Visual Inference for IoT Systems: A Practical Approach 3030909026, 9783030909024

This book presents a systematic approach to the implementation of Internet of Things (IoT) devices achieving visual infe

98 76 8MB

English Pages 172 [171] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Visual Inference for IoT Systems: A Practical Approach
 3030909026, 9783030909024

Table of contents :
Preface
Contents
Acronyms
1 Introduction
References
2 Embedded Vision for the Internet of Things: A Survey on State-of-the-Art Hardware, Software, and Deep Learning Models
2.1 Networks Models
2.1.1 CNN-Based Tasks
2.1.2 Metrics on CNN Complexity and Performance
2.1.3 Popular Convolutional Neural Networks
2.1.4 Training Convolutional Neural Networks
2.2 Software
2.2.1 Software Implementation Strategies
2.2.2 Additional Software Tools
2.3 Hardware Devices and Accelerators
2.4 Summary and Advanced Topics
References
3 Optimal Selection of Software and Models for Visual Inference
3.1 Methodology Overview
3.2 Benchmarking
3.2.1 Performance Evaluation Metrics
3.2.2 Selected Components
3.2.3 Benchmarking on Raspberry Pi—Practical Realization
3.2.4 Benchmarking Results
3.3 Optimum Selection
3.3.1 Pairwise Metric Comparison
3.3.2 Figure of Merit
3.4 Summary and Advanced Topics
3.5 Appendix
References
4 Relevant Hardware Metrics for Performance Evaluation
4.1 Introduction
4.2 Procedure Overview
4.3 Hardware-Aware Analysis
4.4 High-Level Performance Analysis
4.5 Qualitative Performance Explanation
4.5.1 From Aggregated Statistics to Inference Performance
4.5.2 From Instantaneous Statistics to Inference Performance
4.6 Graphical Comparison of Metrics
4.7 Summary and Advanced Topics
4.8 Appendix
References
5 Prediction of Visual Inference Performance
5.1 Introduction
5.2 Fundamentals of CNNs
5.3 Modeling and Prediction of Inference Performance
5.3.1 General Description
5.3.2 Selected System
5.3.3 Network Profiling
5.3.4 Model Construction
5.4 PreVIousNet: A Network for Fine-Grained Characterization
5.4.1 Layer Parameters
5.4.2 Architecture
5.4.3 Network Configuration for the Modeling Stage
5.5 Experimental Tests
5.5.1 Embedded System
5.5.2 Networks
5.5.3 Experimental Results: Layerwise Predictions
5.5.4 Experimental Results: Network Predictions
5.5.5 Analysis
5.6 Summary and Advanced Topics
5.7 Appendix
References
6 A Case Study: Remote Animal Monitoring
6.1 Introduction: Target Application
6.2 Methodology—Overview of DL Model Deployment
6.3 Data Collection
6.4 Training CNNs
6.4.1 Classification Network
6.4.2 Object-Detection Networks
6.4.3 Discussion
6.4.4 Good Practices: Explainable Deep Learning
6.5 Visual System Implementation
6.5.1 Hardware
6.5.2 Software Libraries
6.5.3 Application Algorithm
6.6 Experimental Tests
6.7 Summary and Advanced Topics
References

Citation preview

Delia Velasco-Montero Jorge Fernández-Berni Angel Rodríguez-Vázquez

Visual Inference for IoT Systems: A Practical Approach

Visual Inference for IoT Systems: A Practical Approach

Delia Velasco-Montero · Jorge Fernández-Berni · Angel Rodríguez-Vázquez

Visual Inference for IoT Systems: A Practical Approach

Delia Velasco-Montero Instituto de Microelectrónica de Sevilla Universidad de Sevilla-CSIC Seville, Spain

Jorge Fernández-Berni Instituto de Microelectrónica de Sevilla Universidad de Sevilla-CSIC Seville, Spain

Angel Rodríguez-Vázquez Instituto de Microelectrónica de Sevilla Universidad de Sevilla-CSIC Seville, Spain

European Region Development Fund ISBN 978-3-030-90902-4 ISBN 978-3-030-90903-1 (eBook) https://doi.org/10.1007/978-3-030-90903-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The three authors of this book belong to the Integrated Interface Circuits and Sensory Systems (I2CASS) group at the Instituto de Microelectrónica de Sevilla (Universidad de Sevilla-CSIC, Spain), which has been conducting research on vision hardware, among other topics, for over 25 years. The group was founded by the third author, Prof. Angel Rodríguez-Vázquez. The initial focus was on the design of bioinspired image sensors emulating initial processing stages in natural visual systems, which typically take place in the retina. By conveying some processing and memory close to the photosensors, we built computationally powerful and efficient chips to solve early vision tasks. At architectural level, we explored the parallelism that is inherent to low-level vision, effectively dealing with massive but simple raw data. At circuit level, we exploited the device physics in order to efficiently implement the required functionalities. This scope was extended later on by integrating more system functionalities demanded along the vision pipeline to achieve inference. This further integration encompassed on-chip and off-chip hardware components and their co-design with diverse algorithmic approaches, eventually reaching the application level. More recently, on the basis of our accumulated expertise, we started actively contributing to the disruptive field of deep learning (DL). In particular, we have been investigating on how to optimally leverage the scarce resources of IoT devices to achieve edge visual inference with performance standards demanded by real application scenarios. A clear evidence of the long-term impact of the I2CASS group is the 2019 Mac Van Valkenburg Award presented to his founder, Prof. Rodríguez-Vázquez, for fundamental contributions to hardware architectures for smart imaging, vision, and 2D data processing. The Mac Van Valkenburg Award yearly honors individuals for outstanding technical contributions and distinguishable leadership in a field within the scope of the IEEE Circuits and Systems (CAS) Society. In this monograph, we have put all our experience and knowledge into play in order to provide a systematic and up-to-date approach to vision-enabled embedded systems, and a comprehensive analysis of their implementation specificities. Practical aspects are profusely detailed. Guidelines are presented for optimal selection

v

vi

Preface

of hardware and software components according to prescribed application requirements. The book includes a remarkable set of experimental results and functional procedures supporting the theoretical concepts and methodologies introduced. A case study is also presented. All the contents revolve around the state-of-the-art computer vision paradigm, i.e., DL. These days, accurate, but computationally heavy, neural networks are employed for a variety of tasks. The strong processing and memory requirements of such networks come into conflict with the limited availability of resources in IoT devices. Meanwhile, a great deal of technological components has been released to support DL deployments. Each of these components claims advantages in terms of performance, system integration, energy consumption, etc. The book provides tools to navigate this complex scenario in an effective manner. All in all, this monograph • surveys the state-of-the-art of embedded vision based on DL • describes strategies to leverage the limited resources of IoT devices • defines metrics and figures of merit to evaluate diverse system configurations in a rapid and effective manner • reports extensive benchmarking on different hardware platforms • includes detailed examples of DL-based realizations of visual inference • presents coding and parameterization for a diversity of tools under different operation conditions • addresses a specific application scenario in terms of both training and inference based on a unique database composed of real-world images and sequences. The book is organized in six chapters. Chapter 1 introduces the context within which the rest of chapters elaborate. Chapter 2 surveys state-of-the-art DL components at both hardware and software levels, as well as a diversity of neural network models. Chapter 3 describes a benchmarking-based methodology to select the optimum combination of deep neural network and software framework for visual inference on embedded systems. In Chap. 4, we analyze how the hardware resources in a low-power processor are differently exploited by DL frameworks and the key metrics to identify limitations and bottlenecks in particular realizations. Chapter 5 demonstrates that it is possible to predict the performance of convolutional neural networks on embedded vision devices with high accuracy through the characterization of a single ad hoc model. Finally, Chap. 6 details a case study: remote animal monitoring. In this conclusive chapter, we apply some of the concepts and results reported in previous chapters to a real application scenario. Seville, Spain September 2021

Delia Velasco-Montero Jorge Fernández-Berni Angel Rodríguez-Vázquez

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 3

2 Embedded Vision for the Internet of Things: A Survey on State-of-the-Art Hardware, Software, and Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Networks Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 CNN-Based Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Metrics on CNN Complexity and Performance . . . . . . . . . . . 2.1.3 Popular Convolutional Neural Networks . . . . . . . . . . . . . . . . . 2.1.4 Training Convolutional Neural Networks . . . . . . . . . . . . . . . . 2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Software Implementation Strategies . . . . . . . . . . . . . . . . . . . . 2.2.2 Additional Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Hardware Devices and Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary and Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 10 11 14 16 17 18 20 20 22 24

3 Optimal Selection of Software and Models for Visual Inference . . . . . 3.1 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Selected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Benchmarking on Raspberry Pi—Practical Realization . . . . 3.2.4 Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Optimum Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Pairwise Metric Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Figure of Merit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary and Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 31 32 32 33 35 40 41 41 42 51 51 58

vii

viii

Contents

4 Relevant Hardware Metrics for Performance Evaluation . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Procedure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hardware-Aware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 High-Level Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Qualitative Performance Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 From Aggregated Statistics to Inference Performance . . . . . 4.5.2 From Instantaneous Statistics to Inference Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Graphical Comparison of Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Summary and Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 62 64 70 75 75 77 78 80 82 87

5 Prediction of Visual Inference Performance . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Fundamentals of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Modeling and Prediction of Inference Performance . . . . . . . . . . . . . . 5.3.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Selected System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Network Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 PreVIousNet: A Network for Fine-Grained Characterization . . . . . . 5.4.1 Layer Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Network Configuration for the Modeling Stage . . . . . . . . . . . 5.5 Experimental Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Embedded System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Experimental Results: Layerwise Predictions . . . . . . . . . . . . . 5.5.4 Experimental Results: Network Predictions . . . . . . . . . . . . . . 5.5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary and Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 89 91 93 93 95 95 96 100 101 101 103 104 104 104 105 109 110 112 114 122

6 A Case Study: Remote Animal Monitoring . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction: Target Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Methodology—Overview of DL Model Deployment . . . . . . . . . . . . . 6.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Training CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Classification Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Object-Detection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Good Practices: Explainable Deep Learning . . . . . . . . . . . . . 6.5 Visual System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125 125 126 128 129 132 136 139 140 143

Contents

6.5.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Software Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Application Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Experimental Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Summary and Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

143 144 149 153 156 157

Acronyms

AI AP API ATLAS BGR BLAS BN CIE L*a*b* CIFAR CNN COCO CONV CPU DL DNN DVFS FC FFT FLOP fmap FN FoM FP fps GAP GEMM GPIO GPU HLS HSV ILSVRC

Artificial Intelligence Average Precision Application Programming Interface Automatically Tuned Linear Algebra Software Blue-Green-Red Basic Linear Algebra Subprograms Batch-Normalization Layer International Commission on Illumination L*a*b* Color Space Canadian Institute For Advanced Research Dataset Convolutional Neural Network (Microsoft) Common Objects in Context Convolutional Layer Central Processing Unit Deep Learning Deep Neural Network Dynamic Voltage and Frequency Scaling Fully Connected Layer Fast Fourier Transform Floating-Point Operation Feature Map False Negative Figure of Merit False Positive Frames Per Second Global Average Pooling General Matrix-Matrix Multiplication General Purpose Input/Output Pins Graphics Processing Unit Hue-Saturation-Lightness Hue-Saturation-Value ImageNet Large-Scale Visual Recognition Challenge xi

xii

im2col IoT IoU L1 L2 LLDPR MAC mAP MAPE MCU MKL ML MMU MNIST MS COCO NAS NCS OLS ONNX OS OSD PASCAL VOC pb PIR PMU POOL RAM R-CNN ReLU RGB RMSprop RoI RPi RSS SD SGD SIMD SoC SPP SSD SVHN SWA TF TLB

Acronyms

Image-to-Column Transformation Internet of Things Intersection over Union Level 1 Memory System Level 2 Memory System Low-Power Double Data Rate Multiply–Accumulate Operation Mean Average Precision Mean Absolute Percentage Error Microcontroller Unit Math Kernel Library Machine Learning Memory Management Unit Modified National Institute of Standards and Technology Database Microsoft Common Objects in Context Neural Architecture Search Neural Compute Stick Ordinary Least Squares Open Neural Network Exchange Operating System On-Screen Display Interface Pattern Analysis, Statistical Modeling and Computational Learning, Visual Object Classes Challenge Protocol Buffer Passive Infrared Sensor Performance Monitoring Unit Pooling Layer Random Access Memory Region CNN Rectified Linear Unit Red-Green-Blue Root Mean Square Propagation Region of Interest Raspberry Pi Resident Set Size Secure Digital Stochastic Gradient Descent Single Instruction Multiple Data System-on-a-Chip Spatial Pyramid Pooling Single-Shot Detector Street View House Numbers Dataset Scottish Wildcat Action TensorFlow Translation Lookaside Buffer

Acronyms

TN TP TPU URL USS VFP YOLO

xiii

True Negative True Positive Tensor Processing Unit Uniform Resource Locator Unique Set Size Vector Floating-Point Register You Only Look Once Architecture

Chapter 1

Introduction

Back in 2012, AlexNet [1], a DNN fundamentally based on trainable convolutions, accomplished an unprecedented boost in accuracy in the ImageNet large-scale visual recognition challenge [2]. Prior to that milestone, vision algorithms were ad hoc pieces of engineering demanding painstaking efforts from senior practitioners to achieve moderate performance in real-world scenarios. AlexNet proved that highly accurate visual pipelines were possible. Since then, both academia [3] and industry [4] have been successfully pushing the limits of DNNs and creating a vast technological ecosystem to make the most of them. Actually, the number of hardware platforms, software frameworks, and model architectures tailored for DNN-based vision is overwhelming by now. From the point of view of application developers, the optimal selection of suitable components for prescribed specifications can be a really daunting task. One of the main objectives of this book is precisely to assist with this task. Neural networks are not a new concept. They were proposed back in the 1960s. In 1989, a major breakthrough based on these networks was accomplished: accurate visual recognition of handwritten digits [5]. However, no remarkable advances were reported for the next couple of decades until the irruption of AlexNet. Four aspects are commonly identified as the joint reason for the renaissance of neural networks, with special relevance in the field of computer vision: • Massive data available for training: DNNs require extensive datasets to learn feature maps at different abstraction levels that make accurate inference possible. Efforts in academia to create such datasets in addition to the potential of companies such as Facebook or Google to collect information from millions of users produced enough training material to improve the performance of DNNs significantly. • Massive compute capacity: Advances in semiconductor technologies [6] and processing architectures [7] provide the computing horsepower to conduct training and inference in reasonable timescales. • Algorithmic techniques: New types of layers, operations, and interlayer connections have been progressively explored and incorporated in DNNs to increase their performance [8]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Velasco-Montero et al., Visual Inference for IoT Systems: A Practical Approach, https://doi.org/10.1007/978-3-030-90903-1_1

1

2

1 Introduction

• Vast technological ecosystem for prototyping: In parallel with the three previous aspects, an extensive ecosystem of open-source tools and platforms has been developed by both academia and industry, enabling rapid prototyping of ideas and products. This, in turn, has spurred significant efforts in benchmarking [9] for fair comparison among the countless DNN-based implementations proposed in the literature. Early successes with DNN solutions (e.g., AlexNet for image recognition) encouraged the study of their application to other vision tasks. Gradually, DNNs have become the reference approach for a number of tasks—object detection and tracking, pixel segmentation, image enhancement, etc.—in contrast to the myriad of distinct options available for every task in classical computer vision. This convergence of procedures around DNNs together with the boost in accuracy is the major asset fueling this technological revolution. A common processing scheme for many tasks facilitates the exploitation of a particular software–hardware infrastructure across multiple scenarios, which in turn decreases the risk of product development, creating a virtuous cycle of innovation [10]. But everything has a cost. DNNs are computationally heavy and memory hungry, not only during training but also when it comes to inferring in real deployments. This significant increase in consumption of hardware resources in turn leads to an energy consumption several orders of magnitude greater than that of classical algorithms [11]. Therefore, on the one hand, we have that DNNs constitute a highly accurate and flexible approach for artificial vision; on the other hand, DNNs are a computational burden for systems that can quickly deplete resources for other concurrent processes, especially in embedded devices. The challenge is clear: How can we lighten DNNs while keeping their accuracy and versatility? Pruning stands out as one of the most efficient techniques to address this challenge. It aims at removing unnecessary, or at least non-critical, operations within a DNN in order to reduce its computational complexity while ideally keeping accuracy unchanged or even increasing it. However, this technique must be jointly conducted with energy measurements to be effective [12]. Energy-aware quantization [13] is another technique with similar objective based on data precision scaling, having a major impact on memory management as well. In addition to these approaches, which act on the network models, solutions at system [14] or chip [15] level have also been proposed. In this book, we describe how bottlenecks in DNN processing can be identified by monitoring specific hardware metrics. We also present a methodology to predict the performance of a network layer by layer on a specific combination of hardware and software. Both tools provide meaningful information for direct action on those aspects of the system and network architecture that exert the greatest impact on the global operation, with a focus on runtime and energy consumption. We are just starting to envisage the possibilities that pervasive DNN-enabled artificial vision will bring about [16]. From permanent assistance to blind and visually impaired people [17], to drone autonomous navigation [18] or unattended shopping [19], deep learning is becoming a silent background technology in all kinds of practical scenarios. We also cover the application level in this book by exploring the field of

References

3

remote animal monitoring. In collaboration with conservationists, the specifications for a smart camera trap were defined. Every step in the design and implementation process is carefully analyzed, from data curation to the final system realization, including DNN training parameterization and inference tuning. All in all, artificial intelligence, and its embodiment in the form of DNNs, will play—is already playing—an instrumental role in people’s daily lives, and the implications of this are being evaluated at all levels [20]. In this context, it is evident that the demand for researchers, specialists, and practitioners in the field will only grow in the next few years. We really hope that the contents presented in this book will be helpful for such professionals and also serve as a practical reference for both undergraduate and graduate students who want to initiate themselves into the discipline of embedded vision.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, NIPS, pp. 1097– 1105 (2012) 2. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) 3. Alam, M., Samad, M., Vidyaratne, L., Glandon, A., Iftekharuddin, K.: Survey on deep neural networks in speech and vision systems. Neurocomputing 17, 302–321 (2020) 4. Edge AI and Vision Alliance, Computer Vision Developer Survey. https://www.edge-ai-vision. com/2020/02/edge-ai-and-vision-alliance-november-2019-computer-vision-developersurvey-white-paper/ 5. Le Cun, Y., Jackel, L., Boser, B., Denker, J., Graf, H., Guyon, I., Henderson, D., Howard, R., Hubbard, W.: Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Communications Magazine 27(11), 41–46 (1989) 6. IEEE International Roadmap for Devices and Systems (IRDS). https://irds.ieee.org/ 7. Perry, T.: David Patterson says it’s time for new computer architectures and software languages. IEEE Spectrum (2018). https://spectrum.ieee.org/david-patterson-says-its-time-fornew-computer-architectures-and-software-languages 8. Wang, X., Zhao, Y., Pourpanah, F.: Recent advances in deep learning. Int. Journal of Machine Learning and Cybernetics 11, 747–750 (2020) 9. Reddi, V., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C., Anderson, B., Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, P., Diamos, G., Duke, J., Fick, D., Gardner, J., Hubara, I., Idgunji, S., Jablin, T., Jiao, J., John, T., Kanwar, P., Lee, D., Liao, J., Lokhmotov, A., Massa, F., Meng, P., Micikevicius, P., Osborne, C., Pekhimenko, G., Rajan, A., Sequeira, D., Sirasao, A., Sun, F., Tang, H., Thomson, M., Wei, F., Wu, E., Xu, L., Yamada, K., Yu, B., Yuan, G., Zhong, A., Zhang, P., Zhou, Y.: MLPerf inference benchmark. In: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 446–459 (2020) 10. Jeff Bier, Computer Vision 2.0: Where We Are and Where We’re Going, Embedded Vision Summit, 2016. https://www.edge-ai-vision.com/2016/05/computer-vision-2-0-wherewe-are-and-where-were-going-a-presentation-from-the-embedded-vision-alliance/ 11. Suleiman, A., Chen, Y., Emer, J., Sze, V.: Towards closing the energy gap between HOG and CNN features for embedded vision. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2017)

4

1 Introduction

12. Yang, T., Chen, Y., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071–6079 (2017) 13. Peluso, V., Calimera, A.: Energy-driven precision scaling for fixed-point convNets. In: IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 113–118 (2018) 14. Moreno, A., Olivito, J., Resano, J., Mecha, H.: Analysis of a pipelined architecture for sparse DNNs on embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28(9), 1993–2003 (2020) 15. Wang, J., Fu, X., Wang, X., Liu, S., Gao, L., Zhang, W.: Enabling Energy-Efficient and Reliable Neural Network via Neuron-Level Voltage Scaling. IEEE Transactions on Computers 69(10), 1460–1473 (2020) 16. Qualcomm, Emerging vision technologies: Enabling a new era of intelligent devices, Technical Report, 2016. https://www.qualcomm.com/documents/emerging-vision-technologiesenabling-new-era-intelligent-devices 17. OrCam MyEye. https://www.orcam.com/en/ 18. Skydio Autonomy. https://www.skydio.com/skydio-autonomy 19. Amazon Go. https://www.computerweekly.com/feature/Amazon-Go-is-now-the-right-time 20. Artificial Intelligence, A European Perspective. https://publications.jrc.ec.europa.eu/ repository/handle/JRC113826

Chapter 2

Embedded Vision for the Internet of Things: A Survey on State-of-the-Art Hardware, Software, and Deep Learning Models

Abstract The adoption of DL-based computer vision algorithms in IoT devices enables the development of accurate vision-enabled applications. This requires the integration of network models in embedded hardware and software. In this chapter, we review core technological components to achieve such integration in practice. Within the broad and ever-growing ecosystem of DL technological elements, we describe representative CNNs, libraries, and hardware architectures currently available. A global overview of this ecosystem is depicted in Fig. 2.1. Each application scenario is intimately related to a particular dataset to train a CNN architecture, which will be deployed on a framework supported by underlying libraries that interact with the hardware through the operating system.

2.1 Networks Models Convolutional neural networks (CNNs) mainly rely on the convolution operation to extract features. Their architecture is inspired by the processing flow inherent to visual perception in biological systems. The generic architecture of a CNN is illustrated in Fig. 2.2. Network training refers to the optimization process through which the learnable parameters (weights) of the CNN are adjusted to perform a visual task, e.g., classification, object detection, or segmentation. A large dataset and high computational power are required for training CNNs. Network inference (also called network execution or CNN forward) refers to the process through which a previously trained network infers on an input image. This inference produces a numerical output resulting from the layers’ operations and learnt weights. For instance, classification networks yield an output vector in which each element represents the probability of its corresponding category. CNNs are composed of a large number of layers operating on their corresponding input data. Various architectural principles are followed: (i) local connectivity between neurons, (ii) weight sharing for groups of connections, which reduces the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Velasco-Montero et al., Visual Inference for IoT Systems: A Practical Approach, https://doi.org/10.1007/978-3-030-90903-1_2

5

6

2 Embedded Vision for the Internet of Things …

Fig. 2.1 Hierarchical stack of technological components involved in embedded realizations of DL. Numerous options are available at each level

Fig. 2.2 Generic CNN architecture. Its core operation is the convolutional layer, but other operations, such as pooling or normalization, are also included in the network architecture. Fully connected layers are usually inserted at the end of the CNN

number of parameters, and (iii) dimensionality reduction to retain only useful information. Each layer’s output constitutes the next layer’s input. Relevant aspects of CNN layers will be outlined in this section. The mathematical notation employed throughout the book will also be introduced.

2.1 Networks Models

7

Fig. 2.3 Processing based on a sliding window. The green array is the operation kernel (implementing filtering, pooling, etc.), whereas the RGB array is the input image. The process of sliding the kernel along the horizontal dimension of the input is depicted with light areas. The kernel will be eventually applied to the whole image by sliding in the vertical dimension as well

Feature maps The input and output data of each network layer comprise several channels of two-dimensional arrays; these are the so-called feature maps, fmaps, arranged in a 3D tensor. We will denote fmap dimensions as H × W × C, meaning a set of H rows of activations,1 with W activations per row, and C channels in the third dimension. Subindexes in and out will refer to input and output data from network operation layers. Most CNNs are fed with a 3-channel input image (in RGB format), which is the input for the first network layer. Then, the CNN progressively reduces the spatial dimensions (H × W ) of the successive fmaps, while increasing the number of channels C (dimensionality reduction is thus achieved). Input and output fmaps will be denoted as I and O, respectively, whereas W denotes the learnt network weights. Therefore, each layer takes a 3D input tensor I, performs operations that could involve a set of parameters W, and generates output data O for the next layer. These layer operations can be locally performed on subregions of the I volume called receptive fields (local connectivity). In this case, usually a kernel kh × kw × kc operates over some inputs composing the receptive field (kc usually coincides with Cin ). The layer will eventually operate over the whole input data if the kernel is progressively slid on I—so-called sliding windows with a stride of s pixels (weight sharing). Figure 2.3 exemplifies how a 3 × 3 × 3 kernel filter (green array) is progressively applied on receptive fields of the H × W × 3 (RGB array) input image. In particular, this figure illustrates four positions of the sliding window (light areas) resulting from a unit stride s = 1 (to save space, this figure only shows a set of horizontal steps; the filter will eventually travel over the entire image). Therefore, the size of the output fmap O is expressed as follows2 :  Hout =

1

 Hin − kh + 2 p +1 s

(2.1)

Although the input of a network is given by the original image composed of pixels, subsequent output data from network layers are called activations and arranged into fmaps. 2 A similar equation for the horizontal dimension is obtained by replacing H for W .

8

2 Embedded Vision for the Internet of Things …

Table 2.1 Notation used throughout this book for data and operations involved in CNNs Variable name Description I, O, W Xi, j,k n(X) H, W, C kh , kh , kc s, p

Input, output, and weight tensors Element located at the (i, j, k)th position of the 3D tensor X Number of elements in the tensor X Height and weight of a feature map, and number of channels Kernel size: height, width, and number of channels Stride and padding used for sliding operations

where p denotes an optional zero-padding that can be applied to enlarge the input in order to keep prescribed spatial dimensions. For instance, if the input contains Hin = 224 rows, and the receptive field is characterized by k = 3, s = 1, without padding, then we will obtain an output with Hout = 222. However, if a border with p = 1 is applied, the input dimensions will be kept constant, i.e., Hout = Hin = 224. This is the so-called same padding, given by p = (k − 1)/2, which keeps the dimensions unmodified, as opposed to the so-called valid padding, which adds no borders ( p = 0). Table 2.1 summarizes the explained parameters. Common layers Three types of layers are usually included in CNNs: convolutional, pooling, and fully connected. They are briefly described next together with other types that can also be found in modern networks. • Convolutional (CONV). A set of N kernels of dimensions kh × kw × Cin filters the input fmap. In the 4D tensor W, the nth kernel W(n) , n = 1, . . . , N yields the nth 2D output feature map in which the activations are obtained from the dot product: kh Cin  kw   Ox,y,n = Wi,(n)j,c Ix+i,y+ j,c . (2.2) c=1 i=1 j=1

Moreover, learnt biases b(n) can also be added to each output activation. Figure 2.4 illustrates a CONV layer operation in which N 3 × 3 kernel filters are applied. Note that Cout is equal to the number of applied kernels N . Variations of the standard CONV expressed by Eq. (2.2) have been proposed: • Group CONV. It was first used in AlexNet [1]. Input channels are divided into N g groups (Cin /N g channels per group) on which different kernel filters operate separately (only N /N g filters per group). Thus, each N /N g output fmap depends only on the corresponding fraction of inputs, globally saving layer’s weights and computation:

2.1 Networks Models

9

Fig. 2.4 Convolutional layers constitute the core operation of CNNs. Each kernel filter—depicted in green—operates on sliding local regions of the input fmaps—receptive fields in light blue—to produce the corresponding output fmaps

Ox,y,n =

Cin /N g kw kh   c=1

Wi,(n)j,c Ix+i,y+ j,

i=1 j=1

n−1 N /N g



Cin Ng

+c

(2.3)

where the group index 0, 1, 2, . . . , N g − 1 is given by  Nn−1  for n = 1, 2, /N g . . . , N . The special case given by N g = Cin = N (one 2D kernel filter is applied to each input channel to produce one output fmap) is called Depthwise CONV : Ox,y,n =

kh kw  

Wi,(n)j Ix+i,y+ j,n .

(2.4)

i=1 j=1

• Pointwise CONV is a standard convolution whose filters are 1 × 1: Ox,y,n =

Cin 

Wc(n) Ix,y,c .

(2.5)

c=1

• Pooling (POOL). This operation lowers the spatial dimensions of the fmaps by applying a simple operation to each kh × kw patch with a stride s. The “maximum” operation is a simple and widely used computation on receptive fields. A special case is global average pooling, which operates over the input fmaps kh = Hin , kw = Win , producing an output vector sized 1 × 1 × Cin . • Fully Connected (FC). Similar to classical neural networks operating on data vectors, an FC layer applies one weight factor to each connection between input and output activations—both 1D vectors. Additional bias terms can be added. FC layers are usually located at the end of the network to perform classification on the fmaps produced by the rest of the layers. • Nonlinear activation. To introduce non-linearities between layers, various functions can be individually applied to each input activation. Rectified linear unit (ReLU) is the most popular one among them. It simply selects the maximum

10

2 Embedded Vision for the Internet of Things …

between each input activation Ii, j,c and 0. This speeds up the calculation of nonlinearities with respect to other activation functions such as sigmoid and tanh. In addition, ReLU is more suitable for rapid training convergence [2]. • Normalization. This type of activation layer aims to enhance the training stage by changing the range of activation values. For instance, the popular batchnormalization (BN) normalizes activations on the ith channel in terms of zero mean and one variance—scale and shift operations—across the training batch [3]. • Multiple-input layers operate by combining input volumes from more than one previous layer. These inputs usually have the same spatial dimensions, H × W , but their number of channels may differ. It is useful to merge data from branched networks or modules. This is the case of the Inception module included in a series of CNNs [4–6]. For instance, concatenation of fmaps—in which no mathematical operation is carried out apart from data reorganization—or elementwise operations are usually performed. • Loss function. Networks usually end with a loss function. Softmax is the most notable one for classification tasks. It outputs a normalized probability distribution from a vector of N class predictions, [I1 , I2 , . . . , I N ], by applying the function Ic so f tmax(Ic ) =  Ne I to obtain the probability score for the input Ic . p=1

p

2.1.1 CNN-Based Tasks Several types of algorithms can be implemented using CNNs. Common visual tasks based on CNNs include: • Image classification. The network recognizes the single object contained in an image and assigns it to one image category with an associated probability. This task is employed in most DL-based vision systems and is widely supported by software libraries and hardware accelerators. • Object detection. The network finds and classifies all the objects in the image, providing both bounding boxes and labels for such objects (also with confidence scores). This task is based on image classifiers but requires complex postprocessing to deal with the numerous bounding boxes yielded with different probabilities. Object detection can use one-stage or two-stage approaches—more details in Sect. 2.1.3. A great deal of software tools and DL hardware processors do not cover detector requirements and do not implement the whole set of operations involved. • Semantic segmentation. The network classifies each and every pixel into object categories, thus dividing the image into different areas. The algorithm includes the “deconvolution” operation to map assigned categories to pixels. This task is barely supported by current DL software and hardware. • Face recognition. The network identifies human faces. The CNN algorithm relies on detection, alignment, feature extraction, and classification.

2.1 Networks Models

11

2.1.2 Metrics on CNN Complexity and Performance To compare different CNN architectures, quantitative metrics are employed. Useful indicators for CNN implementation (summarized in Table 2.2) are: Computational Complexity. A common estimator of model complexity is the number of floating-point operations (FLOPs) required for network inference—in some cases, only CONV and FC operations are taken into account. Model Size. The number of learnable parameters of a network impacts both the time required for network training and its memory footprint during inference. Accuracy. Networks are trained on a target dataset for a specific application. Diverse loss functions and optimizers are used during the training process. Once a CNN is trained, it provides a particular inference precision. For instance, top-N accuracy and mean average precision (mAP) are the most frequently applied metrics to, respectively, evaluate classification [7] and detection [8] tasks: • Classification networks. The number of outputs coincides with the target classification categories. The image is classified as belonging to the most probable class, that is, the class whose corresponding network output value is the highest. During training, if the network output matches the image ground-truth label, the classification is correct. A relaxed restriction (Top-N) is set when a classification is deemed correct if the ground-truth category is included among the N most probable classes

Table 2.2 Widely used CNN performance metrics Metric Unit Computational complexity Network size Application-specific accuracy Throughput Energy, power Memory allocation System utilization

Fig. 2.5 Intersection over union. Intersection is the overlap region between areas A (green) and B (blue), whereas union is the overall region including both boxes. IoU is the fraction between intersection and union

FLOPs # parameters, MB rate # inferences per second (fps) J, mW MB %

12

2 Embedded Vision for the Internet of Things …

Table 2.3 Confusion matrix and network accuracy metrics for object-detection CNNs Actual Predicted Positive prediction Negative prediction Positive TP Negative FP Precision = 1

TP T P+F P

FN TN NPV1 =

TN T N +F N

TP Recall = T P+F N TN Specificity = F P+T N N Accuracy = T P+FT NP+T +F P+T N

NPV stands for negative predictive value. Likewise, precision is the positive predictive value.

in the output vector. Evaluating the Top-N accuracy on a dataset means averaging per-image prediction results. • Detection networks. These networks output bounding boxes including the detected objects, in addition to vectors of “confidence scores” to classify the detections among a list of categories. The detection performance is typically evaluated through four indicators, namely precision rate, recall rate, mAP, and F1-score. First, given the ground-truth Bgt and predicted B p bounding boxes, intersection over union (IoU) is calculated as the ratio between the overlap and union areas (see Fig. 2.5): ar ea(B p ∩ Bgt ) (2.6) I oU = ar ea(B p ∪ Bgt ) A threshold t judges the prediction: If the overlap is greater than a threshold, i.e., I oU > t, the prediction is considered as true positive (TP); otherwise, it is a false positive (FP) (see Table 2.3). We elaborate on object-detection metrics next. • The recall rate reflects the proportion of actual positive cases within the total ground truth of positives—that is, the fraction of detected objects (in this context, recall, sensitivity, and true-positive rate are synonyms). R=

TP TP = T P + FN # Positives

(2.7)

• The precision rate indicates the proportion of real positive samples within all the predictions identified—that is, the fraction of detections that are correct (precision is also called positive predictive value). P=

TP TP = T P + FP # Detections

(2.8)

• F1-score is the harmonic mean of the precision and recall rates. Higher values of this metric indicate better models. F1 =

2P R P+R

(2.9)

2.1 Networks Models

13

• True-negative rate (TNR or specificity) and false-positive rate (FPR) are defined as follows: TN TN = (2.10) TNR = FP + T N #N egatives FPR = 1 − T NR =

FP FP = FP + T N #N egatives

(2.11)

• One of the most used accuracy metrics for detectors is the “mean average precision” (mAP). Average precision (AP) is the area under the precision–recall (P–R) curve of each category. This curve takes into consideration the accumulated TP and FP, calculates the corresponding precision and recall rates, and then plots a monotonically decreasing precision. The AP is calculated as the area under this curve [9, 10].  AP =

1

p(r )dr

(2.12)

0

where p and r are rates, respectively. PASCAL VOC 2007 only takes 11 equally spaced recall points on the P–R curve (from 0 to 1 with a step size of 0.1): AP =

 1 p(r ) 11 r ∈{0,0.1,0.2,...,1}

(2.13)

A new criteria set on PASCAL VOC 2010–2012 considers all points to calculate the area under curve (AUC) on the P–R curve. The AP is calculated for each detection category, and then the mAP is obtained as the average of these per-class AP values: m AP =

1

n classes

n classes

i=1

A Pi

(2.14)

In contrast, Microsoft Common Objects in Context (MS COCO) [11] considers different IoU thresholds—from 0.5 to 0.95 with a step size of 0.05—and then averages the resulting mAP values for each threshold. That is, COCO m A PI oU =0.50 corresponds to the traditional mAP on PASCAL VOC. • Segmentation networks. The percentage of pixels correctly labeled (TP) is a common evaluation metric for these algorithms: Accuracy =

TP T P + FP + FN

(2.15)

14

2 Embedded Vision for the Internet of Things …

This accuracy rate can be calculated per class. Another possible evaluation metric is the rate between the interception of the segmentation and the ground truth divided by its union [9]. Inference performance metrics. Table 2.2 includes common indicators of inference performance. Some key parameters useful to compare CNN implementations at execution time include power consumption (W), energy demand per image (J), inference time (s), network throughput (fps), memory allocation (MB), FLOPs executed per second (FLOP/s), and system utilization (%). These metrics will be extensively evaluated, compared, and discussed in the next chapters. For another reference, an extensive comparison of various classification CNNs was reported in [12]. This comparison, based on the aforementioned metrics, is presented in plots to facilitate the assessment.

2.1.3 Popular Convolutional Neural Networks Since the breakthrough of AlexNet [1] back in 2012, diverse architectures have been designed. Here, we summarize the main characteristics of classic CNNs for image classification—most of them trained on ImageNet [7] (1000 image categories). Classification networks LeNet [13] was proposed in 1998 for handwriting recognition—MNIST dataset [14]. Its architecture contains two CONV, two pooling, and three FC layers. This network set some of the design guidelines for subsequent CNNs: local receptive fields, weight sharing, and spatial sub-sampling. However, this CNN did not receive much attention at the time. AlexNet [1] was the first CNN to win the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012, significantly outperforming all the competitors. Its architecture is composed of 8 CONV and 3 FC layers. This CNN also includes ReLU, normalization, and dropout3 layers. To leverage the training process on various GPUs, AlexNet includes group convolutions. Network in Network [15] is based on stacked pointwise CONV layers—equivalent to micro–multi-layer perceptrons, called MLPconv in this architecture. Given that the last MLPconv layer generates one feature map per predicted class, FC layers are omitted and replaced by a global average pooling layer. By exploiting both strategies, this network achieves a considerable reduction in the number of parameters with respect to preceding models. Inception [5] was the winner of ILSVRC in 2014 (its first version is also known as GoogLeNet). It was the first architecture composed of modules containing parallel layers. Its building block, i.e., the Inception module, contains 4 parallel branches designed as follows: CONV 1 × 1 plus CONV 5 × 5; CONV 1 × 1 plus CONV 3 × 3; 3

Used to prevent overfitting, a dropout layer randomly sets input activations to zero during the training time.

2.1 Networks Models

15

CONV 1 × 1; and pooling plus CONV 1 × 1. These different kernel sizes can extract features at different image scales. Subsequent realizations of the Inception family introduced additional strategies (batch-normalization layers, kernel-size reduction, convolution factorization, etc.) [4, 6]. VGG [16] is a very deep model also submitted to ILSVRC in 2014. Its architecture was inspired by AlexNet, but the large 11 × 11 and 5 × 5 kernels were replaced by kernels with reduced receptive field (3 × 3 and 1 × 1). Various configurations with different depths were explored, all of them containing ReLU, pooling, and FC layers, in addition to CONV. ResNet [17], or residual network, won the ILSVRC competition in 2015 and was the first CNN to exceed human-level accuracy. The contribution of this family of networks was the use of shortcut connections skipping one or more layers. Another remarkable feature is the bottleneck module, or residual block, composed of 3 stacked 1 × 1, 3 × 3, and 1 × 1 CONV layers. The data channels are first reduced (1 × 1 CONV), and then a 3 × 3 CONV operation is performed. Finally, the original data dimensions are restored, thereby alleviating the computational load of the intermediate 3 × 3 layer. Overall, ResNet networks are very deep but accurate models. SqueezeNet [18] proposes a high reduction on the number of network parameters while maintaining accuracy levels similar to previous CNNs. To achieve this, 1 × 1 CONV layers are exploited to massively reduce the number of weights. Its building block, the so-called Fire module, stacks the “squeeze” block, composed of 1 × 1 filters, and the “expand” block, made up of 1 × 1 and 3 × 3 CONV layers. This architecture also removes the use of parameter-intensive FC layers. MobileNet [19, 20] is a family of networks proposed by Google, aiming to be efficient for embedded and mobile devices. The architecture introduces the concept of depthwise separable convolutions, which greatly reduce the computational load and the number of weights. Its novel technique is the factorization of a standard CONV into two layers: (i) A depthwise CONV, extracting features in space. (ii) A pointwise CONV, extracting features in the channel dimension. A set of MobileNet networks were trained, providing a trade-off between classification accuracy and inference. These networks are characterized by two parameters: the width multiplier α ∈ (0, 1]—which reduces the number of channels, and the resolution multiplier ρ ∈ (0, 1]—which reduces the height and width of the squared input image. These multipliers have the effect of reducing the computational cost by α 2 and ρ 2 , respectively. The baseline MobileNet architecture sets these factors to 1. ShuffleNet [21, 22] uses two techniques to reduce the computational load, namely pointwise group CONV (as introduced in AlexNet) and channel shuffle. Its main building block, i.e., the ShuffleNet unit, is composed of a group pointwise CONV, a channel shuffle layer, a 3 × 3 depthwise CONV, and another pointwise group CONV. Therefore, each group CONV splits the fmaps into G groups and applies a depthwise kernel to each group. In addition, the channel shuffle operation is performed to randomly combine information between groups.

16

2 Embedded Vision for the Internet of Things …

Object-detection networks Concerning object-detection networks, their architectures can be classified into twostage and one-stage approaches. These networks are usually trained on two popular datasets for object detection: PASCAL VOC [9] (with bounding boxes belonging to 20 object categories) or MS COCO dataset [11] (including 80 classes). • Two-stage object detectors first generate class-agnostic regions of interest (ROI) and then classify them into object categories. Relevant two-stage architectures include region CNN (R-CNN) [23], fast R-CNN [24], faster R-CNN [25], regionbased fully convolutional network (R-FCN) [26], and spatial pyramid pooling (SPP-net) [27]. These networks usually achieve higher accuracy than one-stage approaches, but with slower inference. Many architectures were inspired by the SSD [28] one-stage detector. This network searches for bounding boxes with a series of established aspect ratios and scales. The object-size variability is handled by using multiple-resolution fmaps as prediction inputs. The architecture includes (i) a backbone network for extracting image features and (ii) extra feature layers for predicting the bounding boxes and their associated confidences (i.e., the “SSD head” located at the end of the network). For instance, MobileNet-SSD uses MobileNet as the backbone, thus becoming a high-performance model. • One-stage approaches combine the two aforementioned stages into a single endto-end trainable network. Therefore, they are more computationally efficient models, although some accuracy is traded off. Relevant one-stage models include the single-shot detector (SSD) [28], YOLO [29–31], CornerNet [32, 33], CenterNet [34], ExtremeNet [35], RetinaNet [36], and EfficientNet [37]. YOLO series (“You Only Look Once”) [29–31] was the pioneer one-stage model, becoming a widely extended real-time detector. It divides the input image into an SxS grid of cells; the output is given in terms of confidences for multiple categories and bounding boxes—each cell is responsible for detecting objects centered in it. Using different stride sizes, YOLO can detect objects at three scales: 13 × 13 (small), 26 × 26 (medium), and 52 × 52 (large). Its backbone network for feature extraction is the so-called DarkNet-53 architecture (53 CONV layers) [38]—plus detection modules. Tiny-YOLO [38] and YOLO-LITE are lightweight versions of YOLO.

2.1.4 Training Convolutional Neural Networks Given a set of labeled samples (images), DL training based on supervised learning is an optimization process that minimizes a function. For instance, classification networks usually minimize the differences between predictions and true labels calculated with the cross-entropy function. It is common to use the stochastic gradient

2.1 Networks Models

17

descent (SGD) for this optimization problem. This method iteratively takes subsets of samples (batch) to update the parameters (network weights) in the opposite direction of the gradient of the objective function w.r.t. to the parameters. The initialization of the network parameters can be randomly performed or may be taken from a pretrained network—this is the so-called transfer learning approach. Then, a weight update rule is employed to adapt the weights in each iteration. This optimizer rule [39] usually depends on training parameters such as learning rate and momentum, which control the speed in the gradient descend. The gradient of the loss function w.r.t the weights ∇ f (W) is calculated for each layer in the network using backpropagation—from top layers to initial ones in the CNN. For instance, two simple weight update rules are defined as follows: W(i+1) = W(i) − η∇ f (W(i) )

(2.16)

W(i+1) = W(i) + μ(W(i) − W(i−1) ) − η∇ f (W(i) )

(2.17)

where W(i) represents the weights at iteration i, and η and μ are the learning rate and momentum, respectively. Note that at each iteration i, only a subset of the dataset is employed to calculate ∇ f (W). More refined weight optimization algorithms are Nesterov, Adagrad, Adadelta, Adam, RMSProp, etc. [39]. The majority of this book will focus on pre-trained neural networks ready for embedded inference. Notwithstanding, specific application scenarios require CNN training on particular datasets. In this regard, network training will be covered in Chap. 6, which comprehensively elaborates and exemplifies the process—also including tips and code.

2.2 Software We can find a notable number of software libraries that facilitate the design, training, and execution of CNNs. Some of the most popular ones are summarized next. Caffe [40] was developed by the Berkeley Vision and Learning Center (BVLC). This is a very popular tool that indeed offers the network implementation of an extensive number of state-of-the-art CNNs. Pre-trained weights for these architectures are also provided and supported by developers [41]. Its core code is written in C++, but it also provides Python and MATLAB APIs. TensorFlow [42] (TF) was released by Google as an open-source library for developing ML applications. The set of computational operations (layers) is expressed as a stateful dataflow graph based on tensors (activations). Differentiable programming on that graph allows model training, whereas graph execution enables image inference. C++ and Java interfaces are available. However, most of TF’s open-source code is provided in Python. This framework has become a reference tool for ML development. In addition, Google also developed TensorFlow Lite for model imple-

18

2 Embedded Vision for the Internet of Things …

mentation on embedded devices. This variant specifically targets network inference, reducing the latency and model size. Caffe2 [43], released by Facebook AI Research group, was based on Caffe code but with a focus on mobile deployment. The network definition is expressed as computation graphs. This lightweight and modular framework specifically supports embedded operating systems such as Android, Raspbian, and iOS. PyTorch [44, 45] was also developed by Facebook AI. It is built on the Torch library. A software package called PyTorch Mobile was also released to deploy models on embedded devices using Android or iOS. In 2018, Facebook AI merged Caffe2 and PyTorch projects, so Caffe2 is now part of PyTorch. OpenCV [46] (Open Source Computer Vision Library) is an open-source library that includes a large number of functions and modules for CV mainly supporting classical algorithms. Concerning DL, the library implements network inference on its “DNN module.” CNNs can be loaded from a diversity framework formats—e.g., models trained on Caffe, TF, Torch, etc.—and then executed on OpenCV just to run inference. This library offers C++, Python, and Java APIs, and is compatible with several platforms—Linux, Windows, iOS, Android, etc. Other alternatives to the above frameworks can be pointed out, such as MXNet [47], Paddle/Paddle-Lite (PArallel Distributed Deep LEarning) [48, 49], Theano [50], Mobile AI Compute Engine (MACE) [51], Mobile Neural Network (MNN) [52, 53], and Tengine [54].

2.2.1 Software Implementation Strategies Although the inputs and outputs of convolutional networks are tensors, their computational implementation and memory allocation can be performed in several ways. For instance, a usual approach for storing image data is using a row-major C × H × W layout, in which activations from the same row ([0, 1, 2, . . . , W − 1]) are physically adjacent in memory, whereas fetching channel information implies discontinuous memory accesses. By contrast, column-major layout allocates elements from the same column adjacent in memory. In addition, there are different ways to implement each CNN layer. For instance, a simple six-level nested loop over Cin − Hin − Win − Cout − Hout − Wout can perform the CONV operation defined in Eq. (2.2). However, this naive implementation can be enhanced by restructuring the data or calling a matrix multiplication subroutine. For instance, vectorizing input data or changing the loop order can exploit local data reuse on the registers and the cache hierarchy. A common approach to implement convolutions involves the use of highly optimized general matrix-to-matrix multiplication (GEMM) functions, such as those provided by Basic Linear Algebra Subroutines (BLAS) [55–58] libraries: ATLAS [59], OpenBLAS [56], MKL [60], Eigen [61], cuBLAS [62], cuDNN [63], etc. To leverage such routines, image-to-column transformation, known as im2col or matrix lowering, is first performed [64, 65]. This data reorder operation expands the 3D input

2.2 Software

19

Fig. 2.6 Data reshaping on input data and kernel weights—so-called im2col and im2row. Each receptive field on an input fmap sized kh × kw × Cin becomes a column of an enlarged matrix. Similarly, weights on each 3D filter are flattened into a row. Thus, convolution becomes a matrixto-matrix product

volume into a large 2D matrix, containing one receptive field of kh kw Cin elements per column (i.e., an array with shape n(W) × (Hout Wout )). Convolution weights are also organized into 2D arrays Cout × n(W)—see Fig. 2.6. Thus, the convolution becomes a large matrix-to-matrix multiplication expressed as Ounr olled = Wunr olled Iunr olled . Reading the “unrolled” input matrix involves sequential accesses to memory, which is more efficient and also essential for exploiting SIMD and parallel execution. However, note that this approach inherently leads to a trade-off between memory consumption and computational efficiency, because oversized arrays that include duplicate data are built. These arrays are larger than the original input tensor. Likewise, FC layers can also be implemented with GEMM routines. Layer weights W can be arranged into a matrix sized n(O) × n(I), and then FC operation becomes a matrix-to-vector multiplication expressed as O = WI (see Fig. 2.7). Other approaches to accelerate matrix multiplication in CNNs include algorithms based on the fast Fourier transform (FFT) [66], Winograd [67], and Strassen [68].

20

2 Embedded Vision for the Internet of Things …

Fig. 2.7 Left: FC layer operating on 9 input activations and producing a 5-neuron output vector. One weight connection is applied between each input/output pair (these weights are not included to avoid clutter in the graph). Right: implementation of this layer as the matrix product of an array of weights (green) and the input vector (blue). For instance, the dot product of the weights highlighted in the second row and the column of input activations produces the second output of this FC layer

2.2.2 Additional Software Tools There exist open-source tools that facilitate the process of designing, training, and integrating CNNs in embedded systems. • Network visualization tools, such as Netscope [69] and Netron [70], help to graphically understand the architecture of the networks. • Image annotation and labeling software, such as LabelImg [71] and LabelMe [72], among others [10]. These tools allow, for instance, drawing bounding boxes to create object-detection training datasets. • Framework format conversion tools. For example, ONNX [73] enables library format conversion—involving some of the libraries described in Sect. 2.2—through an intermediate representation. This is convenient when different frameworks are used during the training and deployment stages. • The operating system (OS) installed on an embedded device can significantly simplify the management of platform resources. Linux is popular for embedded platforms owing to its open-source code, which allows creating variants tailored for specific hardware. Mobile and wearable devices commonly run Android or iOS. Microcontrollers can also employ OS specifically designed for their architectures.

2.3 Hardware Devices and Accelerators The heterogeneity in computing architectures is translated into a variety of hardware features. Indeed, each architecture can be particularly exploited for prescribed workloads on which optimal performance is achieved [74]. CPU-based devices Central processing units (CPUs) constitute a ubiquitous compute architecture with broad library support. Although the unit is designed to process instructions sequentially, more efficient program execution is possible by exploiting instruction-level par-

2.3 Hardware Devices and Accelerators

21

allelism. Multi-thread processors and CPUs supporting single-instruction multipledata (SIMD) pipelines speed up program execution by performing several instructions per clock cycle. To this end, accurate branch prediction mechanisms are incorporated—once the dependency graph of the program is analyzed, various instructions are fetched and executed in parallel. Finally, an evident advantage of this architecture is the reduced data-transfer latency when compared to external accelerators or GPUs. They are also easy to program, which make them attractive for non-expert users. This book will particularize various inference studies on Raspberry Pi (RPi) platforms [75], a series of low-cost general-purpose embedded devices. For instance, the popular RPi model 3B [76] features a quad core ARM Cortex-A53 1.2GHz 64-bit CPU, with 1GB RAM LPDDR2 volatile memory, enough for loading and running most lightweight CNNs. Note that this is a low-power device, with a consumption below 5 W when the four cores are at full operation. There is a myriad of providers in the market offering general-purpose CPU-based embedded devices, such as the Odroid series [77], LattePanda [78], Orange Pi series [79], Beagleboard series [80], etc. In this regard, Cloutier et al. [81] provides an overview of ARM-based embedded systems including computational performance and power-related metrics. GPU-based devices Graphics processing units (GPUs) are specialized processors optimized for operation on data vectors. Data processing on these units relies on single-instruction multiple thread (SIMT) execution. GPUs are composed of numerous parallel “execution units” (small and efficient cores that can process SIMD instruction streams), which independently process massive data. In general, GPUs offer high processing performance on large workloads—they are not designed for irregular workflows. Notwithstanding, these high computing capabilities usually come at the cost of higher power consumption. A popular embedded GPU-based platform is Nvidia Jetson Nano [82], which comprises a 128-core Maxwell GPU and a volatile 4 GB 64-bit LPDDR4 memory (in addition to a quad core ARM A57 1.43 GHz CPU). More powerful Nvidia devices include Nvidia Jetson Xavier NX [83]—which is a high-power Nvidia CUDA GPU edge platform—and Nvidia Jetson TX2 series [84]—which trade off computational capacity (256-core Nvidia Pascal GPU) for a higher power demand (from 7.5 W to 20 W). These platforms rely on the TensorRT library to perform CNN inference. MCU devices Microcontroller units (MCUs) constitute a valid option for building IoT systems. They enable applications with very low energy budgets but at the cost of extremely scarce computational and memory resources. Still, these simple computing architectures can be effectively exploited by ML libraries, such as CMSIS-NN [85], which leverages hardware-optimized kernels on an MCU. Lightweight CNNs, such as LeNet, can run on these processors [86, 87]. Examples of edge devices incorporating the ubiquitous ARM Cortex-M architecture include: Arduino Nano 33 BLE [88, 89] (featuring a 32-bit ARM Cortex-M4

22

2 Embedded Vision for the Internet of Things …

CPU 64MHz), STM32 microcontrollers [90] (also providing software libraries for embedded DNNs [91]), or SparkFun Edge [92] and AdaFruit Edge Badge [93] (both including a single-core ARM Cortex-M4F processor compatible with TensorFlow Lite). Other Architectures and Accelerators DL pipelines can leverage other existing computing architectures. For instance, fieldprogrammable gate arrays (FPGAs) are specific hardware components composed of processing units that can be reprogrammed via software. This technology offers high flexibility, allowing custom instructions and data types. DL accelerators arise as a hardware solution applicable to multiple IoT scenarios. These specific circuits provide high performance on DL workloads, although they may not support all operations—programmability is limited. For instance, we briefly describe two popular DL accelerators—compatible with CNN visual inference— which can be integrated in edge systems: • Google Coral Accelerator [94]. This device features a Google edge TPU coprocessor (tensor processing unit, i.e., an array of processors), designed to provide high-performance inference on mobile CNN models. The device can be attached to any system running Debian Linux OS (e.g., RPi); it is compatible with TensorFlow Lite software (also from Google). • Intel Neural Compute Stick (NCS). This USB accelerator includes a dedicated Myriad X vision processing unit. The Intel NCS2 release [95] outperforms its predecessor Intel Movidius NCS [96]. This company also provides the OpenVINO Toolkit, a library compatible with this device that supports a variety of DL operations. Moreover, some platforms include both general-purpose CPU and AI accelerators: • Google Coral Dev Board [97]. In addition to its quad core Cortex-A53 CPU and its integrated GC7000 Lite Graphics GPU, this device features a tensor processing unit (TPU), which is an application-specific integrated circuit (ASIC) for accelerating DNNs. It is compatible with Google’s TensorFlow framework as well. • Tinker Edge R board [98]. Its ARM big.LITTLE technology comprises a hexacore CPU (dual-core ARM Cortex-A72 1.8 GHz plus quad core ARM Cortex-A53 1.4 GHz). It also integrates a neural processing unit, the Rockchip NPU, a ML accelerator that supports various software frameworks.

2.4 Summary and Advanced Topics Many commercial applications require complex DL-based solutions. Deeper networks usually offer more accuracy, but challenge—or even prevent—their embedded realization. Therefore, understanding the trade-offs between different CNN architectures and implementations is critical. This book addresses the challenges associated

2.4 Summary and Advanced Topics

23

with the heterogeneity of existing CNN architectures, software, and hardware components. Research on component benchmarking The extensive ecosystem of architectures, datasets, DL frameworks, software libraries and tools, implementation strategies, and hardware alternatives calls for benchmarking and comparison. For instance, concerning embedded platforms running opensource or vendor-provided DL software, experimental evaluations have been conducted. Thus, we can find benchmarking and characterizations for specific DL accelerators [99], smartphones [100–102], GPU-based devices [103], and other hardware devices [104–111]. With respect to CNNs, both image classification [12, 112] and object detection [113] networks have been analyzed in terms of accuracy, runtime, computational complexity, etc. Research on Optimization The resource and memory constraints of edge devices set the need for optimizing and compressing CNN architectures. Optimization strategies can be globally categorized into four approaches: • Network pruning. This technique identifies the parameters which less contribute to the output to remove them, in principle with a slight accuracy loss [114–120]. Therefore, the model size and memory footprint are reduced. A common and simple way to apply pruning is by removing weights below a threshold. Then, a dense network may become a sparse one. However, irregular sparsity of kernels is difficult to accelerate on most hardware. • Parameter quantization. The objective is to reduce the memory requirements by constraining the number of bits to represent each weight, i.e., by reducing the range of possible unique values (2n for n bits). Several data precisions (float32, int8, 4 bits, 2 bits, or binary) can be used to reduce the model size [121–126]. Note that a smaller data size implies greater information loss. In these quantization approaches, two key choices are (i) the number of bits and (ii) how to allocate bits to cover the quantization range. For instance, the popular uniform-scale quantization technique—implemented in frameworks such as TensorFlow and PyTorch—first scales/shifts the values to fit into the [−2n−1 , 2n−1 ] range and then rounds each value to its nearest quantization step. Owing to the limited number of quantization steps (2n ), an inherent quantization error emerges in the process. • Low-rank approximation or tensor decomposition. This approach divides large weight arrays into several smaller matrices, aiming at reducing the information loss as much as possible. Different methods have been proposed, such as Tucker and CP decomposition of 4D kernel tensors [127, 128], structured transformation into circulant or Toeplitz-like matrices [129], or singular value decomposition [130]. • Network architecture search (NAS) [131]. Instead of manual design of networks, current research is moving toward neural-architecture-search algorithms [132– 138]. The NAS approach treats model design or optimization as a search task. The objective is to automatically find optimal architectures according to specific

24

2 Embedded Vision for the Internet of Things …

performance criteria by systematically exploring the design space while using a performance estimation strategy. The search space, whose variables could be, for instance, the filter size or the number of kernels, might be extremely large. Therefore, this technique is resource-intensive and time-consuming. Note that the implementation of layers provided by DL frameworks such as those described in Sect. 2.2 does not support specific optimization methods (as listed above) in most cases. Indeed, leveraging the underlying hardware with optimized CNN may be challenging. For instance, some bit-widths could not be supported by a particular device. Basic steps for deployment of an embedded vision application • Carefully collect a varied and large dataset according to the application scenario. Label and annotate the data. • Design or select a CNN according to accuracy or application-specific metrics (Sect. 2.1). • Train—or use transfer learning (Sect. 2.1.4)—on the dataset by leveraging a software framework (Sect. 2.2). • Select the proper software–hardware combination for deployment (Sects. 2.2–2.3), and convert the model format into one compatible with the selection. • Apply, if possible, an optimization strategy (Sect. 2.4). • Test the system in real scenarios and monitor its behavior, possibly requiring iteration on some of the previous steps until reaching the expected performance.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, NIPS, pp. 1097– 1105 (2012) 2. Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of convolution neural network advances on the imagenet. Computer Vision and Image Understanding 161, 11–19 (2017). https://doi.org/10.1016/j.cviu.2017.05.007 3. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 448–456 (2015) 4. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv (1602.07261) (2016) 5. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv (1409.4842) (2014) 6. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308

References

25

7. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). https:// doi.org/10.1007/s11263-015-0816-y 8. Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 303–338 (2009). https://doi.org/10.1007/s11263-009-0275-4 9. Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 303–338 (2009) 10. Padilla, R., Passos, W.L., Dias, T.L.B., Netto, S.L., da Silva, E.A.B.: A comparative analysis of object detection metrics with a companion open-source toolkit. Electronics 10(3) (2021). https://doi.org/10.3390/electronics10030279.https://www.mdpi.com/2079-9292/10/3/279 11. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision – ECCV 2014, pp. 740–755. Springer International Publishing (2014) 12. Canziani, A., Paszke, A., Culurciello, E.: An Analysis of Deep Neural Network Models for Practical Applications. arXiv (1605.07678) (2017) 13. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5. 726791 14. LeCun, Y., Cortes, C., Burges, C.J.: The MNIST database of handwritten digits. http://yann. lecun.com/exdb/mnist/ 15. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv (1312.4400) (2013) 16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (1409.1556) (2014) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv (1512.03385) (2015) 18. Iandola, F., Han, S., Moskewicz, M., Ashraf, K., Dally, W., Keutzer, K.: Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor performance pi@raspberrypi:˜ $ vcgencmd measure_clock arm frequency(45)=1200000000

1

Since default pi user on Raspberry Pi OS is not the owner of system configuration files, sudo is required for the raspi-config menu.

3.2 Benchmarking

37

Note that although RPi3B comprises four cores (0–3), they belong to the same quad core CPU, so that setting a CPU governor for one of them will also change the other three governors. You can verify this fact as shown below—thus, you will be changing cpu0–3 scaling frequency mode just modifying cpu0 governor file. Quad-core frequency scaling governor pi@raspberrypi:˜ $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ondemand ondemand ondemand ondemand pi@raspberrypi:˜ $ echo performance | > sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor performance pi@raspberrypi:˜ $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance performance performance performance

This scaling governor managing system is common to other embedded systems. For instance, Odroid-XU4 incorporates two clusters of quad core processors. In this case, a CPU governor is set simultaneously either for the four “LITTLE” cores (0–3) or for the “big” (4–7) cores. • Temperature monitoring. Surge of temperature is likely to occur when running continuous network inference on an embedded device. When temperature reaches a limit, thermal throttling reduces the frequency of the RPi’s ARM cores. In this case, Raspbian displays a warning icon on the screen showing a red thermometer [42]. Temperature measurement is also facilitated by vcgencmd RPi tool. pi@raspberrypi:˜ $ vcgencmd measure_temp

• Other services. Unplugging keyboard, mouse, or screen releases the OS of extra workload and interruptions, and also reduces power consumption. Unused peripherals can also be deactivated with sudo raspi-config → 5 Interfacing Options Furthermore, you can check all running services using the Linux commands sudo service –status-all or systemctl and deactivate needless ones with sudo service stop.

Further details about the benchmarking DL software under study—including software code examples—are covered in the companion Appendix.

38

3 Optimal Selection of Software and Models …

Fig. 3.1 Inference performance metrics for each combination of DNN model and DL framework. Numerical results are also reported in Table 3.3

79.23, 2.08, 3.75 79.23, 2.58, 3.49 79.23, 1.42, 3.08 79.23, 2.55, 3.44

89.02, 0.80, 3.81 89.53, 1.74, 3.58 89.02, 1.28, 3.63 89.02, 0.92, 3.00

GoogLeNet Inception-v1 91.04, 0.41, 3.72 91.03, 0.71, 3.41 91.04, 0.50. 3.44 91.04, 0.59, 3.22

ResNet 50 80.81, 3.11, 3.59 80.81, 4.22, 2.95 80.81, 4.79, 3.21 79.92, 3.74, 3.11

SqueezeNet v1.1 89.00, 1.14, 3.51 89.67, 2.41, 3.26 89.00, 2.56, 3.23 89.00, 1.80, 2.88

MobileNet v1 1.0-224

83.25, 1.17, 3.74 83.25, 1.63, 3.56 83.25, 0.60, 3.44 83.25, 1.44, 3.52

SimpleNet v1

Each triplet per entry corresponds to the measured metrics: top-5 accuracy (%), throughput (fps), and power consumption (W)—shown in Fig. 3.1. Green/red values highlight best/worst performance for each parameter

Caffe TensorFlow OpenCV Caffe2

Network in Network

Table 3.3 Performance metrics for each network–framework combination running on Raspberry Pi 3B

3.2 Benchmarking 39

40

3 Optimal Selection of Software and Models …

3.2.4 Benchmarking Results Each benchmarked software–network combination was evaluated while performing inference over 100 images of ImageNet. Instantaneous inference time and power consumption were measured—see Figs. 3.1a, which shows the average rate of frames per second over the 100 images, and 3.1b, which depicts the ranges of power consumption during inference for each combination. Concerning network accuracy, the corresponding research papers and repositories report the values achieved by the models. These values can be verified by running inference over the 50 k validation images of ImageNet and comparing network outputs with the corresponding ground truth. Figure 3.1c shows the measured accuracy using OpenCV and single-precision floating-point format for image pre-processing. This pre-processing includes (1) image resizing to 256 × 256 plus central patch cropping out according to the input size—see Table 3.3 (224 in most cases), and (2) mean and standard deviation normalization. Slight differences in the applied pre-processing would explain the small deviations from the reported values. Note that there are other usual ways to preprocess images: center crop followed by resizing; resize to the shortest side while keeping the aspect ratio plus central crop, etc. All in all, network accuracy, as well as the average values of throughput and power consumption from Fig. 3.1, is summarized in Table 3.3. Note that the most accurate model, ResNet-50, is also the most complex one, requiring the greatest number of operations per inference, thereby leading to lowest throughput. On the other hand, SqueezeNet, the lightest network in computational terms, achieves the highest throughput at the cost of lower—but not the lowest—accuracy. Concerning power consumption, MobileNet is one of the less energy-demanding models, reaching the minimum average consumption when running on Caffe2.



! Thermal Throttling

Concerning embedded performance, the high computational load of neural networks can cause significant device temperature rising when executing continuous inference. CPU thermal throttling process automatically reduces the clock speed when the CPU reaches a limit temperature. To avoid this effect on the results of the proposed benchmark, only samples of inference at maximum performance operation were averaged, that is, with the CPU working over a frequency threshold—see details in the companion Appendix. For RPi3B, a value of 1.1 GHz was selected—keep in mind that the maximum of the system is 1.2 GHz. Therefore, for the selected benchmark, the combinations that underwent thermal issues, leading to less than 100 averaged samples, were: GoogLeNet–Caffe (64 samples), ResNet–Caffe/TF/OpenCV (24, 69, and 75, respectively), and SimpleNet–Caffe (55).

3.3 Optimum Selection

41

3.3 Optimum Selection Practitioners demand guidelines to select the best components for a particular application scenario. In this case, by “components to select" we mean each combination of software and network running on the RPi3B, whereas “application scenario" is defined in terms of the relative importance of inference metrics on the targeted application. Performance parameters can be categorized into two groups: minimization and maximization variables. Thus, for instance, power consumption would be a minimization target whereas network accuracy should be maximized. This section proposes two methodologies to guide the optimal selection of technological components for visual inference in different application scenarios.

3.3.1 Pairwise Metric Comparison Selecting solutions according to one performance parameter is trivial. It is an optimization problem with a single objective and simply implies selecting the best combination for the given metric in Fig. 3.1a–c. Technically, this is a single-objective global minimum/maximum optimization problem. However, for multi-objective optimization tasks, we can use the concept of “Pareto optimization.” A solution is considered Pareto optimal if there exists no other solution which would improve some objective without worsening at least one other objective. These Pareto-optimal solutions within the decision space constitute the “Pareto front,” that is, solutions whose corresponding variables cannot be all simultaneously improved [43]. Results and Analysis In our case study, only a subset of combinations achieve Pareto optimization with respect to the given metrics. To facilitate pairwise performance comparisons, graphical representations are highly beneficial. Figures 3.2, 3.3, and 3.4 show possible solutions—network–framework selection—as points in the space and compare performance according to two inference variables. All in all, Table 3.4 summarizes the best selections for system optimization according to (a) one metric and (b) each pairwise selection of metrics. According to these results, when one or two metrics are relevant for the application, only a reduced subset of 9 out of the 24 analyzed network–framework combinations should actually be considered for real deployment. In fact, these technological solutions include only 3 network architectures, completely discarding Network in Network, GoogLeNet, and SimpleNet, which are far from Pareto fronts in Figs. 3.2, 3.2, and 3.4.

42

3 Optimal Selection of Software and Models …

Table 3.4 Optimum single-objective selection and Pareto-optimal dual-objective selections Variables Optimal combinations Reference figure Throughput Power consumption Top-5 accuracy Throughput—accuracy

Throughput—power consumption

Accuracy—power consumption

SqueezeNet–OpenCV MobileNet–Caffe2 ResNet-50–Caffe/OpenCV/TF SqueezeNet–OpenCV MobileNet–OpenCV MobileNet–TensorFlow ResNet-50–TensorFlow SqueezeNet–OpenCV SqueezeNet–TensorFlow MobileNet–Caffe2 MobileNet–Caffe2 ResNet-50–Caffe2

Fig. 3.1a Fig. 3.1b Fig. 3.1c Fig. 3.2

Fig. 3.3

Fig. 3.4

3.3.2 Figure of Merit When several performance variables come into play, the definition of a singleobjective function—figure of merit (FoM)—enables a feasible comparison.

Fig. 3.2 Throughput performance versus network accuracy. Each sample point is the performance of a combination network–framework from Figs. 3.1a and 3.1c. The black curve represents the Pareto frontier, from which one metric cannot be improved without worsening the other.

3.3 Optimum Selection

43

Fig. 3.3 Throughput performance versus power consumption for each network–framework combination and Pareto frontier of optimal solutions

Fig. 3.4 Network accuracy versus power consumption for each network–framework combination and Pareto frontier of optimal solutions

44

3 Optimal Selection of Software and Models …

Fig. 3.5 Figure of merit defined in Eq. (3.1). It is expressed as “number of correctly inferred images per joule”

We can reduce the multi-objective optimization problem to a single-objective one by introducing a combination of the variables.

Definition 3.1 The “number of correctly inferred images per joule” can be obtained from a figure of merit aggregating the three performance parameters under study: FoM = Accuracy ·

T hr oughput Power

(3.1)

This meaningful metric allows the joint assessment of the computational and energy efficiency of networks and frameworks on the considered hardware platform. The FoM in Eq. (3.1) calculated for values in Table 3.3—or Fig. 3.1a–c- –enables a single and quick comparison between solutions. Figure 3.5 depicts the single-criteria evaluation of networks and frameworks accomplished by defining this FoM. According to the FoM in Eq. (3.1), SqueezeNet running on OpenCV performs the best. The superiority of this solution arises from achieving a superior frame rate in a power-efficient way, while trading off some accuracy in comparison with other networks. The question is: Would this solution still be the best choice for a particular application where accuracy is more important, in relative terms, than throughput and/or power consumption?

3.3 Optimum Selection

45

Weighted FoM In order to answer the question above, the relative importance of each performance metric on high-level specifications will define a possible application scenario. Definition 3.2 Relative weights of evaluation parameters—accuracy, throughput, power—can be specified by a vector [α A , αT , α P ] satisfying 

αi = 1, αi ∈ [0, 1]

(3.2)

i

where i corresponds to A, T, and P, which stand for accuracy, throughput, and power consumption. It follows from Eq. (3.2) that, for instance, the triplet [α A , αT , α P ] = [0.35, 0.05, 0.60] defines an application scenario where power consumption presents a relative preeminence of 60% on the targeted performance, followed by accuracy with a weight of 35% and throughput with very little importance, only 5%. Definition 3.3 Each specific weighted combination of the variables (i.e., a vector of αi ) must produce a single metric. A new FoM parameterized by the set of αi can be defined as follows: FoM(α A , αT , α P ) = Accuracy 3α A ·

T hr oughput 3αT Power 3α P

(3.3)

Therefore, Eq. (3.1) is just a particular case of the generalized FoM introduced in Eq. (3.3): when the three performance parameters have the same impact on application-level requirements (αi = 1/3 ∀i). Any other distribution of αi will emphasize or undermine the specific performance variable by setting its exponent to a value greater or less than unity. Extreme cases occur when any αi is set to 1, boosting just one performance parameter on the FoM at the cost of completely dismissing the other ones. These singular scenarios have already been identified as single-objective selections in Table 3.4. Hence, the choice of αi values highly influences the final decision. Now, the objective is to find the optimum network–framework choice per application scenario, that is, per possible combination of relative weights for accuracy, throughput, and power consumption. Results The parameterized FoM(α A , αT , α P ) can be obtained for each technological combination per application scenario by sweeping αi . Note that, for these calculations, we make use again of the baseline experimental measurements presented in Sect. 3.2.4. Graphical representations of FoM values for each network–framework pair as a function of αi are depicted in Figs. 3.6, 3.7, 3.8, 3.9, 3.10, and 3.11.2 In fact, for 2

Note that the plot axes represent α A and αT since, according to Eq. (3.2), α P = 1 − α A − αT .

46

3 Optimal Selection of Software and Models …

Fig. 3.6 Normalized weighted FoM for Network in Network

Fig. 3.7 Normalized weighted FoM for GoogLeNet

Fig. 3.8 Normalized weighted FoM for ResNet-50

each combination of αi , FoM values are normalized with respect to its maximum value obtained for the optimum network–framework pair. Thus, the best selection exhibits a FoM = 1, whereas the range of FoM values allows to easily evaluate how far each solution is from the optimum one through color degradation from blue (representing the maximum, 1) toward lighter tones (solutions getting far from this maximum). For a particular application scenario, that is, for a specific point within the exploration domain of αi , a single-variable comparison similar to that of Fig. 3.5 can be extracted. For instance, the scenario defined by the triplet previously mentioned, [α A , αT , α P ] = [0.35, 0.05, 0.6], is characterized by values in Fig. 3.12. In contrast to Fig. 3.5 where all metrics were equally important, MobileNet on Caffe2 is the best choice in this case.

3.3 Optimum Selection

47

Fig. 3.9 Normalized weighted FoM for SqueezeNet

Fig. 3.10 Normalized weighted FoM for MobileNet

Fig. 3.11 Normalized weighted FoM for SimpleNet

The graphical representation that summarizes the methodology for optimal selection based on the proposed weighted FoM in multiple application scenarios is shown in Fig. 3.13.

Specifically, Fig. 3.13 shows the best selection, i.e., the network–framework pair reaching the maximum FoM, for the whole domain of αi . On the opposite side, Fig. 3.14a overviews which combinations perform the worst across the domain of application scenarios. The corresponding lowest values of FoM are shown in Fig. 3.14b. Light tones in this plot reveal that there is a wide range of FoM across technological solutions, highlighting that some network–framework combinations can be really far from optimum performance.

48

3 Optimal Selection of Software and Models …

Fig. 3.12 Normalized values of the FoM defined in Eq. (3.3) for a scenario defined by relative preeminence of power consumption, throughput, and accuracy of 60%, 35%, and 5%, respectively. In this particular case, MobileNet on Caffe2 is the best choice

Fig. 3.13 Optimal network–framework selection according to the weighted FoM defined in Eq. (3.3). Only a subset of possible solutions perform the best in at least one point of the exploration domain

For all the domain of scenarios, the percentages of solutions achieving optimum and worst performance in Figs. 3.13 and 3.14a are reported in Table 3.5.

3.3 Optimum Selection

Fig. 3.14 Lowest FoM network–framework combinations

49

50

3 Optimal Selection of Software and Models …

Table 3.5 Left: optimum network–framework selections according to FoM in Eq. 3.3 when sweeping αi , and corresponding percentage of scenarios (area covered by optimum solutions in Fig. 3.13). Right: worst selections and corresponding areas in Fig. 3.14a Optimum Network–framework

Cases (%)

Last-place Network–Framework

ResNet-50–Caffe/OpenCV/Caffe2

0.00a

Cases (%)

SimpleNet–Caffe

0.04

ResNet-50–TensorFlow

0.04

Network in Network–OpenCV

0.43

ResNet-50–Caffe2

0.25

GoogLeNet–Caffe

1.76

MobileNet–OpenCV

1.27

SimpleNet–OpenCV

5.77

MobileNet–TensorFlow

2.58

Network in Network–Caffe

5.95

MobileNet–Caffe2

9.61

ResNet-50–Caffe

86.05

SqueezeNet–TensorFlow

28.17

SqueezeNet–OpenCV

58.09

Singular case just for α A → 1. These three combinations achieve the highest accuracy and thus maximize FoM(α A = 1, αT = 0, α P = 0)—see Table 3.3.

a

Analysis The corners on the graphical representation in Fig. 3.13 represent single-objective optimization scenarios. Any other vector of αi defining the application is assessed through the FoM in Eq. (3.3). These results reveal that SqueezeNet performs the best for the majority of the explored scenarios—in conjunction with OpenCV for high frame rates and with TensorFlow for energy efficiency. For relaxed frame rate requirements, MobileNet is the best option on combining high accuracy and reduced power consumption. In practical terms, only 3 combinations—SqueezeNet on OpenCV and TensorFlow; MobileNet on Caffe2—cover more than 95% of cases. On the other hand, Network in Network, GoogLeNet, and SimpleNet are discarded in all scenarios, no matter the software framework considered. Actually, they are the worst solutions in several cases, as shown in Fig. 3.14a. Network in Network features the lowest accuracy, a moderate frame rate, and also high power consumption, specially in combination with Caffe. SimpleNet also offers a deficient performance in the three considered metrics. Despite the notable accuracy of GoogLeNet, it exhibits a poor performance in the other aspects, being the most energy-demanding model when running on Caffe. Concerning ResNet-50, this model performs with high accuracy but presents the lowest throughput and high power demand. This is why it is the worst model in more than 85% of the scenarios—see Fig. 3.14a. In particular, its notably low throughput (0.41 pfs) compared with the best one (4.79 fps for SqueezeNet– OpenCV) makes the normalized FoM in Fig. 3.14b degrade for increasing values of αT .

3.4 Summary and Advanced Topics

51

3.4 Summary and Advanced Topics This chapter exemplifies two graphical methodologies for optimal selection of software tools and networks for embedded inference. First, measured metrics for several networks and software tools on the popular Raspberry Pi 3B demonstrate the capability of low-cost embedded platforms for implementing CNN image classification. Second, the presented methodologies filter out most of the studied alternatives, thus significantly simplifying informed system decisions. Certainly, only 9 network– software selections perform the best in practical scenarios. In fact, this reduced number of cases has been pointed out by both methodologies, i.e., Pareto-optimal choices and weighted FoM maximization in Sects. 3.3.1 and 3.3.2. Other methodologies can be applied. For instance, additional parameters, such as temperature and memory, can be taken into account in the optimization decision as problem restrictions. Note also that the presented approaches are strictly quantitative since they are exclusively based on measurable performance. However, practitioners must also deal with qualitative parameters encoding the suitability of a particular framework or hardware platform according to, for instance, a long-term technological strategy in a company. Platform size, weight, or cost are also relevant. These extra restrictions would mainly affect the benchmarking process. In fact, instead of manually evaluating each network–software–platform, available public benchmarks can assist this step—see Chap. 2. Alternative definitions of figures of merit can also be beneficial. For example, most public benchmarks only include inference time and network accuracy as measured metrics. This allows to define a FoM as the “number of correct images per second” [44]: FoM = Accuracy · T hr oughput (3.4) or the “number of valid FLOPs3 per second”: FoM = Accuracy · F L O Ps · T hr oughput

(3.5)

Acknowledgements This chapter is partially based on a previous authors’ publication [45].

3.5 Appendix The variety of available software frameworks and APIs establishes different ways to define and evaluate networks, and offers pre-trained weights saved in several file formats. In addition, data dealing and computational strategy differ from one to other software, mainly based on the libraries in which they rely. Code details to perform network inference on software frameworks evaluated in this chapter are 3

FLOPs stands for the number of floating-point operations per inference, reflecting the computational complexity of the network.

52

3 Optimal Selection of Software and Models …

exposed below. Moreover, further pre-processing and supplementary code are also exemplified. Image Reading. A widely used open-source computer vision library for image processing is OpenCV. Images are stored in arrays shaped H × W × C, where H, W, and C stand for image height, image width, and number of channels, respectively. Generally, color images contain three channels of information corresponding to red (R), green (G), and blue (B) colors, when using the RGB color space—- other spaces are possible such as HSV, HLS, or CIE L*a*b*. In particular for RGB space, OpenCV swaps red and blue channels, so that channel order is BGR. Image Pre-processing. It is usual to train models on pre-processed images. Depending on the training procedure, the input image to pre-trained networks should be pre-processed into corresponding form. This process differs from one framework to another one, and for each network architecture. In addition to image resizing according to network input resolution—see Table 3.1, mean substraction and/or standard deviation normalization are usually conducted. Dataset mean value over the pixels in each channel is substracted to the image. Mean μ is a well-known value for public datasets, such as ImageNet. A normalization value σ can also be applied to the pixels in order to scale their range. Thus, original pixels x in the range [0, 255] (if using unsigned 8-bit format) are standardized as (x − μ)/σ . Channel reordering is also usual. Image reading and pre-processing functions – Python import cv2 # OpenCV Python module. Check version with ’cv2.__version__’ import os import numpy as np def resize_crop_center(img, scale, target_size): ’’’ 1. Resize image according to scale 2. Centrally crop the image according to target_size ’’’ img_resized = cv2.resize(img.astype(np.float32), (scale,scale)) H,W = img_resized.shape[:2] offset_i = int((H-target_size)/2) offset_j = int((W-target_size)/2) img_crop = img_resized[offset_i:(offset_i+target_size), offset_j:(offset_j+target_size), :] return img_crop def mean_scale(img,mu,std): ’’’ Substract mean ’mu’ (list with 3 elements), and normalize by ’std’ ’’’ img = img.astype(np.float32) img -= mu #img[:,:,0] -= mu[0]; img[:,:,1] -= mu[1]; img[:,:,2] -= mu[2] img /= std return img def HWC_to_NCHW(img): img = img.swapaxes(1, 2).swapaxes(0, 1) # switch from HWC to CHW img = img[np.newaxis,:,:,:] # add batch dimension return img def BGR_to_RGB(img):

3.5 Appendix

53

img = img[...,(2,1,0)] # swap channels R-B return img def read_batch(dataset_path, NHC=[32,256,3], start_idx=0): ’’’ Read ’N’ files from ’dataset_path’ starting from ’start_idx’ file. Save resized square images (HxW) into NHWC batch array (assuming H=W). ’’’ N,H,C = NHC # NHC is a Python list. file_list = [fn for fn in os.listdir(dataset_path)] file_list = sorted(file_list) batch_img = np.empty([N,H,H,3], dtype=np.float32) for k in range(N): im_file = file_list[start_idx+k] read_img = cv2.imread(dataset_path + im_file) batch_img[k,:,:,:] = crop_center_resize(read_img, H) return batch_img

Caffe processes data using “blobs,” which are N-dimensional arrays to keep model parameters, input images, derivatives, and so on. Batches of N images are stored in arrays of N × C × H × W dimension. Note that this way, pixels from the rightmost dimension (W) are physically located successively, in a row-major layout. Channels are at the outermost position in the image array. Thus for instance, the pixel at index (n, c, i, j) will be physically accessed through the index [(nC + c)H + i]W + j. In addition, by default, models are configured to take input images in BGR format with pixels in the range [0, 255]. It follows that, if the image was loaded with OpenCV, there is no need of channel reorder or pixel standardization. Below, there is an example of creating a CNN in Caffe with pre-trained weights and running inference. Network definition is loaded from a file with prototxt extension, while network weights are stored in a caffemodel file. Image array NCHW_img should be an already pre-processed BGR image—mean substrated and scaled. Caffe Inference – Python # Create network: net = caffe.Net(prototxt_file, caffemodel_file, caffe.TEST) #(1) # Set input: net.blobs[’data’].reshape(N,C,H,W) net.blobs[’data’].data[...] = NCHW_img # Execute network inference: output = net.forward() #(2)

Note that: (1) When loading the network, Caffe will print model information on stderr file descriptor. (2) Output will be a Python dictionary whose key-value pair contains the name of the network output and the output probability vector, respectively. A basic Caffe code example can be found at [46]. For further information, check [47].

54

3 Optimal Selection of Software and Models …

TensorFlow creates deep learning models as dataflow graphs. Nodes are mathematical operations, and edges are “tensors”—multi-dimensional arrays—on which operations are performed. This framework offers several APIs to create deep learning models. Specifically for CNNs, TF-Slim library [17, 48] was its first API allowing to define, train, fine-tune, and evaluate CNNs. It simplifies network description through “argument scoping,” which is a high-level definition of a collection of numerous layers, variables, and operations sharing arguments. In addition, this API provides pre-trained “checkpoints” from well-known neural networks. What is a checkpoint in TensorFlow? Basically, it is a mechanism to save and restore graph variables. To restore the model, you have to create the same network architecture before loading checkpoint weights. These trained parameters are stored in a set of files composing the TensorFlow checkpoint: .data-xxxxx-to-xxxxx One or more files containing network’s weights. .index Index file indicating weight distribution over “data” files. .meta Protocol buffer file describing the complete TensorFlow graph structure. checkpoint Plain text file which keeps a record of latest checkpoint files saved, registering the name of these files. The code below outlines TF-Slim [17] graph definition to run a CNN–code tested on TF version 1.3.0.4 TensorFlow TF-Slim Python API for inferencing from tensorflow.contrib import slim from nets import #(1) net_scope = slim.arg_scope() #(2) # Create network: graph = tf.Graph() with graph.as_default(): with net_scope: input_tensor = tf.placeholder(tf.float32, shape=[N, H, W, C]) output_tensor, _ = .(input_tensor, ...) #(3) probabilities = tf.nn.softmax(output_tensor) #(4) init_fn = slim.assign_from_checkpoint_fn(ckpt_file, slim.get_model_variables()) #(5) with tf.Session(graph=graph) as sess: init_fn(sess) # Execute network inference: output = sess.run(probabilities, {input_tensor: NHWC_img}) #(6)

For instance, InceptionV1 network would be defined, restoring pre-trained weights, and evaluated with these functions and statements: (1) The in the code should be replaced by inception. Check the list of importable networks on [49]. 4

Module tf.contrib is not available on TF version 2.x. In that case, import the newest version of TF-Slim with import tf_slim as slim and import tf_slim.nets as nets.

3.5 Appendix

55

(2) Corresponding network scope is inception.inception_v1_arg_ scope(). (3) This line defines the network graph. For this example, function and arguments would be: inception.inception_v1(input_tensor, num_classes=1001, is_training=False) Note that both input_tensor and output_tensor belong to the default graph defining the CNN. (4) Notethat this network produces “logits,” that is, raw output numbers from the network classifier. Softmax function translates logits into probabilities that sum to one. This way, to obtain the network probability vector, we must evaluate the graph tensor probabilities. (5) Here, checkpoint file is specified. In this case, ’inception_v1.ckpt’. You should download the checkpoint file before running the code.5 Several pre-trained checkpoints are available as listed at [50]. (6) TF-Slim API also provides pre-processing functions compatible with the training process of each pre-trained network [51]. For instance, mean substracted and scaled image array NHWC_img for InceptionV1 can be easily obtained as follows: from preprocessing import inception_preprocessing HWC_img = inception_preprocessing.preprocess_image(HWC_tensor, H, W, is_training=False) NHWC_img = tf.expand_dims(HWC_img, 0)

A complete extension of this snippet of code is available on [52]. For more information, please check [17, 48]. Another popular high-level API for building deep learning models is TensorFlow– Keras [53]. In addition to checkpoints to save weights, the entire model can be saved into SavedModel or HDF5 formats. Creating a model and loading trained weights with TF–Keras is as easy as follows—code tested on TensorFlow 2.1.0. TensorFlow-Keras Inference – Python from tensorflow.keras.applications. import #(1) # Create network: model = (weights=’imagenet’) #(2) # Execute network inference: output = model.predict(NHWC_img) #(3)

(1) TF–Keras offers several model definitions [54]. For instance, you can load the network InceptionV3 from tf.keras.applications module inception_v3. (2) First call to this function will download pre-trained weights in HDF5 format. (3) Output will be a numpy probability vector N × n classes . Note that tensorflow. keras.applications.inception_v3 module also provides prepro5

Note that TF-Slim module of TF 1.x saves checkpoints in a single file with ckpt extension.

56

3 Optimal Selection of Software and Models …

cess_input and decode_predictions functions in order to process input images and show output labels. In order to run on TensorFlow extra networks not available at [50, 54], the tool Caffe-to-TensorFlow [24] translates models from Caffe format to TensorFlowcompatible files. Out files from this tool (Numpy .npy and Python class .py) contain both trained weights and TF network graph description. OpenCV-DNN—the deep learning module of OpenCV—allows loading pre-trained models from other frameworks without file format conversion, i.e., directly using the same files. The following code would run visual inference with a network trained on Caffe—code tested on OpenCV 3.3.1. OpenCV-DNN Inference – Python # Create network: net = cv2.dnn.readNetFromCaffe(prototxt_file, caffemodel_file) #(1) # Set input: blob = cv2.dnn.blobFromImage(HWC_img, 1/std, (H, W), mu) #(2) net.setInput(blob) # Execute network inference: output = net.forward() #(3)

Note that: (1) In this code, pre-trained network is loaded from files in Caffe format. There are also other OpenCV functions to create networks from frameworks such as TensorFlow or Torch. (2) OpenCV offers a function to automatically create a blob of data as expected by the network object, i.e., a NCHW-shaped pre-processed image. (3) Output will be a numpy array containing the probability vector. A simple example of CNN inference on OpenCV can be found at [55]. You will find more information about DNN module of OpenCV at [56]. Caffe2 is now integrated on PyTorch project, as stated on its Web site [20]. However, Caffe2 installation is possible and its network inference evaluation would be done like this. Caffe2 Inference – Python from caffe2.python import workspace # Create network: net = workspace.Predictor(init_net, predict_net) #(1) # Set input: workspace.FeedBlob("data",NCHW_img) # Execute network inference: output = net.run([]) #(2) # Alternatively, execute network inference on input data: output = net.run({’data’: NCHW_img})

3.5 Appendix

57

Note that: (1) init_net and predict_net are read data from two protobuf files—pb extension—containing the network weights and computation graph definition, respectively. (2) Output will be a Python list containing the output vectors. A basic Caffe2 example can be found at [57] and further information at [58]. Image classification—top-N output classes. Assuming you have txt labels’ file with n classes categories, and containing one class name per row, network class prediction and probabilities can be easily obtained with the following Python lines.

Get top-N labels – Python import numpy as np labels = np.loadtxt(txt_file, str, delimiter=’\n’) # Top-1 prediction: out_label = labels[output.argmax()] #(1) out_probability = 100*output[output.argmax()] print("Top-1 class: {}, probability {}%".format(out_label,out_probability)) # Top-N predictions: N = 5 for k,order in zip(output.argsort()[::-1][:N], range(N)): print("({}) {}, probability {}%".format(order+1,labels[k],100*output[k]))

(1) output variable contains data yielded from network inference in any DL software. It must have been converted to a 1D numpy array. RPi Performance Measurement. Inference time on each framework can be determined through time Python library. Inference time measurement – Python # [create network] start_time = time.time() # [network inference] end_time = time.time()

First inference can be slower than successive ones due to memory allocation, cache misses, etc. To measure inference time, you should ideally inference over multiple images in a long-running process and then average times. Discard the first inference time is also reasonable. However, continuous embedded inference can derive thermal issues. When RPi CPU heats up reaching a limit temperature, the CPU clock immediately goes down, thus ensuring safe operation. Assuming the CPU frequency governor set as

58

3 Optimal Selection of Software and Models …

performance, the following code uses vcgencmd tool in order to set a warning in the case of RPi CPU clock speed reductions over a threshold thr. RPi CPU thermal throttling warning – Python import subprocess def check_freq(thr): # check temperature and frequency: cmd_out = subprocess.check_output([’vcgencmd’,’measure_clock’,’arm’]).decode(’utf-8’) freq = int(cmd_out.split(’=’)[1]) cmd_out = subprocess.check_output([’vcgencmd’,’measure_temp’]).decode(’utf-8’) temp = float(cmd_out.split(’=’)[1].split(’\’’)[0]) # return status / Warning: if freq < thr: print("\033[93m Warning: Thermal Throttling ({}*C). Freq {:.0f} Hz\033[0m".format(temp,freq)) warn = 1 else: warn = 0 return warn, temp, freq

References 1. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). DOI https://doi.org/10.1007/s11263-015-0816-y 2. Velasco-Montero, D. et al.: Performance Analysis of Real-Time DNN Inference on Raspberry Pi. In: SPIE Commercial + Scientific Sensing and Imaging (2018) 3. Raspberry Pi 3 Model B. https://www.raspberrypi.org/products/raspberry-pi-3-model-b/ 4. Raspberry Pi. Operating system images. https://www.raspberrypi.org/software/operatingsystems/ 5. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv (1312.4400) (2013) 6. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv (1409.4842) (2014) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv (1512.03385) (2015) 8. Iandola, F., Han, S., Moskewicz, M., Ashraf, K., Dally, W., Keutzer, K.: Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and -D ENABLE_NEON=ON -D ENABLE_VFPV3=ON \ > # [extra settings]

Keep in mind that both TensorFlow and Caffe2 graphs are turned into parallel code through their back-end library, i.e., Eigen, which exploits ARM NEON registers and VFP optimizations for code acceleration. Benchmarked CNNs Finally, to identify performance trends on each software, three different network architectures were used. This set of networks, namely GoogLeNet, ResNet-50, and SqueezeNet, shows a trade-off between complexity and accuracy, as listed below: GoogLeNet ∼68% top-1 acc. ResNet-50 ∼73% top-1 acc. SqueezeNet-v1.1 ∼58% top-1 acc.

∼89% top-5 acc. ∼91% top-5 acc. ∼81% top-5 acc.

∼7.0M weights ∼25.6M weights ∼1.3M weights

∼1.6G MACs ∼3.9G MACs ∼396k MACs

4.3 Hardware-Aware Analysis Embedded hardware in RPi3B includes the so-called performance monitoring unit (PMU) available in its Cortex-A53 processor. This is a non-invasive debug component that provides six event counters related to statistics on both the processor and the memory system. Several tools facilitate the profiling of these performance counter events occurring in the processor. In particular, perf [14] is a high-level Linux API to this end. Keep in mind that PMU hardware events are specific to the particular CPU, as documented by its vendor [9]. You can check the list of events available on your hardware— and accessible through perf—by typing the command perf list. Counting

4.3 Hardware-Aware Analysis

65

the events derived from running a program is possible through the perf stat command with the corresponding options: pi@raspberrypi:˜ $ perf stat -e -r []

where: • n is the number of test repetitions. Averaging on various command executions reduces the estimation errors due to multiplexing events [14]. • events: comma-separated (with no space) list of events to gather according to perf list and processor’s architecture manual [9, 10]. • command: application to evaluate. In our case, python constitutes the command to assess. • : Further options can be optionally specified [14]. For instance, results from this tool are in general aggregated metrics gathered during the command execution. However, the extra option -I allows retrieving temporal samples. This will be used in Sect. 4.5.2. In particular, experimental data in this section correspond to PMU hardware events extracted when running Python scripts for CNN inference for all framework–network combinations under study. For the sake of a fair comparison, per-image hardware statistics are calculated with the following procedure: count =

c N − c0 Nimg

(4.1)

where c N is the count corresponding to the complete inference script, which runs inference on Nimg images randomly selected from ImageNet, and c0 denotes the selected statistics derived from loading the network and libraries—obtained by running the same script with Nimg = 0. Thus, per-image inference processing hardware metrics are obtained from Eq. (4.1). Moreover, to reduce perf estimation errors due to multiplexing events [14], perf stat -r will be used to average values from five runs. Keep in mind that processing data on general-purpose registers is immediate, whereas fetching data from high-level caches supposes an incremental delay. Note also that the main memory is larger but slower.

Table 4.2 summarizes and explains several significant microarchitectural events provided by RPi3B’s PMU. The following PMU hardware events [10] and derived rates are included in the proposed approach:

66

4 Relevant Hardware Metrics for Performance Evaluation

Table 4.2 Various relevant events generated in the PMU of the ARM Cortex-A53 processor of RPi3B Event code

Event (perf event name)

Description

0x03

L1 data cache refill (L1-dcacheload-misses)

Attributable memory-read or memory-write operations causing a refill of at least the L1 data cache. Specifically, this includes L1 cache accesses causing a refill that is satisfied by another L1 data or unified cache, or a L2 cache, or memory

0x04

L1 data cache access (L1-dcache-loads)

Attributable memory-read or memory-write operations causing a cache access to at least the L1 data or unified cache

0x05

L1 data TLB refill (dTLB-load-misses)

Attributable memory-read or memory-write operations causing a TLB refill of at least the L1 data or unified TLB. These operations cause a translation table walk o an access to another level of TLB caching

0x08

Instruction architecturally executed (instructions)

Instructions executed in a simple sequential execution of the program flow. A negative example of “architecturally” execution is a “speculative” execution performed on the processor which ends discarded

0x10

Mispredicted or not predicted Corrections to the predicted program flow related to instructions branch speculatively executed that the branch prediction resources are capable of predicting (branch-loadmisses)

0x12

Predictable branch speculatively executed (branch-loads)

Branches or changes in the program flow that the branch prediction resources are capable of predicting. “Speculatively executed” means the processor did some work associated with one or more instructions, but the instructions were not necessarily architecturally executed. For instance, mispredicted branches are speculatively executed but no architecturally executed

0x13

Data memory access

Memory-read and memory-write operations that the processor made, including accesses to a L1 data or unified cache, a L2 data or unified cache, or neither of these

0x16

L2 data cache access

Attributable memory-read and memory-write operations causing a cache access to at least the L2 data or unified cache. This count includes refills of and write-backs from the L1 data, instruction, or unified caches

0x17

L2 data cache refill

Attributable memory-read and memory-write operations that access at least the L2 data cache and cause a refill of a L1 data or unified cache, or of the L2 data or unified cache. Specifically, this includes L2 cache accesses causing a L1 or L2 refill that is satisfied by another L2 cache, a L3 cache, or memory and refills of and write-backs from any L1 cache

Further information on [9, 10]

4.3 Hardware-Aware Analysis

67

• # instructions architecturally executed—0x08. This metric includes each instruction executed in the program flow. Thus, the software framework coding definitely determines this count. • # data memory accesses—0x13. This is a measure of the memory traffic, which mainly depends on the platform cache hierarchy—and evidently on the application. • # L1 and L2 cache loads—0x04 and 0x16. Memory hierarchy accesses. The overall cache performance depends on parameters such as data locality, size of the cache, or data block size. • L1 and L2 cache miss ratios (%)—0x03 / 0x04 and 0x17 / 0x16. These ratios are obtained as the fraction between cache misses and accesses. Remember that a cache access can result in a hit (if data are available) or a miss (otherwise). Cache misses may stall the CPU waiting for data. Thus, high cache miss ratios increase data fetching time in the overall program execution. • L2 / L1 cache loads ratio—0x16 / 0x04. This parameter provides insight about cache hierarchy performance and exploitation. Larger caches require more CPU cycles to access. Reusing data from L1 will reduce data access latency. • Data TLB misses per memory accesses ratio—0x05 / 0x13. Each memory access requires an address translation from the virtual space to the physical one. Keep in mind that TLB in the MMU is the cache of mappings between virtual to physical addresses. A virtual address translation can result in a TLB hit (in which physical address is immediately retrieved from TLB) or TLB miss (requiring a look-up in the page table). The more the TLB misses, the higher the time span. Conversely, access to recently translated addresses will reduce the execution time by exploiting TLB hits. • # branches architecturally executed—0x12. This parameter characterizes the changes in program flow that the branch prediction resources in the processor are capable of predicting. • # branch mispredict ratio (%)—0x10 / 0x12. Percentage of mispredictions within predictable branches. The branch predicted to be the most likely is speculatively executed. If it ends up as a misprediction, speculatively executed instructions are discarded, incurring a delay. • # instructions per second—0x08 /runtime.2 This is a measure of the processor performance. The number of instructions depends on software coding. Thus, exploitation of parallelism is reflected in a higher rate of instructions per second. Note that each system architecture presents a peak performance that cannot be surpassed— usually expressed in GFLOP/s (giga-FLOPs per second). Memory-bound applications employ numerous CPU cycles in memory accesses so that instructions per second come down, getting far from the peak performance. • # Data memory accesses per second—0x13 /runtime. This rate is always below the peak bandwidth achieved by the RAM of the system—the rate at which data are transferred between the processor and memory. Compute-bound applications show high workloads such that the rate of memory accesses per second is diminished. Additional remarks on PMU hardware events on RPi 2

Execution time is also displayed on perf report.

68

4 Relevant Hardware Metrics for Performance Evaluation

Fig. 4.1 Effects of branch prediction on the system behavior. This plot shows the profile of instructions per second while performing various network inferences separated 0.5 s—GoogLeNet–Caffe. Instructions are executed after network inference ends, thus reducing the idle period—0.5 s between red lines—; note the extra interval of 0.26 s related to instruction workload

• Since RPi3B features a two-level memory system, the two perf events L2dcache-loads and LLC-loads (last-level cache) are coincident—as L2-dcache-load misses and LLC-load misses are. Indeed, 0x2A and 0x2B events (L3 data cache refill and access, respectively) defined for ARMv8 processor yield zero counts when analyzed with perf on RPi3B. • The external memory bus rate can be retrieved with 0x1D (bus-cycles) and 0x11 (CPU-cycles) events. Then, for any program execution on RPi3B, it can be checked that the ratio 0x1D / 0x11 is 0.5, confirming the fact that the interface between the processor and its closely coupled caches is working at half the processor’s clock frequency. • Branch prediction on the system can be observed when analyzing the event counts of a program execution. This is exemplified in the temporal event profile in Fig. 4.1, obtained with perf -I . This plot corresponds to instantaneous instruction activity for various executions of GoogLeNet–Caffe, with an idle interval of 500 milliseconds between inferences set via software and marked with red lines. Note the extra account of events registered once finished the network workload (i.e., the slot between 1.7 and 2.0 s), suggesting that the processor carries out branch prediction and data pre-fetching activity. The same effect was observed when measuring power consumption. Results and Analysis The statistics and metrics listed above were registered and aggregated while running visual inference on RPi3B—specifically using perf version 4.9.82.3 Figure 4.2 3

perf can be installed with apt-get install linux tools. Note that on RPi3B, this will install version 4.9.82, which can be accessed via perf_4.9 command. Therefore, in this chapter we assume that alias perf=“perf_4.9” was set.

4.3 Hardware-Aware Analysis

Fig. 4.2 Per-image extracted PMU hardware statistics registered during inference

69

70

4 Relevant Hardware Metrics for Performance Evaluation

depicts related experimental data that correspond to per-image counts (Eq. (4.1), with Nimg = 50), averaged on five script executions. We next examine these results comprehensively. To start with, Caffe’s coding strategies and underlying libraries— threaded OpenBLAS for GEMM—lead to the highest number of processing instructions (Fig. 4.2a) and data memory accesses (Fig. 4.2c) for the three networks. To deal with these work and memory traffic requirements, the processor renders the highest rates of instructions and memory fetches per second (Figs. 4.2b and 4.2d). In addition, branch prediction resources on the ARM Cortex-53 processor are intensively exploited when running Caffe code (Fig. 4.2k). Under branch prediction, instructions of a guessed branch of code are speculatively executed before checking whether that was the branch to execute. Succeeding in branch prediction speeds up computation. Nonetheless, the poor prediction performance showed by ARM resources when running Caffe code (Fig. 4.2l) makes the CPU execute extra instructions that end up discarded. Concerning cache exploitation, keeping reused data in higher level of the memory hierarchy reduces the data access delay. Caffe significantly makes the most of L1 and L2 caches, loading high amounts of data (Figs. 4.2e and 4.2g) with low miss rates (Figs. 4.2f, 4.2h and 4.2i). The exploitation of TLB by Caffe is also notable (Fig. 4.2j). In fact, OpenBLAS library underlying this framework is highly oriented to reduction of TLB misses by keeping part of one of the operands in the L1 cache [4]. Concerning the other three software tools, there are also differences to highlight. The distinctive instruction reduction with respect to Caffe depicted in Fig. 4.2a suggests an efficient exploitation of the processor’s SIMD instruction set—in fact, these three frameworks leverage ARM hardware optimizations by compilation. TensorFlow’s coding performance is remarkable, as revealed by its lowest number of instructions and data operations per inference (Figs. 4.2a and 4.2c), but high operation rates (Figs. 4.2b and 4.2d). Concerning OpenCV, it is the best by far on exploiting L2 cache, although at a high L1 refill rate (Figs. 4.2e–h). The performance of TensorFlow and Caffe2 is similar concerning cache exploitation. Finally, Caffe2 performance is remarkable in terms of effective branch prediction, thus accelerating code pipelines.

4.4 High-Level Performance Analysis As Chap. 3 evidences, the high computational demand of CNNs increases the SoC temperature, which in turn has an impact on the instantaneous performance. To take this aspect into account, in this study we registered CPU status (temperature, frequency, and utilization), memory footprint, and inference performance (throughput and power consumption) on a long-term period of 6 min of continuous inference. Given that all the frameworks are accessed through a Python interface, the same inference Python script can gather all these metrics after each network image classification.

4.4 High-Level Performance Analysis

71

Overall, the following performance figures were considered and calculated with the code below entitled “Performance monitoring— Python”: • Throughput. Frame rate calculated as the inverse of per-image processing time. In this case, in addition to network inference time, it includes the time required to read and preprocess the input image. • CPU utilization. It can be easily measured by using the Python psutil library. • CPU frequency and temperature. As explained in Chap. 3, vcgencmd tool provides that information. • Memory footprint. Several memory statistics can be gathered with psutil library. For instance, free memory in the system is provided by psutil.virtual_memory().free—alternatively reported by free command or /proc/meminfo file. For comparison between software libraries, the so-called unique set size (USS) parameter is used in this chapter. This parameter represents the effective physical memory allocation unique for running the process; it comprises the required memory for processing one image plus CNN weights plus unshared libraries, including the framework library itself. Likewise, further statistics can be gathered, such as RSS, which is the actual use of non-swapped physical memory including shared libraries. • Power consumption. It can be measured with several meter devices. In particular, in this study an external Keysight N6705C DC Power Analyzer was employed to monitor the instantaneous power demanded by the networks. See power measurement details in the companion Appendix. Performance monitoring—Python t_start = time.time() while time.time() - t_start < 60*MIN: #(1) # Execute network inference: t0 = time.time() # [image reading + preprocessing + CNN feeding] # [network inference] #(2) # Measure performance: #(3) t1 = time.time() fps = 1.0/(t1-t0) CPU = psutil.cpu_percent() warn, temp, freq = check_freq(1.2e9) #(4) # Display results: print("CPU {}%, at {}*C. {:.1f} fps".format(CPU,temp,fps)) # [save results into vectors or a file] print("\t- USS memory {} bytes".format( check_mem() )) #(5)

72

4 Relevant Hardware Metrics for Performance Evaluation

In this code: (1) The loop will run for M I N minutes. (2) The Python API of a framework for network inference is employed, for instance, Caffe, OpenCV, or TF. (3) Note that these system monitoring measurements after each inference require little lapse of time, which in turn slightly reduces the number of processed images in the interval M I N . (4) Temperature and frequency can be monitored using the vgencmd tool in check_freq(), as explained in Appendix of Chap. 3. Under normal conditions and performance governor, the ARM Cortex-A53 CPU runs at 1.2 GHz. However, the long-term inference causes a high SoC temperature and hence CPU down-clocking. In that case, a warning message will be shown on the terminal. (5) Memory allocation keeps stable during inference. We can use the psutil Python library to monitor this parameter. def check_mem(): pid = os.getpid() # PID py = psutil.Process(pid) # Process mem_proc = py.memory_full_info().uss return mem_proc

# USS (Bytes)

Results Temporal behavior of the CPU when running continuous inference and its impact on frame rate is exemplified in Fig. 4.3 for SqueezeNet network. The average values of the performance indicators for all the network–framework pairs under study are reported in Fig. 4.4. Some aspects must be emphasized:

Fig. 4.3 Long-term performance metrics when running SqueezeNet. Similar behavior is demonstrated by other network architectures

4.4 High-Level Performance Analysis

73

Fig. 4.4 Average values of a CPU utilization, b frame rate, c power consumption, and d allocated memory, during a period of continuous inference with performance degradation due to temperature

74

4 Relevant Hardware Metrics for Performance Evaluation

Fig. 4.5 Power profiling of GoogLeNet running on Caffe (black) and average value (blue). Measurement data come from an external power analyzer—more information in Appendix

• Thermal throttling. Figure 4.3 illustrates the fact that when the temperature of the quad core ARM processor exceeds a critical value of 80 ◦ C, the clock frequency is progressively reduced, which in turn decreases the throughput. • The particular acceleration libraries exploited by the tools make them allocate different amounts of memory (Fig. 4.4d)4 and also explain the differences in CPU utilization and throughput (Fig. 4.4a, b). • Both framework coding and network layers composing the CNN architecture affect the instantaneous power consumption (averaged in Fig. 4.4c). Detailed power profiling of GoogLeNet–Caffe inference is depicted in Fig. 4.5. Note that some layers require higher power than others, emphasizing that the network architecture has an influence on the overall power demand. This topic will be covered in Chap. 5. Analysis Performance metrics greatly differ among software tools. In this embedded system study case, Caffe shows the highest CPU utilization, what makes the system demand high power and quickly increases its temperature. However, in spite of apparently making the most of the CPU, Caffe’s throughput is the lowest. Actually, the same behavior has been observed for other CNNs running on this framework, such as those characterized in Chap. 3. The earlier hardware exploitation analysis carried out can explain the underlying reasons for these outcomes. Firstly, highest processing workload and memory traffic achieved by Caffe framework (Fig. 4.2a, c) justify its high CPU utilization and power consumption measured. However, this DL library also showed a good cache and TLB exploitation, which contrast with its poor frame rate. These outcomes on Caffe suggest a poor mapping between source code and the RPi’s ARM instruction set. Regarding Tensor4

MiB refers to Mebibyte, which is equivalent to 220 bytes.

4.4 High-Level Performance Analysis

75

Flow, the achieved high throughput is explained from its lowest computational and memory workloads, which are indeed executed at high operation rates (Fig. 4.2a–d). Great cache exploitation efficiency of OpenCV renders notable frame rates on this framework—note that external memory accesses take several CPU cycles, which OpenCV saves upon its efficient use of the last level cache (Fig. 4.2h). Finally, the lowest power consumption of Caffe2 could result from its reduced memory access and instruction execution rates (Fig. 4.2b, d).

4.5 Qualitative Performance Explanation Numerical results presented in Sects. 4.3 and 4.4 are consistent, as previously discussed. In fact, identifying correlations among these data establishes a bottom-up approach to trace visual inference performance.

4.5.1 From Aggregated Statistics to Inference Performance To start with, results above outline how each network architecture leverages available hardware resources for performing inference. For each DL network under consideration, Fig. 4.6a shows that the per-image processor workload (shown in Fig. 4.2a) is clearly consistent with the amount of MAC operations required for inference (Sect. 4.2)—with each framework exhibiting a distinctive relationship due to its coding strategy, as previously discussed. Similarly, in Fig. 4.6b, data memory-write and read operations are correlated with the number of weights learnt by the network. Moreover, the specific acceleration libraries underlying each framework give rise to a range of memory requirements on the system. Indeed, that amount of memory resources per framework roughly follows a linear pattern according to the network architecture, as depicted in Fig. 4.7. Concerning visual inference applications on resource-constrained embedded platforms, optimizing hardware resources—instructions and memory traffic—is mandatory in order to boost performance. The greater the number of CPU cycles employed for memory access or execution of CPU operations, the lower the overall frame rate is. This is observed for the RPi3B, where a nearly linear pattern between frame rate and data memory accesses can be identified (Fig. 4.8), making it possible to obtain good estimates of performance from this aggregated hardware metric.

76 Fig. 4.6 Each software shows a correlation between PMU hardware statistics (y-axis) and parameters defining the network architecture (x-axis), for the three CNNs

Fig. 4.7 Memory allocation versus network weights

4 Relevant Hardware Metrics for Performance Evaluation

4.5 Qualitative Performance Explanation

77

Fig. 4.8 Alignment between throughput and registered memory access events for the 12 accessed framework–network pairs. Pentagonal, triangular, and hexagonal points represent SqueezeNet, GoogLeNet, and ResNet-50, respectively

4.5.2 From Instantaneous Statistics to Inference Performance This subsection departs from the aggregated statistics analyzed in Sect. 4.3 and studies hardware metrics sampled over time. Specifically, a sampling period of 10 milliseconds was set with the -I option of perf stat command. The following commands show an example of continuous recording of PMU hardware statistics and data storage into a readable file report.csv of commaseparated values [14]. perf—example of registering statistics pi@raspberrypi:˜ $ perf stat record -I 500 r008,r004,r003,r016,r017 # time counts unit events 0.500420610 220,063 r008:u 0.500420610 75,680 r004:u 0.500420610 2,295 r003:u 0.500420610 10,995 r016:u 0.500420610 2,982 r017:u 1.001385282 285,128 r008:u # [more results] pi@raspberrypi:˜ $ perf stat report 2>report.csv pi@raspberrypi:˜ $ rm perf.data

• The time interval was fixed to 500 ms. Thus, samples were approximately recorded at time instants 0.5, 1.0, 1.5, etc. • Events are being monitored at user level—this is indicated with :u—for 0x08, 0x04, 0x03, 0x16, and 0x17 event codes. • Data generated by perf stat record are by default saved into a file called perf.data, which can be loaded with perf stat report. Figure 4.9 profiles instantaneous power consumption (right y-axis) versus three hardware metrics simultaneously measured (left y-axis)—namely L1 and L2 data

78

4 Relevant Hardware Metrics for Performance Evaluation

Fig. 4.9 Instantaneous hardware statistics and power consumption for four inferences of GoogLeNet on Caffe. High correlation can be visually identified

cache loads per second and instructions per second—during four consecutive inferences. The different measurement sources employed, i.e., (1) PMU event counters and (2) the external power analyzer mentioned in Sect. 4.4 make temporal alignment necessary. Further details about how these measurements were obtained and processed are provided in this chapter’s Appendix. Although not included in this book, similar plots to that in Fig. 4.9 were obtained for all the analyzed networks running on each DL library. Note the high correlation visually identifiable in that plot. Therefore, the corresponding Pearson correlation coefficients of these aligned temporal signals of hardware counts and power consumption are given in Table 4.3. Note that these coefficients are in the range 0.54–0.95, being greater than 0.80 in most cases. Taking into account how difficult measuring power consumption is in most embedded platforms—supply pins must be accessible and special laboratory equipment is required—and its importance for the operation lifetime of edge vision systems, the established hardware-aware analysis constitutes a simple way to characterize the overall system—embedded platform, network, software libraries, vision application—in terms of energy.

4.6 Graphical Comparison of Metrics Outcomes from this study can be properly represented to assist in evaluating application requirements.

4.6 Graphical Comparison of Metrics

79

Table 4.3 Pearson correlation coefficient between three hardware statistics and instantaneous power consumption L1-dcache loads/s L2-dcache loads/s Instructions/s GoogLeNet–Caffe GoogLeNet–TensorFlow GoogLeNet–OpenCV GoogLeNet–Caffe2 ResNet-50–Caffe ResNet-50–TensorFlow ResNet-50–OpenCV ResNet-50–Caffe2 SqueezeNet–Caffe SqueezeNet–TensorFlow SqueezeNet–OpenCV SqueezeNet–Caffe2

0.85 0.95 0.89 0.82 0.79 0.76 0.86 0.67 0.94 0.80 0.94 0.86

0.72 0.88 0.66 0.79 0.61 0.68 0.54 0.73 0.82 0.80 0.70 0.81

0.94 0.92 0.89 0.80 0.78 0.73 0.85 0.61 0.95 0.80 0.94 0.86

To simplify this task, a quick visual comparison of benchmarked scenarios is provided in Fig. 4.10a, b. These plots also summarize the twofold experimental data reported in Sects. 4.3 and 4.4 for the bottom-up approach followed in this chapter. From the point of view of hardware resource exploitation, Fig. 4.10a assesses five selected hardware events, for each software package. Three diagrams also help to compare among CNNs. Colored charts represent the hardware metrics for the four studied software tools. The external circumference on the circular plots corresponds to the maximum measured value among all the frameworks and networks, whereas the radial axis scales values from that maximum. Both maximum and scale factors are indicated on the plots. Thus, for example, the branch mispredict ratio of GoogLeNet– TensorFlow (Fig. 4.10a–left, orange line) is approximately 0.6 times 7.5%, which is the maximum obtained for this metric. Figure 4.10a immediately suggests the bottlenecks of each software library on this CPU-based platform: The higher the value of a metric, the worse the performance is on that aspect. For instance, Caffe is characterized for the highest number of executed instructions and branch mispredictions in the three network cases. On the opposite side, TensorFlow stands out for requiring fewer instructions, OpenCV for reduced L2 cache misses, and Caffe2 for a very efficient use of branch prediction resources. Figure 4.10b compares high-level performance metrics of the system. There is an unequivocal correspondence between reduced latency and TensorFlow, minimum memory allocation and OpenCV, and low-power consumption and Caffe2. Comparing the networks on each circular diagram, note that SqueezeNet trades off classification accuracy with low power, memory resources, and latency (Fig. 4.10b–right).

80

4 Relevant Hardware Metrics for Performance Evaluation

Fig. 4.10 Polar charts highlighting the main factors that critically impact the system performance

4.7 Summary and Advanced Topics This chapter establishes guidelines to investigate hardware resource utilization in order to explain system performance and identify critical factors. These low-level metrics effectively characterize the performance of CNN software frameworks running on the CPU of embedded systems. The identified correlations between such parameters and performance metrics highlight bottlenecks and limitations in the interaction between hardware and software. The companion graphical representations prove the validity of these considerations in practical terms. Results such as those presented in Sect. 4.5.2 suggest the use of analytical models for instantaneous power consumption estimation, whereas outcomes in Sect. 4.5.1 also hypothesize the idea of modeling the system performance from network architecture parameters. Accordingly, a comprehensive approach including analytical models to assess embedded systems will be presented in the next chapter. Actually, this approach characterizes the system performance prior to actually running inference.

4.7 Summary and Advanced Topics

81

Concerning event counters, additional metrics can be gathered, for instance, those selected as relevant in the hierarchical bottleneck analysis proposed by Yasin [15]. Actually regarding perf, there are further options to explore. Data collected by this tool are also classified through per-thread, per-process, and per-cpu profiles—related commands are perf record/report/annotate/script. This provides a function-level breakdown that allows detecting code bottlenecks, for instance functions taking the majority of CPU cycles. Other tools can also be employed to monitor hardware events in specific embedded hardware, such as toplev tool specific to Intel’s CPUs, among others. All in all, further information about what hardware performance counters are actually measuring and how these statistics explain software performance is comprehensively described in [16]. Besides, the proposed hardware-aware analysis is closely related to the so-called roofline performance model [17, 18]. Given the memory bandwidth and processor resources, this model provides an upper-bound performance for different workloads. Three parameters are considered and visually assessed in a “roofline” graph: 1. Floating-point performance, measured as the number of floating-point operations per second (GFLOPs/s). Hardware specifications can report peak floatingpoint performance; otherwise, microbenchmarks such as LINPACK [19] can be carried out to find this limit. 2. Memory performance, which is related to the capacity of transferring data from the caches. Peak memory bandwidth (GB/s) will also constrain the system performance. STREAM benchmark [20] can provide this value. 3. Operational or arithmetic intensity I , that is, mean operations per byte of memory traffic. By defining the memory traffic Q (B) as the input/output data transferred between the caches and the off-chip memory and given the software application work W (FLOPs); then system operational intensity I is the ratio of the work to the memory traffic I = W/Q, which is usually expressed in FLOPs/Byte. Note that if the traffic is measured between the processor and the cache, this ratio will instead express arithmetic intensity. Figure 4.11 exemplifies the roofline model. In this plot, axes are reserved for both I and floating-point performance. An horizontal black-dotted line establishes the upper bound to floating-point performance (GFLOPs/s), whereas the peak memory bandwidth exhibits a constant slope (GB/s, or (GFLOPs/s)/(FLOPs/Byte)). Therefore, the upper bound of a workload is the minimum between (i) the processor’s peak floating-point performance—blue horizontal line—and (ii) the product of the peak memory bandwidth (GB/s) and operational intensity—blue slanted line. This analysis shows limitations, such as memory- or compute-bound applications. For a particular operational intensity—red vertical line—, if it hits the flat limit (peak GFLOPs/s), this is a compute-bound application. In the opposite side, if it hits the sloped line, it is memory bound. Tips for reducing memory bottlenecks are restructuring loops for consecutive address memory accesses, or using data pre-fetching; meanwhile, exploiting instruction-level parallelism—SIMD—will reduce computational bottlenecks.

82

4 Relevant Hardware Metrics for Performance Evaluation

Fig. 4.11 Roofline performance model

The roofline model is related to the event counters approach employed in this chapter. Assuming proportionality between the number of instructions and FLOPs, the event count would be an estimate of the work W ; also, the memory traffic Q would rely on data memory accesses provided by event counters. Then, arithmetic intensity would reveal workload bounds. Acknowledgements This chapter is partially based on authors’ publication [21].

4.8 Appendix Power Measurement We employed an external Keysight N6705C DC Power Analyzer to extract temporal samples of device power consumption. This equipment powered the embedded platform while recording the current and voltage supplied. In the case of RPi series, the pinout command provides insight about GPIO pinout, as exemplified in Fig. 4.12. For RPi3B, both #2 and #4 pins correspond to 5V voltage supply, whereas various pins—#6 as an example—are reserved for GND reference. Thus, the RPi can be powered by generating a constant 5V signal on N6705C and wiring it up to RPi’s GPIO. Note that there is no fuse protection or regulation on the GPIO to protect the system from over voltage or current spikes.

4.8 Appendix

Fig. 4.12 Information from pinout command on RPi3B

83

84

4 Relevant Hardware Metrics for Performance Evaluation

Additionally, in order to estimate the input resistor of such measurement system, we proceeded as follows. We incrementally established V in the range [5.0, 5.5] V on the N6705C output voltages, with step variations of 0.1 V. In each case, we used a multimeter to measure both the input voltage on RPi’s GPIO pins VRPi and the RPi . With these data, current supplied I . Thus, we calculated the resistor as R = V −V I we obtained R = 0.805 ± 0.003. Therefore, in all experiments performed to measure power consumption, the power analyzer recorded a set of V and I temporal samples—this monitoring instrument allows saving and exporting that data (V , I ) into a csv file. Then, RPi’s power consumption was calculated as PRPi = V I − I 2 R. The sampling period was established at its minimum (40.96 µs), thus ensuring high precision in the measurement. Data Alignment We rendered a list of event counts periodically gathered during the inference— see perf command in Sect. 4.5.2. To temporarily distinguish network inference, the minimum sampling period had to be set on the PMU counters. This minimum available value was 10 ms.



! Event counters issues

As explained in the reference manual, absolute counts provided by the six PMU counters might vary due to pipeline effects. This can have an effect when the counters are enabled for a very short time and switching very frequently. For the sampling period established in the experiments described in this chapter, a warning is notified by perf: print interval < 100ms. The overhead percentage could be high in some cases. Please proceed with caution.

Another concern is the absence of event counting at some time intervals. In these cases, appears on perf output instead of the corresponding event count: #

time 2.002717908 2.002717908 2.002717908

counts unit events

r008:u

r004:u

r003:u

This effect is exemplified in Fig. 4.13a, which shows the already processed data gathered from perf stat -I 10 for various CNN inferences. Empty counts at some instants—for example, in the slot between seconds 14–15—correspond to those not counted values. Time overflow on perf report is another issue to take care when processing its output.

4.8 Appendix

85

Fig. 4.13 Alignment process between two recordings from different sources while running various network inferences. Plots a and b show raw data from the PMU hardware counters and the power analyzer, respectively. Note that hardware events are also gathered during network loading. Power monitoring was manually started and stopped to register inference. An alignment of identified starting points—annotated with red lines—is depicted in plot c. However, the number of samples highly differs due to different sampling periods. Finally, plot d illustrates a detailed view—corresponding to one inference—of those already processed signals sharing the sampling rate and hence the number of samples

86

4 Relevant Hardware Metrics for Performance Evaluation

Finally, keep in mind that when measuring more events than counters, the kernel uses counter time multiplexing to gather all data. With multiplexing, each event is not constantly measured; by contrast, the final result is scaled according to the total time enabled and the overall time running. Therefore, these counts are estimations that could differ from actual event counts. As an advice, avoid multiplexing numerous events. Furthermore, a way to achieve coherent counts for various events—with the same scale—is to measure them together in a “group.” This can be done with perf -g.

At this point, we extracted two network profiles from (1) hardware events provided by event counters in the processor’s PMU and gathered via software perf and (2) platform power consumption measured with an external power analyzer. Unfortunately, their absolutely different measurement procedures make them difficult to compare. How to align samples from different sources? Figure 4.13 depicts how data alignment was performed to obtain the results presented in this chapter. First of all, processing hardware events statistics allows visualizing a profile similar to that shown in Fig. 4.13a. This processing implies setting missing samples to zero—not counted events—and checking a possible time overflow and the actual sampling period—it may slightly differ from the established one. In this particular case, the average sampling period was approximately 10.41 ms. Instantaneous rates of statistic counts per second on the y-axis were simply calculated from the fraction of raw counts divided by its sampling period. This figure evidences that there are two distinguishable stages: network loading and network inference. Second, we found a common reference time in both series of measurements. For instance, the beginning of inference is characterized by an abrupt step from idle power to high consumption. This is marked in Figs. 4.13a, b with a vertical red reference. Then, we were able to plot the aligned measurements (Fig. 4.13c). Indeed, as time axis is equally scaled, we can identify the same profiling, with the inference ending temporarily coinciding. However, each data record—power and hardware statistics— actually contains a very different number of samples owing to their distinct sampling periods when registering data. Certainly, in order to calculate correlation coefficients, same length vectors are needed. So we had to make their sampling periods equivalent. To do that, we checked the ratio between sampling periods and under-sample the signal with the highest sampling period. That is, we kept only the samples at int(ratio f ∗ k) positions, for all integer k values in range [0, L], where

4.8 Appendix

87

• Ratio f > 1 is the proportion between the highest and lowest sampling periods ratio f = Thigh /Tlow . • L is the length of the longest vector. These power and hardware statistics based on same length vectors are plotted in Fig. 4.13d for one network inference, also displaying red lines for the corresponding sampling times, separated Thigh .

References 1. Kielmann, T.e.a.: Basic Linear Algebra Subprograms (BLAS) (2011). https://doi.org/10.1007/ 978-0-387-09766-4_2066 2. Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979). https://doi.org/10.1145/ 355841.355847 3. Automatically tuned linear algebra software (ATLAS). http://math-atlas.sourceforge.net 4. K. Goto and R. A. van de Geijn: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008). https://doi.org/10.1145/1356052.1356053 5. Intel math kernel library. https://software.intel.com/en-us/mkl 6. Guennebaud, G. et al.: Eigen. http://eigen.tuxfamily.org/ 7. Dense linear algebra on GPUs. https://developer.nvidia.com/cublas 8. ARM Processors.Cortex-A53. https://developer.arm.com/products/processors/cortex-a/ cortex-a53 9. ARM: ARM® Cortex®-A53 MPCore Processor. Technical Reference Manual 10. ARM: ARM® Architecture Reference Manual. ARMv8, for ARMv8-A architecture profile (2017) 11. ARM: ARM Cortex-A53 MPCore Processor Advanced SIMD and Floating-point Extension. Technical Reference Manual 12. Raspberry Pi. Operating system images. https://www.raspberrypi.org/software/operatingsystems/ 13. GitHub – OpenCV. Open Source Computer Vision Library. https://github.com/opencv/opencv 14. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_ Page 15. Yasin, A.: A top-down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 35–44 (2014). https://doi.org/10.1109/ISPASS.2014.6844459 16. Becker, M., Chakraborty, S.: Measuring software performance on linux. arXiv (1811.01412) (2018) 17. Cabezas, V.C., Püschel, M.: Extending the roofline model: Bottleneck analysis with microarchitectural constraints. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 222–231 (2014). https://doi.org/10.1109/IISWC.2014.6983061 18. Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/ 1498765.1498785 19. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK Benchmark: past, present and future. Concurrency and Computation: Practice and Experience 15(9), 803–820 (2003). https://doi. org/10.1002/cpe.728. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.728

88

4 Relevant Hardware Metrics for Performance Evaluation

20. McCalpin, J.: Memory bandwidth and machine balance in high performance computers. IEEE Technical Committee on Computer Architecture Newsletter pp. 19–25 (1995) 21. Velasco-Montero, D., Fernández-Berni, J., carmona Galán, R., Rodríguez-Vázquez, A.: Performance Assessment of Deep Learning Frameworks through Metrics of CPU Hardware Exploitation on an Embedded Platform. International journal of electrical and computer engineering systems 11(1) (2020). https://doi.org/10.32985/ijeces.11.1.1

Chapter 5

Prediction of Visual Inference Performance

Abstract The implementation of CNN-based applications on resourceconstrained devices operating at the edge constitutes a major challenge. Once the network chosen for a specific edge application is trained, its actual implementation on a particular edge platform may not fulfill prescribed application requirements. Therefore, selecting or designing a CNN able to deliver acceptable execution time or energy consumption on the embedded system is crucial—even before training the network on the target dataset. In this chapter, a methodology for estimating CNN performance on edge devices is presented. It allows evaluating the execution latency and energy demand of CNNs prior to their actual deployment. Furthermore, this procedure also provides layerwise performance analysis that can be exploited by NAS engines or guides the design of ad hoc networks optimally adapted to the system. The resulting performance predictive models for both inference time and energy consumption have been tested on several well-known CNNs running on three different software–hardware systems.

5.1 Introduction When it comes to implementing embedded vision systems, practitioners must focus on a particular realization defined in terms of a complex algorithm–software– hardware stack. For such design labor, a preliminary benchmark methodology such as the one presented in Chap. 3 can guide the technological choices according to application requirements. In addition, software coding enhancement can further boost the system performance. To this end, the recommendations provided in Chap. 4 are helpful. Furthermore, once the software libraries underlying the operation of a particular device have been selected and optimized, edge inference performance only depends on the network architecture. In this regard, designing networks optimally adapted to the hardware to fulfill stringent application requirements is not easy. Well-known lightweight models, as those listed in Chap. 2, were designed to reduce the number of parameters and thus the computational load as a whole. Actu-

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Velasco-Montero et al., Visual Inference for IoT Systems: A Practical Approach, https://doi.org/10.1007/978-3-030-90903-1_5

89

90

5 Prediction of Visual Inference Performance

ally, as they did not specifically target a particular embedded hardware, their performance significantly varies depending on the host system [1]. In this regard, a preliminary evaluation is possible by simply comparing the number of operations performed by each network architecture for inference. However, this straightforward evaluation does not usually produce reliable values of performance metrics, in particular power consumption [2]. Additionally, manual design of computationally efficient CNNs is currently moving toward automatic algorithms such as NAS. To incorporate specific platform constraints to such approaches, a previous step is modeling how both the network architecture and the software–hardware system impact the optimization target function. The main objective when implementing CNNs on resource-constrained devices is maximizing throughput and inference accuracy while minimizing energy consumption. Concerning architectural design or network selection, there exist a vast space of possibilities that perform differently on the same embedded system. Benchmarking is non-scalable concerning this diversity of network models. To address this issue, a methodology that removes the need of comprehensive benchmarking is presented in this chapter. This methodology builds a performance prediction model based on the single characterization of a CNN specifically designed to cover most of the common layers configurations in state-of-the-art CNNs. The resulting model targets a particular software–hardware combination and provides accurate per-layer estimation of the expected performance for any network architecture running on that combination. Such a priori prediction is key for rapid exploration and optimal implementation of visual inference. In summary, this chapter describes the following: • A procedure to accurately estimate the performance of CNN architectures—in terms of throughput and energy consumption—layer by layer without the need of actually running them. • A neural network whose fine-grained characterization enables the construction of the aforementioned prediction model. This network incorporates a large variety of CNN layers and parameters to achieve CNN layerwise characterization. • Fast identification of layers incurring the most inference time or demanding distinctively higher energy consumption for a network running on a particular software– hardware combination. Such layers would be the first ones to be modified by an optimization procedure or a NAS engine. In order to guide the reader through the methodology, an overview of CNN parameters and metrics is first reviewed. Then, we comprehensively describe the methodology for building performance estimation models. This procedure relies on an ad hoc CNN specifically designed for performance characterization, whose details are reported later on. Finally, the capability of such predictive models to estimate the performance of any CNN on the targeted system is demonstrated through experimental tests.

5.2 Fundamentals of CNNs

91

5.2 Fundamentals of CNNs We refer the reader to Chap. 2 to review fundamental layers composing the CNNs. Notwithstanding, the most relevant aspects related to the proposed methodology are also described in this Section. Mathematical notation will also conform to the aforementioned chapter— remember, we will use in/out subindexes for input/output 3D data volumes sized (H × W × C). In addition, n(X) will denote the number of elements in the tensor X. Thus, for example, n(I) for an input fmap I containing Cin channels sized Hin × Win will be equal to Hin Win Cin . Common CNN Layers and Parameters From an architectural point of view, common CNN layers composing state-of-the-art CNNs and their related metrics are summarized in Table 5.1 and explained below. • CONV layers apply a total of Cout filters sized kh × kw × Cin on sliding windows in the input volume. The dot product is performed between kernel weights and the considered receptive field, thus executing kh kw Cin MACs per output activation. In case of adding biases to each output, the computation increases with n(O) extra MACs. • FC layers yield a vector of activations after applying a set of weights to each input. Generally assuming n(I) inputs that produce Cout outputs, this layer involves a computational cost of n(I)Cout MACs. • Pooling operation is applied on sliding windows sized kh × kw , performing a total of kh kw Hout Wout Cout operations. Note that different functions can be applied— maximum, average, etc. • Activation layers perform a specific operation on each input activation, thus keeping their input dimensions unmodified. Examples are nonlinear layers (ReLU, sigmoid, etc.), normalization layers, scale, softmax, etc. These layers may require learning weights—as in the case of adding biases or normalizing inputs. • Multiple-input layers combine activations from various branches. This is usually implemented by directly concatenating fmaps (Concat) or applying elementwise operations (Eltwise). Overall, the approach presented in this chapter covers nine types of layers: CONV, FC, pooling, BN, ReLU, scale, Eltwise, Concat, and softmax. Therefore, the most popular normalization (BN) and nonlinearities (ReLU, scale) are studied, in addition to branch-combining layers (Eltwise, Concat) and the loss function included in most classification networks (softmax). Intrinsic CNN Metrics Trained CNNs present the following parameters, relevant for practical deployment of embedded applications: • Computational Complexity (#OPs). The amount of floating-point operations required for network inference—or, at least, required by CONV and FC layers—is

92

5 Prediction of Visual Inference Performance

Table 5.1 Operation metrics for most usual layers composing CNNs. The number of weights and operations assumes no biases added Layer type Output dimensions # weights # operations Hout =  Hin −ksh +2 p + 1 kh kw Cin Cout kh kw Cin n(O) Wout =  Win −ksw +2 p + 1 Cout = # filters Fully connected 1 × 1 × Cout n(I)Cout n(I)Cout Pooling Hout =  Hin −ksh +2 p + 1 — kh kw n(O) Wout =  Win −ksw +2 p + 1 Cout = Cin Activation layer Hin × Win × Cin ∝ Cin ∝ n(O) Multiple-input layer Hin × Win × C ∗ ∝ Cin ∝ n(O)  ∗ For concatenation layers, C  = Cin ; whereas simple elementwise operations keep the same Convolutional Layer

inputs

channel dimension as input volumes, i.e., C  = Cin

a widely used metric to measure computational complexity. The overall computational load of the CNN can be determined by adding up the number of operations per layer1 —previously detailed in Table 5.1. • Model Size. It is calculated as the total number of parameters learnt by the network during the training stage. This value impacts the memory footprint—as exemplified in Chap. 4—and may preclude the execution of certain networks on specific hardware devices. • Memory Accesses. In addition to network weights, working activations are also allocated in memory during inference. Thus, the minimum number of basic memory operations for a layer forward pass will be2 #memOPs = n(I) + n(W) + n(O)

(5.1)

• Accuracy. As opposed to the metrics above, this figure is related to the dataset on which the network was trained—network architecture and complexity also affect the capability of accurate network training. These intrinsic metrics gather information about resource requirements of the network, but note that they are just “indicators” of actual inference performance: Although they can provide a preliminary estimation of the network operation efficiency, the final empirical performance may significantly differ [2–5]. 1

This is a general estimation on the minimum number of operations required for inference—not only including FLOPs. Ultimately, it will be the specific interaction between hardware and software in the targeted system that will determine the actual computational complexity in terms of processor workload. 2 This is again a plain estimation. The actual number of memory accesses will ultimately depend on the hardware platform—memory word size, memory hierarchy, cache size, etc.—and the computational strategy for each operation—partial matrix products, data access pattern, etc.

5.2 Fundamentals of CNNs

93

CNN Inference Performance Metrics The most relevant inference performance parameters regarding forward pass execution are as follows: • Throughput, or processing frame rate, which depends on the execution time of the complete network workload. • Energy consumption, directly constraining the battery lifetime of the system. CNN Implementation Strategies The relation between network architecture and operation inference metrics will highly depend on the implementation. Concerning software, several optimization strategies can be adopted to accelerate network processing according to the available hardware resources. In particular, in the diversity of libraries underlying DL frameworks, numerous approaches to accelerate matrix multiplication on specific hardware—CPU, GPU, etc.—are implemented. For CPU-based systems, the common approach, i.e., unrolled convolution, integrates im2col reorder before the matrix product—see details in Chap. 2. Thus, a memory overhead is introduced, which must be taken into account in Eq. (5.1), where n(I) becomes (kh kw Cin )(Hout Wout ) for CONV layers implementing this strategy.3 In sum, each implemented approach will perform differently according to how hardware resources are exploited—memory allocated, data pattern accesses, use of instruction-level parallelism, etc. The question to answer is: Can these inference performance parameters be estimated from intrinsic network metrics?

5.3 Modeling and Prediction of Inference Performance 5.3.1 General Description Layerwise modeling allows per-layer performance assessment, which is valuable not only for network selection but also for network architecture design, layer selection, or network compression. This can be concluded from the following remarks: • Different layers are stacked in network architectures, thus involving numerous types of computational operations and data access patterns. For instance, FCs require a great deal of memory operations due to their large number of learnt weights, whereas ReLU layers only perform a simple operation on input data. • Energy consumption also depends on the particular layer operation and how this operation is mapped into the underlying hardware resources—as explained in Chap. 4. For instance, the energy cost of accessing memory varies up to two orders of magnitude within the different levels of the memory hierarchy [6].

3

Note that in addition to the kh kw memory overhead factor introduced, data reorganization of im2col may also cause a slight operation delay.

94

5 Prediction of Visual Inference Performance

Therefore, the expected performance of CNN inference will be characterized through per-layer predictive models. Layer-level performance modeling takes into account the layer diversity in terms of operations and data—layers can be memory- or compute-bound— and how embedded hardware resources are differently exploited to perform inference. Moreover, it supports low-level decisions for designing ad hoc network architectures or model compression.

Figure 5.1 illustrates the proposed approach, called prediction of visual inference performance (PreVIous), which includes two main stages. The first one implies the complete performance characterization—in terms of runtime and energy—of each layer composing a full-custom network called PreVIousNet, when running on the selected embedded system—defined as a software–hardware combination. The resulting benchmarking data enable the construction of a prediction model, whose inputs are network parameters as those explained in Sect. 5.2. This one-time constructed model can be used in a second stage for estimating the performance of any other CNN to be run on such a system.

Fig. 5.1 General overview of the proposed methodology. It comprises two stages: (1) performance modeling, in which the characterization of PreVIousNet enables the construction of a prediction model for the selected system; (2) performance prediction, in which the performance of any CNN of interest to be run on the selected system is accurately predicted on the basis of the previously constructed model

5.3 Modeling and Prediction of Inference Performance

95

5.3.2 Selected System PreVIous is agnostic with respect to the software–hardware combination for modeling and prediction. However, this study will specifically evaluate the methodology for two popular software frameworks deployed on two low-cost hardware platforms. The baseline combination integrates Caffe on RPi3B. Two variations of this baseline system are also experimentally analyzed: OpenCV on RPi3B and Caffe on OdroidXU4 [7]. Particularities of these systems under study will be outlined in Sect. 5.5.1.

5.3.3 Network Profiling This step involves the extraction of three sets of measurements in the modeling stage. While the first set—architectural network metrics—can be directly extracted from the model architecture, the other sets—i.e., inference performance metrics—must be empirically measured while executing network layers. By “layer execution,” we mean the computation needed to obtain output data from inputs at each layer. In order to ensure accurate measurements, these metrics are assessed for N inferences, and the obtained values are averaged. Concerning the performance prediction stage, only architectural metrics are extracted from the CNN of interest, and performance metrics for this CNN are predicted. • Architectural Metrics. Simply by analyzing the network definition, it is possible to extract per-layer architectural parameters such as input/output dimensions (H, W, C), number of learnt weights n(W), kernel sizes (kh , kw ), as well as estimates of computational (#OPs) and memory (#memOPs) requirements. An overview of the calculations to obtain some of these metrics is provided in Table 5.1. • Time profiling. Each layer composing a network is executed individually, and its runtime is measured. This per-layer execution profiling on the targeted system is carried out for PreVIousNet in the characterization stage. • Energy Profiling. Fine-grained characterization of energy demand for each individual layer of PreVIousNet is also required. When running separately each layer, the power consumption of the corresponding system was monitored. The exact runtime required by each layer was simultaneously recorded. Therefore, layers’ consumption can be delimited within the power signal provided by the analyzer in order to obtain the energy profiling of the whole network. As an example, Fig. 5.2a illustrates the profiling of the per-layer power demand for All-CNN-C [8]. A total of 50 executions per layer were carried out, and then the energy measurements were averaged. For proper identification of the layers, a time interval of 300 ms was established via software to separate each set of layer executions. In addition, the previously obtained time profiling allows extracting the portion of the power

96

5 Prediction of Visual Inference Performance 5.0

5.0

4.0

Power (W)

Power (W)

4.5

3.5 3.0

50 executions of layer 'conv2'

2.5

4.6

4.2

2.0 3.8 1.5 0

2

4

6

8

10

12

1.56

2.06

2.56

time (s)

time (s)

(a)

(b)

3.06

Fig. 5.2 a Power signal measured while performing 50 consecutive executions of each layer of All-CNN-C on the combination RPi–Caffe, with a time interval of 300 ms between layer executions. b Portion of the power signal corresponding to layer “conv2” of All-CNN-C. This extracted signal is integrated and averaged over the 50 executions to obtain the energy consumption of the layer

signal corresponding with the layer under characterization—see further details in Appendix. For instance, Fig. 5.2b shows the extracted signal for layer “conv2” of All-CNN-C. Then, the energy consumption for each layer is obtained by integrating its power signal and averaging over the 50 performed executions.

The companion Appendix provides further details about network profiling procedures on the selected embedded systems.

5.3.4 Model Construction Linear regression models are proposed to predict performance figures through CNN metrics. In particular, regression models for each type of layer described in Sect. 5.2 are required for predicting both runtime and energy consumption. Note that this diversity of layers must be covered by PreVIousNet—see Sect. 5.4. Architectural metrics play the role of predictors for such models. Thus, the performance prediction stage simply consists in parsing the definition of a CNN of interest to extract its architectural metrics and apply the corresponding regression models for its constituent layers. Model Formulation For the sake of simplicity and reduction of overfitting, we make use of linear models. Generally speaking, a linear regression model aims at finding the best weighted combination of a set of variables x = [x1 , x2 , . . . , x p ] (predictors) to estimate the values of another variable y (response) with minimum error. Given n observations n , the model can be expressed as follows: of such predictors and response, {xi , yi }i=1 y = Xw + 

(5.2)

5.3 Modeling and Prediction of Inference Performance

97

where X is the n × p matrix of input observations, w is the p × 1 vector of adjusted model coefficients (weights), y is the n × 1 observed response vector, and  is the n × 1 error vector. Note that the observations for building the model are encoded by each row xi = [xi1 , xi2 , . . . , xi p ] of the observation matrix X and p is related to the complexity of this multivariate linear model. In the method known as ordinary least squares (OLS) linear regression, the coefficients w are obtained by minimizing the following equation: ||y − Xw||2 =

n 

(yi − xi w)2

(5.3)

i=1

where again xi is the i-th row of X containing one observation and n is the total number of available samples for building the model. Therefore, minimizing this equation will also minimize the variance of the residuals, yielding unbiased estimations (E[i ] = 0). Once the model is built, i.e., the weights are adjusted, the model prediction yˆ for a new set of predictors x is obtained as follows: yˆ = x w =

p 

x j w j

(5.4)

j=1

According to this notation, PreVIous produces a regression model for each type of layer based exclusively on architectural metrics (x) to estimate runtime or energy (y). How to obtain the observation set to build the models For each regression model, that is, for each type of layer and target metric—runtime or energy—we need to measure the performance of n samples in order to build the n model. Then, the observation set {xi , yi }i=1 comprises architectural metrics of these n layers for a particular type {xi }i=1 and their associated data from performance mean . This one-time characterization will enable the model construction surements {yi }i=1 by minimizing its associated function as in Eq. (5.3). Variability in the set of n samples is desirable in order to obtain models with enough capacity of generalization. For instance, a set of characterized FC layers must include samples featuring several input/output dimensions and number of weights. n . Thus, Therefore, diverse performance behaviors will be encompassed in {xi }i=1 model estimations will likely be more accurate, and the model will generalize better. The observation dataset could be obtained by collecting performance data from many different popular CNNs. However, this procedure presents two undesirable issues: • The characterization of many CNNs is time consuming and arduous.

98

5 Prediction of Visual Inference Performance

• Layers composing different well-known CNNs still share similar parameters, and therefore, their characterization would be redundant and will not provide data variability. Alternatively, PreVIous proposes a simpler and systematic approach to produce the performance dataset: the introduction of a network, PreVIousNet, which sweeps by design the most usual layer parameters, features, and data dimensions for each n . layer. Its characterization is simple but still delivers data heterogeneity on {xi , yi }i=1 Hence, the architecture design space is comprehensively covered. See details on this network in Sect. 5.4.

Model Tuning. Feature Selection and Extraction In general, a greater complexity in machine learning models implies a greater probability of overfitting. In this regard, linear models are the simplest ones. As practical advice, it is worth applying well-known techniques to make the most of simple linear regression models: • Feature extraction: nonlinear transformation. In general, applying transformations to the input variables – logarithmic, polynomial, etc.—improves the linear behavior of the response variable with respect to the predictors. Provided that these new features are more correlated with the response, the performance of the linear models is enhanced. Note that, in our case, polynomial transformations are indirectly applied: We use meaningful variables—i.e., n(I), #OPs, etc.—which are high-order products of basic ones—H ,W ,C, etc. In addition, depending on the problem, additional techniques to extract new features from the available data could be explored, such as principal component analysis. • Dimensionality reduction. The higher the number of variables, the higher the number of observations needed to fit the model. In fact, to avoid overfitting, n should increase exponentially with the number of predictors p—a fact also known as “the curse of dimensionality.” A technique to overcome this problem is feature selection: Only variables highly correlated with the target response y must be considered for the model. Figure 5.3 shows a target correlation analysis providing insight about the best candidates to create a linear regressor. • Model regularization. This technique overcomes the need of a high number of observations to accurately fit the model. Regularization reduces model complexity, making it less sensitive to the reduced set of data. In particular, we apply standardized ridge regularization in which the coefficients are obtained by minimizing the following equation: ||y − Xw||2 + λ||w||2 =

p n   (yi − xi w)2 + λ w 2j i=1

(5.5)

j=1

where λ denotes the regularization tuning parameter for controlling the strength of the ridge penalty term. We set this penalty parameter to 1. An alternative regulariza-

5.3 Modeling and Prediction of Inference Performance

99

Fig. 5.3 Pearson correlation coefficient for analyzing architectural parameters versus the corresponding response variable, namely a runtime or b energy consumption. Data come from PreVIousNet profiling in the baseline system, i.e., RPi–Caffe. Variables are ordered in growing correlation. For simplicity, data from all the types of layers were jointly considered

tion is least absolute shrinkage and selection operator (Lasso) whose minimization function has the following expression: ||y − Xw||2 + λ||w||1 =

p n   (yi − xi w)2 + λ wj i=1

(5.6)

j=1

 which penalizes the L1 norm of coefficients ||w||1 = j w j , thus also performing variable selection by setting some coefficients to zero. Considering the points above, to keep simplicity in the modeling stage, the most meaningful predictors are exclusively considered to build both runtime and energy per-layer regression models. According to our analysis, n(W), #OPs, and #memOPs are a priori the best candidates. In fact, these selected features show correlation with the prediction variable, as demonstrated in Fig. 5.3 and also exemplified in Fig. 5.4, which plots #memOPs versus layer runtime in the baseline system for the data gathered from PreVIousNet profiling during the modeling stage. Indeed, this inherent linear relation supports the decision of applying linear regression models rather than more complex approaches such as support vector machines, neural networks, or Gaussian process regression. Furthermore, derivation of per-layer regressors also makes sense, as observed in the different linear trends followed by the observation data (Fig. 5.4). In contrast, basic metrics such as first-order variables (i.e., H ,W ,C, etc.) would hardly predict the output in a linear model—see, as an example, Fig. 5.5, in which no linear relationships can be identified for Hin .

100

5 Prediction of Visual Inference Performance 40

Fig. 5.4 Inference runtime versus architectural metric #memOPs (variable selected for the prediction model). Per-layer linear correlations can be identified

CONV FC Pooling ReLU BN Scale Concat Eltwise Softmax

35

Time (ms)

30 25 20 15 10 5 0

1

2

3 # memOPs

4

5 1e6

Fig. 5.5 Inference runtime versus the architectural metric Hin (which was finally excluded from the set of model input variables)

5.4 PreVIousNet: A Network for Fine-Grained Characterization In order to define a systematic modeling stage, a full-custom neural network, PreVIousNet, is proposed. It is specifically designed for modeling the performance of a diversity of layers with different settings when running on a selected system. Therefore, it neither needs to be trained nor is applicable for vision applications. Common architectural settings and layer parameters included in CNNs are contemplated in PreVIousNet, as described below.

5.4 PreVIousNet: A Network for Fine-Grained Characterization

101

5.4.1 Layer Parameters Concerning convolutions, common strategies implemented in lightweight CNNs are included in PreVIousNet: • Standard CONV. Adjustable layer parameters include: • Kernel sizes (kh , kw ): Conventionally, square odd-sized kernels are used. In addition, state-of-the-art CNNs employ small kernel sizes in order to reduce the computational load (#OPs ∝ k 2 ). • Number of kernels N = Cout : Frequently, N > Cin kernels are applied to progressively expand the channel dimension. • Stride s: If the sliding step is greater than 1, output fmap spatial dimensions H, W are accordingly reduced. The most common value for strided convolutions is s = 2. • Depthwise CONV. This type of layer operates independently on each input channel by applying one kh × kw × 1 kernel filter per 2D fmap. Thus, the number of weights and computational load is reduced by a factor of Cin . • Pointwise CONV. If the kernel only operates along the channel dimension, i.e., filters are sized 1 × 1, then we have a non-spatial convolution that reduces the computational load by k 2 . Two variants are possible: • Bottleneck pointwise CONV, which aims to reduce the computational load of subsequent layers by cutting down the number of channels Cout < Cin . • Expansion pointwise CONV, which increments the channel dimension. This type of layer is applied either to revert the bottleneck channel-shrinking effect [9], or to build separable convolutions in conjunction with depthwise CONV (note that pointwise CONV does not operate along width or height axes; however, depthwise does) [10]. Concerning other layers, only Pooling layers apply sliding windows on adjustable (kh , kw , s) parameters. The usual operation performed is the maximum function, which is typically applied over 2 × 2 patches. Additionally, the so-called global average pooling is introduced in state-of-the-art CNNs in order to replace the high memory-consuming FC layers. This type of pooling averages over each entire input 2D fmap. For the rest of layers without parameters, input data dimensions are parameters to considerate for performance modeling.

5.4.2 Architecture For each layer composing a particular network, both the input data dimensions and layer parameters determine the computational load and memory requirements, thus

102

5 Prediction of Visual Inference Performance

Fig. 5.6 Macroarchitecture of PreVIousNet [11] used in the modeling stage of the proposed method. a The first architecture encompasses seven types of layers with different settings (PreVIousNet-01), whereas b the second structure (PreVIousNet-02) covers the vectorized FC and Softmax layers. The network input dimensions at the first level—H, W, and C—are adjustable variables of the network

affecting the execution performance. Based on this fact, PreVIousNet aims at covering a wide range of settings within the architectural design space. For the main architecture of PreVIousNet—denoted as PreVIousNet-01 in Fig. 5.6a—this is achieved as follows: 1. Data dimensions and thus the computational load progressively increase as the network goes deeper—i.e., moving rightward in Fig. 5.6a through the stacked levels of the network. The network input dimensions at the first level—H, W, and C—are adjustable variables allowing covering different volume data sizes. 2. Diverse layers and parameters are contemplated in parallel branches inserted at each level—vertically arranged in Fig. 5.6a. Based on these key points, PreVIousNet-01 was designed as follows. At each level, a set of parallel CONV layers covers the aforementioned strategies: standard convolutions, pointwise, depthwise, and bottlenecks. Concerning layers individually operating on activations (BN, scale, and ReLU), they are introduced at different levels and network branches in order to operate on diversely shaped intermediate fmaps. Likewise, pooling layers with diverse settings are introduced at different levels. Finally, multiple-input layers (Eltwise and Concat) operate on equally sized pairs of tensors coming from previous branches of the network. Note that FC and softmax deal with 1D data vectors instead of 3D tensors; in addition, FCs are memory intensive. Therefore, these layers are not included in PreVIousNet-01 but in a different architecture, PreVIousNet-02 is depicted in Fig. 5.6b. Following the same design guidelines, various sets of parallel FC layers deal with a diversity of input and output vector sizes. The network input is an adjustable 1 × 1 × C vector. Then, vectors of various sizes are processed by the layers: either resulting from applying an expansion factor to the input (2C, 4C, etc.),

5.4 PreVIousNet: A Network for Fine-Grained Characterization

103

or using customized vector lengths K i . For instance, K 1 = 10 and K 2 = 1000 are common output sizes in classification networks trained on ImageNet , CIFAR, and MNIST. Similarly, softmax layers operate on differently sized vectors. All in all, PreVIousNet comprises two compact specialized networks that include 52 layers—15 CONV, 7 BN, 7 scale, 7 ReLU, 6 pooling, 5 Eltwise, and 5 Concat— plus 44 1D data layers—32 FC and 12 softmax. Further details about the architecture are included in its model definition, publicly available in Caffe format [11]. For instance, although no weight file is provided, loading the network in Caffe will automatically initialize the weights according to the “MSRA” initialization scheme [12]. Interestingly, the proposed architecture can be adjusted according to the features of the networks to be characterized in the performance prediction stage. Finally, note that PreVIousNet constitutes a singular approach to characterize CNN layers, but alternative networks or systematic benchmarking methods would also be valid. For instance, the characterization of recurrent building blocks of highly optimized architectures as a whole, e.g., the Fire module of SqueezeNet or separable convolutions of MobileNets, could constitute an alternative.

5.4.3 Network Configuration for the Modeling Stage As previously mentioned, the adjustable input size of PreVIousNet-01 (H × W × C) can be set according to the most common tensor sizes handled by CNNs. For instance, consider SqueezeNet [13], which is a popular lightweight CNN. Input volumes in the layers composing this network range from 3 to 512 channels, whereas both their height and width decrease following the sequence 227 − 113 − 56 − 28 − 14. Accordingly, a characterization of PreVIousNet with varied input tensor sizes will allow collecting enough data to build accurate prediction models. In the modeling stage, we empirically set the following four input dimensions for PreVIousNet01: (i) 56 × 56 × 32, (ii) 28 × 28 × 64, (iii) 14 × 14 × 64, and (iv) 7 × 7 × 64. Thus, we are particularly sampling4 CONV layers with Hin = Win = {56, 28, 14, 7} and Cin = {32, 64, 128, 256}. The regression models will accurately interpolate or extrapolate performance values from these ranges, as will be shown in Sect. 5.5. Concerning PreVIousNet-02, the particular realization characterized uses an input vector sized 1 × 1 × 256. Consequently, the common vector lengths covered are Cin = {256, 512, 1024, 2048, 4096}, plus customized sizes {K 1 , K 2 } = {10, 1000}. Note that for specific architectures of interest, different input configurations could be used.

4

Note that in PreVIousNet-01, the channel dimension increases over network levels, whereas the input fmap resolution remains constant. The purpose is to design a simplified network to be characterized under different configurations.

104

5 Prediction of Visual Inference Performance

5.5 Experimental Tests 5.5.1 Embedded System To test the performance of the proposed prediction methodology, the baseline system consisted of the low-cost RPi3B platform running CNNs on Caffe. Once the effectiveness of PreVIous on this baseline case was verified, experimental tests were extended by changing first the software—OpenCV 4.0.1 on RPi3B—and then the hardware—Caffe on Odroid-XU4. This second hardware device is more suitable for high-performance IoT applications, presenting a multi-core architecture arranged into two clusters that can be tuned for higher performance or power efficiency. Further details about these embedded devices are included in Table 5.2. All the experiments were performed on the quad core CPU included in the RPi3B or on the “big” cluster at 2 GHz in Odroid-XU4. The operating systems were booted in console mode to boost inference performance and reduce power consumption.

5.5.2 Networks We applied the PreVIous methodology to predict the runtime and energy consumption of CNNs when running on the aforementioned systems. The modeling stage included the performance profiling of PreVIousNet under the five configurations specified in Sect. 5.4.3—four for PreVIousNet-01 and one for PreVIousNet-02. Then, vector y in Eq. (5.2) was built for both runtime and energy consumption. Likewise, observation matrix X in Eq. (5.2) was obtained for each type of layer according to their corresponding architectural parameters.

Table 5.2 Features of the embedded platforms included in the studied IoT vision systems Raspberry Pi 3B Odroid-XU4 SoC Processor

Broadcom BCM2837 Quad core ARM Cortex-A53 @ 1.2 GHz

Accelerator Memory OS GPIO Dimensions Power Cost

VideoCore IV GPU 1 GB RAM LPDDR2 @ 900 MHz Raspbian v9.4 Linux Kernel v4.14 40-pin GPIO 85 × 56 × 20 mm3 MicroUSB socket 5V/2.5A ∼30e

Octa-core Exynos 5422 big.LITTLE Quad core ARM Cortex-A15 @ 2 GHz + quad core ARM Cortex-A7 @ 1.4 GHz Mali-T628 MP6 GPU 2 GB LPDDR3 RAM @ 933 MHz Ubuntu v16.04 Linux Kernel v4.14 40-pin GPIO 83 × 58 × 20 mm3 Barrel jack 5V/4A ∼43e

5.5 Experimental Tests

105

Table 5.3 Popular CNN architectures employed for testing the performance prediction capability of the PreVIous. Built regression models are applied on networks’ constituent layers, here summarized Network Dataset Layers AlexNet All-CNN-C MobileNet ResNet-18

ImageNet CIFAR-10 ImageNet ImageNet

SimpleNet SqueezeNet Tiny YOLO

CIFAR-10 ImageNet COCO

5 CONV, 3 FC, 3 pooling, 7 ReLU, 1 softmax, 2 LRN5 9 CONV, 1 pooling, 9 ReLU, 1 softmax 27 BN, 28 CONV, 1 pooling, 27 ReLU, 27 scale, 1 softmax 21 BN, 21 CONV, 8 Eltwise, 1 FC, 2 pooling, 17 ReLU, 21 scale, 1 softmax 10 BN, 13 CONV, 1 FC, 5 pooling, 13 ReLU, 10 scale, 1 softmax 8 Concat, 26 CONV, 4 pooling, 26 ReLU, 1 softmax 8 BN, 9 CONV, 6 pooling, 8 ReLU, 8 scale

The prediction stage was conducted on seven popular CNNs, most of them designed for embedded inference: AlexNet [14], All-CNN-C [8], MobileNet [10], ResNet-18 [9], SimpleNet [15], SqueezeNet [13], and Tiny YOLO [16]—see Table 5.3.5 These networks cover classification on ImageNet and CIFAR-10 or object detection on COCO. All in all, 399 CNN layers were assessed in this extensive study.

5.5.3 Experimental Results: Layerwise Predictions The precision of the per-layer prediction models resulting from PreVIous can be assessed by comparing layerwise predictions with actual experimental profiling for the considered networks under test. For instance, fine-grained layer-level performance characterization of a particular CNN is shown in Fig. 5.7. It compares the actual profiling of All-CNN-C [8] with model estimations, for the RPi–Caffe implementation. The prediction model correctly identifies “conv2” and “conv5” as the layers consuming most of the inference time and energy. This is an example of how useful this methodology can be for optimization algorithms or NAS methods. PreVIous is designed for prediction of inference time and energy consumption; their combination also allows estimating the average power consumption as follows: Eˆ l Pˆl = tˆl

(5.7)

where Eˆ l , tˆl , and Pˆl are the energy, runtime, and power predictions for layer l, respectively. Figure 5.8 exemplifies the per-layer power consumption predictions Pˆl 5

AlexNet also includes two local response normalization (LRN) layers that have not been included in this layerwise prediction because they were deprecated in favor of other normalization layers such as BN.

106

5 Prediction of Visual Inference Performance

Fig. 5.7 Layerwise measured profiling versus predictions for All-CNN-C [8] running on the baseline system, i.e., RPi–Caffe

as a function of the predicted runtime tˆl for the complete model All-CNN-C. Note that this average power estimation is similar to the actual measured profiling outlined in Fig. 5.2a. Applying the methodology to all the considered networks, we obtain per-layer predictions, as shown in Fig. 5.9 for the inference time in the baseline test system. Note that the runtime predictions are accurate when compared to empirical measurements.

5.5 Experimental Tests

107

ˆ Fig. 5.8 Layerwise power consumption predictions, expressed as Pˆl = Etˆl , during the All-CNN-C l network inference. The temporal duration of each layer (x-axis) is also a prediction (Fig. 5.7a). Note that the power, represented in the y-axis, keeps constant for each layer owing to the average estimation derived from the aforementioned equation. The predicted energy demand (Fig. 5.7b) corresponds to the colored area

Fig. 5.9 Per-layer inference runtime predictions (y-axis) compared with actual measurements (xaxis) for the seven networks running on the baseline combination RPi–Caffe. Similar results are obtained for RPi–OpenCV and XU4–Caffe. The dashed line depicts an ideal estimation in which predictions exactly match actual measurements

108

5 Prediction of Visual Inference Performance

Figure 5.9 highlights the non-negligible contribution of certain layers, even dominant in some cases, e.g., pooling in SqueezeNet or BN in MobileNet and ResNet, to the global inference runtime. This proves the importance of their consideration for performance modeling, as opposed to approaches exclusively focused on CONV and FC layers.

When adding up layer-level predictions for all layers composing each particular network, the total runtime can be estimated, as shown in Table 5.4 for RPi–Caffe. Each row reports the total measured (t) and predicted (tˆ) runtime, as well as the prediction error, according to the following equations: t=

Nl 

tl

tˆ =

l=1

Nl 

tˆl

l=1

error =

tˆ − t t

(5.8)

where tl denotes per-layer measurements, tˆl denotes per-layer predictions, and Nl is the number of layers in the particular CNN. The last row in the table shows the mean absolute percentage error (MAPE) over the seven considered networks:  n  100%   yˆi − yi  MAPE = n i=1  y 

(5.9)

where y, yˆ are the actual and predicted values and n is the number of test samples considered—seven in this case. Layerwise measurements, predictions, and errors can be obtained for energy consumption as follows: E=

Nl  l=1

El

Eˆ =

Nl  l=1

Eˆ l

error =

Eˆ − E E

(5.10)

where El and Eˆ l represent per-layer energy measurements and predictions, respectively. All in all, Table 5.5 shows the accurate prediction results achieved for all tested systems, CNNs, and measured metrics. For each vision system, this table presents the MAPE, given by Eq. (5.9), over the seven considered CNNs resulting from applying Eqs. (5.8) and (5.10).

5.5 Experimental Tests

109

Table 5.4 Predictions of total runtime on the RPi–Caffe system based on per-layer values. The corresponding profiling data are depicted in Fig. 5.9 Measured Predicted Error tˆ (ms) t (ms) (%) AlexNet 561.64 All-CNN-C 115.68 MobileNet 943.73 ResNet-18 1032.84 SimpleNet 347.59 SqueezeNet 348.15 Tiny YOLO 1691.37 mean absolute error (MAPE)

−6.21 6.52 −3.71 1.57 0.47 −1.32 2.88 3.24

526.75 123.22 908.74 1049.10 349.22 343.54 1740.03

Table 5.5 MAPE for predictions based on per-layer values on the seven tested CNNs Runtime (%) Energy (%) RPi3B–Caffe RPi3B–OpenCV XU4–Caffe

3.24 4.08 3.82

5.30 3.63 5.01

5.5.4 Experimental Results: Network Predictions Equations (5.8) and (5.10) in Sect. 5.5.3 represent a widely used procedure to estimate the global forward pass performance of a CNN: adding up per-layer metrics. Indeed, this is an assumption underlying several works on network optimization or NAS [3, 5, 17, 18]. Per-layer aggregation should be valid to estimate the global metrics given that layers are sequentially executed during network inference in many realizations. However, in practice, this direct addition of per-layer measurements independently taken may not reflect actual network performance metrics [19–21]. This mismatch arises from computational optimizations that can be applied during forward pass, such as those related to software algorithms (e.g., layer fusion or constant folding) or processor strategies (e.g., data prefetching or data reutilization within the memory hierarchy). For further accuracy in prediction taking into account this identified mismatch, additional experimental data were collected during complete network inference without layer separation. From these forward pass characterizations and the previous perlayer profiling, we can find a simple expression to predict network inference, thus replacing Eq. (5.8) or (5.10) for: yˆ = c

Nl  l=1

yˆl

(5.11)

110

5 Prediction of Visual Inference Performance

Table 5.6 Empirical coefficient c in Eq. (5.11) for the three studied systems Runtime Energy RPi3B–Caffe RPi3B–OpenCV XU4–Caffe

0.88 0.85 0.93

1.08 0.89 1.09

Table 5.7 MAPE for forward pass predictions on the seven testing CNNs Runtime (%) Energy (%) RPi3B–Caffe RPi3B–OpenCV XU4–Caffe

5.02 7.92 3.25

8.52 7.24 10.46

where yˆ stands for the total predicted runtime or energy, yˆl represents the per-layer predictions either for runtime or energy, and c is a mismatch-fitting coefficient. This factor  c results from a linear regression between the aggregated per-layer predictions ( yˆl ) and the corresponding actual forward pass measurement (y), adjusted from the five configurations of PreVIousNet on each visual system. The resulting coefficients c for each system case are shown in Table 5.6. The proposed performance prediction framework allows evaluating complete network inference by using Eq. (5.11). Both performance inference prediction and experimental forward pass measurements are compared in Fig. 5.10a, b for runtime and energy, respectively.6 Note that complete forward pass estimations are also accurate, with deviations below 10% in 18 out of 21 studied cases for runtime and 15 out of 21 cases for energy. Table 5.7 summarizes the MAPE on forward pass performance prediction.

5.5.5 Analysis Let us next analyze the reasons for the less accurate predictions in some networks— for instance, AlexNet, specially when modeling energy on Odroid-XU4; in fact, runtime and energy consumption were underestimated in all cases involving AlexNet, note the negative errors shown in Fig. 5.10. This behavior can be explained in terms of the mismatch between per-layer and complete-CNN inference performance, characterized with the c term through the simple Eq. (5.11). This single coefficient c per software–hardware system may not accurately describe empirical behavior in all cases. For example, delving into the worst prediction case (energy estimation 6

For a fair comparison on AlexNet, actual profiling measurements of the deprecated LRN layers were added in the summation in Eq. (5.11).

5.5 Experimental Tests Fig. 5.10 Network predictions versus actual measurements for all the assessed CNNs and software–hardware systems

111

112

5 Prediction of Visual Inference Performance

of AlexNet7 on XU4–Caffe), the actual ratio between per-layer and complete-CNN measurements is 1.60, but it is characterized applying a factor of 1.09 (Table 5.6), notably lower. This same issue is identified for MobileNet’s energy consumption in RPi–Caffe and XU4–Caffe. Note that, as a whole, Fig. 5.10 reports absolute errors below 10% for most of the 42 studied cases (although most network prediction errors exceed 5%). By contrast, per-layer predictions of PreVIous summarized in Table 5.5 were certainly more accurate, with absolute network errors below 5% in the vast majority of the 42 cases, and below 10% in the 95% of the cases. Finally, it is worth noting that performance metrics can slightly fluctuate between network executions according to the state of the system at the moment of running the CNN. Indeed, this is why measurement data were averaged over N = 50 executions. Therefore, conducting the modeling stage again will produce slightly different predictions and errors. Interestingly, the prediction errors from PreVIous are in the order of the experimental fluctuations when repeating executions. This suggests that much better accuracy in prediction may not be possible.

5.6 Summary and Advanced Topics The methodology developed in this chapter offers accurate predictions on the performance of CNNs on CPU-based embedded vision devices. It is a simple procedure that (i) requires a single systematic network characterization, (ii) relies on linear regression models, and (iii) is agnostic with respect to the selected hardware–software system. Moreover, a greater diversity of network layers has been considered with respect to previous studies, which mostly focused on CONV layers. The application of this methodology to IoT vision systems is twofold. First, fine-grained layer-level performance estimation assists network architecture design and optimization in order to create ad hoc networks optimally adapted to the system. Second, network performance prediction facilitates the task of selecting CNNs according to prescribed requirements, such as latency and battery lifetime. Finally, a comprehensive characterization of layers composing seven networks on two different software frameworks and two edge devices has demonstrated the prediction capacity of this method. Regarding hardware, this chapter is focused on the multi-core CPUs available on two different low-power low-cost platforms, but the methodology could be extended to other types of devices. In fact, similar approaches were applied on a diversity of platforms, such as those summarized below. However, many of the hardware devices 7

AlexNet is a starting point in the CNN paradigm. Therefore, its architecture includes deprecated layers and a high number of weights, somehow differing from state-of-the-art architectures.

5.6 Summary and Advanced Topics

113

characterized with these procedures may not qualify as IoT/edge vision systems due to their energy consumption and/or cost. • Augur [21]. Focused on two edge platforms—NVidia TK1 and TX1, using both CPU and GPU—this work estimates runtime and energy when running CNNs on Caffe. The prediction models are based on performance linearity with respect to the dimensions of matrix data involved in the CONV layers. Up to four networks— NiN, VGG19M, SqueezeNet, and MobileNet—were used for testing the models, with MAPE on these CNNs between 4 and 40% for all tested cases—four software– hardware systems and two prediction variables (runtime/energy). • NeuralPower [19]. This procedure adjusts polynomial regression models as a function of layer metrics in order to estimate both runtime and energy consumption. Three layers are contemplated—CONV, FC, and pooling—to build models in a baseline system incorporating TensorFlow on the GPU-based NVidia GTX Titan X platform. MAPE on a set of five CNNs—VGG16, AlexNet, NIN, Overfeat, and CIFAR-10-6conv—are below 8% for both energy and runtime metrics. Additional tests on other systems include Titan X GPU platform and Caffe library, with average errors ranging from 8 to 21%. • SyNERGY [22, 23]. Two hardware metrics, as those extracted and analyzed in Chap. 4,—bus accesses, and SIMD operations—allow building a linear model to estimate CONV performance on Jetson TX1 CPU running Caffe. A MAPE of approximately 12% over an extensive test set of ten CNNs confirms the energy estimation capacity of this approach. In an extension of this work [24], architectural metrics were employed to build energy predictive models for Jetson TX1 and Snapdragon 820 platforms running Caffe and Caffe2, with errors ranging from 15 and 24% using a single model predictor. • Bouzidi et al. [25]. Five ML algorithms to predict runtime of CNNs on edge GPUs are proposed and compared. The target systems encompass two GPU-based devices—NVidia Jetson AGX Xavier and NVidia Jetson TX2—running Keras– TensorFlow. The resulting prediction models based on architectural metrics are capable of predicting inference runtime with MAPE values in the range from 7 to 26%. • Accelergy [4, 26]. Particularly focused on the Eyeriss hardware accelerator [27], both memory accesses and MAC energy cost at each level in the memory hierarchy constitute the basis of an energy estimation methodology. • Heim et al. [28]. This method optimally implements CNNs on low-power ARM Cortex-M MCUs and includes runtime and energy estimation models. Similar to PreVIousNet, this work varies hyper-parameters in a simple network in order to characterize the MCU performance for different optimizations and implementations—TFLite and CMSIS-NN libraries, running both non-optimal and quantized models. Linearities between execution time and MAC operations are identified, as well as between energy consumption and measured runtime. • Additional performance estimation approaches. Simple performance modeling procedures, such as look-up tables, allow accelerating automatic optimization and NAS algorithms [5, 17, 18, 29–31] or to reduce training costs on high-end GPUs [32, 33].

114

5 Prediction of Visual Inference Performance

In addition, when it comes to estimating the global system power consumption, there are additional approaches that avoid the need of using an external measurement system. Energy can be instead estimated from hardware statistics and linear regressors [34–36]—as pointed out in Chap. 4 through quantitative correlation coefficients. Acknowledgements This chapter is partially based on authors’ publication [37].

5.7 Appendix Network profiling for the modeling stage of PreVIous involves measuring “layer execution” performance. To this end, individual layers composing a CNN must be sequentially executed. Caffe offers specific functions that perform this layer operation, such as its Python method forward() from the class caffe.Net. Concerning OpenCV version 4.0.1, employed in this chapter, no specific function for single layer execution is provided; however, its open-source code [38] can be edited to create ad hoc C++ functions that can also be called from the Python API. How to develop a new OpenCV functionality? OpenCV automatically exports its C++ code to Python at compilation time. To this end, some macros are available in order to specify which functions should be exported to Python [39]. As an example, we can code a new C++ method Net::forward1() in the DNN module of OpenCV that only executes one layer. Then, this method will be accessible from Python thanks to the following code—inserted in the dnn.hpp header file included in the DNN module of OpenCV [38]. OpenCV DNN module—allowing per-layer inference class CV_EXPORTS_W_SIMPLE Net { [...] CV_WRAP Mat forward1(const String& outputName); [...] }

Note that the CV_EXPORTS_W_SIMPLE and CV_WRAP macros are employed to extend the class Net and its new method Net::forward1(), respectively, at compilation time [39]. Facilitating Network Performance Profiling To simplify the network profiling process, we suggest creating a Python class per framework, as exemplified below for Caffe and OpenCV.

5.7 Appendix

Caffe Python Class class network_Caffe: def __init__(self, netfiles): ’’’ Create network ’’’ self.network = caffe.Net(netfiles[’prototxt’], netfiles[’caffemodel’], caffe.TEST) #(1) def get_list_of_layers(self): ’’’ Get network layers ’’’ return list(self.network._layer_names)[1:] # omit ’Input’ layer def inference(self): ’’’ Run CNN inference ’’’ return self.network.forward() #(2) def inference1(self, layer): ’’’ Run one-layer inference ’’’ self.network.forward(start=layer, end=layer) #(3) def preprocess_and_feed_img(self, file_path, H,W,C,mu,std): ’’’ Read image and set it as network input ’’’ # [load and preprocess image] self.network.blobs[’data’].reshape(1,C,H,W) self.network.blobs[’data’].data[...] = NCHW_img

OpenCV Python Class class network_OpenCV: def __init__(self, netfiles): ’’’ Create network ’’’ self.network = cv2.dnn.readNetFromCaffe(netfiles[’prototxt’], netfiles[’caffemodel’]) #(1) def get_list_of_layers(self): ’’’ Get network layers ’’’ return [k for k in self.network.getLayerNames()] def inference(self): ’’’ Run CNN inference ’’’ return self.network.forward() #(2) def inference1(self, layer): ’’’ Run one-layer inference ’’’ self.network.forward1(outputName=layer) #(3) def preprocess_and_feed_img(self, file_path, H,W,C,mu,std): ’’’ Read image and set it as network input ’’’ # [load and preprocess image] self.network.setInput(NCHW_blob)

115

116

5 Prediction of Visual Inference Performance

Note that: (1) The netfiles Python dictionary contains the files for loading the network on the particular DL framework. Then, the specific framework function is called to load that CNN architecture. (2) Visual inference is facilitated by the method inference(), which executes a complete network forward pass. (3) Executing one specified layer is possible through the inference1() method. As explained above, it relies either on the forward() Caffe method, or on forward1() from OpenCV. Performance Profiling Measurement By using the classes network_ defined above, it is easy to measure per-layer performance on each framework, which is needed for the modeling stage of PreVIous. Time profiling. Loading a network and averaging runtime over N executions per layer could be done as follows. Profiling Per-layer Inference Time—Python # Create network & pre-process image: network = network_({’prototxt’:prototxt_file, ’caffemodel’:caffemodel_file}) network.preprocess_and_feed_img(file_path, H,W,C,mu,std) list_layers = network.get_list_of_layers() # Measure Time: tlayers = np.zeros((len(list_layers), N)) for n in range(N): for k in range(0,len(list_layers)): t0l = time.time() network.inference1(list_layers[k]) # register per-layer execution time (ms) on repetition ’n’: tlayers[k, n] = 1000.0*(time.time()-t0l) time.sleep(t_sleep) # avoid thermal throttling for k in range(0,len(list_layers)): print(’{}) \t{}: \t{:.2f} ms’.format(k,list_layers[k],np.mean(tlayers[k]))) # [save data into a file]

Energy profiling. To monitor the energy consumption, we need a separate system that interfaces the embedded device, such as that used in Chaps. 3 and 4—some platforms also incorporate vendor-specific power meter tools to facilitate energy measurement. Indeed, experimental layerwise measurements must be based on known events, such as the start time and the individual layer execution time tl . Furthermore, in order to simplify the process, we can register the energy consumption of all the

5.7 Appendix

117

Fig. 5.11 Power signal corresponding to layerwise execution of All-CNN-C on the system RPi– Caffe, with an idle time interval of 300 ms between layers. The identification of the starting and ending events for each layer—dotted vertical red lines—allows determining the energy consumption of the layers—colored areas. Complete energy profiling is extracted after integrating the power signal on the identified time intervals

layers composing a CNN in a single file. In such a case, to delimit the start time corresponding to each layer, we set an idle time interval tidle between layers. All in all, the following code performs N repetitions of each layer execution, with a prescribed interval between layers. In addition, it also registers the exact runtime required by each layer. Per-layer Inference to Manually Measure Power Consumption # Measure Power: raw_input(’>> Please, Start measuring and press ENTER to start inference: ’) tlayers = [] for k in range(0,len(list_layers)): t0l = time.time() for n in range(N): network.inference1(list_layers[k]) # register execution time (ms) for N layer executions: tlayers.append(1000.0*(time.time()-t0l)) time.sleep(t_idle) # facilitates layer identification print(’>> Finished. Please, stop measuring’)

Therefore, it is possible to establish the start time associated to each layer l  t + (l − 1)tidle , and thus automatically process the measured data. For as l−1 i=1 i instance, Fig. 5.11 depicts the registered power consumption while running the code above for N = 50 executions per layer on the All-CNN-C architecture (the sampling period of the power signal was 40.96 µs). The starting and ending time events allow easy averaging of the layers’ energy consumption.8

8

Note that the effect of data prefetching and branch prediction on the CPU, already illustrated in Chap. 4, is also observable in the signal in Fig. 5.11. That is, power consumption keeps at a moderate level even once a layer execution is completed.

118

5 Prediction of Visual Inference Performance

Hardware statistics profiling. Additionally, hardware statistics also constitute powerful indicators of power consumption, as mentioned in Sect. 5.6. Although not included in this chapter, the proposed methodology could employ hardware metrics— as those extracted with perf tool in Chap. 4—as inputs for the regression models. This would require profiling those resource exploitation metrics per layer. Again, in order to simplify the measurements, we can use a single script to register hardware events associated to each layer, as exemplified in the code below. Profiling Per-layer Hardware Statistics—Python and External Processes event_start = Event() #(1) start inference event_stop = Event() #(1) stop perf pid = os.getpid() py = psutil.Process(pid) # Function to be spawned in a new thread: def forward_1_layer(net,layer,N): #(2) ’’’ execute N layer executions and register execution time ’’’ event_start.wait() #(1) get blocked until the flag becomes true # [run per-layer inference and measure time] # [save inference time] event_stop.set() #(1) set the flag for stopping perf # Measure Hardware Metrics: for k in range(0,len(list_layers)): print(’\n[INFO] Running {} for {} times:’.format(list_layers[k],N)) # [open file ’f’ to save perf output] # Reset the flags for successive executions: event_start.clear() event_stop.clear() # Spawn thread for per-layer inference: th = Thread(target=forward_1_layer, args=(network,list_layers[k],N,)) #(2) th.start() # start a new thread per layer tid = py.threads()[-1].id # thread ID # Start perf: command = ["sudo perf stat -I 10 record -o out.data -e $EVENTS -t " + str(tid) + " sleep " + str(t_max)] #(3) proc = subprocess.Popen(command, shell=True, stderr=f, stdout=subprocess.PIPE) event_start.set() #(1) set the flag to start operations within the thread # Waits and stop perf: event_stop.wait() #(1) get blocked until the inference is over # [kill launched sub-processes] th.join() # wait until thread actually finishes time.sleep(t_sleep) # avoid thermal throttling # [close file ’f’]

5.7 Appendix

119

where: (1) Two Python events establish communication between the Python threads: event_start announces that perf is already measuring hardware metrics, while event_stop indicates the end of layer inference. (2) A new Python thread is created for each layer execution. (3) Then, perf tool extracts hardware metrics related to that thread—with identifier tid. This perf command is executed in a new system subprocess. As a result, one file per layer will contain hardware statistics gathered with a sampling period of 10 ms. These data can be processed similarly to how the power consumption signal was processed—taking into account the layers’ execution time9 —to finally obtain the total number of hardware events associated to each layer execution on the target system. Architectural Metrics Profiling Parsing the network definition—in the prototxt file from Caffe, also compatible with OpenCV—we can extract useful architectural information for both the modeling and prediction steps of the procedure. Profiling Per-layer Architectural metrics—Caffe Python def analyze_net(prototxt_file): ’’’ Get by-layer information from prototxt

’’’

# (1) Activations and weights data sizes # (from caffe.Net object, containing specific input size): net = caffe.Net(prototxt_file, caffe.TEST) list_layer_data = [] for lidx in range(len(net.layers)): layer_data = {} # store layer’s information layer_data[’name’] = net._layer_names[lidx] layer_data[’type’] = net.layers[lidx].type bottoms = [(net._blob_names[bidx], net.blobs[net._blob_names[bidx]].data.shape) for bidx in list(net._bottom_ids(lidx))] tops = [(net._blob_names[bidx], net.blobs[net._blob_names[bidx]].data.shape) for bidx in list(net._top_ids(lidx))] weights = [net.layers[lidx].blobs[bidx].data.shape for bidx in xrange(len(net.layers[lidx].blobs))] layer_data[’bottoms’] = bottoms layer_data[’tops’] = tops layer_data[’weights’] = weights list_layer_data.append(layer_data) # (2) Additional layer information (from prototxt file): net_proto = caffe.proto.caffe_pb2.NetParameter() f = open(prototxt_file, ’r’) net_proto = text_format.Merge(str(f.read()), net_proto) f.close() lidx = 1 if net_proto.layer[0].type != ’Input’ else 0 for layer in net_proto.layer: # see help(net_proto.layer[idx]) if layer.type == ’Convolution’: kernel = layer.convolution_param.kernel_size[0]

9

Note that here, again, branch prediction strategy can affect the hardware statistics.

120

5 Prediction of Visual Inference Performance stride = layer.convolution_param.stride[0] pad = layer.convolution_param.pad[0] group = layer.convolution_param.group bias = layer.convolution_param.bias_term list_layer_data[lidx][’layer_params’] = {’kernel’:kernel, ’stride’:stride,’pad’:pad,’group’:group} # [parameters extraction for each type of layer] lidx +=1

# (3) Print all the gathered information: for lidx in range(len(list_layer_data)): print(’\n{}) {}, type {}’.format(lidx, list_layer_data[lidx][’name’], list_layer_data[lidx][’type’])) print(’ -- bottoms {}’.format(list_layer_data[lidx][’bottoms’])) print(’ -- tops {}’.format(list_layer_data[lidx][’tops’])) print(’ -- weights {}’.format(list_layer_data[lidx][’weights’])) print(’ -- parameters {}’.format(list_layer_data[lidx][’layer_params’]))

Note that: (1) The loaded network object contains information about input/output volumes, top/bottom relationships, learned weights, etc. (2) Additional layer parameters can be extracted for each type of layer. This code exemplifies reading CONV layer parameters, such as k, s, p. (3) All the extracted data can be examined and also saved into a file for future use. Regression Model The network profiling performed on a set of CNNs running on a software–hardware system allows building the prediction models proposed in this chapter. To this end, we must first process all the collected data in order to build a comprehensive dataset containing (i) architectural layer metrics and (ii) layer execution runtime or energy demanded during layer execution. Once this dataset is available, we can gather data corresponding to each layer type n to build a predictive model. This process of model adjustment is facilitated {xi , yi }i=1 by ML libraries such as scikit-learn [40]. Below you have an example of building a linear model with such Python package. Basic Linear Regression Model Construction—Python scikit-learn from sklearn import linear_model # [data loading, pre-processing or standardization] #(1) (5) # Model: reg = linear_model.LinearRegression() #(2) #reg = linear_model.Ridge() #reg = linear_model.Lasso() # Model Building (objective function minimization) reg.fit(X, y) #(3) adjusted model weights: reg.coef_ and reg.intercept_ # Model Prediction

5.7 Appendix

121

y_pred = reg.predict(X_test) #(4) # Calculate error MAPE = 100*np.mean( abs(y_test-y_pred)/y_test ) print(’MAPE = {:.2f}%’.format(MAPE))

(1) In this code, we assume the declaration of Python variables X, y, X_test, and y_test, corresponding to training data X, y, and test data X , y in Eqs. (5.2) and (5.4), respectively. (2) To characterize the input data behavior, a diversity of ML techniques can be applied, such as OLS, ridge, Lasso, polynomial regression, and perceptrons. (3) This ML library minimizes the corresponding objective function when calling the fit() method. Then, linear model coefficients w are saved in the attribute coef_—optionally, model weights can include a term independent w0 , which will be saved in the intercept_ variable. (4) Model predictions yˆ on new data—Eq. (5.4)—are provided by the predict() method. (5) Furthermore, different techniques can boost the model performance. For instance, a usual practice is dataset standardization. It scales the input variables and removes the mean. Note that, in that case, the same transformation must be applied to both the training and testing data in X and X_test. from sklearn import preprocessing scaler = preprocessing.StandardScaler().fit(X) X_stand = scaler.transform(X) # standardization towards zero mean and unit variance print(’Standardized with mean = {}, scale = {}’.format(scaler.mean_, scaler.scale_)) reg.fit(X_stand, y)

An example of model construction with the code above is illustrated in Fig. 5.12. This plot shows one-dimensional training data X (only one input variable, p = 1) following a linear trend (blue dots). Once the model is adjusted, model predictions are also displayed with red squares representing yˆ (y-axis) versus X (x-axis). Additional information about ML models is documented in [40].

122

5 Prediction of Visual Inference Performance

Fig. 5.12 One-dimensional linear OLS model built for the displayed X, y training data (blue dots). Model predictions for two test samples are also shown (red squares)

References 1. Li, D., Chen, X., Becchi, M., Zong, Z.: Evaluating the energy efficiency of deep convolutional neural networks on CPUs and GPUs. In: 2016 IEEE Int. Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pp. 477–484 (2016). https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.76 2. Canziani, A., Paszke, A., Culurciello, E.: An Analysis of Deep Neural Network Models for Practical Applications. arXiv (1605.07678) (2016) 3. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 4. Yang, T.J., Chen, Y.H., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6071–6079 (2016) 5. Yang, T.J., Howard, A., Chen, B., Zhang, X., Go, A., Sandler, M., Sze, V., Adam, H.: NetAdapt: Platform-aware neural network adaptation for mobile applications. In: European Conference on Computer Vision (ECCV) (2018) 6. Sze, V., Chen, Y., Yang, T., Emer, J.: Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105(12), 2295–2329 (2017) 7. Odroid XU4. https://wiki.odroid.com/odroid-xu4/odroid-xu4 8. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The All Convolutional Net. arXiv (1412.6806) (2014) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 10. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv (1704.04861) (2017) 11. Velasco-Montero, D., Fernández-Berni, J., Carmona-Galán, R., Rodríguez-Vázquez, A.: GitHub – PreVIous. https://github.com/DVM000/PreVIous.git

References

123

12. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: International Conference on Computer Vision (ICCV) (2015) 13. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 1 fps). To reduce energy consumption, our smart camera will capture and process images instead of video sequences. Capture parameters can be configured by the user.

CNN Classification A CNN architecture constitutes the core intelligent algorithm to identify the species of interest, thus giving rise to a smart DL-based camera trap. As comprehensively explained in previous sections, the trained three-category network implicitly filters

6.5 Visual System Implementation

151

out blank images resulting from false-positive motion detection events. The operation of the CNN on the captured images can be activated or deactivated through user configuration.

Send Notification—Alarm In case any of the captured images are classified as one of the relevant classes for the RZSS conservation staff, i.e., (i) wildcat and (ii) mixed species, an alarm will be sent to park staff by using an e-mail service. This notification will include the images themselves and a timestamp. In particular, we employed the mutt tool for the e-mail service. Alternative Linux command tools to this end include mpack, mail, and sendmail. An example of the distilled e-mail is presented next. [RPi-Camera-00] Alarm triggered Please, find attached images classified as follows: - wildcat - . (Fri_Jul_23_11:44: 11_2021)

Application Configuration The smart camera operation can be configured with different settings, according to the user’s preferences. Figure 6.10 expands the application operation workflow presented in Fig. 6.9 including configuration parameters. A file is provided in order to define such parameters. Some of them are exemplified below: Application software—configuration file

# (1) detection READ_DELAY= CONFIRM_GAP= # (2) camera capture WIDTH= HEIGHT= READ_FPS= POST_CAPTURE= EVENT_GAP=

152

6 A Case Study: Remote Animal Monitoring

Fig. 6.10 Expanded workflow of the smart camera application developed including configuration parameters. The basic steps are (1) main loop for motion detection, (2) capturing frames when motion is detected, (3) CNN classification on frames, and (4) checking whether to send a notification

FRAMES_FOLDER= # (3) CNN CNN_FILE= LABEL_FILE= TARGET_LABELS= # (4) alarm CAMERA_NAME= DST_EMAIL= # [extra settings]

(1) We can tune the time interval between PIR readings (READ_DELAY seconds). A second reading after motion detection can be requested. To this end, a confirmation period must be established through CONFIRM_GAP (a zero value indicates that no confirmation is required). (2) The camera operation after motion detection is also defined. It will operate in burst mode, capturing READ_FPS × POST_CAPTURE images per camera trigger.

6.5 Visual System Implementation

153

(3) The CNN for inference could be replaced in the future for a new model that performs better. In such a case, we must only indicate their corresponding files. In addition, the user can specify which output categories of the new network trigger the alarm (TARGET_LABELS). (4) The notification e-mail including the corresponding images will be sent to DST_EMAIL account—only if any of the captured images are classified into target categories. In particular, the default configuration is as follows. The camera resolution is set as 320 × 240, which is a standard resolution near the CNN input size (227 × 227). The PIR sensor is read once per second, and once the camera is triggered, the system conducts inference on ten images at 2 fps (during 5 s). The example e-mail above comes from a smart camera named “RPi-Camera-00” (CAMERA_NAME) that has just detected a Scottish wildcat. By employing general-purpose low-power embedded platforms in conjunction with DL software libraries, we are endowing our visual system with flexibility and programmability. We can also configure operation parameters, or even select diverse CNN architectures.

6.6 Experimental Tests System Setup Figure 6.11 shows the experimental setup employed for the different performance tests carried out in the laboratory. The central processing device is the RPi4B, which is embedded with an off-the-shelf camera module [32] and the Panasonic EKMB long-distance PIR sensor [36]. We disconnected extra peripherals such as the HDMI interface or the keyboard for the sake of greater similarity with real operation scenarios. Performance Metrics Concerning CNN inference, we measured our three-category SqueezeNet model running on three software alternatives compatible with TF–Keras: (a) TFLlite, (b) TFLite after quantizing (“Q”) the model weights [52], and (c) OpenCV after adapting the TF model format [53]. The throughput achieved on the RPi4B is shown in Fig. 6.12a; OpenCV clearly outperforms the other two frameworks. Given that the application scenario demands a low-energy system, we measured and compared the power consumption for the different alternatives to implement the steps shown in Fig. 6.9. First, we measured the current required by the system in idle state, being 0.41 A.

154

6 A Case Study: Remote Animal Monitoring

Fig. 6.11 Embedded vision system designed and implemented for the study case. The low form factor low-cost RPi4B platform constitutes the core device. Two specific peripherals, i.e., a camera module and a PIR motion sensor, were integrated.

1. The current consumed hardly departs from 0.41 if the PIR is detached—note that, according to the PIR datasheet [36], its consumption is in the range of microamperes. The motion and PiKrellCam software tools make the system consume 0.49 A and 0.43 A, respectively. This implies a significant increase with respect to the use of the PIR sensor, which is why both tools were finally dismissed. 2. Concerning image capture, the current consumption of the raspistill tool leads to a system consumption of 0.48 A, whereas that of the OpenCV C++ function is 0.46 A. The latter was therefore selected because of its lower consumption with respect to the former. 3. Finally, CNN software tools were also compared in terms of power consumption and energy per image (Figs. 6.12b–c, respectively). The values in the plots correspond to the total consumption of the system. These results confirm that OpenCV is clearly the best option. Further details on power measurements Note that the performance of batteries is commonly specified in milliamperehour (mAh). Therefore, taking current measurements (let us denote them generically as I ) suffices to provide useful information concerning system operation lifetime. Voltage values V around 5.1 V constitute safe levels for correct operation of RPi4 [37]—notwithstanding, the official RPi power supply powers the platform with 5.2 V.

6.6 Experimental Tests

155

Fig. 6.12 Performance metrics of the proposed three-category SqueezeNet model when running on three different software frameworks. OpenCV is clearly the best choice

Figure. 6.12b, c shows power consumption values calculated as P = V I and energy demand calculated as E = Pt assuming an input voltage of 5.1 V. Furthermore, we also measured the current demanded when setting 5.25 V in the Keysight power analyzer because of the input resistance of our measurement system. We thus confirmed that the current I does not depend on the input voltage V ; it only depends on the processing workload.

OpenCV proved to be the most energy-efficient framework while achieving high inference rate.

Overall, the processing requirements of the developed application are summarized in Table 6.5. In addition, the system specifications are summarized as follows: • Processing time. The high CNN throughput enables quasi-instantaneous CNN inference. In fact, the time required for camera captures (set by POST_CAPTURE) will be in general higher than the CNN processing time.6 Therefore, the system is able to send real-time alarms. • Power consumption. This metric is critical for remote operation. Trips to change batteries should be kept as few as possible. With the energy results reported above and a battery providing 30 Ah, approximately 3 days of continuous operation are ensured; this period would extend to 6 days if night hours are excluded.

6

Note in entry 3. of Table 6.5 that the inference runtime includes both CNN inference itself and data processing. This processing encompasses the latency owing to image loading, network input preprocessing (mean substraction, scaling, and data reorder), and network output interpretation (looking for the label assigned to the highest value in the output vector).

156

6 A Case Study: Remote Animal Monitoring

Table 6.5 Overall performance results—per camera trigger Application step Performance metric 1. Waiting detection 2. Reading frames 3. CNN + data processing 4. Alarm a E-mail

Runtime Energy Runtime Energy Runtime Energy Runtime Energy

Variable ∼ 0.43 A POST_CAPTURE s ∼ 0.46 A ∼ 90 ms per image ∼ 0.86 A Variablea Variablea

service performance highly depends on the available connectivity status

• System accuracy. The trained CNN achieves an accuracy of approximately 78%— see Table 6.3. However, this figure should be jointly considered with the effectiveness of the motion sensor to reduce the number of blank captures, thereby decreasing the false-alarm rate. Actually, the global system accuracy will highly depend not only on the PIR performance, but also on the system location, environmental conditions, etc.

6.7 Summary and Advanced Topics This chapter reviews all the components required to build an embedded vision system for a real application scenario: remote animal monitoring. On-site tests have been delayed because of the COVID-19 situation, but they will be eventually conducted. Still, each development step has been comprehensively analyzed. In particular, a reprogrammable embedded application was developed and implemented on a low-cost device. Its core AI algorithm relies on a CNN for classification whose training procedure was also described. We achieved an accuracy around 80% on wildlife classification that could be further improved by using output probabilities per class—we can set confidence thresholds in order to decide whether the output probabilities are high enough to assign the output label to an image. Moreover, the training process of object-detection CNNs was also outlined. It is worth emphasizing that networks for object detection are much more difficult to implement. First, images with annotated bounding boxes are required for training. Second, there are CNN libraries that do not include all operations required by object detectors. Finally, their computational load is notably higher than that of CNN classifiers. As an additional asset related to CNN training, this chapter also addresses the concept of “Explainable DL” to understand the network operation. A variety of components at both software and hardware level have also been reviewed. In summary, networked smart camera traps able to perform inference and communicate results in real time are still experimental and mostly constrained to academic

6.7 Summary and Advanced Topics

157

works. The case study hereby analyzed not only serves as an example of system implementation and deployment, but it is also of great interest for wildlife conservation tasks. Acknowledgements Authors would like to thank staff and volunteers of RZSS Highland Wildlife Park and Scottish Wildcat Action, in particular Mr. David Barclay and Holly Forbes, for the data and support offered for this study case.

References 1. Gomez Villa, A., Salazar, A., Vargas, F.: Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecological Informatics 41, 24 – 32 (2017). https://doi.org/10.1016/j.ecoinf.2017.07.004. http:// www.sciencedirect.com/science/article/pii/S1574954116302047 2. Norouzzadeh, M.S., Nguyen, A., Kosmala, M., Swanson, A., Palmer, M.S., Packer, C., Clune, J.: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences 115(25), E5716– E5725 (2018). https://doi.org/10.1073/pnas.1719367115. https://www.pnas.org/content/115/ 25/E5716 3. Parham, J., Stewart, C., Crall, J., Rubenstein, D., Holmberg, J., Berger-Wolf, T.: An animal detection pipeline for identification. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1075–1083 (2018). https://doi.org/10.1109/WACV.2018.00123 4. Willi, M., Pitman, R.T., Cardoso, A.W., Locke, C., Swanson, A., Boyer, A., Veldthuis, M., Fortson, L.: Identifying animal species in camera trap images using deep learning and citizen science. Methods in Ecology and Evolution 10(1), 80–91 (2019). https://doi.org/10.1111/2041210X.13099. https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13099 5. Guo, Y., Rothfus, T., Ashour, A.S., Si, L., Chunlai, d., Ting, T.F.: A varied channels region proposal and classification network for wildlife image classification under complex environment. IET Image Processing 14 (2019). https://doi.org/10.1049/iet-ipr.2019.1042 6. Loos, A., Weigel, C., Koehler, M.: Towards automatic detection of animals in camera-trap images. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1805–1809 (2018). https://doi.org/10.23919/EUSIPCO.2018.8553439 7. Schneider, S., Taylor, G.W., Kremer, S.C.: Deep Learning Object Detection Methods for Ecological Camera Trap Data. 2018 15th Conference on Computer and Robot Vision (CRV) pp. 321–328 (2018) 8. Beery, S., Morris, D., Yang, S.: Efficient pipeline for camera trap image review. arXiv (1907.06772) (2019) 9. Royal Zoological Society of Scotland. https://www.rzss.org.uk/ 10. Scottish Wildcat Action. https://www.scottishwildcataction.org/ 11. Saving Wildcats. https://savingwildcats.org.uk/ 12. Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In: Y. Bengio, Y. LeCun (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016). http://arxiv.org/abs/1510.00149 13. He, Y., Zhang, X., Sun, J.: Channel Pruning for Accelerating Very Deep Neural Networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406 (2017). https://doi.org/10.1109/ICCV.2017.155 14. Hu, H., Peng, R., Tai, Y., Tang, C.: Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv (1607.03250) (2016)

158

6 A Case Study: Remote Animal Monitoring

15. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning. In: International Conference on Learning Representations (2017) 16. Beery, S., Van Horn, G., Perona, P.: Recognition in terra incognita. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018) 17. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv (1609.04747) (2016) 18. TensorFlow: TensorFlow – Transfer learning and fine-tuning. 2017 https://www.tensorflow. org/tutorials/images/transfer_learning 19. TensorFlow: TensorFlow – Module tf.keras.applications. https://www.tensorflow.org/api_ docs/python/tf/keras/applications 20. LabelImg. https://tzutalin.github.io/labelImg/ 21. Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 303–338 (2009) 22. GitHub – YOLO3 (Detection, Training, and Evaluation). https://github.com/experiencor/ keras-yolo3/ 23. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. arXiv (1804.02767) (2018) 24. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision – ECCV 2014, pp. 740–755. Springer International Publishing (2014) 25. Xie, N., Ras, G., van Gerven, M., Doran, D.: Explainable Deep Learning: A Field Guide for the Uninitiated. arXiv (2004.14545) (2020) 26. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144 (2016) 27. GitHub – Lime: Explaining the predictions of any machine learning classifier. https://github. com/marcotcr/lime/ 28. Raspberry Pi 4. https://www.raspberrypi.org/products/raspberry-pi-4-model-b/ 29. Raspberry Pi Zero W. https://www.raspberrypi.org/products/raspberry-pi-zero-w/ 30. Intel: Data sheet. Intel Neural Compute Stick 2 (2017). https://software.intel.com/content/dam/ develop/public/us/en/documents/ncs2-data-sheet.pdf 31. Google-LCC: Coral USB Accelerator datasheet. Version 1.4 (2019). https://coral.ai/static/files/ Coral-USB-Accelerator-datasheet.pdf 32. Raspberry Pi Documentation. Accessories. Camera. https://www.raspberrypi.org/ documentation/accessories/camera.html 33. Camera Module 2 NoIR. https://www.raspberrypi.org/products/pi-noir-camera-v2/ 34. Raspberry Pi High Quality Camera. https://www.raspberrypi.org/products/raspberry-pi-highquality-camera/ 35. RPi Camera (G), Fisheye Lens. https://www.waveshare.com/rpi-camera-g.htm 36. Panasonic. PIR Motion Sensor PaPIRs. https://www3.panasonic.biz/ac/e/control/sensor/ human/index.jsp/ 37. Raspberry Pi Documentation. Raspberry Pi Hardware. Power Supply. https://www.raspberrypi. org/documentation/computers/raspberry-pi.html#power-supply 38. Raspberry Pi. Operating system images. https://www.raspberrypi.org/software/operatingsystems/ 39. Tiny Core Linux. http://distro.ibiblio.org/tinycorelinux/ports.html/ 40. Arch Linux. https://archlinux.org/ 41. Ubuntu MATE. https://ubuntu-mate.org/ 42. pigpio library. http://abyz.me.uk/rpi/pigpio/download.html 43. Multi-Media Abstraction Layer (MMAL). Draft Version 0.1. http://www.jvcref.com/files/PI/ documentation/html/ 44. GitHub – RaspiCam: C++ API for using Raspberry camera (with OpenCV). https://github. com/rmsalinas/raspicam 45. GitHub – Motion, a software motion detector. https://github.com/Motion-Project/motion

References

159

46. Motion, a software motion detector – Documentation. https://motion-project.github.io/index. html 47. Motion, a software motion detector – Configuration. https://motion-project.github.io/motion_ config.html 48. GitHub – PiKrellCam. https://github.com/billw2/pikrellcam 49. PiKrellCam – OSD Motion Detect Program. http://billw2.github.io/pikrellcam/pikrellcam. html 50. TensorFlow: TensorFlow – TensorFlow Lite converter. https://www.tensorflow.org/lite/convert 51. OpenCV. Open Source Computer Vision (Documentation). https://docs.opencv.org/ 52. TensorFlow: TensorFlow – Model Optimization. https://www.tensorflow.org/lite/performance/ model_optimization 53. TensorFlow: TensorFlow – Python Tools. freeze_graph. https://github.com/tensorflow/ tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py