Computer Vision and Graphics: Proceedings of the International Conference on Computer Vision and Graphics ICCVG 2022 3031220242, 9783031220241

This book contains 17 papers presented at the conference devoted to cutting-edge technologies and concepts related to im

480 64 12MB

English Pages 263 [264] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computer Vision and Graphics: Proceedings of the International Conference on Computer Vision and Graphics ICCVG 2022
 3031220242, 9783031220241

Table of contents :
Organization
Preface
Contents
Computers and Humans
On Multi-stream Classification of Two Person Interactions in Video with Skeleton-Based Features
1 Introduction
2 Related Work
3 Approach
3.1 Structure
3.2 Key Frame Selection
3.3 Skeleton Detection
3.4 Skeleton Tracking and Correcting
3.5 Feature Extraction
4 Ensemble of ``Weak'' Classifiers
5 Triple Stream LSTM
6 Results
6.1 Dataset
6.2 Ensemble of Experts
6.3 Multi-stream LSTM
6.4 Ablation Study
7 Conclusion
References
On Formal Models of Interactions Between Detectors and Trackers in Crowd Analysis Tasks
1 Introduction
2 Formal Models
2.1 Basic Definitions
2.2 Matrix-Based Data Associations Between Monitoring Events
2.3 Estimates of Association Matrices
2.4 From Association Matrices to Labeling
3 Discussion
References
Digital Wah-Wah Guitar Effect Controlled by Mouth Movements
1 Introduction
2 The Principle of the Wah-Wah Effect
3 Conception
4 Initial Experiments
5 Research Environment
6 Results
7 Summary
References
Traffic and Driving
Traffic Sign Classification Using Deep and Quantum Neural Networks
1 Introduction
2 Quantum Neural Networks
3 Previous Work
4 QNN for Traffic Sign Recognition
5 Conclusion
References
Real Time Intersection Traffic Queue Length Estimation System
1 Introduction
2 Methodology
2.1 Module Which Creates Configuration for Selected Camera (Configuration Mode)
2.2 Module Which Estimates the Queue Length Based on Configuration
3 Experiments
4 Conclusion and Further Plans
References
Image Processing
Carotid Artery Wall Segmentation in Ultrasound Image Sequences Using a Deep Convolutional Neural Network
1 Introduction
2 Datasets
3 Segmentation Method
3.1 Detection of the Far Wall
3.2 Segmentation of the IMC
4 Results
5 Discussion
6 Compliance with Ethical Standards Information
References
Novel Co-SIFT Detector for Scanned Images Differentiation
1 Introduction
2 Related Work
2.1 Analysis of 8 Methods Utilizing Color for SIFT or SURF
3 Contactless Scanning
4 Transformation to Assignment Problem
5 Co-SIFT
5.1 Preprocessing
5.2 Interesting Point Detection
6 Experiments and Results
7 Conclusions
8 Future Work
References
PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection
1 Introduction
2 Related Work
2.1 DCNN Methods for 3D Object Detection on a LiDAR Point Cloud
2.2 Methods for Real-Time Object Detection on Images
3 Comparison of Backbone Types for the PointPillars Network
4 Results
5 Conclusions
References
Fuzzy Approach to Object-Detection-Based Image Retrieval
1 Introduction
2 Related Works
3 Fuzzy Modeling Object Mutual Positions
4 Querying Image Database
4.1 Single Query with Complete Relation
4.2 Combined Queries
4.3 Queries with Incomplete Relations
4.4 Complex Queries and Parsing
5 Tests
6 Conclusions
References
Adaptive Binarization of Metal Nameplate Images Using the Pixel Voting Approach
1 Introduction
2 A Brief Review of Image Binarization Methods
2.1 Global and Adaptive Thresholding Algorithms
2.2 Recent Proposals
3 Proposed Approach
3.1 Pixel Voting
3.2 Dataset of ``Industrial'' Images
4 Results
5 Conclusions
References
Influence of Step Parameterisation on the Results of the Reidentification Pipeline
1 Introduction
2 Related Work
3 Materials and Methods
3.1 Pipeline Under Study
3.2 Metrics Used
4 Results
5 Conclusions
References
On the Influence of Image Features on the Performance of Deep Learning Models in Human-Object Interaction Detection
1 Introduction
2 Related Works
3 Baseline Approach
3.1 Object Detection and Feature Extraction
3.2 Human and Object Streams
3.3 Relation Stream and Final Score Calculation
4 Experiments
4.1 Experimental Setup
4.2 Human Stream
4.3 Human and Object Streams
4.4 Object Category and Distance Information
4.5 Summary of Results
5 Conclusions
References
Computer Graphics
Fast Triangle Strip Generation and Tunneling for Different Cost Metrics
1 Introduction
2 Background
3 Related Work
4 Cost Metrics
5 Implementation
5.1 Overview
5.2 Stripify Stage
5.3 Tunneling Stage
5.4 Novel Loop Detection Algorithm
6 Benchmarking
6.1 Comparison of Implementations
6.2 Results
6.3 Extended Tunneling
7 Conclusion and Future Work
8 Appendix
References
A Deep Multi-Layer Perceptron Model for Automatic Colourisation of Digital Grayscale Images
1 Introduction
2 Related Works
3 Methods
4 Results
5 Conclusion and Future Works
References
An Algorithm for Automatic Creation of Ground Level Maps in Two-Dimensional Top-Down Digital Role Playing Games
1 Introduction
2 A Brief Description of the General Approach
3 The Algorithm for the Creation of the Ground Level Map
4 The Two Rules Algorithm
4.1 The Double Edges Rule
4.2 The Middle Rule
5 Results
6 Conclusions and Future Plans
References
Hardware and Cryptography
Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation
1 Introduction
2 Power-of-Two Quantisation
3 Related Work
4 Hardware Design
4.1 Benchmark Results
4.2 Going Lower on Power—Pruning
5 Conclusion
References
Error Analysis and Graphical Evidence of Randomness in Two Methods of Color Visual Cryptography
1 Introduction
2 Two Methods of Coding
2.1 Description
2.2 Algorithms
2.3 Errors
2.4 Example Images
3 Probability of Errors
3.1 Theoretical Background
3.2 Experiment
3.3 Results and Discussion
4 Randomness Tests
4.1 Presenting the Histograms of p-Values
4.2 Presenting the Results of Tests with Subtests
4.3 Results and Discussion
5 Conclusions
References
Appendix Author Index
Index

Citation preview

Lecture Notes in Networks and Systems 598

Leszek J. Chmielewski Arkadiusz Orłowski   Editors

Computer Vision and Graphics Proceedings of the International Conference on Computer Vision and Graphics ICCVG 2022

Lecture Notes in Networks and Systems Volume 598

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).

Leszek J. Chmielewski · Arkadiusz Orłowski Editors

Computer Vision and Graphics Proceedings of the International Conference on Computer Vision and Graphics ICCVG 2022

Editors Leszek J. Chmielewski Institute of Information Technology Warsaw University of Life Sciences—SGGW Warsaw, Poland

Arkadiusz Orłowski Institute of Information Technology Warsaw University of Life Sciences—SGGW Warsaw, Poland

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-22024-1 ISBN 978-3-031-22025-8 (eBook) https://doi.org/10.1007/978-3-031-22025-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Organization

Association for Image Processing (TPO)—main organizer Institute of Information Technology of the Warsaw University of Life Sciences SGGW—principal supporting organizer Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin—supporting organizer Springer Verlag, Springer Nature (SN)—publisher of the Proceedings

Organizing Committee Leszek J. Chmielewski, Institute of Information Technology, SGGW Arkadiusz Orłowski, Institute of Information Technology, SGGW Ryszard Kozera, Institute of Information Technology, SGGW Dariusz Frejlichowski, Faculty of Computer Science and Information Technology, West Pomeranian University of Technology Marcin Bator, Institute of Information Technology, SGGW Krzysztof Lipka, Institute of Information Technology, SGGW Robert Budzy´nski, Institute of Information Technology, SGGW Tomasz Minkowski, Institute of Information Technology, SGGW Grzegorz Wieczorek, Institute of Information Technology, SGGW

Scientific Committee Ivan Bajla, Institute of Measurement Science, Slovak Academy of Sciences, Bratislava, Slovakia Nadia Brancati, Institute for High Performance Computing and Networking—ICAR, CNR, Rende, Napoli, and Palermo, Italy

v

vi

Organization

Leszek Chmielewski, Warsaw University of Life Sciences—SGGW, Warszawa, Poland Dariusz Frejlichowski, West Pomeranian University of Technology, Szczecin, Poland Ewa Grabska, Jagiellonian University in Cracow, Kraków, Poland Marcin Iwanowski, Warsaw University of Technology, Warszawa, Poland Joanna Jaworek-Korjakowska, AGH University of Science and Technology, Kraków, Poland Włodzimierz Kasprzak, Warsaw University of Technology, Warszawa, Poland Bertrand Kerautret, LIRIS, Université de Lyon 2 Lumiére, Lyon, France Nahum Kiryati, Tel Aviv University, Tel Aviv, Israel Józef Korbicz, University of Zielona Góra, Poland Marcin Korze´n, West Pomeranian University of Technology, Szczecin, Poland Tomasz Kryjak, AGH University of Science and Technology, Kraków, Poland Juliusz L. Kulikowski, Nałecz Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Sciences, Warszawa, Poland Bogdan Kwolek, AGH University of Science and Technology, Kraków, Poland Krzysztof Małecki, West Pomeranian University of Technology, Szczecin, Poland Tomasz Marciniak, University of Technology and Life Sciences, Bydgoszcz, Poland Andrzej Materka, Łód´z University of Technology, Łód´z, Poland Przemysław Mazurek, West Pomeranian University of Technology, Szczecin, Poland Adam Nowosielski, West Pomeranian University of Technology, Szczecin, Poland Krzysztof Okarma, West Pomeranian University of Technology, Szczecin, Poland Maciej Orkisz, CREATIS; Université de Lyon 1, Lyon, France Arkadiusz Orłowski, Warsaw University of Life Sciences—SGGW, Warszawa, Poland Henryk Palus, The Silesian University of Technology, Gliwice, Poland Wiesław Pamuła, The Silesian University of Technology, Gliwice, Poland Piotr Porwik, University of Silesia in Katowice, Poland Artur Przelaskowski, Warsaw University of Technology, Warszawa, Poland Giuliana Ramella, CNR—National Research Council Institute for the Applications of Calculus “Mauro Picone”, Napoli, Italy Khalid Saeed, Białystok University of Technology, Bialystok, Poland Samuel Silva, IEETA/DETI—University of Aveiro, Portugal Václav Skala, University of West Bohemia, Plzen, Czech Republic Maciej Smiatacz, Gda´nsk University of Technology, Gda´nsk, Poland ´ Andrzej Sluzek, Warsaw University of Life Sciences—SGGW, Warszawa, Poland João Manuel R. S. Tavares, Universidade do Porto, Portugal Libor Váša, University of West Bohemia, Pilsen, Czech Republic Michal Wo´zniak, Wroclaw University of Science and Technology, Wrocław, Poland

Preface

The International Conference on Computer Vision and Graphics—ICCVG, organized since 2002, is the continuation of The International Conferences on Computer Graphics and Image Processing—GKPO, held in Poland every second year from 1990 to 2000. The main objective of ICCVG is to provide an environment for the exchange of ideas between researchers in the closely related domains of computer vision and computer graphics. ICCVG 2022 brought together 43 authors. The proceedings contain 17 papers, each accepted on the grounds of merit and relevance confirmed by two independent reviewers. The conference was organized as a hybrid event—classic and online— during the 2022 stage of the COVID pandemic, which partly reduced the attendance. ICCVG 2022 was organized by the Association for Image Processing, Poland (Towarzystwo Przetwarzania Obrazów—TPO), the Institute of Information Technology at the Warsaw University of Life Sciences—SGGW, together with the Faculty of Information Science at the West Pomeranian University of Technology (WI ZUT), Szczecin, as the supporting organizer. The Association for Image Processing integrates the Polish community working on the theory and applications of computer vision and graphics. It was formed between 1989 and 1991. The location of the Institute of Information Technology at the Warsaw University of Life Sciences—SGGW, the leading life sciences university in Poland, is the source of opportunities for valuable research at the border of applied information sciences, forestry, furniture and wood industry, veterinary medicine, agribusiness, and the broadly understood domains of biology and economy. We would like to thank all the members of the Scientific Committee for their help in ensuring the high quality of the papers. We would also like to thank

vii

viii

Preface

˙ Gra˙zyna Doma´nska-Zurek for her excellent work on technically editing the proceedings, and Dariusz Frejlichowski, Marcin Bator, Krzysztof Lipka, Robert Budzy´nski, Tomasz Minkowski, and Grzegorz Wieczorek for their engagement in the conference organization and administration.

Warsaw, Poland September 2022

Conference Chairmen Leszek J. Chmielewski Ryszard Kozera Arkadiusz Orłowski

Contents

Computers and Humans On Multi-stream Classification of Two Person Interactions in Video with Skeleton-Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Włodzimierz Kasprzak, Sebastian Puchała, and Paweł Piwowarski

3

On Formal Models of Interactions Between Detectors and Trackers in Crowd Analysis Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Andrzej Sluzek and M. Sami Zitouni

17

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements . . . . . . Adam Nowosielski and Przemysław Reginia

31

Traffic and Driving Traffic Sign Classification Using Deep and Quantum Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sylwia Kuros and Tomasz Kryjak Real Time Intersection Traffic Queue Length Estimation System . . . . . . . Kamil Bolek

43 57

Image Processing Carotid Artery Wall Segmentation in Ultrasound Image Sequences Using a Deep Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . Nolann Lainé, Hervé Liebgott, Guillaume Zahnd, and Maciej Orkisz Novel Co-SIFT Detector for Scanned Images Differentiation . . . . . . . . . . . Paula Stancelova, Zuzana Cernekova, and Andrej Ferko PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konrad Lis and Tomasz Kryjak

73 85

99

ix

x

Contents

Fuzzy Approach to Object-Detection-Based Image Retrieval . . . . . . . . . . . 121 Marcin Iwanowski, Aleksei Haidukievich, Maciej Leszczynski, and Bartosz Wnorowski Adaptive Binarization of Metal Nameplate Images Using the Pixel Voting Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Hubert Michalak and Krzysztof Okarma Influence of Step Parameterisation on the Results of the Reidentification Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Damian P¸eszor, Konrad Wojciechowski, and Łukasz Czarnecki On the Influence of Image Features on the Performance of Deep Learning Models in Human-Object Interaction Detection . . . . . . . . . . . . . 165 Marcin Grz¸abka, Marcin Iwanowski, and Grzegorz Sarwas Computer Graphics Fast Triangle Strip Generation and Tunneling for Different Cost Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Jonas Treumer, Lorenzo Neumann, Ben Lorenz, and Bastian Pfleging A Deep Multi-Layer Perceptron Model for Automatic Colourisation of Digital Grayscale Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Olawande M. Shokunbi, Joseph Damilola Akinyemi, and Olufade Falade Williams Onifade An Algorithm for Automatic Creation of Ground Level Maps in Two-Dimensional Top-Down Digital Role Playing Games . . . . . . . . . . . 213 Krzysztof Kaczmarzyk and Dariusz Frejlichowski Hardware and Cryptography Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Dominika Przewlocka-Rus and Tomasz Kryjak Error Analysis and Graphical Evidence of Randomness in Two Methods of Color Visual Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Leszek J. Chmielewski, Mariusz Nieniewski, and Arkadiusz Orłowski Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Computers and Humans

On Multi-stream Classification of Two Person Interactions in Video with Skeleton-Based Features Włodzimierz Kasprzak , Sebastian Puchała, and Paweł Piwowarski

Abstract A method of human skeleton-tracking and -refinement, and feature extraction for two-person interaction recognition in video is proposed. Its purpose is to properly reassign the same person-representing skeletons, approximate the missing joints and extract meaningful relational features. In addition, based on the created feature streams, two different multi-stream deep neural networks are designed to perform data transformation and interaction classification. They provide different relations between model complexity and performance quality. The first one is an ensemble of “weak” pose-based action classifiers, which are trained on different time-phases of an interaction. At the same time, the overall classification result is a time-driven aggregation of weighted combinations of their results. In the second approach, three input feature streams were created, which fed a triple-stream LSTM network. Both network models were trained and tested on the interaction subset of the NTU RGB+D data set, showing comparable performance with the best reported CNN- and Graphic CNN-based classifiers. Keywords Action classification · Skeleton data analysis · Human pose estimation · Video processing

1 Introduction The human activity recognition, based on computer vision techniques, has gained noticeable improvements in recent years, it has still been facing many challenges in practice, e.g. occlusions, low resolutions, different view-points, non-rigid deformations, intra-class variability in shape [1]. It is dedicated to support real-world applications that will make our life better and safer, such as human-computer interaction in robotics and gaming, video surveillance and social activity W. Kasprzak (B) · S. Puchała · P. Piwowarski Warsaw University of Technology, Institute of Control and Computation Engineering, ul. Nowowiejska 15/19, 00-665 Warszawa, Poland e-mail: [email protected] URL: https://www.ia.pw.edu.pl/ © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_1

3

4

W. Kasprzak et al.

recognition [2]. Typically, human activity recognition in image sequences requires the first detection of human body parts or key points of a human skeleton. In early solutions, hand-designed features like edges, contours, Scale-Invariant Feature Transform (SIFT), and Histogram of Oriented Gradients (HOG) have usually been used for the detection and localization of human body parts or key points in the image [3]. More recently, Neural Network-based solutions were successfully proposed, e.g. based on Deep Neural Networks (DNN), especially Convolution Neural Networks (CNN) [4], as they have the capability automatically to learn rich semantic and discriminative features. In skeleton-based methods [5] 2D or 3D human skeletons are detected first, even by functions installed on specialized devices, like the Microsoft Kinect. Some popular solutions to human pose estimation deliver skeleton data that can be utilized for action recognition, e.g., OpenPose [6] or DeeperCut [7]. The skeleton-based methods compensate for some of the drawbacks of vision-based methods, such as assuring the privacy of persons and reducing the scene lightness sensitivity. In this work, we focus on two-person interaction recognition in video, assuming the existence of skeleton data for video frames. The recent most successful solutions [8] show two main trends (see Sect. 2). It is essential to extract meaningful information from skeleton-data and to suppress its errors, either by employing specialized algorithms or function approximation (e.g., deep neural networks). A trend is also visible towards lightweight solutions, i.e., when using Graph CNNs instead of CNN, or 2D CNNs instead of 3D CNNs. We propose two methods that differently interpret the idea of using multi-stream networks. The common structure of both methods (including the important skeleton tracking/correction and feature extraction steps) is presented in Sect. 3. The first method (Sect. 4) is an ensemble of “weak” pose classifiers, where every classifier is trained on a different time-phase of an interaction, while the overall classification result is a weighted combination of their results [9]. In the second approach (Sect. 5), three data were streams created from the skeleton data and they fed a triple-stream LSTM network performing feature transformation and classification. The classifiers are trained and tested on the well-known “NTU RGB+D” video dataset for action and interaction [10] (Sect. 6).

2 Related Work In this short review of related work, we skip classic approaches, using different rich feature extraction methods and classic classifiers, like SVM or AdaBoost [3]. More recently, artificial neural networks are replacing such classic methods. Networks based on the LSTM architecture or a modification of this architecture (a ST-LSTM network with trust gates) were proposed in [11] and [10]. They introduced so-called “Trust Gates” in LSTM cells for controlling the quality of skeleton joints. Then, they arranged the LSTM modules into a 2D array to capture spatial and temporal dependencies simultaneously (denoted as ST-LSTM). The idea of applying convolutional filters to pseudo-images in action classification was introduced in [12]. A pseudo-image is a map (a 2D matrix) of feature vectors from successive time-points,

On Multi-stream Classification of Two Person Interactions in Video …

5

aligned along the time axis. Thus, the convolutional filters find local relationships of a combined time-space nature. Liang et al. [13] extended this idea to a multi-stream network with three stages. They use three types of features, extracted from the skeleton data: positions of joints, motions of joints and orientations of line segments between joints. Every feature type is processed independently in its own stream but after every stage the results are exchanged between streams. The long-time top performance on the NTU RGB+D interaction dataset was reported by the work of Perez et al. [14]. Its main contribution is a powerful twostream CNN network with three-stages, called “Interaction Relational Network” and a final LSTM network for a classification based on a dense sequence of frames. The input of this network consists of basic relations between skeleton joints of two interacting persons, tracked over the length of an image sequence. Graph convolutional networks are currently considered the best skeleton-based approach to the action recognition problem. They can achieve high quality results with lower requirements of computational resources than needed by CNNs (for example, “Spatial Temporal Graph Convolutional Networks” [15] and “ActionalStructural Graph Convolutional Networks” [16]). Another recent development is extracting the different type of information from the initial skeletons (e.g., information on joints and bones, and their relations in space and time) what leads to multi-stream Graph CNNs (for example, “Two-Stream Adaptive Graph Convolutional Network” (2S-AGCN) proposed by Shi et al. [17]). The currently best results are reported by Zhu et al. [18], where two new modules are proposed for baseline GCN and 2S-AGCN networks. The first module extends the idea of modeling relational links between two skeletons by a spatio-temporal graph to a “Relational Adjacency Matrix (RAM)”. The second novelty is a processing module, called “Dyadic Relational Graph Convolution Block”, which combines the RAM with spatial graph convolution and temporal convolution to generate new spatial-temporal features. The new solutions are called “dyadic relational GCNs” (DR-GCN, 2S DR-AGCN). From the analysis of the recent most successful solutions, we conclude that they differ by: processing sparse or dense frame sequences, extracting different features from skeleton data, and applying light- or heavy-weight neural networks.

3 Approach 3.1 Structure The input data has the form of video clips, apparently containing a two-person interaction. Both proposed solutions have a similar structure, shown in Fig. 1, differing only by the neural networks. There is a sequence of processing steps: key frame selection, skeleton estimation, skeleton tracking and correcting, feature extraction, neural network training and model testing.

6

W. Kasprzak et al.

Fig. 1 General structure of our approach

3.2 Key Frame Selection Assuming, the start and end of human activity are known, a given number N of keyframes are selected, uniformly distributed along the time axis (e.g., N = 32). Assume, that N = Nm · M, where M is a given number of phases of the activity and Nm is the number of frames per single phase. We shall set M = 4 stages (phases) and interpret them as the start, 1-st intermediate, 2-nd intermediate and final phase of an action.

3.3 Skeleton Detection In our implementation, we use the core block of OpenPose [19], the “body_25” model, to extract 25 human skeletal joints from an image. The result of OpenPose, as applied to a single key frame, will be a set of skeleton data, where a 25-elementary array represents a single skeleton, providing 2D (or 3D, if needed) image coordinates and a confidence score for every joint.

3.4 Skeleton Tracking and Correcting In case, more than two skeletons in an image are returned by OpenPose, the two largest skeletons, Sa , Sb , are selected first and then tracked in the remaining frames. We focus on the first 15 joints of every skeleton, as the information about the remaining joints is very noisy (Fig. 2a). Size normalization. The invariance of features with respect to the size of the skeleton in the image was obtained by normalizing the coordinates of the junction points with the section between the neck joint1 and the center of the hips joint8 (Fig. 2b). This distance is nearly always correctly detected by OpenPose. Secondly, it does not depend on the angle of the human position in relation to the camera. The only exception is when the person’s spine is positioned along the depth axis of the camera system (this case does not occur in the data sets used). After calculating the length of the segment joint1  joint8 , it becomes a normalization value for all other measured distances in the feature sets. This distance

On Multi-stream Classification of Two Person Interactions in Video …

7

Fig. 2 Joints: a the 15 reliable joints (indexed from 0 to 14) out of 25 of the OpenPose’s “body_25” skeleton model; b the size normalization distance joint1  joint8

(a)

(b)

is measured only for the first person, and both persons are normalized by it. The absolute image coordinates are transformed into relative coordinates by dividing them by the size normalization distance | joint1  joint8 | of the first skeleton. Finally, the distance between two persons (a, b)—skeleton centres—is obtained, d = | joint8(a) − joint8(b) |. Correction of joints. We cancel some joints data that are uncertain. Those joints j, whose certainty value is c j < 0.3, are removed and replaced by a special mark representing “not a value”. The location data for joints received from OpenPose is not always perfect. It happens that some joints are not detected, while some others are detected with low certainty and we remove them. Fortunately, due to the sequential nature of the available data, a number of techniques can be used to fill these gaps by approximating empty joints by their locations in neighbor frames. RAW features. The result of tracking two sets of skeleton joints in N frames can be represented as a 2D map of N × 15 × (2 + 1) vector entries.

3.5 Feature Extraction A strict representation of junctions, like the RAW data map, has obvious disadvantages—it is not invariant with respect to the position in the image and it does not explicitly represent relationships between both skeletons. First, the coordinates of the joints may randomly change but still represent the same semantic meaning (i.e. an action stage). The second problem is that the distance of points during interaction depends on the scale of the presentation of the scene in the image and the size of the people. Thirdly, the point representation does not explicitly model

8

W. Kasprzak et al.

Fig. 3 Illustration of the LA features: a the 14 line segments (bones) per skeleton; b the 13 orientation changes between every two consecutive segments (bones)

(a)

(b)

other important relationships between silhouettes such as relative orientation and movement. Of course, a deep network would also be able to learn such dependencies, but then we unnecessarily lose computing resources and deteriorate the quality of predictions for learning data transformations, which we can easily be done analytically. Therefore, a mutual representation of both skeletons was developed, which reduces the aforementioned disadvantages of the raw representation of joints: 1. bone lengths (limbs) and orientation changes (angles)—so called “LA features”; 2. limbs, angles and joint motion vectors, denoted as “LAM features”. The LA features will be fed into the tree-stream LSTM, while the LAM features— into the ensemble of weak pose-based action classifiers. 3.5.1 LA Features For every skeleton a and b there are obtained: • lengths of 14 bones—line segments (L, “limbs”) f a , fb —distances between two neighbor joints (Fig. 3a); • 13 bone orientation changes at joints (A, “angles”) ra , rb between two neighbor segments (Fig. 3b). ( j)

Additionally, distances dab between pairs of corresponding joints (the same index j) of two skeletons Sa and Sb are also considered (15 distances). Thus, for every frame there are 69 features defined (= (14 + 13) · 2 + 15), denoted as LA. The N · 69 features are split into three maps, one for each skeleton, FaN and FbN , and the inter-distances are collected into a third map F ND :

On Multi-stream Classification of Two Person Interactions in Video …



fa1 ⎢ fa2 FaN = ⎢ ⎣ ... faN

   

⎤ ⎡ 1 ⎤ ⎡ 1 Dab fb  ra1 ⎢ D2 ⎥ N ⎢ f 2  ra2 ⎥ N ab ⎥,F = ⎢ ⎥,F = ⎢ b ⎣ ...  ... ⎦ D ⎣ ... ⎦ b N N ra Dab fbN 

9

⎤ rb1 rb2 ⎥ ⎥. ... ⎦ rbN

(1)

3.5.2 LAM Features For memory-less networks, like the MLP, we need to define motion vectors for every joint and every skeleton. Let u at j = (u x , u y ) , be the motion vector of joint “ j” of the skeleton “a” in frame “t”, while Uat is a sequence of such vectors for all 15 joints. Thus, for every frame there are 60 more features defined (= 15 · 2 · 2). In total, the LAM feature map has 129 features for every frame (= N · (60 + 69)).

4 Ensemble of “Weak” Classifiers The block structure of our first network contains: feature transformation layers of an initially trained pose classifier, four weak action classifiers (called “experts”) and a fusion layer with gain-like weights (Fig. 4). Let us denote this network as E-ANNLAM, where “E” stands for “ensemble” and “LAM” for “limbs, angles and motion”. The pose classifier is a fully-connected ANN network with two hidden layers. We have evaluated many variants of several hyper-parameters: the number of hidden layers of the network varied from 1 to 3, activation functions ReLU and/or sigmoid were tried, different number of neurons in hidden layers and the learning rate were

Fig. 4 The structure of the ensemble of classifiers (E-ANN-LAM): two feature transformation layers, four weak action classifiers and a fusion layer

10

W. Kasprzak et al.

checked. The four weak action classifiers use the embedding data generated by the pose classifier as their input. They replace the output layer of the pose classifier by a fully connected layer with N outputs each. The “ensemble of experts” consists of a fusion layer for the weak classifier and aggregation of class likelihoods over the entire frame sequence. The fusion layer is again a fully connected layer that weights the results of all weak classifiers. It takes the frame index t as its additional input. Finally, interaction classification consists of summing up all the time-indexed interaction class likelihoods, obtained by the ensemble classifier over the entire frame sequence. The fusion and time-based aggregation operation is formally a weighted sum of pose-based interaction likelihoods (vector data), for frames indexed from t = 1 to t = T . At the same time, the gain network provides gain coefficients wi (t) (i = 1, 2, 3, 4) for the four weak interaction classifiers Prex per t_i (t): T  S= [Prex per t_1 (t) · w1 (t) + Prex per t_2 (t) · w2 (t) t=1

+ Prex per t_3 (t) · w3 (t) + Prex per t_4 (t) · w4 (t)

(2)

5 Triple Stream LSTM The “triple stream” LSTM (TS-LSTM-LA) consists of three independent LSTM streams, a concatenation layer and two dense layers. Every LSTM stream has two LSTM layers interleaved by two Dropout layers (Fig. 5). Two of the LSTM streams process the feature subsets of every skeleton separately, while the third one processes the common feature subset (15 distances between joints).

6 Results 6.1 Dataset The primary dataset on which our models have been trained and evaluated is the interaction subset of the NTU RGB+D dataset. It includes 11 two-person interactions of 40 actors: A50: punch/slap, A51: kicking, A52: pushing, A53: pat on back, A54: point finger, A55: hugging, A56: giving object, A57: touch pocket, A58: shaking hands, A59: walking towards, A60: walking apart. Skeleton data is already available for all frames of all video clips. We trained and tested our models in the CS (crosssubject) verification mode of the NTU RGB+D dataset, i.e., when actors in the training set are different than in the test set, but data from all the camera views are included in both sets. There are 10347 video clips in total, of which 7334 videos are in the training set and the remaining 3013 videos are in the test set. No distinct validation

On Multi-stream Classification of Two Person Interactions in Video …

11

Fig. 5 Architecture of the TS-LSTM-LA network

subset is distinguished. In the cross-subject setting, samples used for training show actions performed by half of the actors, while test samples show actions of remaining actors, i.e., videos of 20 persons are used for training and videos of the remaining 20 persons—for testing. Typically, the classification accuracy grows with the growing number of frames per video clip. We have chosen to extract 8 frames per time phrase (in total 32 frames per video clip) to evaluate our solutions similarly to the reference work [14]. The selected frames are uniformly distributed along the time axis. The training set was split into learning and test subsets—two third for learning and one third for validation/testing. There were 100 training epochs, and the model was chosen, that emerged after the epoch with the best validation result. In addition, we performed a hyper-parameter search during training by the Random Search algorithm, offered in Keras [20].

6.2 Ensemble of Experts We compare a weak action classifier with the ensemble of four weak classifiers, where both classifiers are processing the LAM features. The final score of the ensemble classifier is obtained by the aggregation equation (2). The class with the highest score is selected as the winner. A notable improvement is observed, when aggregating the weak experts to an ensemble classifier. The mean accuracy of a single action classifier was 88.4% (training) and 76.6% (testing), while the ensemble classifier reached 94.5% and 84.0%, respectively.

12

W. Kasprzak et al.

6.3 Multi-stream LSTM The accuracy of the three-stream LSTM network operating on the LA features was tested against a baseline single-stream LSTM, which operated on the RAW skeleton features. Also in this case, our solution has shown superior performance. The test accuracy of the LSTM-RAW classifier was 75.7%, whereas the TS-LSTM-LA reached 94.6%. Confusion matrices allow for accurate analysis of incorrect predictions of individual classes. The confusing results are as follows: • The striking class is confused with the finger pointing class—in both cases, a similar hand movement is made towards the other person; • The class of patting on the back is confused with the class of touching a pocket— touching a pocket involves touching another person’s pocket in an interaction (a simulation of stealing a wallet), so the movement is close to patting someone on the back; • The giving object class and the hand-squeezing class represent very similar interactions; both involve the contact of the hands. The classes “zooming out” and “zooming in” are recognized virtually flawlessly even with weaker models (Fig. 6).

6.4 Ablation Study So far, several methods for two-person interaction classification have been tested on the NTU-RGB+D interaction dataset. We list some of the leading works in Table 1. Our solutions can be characterized as follows: (a) the ensemble of action experts needs a low number of weights (462K ) to be trained but achieves good quality (84%), (b) the three-stream LSTM needs a high number of weights (9.8M) but achieves top quality (94.6%), comparable with current best approaches, based on graph CNNs. Solutions, that process all or nearly all frames of the video clip demonstrate superior performance over solutions operating on sparse frame sequences.

On Multi-stream Classification of Two Person Interactions in Video …

13

Fig. 6 Illustration of properly classified interactions of hugging and punch Table 1 Interaction classification accuracy of leading works evaluated on the NTU-RGB+D interaction set in the CS (cross subject) mode. Note: †—result according to [14], ‡—result according to [18] Model Year Accuracy (CS) Parameters Sequence ST-LSTM [11] ST-GCN [15] AS-GCN [16] IRNinter +intra [14] Our E-ANN-LAM Our TS-LSTM-LA LSTM-IRN [14] 2S-AGCN [17] DR-GCN [18] 2S DR-AGCN [18]

2016 2018 2019 2019

83.0% † 83.3% † 89.3% † 85.4% †

∼ 2.1M 3.08M ∼ 9.5M ∼ 9.0M

32 32 32 32

2022

84.0%

462K

32

2022

94.6%

9.8M

32

2019 2019 2021 2021

90.5% † 93.4% ‡ 93.6% ‡ 94.6% ‡

∼9.08M 3.0M 3.18M 3.57M

max(all, 300) max(all, 300) max(all, 300) max(all, 300)

14

W. Kasprzak et al.

7 Conclusion Two methods for two-person interaction classification were proposed and evaluated— the first one is using a lightweight network implementing an ensemble of action experts, while the second one—a three-stream LSTM network. We apply the OpenPose library, an efficient deep network solution, to generate human skeleton sets from an image or video frame. Our main contributions are algorithms for skeleton tracking and -correction, and the design of two classifiers. The main advantage of the first network is a good performance obtained with the extremely low number of weights, while the second network shows a top classification performance without using a CNN or Graph CNN. The limitation of methods is the low accuracy for actions performed by hand palms or feet, as the data for fingers, palm and feet is very noisy. Acknowledgments This work was conducted within the project APAKT and supported by “Narodowe Centrum Bada´n i Rozwoju”, Warszawa, grant No. CYBERSECIDENT/455132/III/NCBR/ 2020.

References 1. Stergiou, A., Poppe, R.: Analyzing human-human interactions: a survey. Comput. Vis. Image Underst. Elsevier 188, 102799 (2019). https://doi.org/10.1016/j.cviu.2019.102799 2. Coppola, C., Cosar, S., Faria, D.R., Bellotto, N.: Automatic detection of human interactions from RGB-D data for social activity classification. In: 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Lisbon, pp. 871–876 (2017). https://doi.org/10.1109/ROMAN.2017.8172405 3. Zhang, S., Wei, Z., Nie, J., Huang, L., Wang, S., Li, Z.: A review on human activity recognition using vision-based method. J. Healthc. Eng., Hindawi 2017, Article ID 3090343, 31 pages (2017). https://doi.org/10.1155/2017/3090343 4. Bevilacqua, A., MacDonald, K., Rangarej, A., Widjaya, V., Caulfield, B., Kechadi, T.: Human activity recognition with convolutional neural networks. In: Machine Learning and Knowledge Discovery in Databases, LNAI, vol. 11053, pp. 541–552. Springer, Cham, Switzerland (2019). https://doi.org/10.1007/978-3-030-10997-4_33 5. Cippitelli, E., Gambi, E., Spinsante, S., Florez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living (TechAAL 2016), London, UK (2016). https://doi.org/10.1049/ic.2016.0063 6. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172– 186 (2021). https://doi.org/10.1109/TPAMI.2019.2929257 7. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision – ECCV 2016, LNCS, vol. 9907, pp. 34–50. Springer, Cham, CH (2016). https://doi.org/10.1007/978-3-31946466-4_3 8. [Online]. NTU RGB+D 120 Dataset. Papers With Code. https://paperswithcode.com/dataset/ ntu-rgb-d-120 (Accessed on 20.07.2022) 9. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991). https://doi.org/10.1162/neco.1991.3.1.79

On Multi-stream Classification of Two Person Interactions in Video …

15

10. Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis (2016). arXiv:1604.02808 [cs.CV]. https://arxiv.org/abs/1604.02808 (Accessed on 15.07.2022) 11. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Computer Vision ECCV 2016, LNCS, vol. 9907, pp. 816–833. Springer, Cham, CH (2016). https://doi.org/10.1007/978-3-319-46487-9_50 12. Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based action recognition with convolutional neural networks (2017). arXiv:1704.07595v1 [cs.CV]. https://arxiv.org/abs/1704.07595v1 (Accessed on 15.07.2022) 13. Liang, D., Fan, G., Lin, G., Chen, W., Pan, X., Zhu, H.: Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE (2019). https://doi.org/10.1109/cvprw.2019.00123 14. Perez, M., Liu, J., Kot, A.C.: Interaction relational network for mutual action recognition (2019). arXiv:1910.04963 [cs.CV]. https://arxiv.org/abs/1910.04963 (Accessed on 15.07.2022) 15. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeletonbased action recognition (2018). arXiv:1801.07455 [cs.CV]. https://arxiv.org/abs/1801.07455 (Accessed on 15.07.2022) 16. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp. 3590–3598. https://doi.org/10.1109/CVPR.2019.00371 17. Shi, L., Zhang, Y., Cheng, J., Lu, H.-Q.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition, 10 July 2019. arXiv:1805.07694v3 [cs.CV]. https://doi. org/10.48550/ARXIV.1805.07694 (Accessed on 15.07.2022) 18. Zhu, L.-P., Wan, B., Li, C.-Y., Tian, G., Hou, Y., Yuan, K.: Dyadic relational graph convolutional networks for skeleton-based human interaction recognition. Pattern Recognit., Elsevier 115, 107920 (2021). https://doi.org/10.1016/j.patcog.2021.107920 19. [Online]. openpose. CMU-Perceptual-Computing-Lab (2021). https://github.com/CMUPerceptual-Computing-Lab/openpose/ (Accessed on 20.07.2022) 20. [Online]. Keras Tuner. https://keras-team.github.io/keras-tuner/ (Accessed on 15.07.2022)

On Formal Models of Interactions Between Detectors and Trackers in Crowd Analysis Tasks ´ Andrzej Sluzek

and M. Sami Zitouni

Abstract In crowd analysis tasks (crowds of humans, cattle, birds, drones, etc.) the low-level vision tools are usually the same, i.e. detection and tracking of either individuals or groups. The required results, however, are more complicated (e.g. patterns of group splitting/merging, changes in group sizes and membership, group formation and disappearance, etc.). To complete such tasks, raw results of detection/tracking are converted into data associations representing crowd structure/evolution. Normally, those associations are deterministic and based on target labeling. However, performances of detectors/trackers are non-perfect, i.e. their outcomes are effectively nondeterministic. We discuss matrix-based mathematical models of interactions between detectors and trackers to represent such data associations non-deterministically. In particular, a methodology for reconstructing weak or missing associations by alternative sequences of matrix operations is proposed. This can provide more reliable label correspondences between selected moments/points of monitored scenes. Apart from mathematical details, the paper presents examples illustrating feasibility of the proposed approach. Keywords Crowd analysis · Data associations · Detectors · Trackers · Matrix representation

1 Introduction Automatic analysis of crowd behaviors from visual surveillance data (be it a crowd of pedestrians in a city center, a flock of birds, a large group of migrating animals, a swarm of drones approaching the defended area, or a herd of grazing cattle) is an interesting and challenging problem for machine vision and AI. Although analysis ´ A. Sluzek (B) Warsaw University of Life Sciences-SGGW, Warsaw, Poland e-mail: [email protected] M. Sami Zitouni Khalifa University, Abu Dhabi, UAE © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_2

17

18

´ A. Sluzek and M. Sami Zitouni

of human crowds is the most popular area (e.g. [2, 15, 17, 25]) there are a number of works discussing group behaviors of animals or even mobile robots (e.g. [13, 18]). Regardless the intended application, the low-level tools of vision-based crowd monitoring remain the same, i.e. detection and tracking algorithms targeting either individuals or groups. However, the required results of surveillance-based analysis are usually more complicated than direct outcomes of detectors or trackers. Prospective users might be interested in issues such as patterns of group splitting or merging, changes in group sizes and membership, duration of group existence (including group formation and disappearance), etc. To complete such tasks, raw-data results of detection and tracking should be converted into data association patterns (maintaining continuity of target identifiers/labels) which represent various scenarios of crowd structure and evolution. From those data association patterns, meaningful conclusions can be drawn regarding the type of crowd behavior, development of potentially dangerous events, situational abnormalities, etc. Since, in general, data associations based on target labeling are deterministic, conversion of detection/tracking results into deterministic associations might be inaccurate. First, even state-of-the-art detectors/trackers very seldom provide accurate results in multi-target scenarios of crowded places. Apart from typical effects degrading visibility (e.g. shadows, weather conditions, poor illumination, etc.) we can also expect deterioration caused by crowd density, low resolution of individual silhouettes, occlusions, multiple similar targets, vaguely defined boundaries of groups, etc. Therefore, continuity of labels assigned to (actually the same) individuals or groups cannot be realistically maintained over longer periods of time and/or over wider areas. Performances would be even less reliable, if several detection/tracking algorithms are simultaneously used. Figures 1 and 2 illustrate typical inaccuracies/mistakes expected in real-world images, even if state-of-the-art algorithms are used and monitored scenes have moderate complexity.

Fig. 1 Individual and group detection results in exemplary street-view images. Exemplary highperformance detectors are used, i.e. ACF (see [6]) for individuals, and a GMM-based detector of groups (see [21])

On Formal Models of Interactions Between Detectors and Trackers …

19

Fig. 2 Several label-continuity mistakes (mostly in group labels) in just three consecutive frames of a surveillance video (from [22]). JPDA-based (see [12]) trackers are used

Thus, various techniques are proposed to handle inconsistencies between deterministic (label-based) associations and imperfect performances of trackers and detectors (producing results which should be considered non-deterministic). For example, extended objects (e.g. [7]) are a mathematical concept for tracking targets with statistically diluted boundaries (using random matrices). Alternatively (an approach proposed in [16]) labels can be assigned to all particles (if a particle filter is used) within the target outlines so that label distributions approximate locations, shapes and identity of targets. In this paper, we propose and illustrate (on selected examples) another approach to data association in crowd analysis tasks. The methodology seems novel, and it is general enough to incorporate any typical detectors and trackers, as long as those detectors/trackers provide geometric outlines of the handled objects. In our opinion, the most significant aspects of the presented work are as follows: 1. Data acquisition and processing by the surveillance system is represented by a collection of (formally unrelated) monitoring events. Each monitoring event is a triplet consisting of: (a) data acquisition device (camera), (b) (discretized) acquisition time and (c) data processing algorithm, i.e. either a detector (of individual targets or groups) or a tracker (of individual targets or groups). Thus, there is no formal distinction between detectors and trackers. At each monitoring event, a number of objects are identified (by detection or tracking). Objects are either individuals or groups, depending on what algorithm is used in the event. 2. Data associations are built between monitoring events, i.e. between objects identified at two points. However, no labels are used. Instead, we propose to use matrices; their elements represent (usually non-deterministically) relations between the corresponding objects of both monitoring events. The matrix relating two monitoring events can be directly obtained from the algorithms used at these events, or derived from a sequence of matrices representing a chain of events linking the original two monitoring events. 3. By using the concept of monitoring events (and the corresponding matrixrepresented data associations) it is possible to maintain continuity of data associations even if detection/tracking of either individuals or groups is temporarily/ locally discontinued. In such cases, missing detection/tracking results can be rectified by using chains of monitoring events to bypass the disrupted fragments.

20

´ A. Sluzek and M. Sami Zitouni

Re-initialization is needed only if tracking/detection of both individuals and groups is temporarily lost. 4. If needed, standard label-based data associations can be established (with the highest possible accuracy) between any monitoring events. For that purpose, the contents of matrices relating two events are analyzed using simple principles (mainly based on min-max operations along matrix columns or rows). The paper is a continuation and improvement of our recent results in [23]. In that paper, we focused only on one aspect of the considered problems, i.e. on substituting tracking individuals by tracking groups or another way around (in cases when one of the trackers is temporarily unable to provide credible results). This paper provides a significant generalization of the mechanisms, both in terms of the considered scenarios and from the theoretical perspective. In Sect. 2 (which is the main part of the paper) we systematically introduce the proposed formalisms, explain how they work for various surveillance scenarios, and provide illustrative examples of the presented concepts. The final Sect. 3 summarizes the paper (including the summary of results from past papers) and highlights the prospective directions for the future works.

2 Formal Models 2.1 Basic Definitions Generally, we assume a surveillance system with K cameras {C1 , C2 , . . . , C K } acquiring visual data (video-frames) at a certain discrete rate (e.g. frames per second) {. . . , t1 , t2 , . . . , tn , . . .}. Additionally, the cameras are equipped with M dataprocessing algorithms {A1 , A2 , . . . , A M }, where each algorithm can be: • Detector of individuals, i.e. extracting in a current frame a number of individuals {i 1 , i 2 , . . . , i P }. Each extracted individual is represented by its shape function (see below). • Tracker of individuals, i.e. extracting in a current frame a number of individuals {i 1 , i 2 , . . . , i P } which are also represented by theirs shape functions. In contrast to detectors (which extract individuals from the current frame) trackers identify the current-frame individuals as continuations of individuals from a past frame. • Detector of groups, i.e. extracting in a current frame a number of consistentlylooking groups {g1 , g2 , . . . , g P }. Each group is represented by its shape function. • Tracker of groups, i.e. extracting in a current frame a number of groups {g1 , g2 , . . . , g P } which are continuation of groups from a past frame. 2.1.1 Shape Functions Shape functions s f (i k ) (or s f (gl )) are a generalization of intuitively straightforward notion of geometric outlines for extracted i k individuals or gl groups. In practice,

On Formal Models of Interactions Between Detectors and Trackers …

(a)

(b)

21

(c)

Fig. 3 Rectangular outlines in exemplary crowd images of: a individuals by using ACF detector, b individuals and groups by using CNN-based detectors from [24], and c groups only by the detector from [21]

(a)

(b)

Fig. 4 a Individuals outlined by their heads (using the head detector from [19]). b Groups outlined by irregular patches (using the method from [1])

Fig. 5 Bounding boxes of three detected targets (by YOLO detector, [11]). The provided confidence levels (softmax outputs of the terminal layer) are: bicycle= 0.82, dog= 0.79 and truck= 0.78

rectangular bounding boxes are the most popular form of shape functions for both individuals and groups, see Fig. 3. Alternatively, targets might be represented by differently defined outlines, e.g. heads-only (plus default shapes of bodies) for individuals or irregular patches for groups (examples in Fig. 4). Additionally, the outline-based binary shape functions can be scaled by the credibility factor (ranging from 0 to 1) which represents the confidence level of target presence and/or localization. This is what typical CNN-based detectors provide (with softmax outputs from the terminal layer). Figure 5 shows exemplary image with three detected objects and their credibilities provided.

22

´ A. Sluzek and M. Sami Zitouni

Another way to define shape functions of targets (groups, in particular) is by using density (e.g. in [9]) or heat maps (e.g. in [20]). In such cases, target outlines are indirectly defined by near-zero values of these maps, while credibility (which varies within the outlines) is provided locally by the map values. 2.1.2 Monitoring Events Monitoring events are the proposed abstract entities for visual data acquisition and processing  in surveillance systems. Each monitoring event is a triplet M Pi, j,k =  Ci , t j , Ak consisting of: • Ci camera, with its geometric configuration (with respect to other cameras, if any) provided. • t j data acquisition time (frame number). If multiple cameras are used, they are assumed to trigger at the same (at least approximately) time instances. • Ak algorithm, i.e. detector or tracker of either individuals or groups. Therefore, a monitoring event delivers a list of extracted individuals or groups (as explained above) that is (formally) unrelated to such lists obtained at other monitoring events (which can use different cameras, different algorithms, or perform data acquisition at different times). For each extracted object, its shape function is also provided.

2.2 Matrix-Based Data Associations Between Monitoring Events In practice, results obtained at different monitoring events are highly related (e.g. results at two monitoring events which only differ by one frame are almost identical) so that the model is proposed to represent those relations in a general and practical way. Within the intended applications, we consider only two types of relations, i.e. sameness and membership, which are further explained as follows: 1. For individuals, the sameness relation is simply identity. 2. For groups, the sameness relation may also include cases of group nesting (i.e. a group being a sub-group of another group). Thus, the formal properties of equivalence relation are not satisfied for sameness of groups. 3. The membership relation is defined only between individuals and groups, and it is intuitively straightforward. Given the results obtained at two monitoring events M Pi1, j1,k1 and M Pi2, j2,k2 , i.e. two sets of either individual or group objects (of the corresponding P1 and P2 cardinalities) we represent associations between these objects by P1 × P2 matrices, with

On Formal Models of Interactions Between Detectors and Trackers …

23

their elements indicating confidence levels of the corresponding relations between the selected pair of objects. Thus, if individuals are identified at both monitoring events, we have: ⎤ ⎡ ii 1,1 . . . ii 1,P2 ⎢ ii 2,1 . . . ii 2,P2 ⎥ i1, j1,k1 ⎥ I Ii2, j2,k2 = ⎢ ⎣ ... ... ... ⎦ ii P1 ,1 . . . ii P1 ,P2

(1)

where ii p,q values are numerical estimates of sameness (identity) between pth individual from M Pi1, j1,k1 and qth individual from M Pi2, j2,k2 . Similarly, if groups are identified at both monitoring events, we have: ⎤ ⎡ gg1,1 . . . gg1,P2 ⎢ gg2,1 . . . gg2,P2 ⎥ i1, j1,k1 ⎥ GG i2, j2,k2 = ⎢ ⎣ ... ... ... ⎦ gg P1,1 . . . gg P1,P2

(2)

where gg p,q values are numerical estimates of sameness between pth group from M Pi1, j1,k1 and qth group from M Pi2, j2,k2 . Finally, if the first monitoring event identifies individuals, and the second one groups, the matrix is: ⎤ ⎡ ig1,1 . . . ig1,P2 ⎢ ig2,1 . . . ig2,P2 ⎥ i1, j1,k1 ⎥ I G i2, j2,k2 = ⎢ (3) ⎣ ... ... ... ⎦ ig P1,1 . . . ig P1,P2 where ig p,q values are numerical estimates that pth individual from M Pi1, j1,k1 is/was a member of qth group from M Pi2, j2,k2 . For simple cases (more details later) we expect all values in Eqs. 1-3 to be within 0, 1 range, but generally we can accept any non-negative numbers, with an obvious interpretation that larger values indicate higher confidence that a relationship actually exists. With matrix-represented data associations, associations between more distant (e.g. temporarily) monitoring events can be formally defined by matrix multiplications (via a number of proxy events). Example: Assume that M Pi1, j1,k1 detects groups, M Pi2, j2,k2 detects individuals, and M Pi3, j3,k3 also detects groups. Then, we can estimate sameness between the former and the latter groups as: T

i1, j1,k1 i2, j2,k2 i2, j2,k2 (4) GG i3, j3,k3 = I G i1, j1,k1 × I G i3, j3,k3 etc.

´ A. Sluzek and M. Sami Zitouni

24

2.3 Estimates of Association Matrices Even though Eq. 4 provides an elegant way of estimating relations between targets extracted by (possibly distant spatially and/or temporarily) monitoring events, we still need the explicit values of the proxy matrices contributing to the multiplication. Those matrices can be actually estimated only in specific cases, where the corresponding detectors or trackers are expected to provide meaningful results. Typically, monitored scenes are subject to varying visibility conditions, unpredictable content fluctuations, and other changes, which constrain performances of even state-of-the-art detectors and trackers. Thus, we focus of cases, where the effects of those changes can be potentially minimized and the obtained results have a reasonable chance to be close to the ground-truth. This is the list of typical scenarios we consider (note some notation variations indicating the nature of scenarios): i, j,T r

1. I Ii, j−L ,T r matrices, i.e. individuals are tracked by the same camera (using T r tracker) and the tracking horizon is short (very small values of L, e.g. L = 1). Such matrices are actually provided (even if indirectly) by advanced multi-target trackers (e.g. [4, 5, 12]) so that no additional calculations are needed. i, j,T r 2. GG i, j−L ,T r matrices, i.e. groups are tracked by the same camera (using T r tracker) and the tracking horizon is very short. The same remarks as above (e.g. [3, 10]). i, j,A1 3. I Ii, j,A2 matrices, i.e. individuals are extracted in the same frame by alternative algorithms A1 and A2. i, j,A1 4. GG i, j,A2 matrices, i.e. groups are extracted in the same frame by alternative algorithms A1 and A2. i, j,A1 5. I G i, j,A2 matrices, i.e. in the same frame individuals are extracted by A1 method, while groups are identified by A2 algorithm. Ca, j,A1 Ca, j,A2 6. I ICb, j,A1 and GG Cb, j,A2 matrices, i.e. individuals or groups are simultaneously identified in two cameras (Ca and Cb) but both cameras use the same method A1 for extracting individuals and the same method A2 for extracting groups. For cases where actual computational steps are needed (i.e. points 3 to 6 in the above list) the following operations are proposed: i, j, A1

i, j, A1

2.3.1 Matrices I Ii, j, A2 and GG i, j, A2 (Points 3 and 4). In such cases, the values of ii p,q (or gg p,q ) are estimated by overlaps between the shape functions of corresponding individuals (or groups). The straightforward Jaccard metric (i.e. IOU, intersection-over-union) is proposed, with the necessary modifications for handling non-binary and multi-valued shape functions (e.g. [8]). i, j,A1 Figures 6 and 7 show simple relevant examples, with the corresponding I Ii, j,A2 i, j,A1

and GG i, j,A2 matrices given, correspondingly, in Table 1 and Eq. 5. i, j,A1

GG i, j,A2 =

0 0.6 0.02 0 0.36 0.89

(5)

On Formal Models of Interactions Between Detectors and Trackers …

25

Fig. 6 Detection of individuals by two alternative methods ([19] and [6]). Binary shape functions i, j,A1 are used, and the resulting I Ii, j,A2 matrix is given in Table 1

Fig. 7 Detection of groups by two alternative algorithms. Binary shape functions are used, and the i, j,A1 resulting GG i, j,A2 matrix is given in Eq. 5 i, j,A1

Table 1 I Ii, j,A2 matrix for the Fig. 6 example 0.12

0.08

0

0

0

0

0

0

0

0

0

0

0 0 0 0 0

0.09 0 0 0 0

0.06 0 0 0 0

0.09 0 0 0 0

0 0.07 0 0 0

0 0 0.07 0 0

0 0 0.06 0 0

0 0 0.07 0 0

0 0 0.09 0 0

0 0 0 0.12 0

0 0 0 0 0.06

0 0 0 0 0.06

i, j, A1

2.3.2 Matrices I G i, j, A2 (Point 5) To estimate credibility of membership relations between individuals and groups (within the same frame), we need a formula for computing ig p,q values indicating whether pth individual can be a member of qth group (see Eq. 3). In the most general case (when shape functions of individuals and groups are actually density or heat maps) the proposed formula is:  ig p,q =

s f (i p ) · s f (gq )d xd y  s f (i p )d xd y

(6)

´ A. Sluzek and M. Sami Zitouni

26

(a)

(b)

(c)

Fig. 8 Binary shape functions (of actually the same individual) in images captured by two cameras (a, b), and back-projection of the first function onto the second image (c). IOU of the polygons in (c) is used as the estimate of sameness credibility for both individuals

where s f (i p ) and s f (gq ) are, correspondingly, shape functions of the pth individual and of the gth group in an image with X × Y coordinates. It can be easily found that for binary shape functions, Eq. 6 becomes very similar to IOU metrics, i.e.   s f (i p ) ∩ s f (gq )   (7) ig p,q = s f (i p ) C a, j, A1

C a, j, A2

2.3.3 Matrices I IC b, j, A1 and GG C b, j, A2 (Point 6) In case of two cameras Ca and Cb simultaneously monitoring the environment (and using correspondingly the same algorithms) the association matrices are built Ca similarly to Points 3 and 4. However, we additionally need a 3D transformation HCb defining relative spatial configurations of both cameras. Given two individuals i Ca and i Cb from both cameras, we first calculate the centers of mass [x¯A , y¯A ] and [x¯B , y¯B ] of their shape functions, using the first-order moments (e.g. [14]). Then, assuming the centers of mass are actually projections of the same 3D point, Ca matrix, and back-project the whole s f (i Ca ) we reconstruct its depth using HCb shape function onto the Cb camera image (or another way around). Then, elements Ca, j,A1 of I ICb, j,A1 matrices are found in the same way as in Point 3. Figure 8 provides an illustrative example. Ca, j,A2 For group analysis, i.e. for obtaining GG Cb, j,A2 matrices, the procedure is basically identical (i.e. as in Point 4). The only practical issue is whether group shape functions can be considered as planar objects in 3D scenes. Nevertheless, we found that in typical cases inaccuracies produced by ignoring depth in group outlines are acceptable, and comparable to already existing errors in the group outline estimates.

On Formal Models of Interactions Between Detectors and Trackers …

27

2.4 From Association Matrices to Labeling In typical applications of automatic crowd analysis, we usually need deterministic labeling as the final outcome, with the exemplary questions to be addressed as follows: • Who of the current group members was in the same group 5 minutes ago? • How many groups merged into the currently existing group? • What is the pattern of size changes of the existing groups?, etc. Thus, the proposed models should enable conversion of the association matrices into (the most credible) deterministic (label-based) associations. It should be noted, however, that the labels are generated only instantaneously to relate items between two (or more) monitoring events, and these labels are not used for defining any other associations. For example, we represent relations between groups in the current frame and i, j,A groups detected in the past by GG i, j−K ,A matrix (over longer periods, such matrices are approximated by a number of matrix multiplications, as shown in Eq. 4 example). Effectively, the elements in this matrix are estimates of the number of members shared by the corresponding current and past groups. From such data, groups can be deterministically linked based on the maximum i, j,A values in GG i, j−K ,A columns (the most credible current counterpart of a past group) i, j,A

or from GG i, j−K ,A rows (the most credible past counterpart of a current group). Because such links are created bidirectionally, multiple deterministic associations are possible, i.e. the same current group can be linked to several past groups (which indicates group merging) or the same past group can be linked to multiple current groups (which indicates group splitting). In this way, deterministic links (labels) flexibly propagate through time, and informative details on the crowd evolution in the monitored scene can be provided. More technical and mathematical details can be found in [23].

3 Discussion In our previous paper [23], we discussed the mechanisms for substituting trackers of individuals by trackers of groups (or another way around). It has been shown that such mechanisms can improve reliability of various crowd analysis tasks. In particular, we found that: • Matrix-based relations embedded into baseline trackers improve performances of those trackers. • Individuals can be effectively tracked using only group trackers (and the corresponding association matrices), i.e. trackers of individuals can be temporarily disabled (e.g. due to visibility conditions). • The ground-truth numbers of groups are more accurately estimated using the mechanisms.

28

´ A. Sluzek and M. Sami Zitouni

We expect, that formalisms presented in this paper (which cover more diversified and complex surveillance scenarios) can prospectively improve performances of many other crowd analysis tasks. Currently, the prospective applications are mainly in human crowd analysis. Unfortunately, there are many justified concerns regarding potential abuse of such capabilities. We are fully aware of such reservations. Nevertheless, recent developments in mobile robotics (e.g. swarms of drones) open new avenues for the future applications. We believe that the proposed models are suitable for integration with any typical surveillance systems for automatic crowd analysis, regardless the type of trackers and detectors used in the systems, and regardless the application domain.

References 1. Ali, S., Shah, M.: A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007). https://doi.org/10.1109/CVPR.2007.382977 2. Bendali-Braham, M., Weber, J., Forestier, G., Idoumghar, L., Muller, P.A.: Recent trends in crowd analysis: a review. Mach. Learn. Appl. 4, 100,023 (2021). https://doi.org/10.1016/j. mlwa.2021.100023 3. Bochinski, E., Senst, T., Sikora, T.: Extending iou based multi-object tracking by visual information. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018). https://doi.org/10.1109/AVSS.2018.8639144 4. Ciaparrone, G., Luque Sanchez, F., Tabik, S., Troiano, L., Tagliaferri, R., Herrera, F.: Deep learning in video multi-object tracking: a survey. Neurocomputing 381, 61–88 (2020). https:// doi.org/10.1016/j.neucom.2019.11.023 5. Dehghan, A., Modiri Assari, S., Shah, M.: Gmmcp tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4091–4099 (2015). https://doi.org/10.1109/ CVPR.2015.7299036 6. Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014). https://doi.org/10.1109/TPAMI. 2014.2300479 7. Feldmann, M., Franken, D., Koch, W.: Tracking of extended objects and group targets using random matrices. IEEE Trans. Signal Process. 59(4), 1409–1420 (2011). https://doi.org/10. 1109/TSP.2010.2101064 8. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 3rd edn., chap. Finding Similar Items. Cambridge University Press (2020) 9. Lia, D., Zhua, J., Xua, B., Lua, M., Li, M.: An ant-based filtering random-finite-set approach to simultaneous localization and mapping. Int. J. Appl. Math. Comput. Sci. 28(3), 505–519 (2018). https://doi.org/10.2478/amcs-2018-0039 10. Mazzon, R., Poiesi, F., Cavallaro, A.: Detection and tracking of groups in crowd. In: 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 202–207 (2013). https://doi.org/10.1109/AVSS.2013.6636640 11. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271 (2017). https://doi.org/10. 1109/CVPR.2017.690

On Formal Models of Interactions Between Detectors and Trackers …

29

12. Rezatofighi, S.H., Milan, A., Zhang, Z., Shi, Q., Dick, A., Reid, I.: Joint probabilistic data association revisited. In: IEEE International Conference on Computer Vision (ICCV), pp. 3047–3055. IEEE (2015). https://doi.org/10.1109/ICCV.2015.349 13. Schranz, M., Umlauft, M., Sende, M., Elmenreich, W.: Swarm robotic behaviors and current applications. Front. Robot. AI 7(36) (2020). https://doi.org/10.3389/frobt.2020.00036 ´ 14. Sluzek, A.: Zastosowanie metod momentowych do identyfikacji obiektów w cyfrowych systemach wizyjnych. Wydawnictwa Politechniki Warszawskiej, Warszawa (1990) 15. Sreenu, G., Saleem Durai, M.: Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J. Big Data 6(48) (2019). https://doi.org/10.1186/s40537-0190212-5 16. Steyer, S., Tanzmeister, G., Lenk, C., Dallabetta, V., Wollherr, D.: Data association for gridbased object tracking using particle labeling. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3036–3043 (2018). https://doi.org/10.1109/ITSC. 2018.8569511 17. Tomar, A., Kumar, S., Pant, B.: Crowd analysis in video surveillance: a review. In: 2022 International Conference on Decision Aid Sciences and Applications (DASA), pp. 162–168 (2022). https://doi.org/10.1109/DASA54658.2022.9765008 18. Wang, X., Lu, J.: Collective behaviors through social interactions in bird flocks. IEEE Circuits Syst. Mag. 19(3), 6–22 (2019). https://doi.org/10.1109/MCAS.2019.2924507 19. Zhou, T., Yang, J., Loza, A., Al-Mualla, M., Bhaskar, H.: Crowd modeling framework using fast head detection and shape-aware matching. J. Electron. Imaging 24 (2015). https://doi.org/ 10.1117/1.JEI.24.2.023019 20. Zhou, X., Zhuo, J., Krahenbuhl, P.: Bottom-up object detection by grouping extreme and center points. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859 (2019). https://doi.org/10.1109/CVPR.2019.00094 21. Zitouni, M.S., Bhaskar, H., Al-Mualla, M.E.: Robust background modeling and foreground detection using dynamic textures. In: International Conference on Computer Vision Theory and Applications (VISIGRAPP ’16), pp. 403–410 (2016). https://doi.org/10.5220/ 0005724204030410 ´ 22. Zitouni, M.S., Sluzek, A.: Video-surveillance tools for monitoring social responsibility under covid-19 restrictions. In: Computer Vision and Graphics (Proceedings of the ICCVG 2020), pp. 227–239. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3030-59006-2_20 ´ 23. Zitouni, M.S., Sluzek, A.: A data association model for analysis of crowd structure. Int. J. Appl. Math. Comput. Sci. 32(1), 81–94 (2022). https://doi.org/10.34768/amcs-2022-0007 24. Zitouni, M.S., Sluzek, A., Bhaskar, H.: Cnn-based analysis of crowd structure using automatically annotated training data. In: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (2019). https://doi.org/10.1109/AVSS.2019.8909846 25. Zitouni, M.S., Sluzek, A., Bhaskar, H.: Visual analysis of socio-cognitive crowd behaviors for surveillance: a survey and categorization of trends and methods. Eng. Appl. Artif. Intell. 82, 294–312 (2019). https://doi.org/10.1016/j.engappai.2019.04.012

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements Adam Nowosielski

and Przemysław Reginia

Abstract The wah-wah is a guitar effect used to modulate the sound while playing. This is an unusual effect in that the guitar player, having his hands on instrument, controls it in real time with the foot. The digital equivalent proposed in this paper transfers this control to mouth movements by capturing an image from a computer camera and then applying computer vision algorithms. The paper analyzes the applicability and studies the effectiveness of using mouth movement to control a wah-wah type guitar effect. Keywords Wah-wah efect · Face features · Head-controlled interface

1 Introduction Digital musical effects with the dynamic development of algorithms used for signal modeling are gaining popularity. Each year they simulate their analog counterparts more and more faithfully making it hard even for musicians to distinguish the original from the simulation. While there are still many supporters of traditional solutions, one cannot ignore the great advantage of digital effects in terms of the convenience of their use, transportation and the lower costs they generate. Wah-wah is a guitar effect used to modulate the sound while playing. The name of the wah-wah effect (also often called Cry Baby for the most popular effect model developed by Dunlop) is an onomatopoeia. The spoken words sound similar to the sound of a guitar subjected to this effect. It consists in boosting and cutting appropriate frequencies of the guitar signal. These frequencies are determined by the guitarist with the use of the foot, as both hands are occupied while playing. The purpose of this paper is to analyze the applicability and study the effectiveness of using mouth movement to control a wah-wah type guitar effect. This is an unusual effect in that the user controls it in real time with their foot. The digital equivalent A. Nowosielski (B) · P. Reginia Faculty of Computer Science and Information Technology, West Pomeranian University ˙ of Technology, Szczecin, Zołnierska 52 Str., 71-210 Szczecin, Poland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_3

31

32

A. Nowosielski and P. Reginia

developed in this paper will transfer this control to mouth movement by capturing an image from a computer camera and then applying computer vision algorithms. The study will look at how natural it is to use such a solution, and whether it is possible to speak of a real-time solution as an analogue effect. The work is motivated by the possibility of creating a device that could be used in recording studios or their small home equivalents without having to purchase a physical effect. The proposed solution falls into the category of vision-based interfaces operated by mouth movement (or mouth-operated interfaces). The problem has been already addressed in the literature and examples of such interfaces include: control of audio synthesis modules [1], text entry for people with disabilities [2], robot control [3], and many others. The rest of the paper is structured as follows. Section 2 presents the principle of the wah-wah effect. In Sect. 3 concept of the interface is proposed. Initial experiments are described in Sect. 4. The proposal of research environment is included in Sect. 5. Results of the detailed studies are provided and discussed in Sect. 6. The paper ends with a summary.

2 The Principle of the Wah-Wah Effect The wah-wah in its classical most common form has the form of a foot operated pedal presented in Fig. 1 on the left. The operation of the digital wah-wah effect is possible by using filters. The filter decreases and/or increases the power of certain frequencies of the signal. Because the effect setting changes in real-time the filter parameters (frequencies to be boosted or cut) also change which results in the characteristic wah-wah sound. The most common filter used in wah-wah effect is the band-pass. The result can be easily observed in the frequency domain of the signal using a white noise, which has a flat spectrum (see Fig. 1 top right). When a guitarist sets the expression pedal to an extreme position off the base, the filter will modulate the signal by boosting the low frequencies with a peak at around 600 Hz. At the same time, other frequencies will be clipped progressively according to how far they are from the band (see the middle pair of Fig. 1). The opposite situation will occur when the musician presses the pedal as close to the base as possible (bottom case on Fig. 1). The output signal in this situation is mostly devoid of bass bands, and the middle and high frequencies are clearly audible. Peak is observed at around 2000 Hz. This is a generalized principle of operation that may vary with the type of filter used. Wah-wah filters operate in a much narrower spectrum than the 20 Hz–20 kHz range heard by humans, due to the fact that the key frequencies in a guitar’s tone are those in the mid-range (500 Hz–2 kHz). The effect is controlled by 3 parameters: f c –the center frequency (the one which will be boosted/cut the most), Q–the width coefficient of the boosted/cut band, Gain–the power with which the band will be boosted/cut. The magnitude of the Q parameter is inversely proportional to the fc . bandwidth and is calculated using the formula [4]: Q = bandwidth

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements

33

Fig. 1 Illustration of the classical wah-wah pedal and its effect on the white noise

3 Conception Having explained the principle of operation of the classical solution, we will continue with the proposal of a digital wah-wah guitar effect controlled by mouth movements. These would require appropriate detection of mouth opening level and accordant mapping procedure for the guitar effect. Contemporary state-of-the-art solutions allow effective detection of face and facial features. Although there are newer solutions like BlazeFace model which offers 468 3D face landmarks in real-time [5] we opted for well established and acknowledged Dlib solution [6]. Dlib is being employed in contemporary projects involving the use of facial features, such as: driver fatigue detection [7, 8], age assessment [9], liveness detection (for detecting spoofing or fakeness) [10], face-operated user interfaces [11], face recognition [12], and many others. Dlib and its shape_predictor_68_face_landmarks model offers 68 distinctive landmarks points on a human face. Based on that model we will examine three mechanisms for level of mouth opening detection (see Fig. 2): basic, eye, and polygon. The basic approach employs  calculation  of the distance between landmark points P 52 and P 58 : m basic =  P 52 − P 58 . In the second calculation method– eye–we adopted mechanism originally proposed in [13] for eye blink detection. The algorithm proposed in [13] estimates eyes landmark positions and extracts a single scalar quantity eye aspect ratio (EAR) which characterize the eye opening. Considering the Dlib landmark model, the EAR value can be transferred to calculate the value of open and closed mouth as follows:

34

A. Nowosielski and P. Reginia

Fig. 2 Dlib facial landmarks and three techniques (basic, eye, and polygon) employed for mouth opening level detection

m eye

    51  P − P 59  +  P 53 − P 57    = 2  P 49 − P 55 

(1)

The last method–polygon–calculates the mouth opening level m polygon as the area of the polygon formed by the external mouth landmarks with indices from 49 to 60. The calculated values define the scope of open mouths. The values obtained are normalized to the 0–1 range according to a user specific calibration during which reference values for an individual for maximum open and closed mouth are obtained. The obtained value is mapped then to the base fc frequency. A band pass is determined and it’s range will be boosted. The value in the range 0–1 can be interpreted as a percentage of the filter’s operating range. Setting the minimum value at which the filter operates to 500 Hz and the maximum to 2000 Hz, and for an exemplary value of the normalized mouth opening distance equal 0.4, the resulting value for f c is 1100 Hz. It seems that due to the selected solution for face and face features detection, the ability to use the effect in concert settings may be limited. The lighting at such events is often insufficient or highly dynamic, which causes problems for algorithms operating in the visible spectrum. One possibility is to use a capturing device operating outside the visible range [14]. The use of modern imaging technologies such as infrared, thermal imaging or depth map could allow the effect to be used in difficult lighting conditions. Example of successful employment of thermal imaging for head-operated touchless typing interface can be found in [15]. In its current form, however, the proposed solution will successfully find application in studio use and home recording.

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements

35

Fig. 3 Functions used to map mouth openness into the base f c filter frequency: linear, hyperbolic tangent, area hyperbolic tangent, exponential

4 Initial Experiments We performed some initial experiments during which users, after the calibration process, were asked to open their mouths halfway in their opinion. As expected, most responses were within the actually measured 40–50% of the opening. However, the second most common measured result of opening the mouth halfway was only 20% of the maximum openness. The whole range of results, in turn, ranged from 20% all the way up to 80%. To overcome potential user-specific problems we decided to use supplementary functions to map the result (see Fig. 3). Our second observation was that the developed solution showed detection instability. When the mouth was stationary, the central frequency fluctuated in a rather large spectrum that audibly affected the signal. To solve the problem a FIFO queue was utilized. Filled with successive normalized values of the level of open mouth it was used to calculate the arithmetic mean. The final value is used to generate the filter center frequency. The optimal FIFO queue size will be investigated in detail later in the paper.

5 Research Environment To verify the validity of the concept we prepared the research environment using the JUCE framework. Due to its high level of support for audio solutions, it is a popular choice when developing applications and audio plugins. To capture the sound signal of the guitar we used two audio interfaces from the middle price range: Line 6 UX1 and Audient iD4. All image processing was implemented with the OpenCV and Dlib libraries. The graphical user interface of the research environment is presented in Fig. 4. It is divided into two sections (individual tabs) which contain controls for manipulating parameters related to audio or video. In the video tab, an operator has access to: controls for handling statistical surveys (1), video preview (2), selection of the output (mapping) function (3), and its visualization (7), choice of the method for calculating the level of mouth opening (4), information on the average processing time of a single frame (5), controls for starting the calibration process (6), indicator showing the current calculated value

36

A. Nowosielski and P. Reginia

Fig. 4 The interface of the developed research environment: video tab (top) and audio tab (bottom)

of f c (8), and accuracy (Acc) slider, determined by the size of the FIFO queue (9). The audio tab allow to change wah-wah effect parameters: Q (1), Gain (2) and the minimum (3) and maximum (4) frequency of the range for the effect. Number (5) in audio tab presents the current calculated value of the f c .

6 Results The evaluation of the proposed mouth controlled digital wah-wah effect was divided into two groups–objective and subjective. The objective research focused on areas of the environment responsible for video processing. This made it possible to invite people completely unrelated to music to the experiment. From an individual participant 200 f c frequency values provided by the video processing algorithms have been gathered with different operating parameters. Ten people between the ages of 17 and 51 took part in the objective study. We calculated the fluctuations of the mapped f c for the same level of open mouths. The differences, defined as the f c factor, are shown in Fig. 5. First, we analyzed the stability of the f c factor depending on the size of the FIFO queue (Acc parameter). The research environment allows for eleven FIFO queue sizes. Due to the imperceptible differences between the levels and taking into account the time required and consequently the fatigue of the participant, the research was limited to four levels: 1, 4, 7 and 11. The results are provided in Fig. 5 on the left. As its size increased, a

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements

37

Fig. 5 Influence of the FIFO queue size (left) and the method employed (right) on the f c factor

stabilisation of the detection performance could be observed. However, as the queue size increased significantly the delay between mouth movement and effect response became noticeable. This was because many values from previous iterations affected the current value of the central frequency. The results of the Student’s t-test between the four levels showed that levels 4 and 7 did not show statistically significant differences. When analysing the methods used to determine the level of mouth opening, eye and polygon proved to be the most stable (see Fig. 5 on the right). The highest variability score (and the lowest Student’s t-test score) was observed for the eye method. This may be due to the fact that this method, originally designed to determine in a binary manner the opening or closing of the eye, is not fully transferable as a measure of mouth opening. The subjective research involved 4 guitarists, representing different musical styles and experienced in using the classical wah-wah effect. For such people the developed solution is dedicated. The first parameter investigated was accuracy (determined by the size of the FIFO queue) and its impact on perceived delay. We used questionnaire approach with the 5 level scale. Results are provided in Fig. 6 on the left. Each subject was then asked to identify which setting had the best trade-off between stability and latency. The majority considered the Acc = 4 setting to be the best, as shown in Fig. 6 top right. Interestingly, we cannot provide any quantitative results for the four output mapping functions. Each musician felt that they could hear the differences between these functions. They agreed that they are necessary for the full functionality of the system and provide appropriate control for a specific situation. As part of the summary, guitarists were asked to specify how intuitive/natural they thought it was to control the wah-wah effect with mouth movement on a five-point scale. Results in a graphical form are presented in Fig. 6 bottom right.

38

A. Nowosielski and P. Reginia

Fig. 6 Subjective evaluation of the developed solution by guitarists

7 Summary In the paper we proposed the digital wah-wah guitar effect controlled by mouth movements. Many musicians move their mouth in sync with the wah-wah pedal as if miming the sounds. The approach where mouth movements are employed to control and mimic traditional wah-wah effect seems therefore reasonable. We tried to imitate the traditional wah-wah effects in terms of sound, control and latency, and our obtained results have provided sufficient evidence. Although in concert settings with dynamic lighting usage of mouth-controlled interface based on image analysis may be limited, the solution, however, will find application in audio recording when a physical wah-wah pedal is not available. In further work it would be desirable to stabilize and improve the reliability of face landmark points detection over time. In the current version all deficiencies have been compensated by averaging values in time, which, however, with too long queues may lead to noticeable delays. All in all, our proposal was well received among the few guitarists who have had the opportunity to work with the effect.

Digital Wah-Wah Guitar Effect Controlled by Mouth Movements

39

References 1. Silva, G.C., Smyth, T., Lyons, M.: A novel face-tracking mouth controller and its application to interacting with bioacoustic models. In: Proceedings of the 2004 Conference on New Interfaces for Musical Expression, pp. 169–172 (2004) 2. Mu-Chun, S., Chin-Yen, Y., Yi-Zeng, H., Shih-Chieh, L., Pa-Chun, W.: An image-based mouth switch for people with severe disabilities. Recent Patent. Comput. Sci. 5(1), 66–71 (2012) 3. Gomez, J., Ceballos, A., Prieto, F., Redarce, T.: Mouth gesture and voice command based robot command interface. In: 2009 IEEE International Conference on Robotics and Automation, pp. 333–338 (2009). https://doi.org/10.1109/ROBOT.2009.5152858 4. Corey, J., Benson, D.H.: Audio Production and Critical Listening: Technical Ear Training, 2nd edn, Routledge (2016) 5. Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs, arXiv preprint arXiv:1907.06724 (2019) 6. King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009) 7. Kavitha, R., Subha, P., Srinivasan, R., Kavitha, M.: Implementing OpenCV and Dlib OpenSource library for detection of driver’s fatigue. In: Raj, J.S., Kamel, K., Lafata, P. (eds.) Innovative Data Communication Technologies and Application. Lecture Notes on Data Engineering and Communications Technologies, vol. 96. Springer, Singapore (2022). https://doi.org/10. 1007/978-981-16-7167-8_26 8. Babu, A., Nair, S., Sreekumar, K.: Driver’s drowsiness detection system using Dlib HOG. In: Karuppusamy, P., Perikos, I., García Márquez, F.P. (eds.) Ubiquitous Intelligent Systems. Smart Innovation, Systems and Technologies, vol. 243. Springer, Singapore (2022). https://doi.org/ 10.1007/978-981-16-3675-2_16 9. Elmahmudi, A., Ugail, H.: A framework for facial age progression and regression using exemplar face templates. Vis. Comput. 37, 2023–2038 (2021). https://doi.org/10.1007/s00371-02001960-z 10. Mandol, S., Mia, S., Ahsan, S.M.M.: Real time liveness detection and face recognition with OpenCV and deep learning. In: 2021 5th International Conference on Electrical Information and Communication Technology (EICT), pp. 1–6 (2021). https://doi.org/10.1109/EICT54103. 2021.9733685 11. Jácome, J., Gomes, A., Costa, W.L., Figueiredo. L.S., Abreu, J., Porciuncula, L., Brant, P.K., Alves, L.E.M., Correia, W.F.M., Teichrieb, V., Quintino, J.P., da Silva, F.Q.B., Santos, A.L.M., Pinho, H.S.: Parallax engine: Head controlled motion parallax using notebooks’ RGB camera. In: Symposium on Virtual and Augmented Reality, pp. 137–146 (2021). https://doi.org/10. 1145/3488162.3488218 12. Boyko, N., Basystiuk, O., Shakhovska, N.: Performance evaluation and comparison of software for face recognition, based on Dlib and Opencv library. In: 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), pp. 478–482 (2018). https://doi. org/10.1109/DSMP.2018.8478556 13. Soukupová, T., Cech, J.: Real-time eye blink detection using facial landmarks. In: 21st Computer Vision Winter Workshop, pp. 3–5. Rimske Toplice, Slovenia (2016) 14. Małecki, K., Nowosielski, A., Forczma´nski, P.: Multispectral data acquisition in the assessment of driver’s fatigue. In: Mikulski J. (ed.) Smart Solutions in Today’s Transport. TST 2017. Communications in Computer and Information Science, vol. 715. Springer, Cham (2017) 15. Nowosielski, A., Forczma´nski, P.: Touchless typing with head movements captured in thermal spectrum. Pattern Anal. Appl. 22(3), 841–855 (2019)

Traffic and Driving

Traffic Sign Classification Using Deep and Quantum Neural Networks Sylwia Kuros

and Tomasz Kryjak

Abstract Quantum Neural Networks (QNNs) are an emerging technology that can be used in many applications including computer vision. In this paper, we presented a traffic sign classification system implemented using a hybrid quantum-classical convolutional neural network. Experiments on the German Traffic Sign Recognition Benchmark dataset indicate that currently QNN do not outperform classical DCNN (Deep Convolutuional Neural Networks), yet still provide an accuracy of over 90% and are a definitely promising solution for advanced computer vision. Keywords Quantum neural networks · Traffic sign recognition · DCNN · GTSRB

1 Introduction Nowadays, the amount of data produced doubles every two years [6], and the peak of the heyday of computers in the classical meaning is coming to an end. Maintaining the current momentum of technological development requires a change in the approach to computing. One of the most promising solutions is to transfer the idea of computing from the field of classical mechanics to quantum mechanics, which creates new and interesting possibilities, but at the same time poses significant challenges. In computer vision systems that for effective operation require processing of large amounts of data in real time, quantum neural networks (QNNs) can prove to be a very attractive solution. On the basis of experiments carried out in recent years, it can be S. Kuros · T. Kryjak (B) Embedded Vision Systems Group, Computer Vision Laboratory, Department of Automatic Control and Robotics, AGH University of Science and Technology, Krakow, Poland e-mail: [email protected] S. Kuros e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_4

43

44

S. Kuros and T. Kryjak

observed that the network training process is becoming quicker and results are more accurate. They also show better generalization capabilities even with small amounts of training data and require several times fewer epochs compared to classical networks. QNNs were successfully used, among others, in the following applications: tree recognition in aerial space of California [3], cancer recognition [10], facial expression recognition [17], vehicle classification [19], traffic sign recognition from the LISA database considering the vulnerability of adversarial attacks [12], handwriting recognition [7, 8, 15, 18, 21, 22], object segmentation [1, 2], pneumonia recognition [20], classification of ants and bees [13], classification of medical images of chest radiography and retinal color of the fundus [14], and generative networks [4, 11]. In this paper, we describe the results of our work on a quantum neural network for the classification of traffic signs from the German Traffic Sign Recognition Benchmark (GTSRB) dataset. The aim of our research was to analyze and compare the results obtained using a classical deep convolutional neural network (DCNN) and a quantum neural network. The architecture was implemented using the Python language and the PennyLane library for quantum computing. To the best of our knowledge, this is the only work on traffic sign classification for the popular GTSRB dataset that uses quantum computing techniques. Another paper [18] on a similar topic deals with a different set and the proposed methods have lower efficiency. The remainder of this paper is organized as follows. Section 2 provides background information on quantum neural networks. In Sect. 3 our motivation and the purpose of our experiment were described in the context of related work on machine learning applied to computer vision systems. Section 4 presents the experiments conducted and the results obtained. In Sect. 5 conclusions and ideas for further development are provided.

2 Quantum Neural Networks The registers of quantum computers can exist in all possible states simultaneously, due to the property of superposition, with a chance of capturing a state at the time of measurement whose probability was encoded before the measurement. Quantum computing works on the principle of increasing the probability of a desired state to a certain high value, so that this state can be reached with high confidence and with as few measurements as possible. In this context, quantum interference resulting from the superposition phenomenon allows the amplitudes of the probabilities corresponding to given states to influence each other. The same is true for quantum entanglement, which allows quantum objects to be linked by a strong correlation, useful for readings of quantum states. Quantum entanglement means that two particles remain connected despite the distances that separate them. In quantum computing, such a state allows for the connection of a large number of qubits (quantum equivalent of the classical bit) together, thereby increasing computational resources in a nonlinear way.

Traffic Sign Classification Using Deep and Quantum Neural Networks

45

There are many approaches to the issue of physical prototyping of a quantum computer, the two most popular being based on superconducting electronic circuits or ion traps. In the case of superconducting quantum circuits, the quantum processor is placed at the bottom of the gas pedal cylinder in a shield composed of a material called cryoperm and a magnetic field. It is led by wires that send microwave pulses of varying frequency and duration to control and measure the state of the qubits. Due to the interference that temperature and environmental factors bring, dilution chillers are used to lower the temperature to 15 millikelvin. Quantum information can also be destroyed by resistance, so a superconducting material with zero resistance at some low temperature is used [9]. A slightly different operating principle can be observed in quantum computer architectures based on ion traps. The advantage of this approach is the ability to capture the state of a qubit at a well-defined location trapping it using an electromagnetic trap called a Paul trap [16]. It works on the principle of a time-varying electric field in such a way as to hold the ion in constant position. To obtain architectures with multiple qubits, it is necessary to trap multiple ions in a linear chain. The use of ions as qubits requires the use of laser light to change the state of an electron in an atom from a ground state to an excited state. Keeping it in the excited state is achieved by changes in the laser frequency. Due to the lack of widespread access to real quantum computers, a lot of research is carried out in an artificial environment using special software that simulates the ideal relationships found in quantum mechanics. In the case of this paper, considerations based on a simulated environment have also been performed. Currently available classical computers, including supercomputers and cloud computing, allow simulating quantum hardware. This supports, to a limited extent, research on quantum algorithms and computations in parallel to the work on the hardware layer. One of the interesting topics are quantum neural networks addressed in this paper. In its current stage of development, quantum deep learning is based on two classes of neural networks. The first one consists of quantum and classical neural layers and is called a hybrid quantum-classical neural network, whereas the second uses only quantum gates in layer construction. A hybrid quantum-classical neural network contains a hidden quantum layer built of parameterized quantum circuits whose basic units are quantum gates. They allow the state of the qubit to be modified by superposition, entanglement, and interference, so that the measurement result uniquely correlates with the unobservable state. In quantum neural networks, quantum gates rotate the state of a qubit that is a parameter for rotating gates in a given layer, based on the output of a classical circuit in the preceding layer. The idea of training a classical and a quantum network is the same. However, an important difference is that, in the case of quantum networks, the parameters of the quantum circuit are optimized, whereas, in the process of learning a classical network, the search for the best possible weights is performed. There are an infinite number of quantum gates, but this paper focuses on discussing the most relevant ones. The quantum NOT gate (or Pauli-X) belongs to the class of

46

S. Kuros and T. Kryjak

single-qubit gates. For visualization purposes, the way to represent a qubit states in a 3D spherical coordinates called the Bloch sphere was proposed. The operation it performs corresponds to the rotation of the x on the Bloch sphere by an angle π . In other words, the NOT gate assigns a probability amplitude to the state |0 to |1 and vice versa according to the following formula: X : α|0 + β|1 → β|0 + α|1 The matrix representation is as follows:   01 X= 10

(1)

(2)

The Pauli-Y gate corresponding to the rotation of the y axis on the Bloch sphere by an angle π is described by a unitary matrix of the following form:   0 −i Y = (3) i 0 The Pauli-Z gate for the rotation of the z axis on the Bloch sphere by the angle π is described by the following unitary matrix:   1 0 Z= (4) 0 −1 Similarly, the rotation gates R X , RY , R Z belonging to the group of single-qubit gates perform an angle rotation θ on the Bloch sphere, respectively, around the axis x, y, z. In matrix representation R X is given as follows:   cos θ2 − sin θ2 (5) R X (θ ) = sin θ2 cos θ2 

Gate RY : RY (θ ) =

cos θ2

−i sin

Gate R Z : R Z (θ ) =

−i sin θ 2

cos θ2

 θ  e−i 2 0 θ

0 ei 2

θ 2

 (6)

(7)

The Hadamard gate is also a gate that operates on a single qubit. On a given computational basis, it creates a superposition state. This operation can be illustrated z on the Bloch sphere by an angle π . This gate is given by a unitary as a rotation x√+ 2 matrix:

Traffic Sign Classification Using Deep and Quantum Neural Networks

  1 1 1 H=√ 2 1 −1

47

(8)

Phase gates also fall into the category of single-cubit gates. U1 gate rotates the cubit around z axis, U2 gate around x + z axis and U3 gate is a rotation gate with three Euler angles. The matrix representations of these gates are as follows:   1 1 0 U 1(θ ) = √ θ 2 0 ei 2  U 2(θ, γ ) =  U 3(λ, θ, γ ) =

γ

1 −ei 2 θ +γ θ ei 2 ei 2

θ

 (10) γ

cos λ4

ei 2 sin

(9)

−ei 2 sin λ 4

ei

θ +γ 2

cos

λ 4 λ 4

 (11)

For two-qubit gates consisting of a control qubit and a target qubit, the state of the control qubit remains constant during operations. A change of the state of the target qubit is performed when the state of the control qubit is |1. A popular one in this group is the C N O T gate (controlled Pauli-X). The unitary matrix representing the CNOT gate is of the form: ⎡

1 ⎢0 C N OT = ⎢ ⎣0 0

0 1 0 0

0 0 0 1

⎤ 0 0⎥ ⎥ 1⎦ 0

(12)

For the CZ gate, the representation is as follows: ⎡

1 ⎢0 CZ = ⎢ ⎣0 0

0 1 0 0

0 0 1 0

⎤ 0 0⎥ ⎥ 0⎦ −1

(13)

In the case of a controllable CRY rotation gate, the matrix is given: ⎡

1 0 0 0 ⎢0 cos θ 0 − sin 2 ⎢ C RY (θ ) = ⎢ 0 ⎣0 0 1 0 sin

θ 2

0 cos θ2

⎤ θ⎥ 2⎥

⎥ ⎦

(14)

48

S. Kuros and T. Kryjak

A network built solely from quantum layers based on unitary gates was first proposed by Edward Farhi and Hartmut Neven in 2018 [5]. In this neural network, the quantum circuit consisted of a sequence of parameter-dependent unary transformations that performed transformations on the input quantum state. At the current stage of development, hybrid networks seem much more promising because they assume operations on small quantum circuits with negligible or zero error correction, so they are shown to have potential for use with upcoming quantum computer architectures. This is because Noisy Intermediate-Scale Quantum (NISQ) era computer architectures are solutions with limited computational resources and need the support of performing some of the computation on classical hardware. In many publications, the term quantum neural network is actually a hybrid model. It is worth noting that, at the current stage of development, purely quantum architectures are successfully applied only to classification problems with a small number of classes, since the number of qubits required to encode and read data increases with the number of classes. For some classification tasks, this leads to the system exceeding the number of available qubits or generating significant errors.

3 Previous Work Due to their fast convergence and high accuracy, quantum neural networks can find applications in many areas, especially in vision systems, whose bottleneck is the need to process large amounts of data efficiently. For example, in the paper [3] a successful attempt was made to recognize trees in the aerial space of California, where ensemble methods were used. The authors proposed a truncation and rescaling of the training objective through a trainable metaparameter. Accuracies of 92% in validation and 90% on a test scene were obtained. In the paper [10] experiments on cancer recognition were carried out. The authors proposed a qubit neural network model with sequence input based on controlled rotation gates, called QNNSI. The three-layer model with a hidden layer, which employed the Levenberg-Marquardt algorithm for learning, was the best of all the approaches tested. The experimental results reveal that a greater difference between the input nodes and the sequence length leads to a lower performance of the proposed model than that of the classical ANN, in contrast, it obviously enhances the approximation and generalization ability of the proposed model when the input nodes are closer to the sequence length [10]. An experiment proposed in [19] included the classification of vehicles. Successful object segmentation proposed in [1, 2]. Other research shows that quantum neural networks can be effective in the recognition of pneumonia [20]. The classification of ants and bees introduced in [13] indicates that the use of a transfer learning technique can result in the creation of an efficient system for differentiating between two types of insects. The retinal color fundus images and chest radiography classification presented in [14] is another example of the superiority of quantum neural networks over classical ones.

Traffic Sign Classification Using Deep and Quantum Neural Networks

49

Topics similar to the subject of this research can be found in the paper [12]. The authors applied a transfer learning approach and used a Resnet18 network trained on an ImageNet set, replacing the fully connected layer with a quantum layer and a PyTorch layer-enabled module. The results obtained from such a network were compared with those of a simple two-layer classical network. Using the PennyLane plug-in, quantum computations were possible, and the popular PyTorch library was used. Three steps are distinguished in the quantum layer. The first was to embed the data in a quantum circuit using a single or combinations of several gates: Hadamard gate, Rotational Y, Rotational X, Rotational Z, U1, U2, and U3 gates. The data were then fed into a parameterized quantum circuit of two-qubit gates (Controlled NOT, Controlled Z, and Controlled RX) with single-qubit parameterized gates (i.e., Rotational Y, Rotational X, Rotational Z, U1, U2 and U3). The last was a measurement step using bases composed of X, Y, Z quantum gates. In this experiment, the LISA traffic sign set was reduced to 18 traffic sign classes. A binary classification based on the network’s recognition of stop signs and the class in which the other signs were placed was tested. This used 231 samples divided into a training set and a test set in a ratio of 80:20. For multiclass classification, the signs were assigned to one of the three classes: stop signs, speed limit signs, and the class containing the other signs. To do this, 279 samples divided into training and test set were used in a ratio of 80:20. The following attack algorithms were used: the gradient attack, Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD) attack. The implementations used achieved high accuracy (more than 90%).

4 QNN for Traffic Sign Recognition The GTSRB (German Traffic Sign Recognition Benchmark) dataset used in this experiment was created in 2011. It contains 39209 training images and 12630 test images, which varies in size and other parameters, e.g. brightness and contrast. Each of the images belongs to one of the 43 class representing traffic signs. An initial pre-processing consisting of two stages was introduced. In the first, grayscale normalization was performed and in the second, the dataset was filtered and only images larger than 64 × 64 pixels were considered in the next steps. This was due to the observation that the network performed better for larger images. In the next step, the data set was divided into the 80:10:10 ratio for the training, test and validation datasets. Data labels were converted to vectors of length equal to the number of classes present according to the one-hot encoding method. The images of each subset were then subjected to a quantum convolution operation. In experiments, a similar approach to that presented in [7] was used. Quantum convolution (quanvolution) layers were built from a group of N quantum filters that create feature maps by locally transforming the input data. The key difference between quantum and classical convolution is the processing of spatially local subareas of data using random or structured quantum circuits. Figure 1 shows the design of such an architecture.

50

S. Kuros and T. Kryjak

Fig. 1 The architecture of a quanvolutional neural network proposed in [7]. The quanvolution layers are made up of a group of N quantum filters that create feature maps. Local subareas of data are processed spatially using random or structured quantum circuits

For each image, a small area of dimensions 2 × 2 was embedded in the quantum circuit by rotations parameterized by the rotation angle of the qubits initialized in the ground state. Further quantum computations were performed on the basis of the unitary space generated by a random circuit similar to that proposed in [7]. The experiment performed in this paper uses a circuit consisting of a single-qubit rotation operator Y and Pauli-X gate. The number of random layers was set to two because this parameter was not observed to lead to any change in classification quality during the experiments. Then, as in classical convolution, each z value was mapped to each of the output z channels. Iteratively repeating these steps yielded a complete output object that could be structured into a multichannel image. The main difference between the experiment proposed in this paper versus [7] was to reduce the number of epochs from ten thousand to one hundred, reduce the dataset to seven thousand samples instead of dimensionality reduction, and change the neural network architecture to a more advanced one. Both classical and quantum networks consisted of convolutional, max-pooling, dropout, flatten, and dense layers. The difference between them was in the convolution operation performed on the images. In the hybrid model, images convolved with quantum circuits were fed. The models were compiled with the adaptive optimizer Adam and the cost function as the cross-entropy with the accuracy metric. The final step was to verify the quality of the learning by determining the confusion matrix and the precision, recall, and f-beta parameters on the test set. Table 1 contains the results obtained. In the case of the classical network, there was only one incorrect classification for a given input, and the number of images assigned to incorrect labels did not exceed one. For the network with quantum convolution, the incorrect detections are slightly

Traffic Sign Classification Using Deep and Quantum Neural Networks

51

Table 1 Summary of results obtained for the classical network and the network with quantum convolution at particular batch sizes for the 43 classes of the GTSRB dataset Batch size Accuracy Precision Recall F-beta CNN QNN CNN QNN CNN QNN CNN QNN 4 8 16 32 64 128 256 512

0.9957 0.9914 0.9971 0.9986 0.9957 0.9986 0.9943 0.9986

0.9254 0.9426 0.9426 0.9369 0.9383 0.9440 0.9369 0.9354

0.9914 0.9949 1.0000 0.9947 0.9959 0.9973 0.9973 0.9959

0.9211 0.9426 0.9426 0.9947 0.9983 0.9440 0.9369 0.9354

0.9918 0.9943 1.0000 0.9369 0.9957 0.9986 0.9971 0.9957

0.9310 0.9456 0.9453 0.9943 0.9423 0.9494 0.9428 0.9381

0.9912 0.9944 1.0000 0.9476 0.9954 0.9979 0.9971 0.9956

0.9216 0.9216 0.9419 0.9403 0.9378 0.9446 0.9382 0.9351

more, and those that recur most frequently and contain more than one misclassified label for a given batch are the following pairs of visually similar characters: • • • • • • • •

pedestrians and priority road, bicycles crossing and double curve, wild animals and road narrows on the right, go straight or right and pedestrians, go straight or left and pedestrians, wild animals and pedestrians, road narrows right and end of all limits, pedestrians and road narrows right.

In some of the above cases, there is little or no similarity from the point of view of the human viewer, for example “crossing priority at an intersection” and “attention pedestrians”. The implication is that the process of quantum convolution makes the model more robust to errors associated with similar sign content, while shifting errors towards a less recognized cause. This brings back an association with characteristic quantum non-explicability. The advantage of the classical network can be seen each time, as it achieves up to 99.86% accuracy for batch sizes of 32, 128, 512. For the network with quantum convolution, the highest obtained result for accuracy is 94.40% for batch size 128, so for the best cases this network shows almost 4% worse classification accuracy. Although the classical network in the worst case achieved an accuracy of 99.14% for a batch of size 8, the network with quantum convolution achieves a decrease greater than 2% in classification quality. In the case of the network with quantum convolution, this is 94.4% for a batch size of 128, which is more than 5% lower. The learning flow for this batch is illustrated in Fig. 2. The classical network achieves a worst-case score of 99.14% for a batch of size 4, and for the network with quantum convolution, this score for the same batch is more than 2% worse. The differences in the values of the recall and f-beta

52

S. Kuros and T. Kryjak

Fig. 2 Comparison of the accuracy obtained with a batch size of 128 for the classical network and the network with quantum convolution on the training set

Fig. 3 Confusion matrix with a batch of size 128 for networks with quantum convolution

coefficients were at a similar level. Figure 3 shows the confusion matrix for a neural network with quantum convolution trained with a batch of size 128. The analysis shows that for both networks the worst results were obtained with a batch size equal to 4. The best results for each network were obtained with different batch sizes. For the classical network, the best of the tested ones was a batch size of 16, while for the network with quantum convolution it was a batch size of 128. This results in an important difference in the approach to quantum networks, as these models can take larger portions of data, and thus their learning will be faster. Note,

Traffic Sign Classification Using Deep and Quantum Neural Networks

53

however, that they are very sensitive to the appropriate choice of batch file, much more so than classical networks, which show degradation at the tenths of a percent level over their best and worst performance. The algorithm applied allowed us to obtain much better classification accuracy than it was in the case of [18], whereby in our experiment, classification was performed using 43 classes. The results obtained are at a level similar to some of the experiments carried out in [12].

5 Conclusion Quantum machine learning is a challenging field. This concerns not only the proper design of algorithms but also the very way in which classical real data are encoded in quantum space. Uncertainty is also provided by the fact that the only official form of a quantum computer is not known, but rather a number of proposals for what it could look like. This results in potentially going down wrong paths and having to redesign the algorithms multiple times. The experiment showed that it is possible to achieve a high classification accuracy (more than 94%) for a neural network with quantum convolution, but raised the question of quantum supremacy. Quantum algorithms require special preprocessing on the dataset, which in the case of hybrid networks, of which the network with quantum convolution is an example, requires significant computational resources, and the results obtained do not show the expected significant superiority over the results returned by the classical network. However, the purpose of the experiment was not to improve the chosen path, but to outline a direction on how to prototype an exemplary quantum-classical neural network model for the multiclass classification problem, which could find its application in vision systems. The prospects for further development of the project are very broad. It is planned to try to compare the influence of data augmentation on learning accuracy and the potential overfitting effect in the comparison of the classical network and the network with quantum convolution. In addition, another task will be undertaken in the area of explainable AI, that is, an attempt to visualize, for example, in the form of a heat map, which features of the image influence a given classification result. It would also be interesting to try out the designed network architecture on a real quantum computer along with undertaking an analysis of the temporal performance of the algorithm. Acknowledgements The work presented in this paper was supported by the AGH University of Science and Technology project no. 16.16.120.773.

54

S. Kuros and T. Kryjak

References 1. Aytekin, C., Kiranyaz, S., Gabbouj, M.: Quantum mechanics in computer vision: automatic object extraction. In: 2013 IEEE International Conference on Image Processing, pp. 2489–2493 (2013) 2. Aytekin, C., Kiranyaz, S., Gabbouj, M.: Automatic object segmentation by quantum cuts. In: 2014 22nd International Conference on Pattern Recognition, pp. 112–117 (2014) 3. Boyda, E., Basu, S., Ganguly, S., Michaelis, A., Mukhopadhyay, S., Nemani, R.R.: Deploying a quantum annealing processor to detect tree cover in aerial imagery of California. PLoS ONE 12(2) (2017) 4. Dallaire-Demers, P.L., Killoran, N.: Quantum generative adversarial networks. Phys. Rev. A 98, 012324 (2018). https://doi.org/10.1103/PhysRevA.98.012324 5. Farhi, E., Neven, H.: Classification with quantum neural networks on near term processors (2018). arXiv:1802.06002 6. Gallagher, B.: The amount of data in the world doubles every two years (last access: 22042022). https://medium.com/callforcode/the-amount-of-data-in-the-world-doublesevery-two-years-3c0be9263eb1 7. Henderson, M., Shakya, S., Pradhan, S., Cook, T.: Quanvolutional neural networks: powering image recognition with quantum circuits. Quant. Mach. Intell. 2(1), 2 (2020). https://doi.org/ 10.1007/s42484-020-00012-y 8. Hernández, H.I.G., Ruiz, R.T., Sun, G.H.: Image classification via quantum machine learning (2020). arXiv:2011.02831 9. Huang, H.L., Wu, D., Fan, D., Zhu, X.: Superconducting quantum computing: a review. Sci. China Inf. Sci. 63(8), 180501 (2020). https://doi.org/10.1007/s11432-020-2881-9 10. Li, P., Xiao, H.: Model and algorithm of quantum-inspired neural network with sequence input based on controlled rotation gates. Appl. Intell. 40(1), 107–126 (2014). https://doi.org/10.1007/ s10489-013-0447-3 11. Lloyd, S., Weedbrook, C.: Quantum generative adversarial learning. Phys. Rev. Lett. 121, 040502 (2018). https://link.aps.org/doi/10.1103/PhysRevLett.121.040502 12. Majumder, R., Khan, S.M., Ahmed, F., Khan, Z., Ngeni, F., Comert, G., Mwakalonge, J., Michalaka, D., Chowdhury, M.: Hybrid classical-quantum deep learning models for autonomous vehicle traffic image classification under adversarial attack (2021). arXiv:2108.01125 13. Mari, A., Bromley, T.R., Izaac, J., Schuld, M., Killoran, N.: Transfer learning in hybrid classicalquantum neural networks. Quantum 4, 340 (2020). https://doi.org/10.22331/q-2020-10-09340 14. Mathur, N., Landman, J., Li, Y.Y., Strahm, M., Kazdaghli, S., Prakash, A., Kerenidis, I.: Medical image classification via quantum neural networks (2021). arXiv:2109.01831 15. Oh, S., Choi, J., Kim, J.: A tutorial on quantum convolutional neural networks (qcnn). In: 2020 International Conference on Information and Communication Technology Convergence (ICTC), pp. 236–239 (2020) 16. Paul, W., Steinwedel, H.: Notizen: Ein neues massenspektrometer ohne magnetfeld. Zeitschrift für Naturforschung A 8(7), 448–450 (1953). https://doi.org/10.1515/zna-1953-0710 17. Peng, L., Li, J.: A facial expression recognition method based on quantum neural networks. In: Proceedings of the 2007 International Conference on Intelligent Systems and Knowledge Engineering (ISKE 2007), pp. 51–54. Atlantis Press (2007/10). https://doi.org/10.2991/iske. 2007.10 18. Potempa, R., Porebski, S.: Comparing concepts of quantum and classical neural network models for image classification task. In: Chora´s, M., Chora´s, R.S., Kurzy´nski, M., Trajdos, P., Peja´s, J., Hyla, T. (eds.) Progress in Image Processing, Pattern Recognition and Communication Systems, pp. 61–71. Springer, Cham (2022) 19. Yu, S., Ma, N.: Quantum neural network and its application in vehicle classification. In: 2008 Fourth International Conference on Natural Computation, vol. 2, pp. 499–503 (2008) 20. Yumin, D., Wu, M., Zhang, J.: Recognition of pneumonia image based on improved quantum neural network. IEEE Access 8, 224500–224512 (2020)

Traffic Sign Classification Using Deep and Quantum Neural Networks

55

21. Zhao, C., Gao, X.S.: Qdnn: deep neural networks with quantum layers. Quantum Mach. Intell. 3(1), 15 (2021). https://doi.org/10.1007/s42484-021-00046-w 22. Zhou, J., Gan, Q., Krzy˙zak, A., Suen, C.Y.: Recognition of handwritten numerals by quantum neural network with fuzzy features. Int. J. Doc. Anal. Recognit. 2(1), 30–36 (1999). https:// doi.org/10.1007/s100320050034

Real Time Intersection Traffic Queue Length Estimation System Kamil Bolek

Abstract Many systems which check queue length at intersections are inaccurate. Information about small/medium/large queues are not enough for intelligent transport systems which could modify an actual intersection program in a way which may allow for optimising the traffic flow in the city. Not only does the presented system provide information about queue length in metres on every lane, but also it is based solely on camera focal length, sensor size and number of lanes in camera view, which minimizes the involvement of traffic operators in time-consuming camera setup. The system consists of several submodules, the first of which detects license plates and uses them to create a configuration of the camera. Subsequently, the second module that performs detection of the type of vehicle can determine what the current length of the queue on every lane is. All this information is sent to the traffic management system in Wroclaw which modifies the traffic lights controller programs, optimising the traffic flow in the city. Keywords Queue length estimation · Traffic systems · Traffic optimisation · YOLO versus RetinaNet · Image processing · Object tracking · Cluster computing

1 Introduction Every day, cities around the world are growing, and with them their needs. More and more roads are built in order to keep up with the ever-increasing number of cars. An intelligent transport system contains many submodules which are necessary for city management. In order to work properly, the submodule which optimizes traffic needs a lot of information about the current situation on the roads. The information is provided by induction loops, cameras and controllers on the intersections. Cameras on the intersections have multiple functions. Not only do they allow traffic operators to see the current situation so they can reorganize routing in the city, K. Bolek (B) Polish-Japanese Academy of Information Technology, Warszawa, Poland e-mail: [email protected]; [email protected] WASKO S.A., Gliwice, Poland © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_5

57

58

K. Bolek

but also they continuously inform traffic management module about the queue length on every lane. Every new intersection requires an installation of a few cameras. All of them need to be configured manually (by inserting the lane number, area which a particular lane covers and draw the lines which determining distance: “short”, “medium” or “long” queue). Moreover, if cameras on a given intersection break down (which is quite common when in the city there are more or less 600 cameras), then the new one also needs to be configured. Manual configuration is a long and arduous process that has to be done with great care, as the subsequent correct response of the automatic traffic management system depends on it. Queue length is one of the main performance measures for signalized intersection [1], which is critical to adjusting the traffic light program to them [2]. To date, many solutions have been developed to estimate queue length at intersections. Traditional fixed-location sensors (e.g. a radar or an induction loop) could provide information about queue length and were used in the past [3, 4], but these detectors upload their information with an aggregation interval of 30–60 s in many Chinese cities. Such a low interval does not allow to react quickly enough to changes on intersections [5]. Nowadays, that detectors are much faster and can provide information in real time, but are not installed as standard at each intersection. Moreover, radars are able to only detect vehicles about 20–30 m from mounting points (e.g. Siemens Sitraffic Heimdall). Induction loop provides information only on whether the vehicle is directly above it. Cameras which provide information about actual situation to traffic operator are the basic equipment at every intersection and their range depends only on the camera coverage. Another way of estimating queue length is by using the input-output model. To date, a lot of models of this type were developed [6–10]. The model estimates queue by analysing cumulative traffic arrival-departure curve. Other types of models were implementations of shockwave theory [11–16]. They have received more attention recently, making an attempt to explain the formation and dissipation of queuing using Lighthill-Whitham-Richards (LWR) shockwave theory [17]. All the models mentioned above work only under certain specified conditions and do not work properly when unforeseen events such as lane blockage occur on the roadway, which is obviously not a rare event in overcrowded cities. Yet another type of solutions try to estimate the length of the queue by determining the intersection delay pattern that depicts the real time delay experienced by a vehicle when arriving at the intersection at a given time. On the grounds of that data, they predict the queue in the next cycle [18]. Some of them use the lane-based real-time queue length estimation model which employs the LP (License Plate) recognition data. Included in this group are systems in which an interpolation method based on Gaussian process is developed to reconstruct the equivalent arrival-departure curve for each lane. By using license plate cameras which are focused on stop line area, the algorithm compares the timestamp of the detected car passing through an intersection and assesses how much time it takes to arrive on second road crossing. In many cases, license plate detection shows a wrong plate number (it depends on

Real Time Intersection Traffic Queue Length Estimation System

59

the weather, light etc.), which disallows comparison. The missing information for unrecognized or unmatched vehicles is obtained from the reconstructed arrival curve [19]. This solution needs at least two intersections that would send information about the license plate detected on each lane to build the arrival-departure curve. Additionally, data from both intersections need to be aggregated and compared with each other to adjust the general model, time being precisely synchronised and both of cameras manually configured by the traffic operator. Based on research and my own experiments, a system presented in this article has been developed which estimates queue length with high accuracy and detects lanes direction automatically. The process no longer requires manual configuration, but only definition of parameters of a selected camera. This work contributes to the literature in following aspects: 1. The proposed solution estimates queue length only on the basis of information from the camera (focal length, sensor size and camera stream). Thanks to this approach, the system is able to determine not only the length of the queue currently in the city, but also the length of the queue in the past based on the footage from the selected camera. 2. The data obtained from the system is the actual and true length of the queue, not the result of a statistical approach or attempts at forecasting. 3. The system does not need manual configuration. The user (system administrator) has only to set a few technical camera parameters (focal length, sensor size and the number of lanes which are in camera coverage). The system has a module which determines automatically the area occupied by the lane. 4. The neural network (YoloV4) which is used to detect license plates on the road has a much higher mean average precision (mAP) [20] in comparison to other tested models. In order to achieve this, transfer learning process and own dataset was used. The prepared dataset contains more than 500 labeled license plates. Produced bounding boxes are used for distance estimation. The paper is organized as follows: the next section presents the methodology of the proposed solution, the third and the fourth sections present the field experiment design with the numerical results. The final section concludes the paper.

2 Methodology This section presents the details of the proposed system for estimating the queue length. This system contains two major modules: 1. The module which creates configuration for a selected camera (system working in the configuration mode). 2. The module which estimates the queue length based on the configuration established in the previous module (system working in the production mode).

60

K. Bolek

Fig. 1 System workflow

Module which performs configuration for a selected camera consists of a few submodules described in the next section. As a result of the work of this module we get model which determine on which lane the vehicles stop and estimates the distance between first and the last thus determining the length of the queue. This module can work on an archival video from a recorder or on live stream from camera (Fig. 1). Once enough amount of data has been collected and processed in the configuration mode, the produced output is input for the system working in the production mode. This mode detects a few class of vehicles. The sum of the estimated distances (between first and last vehicle) and specified length of the class of last vehicle on the lane is send to the central system which adjusts the intersection program to maximise capacity.

2.1 Module Which Creates Configuration for Selected Camera (Configuration Mode) The first module processes video stream from camera frame by frame to create average distance lines and their coordinates in the camera view. Moreover, using cloud of points which are central points of detected license plate bounding box module determine area where specified lane is. 2.1.1 License Plate Detection The first step is the detection of a license plate in the camera view, using neural network. Many neural networks can recognize license plates in an image, but in this module, object detection needs to be done precisely, regardless of the size of the object in the picture. Thus, the project takes into consideration only the FPN (Feature Pyramid Network) topology. Of several tested neural networks, the best results has achieved RetinaNet (mAP = 0.81) and YoloV4 (mAP = 0.79). The mAP [20] for a set of detections is the mean over classes of the interpolated AP [21] for each class. Mean average precision for both of these neural networks was calculated for confidence bigger than 80%. The second factor which was taken into account was the speed of detection. RetinaNet was slower in comparison to YoloV4 on NVIDIA

Real Time Intersection Traffic Queue Length Estimation System

61

Table 1 Compare AP and IoU of YOLO before (pretrained model to find LP area for OCR) and after training AP (%) IoU (%) Pretrained weight which detect LP for OCR The best result of own training

56,30 78,98

51,42 55,65

RTX3090 (63 FPS vs 46 FPS). Such a big difference in speed motivated the choice of YOLO as the NN (neural network) topology used in the presented solution. YOLO NN used in the suggested solution has been trained on a dataset prepared by the author, which contains 150 frames from cameras in Wroclaw. Every frame has more than 3 license plates in view at once. In this way we get a dataset which includes more than 500 bounding boxes of license plates at a different distance (from 5 to 50 m) from the camera. Images of vehicles without plates and unrelated pictures from the city were also added to the dataset to eliminate false positives. That amount of data has proven to be enough to train the neural network which has more than 90% accuracy. The transfer learning process was used during the training. The pre-trained YOLO network that was additionally trained to look for license plate for OCR was used as a basis for the process. Accuracy and IoU (Intersection over Union) value was enough for OCR (since, for this process, the detected area can contain more than only license plate), but not enough for a precise license plate detection. During the training on the prepared dataset, both of the above properties have been improved (Table 1). 2.1.2 Distance Estimation On the basis of the detected bounding box size and information from the user about the camera parameters (focal length and sensor size), the submodule makes an attempt to estimate the distance between the camera and the detected license plate. Knowing the size of license plates in Poland (which is 520 × 114 mm), the focal length of the camera and the width of the bounding box, the distance between the camera and the object can be calculated, using the principle of similarity of triangles. The size of sensor and focal length are information contained in the manufacturer’s information. Selection of the sensor and lens depends on the task to be performed by the camera, but the indicated parameters may differ significantly between different models what has been shown in Fig. 2. Once the operator has entered the required parameters, LP detection begins. Multiple detection of LP creates a cloud of points where coordinates x, y are the centre of their bounding box. All of these points are clustered on the basis of estimated distance. The algorithm finds straight lines passing through the highest possible number of points with the same distance. The distance between points for which no LP have been found is calculated by using a linear regression of polynomial of fourth degree. The degree of the polynomial is dependent on the height and angle of the camera relative to the plane of the observed image. The indicated polynomial the lowest error was obtained relative to the reference values measured at the intersection. If the remaining lines are deter-

62

K. Bolek

mined on the basis of LP detection, the error of the polynomial is minimised. As an argument for this polynomial, we use coordinates of lines which were determined previously during detection of LP. This method enables us to determine the distance for any points in the camera view what was shown in Fig. 3. 2.1.3 Calibration of Camera Image Distortion Any operations on an image which are intended to describe the three-dimensional space imaged by the camera, depend on the camera’s internal parameters. Internal parameters should be understood as features of the camera that do not depend on its relation with the space in which it is placed. Knowledge of the internal parameters of the camera allows us to determine what the two-dimensional projection of three-dimensional objects onto the image plane will look like on the basis of their transformation in the camera frame. The basic intrinsic parameters of camera are: • f —the focal lengths, • s—pixel density in the sensor axis,

Fig. 2 Popular sensor sizes

Fig. 3 Result of distance estimation visualised on camera view—lines passing through the highest possible number of points with the same distance. The distance of the missing lines was determined using polynomial

Real Time Intersection Traffic Queue Length Estimation System

63

• cx , c y —the point of intersection of the line perpendicular to the image plane passing through the centre of the camera coordinate system, • the obliquity of the sensor axis causing a tilt of the acquired image (for today’s cameras this parameter is usually ignored due to the quality of the sampling process). The pixels in the image sensor may not be square, and so we may have two different focal lengths ( f x , f y ). Intrinsic Matrix (K) of internal parameters can thus be expressed as: ⎡

⎤ f x s cx K = ⎣ 0 f y cy ⎦ 0 0 1 When measuring distances in the real world from a camera image, it is useful to calibrate the distortions that occur in the image. The most common are radial and tangential distortion. Radial distortion means that as the pixel distance from the imaging centre increases, the distortion increases. Tangential distortion models the distortion resulting from the tilt of the sensor relative to the lens. Popular way to calibrate internal camera parameters is Zhang’s approach [22]. It is based on the assumption that a physical, three-dimensional plane containing a set of locatable points is used for the calibration process. The use of a plane allows the assumption that the coefficient at one of the base axes is equal to 0 for each point. The relationship between the three-dimensional homogeneous points of the plane and the two-dimensional homogeneous points of the plane is therefore expressed by scale and homography. The use of multiple images of the calibration table allows the creation of a system of equations to estimate the matrix of internal parameters and the matrix of external parameters in each frame. However, due to the acquisition noise, such a solution is insufficient, hence Zhang’s approach is to minimise the sum of the differences between the projections of the plane points based on the acquired matrices and the acquired image by using the Levenberg-Marquardt algorithm. In order to simplify the camera configuration process and to ensure wide applicability of the solution, distortion resulting from the camera’s internal parameters was not eliminated. This approach is based on the projection that takes place in pinhole camera models. 2.1.4 Lane Detection Algorithm Apart from the distance estimation, license plate detection is useful for determining the area in camera coverage occupied by every lane. All of the detections (the centre of their bounding box) are processed after estimation distance module work is done. The best result in comparison with manually determining lanes area achieve DBSCAN algorithm with epsilon equal 5.4 and min_samples equal 8 (Fig. 6). Epsilon determine the radius of the circle where all the points contained inside are considered as neighborhood points. The radius was determined by looking for elbow point on chart which represents sorted distances between points and by increasing that value

64

K. Bolek

Fig. 4 Visualisation of sorted distance between points

Fig. 5 DBSCAN with epsilon = 3.6

Fig. 6 Result of clustering method which epsilon = 5.4 (increased elbow point)

by 50% (Fig. 4). Increased value was necessary—elbow point value did not get satisfying results (Fig. 5). The final result is visualised on frame from camera for better understanding (Fig. 7). Result of clustering algorithm (labels of coordinates) was used to train the classifier which decided on what lane the vehicles stops. In the the proposal solution k-nearest neighbors classifier was used.

Real Time Intersection Traffic Queue Length Estimation System

65

Fig. 7 Result of clustering method which determines the area which occupies every lane

2.1.5 Result from the Configuration Module The result of the module’s work is the configuration of the camera which contains: 1. A trained classifier which assigns any detected object (car, truck or bus) to the specific lane; 2. A function which, on the basis of the bounding box and lane assignment, would estimate the actual distance to the detected object and return the length of the queue on all lanes. Produced configuration is used in further processing (in the production mode) to transmit information about current queue lengths to the central system that is responsible for traffic flow optimalisation.

2.2 Module Which Estimates the Queue Length Based on Configuration Module estimating the queue length uses YOLOv4 which is trained to look for a few class of objects (in brackets are the average rounded lengths for a given vehicle category): 1. 2. 3. 4.

motorbike (2 m) car (5 m) truck (7 m) bus (12 m)

Detection of the vehicle that stopped at the lane starts a procedure which uses central points of the bottom edge of the bounding box to do the following: 1. Adjust detection to the specific lane using k-nearest neighbour classifier; 2. Estimate distance to detected vehicle;

66

K. Bolek

3. Check if it is the last car that stopped at a particular lane; 4. Add average length of object class (e.g. 5 m for car detection) to estimated distance; 5. Estimate distance between the first car stopped on selected lane and subtract from the queue length; 6. Send information to the central system about the lane number and the queue length on it. This procedure is repeated constantly. When cars move, the system waits for the next detection of the car that stopped at the lane. If the central system is not available, the system caches information, but only for one minute (older information is not useful for the central system which modifies the intersection program according to the latest information).

3 Experiments The system was tested in Wroclaw on a randomly selected intersection. The camera (model HIKVISION DS-2CD5026G0-AP) at the intersection had sensor size 1/2,8 . The one hour streaming from the camera was processed by the camera configuration generator module. This module detected a license plate, estimated the distance to every one of them and saved the centre of the bounding box to train the lane number classifier and distance estimation algorithm. This configuration was subsequently used as initialisation data to the system in production mode. A few seconds after the algorithm started, the module showed first results which were compared with the ground truth (measured manually previously in the city). The result of measurements was shown in Fig. 8.

Fig. 8 Measurement on the intersection—ground true line visualisation

Real Time Intersection Traffic Queue Length Estimation System

67

Table 2 Comparison of the estimation result with manually measured ground truth Lines y Distance (m) ERROR (%) coordination Min Max Estimated Ground truth 86 118 148 186 238 336 444 660 969

43.89 35.66 35.66 31.7 24.8 17.83 15.85 10.97 –

47.54 43.89 40.75 33.56 25.93 21.94 16.3 11.18 –

45 40 35 30 25 20 15 10 5

53 45 41 34 28 20 16 10 5,5

18 13 17 13 12 0 6 0 10

To check the result achieved in the production environment, the algorithm showed the distance between the camera and the front of the car (it is important to note that the queue length should be increased by the length of the car depending on its class). A few cars which stopped close to the ground true lines (at max. 5 px of difference) were compared with the algorithm estimation. The analysis indicated that the module got max. 18% error on 45 m (Table 2). Due to image distortion, at very close range (5 m) the model did not recognize enough license plates to determine the distance line, but, using a linear regression of fourth degree polynomial, the method estimated the missing line coordinates. The tests were repeated in different weather conditions to check if this system would give similar results. Approximately 2 h of video stream recorded during the afternoon rush hour have been processed to create camera configuration. Next half of hour of video stream was compare with manual measurement. The system worked with configuration produced for the same camera in the sunny weather. No significant differences were observed (comparable errors). Despite changing conditions (rain, snow, sunny weather), the system continued to measure distances with satisfactory accuracy (below 20% error under 50 m). Using the measures defined in other papers, the proposed method was compared using MAPE (Mean Absolute Percentage Error). Many other solutions give information only about the number of cars which are on given lane at a particular moment in time, which is not fully measurable data. The prepared system returns the queue length in metres. To calculate MAPE, the recording from camera was checked manually. The result of distance estimation module which processed frames containing vehicle stopped near the defined distance lines were compared with the distance measured in the city (Fig. 9). State of the art MAPE was shown in different papers with average MAPE amounted to 14.93 [17] and 14.2 [23]. In the experiment proposed algorithm average MAPE was 13.94.

68

K. Bolek

Fig. 9 System estimation visualisation

The obtained results indicate that the proposed solution can effectively perform the function of verifying queue lengths at intersections without the need for manual camera configuration. As Table 2 indicates, the error in estimating queue lengths increases with distance. Moreover, the system operates with comparable accuracy regardless of weather conditions.

4 Conclusion and Further Plans In this paper a queue length estimation system that uses LPR (License Plate Recognition) data for camera automatic calibration (detecting the lanes area in camera view, passing vehicle direction) and for distance estimation was presented. The system provides the user with an efficient estimation of the queue length at a lane in real-time and sends this information to the central system which can modify the intersection program and react to the actual situation in the city. The system was tested and checked in a production environment (Wroclaw). In all tests the maximum error was 18% at a distance shorter than 50 m in reality. The designed system provides easy scalability. The length of the queue does not change very often, so it is sufficient to measure the length in a given lane once per second. If we are able to check queue lengths for several lanes in one frame from the camera view processing, it is possible for one intersection to be operated only by 4 cameras. The system could simultaneously process 60 cameras from the city on one server node with RTX 3090. Plans for further research include using a neural network based on the transformer architecture. As a result, we could get a more precise bounding box for detected license plates and minimise the errors. Furthermore, the neural network will be quantised and will run locally on STM32 devices to reduce both the costs of energy and the number of servers in the city.

Real Time Intersection Traffic Queue Length Estimation System

69

References 1. Balke, K.N., Charara, H.A., Parker, R.: Development of a traffic signal performance measurement system (TSPMS). No. FHWA/TX-05/0-4422-2. College Station, Texas, Texas Transportation Institute, Texas A & M University System (2005) 2. Newell, G.F.: Approximation methods for queues with application to the fixed-cycle traffic light. Siam Rev. 7(2), 223–240 (1965) 3. Liu, H.X., Ma, W.: A virtual vehicle probe model for time-dependent travel time estimation on signalized arterials. Transp. Res. Part C: Emerg. Technol. 17(1), 11–26 (2009) 4. Sharma, A., Bullock, D.M., Bonneson, J.A.: Input-output and hybrid techniques for real-time prediction of delay and maximum queue length at signalized intersections. Transp. Res. Rec. 2035(1), 69–80 (2007) 5. Tan, C., et al.: Fuzing license plate recognition data and vehicle trajectory data for lane-based queue length estimation at signalized intersections. J. Intell. Transp. Syst. 24(5), 449–466 (2020) 6. Webster, F.V.: Traffic signal settings. No. 39 (1958) 7. May, A.D.: Traffic flow theory-the traffic engineers challenge. Proc. Inst. Traf. Eng. 290–303 (1965) 8. Akçelik, R.: A queue model for HCM 2000. ARRB Transportation Research Ltd., Vermont South, Australia (1999) 9. Sharma, A., Bullock, D.M., Bonneson, J.A.: Input-output and hybrid techniques for real-time prediction of delay and maximum queue length at signalized intersections. Transp. Res. Rec. 2035(1), 69–80 (2007) 10. Vigos, G., Papageorgiou, M., Wang, Y.: Real-time estimation of vehicle-count within signalized links. Transp. Res. Part C: Emerg. Technol. 16(1), 18–35 (2008) 11. Lighthill, M.J., Whitham, G.B.: On kinematic waves II. A theory of traffic flow on long crowded roads. Proc. R. Soc. Lond. Ser. A. Math. Phys. Sci. 229(1178), 317–345 (1955) 12. Richards, P.I.: Shock waves on the highway. Oper. Res. 4(1), 42–51 (1956) 13. Stephanopoulos, G., Michalopoulos, P.G., Stephanopoulos, G.: Modelling and analysis of traffic queue dynamics at signalized intersections. Transp. Res. Part A: Gen. 13(5), 295–307 (1979) 14. Skabardonis, A., Geroliminis, N.: Real-time monitoring and control on signalized arterials. J. Intell. Transp. Syst. 12(2), 64–74 (2008) 15. Liu, H.X., et al.: Real-time queue length estimation for congested signalized intersections. Transp. Res. Part C: Emerg. Technol. 17(4), 412–427 (2009) 16. Ban, X.J., Hao, P., Sun, Z.: Real time queue length estimation for signalized intersections using travel times from mobile sensors. Transp. Res. Part C: Emerg. Technol. 19(6), 1133– 1156 (2011) 17. Liu, H.X., et al.: Real-time queue length estimation for congested signalized intersections. Transp. Res. Part C: Emerg. Technol. 17(4), 412–427 (2009) 18. Zhan, X., Li, R., Ukkusuri, S.V.: Lane-based real-time queue length estimation using license plate recognition data. Transp. Res. Part C: Emerg. Technol. 57, 85–102 (2015) 19. Zhan, X., Li, R., Ukkusuri, S.V.: Lane-based real-time queue length estimation using license plate recognition data. Transp. Res. Part C: Emerg. Technol. 57, 85–102 (2015) 20. Everingham, M., et al.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015) 21. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. Mcgraw-Hill (1983) 22. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 23. Tiaprasert, K., et al.: Queue length estimation using connected vehicle technology for adaptive signal control. IEEE Trans. Intell. Transp. Syst. 16(4), 2129–2140 (2015)

Image Processing

Carotid Artery Wall Segmentation in Ultrasound Image Sequences Using a Deep Convolutional Neural Network Nolann Lainé , Hervé Liebgott , Guillaume Zahnd , and Maciej Orkisz

Abstract Intima-media thickness (IMT) of the common carotid artery is routinely measured in ultrasound images and its increase is a marker of pathology. Manual measurement being subject to substantial inter- and intra-observer variability, automated methods have been proposed to find the contours of the intima-media complex (IMC) and to deduce the IMT thereof. Most of them assume that these contours are smooth curves passing through points with strong intensity gradients expected between artery lumen and intima, and between media and adventitia layers. These assumptions may not hold depending on image quality and arterial wall morphology. We therefore relaxed them and developed a region-based segmentation method that learns the appearance of the IMC from data annotated by human experts. This deeplearning method uses the dilated U-net architecture and proceeds as follows. First, the shape and location of the arterial wall are identified in full-image-height patches using the original image resolution. Then, the actual segmentation of the IMC is performed at a finer spatial resolution, in patches distributed around the location thus identified. Eventually, the predictions from these patches are combined by majority voting and the contours of the segmented region are extracted. On a public database of 2676 images the accuracy and robustness of the proposed method outperformed state-of-the-art algorithms. The first step was successful in 98.7% of images, and the overall mean absolute error of the estimated IMT was of 100 ± 89 μm. Keywords Segmentation · Ultrasound images · Deep learning N. Lainé · H. Liebgott · M. Orkisz (B) Univ Lyon, Université Claude Bernard Lyon 1, INSA-Lyon, CNRS, Inserm, CREATIS UMR 5220, U1294, F-69621 Lyon, France e-mail: [email protected] N. Lainé e-mail: [email protected] H. Liebgott e-mail: [email protected] G. Zahnd Institute of Biological and Medical Imaging, Helmholtz Zentrum München, Neuherberg, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_6

73

74

N. Lainé et al.

1 Introduction Atherosclerosis, the “silent killer” that can progress to acute events such as stroke or heart attack, begins to develop in the intima-media complex (IMC), the innermost structure of the artery wall. IMC thickening is widely considered as a marker of the atherosclerosis onset. Its screening is usually performed in ultrasound (US) images of the carotid artery. In the clinical routine, the measurement of the intima-media thickness (IMT) is performed manually, which may lead to substantial inter- and intra-observer variability. Numerous methods, compared in several surveys [2, 3, 10], have been developed to automate the measurement of the carotid artery IMT. Most of them infer the IMT as the distance between the lumen-intima (LI) and media-adventitia (MA) contours extracted within a region of interest (ROI) cropped manually or automatically by seeking the window where the characteristic dual–line pattern is discernible (Fig. 1). Contour extraction is performed using various “conventional” algorithms, such as snakes or dynamic programming. Virtually all assume that the contours are smooth curves passing through points with strong intensity gradients expected between artery lumen and intima, and between media and adventitia layers. Lately, deep-learning-based (DL) methods have also been proposed for this task and obtained promising results, but the comparisons [1] were performed on different datasets. A recent study [6] compared five state-of-the-art conventional methods on an open

Fig. 1 Longitudinal US images of the common carotid artery, examples. The curves represent the lumen-intima (red) and media-adventitia (green) boundaries of the far wall interpolated from expert’s annotations (dots). The annotations were restricted to the exploitable region (ROI) where the characteristic double-line pattern of the IMC is discernible. Yellow arrows point examples of inconsistent annotations

Carotid Artery Wall Segmentation in Ultrasound Image Sequences . . .

75

access dataset CUBS1 [7] containing 2176 US images from two centers, and concluded that their accuracy is comparable to skilled human experts, while presenting substantially smaller variability. In that study, the best results were achieved by a method based on dynamic programming [15]. A subsequent study [4] confronted the same methods with two DL approaches on another dataset CUBS2 [5] containing 500 US images from five different centers. This comparison has demonstrated the sensitivity of several methods to the arterial wall morphology, namely a decreased accuracy in curved or inclined arteries. One of the DL-based methods, previously unpublished, outperformed the others in almost all comparisons except for processing time. While the selected network architecture was the well-known U-net [12], the original contribution of the method resides in a specific patch-based strategy devised to cope with inclined and curved arteries. In this article, we describe for the first time the details of this method, and then report main results in terms of accuracy and robustness on both open-access datasets CUBS1 [7] and CUBS2 [5], compared to available expert annotations and the other methods. We also assess how much the processing time can be improved by changing the parameter setting, without degrading the accuracy.

2 Datasets In this work, for training and evaluation purposes, we have used both previously mentioned publicly-available datasets CUBS11 and CUBS2,2 i.e., a total of 2676 static US images acquired with different clinical equipment and including a subset (n = 100) of simulated images. These images have pixel sizes ranging from 29 to 267 μm (mean size 64 μm in CUBS1 and 60 μm in CUBS2) and come along with LI and MA contours independently traced by two or more experts, one of them having traced them twice. The experts independently selected a ROI, where they considered the LI and MA interfaces as sufficiently perceptible, and traced control points within it. To obtain smooth contours, piecewise cubic Hermite interpolating polynomial (PCHIP) was applied using MATLAB, Version 2020b (The MathWorks, Inc.). The same annotations as in the previous studies [4, 6] were used as reference. These will be referred to as A1. According to A1, the mean IMT was of 725 μm in CUBS1 and 857 μm in CUBS2. The same expert’s second set of annotations (A1’) was used to calculate the intra-observer variability, while the inter-observer variability was calculated between A1 and the second expert’s annotations (A2).

1 2

https://data.mendeley.com/datasets/fpv535fss7/1. https://data.mendeley.com/datasets/m7ndn58sv6/1.

76

N. Lainé et al.

3 Segmentation Method Similarly to many existing methods, e.g. [15], the first step is a very simple user interaction (two mouse clicks) defining the exploitable ROI, where the IMC is perceptible. The remainder is fully automatic and split into two main steps: localization of the far wall and actual IMC segmentation (Fig. 2). The proposed solution is based on U-net architecture [12], which has widely demonstrated its ability to produce accurate results in medical image segmentation with limited annotated data available for supervised training. We have kept standard components, such as ReLU activation function, and experimentally selected the following settings: number of layers 5, initial number of convolution kernels 32, and size of the kernel 3 × 3 (with batch normalization and bias). We implemented dilated convolutions on the bottleneck, to increase the receptive field [9]. The outputs of the filters with different dilation factors {1, 2, 3, 4, 5, 6} are concatenated into one single tensor. The U-net operates in fixed-size (128-pixel width W , 512-pixel height H ) overlapping patches distributed within the ROI. The two first layers contract only the height, so as to achieve a square shape (128 × 128) of the feature space starting from the third layer. The horizontal overlap between patches is equal to W − x, where x is the horizontal shift. A post-processing combines the predictions made within the patches to extract smooth contours over the entire ROI regardless of its width. The core of the method consists of two steps: approximately localizing the far wall (Fig. 2c), and precisely segmenting the IMC around this location (Fig. 2d).

Fig. 2 Flowchart of the proposed method: a Input image. b User delimitation of the left and right borders of the ROI. c Far wall detection: patches are extracted through a sliding window with overlap within ROI borders; post-processing of predicted overlapping masks leads to extraction of the median axis (green). d IMC segmentation: overlapping patches are picked along the median axis and post-processing of the predicted masks leads to extraction of the LI (red) and MA (yellow) interfaces

Carotid Artery Wall Segmentation in Ultrasound Image Sequences . . .

77

3.1 Detection of the Far Wall Like in many state-of-the-art methods [8, 11, 13, 14], the initial localization of the far wall is performed to focus the subsequent actual segmentation. In this step, the algorithm attempts to separate the ROI in two regions: above and below the median axis of the IMC, respectively. Here, the patches are of full image height, and the corresponding U-net will be referred to as  F W . We first describe how the images were pre-processed and how  F W was trained, then we specify the post-processing chosen to obtain the curve approximately localizing the far wall on the entire ROI width from patch-wise predictions inferred using  F W . Pre-processing and training: All images of the database were resampled to a constant height of 512 pixels; as the native height of all images in the database is around 600 pixels, the distortion thus introduced was minimal (regions containing alphanumerical information and/or ECG curve overlaid onto the actual image were not clipped). For training data, the median axis of the IMC was defined as the line halfway between LI and MA annotations, interpolated across the entire width of the ROI, and a reference mask (M R O I ) was generated by setting all pixels below the median axis to one and the others to zero. Then the ROI and M R O I were identically cut into patches; a 100-pixel overlap (x = 28) between patches aimed at data augmentation. Thus obtained patches with their associated masks (Fig. 3) were fed into the training process, which used the ADAM optimizer and the sum of the binary cross-entropy and the Dice loss as the loss function. The latter was experimentally chosen to minimize the boundary distance and maximize the overlay with respect to the reference masks. Training was performed with on-the-fly data augmentation:

Fig. 3 Data used during the far wall detection training phase: patch cut from the ROI (left) and the corresponding patch from M R O I . Red and cyan dots represent the annotations for LI and MA interfaces, respectively. The green curve is the median axis calculated from the interpolated annotations

78

N. Lainé et al.

horizontal and vertical flip applied on 50% of the images, and affine transformation with random rotation [−2, 2] degrees, shearing [−2, 2] degrees, and translation (vertical [−20, 20] pixels, horizontal [−5, 5] pixels). The batch size was of 32, and the number of epochs 300 with early stopping switched on, so that the process stopped if the loss value on the validation subset did not improve during 50 epochs. Inference and post-processing: Prior to inference, each image is resampled as described above, and then the corresponding ROI is cut into overlapping 128 × 512pixel patches. Next, all patches are segmented using  F W . Knowing the location and the size of each patch, two maps are created: • prediction map: contains, for each pixel, the sum of values predicted by  F W in all patches. • overlay map: contains, for each pixel, the number of overlapping patches it belonged to. Dividing the prediction map by the overlay map provides, for each pixel, an average value in the range [0, 1], which is then binarized by using a threshold of 0.5, to obtain the segmentation map. The latter is cleaned by retaining the largest connected component. The median axis we seek is the upper boundary of thus segmented region. Eventually, this boundary is smoothed by using a third-order polynomial regression.

3.2 Segmentation of the IMC The above-described approximation of the far-wall median axis is used to initialize the actual segmentation of the IMC. This segmentation process presents many similarities with concepts explained in Sect. 3.1: overlapping patches of 128 × 512 pixels, an overlay map, a prediction map, a similar post-processing except that two contours are extracted (the LI and MA interfaces), as well as the same optimizer, loss function, and U-net architecture. The dilated U-net trained here will be referred to as  I MC . Hereafter, we emphasize the specific choices made for this step. Pre-processing and training: The segmentation task has to be as accurate as possible, hence the algorithm works at a sub-pixel resolution. To this purpose, the vertical pixel size of the images was homogenized to 5 μm using a linear interpolation. According to this physical size, the patch height H = 512 pixels roughly corresponds to 2.6 mm, which aims to encompass the IMC, knowing that the average IMC thickness is about 0.8 mm. For training, the ground truth was then deduced from thus interpolated images (Fig. 4): each pixel located between the annotated LI and MA interfaces was set to one, and the others to zero. Unlike the far wall detection, the patches were extracted along the median axis and, in addition to the 100-pixel horizontal overlap (i.e., x = W − 100 = 28-pixel shift), at each abscissa xi , the mean ordinate yi of the median axis was computed on the patch width W = 128, and three patches were extracted, respectively centered at yi and yi ± y (y = 128).

Carotid Artery Wall Segmentation in Ultrasound Image Sequences . . .

79

Fig. 4 Data used during the training phase for IMC segmentation. Patches and their associated masks located at: a (xi , yi − 128), b (xi , yi ), and c (xi , yi + 128). Red and blue dots represent the corresponding annotations for LI and MA interfaces, respectively. The contours were obtained by interpolating the annotations. Of note: the actual size of the patches is 128 × 512 pixels, but they have been vertically compressed for display purposes

This data augmentation attempted to cope with possibly inaccurate far-wall approximation as well as with tilted arteries. Inference and post-processing: During inference, the patches are extracted along the far wall approximation resulting from the first step (Sect. 3.1). At each abscissa xi three or more patches are captured at different ordinates, depending on the tilt of the median axis, with the goal to cover all the expected extent of the IMC. The predictions made by  I MC in all patches are combined into a prediction map, and then the segmentation map is derived thereof, as described above. Finally, the LI and MA interfaces are respectively defined as the upper and lower boundaries of the region thus segmented.

4 Results The evaluation was carried out using 5-fold cross-validation, so as to assess each network on data not seen during its training. In each fold, the database (combined data from the open-access datasets CUBS1 [7] and CUBS2 [5]) was split into training (60%), validation (20%), and testing (20%) subsets. Thus, five pairs of networks  F W

80

N. Lainé et al.

and  I MC were trained and tested independently, and the results reported here are the merging of the test sets of these five pairs, thus evaluating the method on the entire database. In the proposed cascade approach, a failure of the first step (far wall detection) will trigger a failure of the second step (IMC segmentation). To conduct a fair evaluation of both steps, we first quantified the success rate of the first step alone, and then we quantified the accuracy of the second step by manually enforcing valid initial conditions when needed. Robustness of the far wall detection: After visual inspection, 36 out of 2676 predicted median axes (1.3% of the database) were considered as failures, i.e., curves unusable to initialize the IMC segmentation step. Hence, the success rate was of 98.7% and, in the 36 images with failures, the median axis was manually redrawn using a home-made graphical interface. Accuracy of the IMC segmentation: The segmentation error was quantified in two ways. To enable a fair comparison of our method with five state-of-the-art methods evaluated on the CUBS1 and CUBS2 datasets in the previously mentioned recent studies [4, 6], we asked the first authors of the latter (see Acknowledgements) to apply onto the LI and MA contours extracted by our method exactly the same evaluation protocol and metrics as used in their publications. Hence, the IMT values and all errors were calculated on a region restricted to a common support where all participating methods succeeded in extracting the contours. The IMT was measured as the polyline distance between LI and MA contours, and its error was calculated as the mean absolute difference (MAD) between the method output and the reference values calculated from the expert’s (A1) annotations. The (worst case) errors for LI and MA were separately estimated by calculating the Hausdorff distances between the respective contours extracted by the method and the annotations performed by A1. These results are summarized in Table 1. The reported values were obtained with the same implementation and parameter settings as in the previous study restricted to the CUBS2 dataset [4]. In that study, the processing time was the only drawback of our method: 0.92 s for far-wall localization and 1.77 s for IMC segmentation, i.e., a total of 2.69 s per image, and was achieved with a very large horizontal overlap of 124 pixels (i.e., shift x = 4). Therefore, without modifying the implementation, we varied x ∈ {8, 16, 32, 64, 96} to check how much the horizontal overlap—and thereby the number of processed patches— can be decreased without degrading the results, and how much the processing time can thus be improved. The results obtained on the entire database are summarized in Table 2. Here the IMT and the errors in contour extraction were assessed using in-house tools, by measuring the column-wise distances instead of the polyline and Hausdorff distances. Also, the comparison was not restricted to the common support for all methods, and used the full extent of the extracted contours. In addition, the failures of the automatic far-wall localization step were not manually corrected. For all these reasons, the results for x = 4 differ from those in Table 1, but they can be consistently compared with those obtained for the remaining shifts.

Carotid Artery Wall Segmentation in Ultrasound Image Sequences . . .

81

Table 1 Errors of segmentation results (mean ± standard deviation): mean absolute difference (MAD) for thickness quantification (IMT) and Hausdorff distance (HD) for contour locations (LI and MA). The comparisons are reported for the entire database and separately for CUBS1 and CUBS2 datasets. For each dataset, the following differences are reported: the proposed method against reference annotations, the inter- and intra-observer variabilities Dataset Comparison\ IMT LI MA Measure MAD (μm) HD (μm) HD (μm) CUBS1 ∪ CUBS2 Method vs. A1 A2 vs. A1 A1’ vs. A1 CUBS1 Method vs. A1 A2 vs. A1 A1’ vs. A1 CUBS2 Method vs. A1 A2 vs. A1 A1’ vs. A1

100 ± 89 204 ± 168 147 ± 127 99 ± 89 206 ± 168 144 ± 123 106 ± 89 192 ± 166 160 ± 140

317 ± 193 370 ± 197 356 ± 194 320 ± 193 380 ± 207 357 ± 204 305 ± 197 327 ± 138 352 ± 140

287 ± 152 348 ± 162 324 ± 161 287 ± 153 351 ± 161 319 ± 155 289 ± 147 338 ± 184 346 ± 185

Table 2 Estimation errors (mean of absolute differences ± their standard deviation) with respect to reference annotations A1, and processing time (FW stands for far-wall localization and IMC for the actual segmentation), as a function of the horizontal shift x. The horizontal overlap between patches is equal to 128 − x. The reported processing time was achieved on the following hardware: CPU Intel(R) Core(TM) i7-6700, 32 GB RAM, 3.40 GHz; GPU NVIDIA GeForce GTX 1070, 8 GB RAM Shift x (pixels) 4 8 16 32 64 96 IMT MAD ± std (μm) LI MAD ± std (μm) MA MAD ± std (μm) FW time (sec) IMC time (sec) total time (sec)

158 ± 88 116 ± 68 104 ± 62 0.92 1.77 2.69

158 ± 88 116 ± 68 104 ± 62 0.73 1.14 1.87

158 ± 88 116 ± 68 104 ± 63 0.64 0.80 1.44

159 ± 88 113 ± 65 107 ± 68 0.59 0.65 1.24

159 ± 89 114 ± 66 107 ± 68 0.58 0.58 1.16

160 ± 87 118 ± 71 105 ± 62 0.56 0.55 1.11

5 Discussion We developed and assessed a deep-learning method to extract the contours of the intima-media complex in longitudinal B-mode ultrasound images of the carotid artery. The method first approximately localizes the far wall, and then segments the anatomical interfaces of interest. The proposed approach allows for variable width of the region of interest without rescaling the images. Robustness of the far-wall localization step is a prerequisite for overall correct segmentation. This step was successful in 98.7% of the images, therefore attesting of the method robustness. The actual segmentation step achieved good accuracy, with errors smaller than the interobserver and intra-observer variability. The results reported in Sect. 4 were obtained

82

N. Lainé et al.

on combined data from the open-access datasets CUBS1 [7] and CUBS2 [5] and can be compared with the results of five state-of-the-art methods evaluated on the same datasets in recent studies [4, 6]. On both datasets, our method outperformed the remaining ones. In particular, on the CUBS1 dataset, the compared methods produced IMT errors (MAD) ranging from 114 ± 117 to 255 ± 230 μm, while for our method the MAD was of 99 ± 89 μm. Similarly, on the CUBS2 dataset, the other methods produced IMT errors ranging from 139 ± 118 to 224 ± 178 μm, while for our method the MAD was of 106 ± 89 μm. On average, the MAD produced by the proposed method represented 1.6 pixels (which corresponds to 13.7% of the mean IMT) in CUBS1, and 1.8 pixels (12.4%) in CUBS2, respectively. Although these inaccuracies remain relatively large compared to the target measurement (IMT), they are respectively twice and 1.5 times smaller than the inter-observer and intra-observer variabilities. The processing time was below 1 s for the far-wall localization step and below 2 s for the actual IMC segmentation step, when using a very large horizontal overlap and thereby a large number of patches. It can be seen that, on average, the errors were stable when decreasing the overlap, particularly up to the shift x = 16, but the resulting reduction of the processing time was limited. Of note, when segmenting a sequence of images, the far-wall localization step needs to be executed only once, at the beginning of the sequence, while the actual IMC segmentation needs to be repeated for each image. In the current implementation, the processing time of this step remains greater than 0.5 s even with a strongly reduced overlap. Further reduction of this time would require an optimization of the implementation, which has not yet been attempted. The largest errors occurred in the presence of atherosclerotic plaques. As the work presented here was oriented towards asymptomatic plaque-free subjects, images with plaques were not expected. Nevertheless, we anticipate that results might be improved by increasing the number of such images in the database and re-training the networks. In conclusion, with a 98.7% success rate and errors smaller than the interand intra-observer variability, the proposed method likely is robust and accurate enough to be used in clinical practice, as well as to study the periodic compression-decompression of the arterial wall, which is another potential biomarker of atherosclerosis. Code and examples are available at https://github.com/nl3769/caroSegDeep.

6 Compliance with Ethical Standards Information The data from human subjects used in this work were obtained and treated in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committees of the institutions involved in creating the multicentric database, from which these data were accessed.

Carotid Artery Wall Segmentation in Ultrasound Image Sequences . . .

83

Acknowledgments The authors thank K. M. Meiburger and F. Marzola for their help in calculating the comparisons with reference annotations on CUBS1 and CUBS2 datasets. This work was partly supported, via NL’s doctoral grant, by the LABEX PRIMES (ANR-11-LABX-0063) of Université de Lyon, within the program “Investissements d’Avenir” (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR).

References 1. Biswas, M., Saba, L., Omerzu, T., Johri, A.M., Khanna, N.N., Viskovic, K., Mavrogeni, S., Laird, J.R., Pareek, G., Miner, M., Balestrieri, A., Sfikakis, P.P., Protogerou, A., Misra, D.P., Agarwal, V., Kitas, G.D., Kolluri, R., Sharma, A., Viswanathan, V., Ruzsa, Z., Nicolaides, A., Suri, J.S.: A review on joint carotid intima-media thickness and plaque area measurement in ultrasound for cardiovascular/stroke risk monitoring: Artificial intelligence framework. J. Digit. Imaging 34(3), 581–604 (2021) 2. Loizou, C.P.: A review of ultrasound common carotid artery image and video segmentation techniques. Med. Biol. Eng. Comput. 52(12), 1073–1093 (2014) 3. Meiburger, K.M., Acharya, U.R., Molinari, F.: Automated localization and segmentation techniques for B-mode ultrasound images: A review. Comput. Biol. Med. 92, 210–235 (2018) 4. Meiburger, K.M., Marzola, F., Zahnd, G., Faita, F., Loizou, C., Lainé, N., Carvalho, C., Steinman, D., Gibello, L., Bruno, R.M., Clarenbach, R., Francesconi, M., Nikolaides, A., Liebgott, H., Campilho, A., Ghotbi, R., Kyriacou, E., Navab, N., Griffin, M., Panayiotou, A., Gherardini, R., Varetto, G., Bianchini, E., Pattichis, C., Ghiadoni, L., Rouco, J., Orkisz, M., Molinari, F.: Carotid ultrasound boundary study (CUBS): Technical considerations on an open multi-center analysis of computerized measurement systems for intima-media thickness measurement on common carotid artery longitudinal B-mode ultrasound scans. Comput. Biol. Med. 144, 105333 (2022) 5. Meiburger, K.M., Marzola, F., Zahnd, G., Faita, F., Loizou, C., Lainé, N., Carvalho, C., Steinman, D., Gibello, L., Bruno, R.M., Clarenbach, R., Francesconi, M., Nikolaides, A., Liebgott, H., Campilho, A., Ghotbi, R., Kyriacou, E., Navab, N., Griffin, M., Panayiotou, A., Gherardini, R., Varetto, G., Bianchini, E., Pattichis, C., Ghiadoni, L., Rouco, J., Orkisz, M., Molinari, F.: DATASET for “Carotid Ultrasound Boundary Study (CUBS): Technical considerations on an open multi-center analysis of computerized measurement systems for intima-media thickness measurement on common carotid artery longitudinal B-mode ultrasound scans”. Mendeley Data, V1 (2022) 6. Meiburger, K.M., Zahnd, G., Faita, F., Loizou, C.P., Carvalho, C., Steinman, D.A., Gibello, L., Bruno, R.M., Marzola, F., Clarenbach, R., Francesconi, M., Nicolaides, A.N., Campilho, A., Ghotbi, R., Kyriacou, E., Navab, N., Griffin, M., Panayiotou, A.G., Gherardini, R., Varetto, G., Bianchini, E., Pattichis, C.S., Ghiadoni, L., Rouco, J., Molinari, F.: Carotid ultrasound boundary study (CUBS): An open multicenter analysis of computerized intima-media thickness measurement systems and their clinical impact. Ultrasound Med. Biol. 47(8), 2442–2455 (2021) 7. Meiburger, K.M., Zahnd, G., Faita, F., Loizou, C.P., Carvalho, C., Steinman, D.A., Gibello, L., Bruno, R.M., Marzola, F., Clarenbach, R., Francesconi, M., Nicolaides, A.N., Campilho, A., Ghotbi, R., Kyriacou, E., Navab, N., Griffin, M., Panayiotou, A.G., Gherardini, R., Varetto, G., Bianchini, E., Pattichis, C.S., Ghiadoni, L., Rouco, J., Molinari, F.: DATASET for “Carotid Ultrasound Boundary Study (CUBS): an open multi-center analysis of computerized intimamedia thickness measurement systems and their clinical impact”. Mendeley Data, V1 (2021) 8. Menchón-Lara, R., Sancho-Gómez, J., Bueno-Crespo, A.: Early-stage atherosclerosis detection using deep learning over carotid ultrasound images. Appl. Soft Comput. 49, 616–628 (2016) 9. Meshram, N., Mitchell, C., Wilbrand, S., Dempsey, R., Varghese, T.: Deep learning for carotid plaque segmentation using a dilated U-net architecture. Ultrasonic Imaging 42(4–5), 221–230 (2020)

84

N. Lainé et al.

10. Molinari, F., Zeng, G., Suri, J.S.: A state of the art review on intima-media thickness (IMT) measurement and wall segmentation techniques for carotid ultrasound. Comput. Methods Programs Biomed. 100, 201–221 (2010) 11. Qian, C., Su, E., Yang, X.: Segmentation of the common carotid intima-media complex in ultrasound images using 2-D continuous max-flow and stacked sparse autoencoder. Ultrasound Med. Biol. 46(11), 3104–3124 (2020) 12. Ronneberger, O., Brox: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) International Conference on Medical Image Computing and Computer Assisted Intervention—MICCAI. vol. LNCS 9351, pp. 234– 241. Springer, Cham (2015) 13. Wang, K., Pu, Y., Zhang, Y., Wang, P.: Fully automatic measurement of intima-media thickness in ultrasound images of the common carotid artery based on improved Otsu’s method and adaptive wind driven optimization. Ultrasonic Imaging 42(6), 245–60 (2020) 14. Zahnd, G., Kapellas, K., van Hattem, M., van Dijk, A., Sérusclat, A., Moulin, P., van der Lugt, A., Skilton, M., Orkisz, M.: A fully-automatic method to segment the carotid artery layers in ultrasound imaging: Application to quantify the compression-decompression pattern of the intima-media complex during the cardiac cycle. Ultrasound Med. Biol. 43(1), 239–257 (2017) 15. Zahnd, G., Orkisz, M., Sérusclat, A., Moulin, P., Vray, D.: Simultaneous extraction of carotid artery intima-media interfaces in ultrasound images: assessment of wall thickness temporal variation during the cardiac cycle. Int. J. Comput. Assist. Radiol. Surg. 9(4), 645–658 (2014)

Novel Co-SIFT Detector for Scanned Images Differentiation Paula Stancelova , Zuzana Cernekova , and Andrej Ferko

Abstract We describe experiments with the hi-tech contactless scanner CRUSE CS 220ST1100 in the digitization of originals of natural and cultural heritage. The 2D scans guarantee high accuracy both in geometry and radiometry (48 bits for RGB colors). However, an inexperienced customer needs support in selecting the appropriate scan mode. To distinguish similar CRUSE scans, we proposed an image descriptor based on Harris corners and a topological structure embedding a planar subgraph. For some use-cases, the Harris approach did not perform well. We report on a novel SIFT type detector using concurrent color channels, hence the proposed name. We put our solution into the context of previous research and compare, on selected use-cases, the solution quality and/or disadvantages. Keywords CRUSE scanning modes · Harris corner · Hungarian algorithm · Image differentiation · Concurrent color channels

1 Introduction Nowadays, the digitalization of cultural or natural heritage has become an essential part of the knowledge economy in the EU. State-of-the-art CRUSE scanners are now available to capture large-format originals in 2D and 3D and provide the highest image and color quality. This unique scanning system was developed for special requirements for decorative and fine arts reproduction. It can be used in museums, archives and libraries (scanning maps, and works of art), also for decor reproduction and design archives (objects such as wooden boards, natural stone tiles, even 3D P. Stancelova · Z. Cernekova (B) · A. Ferko Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia e-mail: [email protected] P. Stancelova e-mail: [email protected] A. Ferko e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_7

85

86

P. Stancelova et al.

Fig. 1 Left. The larger scans display the same decor original, scanned using LRFB and LTx modes with CCD “14.400 Pixel Tri Linear RGB Line-Sensor” with Schneider optics, CRUSE CS 220ST1100 contactless scanner, CSx software [9]. Right. The four areas with tens of Harris corners and Hungarian edges. Image courtesy Ján Janˇcok, 2022

objects, etc.) and to capture material, surfaces to create patterns for floors, furniture, ceramics and many other applications. These scanners offer multiple scanning modes and may produce very similar scans. The owners of rare natural or cultural originals are confronted with limited accessibility and sustainability of qualified decisions. Moreover, the time during a scanning session is very expensive. Therefore, we help the CRUSE scanning beginners to differentiate the similar CRUSE scans using a novel image detector and so to easily choose to proper scanning mode. So far, we have tested Harris corner detectors to support, among others, the evidence of rustic parts in a wooden texture design for decors [8]. Figure 1 left illustrates two above 3D parts textured using data from two 2D scans (below), i.e. 4 misaligned areas in a single JPG photo. Grayscale Fig. 1 right presents several Harris corners, visualized using Hungarian edges (blue). More corners and more edges signalize the “more rustic” scan of decor, which was selected for production. Instead of the Harris corners detector, a different feature detector can be used to select vertices. In the Figs. 2 and 3, local features calculated using BRISK, FAST, and SURF detector, compared to the Harris corner detector are shown. FAST appeared to be the least sensitive, at least in the number of points. All of them work with grayscale images. In our current research, we focus our attention on the improved SIFT method, using color information. Many existing research methods have incorporated color into descriptors. We have chosen the approach of using color information of the image already during the detection of interesting points. One of the main inspirations was feature maps biologically inspired by the Itti model proposed in [20]. The rest of the paper is structured as follows. Section 2 describes relevant Related work, Sect. 4 explains Problem transformation. Section 5 introduces the new method Co-

Novel Co-SIFT Detector for Scanned Images Differentiation

87

Fig. 2 Left. BRISK 40 points (green circles) [22]. Right. FAST 40 points (red markers) [29]. Image courtesy Ján Janˇcok, 2022

Fig. 3 Left. SURF 40 points (green circles) [4]. Right. The four areas with tens of Harris corners (red markers) [14] indicating the “most rustic” part of the decor. Image courtesy Ján Janˇcok, 2022

SIFT. In Sect. 6 we discuss selected results. The paper is finished with Future work and a Conclusion.

2 Related Work High-quality contactless scanning of 2D cultural or natural heritage originals is achieved by CRUSE scanner [9]. The optical imaging technique is based on a pushbroom camera model [15]. The sensor is CCD “14.400 Pixel Tri Linear RGB LineSensor” with Schneider optics. Our research is based on CRUSE CS 220ST1100 model. There is little research on these geometrically exact, 48-bit RGB TIFF images.

88

P. Stancelova et al.

The manual focusing analysis was described in [5], two alternatives of 3D reconstruction in [27], and fast visual comparison of two similar scans in [8]. Image quality measuring is based on an assumption that there exists one ideal, ground truth image. If such an image is not available, we have to give up the algorithmic problem solving and inevitably opt for the heuristic approach and select one of the available options as a relative quality prototype. For multiple sensors, the quality evaluation of nearly identical images was proposed in [26]. This scenario can be observed in CRUSE scanning using multiple modes offered by CSx software tool. Authors in [8] compared LRFB scans (in the role of ground truth) against other ones, using Harris corner detector and Hungarian edges and SSIM comparisons. This detector suffered from a low count of Harris corners for certain hard-to-scan images. Our goal is to improve the method [8] by more sensitive corner detector.

2.1 Analysis of 8 Methods Utilizing Color for SIFT or SURF The theory of opponent colors [17] was developed by Ewald Hering by realizing physiological research based on the theory of J.W. Goethe. These studies confirmed Goethe’s assertion that some colors go very well with each other and others not at all, which led him to design the opponent’s principle of color-coding taking place on the retina. In ganglion cells, three opponent channels are formed on the retina, one achromatic grey scale and two chromatic red-green and yellow-blue. From the given literature, we analyze 8 methods that differ in their approach. Some authors have chosen to include the color information only when creating a descriptor. Others have already used it to detect interesting points. In Table 1, we present a simple analysis of all methods, and their important differences, such as the use of various color adaptations by creating new color models, the use of different or existing methods, and the use of various distance metrics when pairing the feature points. The first significant difference between the methods is the input image, which is either grayscale or color. Most methods using grayscale images process color information using only the selected histogram type. Results obtained incorporate into the original descriptor by classical concatenation and evaluate it. All of the methods with this approach had experimental results at a very good level. Authors usually choose a more demanding approach to color input. For good detection points of interest and the subsequent pairing of the color image directly is a very important adjustment of image colors. There are several simple color models into which an image can be converted with subtle modifications and run on detection. But it may happen to us that the results of detection or pairing will not be sufficient. The evaluations of the analyzed methods show that the results are not excellent, but overall sufficient. Each method met the requirements of the authors, by using partially different methods. The authors in the C-SIFT method focused on the detection of interesting points. Their result is a detector that, compared to the original SIFT method, gives better results in point detection. Very good results when pairing

Novel Co-SIFT Detector for Scanned Images Differentiation

89

Table 1 Analysis of color modifications of selected 8 methods Method Input image Color model Representative of the color C-SIFT [1]

Color

Gaussian

COLOR-SIFT [23]

Grayscale

RGB

CH-SIFT [18]

Grayscale

RGB

SIFT-CCH [2]

Grayscale

RGB

PC-SIFT [10]

Color

COLOR-SURF [13]

Grayscale

Color model based on perception RGB

CW-SURF [19]

Color

LBP-SURF [28]

Color

CW model, Gaussian Complex normalization

Gradients of the color invariants Local kernel hist. Local kernel hist. Color co-occurrence matrix 3D color space

Metric – Euclidean, intersection Euclidean, Bhattacharyya dist. Euclidean



Local kernel hist.

Euclidean, Bhattacharyya dist. Gradients of the Euclidean color invariants – Euclidean, ambiguity dist.

interesting points recorded by the CH-SIFT method, which was the only one to give the best results compared to the methods SIFT and C-SIFT. We chose the CW-SURF method as the best for color modifications of the SURF method. However, the method does not give much better results, but compared to other color SURF methods gives the shortest processing time when pairing points. As the high-precision color is a desirable property of CRUSE scans, we propose a novel Co-SIFT instead of Harris corners, and a relatively modest planar graph with a linear number of edges for structuring the input images into more understandable parts.

3 Contactless Scanning The CRUSE scanner CS 220ST1100 offers 4 light sources LRFB (left, right, front, back) and LRFB scanning mode is the default one for heating the lamps, which requires 30 minutes. Another mode is LTx, which better documents the relief of the given original. It is possible to scan the given scanline of real artwork under constant illumination accurately up to 1200 ppi (pixels per inch) in the TIFF format.

90

P. Stancelova et al.

The scanned original weight may be up to 300 kg, measuring 120×180×30 cm [9]. The current software CSx does not create a 3D model, but allows for precise scanning with variable illumination, 15o sensor rotation, and variable depth of field, thus offering dozens of scan versions of a single original. The ultimate and unique truth does not exist here. Consequently, there are many best scans, each one for a given mode and settings. The above-shown wooden decors illustrate a sample from natural heritage. From cultural heritage we selected hard-to-distinguish cases like ambrotypes, dagerotypes, or handmade papers [8].

4 Transformation to Assignment Problem Originally, in operation research, the well-known combinatorial optimization problem minimizes the total cost of assigning given agents for given tasks with given costs. The graph theory formulation finds a minimum matching in a weighted bipartite graph. We construct the problem instance by weighting pairwise Harris points by their Euclidean distance. The symmetric distance matrix is completed by replacing diagonal zero entries with maximum. This transformation allows applying the Hungarian algorithm [21]. The fast implementation by Yi Cao in the Matlab environment can be found at [7]. “This is an extremely fast implementation of the famous Hungarian algorithm (also known as Munkres’ algorithm)” [24]. Both steps, detecting Harris corners and computing Hungarian edges are standard well-established procedures. Illustrating the results on synthetic data in Fig.4 left, we observe, that triangulation offers too many edges. Harris step filtered non-remarkable corners like the lowest one. Hungarian step reduced edge count. Similarly, in Fig. 4 right the results of the Hungarian algorithm (yellow edges) on the Lena image are shown. The triangulation offers too many edges and the Hungarian approach reduces edge count. Obviously, there are multiple minima for the given input. Formally, for N points, we have a square matrix N × N , A = (ai j ), ai j = a ji , ai j > 0, i, j = 1, 2, ..., N . Diagonal elements aii are set to maximum. In Fig. 5 N = 6. The task is to select exactly N non-diagonal elements, one in each row and each column to minimize their sum. We explain the method by the following minimum size example. We have modified known 7-point set S = {A, B, C, D, E, F, G}, A(4, 6), B(0, 0), C(5, 0), D(11.1, 0.1), E(9, 2), F(7, 2), G(6, 10) to 6-point one as shown in Fig. 5. The original example served for comparison of properties of Delaunay triangulation and Minimum weight triangulation [11]. For simplicity, we removed the leftmost point B. The standard Harris detector recognized (as expected) 6 corners and their mutual distances formed the matrix in Table 2 with a maximum value 100 for diagonal elements to eliminate trivial (useless) zero-length edges. The solution for this instance of assignment problem created 6 Hungarian edges, in fact, three pairs (Fig. 5 the green edges). We denote them by row and column indices of the above distance matrix: 1–6, 2–5, 3–4, 4–3, 5–2, 6–1. One can easily observe (and verify), that these edges for given input form Euclidean minimum matching (EMM). To obtain 6 Hungarian

Novel Co-SIFT Detector for Scanned Images Differentiation

91

Fig. 4 Left. The artificial dataset described by six Hungarian edges (yellow). Two of the apparent corners the Harris corners detector did not select. Right. Lena output, Hungarian edges (yellow) and Delaunay triangulation (blue + yellow edges)

Fig. 5 Modified Dillencourt’s 7-point example with the leftmost point omitted. Result of the Hungarian algorithm–the green edges

edges, one should replace by maximum the redundant (symmetric) values under the diagonal. To speed up the matrix computation, we can use squared distances. If the doubled Hungarian edges form the EMM edges always, remain open. The minimum total edge length criterion is the same, giving N Hungarian edges and N /2 EMM edges. To conclude the explanation, the symmetric construction of Hungarian edges offers edge description “between” N /2 and N edges.

92

P. Stancelova et al.

Table 2 The input matrix values to the Hungarian algorithm for the Six-point example 100.0000 6.0828 9.2315 6.4031 5.0000 4.4721 6.0828 100.0000 6.1008 4.4721 2.8284 10.0499 6.1008 100.0000 2.8320 4.5188 11.1364 9.2315 6.4031 4.4721 2.8320 100.0000 2.0000 8.5440 2.8284 4.5188 2.0000 100.0000 8.0623 5.0000 4.4721 10.0499 11.1364 8.5440 8.0623 100.0000

5 Co-SIFT Although the color is perceived as an irreplaceable element describing the world around us, the techniques for extracting the local features are mostly based on the description of the shape, while the color information is entirely ignored. The pipeline of the proposed Co-SIFT [6] algorithm is illustrated in Fig. 6. The key idea of our solution is to incorporate color information from the image into the SIFT method by replacing grayscale information so that key points are detected on two separate chromatic channels separately (red-green (RG) and a yellow-blue (YB)) and the achromatic brightness channel. Individual steps of the algorithm are described in detail in the following subsections.

5.1 Preprocessing Stephen Engel et al. in [12] examined the human visual cortex and its color matching using magnetic resonance. The human visual cortex processes signals from photoreceptors on the retina of the eye, and the cones, and interprets color information. The

Fig. 6 The pipeline of the Co-SIFT. [6]

Novel Co-SIFT Detector for Scanned Images Differentiation

93

authors [31] experimentally found that the strongest response is to red-green stimuli. In yellow-blue stimuli, the reaction is also strong, but compared to red-green stimuli, it decreases rapidly. The combination of the trichromatic process and the opponent color-coding process was until recently considered impossible. The trichromatic process, however, speaks of composing colors from several cones, the process of opponent’s colors, on the other hand, is finding the color from their difference. However, the eye works on a much more complicated level, and it is these two processes that perfectly combine together. Therefore, as a basic model for our method we chose, an approach based on experiments in [12], using chromatic opponent channels proposed in [20] and trichromatic color theory, for color image processing and the SIFT method. The intensity channel (1) is computed as the weighted average of the R, G and B values, where the weights were acquired by measuring the intensity perceived by the people with undistorted trichromatic vision. The weights are the same as used in the standard sRGB model by [3] I = 0.2126 ∗ R + 0.7152 ∗ G + 0.0722 ∗ B.

(1)

In this color space, the two chromatic channels (RG and YB) proposed in [30] are normalized by the intensity channel, which removes the effect of intensity variations. The channels are defined as follows RG =

(R − G) I

(2)

YB =

(B − Y ) , I

(3)

and

where Y =

(R + G) . 2

(4)

Now we proceed with the detection of the interesting points in the chromatic channels.

5.2 Interesting Point Detection SIFT algorithm introduced by [25] consists of a scale and rotation invariant detector and a HOG (histogram of oriented gradients) descriptor. SIFT detector uses a Gaussian scale pyramid. The image is scaled to K sizes—octaves. Each octave is then recurrently filtered by a 2D Gaussian. Two consecutive images in each octave are then subtracted and the resulting N − 1 DoG (difference of Gaussians) images are approximations of LoG (Laplacian of Gaussians) images. The points of interest are identified in 3 × 3 × 3 neighborhood

94

P. Stancelova et al.

in the DoG space. The octave in which the IP was found represents the “scale” of the IP and determines the size of the neighborhood for descriptor extraction of that point. In our method, the Co-SIFT IP detection is applied directly to each opponent chromatic channel.

6 Experiments and Results The Harris detector offers in Matlab 200 points as a default value for any image. We changed the Harris settings to 1000 points, but still, no points were found in the interior parts. They were achieved by higher sensitivity of the novel Co-SIFT approach. The advantage of Co-SIFT over the Harris approach is clearly visible in two pairs of scans in Fig. 7. Red Harris corners were nearly not detected in the areas with faces in the portrait. Building the Hungarian edges makes no sense in this measurement. On the other hand, the Co-SIFT preprocessing promises reasonable and maybe decisive filtering of regions of interest and enables discussing image subregions given by Hungarian edges and a sufficient number of interior points, marked by blue markers. In Fig. 7c 354 local features and in Fig. 7d 269 local features were detected. These numbers describe the quality of two scanning modes of the same original.

7 Conclusions We report on the second set of experiments with image differencing for those CRUSE scans, where the Harris corners did not perform well [8]. As an alternative to the Hungarian edges, there are multiple edge insertions, as well. We tested a novel feature detector named Co-SIFT. Contactless scanning is routinely used for documenting 2D assets of cultural or natural heritage and it is worth objectivize the differencing of seemingly very similar images. The evaluation method is necessarily qualitative, based on expert opinion. In the case of decors, our method confirmed the expert’s intuitive evaluation of rustic parts, but the proof-of-concept with ambrotype is confirmed both by expert opinion and in an increased number of Co-SIFT points in areas, where the previous method failed. Co-SIFT is obviously more efficient than Harris for the given data, and this observation was generally confirmed by an expert in the field.

Novel Co-SIFT Detector for Scanned Images Differentiation

95

(a)

(b)

(c)

(d)

Fig. 7 Colored ambrotype. Author unknown, 19. Century, © Photographic reference collection at the Academy of Fine Arts and Design, 2022, p. 67 in [16]. 5.2 and 5.2 show LRFB and LTx scans with Harris corners—red dots, and Hungarian edges—red lines, 5.2 and 5.2 the same images with Co-SIFT markers—cyan crosses

96

P. Stancelova et al.

8 Future Work For grayscale input data, we have started to experiment with other feature detectors (FAST, BRISK, SURF). Their advantage over Harris corners is that they can be represented by power, visually indicated by green circle radii. As this step does not depend on edge insertion, one can impose the number of points according to “importance”. The novel Co-SIFT feature detector based on concurrent color channels offers better results for selected “hard” examples. We intend to propose a detailed experimental evaluation of CRUSE scans to further study the promising proof-of-the concept. The full set of scans contains another 15 items for the same ambrotype and 19 items for daguerrotype with various settings. Acknowledgment We express our gratitude to anonymous referees for ideas how to improve the written presentation. We are very thankful for providing data and image courtesy of selected “hard” scanning options. For ambrography unique sample we thank to doc. Mgr. art. Jana Hojstriˇcová, ArtD. and Mgr. art. Jana Blaško Križanová, ArtD. from Academy of Fine Arts and Design in Bratislava. Special thanks go to Ing. Ján Janˇcok (PARKET MANN, s.r.o) for selecting and documenting the decors. Last but not least, we highly appreciate the early discussions on given approach with our late colleague doc. RNDr. Milan Ftáˇcnik, PhD.

References 1. Abdel-Hakim, A.E., Farag, A.A.: Csift: A sift descriptor with color invariant characteristics. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1978–1983. CVPR ’06, IEEE Computer Society, USA (2006) https:// doi.org/10.1109/CVPR.2006.95 2. Ancuti, C., Bekaert, P.: SIFT-CCH: Increasing the SIFT distinctness by color co-occurrence histograms. In: Image and Signal Processing and Analysis, 2007. ISPA 2007. 5th International Symposium, pp. 130–135. IEEE (2007) 3. Anderson, M., Motta, R., Chandrasekar, S., Stokes, M.: Proposal for a standard default color space for the internet - sRGB. In: Color and Imaging Conference, vol. 1996, pp. 238–245. Society for Imaging Science and Technology (1996) 4. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 5. Bohdal, R., Ferko, A., Fabián, M., Bátorová, M., Hrabovský, M., Lúˇcan, L., Zabadal, L.: Adaptive scanning of diverse heritage originals like synagogue interior, empty rare papers or herbarium items from the 19 th century. In: Proceedings of the 18th Conference on Applied Mathematics (APLIMAT 2019), pp. 72–82. Curran Associates, Inc. (2019) 6. Budzakova, P., Sikudova, E., Berger Haladova, Z.: Color object retrieval using local features based on opponent-process theory. In: International Conference on Computer Vision and Graphics, pp. 275–286. Springer (2018) 7. Cao, Y.: Hungarian algorithm for linear assignment problems (v2.3). Matlab central file exchange https://www.mathworks.com/matlabcentral/fileexchange/20652-hungarianalgorithm-for-linear-assignment-problems-v2-3 (2022) 8. Cerneková, Z., Berger Haladová, Z., Blaško Križanová, J., Hojstriˇcová, J., Bohdal, R., Ferko, A., Zabadal, L.: “Hungarian” image (differencing) descriptor. In: DSAI 2022: International Con-

Novel Co-SIFT Detector for Scanned Images Differentiation

9. 10. 11.

12. 13.

14. 15.

16. 17. 18.

19.

20. 21.

22. 23.

24.

25. 26. 27.

28.

97

ference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion August 31–September 2, 2022–NOVA-IMS, Lisboa, Portugal. ACM (2022) CRUSE Spezialmaschinen GmbH. Wachtberg, Germany.: Cruse software csx 3.9 manual (2018) Cui, Y., Pagani, A., Stricker, D.: Sift in perception-based color space. In: Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3909–3912. IEEE (2010) Dillencourt, M.B.: Traveling salesman cycles are not always subgraphs of delaunay triangulations or of minimum weight triangulations. Inf. Process. Lett. 24(5), 339–342 (1987). https:// doi.org/10.1016/0020-0190(87)90160-8 Engel, S., Zhang, X., Wandell, B.: Colour tuning in human visual cortex measured with functional magnetic resonance imaging. Nature 388(6637), 68–71 (1997) Fan, P., Men, A., Chen, M., Yang, B.: Color-SURF: A surf descriptor with local kernel color histograms. In: Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on, pp. 726–730. IEEE (2009) Harris, C.G., Stephens, M., et al.: A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference, pp. 23.1–23.6 (1988) Hartley, R.I., Gupta, R.: Linear pushbroom cameras. In: Eklundh, J. (ed.) Computer Vision– ECCV’94, Third European Conference on Computer Vision, Stockholm, Sweden, May 2–6, 1994, Proceedings, vol. I. Lecture Notes in Computer Science, vol. 800, pp. 555–566. Springer (1994) https://doi.org/10.1007/3-540-57956-7_63 Hojstriˇcová, J.e.: Renesancia fotografie 19. Storoˇcia. VSVU, Bratislava (2014) Hurvich, L.M., Jameson, D.: An opponent-process theory of color vision. Psychol. Rev. 64(6p1), 384 (1957) Jalilvand, A., Boroujeni, H.S., Charkari, M.M.: CH-SIFT: A local kernel color histogram sift based descriptor. In: Multimedia Technology (ICMT), 2011 International Conference on, pp. 6269–6272. IEEE (2011) Jalilvand, A., Boroujeni, H.S., Charkari, N.M.: CWSURF: A novel coloured local invariant descriptor based on SURF. In: Computer and Knowledge Engineering (ICCKE), 2011 1st International eConference on, pp. 214–219. IEEE (2011) Jost, T., Ouerhani, N., Von Wartburg, R., Müri, R., Hügli, H.: Assessing the contribution of color in visual attention. Comput. Vis. Image Underst. 100(1), 107–123 (2005) Kuhn, H.W.: The hungarian method for the assignment problem. In: Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank, W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer Programming 1958–2008 — From the Early Years to the State-of-the-Art, pp. 29–47. Springer (2010) https://doi.org/10.1007/978-3-540-68279-0_2 Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binary robust invariant scalable keypoints. In: 2011 International Conference on Computer Vision, pp. 2548–2555. IEEE (2011) Li, D., Ke, Y., Zhang, G.: A SIFT descriptor with local kernel color histograms. In: Mechanic Automation and Control Engineering (MACE), 2011 Second International Conference on, pp. 992–995. IEEE (2011) Lowe, D.G.: Object recognition from local scale-invariant features. In: Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2, pp. 1150–1157. IEEE (1999) Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) Perko, R., Ferko, A., Bornik, A., Cech, P.: Beyond image quality comparison. In: European Association for Computer Graphics–Eurographics, pp. 271–276 (2003) Polak, M., Pecko, M., Bohdal, R., Ferko, A.: Experimental 3d shape reconstruction from highprecision 2d cruse scans using shadows and focusing. In: Proceedings of the 19th Conference on Applied Mathematics (APLIMAT 2020), pp. 878–887. Curran Associates, Inc. (2020) Prabhakar, C., Kumar, P.: LBP-SURF descriptor with color invariant and texture based features for underwater images. In: Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing, p. 23. ACM (2012)

98

P. Stancelova et al.

29. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: Tenth IEEE International Conference on Computer Vision (ICCV’05), vol. 1, 2, pp. 1508–1515. IEEE (2005) 30. Swain, M., Ballard, D.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991) 31. Thornton, J.E., Pugh, E.N.: Red/green color opponency at detection threshold. Science 219(4581), 191–193 (1983)

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection Konrad Lis

and Tomasz Kryjak

Abstract 3D object detection from LiDAR sensor data is an important topic in the context of autonomous cars and drones. In this paper, we present the results of experiments on the impact of backbone selection of a deep convolutional neural network on detection accuracy and computation speed. We chose the PointPillars network, which is characterised by a simple architecture, high speed, and modularity that allows for easy expansion. During the experiments, we paid particular attention to the change in detection efficiency (measured by the mAP metric) and the total number of multiply-addition operations needed to process one point cloud. We tested 10 different convolutional neural network architectures that are widely used in imagebased detection problems. For a backbone like MobilenetV1, we obtained an almost 4x speedup at the cost of a 1.13% decrease in mAP. On the other hand, for CSPDarknet we got an acceleration of more than 1.5x at an increase in mAP of 0.33%. We have thus demonstrated that it is possible to significantly speed up a 3D object detector in LiDAR point clouds with a small decrease in detection efficiency. This result can be used when PointPillars or similar algorithms are implemented in embedded systems, including SoC FPGAs. The code is available at https://github.com/visionagh/pointpillars-backbone. Keywords LiDAR · PointPillars · MobilenetV1 · CSPDarknet · YOLOv4

K. Lis · T. Kryjak (B) Embedded Vision Systems Group, Computer Vision Laboratory, Department of Automatic Control and Robotics, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krakow, Poland e-mail: [email protected] K. Lis e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. J. Chmielewski and A. Orłowski (eds.), Computer Vision and Graphics, Lecture Notes in Networks and Systems 598, https://doi.org/10.1007/978-3-031-22025-8_8

99

100

K. Lis and T. Kryjak

1 Introduction Object detection is an important part of several systems like Advanced Driver Assistance Systems (ADAS), Autonomous Vehicles (AV) and Unmanned Aerial Vehicles (UAV). It is also crucial for obstacle avoidance, traffic sign recognition, or object tracking (the tracking by detection approach). Usually, object detection is related to standard vision stream processing. Nevertheless, other sensors, such as radar, event cameras, and also LiDARs (Light Detection and Ranging) are used. The latter has many advantages: low sensitivity to lighting conditions (including correct operation at nighttime) and a fairly accurate 3D mapping of the environment, especially at a short distance from the sensor. This makes LiDAR a promising sensor for object detection, despite its high cost. Currently, it is used in autonomous vehicles (levels 3 and 4 of the SAE classification): Waymo, Mercedes S-Class and EQS and many other experimental solutions. It should be noted that, due to the rather specific data format, processing the 3D point clouds captured be a LiDAR significantly differs from methods known from vision systems. The 3D point cloud is usually represented as an angle and a distance from the LiDAR sensor (polar coordinates). Each point is also characterised by the intensity of the reflected laser beam. Its value depends on the properties of the material from which the reflection occurred. An example point cloud from a LiDAR sensor with object detections is presented in Fig. 1. In object detection systems for autonomous vehicles, the most commonly used datasets are KITTI, Wyamo Open Dataset, and NuScenes. KITTI Vision Benchmark Suite (2012) [6] is the most popular. The training set consists of 7481 images along with the corresponding point clouds and annotated objects. KITTI maintains a ranking of object detection methods, where the annotated objects are split into three levels of difficulty (Easy, Moderate, Hard) corresponding to different occlusion levels, truncation, and bounding box height. The Waymo Open Dataset (2019) [28] includes 1950 sequences, which correspond to 200000 frames, but only 1200

Fig. 1 A sample point cloud from the KITTI data set [13] with marked the object detections

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

101

sequences are annotated. However, they contain as many as 12.6 million objects. Each year, Waymo holds few challenges in several topics, e.g. 3D object detection, motion prediction. Nuscenes [3] contains 1000 sequences—it is approximately 1.4 million images, 390 thousand LiDAR scans, and 1.4 million annotated objects. NuScenes also maintains a ranking of object detection methods. In this work, we decided to use the KITTI dataset because it still holds the position of the most widely used LiDAR database, where new solutions can be easily compared with those proposed so far. Generally, two approaches to object detection in point clouds can be distinguished: “classical” and based on deep neural networks. In the first case, the input point cloud is subjected to preprocessing (e.g. ground removal), clustering, handcrafted feature vector calculation, and classification. “Classical” methods achieve only moderate accuracy on widely recognised test datasets—i.e. KITTI [13]. In the second case, deep convolutional neural networks (DCNN) are used. They provide excellent results (cf. the KITTI ranking [13]). However, the price for the high accuracy is the computational and memory complexity and the need for high-performance graphics cards (Graphics Processing Units—GPUs)—for training and, what is even more important, inference. This contrasts to the requirements for systems in autonomous vehicles, where detection accuracy, real-time performance, and low energy consumption are crucial. However, these requirements can be met by selected embedded platforms. They are characterised by low power consumption, which is usually related to lower computational power. For real-time performance on an embedded platform, not all algorithms that run on a high-end GPU can be used. One can choose a faster algorithm, although it usually has a negative impact on detection accuracy. Thus, a compromise has to be made—to speed up an algorithm for real-time performance we must accept a detection accuracy loss. However, using several techniques like quantistion, pruning, or careful network architecture redesign, this loss can be minimised. Embedded platforms can be divided into two rough categories: fixed or variable (reconfigurable, reprogrammable). Examples of the former are embedded GPU devices (e.g., NVIDIA’s Jetson series) and various SoC solutions that include AI coprocessors (e.g., Coral consisting of a CPU, GPU, and Google Edge TPU coprocessor). In this case, the task is to adapt the network architecture to the platform to maximise its capabilities. The second category of solutions are mainly SoC (System on Chip) devices containing so-called reprogrammable logic (FPGA—Field Programmable Gate Arrays). Examples include the Zynq SoC, Zynq UltraScale+ MPSoC and Versal/ACAP series from AMD Xilinx, and similar chips called SoC FPGAs from Intel. Reconfigurable resources allow for greater flexibility in network implementation. It is possible both to adapt the network to the architecture (an example is the AMD Xilinx DPU module [1]) and to build custom accelerators—then the hardware architecture is adapted to the network requirements. Based on the initial analysis, we have selected the PointPillars [15] network for experiments, mainly due to the favourable ratio of detection precision to computational complexity. In our previous work [27] we have described the process of running the PointPillars network on an embedded platform—the ZCU104 evaluation board. However, we were not able to obtain real-time performance—a single-point cloud

102

K. Lis and T. Kryjak

was processed in 374.66 ms, while it should be in less than 100 ms. Therefore, in the next step, we decided to optimise the network in terms of time performance. The most time-consuming part of each DCNN, in terms of multiple and add (multiply-add) operations, is the so-called backbone. In the PointPilars network, 84% multiply-add operations are computed in the backbone. Therefore, potentially, speeding up this part will most likely affect the overall performance of the algorithm. In this work, we investigate the impact of replacing the PointPillars backbone with a lighter computational architecture on detection performance. We consider 10 different backbones, inspired by fast and lightweight algorithms for object detection in images. In particular, we are looking for solutions that significantly reduce the total number of multiply-add operations with minimal decrease in detection performance. The main contributions of our work are: • a review of 10 different versions of PointPillars in terms of detection performance and speed, • the identification of several versions of the PointPillars network that are significantly faster than the original version with only minimal decrease in detection performance. To our best knowledge, this type of study for the PointPillars network has not been published. We also have the shared code repository we used for the experiments. The reminder of this paper is organised as follows. In Sect. 2 we discuss two issues related to our work: DCNN approaches to object detection in LiDAR data and lightweight image processing DCNN backbones. Next, in Sect. 3 we present the backbones used in our experiment. The results obtained are summarised in Sect. 4. The paper ends with a short summary with conclusions and discussion of possible future work.

2 Related Work LiDAR data processing has often been done with deep convolutional neural networks (DCNN). DCNNs combine the entire processing workflow (end-to-end), including both feature extraction and classification. They provide high recognition performance at the cost of high computational and memory complexity. Neural networks for LiDAR data processing can be divided into two classes: 2D methods, where points from 3D space are projected onto a 2D plane under a perspective transformation, and 3D methods where no dimension is removed. The latter are described in Sect. 2.1. The former, given the perspective view of a 2D plane, projection-based representations are split into Front-View (FV) [18] and Bird’s Eye View (BEV) [40] representations.

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

103

2.1 DCNN Methods for 3D Object Detection on a LiDAR Point Cloud Based on the representations of point clouds, LiDAR-based 3D detectors can be divided into point-based [25], voxel-based [15, 42] and hybrid methods [24]. Pointbased methods process a point cloud in an original, unstructured form. Usually, LiDAR data are first subsampled and then processed by PointNet++ [19] inspired DNNs. An example of such a method is Point-RCNN [25]. It first subsamples a point cloud and processes it by a 3D PointNet++ like network. Based on point-wise features, it classifies each point to foreground or background class and generates a bounding box proposals for foreground points. The proposals are then filtered with Non-Maximal Suppression (NMS). In the refinement stage, point-wise features from 3D proposals (with a small margin) are gathered and processed to obtain a final bounding box and confidence prediction. In voxel-based methods, point clouds are first voxelized, and then a tensor of voxels is processed by 2D/3D DCNNs. An example of such a method is PointPillars [15] (described below) and VoxelNet [42]. In VoxelNet the first step is to voxelize a point cloud and apply Voxel Feature Extraction (VFE) to extract a feature vector for each voxel. Afterward, a voxel tensor is processed by a 3D DCNN. Then the output tensor is flattened in the Z-axis direction and fed into a 2D DCNN network to finally obtain 3D bounding boxes, class labels and confidence scores. SECOND [38] improves VoxelNet in terms of speed as it changes 3D convolutional layers in VoxelNet’s backbone to sparse 3D convolutional layers. As LiDAR data are very sparse, it significantly speeds up calculations. Hybrid methods use elements of both aforementioned approaches. An example is PV-RCNN [24]. It is a two-stage detector. In the first stage, there are two feature extraction methods. The first one is a SECOND-like sparse 3D DCNN which at the end generates 3D bounding box proposals. The second is inspired by PointNet++. First, a number of points called keypoints are sampled from a point cloud. Then, for each stage of SECOND-like sparse 3D DCNN, for each keypoint, a set of neighbouring voxels from stage’s feature map is processed by a PointNet-like network. Outputs from these operations form a keypoint’s feature vector. In the second stage of PV-RCNN, the keypoint’s feature vectors are used to refine bounding boxes. Real-time methods focus on the speed of the algorithms which usually results in a decrease in mAP. Authors of [34] base their solution on the SECOND [38] architecture. Prior to VFE, they added a fast and efficient module which fuses LiDAR data with vision data. The solution achieves better detection performance than SECOND and a speed of 17.8 fps (frames per second) on an Nvidia Titan RTX GPU. The authors of [36] also took advantage of the SECOND architecture. They calculate the initial features of the voxel as averages of the Cartesian coordinates and the intensity of the points inside the voxel—unlike SECOND, where VFE modules are used. They add submainfold 3D sparse convolutions to the backbone, in addition to sparse 3D convolutions. Besides, they use self-attention mechanism and deformable convolutions. The solution achieves 26 fps on an Nvidia RTX 2080Ti with a detection

104

K. Lis and T. Kryjak

Table 1 Comparison of the AP results for the 3D KITTI ranking (the Place column indicates the algorithm’s place in the ranking). The best results are in bold. In June 2022 one of the top methods is TED. PointPillars, with an up to 11.7% lower AP was ranked as 229th Place Method Car Easy Mod. Hard 229 191 134 81 73 68 24 1

VoxelNet PointPillars Patches STD PV-RCNN Voxel RCNN SIENet SE-SSD TED

77.47 82.58 88.67 87.95 90.25 90.90 88.22 91.49 91.61

65.11 74.31 77.20 79.71 81.43 81.62 81.71 82.54 85.28

57.73 68.99 71.82 75.09 76.82 77.06 77.22 77.15 80.68

efficiency better than SECOND. A different approach is presented by the authors of [31]. They build on the PointPillars architecture, which they accelerate using structured pruning. They use reinforcement learning methods to determine which weights should be pruned. They manage to get a slightly higher mAP and 76.9 fps on the Nvidia GTX 1080Ti—a 1.5x speedup over the original PointPillars version. The detection performance of all aforementioned algorithms is measured using 1 Average Precision (AP), given by a formula A P = 0 p(r )dr where p(r ) is the precision in the function of recall r . Usually, AP is calculated per class of the evaluation dataset. The overall detection performance is measured with Mean Average Precision (mAP), which is AP averaged over all classes. In Table 1 we present a comparison of AP results for Car detection in the KITTI dataset for several algorithms. 2.1.1 The PointPillars The PointPillars network is a voxel-based method but removes the 3D convolutions by treating the pseudo-BEV (Bird-Eye View) map as voxelized representation, so that end-to-end learning can be done using only 2D convolutions. The input to the PointPillars [15] algorithm is a point cloud from a LiDAR sensor. The results are orientated cuboids that denote the detected objects. A “pillar” is a 3D cell created by dividing the point cloud in the XY plane. The network structure is shown in Fig. 2. The first part, Pillar Feature Net (PFN), converts the point cloud into a sparse “pseudo-image”. The number of pillars is limited, as well as the number of points in each pillar. The second part of the network—Backbone (2D DCNN)—processes the “pseudo-image” and extracts high-level features. It consists of two subnets: “topdown”, which gradually reduces the dimension of the “pseudoimage” and another that upsamples the intermediate feature maps and combines them into the final output map. The last part of the network is the Detection Head (SSD), whose task is to detect and regress the 3D cuboids surrounding the objects. Objects are detected on a 2D grid

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

105

Fig. 2 An overview of the structure of the PointPillars network [15]. The Pillar Feature Network converts the point cloud into a “pseudo-image”, then using a 2D DCNN (with transposed convolutions), this image is transformed into a feature map used in the final detection (Single Shot Detector)

using the Single-Shot Detector (SSD [16]) network. After inference, the overlapping objects are merged using the NMS algorithm. The PointPillars network has a rather simple architecture and relatively small computational complexity. Moreover, it can be taken as part of other, more complex DNN like CenterPoint [39], which builds a two-stage detector based on features extracted by SECOND or PointPillars. In the KITTI dataset, it is in the 229th1 place in the 3D Car detection ranking. For the Moderate difficulty level, it has AP equal to 74.31% while the best method [35] has 84.76%. In it’s basic form PointPillars is certainly not the best available method for LiDAR based object detection. However, nuScenes and Wyamo 3D object detection rankings contain many SOTA (State of the Art) variations of PointPillars and CenterPoint (which can incorporate PointPillars). Due to its speed, architecture simplicity, and flexibility for enhancement, it has been chosen as a case study for our research.

2.2 Methods for Real-Time Object Detection on Images Real-time object detection on images can be achieved in a number of ways. One of the most popular methods are those that form the family of YOLO single-stage detectors. Starting from the first YOLO [20], the authors of subsequent versions (YOLOv2 [21], YOLOv3 [22], YOLOv4 [2], ScaledYOLOv4 [32], YOLO-R [33], etc.) gradually improved both speed and accuracy. It is worth mentioning that the YOLO-R detection performance is comparable even to the best non-real-time methods. Another example of an efficient detector is EfficientDet [30]—it is however outperformed by ScaledYOLOv4 and it’s successors. Other real-time object detection algorithms usually are formed as a general purpose fast backbone with some detection head (e.g. Single Shot Detector head). The short history of DCNNs include several families and types of fast backbones. NASNet 1

Last access: 1st June 2022.

106

K. Lis and T. Kryjak

[43] tries to use Network Architecture Search—a reinforcement learning technique, to search for an optimal architecture for a particular task. ResNet [9] follows a concept of efficient skip-connected bottleneck blocks, first reducing the number of features with efficient 1x1 convolutions, then performing 3x3 convolutions, and expanding channels with subsequent 1x1 convolutions. The authors of SqueezeNet [12] had a similar idea with the Fire module that first reduced the number of input channels with 1x1 convolutions and then performed 3x3 ones. SqueezeNext [7] tries to improve its predecessor speed, SqueezeNet, by breaking 3x3 convolutions to subsequent 1x3 and 3x1 convolutions. There is also a whole family of Inception networks, concluded by Xception [4] and InceptionV4 [29]. Xception, very similar to MobilenetV1 [11], uses Depthiwse Separable Convolutions to considerably lower the number of floating point operations in the network. Xception, in contrast to MobilenetV1, uses skip connections. MobilenetV2 [23] improves its predecessor mainly in terms of detection accuracy. It uses shortcut connections and introduces an Inverted Residual Layer which incorporates a 1x1 convolution layer expanding the number of channels followed by a Depthwise Separable Convolution. MobilenetV3 [10], on the other hand, tries to enhance its predecessor by following the NAS approach (similarly to NASNet) and proposing a few changes such as a new activation function and the usage of squeeze and excitation modules. The authors of ShufflenetV1 [41] use grouped and depthwise convolutions to increase speed. To allow information flow between different groups of channels, they introduce the “channel shuffle” operation. ShufflenetV2 [17] on the other hand, splits feature map channels into two groups. One of them is not processed at all, the other one is fed into a variant of Inverted Residual Layer (like in MobilenetV2). After merging the groups, a “channel shuffle” operation is performed. Several of the above-mentioned DNNs architectures were chosen for PointPillars backbone replacement. Potentially, they should allow to speed up PointPillars as they speeded up networks like VGG [26] or AlexNet [14].

3 Comparison of Backbone Types for the PointPillars Network In the original version of the PointPillars network, most of the multiply-add operations (about 84%) are concentrated in the backbone, specifically in the “top-down” submodule. Thus, potentially, speeding up this part of the algorithm will have the greatest impact on the time results obtained. In this paper, we focus on experimenting with different computational architectures that can be used to replace the original backbone to reduce processing time at the expense of minimal decrease in detection performance. Ten different types of architecture were selected to replace the “top-down” part of the backbone:

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

107

• SqueezeNext [7]—consists of sequentially connected parts denoted as the “SqueezeNext block”. One “SqueezeNext block” is a combination of a 1x1 convolution that decreases the number of channels, a 1x3 convolution, a 3x1 convolution, and a 1x1 convolution that increases the number of channels to a specified value. SqueezeNext in its original version consists of an input convolutional layer, 21 “SqueezeNext blocks” and a classifier, • ResNet [9]—depending on the version, it can consist of two different types of basic blocks: – building block—two 3x3 convolutional layers with skip connection, – bottleneck building block—1x1 convolution decreasing number of features, 3x3 convolution, 1x1 convolution increasing number of features and a skip connection.









In different ResNet versions, the total number of convolutional layers is 18, 34, 50, 101 or 152. ResNeXt [37]—the architecture is based on modified ResNet’s bottleneck building block—the 3x3 convolution is replaced by a 3x3 convolution with 32 groups. A grouped convolution, in contrast to an ordinary one, divides the input feature map into N groups in the channel dimension, applies the convolution operation to each of the groups separately, and concatenates the outputs in the channel dimension at the end. MobilenetV1 [11] – takes advantage of Depthwise Separable Convolutions (also called Separable Convolutions), a combination of “depthwise” and “pointwise” convolutions. A “depthwise” convolution, in contrast to the usual one, is performed for each channel separately and the output always has as many channels as the input. It is a special case of convolution with groups, where the number of groups is equal to the number of input channels. In MobilenetV1 the kernel size is always set to 3x3 (in “depthwise” convolutions). A “pointwise” convolution, on the other hand, is a 1x1 convolution, which is used to make a given output channel dependent on all input channels and possibly change the number of channels. The original architecture of MobilenetV1 includes a regular convolutional layer, 13 Separable Convolutions, and a classifier. MobilenetV2 [23]—the basic unit is the Inverted Residual Layer. It consists of a 1x1 convolution, increasing the number of channels, a Separable Convolution and a skip connection (if block’s stride is equal to 1). MobilenetV2 consists of an ordinary convolution, 7 Inverted Residual Layers and a classifier. ShuffleNetV1 [41]—the basic block is the ShuffleNet unit. It consists of a 1x1 group convolution, a “channel shuffle” operation, a “depthwise” 3x3 convolution, a 1x1 group convolution with the same number of groups as the first and a skip connection. The “channel shuffle” operation is designed to shuffle channels from individual convolution groups so that they can interact. Otherwise, using only group convolutions with the same number of groups and meanwhile “depthwise” convolutions, the processing paths of the individual channel groups would be

108









K. Lis and T. Kryjak

completely separated from each other. ShufflenetV1 consists of a convolutional layer and MaxPooling, 16 ShuffleNet units and a classifier. ShuffleNetV2 [17]—the basic unit’s first operation is “channel split”—splits the feature map in the channel dimension into two separate maps subject to two parallel processing tracks. The first track leaves the map unchanged. The second processes the map using 3 layers: a 1x1 convolution, a “depthwise” 3x3 convolution, and a 1x1 convolution. The maps from the ends of both tracks are again combined into one by concatenation in the feature dimension. The final element is the “channel shuffle” operation, which mixes the features from both previous processing tracks. ShufflenetV2 consists of a 3x3 convolutional layer, MaxPooling, 16 ShuffleNet units, a 1x1 convolutional layer and a classifier. Darknet (Darknet53)—the backbone of YOLOv3 [22], the basic unit consists of a 1x1 convolutional layer, a 3x3 convolutional layer and a skip connection. Darknet53 has 5 blocks operating at different resolutions containing 1, 2, 8, 8, and 4 basic units respectively. CSPDarknet (CSPDarknet53)—the backbone of YOLOv4 [2], it is based on Darknet but with different block structure. First, the input feature map is split in the channel dimension into two groups (similar to ShufflenetV2). One group is left unchanged, and the other is processed by a block of basic units from Darknet. Both groups are finally concatenated into one and processed by 1x1 convolution. Xception [4]—it takes advantage of Separable Convolutions, the basic block consists of two or three separable convolutions with the ReLU activation function, surrounded by a skip connection. For blocks with stride equal to 2, after the separable convolutions a 3x3 MaxPooling with stride 2 is used and a 1x1 convolution with stride 2 is included in the skip connection to equalise the number of channels and the size of the feature map. The original version of Xception consists of two convolutional layers, 12 basic blocks and a classifier.

Our choice was guided by both the speed of the individual networks and the relatively low complexity of the computational architecture. For implementation on embedded systems, e.g., SoC FPGAs, the complexity and irregularity of the computing architecture can make implementation much more difficult and slower. For this reason, we have omitted e.g. NASNet. The main novelty of MobilenetV3 compared to MobilenetV2 is the use of the NAS (Network Architecture Search) technique and a related Netadapt technique so as to tailor the architecture for a specific computational task. In this paper, we study other computational task than MobilenetV3 was adapted to, and the network architectures are modified anyway. For this reason, MobilenetV3 is omitted and MobilenetV2 is examined. We have also omitted the Inception family [29], due to comparable results with the architecturally simpler ResNet. Additionally, each of the architectures considered had to be modified so that it could be applied to the PointPillars network. We assumed that the numbers of layers, the numbers of channels, and the number of blocks in the original version of PointPillars are at least roughly matched to the task of 3D object detection in the

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

109

KITTI dataset. Thus, we decided on the following solution: each Conv-BN-ReLU layer sequence from the “top-down” part of the original backbone was considered as a so-called “basic unit”. Then, for each of the considered computational architectures, a backbone of type X was created by replacing the basic units from the “top-down” part of the original backbone by the basic units from X. The basic units in different backbone types are considered as: • • • • • • • • •

SqueezeNext—SqueezeNext block, ResNet—bottleneck building block, ResNeXt—bottleneck building block with ResNeXt modifications, MobilenetV1—Depthwise Separable Convolution, MobilenetV2—Inverted Residual Layer, ShuffleNetV1—ShuffleNet unit, ShuffleNetV2—basic unit of ShufflenetV2, Darknet—Darknet’s basic unit, CSPDarknet—Darknet’s basic unit, however, as in CSPDarknet, blocks of basic units in the modified PointPillars backbone are split into two parallel processing lanes, • Xception—Xception’s basic block. In future work, once a particular backbone is selected, NAS techniques can be used to find the optimal number of layers, number of channels, and number of blocks. Currently, such experiments with 10 different networks would be too time-consuming. The framework mmdetection3d [5], based on PyTorch, was chosen for the experiments. It contains implementations of the chosen 3D object detection methods, including PointPillars. In comparison to the original PointPillars implementation, it has better modularity and custom changes can be made more conveniently. These two features made the experiments easier to conduct. mmdetection3d is less optimised in terms of speed (for PointPillars, it offers about 40 fps instead of 60 fps, voxelization and NMS are particularly slow), but this does not affect the results, since the number of multiply-add operations is compared, not the number of fps obtained on a specific hardware platform. In terms of detection efficiency, these two implementations are comparable, since the mAP values of car detection on the KITTI-val set in mmdetection3d (77.1%) and the original implementation (76.9%) are very close. However, it should be noted that in this work the PointPillars network is trained on all three classes at the same time, as opposed to the paper [15] where the network for car detection and the network for pedestrian and cyclist detection are trained separately. As a consequence, the results may be slightly worse than those obtained in [15]. All experiments were run with the same training parameters as the base PointPillars implementation from mmdetection3d, which in turn uses the same settings as [15] except for: • optimiser—AdamW instead of Adam, • weight decay equal 0.01—original PointPillars do not use weight decay,

110

K. Lis and T. Kryjak

• learning rate schedule—cosine annealing learning rate schedule with initial learning rate 10−3 , rising to 10−2 after 64 epochs and then falling to 10−7 at the final 160th epoch—instead of original exponentially decaying learning rate. We used Nvidia RTX 2070S GPU, one training lasted for 14 hours on average.

4 Results Detection efficiency is measured with the mAP (Mean Average Precision) metric, which is the average AP value for all classes (Car, Pedestrian, Cyclist) and all difficulty levels (Easy, Moderate, Hard). For network speed comparison, we use the number of multiply-add operations (MAD) needed to process one point cloud. This value is independent of the computing platform and the computation’s precision— whether floating point, fixed point, or binary values are used. The frame rate is not used directly for the speed comparison because, depending on the specific hardware platform and optimizations in the computing libraries, the results can vary significantly. In the results, the original PointPillars network is denoted as “base”. We adopt the real-time definition from [27], i.e. processing point clouds at a rate of 10 fps or greater.

Table 2 Backbones characterisation. Par. denotes number of parameters (unit: 106 parameters). MAdd denotes number of multiply-add operations (unit: 109 operations). fps-B denotes number of frames per second measured for backbone only. fps denotes processing rate for the whole algorithm. MAdd-Su, fps-B-Su and fps-Su denote speedup in terms of number of multiply-add operations, in terms of backbone fps and in terms of the whole algorithm processing rate. All speedup values are calculated relative to the base backbone version. fps and fps-B values are measured using Nvidia RTX 2070S GPU (using the mmdetection3d environment). The number of parameters is 1.5x-5.5x smaller than in the original PointPillars. There is no strict correlation between fps-B and MAdd values. Interestingly, the speedup measured in terms of fps-B and fps is significantly smaller than speedup measured in terms of number of multiply-add operations Par. MAdd MAdd-Su fps-B fps-B-Su fps fps-Su Base CSPDarknet Darknet MobilenetV1 MobilenetV2 ResNet ResNeXt ShufflenetV1 ShufflenetV2 SqueezeNext Xception

4.83 2.33 3.15 1.13 1.13 1.49 1.49 1.11 0.88 1.84 1.17

34.91 20.10 23.51 8.84 8.84 12.28 12.30 9.78 7.80 14.66 10.72

1 1.74 1.48 3.95 3.95 2.84 2.84 3.57 4.48 2.38 3.26

128 127.9 131.1 194.9 191.6 173.3 148.4 161.3 200 86.7 131.6

1 1 1.02 1.52 1.5 1.35 1.16 1.26 1.56 0.68 1.03

47.8 46.2 48.7 53 54.3 51.7 48.7 50.9 54 38.7 47.1

1 0.97 1.02 1.11 1.14 1.08 1.02 1.06 1.13 0.81 0.99

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

111

Table 3 AP results of 3D detection for various backbone types. Easy, Moderate (Mod.) and Hard are the KITTI object detection difficulty levels. Note that the original PointPillars does not always achieve the best AP results Car Pedestrian Cyclist Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard base 85.90 73.88 67.98 50.17 45.11 41.09 78.66 59.51 56.02 CSPDarknet 85.87 75.81 68.44 50.31 44.97 39.67 79.41 59.68 57.17 Darknet 84.94 75.49 68.42 45.85 40.99 36.53 79.78 59.45 55.78 49.35 45.08 41.47 76.76 58.50 54.99 MobilenetV1 82.95 73.15 67.8 MobilenetV2 83.85 73.15 67.88 48.45 43.76 39.12 75.69 57.36 53.88 84.61 73.47 68.03 47.58 42.70 38.53 78.50 59.06 55.80 ResNet ResNeXt 84.60 73.86 68.13 47.11 42.37 37.73 75.80 57.07 52.99 ShufflenetV1 83.61 73.33 67.58 47.49 43.03 38.41 74.34 57.16 53.72 ShufflenetV2 82.85 72.62 67.42 43.41 38.82 35.22 74.68 57.47 54.03 43.47 38.83 35.72 62.68 45.49 43.11 SqueezeNext 77.05 64.87 58.5 Xception 84.81 75.37 68.21 46.12 41.43 37.13 78.97 61.20 56.81

The results for all KITTI classes and difficulty levels are shown in Table 3, an exemplary detection is shown on Fig. 3. It is worth noting that the original PointPillars version is not always the best. Table 4 shows the results in a more compact form—mAP versus the speed of each backbone type (in terms of the number of multiply-add operations). Table 2 shows more detailed backbones size and speed characteristics, including number of multiply-add operations, number of parameters and fps measured with Nvidia RTX 2070S GPU. The most surprising backbone type is CSPDarknet, which has both better mAP than the original PointPillars version and more than 1.5x less multiply-add operations. Figure 5 shows the dependence of mAP on the number of multiply-add operations for each backbone type. If we consider a multi-criteria optimization task where we maximise mAP and minimise the number of multiply-add operations then in the Pareto set 2 the following backbone types are found: CSPDarknet, MobilenetV1, ShufflenetV2. Interestingly, the original version of PointPillars is not in the Pareto set because it is dominated by CSPDarknet. The fastest backbone type is ShufflenetV2 and the most accurate is CSPDarknet; MobilenetV1 has intermediate values. The answer to the question which one is best for implementation in an embedded system is not a clear-cut and depends on the specific situation. MobilenetV1 offers a very good compromise between detection efficiency and processing speed. Its architecture is the simplest of those considered and is suitable for implementation on virtually any embedded platform and accelerator, including pipeline accelerators on FPGAs. Compared to the original PointPillars version, MobilenetV1 is nearly 4x faster with a mAP decrease of only 1.13%. However, if we do not care so much about performance and more about speed, or 2

Pareto set is the set of non-dominated solutions.

112

K. Lis and T. Kryjak

Fig. 3 An example detection result of PointPillars version with CSPDarknet backbone Table 4 mAP results of 3D detection for various backbone types. mAP for Car, Pedestrian and Cyclist is AP averaged over KITTI difficulty levels for particular classes. mAP for Overall is the AP averaged across all classes and all KITTI difficulty levels. It is worth noting that the original PointPillars is not the best in Overall mAP. The Speedup is computed as a ratio of given backbone type’s to original PointPillars multiply-add operations number Overall Car Pedestrian Cyclist GMADs Speedup Base CSPDarknet Darknet MobilenetV1 MobilenetV2 ResNet ResNeXt ShufflenetV1 ShufflenetV2 SqueezeNext Xception

62.04 62.37 60.80 61.12 60.35 60.92 59.96 59.74 58.50 52.19 61.12

75.92 76.71 76.28 74.63 74.96 75.37 75.53 74.84 74.30 66.81 76.13

45.46 44.98 41.12 45.30 43.78 42.94 42.40 42.98 39.15 39.34 41.56

64.73 65.42 65.00 63.42 62.31 64.45 61.95 61.74 62.06 50.43 65.66

34.91 20.10 23.51 8.84 8.84 12.28 12.30 9.78 7.80 14.66 10.72

1 1.74 1.48 3.95 3.95 2.84 2.84 3.57 4.48 2.38 3.26

we care more about detection efficiency and less about speed, then ShufflenetV2 or CSPDarknet, respectively, may be a better choice. Compared to the original PointPillars version, ShufflenetV2 is nearly 4.5x faster with a 3.54% decrease in mAP and CSPDarknet is more than 1.5x faster with a 0.33% increase in mAP. If we want to detect objects of a particular class, the conclusions are slightly different from those drawn above. For particular classes, the Pareto set of the corresponding multi-criteria optimisation problem (where the mAP metric applies to only one class) are: • for car class (see Fig. 4): – least multiply-add—ShufflenetV2, – intermediate values—MobilenetV2, Xception, – highest mAP—CSPDarknet,

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

113

Fig. 4 mAP results (for car class) compared to the number of multiply-add operations for various backbone types. GMAD unit denotes 109 of multiply-add operations. The Pareto set includes ShufflenetV2, MobilenetV2, Xception and CSPDarknet

Fig. 5 mAP results (for all classes) compared to the number of multiply-add operations for various backbone types. GMAD unit denotes 109 of multiply-add operations. The Pareto set includes ShufflenetV2, MobilenetV1 and CSPDarknet

114

K. Lis and T. Kryjak

Fig. 6 mAP results (for pedestrian class) compared to the number of multiply-add operations for various backbone types. GMAD unit denotes 109 of multiply-add operations. The Pareto set includes ShufflenetV2, MobilenetV1 and the base backbone

• for pedestrian class (see Fig. 6): – least multiply-add—ShufflenetV2, – intermediate values—MobilenetV1, – highest mAP—base, • for cyclist class (see Fig. 7): – least multiply-add—ShufflenetV2, – intermediate values—MobilenetV1, – highest mAP—Xception, ShufflenetV2 is always the fastest solution as, regardless of the class we are considering, the architecture of the individual solutions does not change. However, mAP changes, and depending on the class we focus on, there are other types of backbones in the Pareto set, from which we should choose a solution for our particular application. If a particular solution imposes memory constraints, one should also consider the number of parameters of a model (see Table 2). Depending on backbone type, the checkpoint size varies from 3.7MB (ShufflenetV2, least number of parameters) to 19.4MB (base, the most number of parameters). The model size can be further reduced by pruning (removing near-zero weights), quantisation (precision lower than default 32 bit floating point number) or compression. However, even the highest obtained model size, 19.4MB, should not exceed memory limits in most of the solutions.

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

115

Fig. 7 mAP results (for cyclist class) compared to the number of multiply-add operations for various backbone types. GMAD unit denotes 109 of multiply-add operations. The Pareto set includes ShufflenetV2, MobilenetV1 and Xception

In Fig. 8, we compare the number of backbone fps with the number of multiply-add operations for the backbones of the different PointPillars versions (for detailed data see Table 2). The results are obtained with mmdetection3d framework, the timing was measured on an Nvidia RTX 2070S GPU. Only the backbone duration was included in the fps count, analysis of the whole algorithm fps results is described below. As you can see, in this implementation, not always a lower multiply-add count means a faster network. This is probably due to the specifics of the computing libraries and the computing platform—some architectures are a better fit, others a worse one. On other embedded platforms, such as FPGAs, the dependence of fps on the number of multiply-add may look quite different. The speedup in terms of backbone fps (see Table 2) translates into only a slight acceleration in terms of the whole algorithm fps. E.g., in case of MobilenetV2, 1.5x backbone speedup results in only a 1.14x whole algorithm acceleration. This is related to the share of the backbone in the total processing time. In the mmdetection3d’s implementation, the backbone of the “base” version is responsible for 39% of the whole processing time. According to the Amdahl’s law [8], optimising the backbone speed alone we can achieve up to 1.61x speedup of the whole algorithm (when backbone time approaches zero). Regarding the fact that backbone includes 84% of all multiply-add operations, 39% is a small fraction. Other most time consuming parts of the algorithm are Pillar Feature Net—10%, the upsampling part after the “topdown” part of the backbone (in mmdetection3d called a neck)—15% and NMS— 25%. High processing times of the PFN and the neck are probably related to the specifics of the computing libraries and the computing platform. In case of NMS,

116

K. Lis and T. Kryjak

Fig. 8 Frames per second vs number of multiply-add operations for different types of PointPillars backbone. GMAD unit denotes 109 of multiply-add operations. The fps value takes into consideration only backbone, time of other PointPillars parts is not included. The general trend is broken probably due to specifics of the computing libraries and the computing platform

high time complexity is caused by it’s sequential nature. In this implementation, the processing time is scattered to many parts of the algorithm—in order to achieve a large total speedup, one should accelerate all of them. In our previous FPGA implementation [27], the backbone takes ca. 70% time of the algorithm. It allows us to achieve up to 3.33x speedup—according to the Amdahl’s law. Accelerating backbone alone is still not enough for real-time, as theoretical 3.33x speedup of this algorithm would result in 8.89 fps. On the other hand, accelerating other parts of the algorithm allows us to achieve up to 1.43x speedup, that translates into up to 3.82 fps for the algorithm. Thus, one has to accelerate both backbone and other parts of the algorithm to achieve the desired 10 fps for real-time processing. One should also keep in mind that after quantisation of examined PointPillars versions, mAP may change and some other model may become optimal in mAP sense. Therefore, when quantisation is necessary (e.g. to deploy model on FPGA or speed up inference on GPU or eGPU), one should choose a subset of models suitable for a given hardware platform and reevaluate mAP afterwards.

5 Conclusions In this paper, we have presented experiments on changing the backbone in a LiDARbased 3D object detector to speed up the computations. We investigated 10 versions of the PointPillars network, where each architecture was inspired by solutions from

PointPillars Backbone Type Selection for Fast and Accurate LiDAR Object Detection

117

the computer vision domain (object detection and classification). We used PyTorch and the mmdetection3d framework. The results obtained indicate that it is possible to significantly speed up PointPillars with only a small decrease in accuracy (mAP value). The version with the MobilenetV1-like backbone runs almost 4x faster than the original with a mAP decrease of 1.13%. Furthermore, the version with the CSPDarknet backbone runs more than 1.5x faster with an increase in mAP of 0.33%. Finally, an almost 4.5x speed-up with a mAP decrease of 3.54% was achieved with a ShufflenetV2-like backbone. The results presented above cover all classes of the KITTI dataset (car, pedestrian, cyclist). However, while considering only one or selected classes, the changes in mAP may vary. The results are also likely to be different with other datasets, such as nuScenes or Waymo. However, this research indicates that it is worth at least considering using a “lighter” backbone architecture. In the case of PointPillars, this increases the potential to implement the algorithm in a real-time embedded system while maintaining reasonable high detection performance. Our previous work [27] shows that such an implementation for the original version of the algorithm is very difficult if not impossible. Compared to [34] and [36], all of our considered PointPillars versions have lower processing time on a comparable GPU. Given that [34] and [36] have smaller number of frames per second than the original PointPillars, it would be potentially harder to implement them on embedded devices in real-time. On the other hand, work [31] is focused on PointPillars pruning. We first focused on changing the backbone to a lighter one, so that in the future we can perform pruning on a model with fewer operations than the original. Potentially, this will allow us to obtain a higher algorithm speed than in [31], with minimal decrease in mAP. One should keep in mind that speedup in sense of number of multiply-add operations does not directly translates into time speedup of the backbone and of the whole algorithm. The resulting backbone’s processing time is dependent on the hardware platform and implementation details, e.g. specifics of used libraries. On the other hand, the whole algorithm’s speedup is conditional on the backbone’s processing time reduction and an initial share of the backbone in the whole processing time— according to the Amdahl’s law. As part of our future work we would like, based on the best of the architectures considered, to use Network Architecture Search techniques to find even better (faster and more accurate) versions of the PointPillars network. Afterwards, we will quantise optimised versions to enable implementation on FPGA devices and speed up inference on GPU/eGPU. In order to reduce the number of multiply-add operations even further, we would like to use structural pruning, which potentially removes unnecessary computations from a particular trained network model. The next step will be to optimise other parts of PointPillars, such as the PFN or NMS, to maximise the speed of the algorithm. We also consider to enhance PointPillars, e.g. by using it as part of CenterPoint and implement it on an embedded system, e.g. on an SoC FPGA as a continuation of our previous work [27].

118

K. Lis and T. Kryjak

Acknowledgements The work presented in this paper was supported by the AGH University of Science and Technology project no. 16.16.120.773.

References 1. Dpuczdx8g for zynq ultrascale+ mpsocs product guide. Tech. Rep. PG338 (v3.4), Xilinx Inc. (2022). https://www.xilinx.com/content/dam/xilinx/support/documentation/ip_ documentation/dpu/v3_4/pg338-dpu.pdf 2. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020). https://arxiv.org/abs/2004.10934 3. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., et al.: nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019) 4. Chollet, F.: Xception: Deep Learning with Depthwise Separable Convolutions (2016). https:// arxiv.org/abs/1610.02357 5. Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection (2020). https://github.com/open-mmlab/mmdetection3d 6. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Int. J. Robot. Res. (IJRR) (2013) 7. Gholami, A., Kwon, K., Wu, B., Tai, Z., et al.: Squeezenext: Hardware-aware Neural Network Design (2018). https://arxiv.org/abs/1803.10615 8. Gustafson, J.L.: Amdahl’s Law, pp. 53–60. Springer US, Boston, MA (2011). https://doi.org/ 10.1007/978-0-387-09766-4_77 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition (2015). https://arxiv.org/abs/1512.03385 10. Howard, A., Sandler, M., Chen, B., Wang, W., et al.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324 (2019) 11. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., et al.: Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017). https://arxiv.org/abs/1704.04861 12. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., et al..: Squeezenet: Alexnet-level Accuracy with 50x Fewer Parameters and