Autonomous driving algorithms and Its IC Design [1st ed. 2023] 9789819928965, 9789819928972, 9787121436437, 9819928966

With the rapid development of artificial intelligence and the emergence of various new sensors, autonomous driving has g

131 26 7MB

English Pages 315 [306] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Autonomous driving algorithms and Its IC Design [1st ed. 2023]
 9789819928965, 9789819928972, 9787121436437, 9819928966

Table of contents :
Foreword
Contents
List of Figures
List of Tables
Chapter 1: Challenges of Autonomous Driving Systems
1.1 Autonomous Driving
1.1.1 Current Autonomous Driving Technology
1.2 Autonomous Driving System Challenges
1.2.1 Functional Constraints
1.2.2 Predictability Constraints
1.2.3 Storage Limitations
1.2.4 Thermal Constraints
1.2.5 Power Is Constrained
1.3 Designing an Autonomous Driving System
1.3.1 Perception Systems
1.3.2 Decision-Making
1.3.3 Vehicle Control
1.3.4 Safety Verification and Testing
1.4 The Autonomous Driving System Computing Platform
1.4.1 GPU
1.4.2 DSP
1.4.3 Field Programmable Gate Array FPGA
1.4.4 Specific Integrated Circuit ASIC
1.5 The Content of This Book
1.5.1 3D Object Detection
1.5.2 Lane Detection
1.5.3 Motion Planning and Control
1.5.4 The Localization and Mapping
1.5.5 The Autonomous Driving Simulator
1.5.6 Autonomous Driving ASICs
1.5.7 Deep Learning Model Optimization
1.5.8 Design of Deep Learning Hardware
1.5.9 Self-Driving ASICs Design
1.5.10 Operating Systems for Autonomous Driving
1.5.11 Autonomous Driving Software Architecture
1.5.12 5G C-V2X
References
Chapter 2: 3D Object Detection
2.1 Introduction
2.2 Sensors
2.2.1 Camera
2.2.2 LiDAR
2.2.3 Camera + Lidar
2.3 Datasets
2.4 3D Object Detection Methods
2.4.1 Monocular Images Based on Methods
2.4.2 Point Cloud-Based Detection Methods
2.4.2.1 Projection Methods
2.4.2.2 Volumetric Convolution Methods
2.4.2.3 Point Net Method
2.4.3 Fusion-Based Methods
2.5 Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds [31]
2.5.1 Algorithm Overview
2.5.2 Point Cloud Preprocessing
2.5.3 The Proposed Architecture
2.5.4 Anchor Box Design
2.5.5 Complex Angle Regression
2.5.6 Evaluation on KITTI
2.5.7 Training
2.5.8 Bird´s Eye View Detection
2.5.9 3D Object Detection
2.6 Future Research Direction
References
Chapter 3: Lane Detection
3.1 Traditional Image Processing
3.2 Example: Lane Detection Based on the Hough Transform
3.2.1 Hough Transform
3.2.2 Lane Detection
3.3 Example: RANSAC Algorithm and Fitting Straight Line
3.3.1 Overview of the RANSAC Algorithm
3.3.2 Use Python to Implement Line Fitting
3.4 Based on Deep Learning
3.5 The Multi-Sensor Integration Scheme
3.6 Lane Detection Evaluation Criteria
3.6.1 Lane Detection System Factors
3.6.2 Offline Evaluation
3.6.3 Online Evaluation
3.6.4 Evaluation Metrics
3.7 Example: Lane Detection
3.7.1 Overview
3.7.2 Loss Function
3.7.3 Experimental Results
3.7.4 Conclusion
References
Chapter 4: Motion Planning and Control
4.1 Overview
4.2 Traditional Planning and Control Solutions
4.2.1 Route Planning
4.2.2 Example: Dijkstra´s Algorithm for Path Planning
4.2.3 Example: Path Planning A* Algorithm
4.2.4 Behavioral Decision
4.2.5 Motion Planning
4.2.6 Example: Motion Planning
4.2.6.1 Get the Information of the Current Car and Surrounding Cars
4.2.6.2 Decide Whether to Change Lanes Based on the Current Car Position
4.2.6.3 Calculate the Trajectory Route of the Current Lane
4.2.6.4 Frent Road Coordinate System
4.2.6.5 The Spline Function Generates the Candidate Trajectory Path
4.2.6.6 Collision Detection
4.2.7 Vehicle Control
4.2.8 Example: Model Predictive Control
4.2.8.1 Prediction Model
4.2.8.2 Rolling Optimization
4.2.8.3 Feedback Correction
4.2.9 Example: PID Control A
4.3 Integrated Perception and Planning
4.3.1 Project: Nvidia´s End-to-End Autonomous Driving
4.3.2 Open-Source Project: Motion Prediction Model
4.3.2.1 L5Kit
4.3.2.2 Woven Planet´s Prediction Dataset
4.3.2.3 Why Do We Need a Motion Prediction Model?
4.3.2.4 Train the Model
Download the Prediction Dataset
Obtaining Input and Output for the Task
Define the Model
Train the Model
4.3.2.5 Further Thoughts
4.4 Interaction Behavior Awareness Planning
4.4.1 Cooperation and Interaction
4.4.2 Game-theoretic Approaches
4.4.3 Probabilistic Approach
4.4.4 Partially Observable Markov Decision Process
4.4.5 Learning-Based Approaches
References
Chapter 5: SLAM in Autonomous Driving
5.1 SLAM Problem
5.1.1 Filter-Based SLAM
5.1.2 Extended Kalman Filter
5.1.3 Lossless Filter UKF
5.1.4 Optimization-Based SLAM
5.2 Maps in Autonomous Driving
5.2.1 Metric Map Models
5.2.2 Directions of Metric Map Research
5.2.3 Semantic Map Model
5.2.4 Directions of Semantic Map Research
5.3 HD Map Creation
5.3.1 Data Collection
5.3.2 Map Production
5.3.2.1 Point Cloud Mapping and Localization Using NDT (Normal Distribution Transform)
5.3.3 Map Labeling
5.3.4 Map Saving
5.3.4.1 Lanelet2
5.3.4.2 Open DRIVE
5.3.4.3 Apollo Map
5.4 AVP: SLAM Application in Autonomous Driving
5.4.1 Main Components for AVP
5.4.1.1 Machine-Vehicle Interconnection
5.4.1.2 Find a Parking Space
5.4.1.3 Parking in a Parking Space
5.4.1.4 Summon a Vehicle
5.4.2 Key Technology in AVP
5.4.2.1 HD Map
5.4.2.2 Slam
5.4.2.3 Fusion Perception
5.4.2.4 Fusion Positioning
5.4.2.5 Path Planning
5.4.3 AVP in the Autoware.Auto
5.4.3.1 Setup
5.4.3.2 Launching
5.5 Metrics for SLAM for Autonomous Driving
Reference
Chapter 6: Autonomous Driving Simulator
6.1 The Latest Simulator
6.1.1 AirSim
6.1.2 Apollo
6.1.3 Carla
6.1.4 Udacity AV Simulator
6.1.5 Deep Traffic
6.2 One Simulator Example: CARLA
6.2.1 The Simulation Engine
6.2.2 Modular Autonomous Driving
6.2.3 Imitation
6.2.4 Reinforcement Learning
6.2.5 Experimental Results
Reference
Chapter 7: Autonomous Driving ASICs
7.1 Mobileye EyeQ
7.2 NVIDIA
7.2.1 Example: Using the NVIDIA AGX Platform
7.3 TI´s Jacinto TDAx
7.4 Example: 360-Degree View System and Automatic Parking System
7.4.1 Automatic Parking and Parking Assist
7.4.2 How the Jacinto Family of TDA4VM Processors Meets Surround View and Autonomous Parking Challenges
7.4.3 Jacinto TDA4VM SoC
7.5 Qualcomm
7.5.1 Snapdragon Ride Vision System
7.6 NXP
7.6.1 S32 Automotive Platform
7.6.2 S32V234
7.7 XILINX ZYNq-7000
7.8 Synopsys
References
Chapter 8: Deep Learning Model Optimization
8.1 Model Optimization Overview
8.2 Parameter Pruning and Sharing
8.2.1 Quantization and Binarization
8.2.2 Pruning and Sharing
8.2.3 Designing the Structure Matrix
8.3 Low-Rank Decomposition and Sparsity
8.4 Transfer/Compact Convolutional Filters
8.4.1 Grouped Convolution
8.4.2 MobileNet Structure
8.4.3 ShuffleNet Structure
8.5 Knowledge Distillation
8.6 AI Model Efficiency Toolkit
8.6.1 Large-Scale Energy-Efficient AI
8.6.2 Advancing AI Model Efficiency Through Collaboration Studying
8.7 Future Research Direction
References
Chapter 9: Deep Learning ASIC Design
9.1 Overview
9.2 Accelerating Kernel Computation on CPU and GPU Platforms
9.3 Deep Learning Chip Series from Institute of Computing Technology, Chinese Academy of Sciences
9.3.1 Introduction About CNN
9.3.1.1 Convolutional Layers
9.3.1.2 Pooling Layers
9.3.1.3 Normalization Layer
9.3.1.4 Classification Layer
9.3.2 DaDianNao
9.3.3 ShiDianNao
9.3.4 Cambricon-X
9.4 MIT Eyeriss Series
9.4.1 CNN Basics
9.4.2 Eyeriss
9.4.2.1 CNN Data Stream
9.4.2.2 Eyeriss Performance Metrics
9.4.3 Eyeriss v2
9.4.3.1 Challenges
9.4.3.2 Eyeriss v2 Architecture
9.4.3.3 Eyeriss v2 Innovation Points
9.4.3.4 Eyeriss v2 Performance Metrics
9.5 Google´s TPU
9.5.1 TPU v1
9.5.2 TPU Instruction Set
9.5.3 TPU: Systolic Array
9.5.4 TPU v2/v3
9.5.5 Software Architecture
9.6 Near-Memory Computing
9.6.1 DRAM
9.6.2 SRAM
9.6.3 Nonvolatile Resistive Memory
9.6.4 Sensors
9.7 Metrics for DNN Hardware
References
Chapter 10: Autonomous Driving SoC Design
10.1 Autonomous Driving SoC Design Flow
10.2 TI Jacinto Soc Platform
10.2.1 System Architecture
10.2.2 Computing Cluster
10.2.3 Accelerator Cluster
10.3 Functional Safety Features of the Jacinto 7 Processor
10.3.1 Functional Safety
10.3.2 Software Functional Safety Overview
10.3.3 Deployment of Security Applications
10.4 Safety-Compliant Multicore SoCs with DNN and ISP to Design
10.4.1 ADAS Image Recognition SoC
10.4.2 DNN Accelerator
10.4.3 ISP with Secure BIST Controller
10.5 Example: Nvidia Deep Learning Accelerator
10.5.1 NVDLA
10.5.2 FireSim
10.5.3 NVDLA Integration
10.5.4 Performance Analysis
References
Chapter 11: Autonomous Driving Operating Systems
11.1 Overview
11.2 Open-Source Autonomous Driving Operating Systems
11.2.1 Linux RTOS
11.2.2 ROS Middleware
11.3 Companies Using Open Source Software for Autonomous Driving
11.3.1 Baidu
11.3.2 BMW
11.3.3 Voyage
11.3.4 Tier4 Autoware
11.3.5 PolySync OS
11.3.6 Perrone
11.4 Automotive Hard Real-Time Operating Systems and Frameworks
11.4.1 Blackberry QNX
11.4.2 Elektrobit
11.4.3 Green Hills Software Company
11.4.4 NVIDIA DriveWorks SDK
11.5 Summary
Chapter 12: Autonomous Driving Software Architecture
12.1 Overview
12.2 Software Development Based on ISO 26262
12.2.1 Introduction About ISO 26262
12.2.2 Synopsys Software Portfolio
12.2.3 ASIL
12.2.4 Software Architecture Design
12.2.5 Software Unit Design and Implementation
12.2.6 Testing
12.3 Component Architecture Design Based on SAE J3016
12.3.1 Functional Components
12.3.2 AutoSAR
12.4 Architecture Design and Implementation of Autonomous Vehicles
12.4.1 Hardware Framework
12.4.2 Software System Architecture
12.4.3 Modules for Data Transfer
12.4.4 Autonomous Driving Test Report
References
Chapter 13: Introduction to 5G C-V2X
13.1 Mobile Internet of Vehicles
13.2 C-V2X: How to Change Driving
13.2.1 Avoid Collision
13.2.2 Convoy Driving
13.2.3 Cooperative Driving
13.2.4 Queue Warning
13.2.5 Protecting Vulnerable Road Users
13.2.6 Support Emergency Services
13.2.7 Warning of Danger
13.2.8 Increasingly Autonomous Driving
13.3 C-V2X: The Advantages
13.4 C-V2X: How It Works
13.4.1 Direct Communication
13.4.2 Telecommunication
13.4.3 5G: How to Change C-V2X
13.5 C-V2X: Deployment Plan
13.5.1 China Leads the Way
13.5.2 Australia: Improving Road Safety
13.5.3 United States: Growth Momentum
13.5.4 Europe: Broad Support
13.6 Summary

Citation preview

Jianfeng Ren · Dong Xia

Autonomous driving algorithms and Its IC Design

Autonomous driving algorithms and Its IC Design

Jianfeng Ren • Dong Xia

Autonomous driving algorithms and Its IC Design

Jianfeng Ren Google San Diego, CA, USA

Dong Xia Vision Weiye Intelligent Technology Co., Ltd. Changsha, Hunan, China

ISBN 978-981-99-2896-5 ISBN 978-981-99-2897-2 https://doi.org/10.1007/978-981-99-2897-2

(eBook)

Jointly published with Publishing House of Electronics Industry, Beijing, China. The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Publishing House of Electronics Industry. ISBN of the Co-Publisher’s edition: 9787121436437 © Publishing House of Electronics Industry 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword

The current trend of automobile development is intelligence, electrification, and sharing. As the demand for self-driving cars continues to increase, an increasing number of large companies, as well as some start-up companies, have invested considerable workforce and resources in the self-driving industry. As the heart of future autonomous driving, the design of autonomous driving chips must be fully invested and developed as its core technology. Starting in 2010, I began to work on the assisted driving project of automobiles, engaged in assisted driving, imaging, computer vision algorithm, and chip design-related work, and have a certain understanding of the development of this industry. The experience of the chip industry over the years tells us that the design of the chip involved in-depth understanding of algorithms, hardware development process, software architecture design, and implementation and management of engineering processes. This book has 13 chapters in total. Chapters 2–6 introduce the algorithm design part of autonomous driving. With the development of an autopilot chip, it is necessary to perform full research on the algorithms of autopilot, such as object detection and multisensory fusion, and perform many simulation experiments to cover more scenarios. Chapters 2–4 introduce several algorithms for autonomous driving. Of course, there are many algorithms involved in autonomous driving. This book can only serve as an introduction. I hope that readers will read a lot of the latest technical articles and do a lot of experiments to ensure that the algorithms they develop have a certain degree of robustness and accuracy. Chapter 5 introduces the development of high-definition maps. At present, many start-up companies focus on the generation of high-definition maps. The technical level of autonomous driving is still in the state of "searching for information according to the map." Generating high-definition maps quickly and accurately is also a current business opportunity. With the self-driving algorithms developed, how do you test these developed selfdriving algorithms? Chapter 6 gives some open-source simulators to test and evaluate the algorithms developed by everyone. Chapters 7–10 is related to chips. With a mature and stable algorithm, how can it be deployed in cars? One approach is to take advantage of off-the-shelf commercial v

vi

Foreword

chips. At present, some large semiconductor chip companies, such as Texas Instruments, Nvidia, and Qualcomm, have provided commercial-grade SoC chips to support automotive-assisted driving or autonomous driving applications. Chapter 7 of this book focuses on how some commercial chips can support the implementation of autonomous driving algorithms. Since many current autonomous driving algorithms have been developed based on deep learning, model optimization is a particularly important topic. Chapter 8 introduces some existing model compression algorithms. Chapters 9 and 10 of this book focus on how to develop a dedicated accelerator chip for deep learning and how to develop an SoC autopilot chip. At the same time, the author believes that image sensor processing chips suitable for autonomous driving and deep learning accelerator chips for autonomous driving are opportunities that currently exist. This book also introduces some hardware open-source codes, such as NVDLA, which interested readers can refer to. Chapters 11 and 12 is about the design of autonomous driving software architecture. To develop a chip, it must be equipped with a mature software architecture and some development tools to enable users to better use the chip. Chapter 11 introduces some operating systems emerging in the field of autonomous driving. Chapter 12 introduces some software development architectures and processes based on security functions. Chapter 13 briefly introduces how 5G is used on the Internet of Vehicles to improve the experience of autonomous driving. For each part, interested readers can also identify potential scientific research projects and products from this book. For example, readers who are interested in algorithms can develop multisensor fusion algorithms (Chaps. 2 and 3); readers who are interested in high-definition maps can learn how to build high-definition maps (Chap. 5). Start-up companies are developing custom high-definition map generation; if there is a good automatic driving algorithm, you can also directly use the software method of model compression (Chap. 8), directly on the automatic driving platform of Nvidia/Qualcomm/Texas Instruments and implementing a prototype of its product (Chap. 7). Readers who are interested in hardware development can refer to how to design deep learning chips (see Chap. 9) and autonomous driving SoC chip design (see Chap. 10), especially the open-source NVDLA. Readers who perform hardware design can refer to this open-source code to design custom ASIC chips. Regarding the ASIC of the self-driving chip, the author believes that the development of digital image processing chips and deep learning chips dedicated to selfdriving is a good direction. How to meet HDR and fast-moving objects in the design of special digital image processing chips is a major challenge, but it is also an opportunity. Readers who are interested in software development can refer to software architecture design (Chap. 12) and how to develop their own autonomous driving software products based on open-source codes. This is different from previous software development, and safety performance needs to be considered. Of course, this book refers to many open-source codes and documents, and it took 5 months to complete, which consumed much time and energy of the author. In addition, an important feature of this book, combined with the latest published academic papers and open-source codes, addresses in detail the general process of

Foreword

vii

autonomous driving chip design. Not only has a theoretical basis but also has a lot of open-source code, which provides readers with a lot of practical experience and exercises their hands-on ability. In addition, chip design involves the cooperation and exchange of talents from different departments. Therefore, process management, code management, design documents, and effective cross-team communication are all particularly important in the project implementation process. However, it is a pity that the author has no way to write these years of semiconductor experience into this book. The purpose of authoring this book is to hope that undergraduates, postgraduates, and even doctoral students can obtain a little inspiration from this book, fully demonstrate each algorithm with a rigorous academic attitude, and design autonomous driving chips to contribute to China's autonomous driving. The author would like to thank my teacher, Professor Jiang Liyuan from the School of Computer Science, Northwestern Polytechnical University. Without his encouragement and review, this book would have been impossible to write. It took me half a year to complete it in my spare time. In the first draft of this book, Mr. Jiang also spent much energy helping me revise the entire book. Due to the limited time, errors and inappropriateness in the book are unavoidable, and I hope everyone will give criticism and correction. San Diego, CA, USA April 7, 2020

Jianfeng Ren

Contents

1

Challenges of Autonomous Driving Systems . . . . . . . . . . . . . . . . . . 1.1 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Current Autonomous Driving Technology . . . . . . . . . . 1.2 Autonomous Driving System Challenges . . . . . . . . . . . . . . . . . 1.2.1 Functional Constraints . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Predictability Constraints . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Storage Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Thermal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Power Is Constrained . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Designing an Autonomous Driving System . . . . . . . . . . . . . . . . 1.3.1 Perception Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Vehicle Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Safety Verification and Testing . . . . . . . . . . . . . . . . . . 1.4 The Autonomous Driving System Computing Platform . . . . . . . 1.4.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Field Programmable Gate Array FPGA . . . . . . . . . . . . 1.4.4 Specific Integrated Circuit ASIC . . . . . . . . . . . . . . . . . 1.5 The Content of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Motion Planning and Control . . . . . . . . . . . . . . . . . . . 1.5.4 The Localization and Mapping . . . . . . . . . . . . . . . . . . 1.5.5 The Autonomous Driving Simulator . . . . . . . . . . . . . . 1.5.6 Autonomous Driving ASICs . . . . . . . . . . . . . . . . . . . . 1.5.7 Deep Learning Model Optimization . . . . . . . . . . . . . . . 1.5.8 Design of Deep Learning Hardware . . . . . . . . . . . . . . . 1.5.9 Self-Driving ASICs Design . . . . . . . . . . . . . . . . . . . . . 1.5.10 Operating Systems for Autonomous Driving . . . . . . . . 1.5.11 Autonomous Driving Software Architecture . . . . . . . . .

1 1 2 3 4 5 5 6 6 7 8 9 11 12 14 14 15 15 15 16 16 17 18 18 19 19 19 20 20 21 21 ix

x

2

3

Contents

1.5.12 5G C-V2X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 22

3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 LiDAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Camera + Lidar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 3D Object Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Monocular Images Based on Methods . . . . . . . . . . . . . 2.4.2 Point Cloud-Based Detection Methods . . . . . . . . . . . . . 2.4.3 Fusion-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds [31] . . . . . . . . . . . . . . . . . . . 2.5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Point Cloud Preprocessing . . . . . . . . . . . . . . . . . . . . . 2.5.3 The Proposed Architecture . . . . . . . . . . . . . . . . . . . . . 2.5.4 Anchor Box Design . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Complex Angle Regression . . . . . . . . . . . . . . . . . . . . . 2.5.6 Evaluation on KITTI . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.8 Bird’s Eye View Detection . . . . . . . . . . . . . . . . . . . . . 2.5.9 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 25 27 28 28 29 30 31 31 35 40

Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Traditional Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Example: Lane Detection Based on the Hough Transform . . . . . 3.2.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Example: RANSAC Algorithm and Fitting Straight Line . . . . . . 3.3.1 Overview of the RANSAC Algorithm . . . . . . . . . . . . . 3.3.2 Use Python to Implement Line Fitting . . . . . . . . . . . . . 3.4 Based on Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Multi-Sensor Integration Scheme . . . . . . . . . . . . . . . . . . . . 3.6 Lane Detection Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Lane Detection System Factors . . . . . . . . . . . . . . . . . . 3.6.2 Offline Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Online Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Example: Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 56 58 58 60 61 62 62 64 67 68 69 70 71 72 73 74

42 42 43 45 47 47 49 49 49 51 51 53

Contents

3.7.2 3.7.3 3.7.4 References . .

xi

Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .........................................

. . . .

80 82 83 83

4

Motion Planning and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Traditional Planning and Control Solutions . . . . . . . . . . . . . . . . 4.2.1 Route Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Example: Dijkstra’s Algorithm for Path Planning . . . . . 4.2.3 Example: Path Planning A* Algorithm . . . . . . . . . . . . 4.2.4 Behavioral Decision . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Motion Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Example: Motion Planning . . . . . . . . . . . . . . . . . . . . . 4.2.7 Vehicle Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.8 Example: Model Predictive Control . . . . . . . . . . . . . . . 4.2.9 Example: PID Control A . . . . . . . . . . . . . . . . . . . . . . . 4.3 Integrated Perception and Planning . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Project: Nvidia’s End-to-End Autonomous Driving . . . 4.3.2 Open-Source Project: Motion Prediction Model . . . . . . 4.4 Interaction Behavior Awareness Planning . . . . . . . . . . . . . . . . . 4.4.1 Cooperation and Interaction . . . . . . . . . . . . . . . . . . . . 4.4.2 Game-theoretic Approaches . . . . . . . . . . . . . . . . . . . . 4.4.3 Probabilistic Approach . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Partially Observable Markov Decision Process . . . . . . . 4.4.5 Learning-Based Approaches . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87 87 88 89 90 94 96 96 97 102 102 108 109 111 114 118 119 120 120 121 122 123

5

SLAM in Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 SLAM Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Filter-Based SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Lossless Filter UKF . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Optimization-Based SLAM . . . . . . . . . . . . . . . . . . . . . 5.2 Maps in Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Metric Map Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Directions of Metric Map Research . . . . . . . . . . . . . . . 5.2.3 Semantic Map Model . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Directions of Semantic Map Research . . . . . . . . . . . . . 5.3 HD Map Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Map Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Map Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Map Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 AVP: SLAM Application in Autonomous Driving . . . . . . . . . .

127 128 129 129 131 133 133 133 136 137 138 139 140 140 142 142 143

xii

Contents

5.4.1 Main Components for AVP . . . . . . . . . . . . . . . . . . . . 5.4.2 Key Technology in AVP . . . . . . . . . . . . . . . . . . . . . . 5.4.3 AVP in the Autoware.Auto . . . . . . . . . . . . . . . . . . . . . 5.5 Metrics for SLAM for Autonomous Driving . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145 147 149 151 152

6

Autonomous Driving Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Latest Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 AirSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Apollo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Carla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Udacity AV Simulator . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Deep Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 One Simulator Example: CARLA . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The Simulation Engine . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Modular Autonomous Driving . . . . . . . . . . . . . . . . . . 6.2.3 Imitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

153 154 154 155 156 157 157 158 158 160 160 161 161 162

7

Autonomous Driving ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Mobileye EyeQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 NVIDIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Example: Using the NVIDIA AGX Platform . . . . . . . . 7.3 TI’s Jacinto TDAx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Example: 360-Degree View System and Automatic Parking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Automatic Parking and Parking Assist . . . . . . . . . . . . . 7.4.2 How the Jacinto Family of TDA4VM Processors Meets Surround View and Autonomous Parking Challenges . . 7.4.3 Jacinto TDA4VM SoC . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Qualcomm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Snapdragon Ride™ Vision System . . . . . . . . . . . . . . . 7.6 NXP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 S32 Automotive Platform . . . . . . . . . . . . . . . . . . . . . . 7.6.2 S32V234 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 XILINX ZYNq-7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Synopsys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

163 164 166 168 169

Deep Learning Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Model Optimization Overview . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Parameter Pruning and Sharing . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Quantization and Binarization . . . . . . . . . . . . . . . . . . . 8.2.2 Pruning and Sharing . . . . . . . . . . . . . . . . . . . . . . . . . .

183 183 185 185 186

8

170 171 173 174 176 177 178 178 179 179 180 181

Contents

8.2.3 Designing the Structure Matrix . . . . . . . . . . . . . . . . . . Low-Rank Decomposition and Sparsity . . . . . . . . . . . . . . . . . . Transfer/Compact Convolutional Filters . . . . . . . . . . . . . . . . . . 8.4.1 Grouped Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 MobileNet Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 ShuffleNet Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 AI Model Efficiency Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Large-Scale Energy-Efficient AI . . . . . . . . . . . . . . . . . 8.6.2 Advancing AI Model Efficiency Through Collaboration Studying . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

187 188 190 191 191 193 193 194 195

Deep Learning ASIC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Accelerating Kernel Computation on CPU and GPU Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Deep Learning Chip Series from Institute of Computing Technology, Chinese Academy of Sciences . . . . . . . . . . . . . . . 9.3.1 Introduction About CNN . . . . . . . . . . . . . . . . . . . . . . 9.3.2 DaDianNao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 ShiDianNao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Cambricon-X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 MIT Eyeriss Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 CNN Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Eyeriss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Eyeriss v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Google’s TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 TPU v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 TPU Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 TPU: Systolic Array . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 TPU v2/v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.5 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Near-Memory Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Nonvolatile Resistive Memory . . . . . . . . . . . . . . . . . . 9.6.4 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Metrics for DNN Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

201 202

Autonomous Driving SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Autonomous Driving SoC Design Flow . . . . . . . . . . . . . . . . . . 10.2 TI Jacinto Soc Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 226 227

8.3 8.4

9

10

xiii

196 197 197

203 203 203 205 207 207 208 208 209 211 214 215 215 216 217 218 220 220 221 221 222 222 223

xiv

11

12

Contents

10.2.2 Computing Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Accelerator Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Functional Safety Features of the Jacinto 7 Processor . . . . . . . . 10.3.1 Functional Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Software Functional Safety Overview . . . . . . . . . . . . . 10.3.3 Deployment of Security Applications . . . . . . . . . . . . . . 10.4 Safety-Compliant Multicore SoCs with DNN and ISP to Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 ADAS Image Recognition SoC . . . . . . . . . . . . . . . . . . 10.4.2 DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 ISP with Secure BIST Controller . . . . . . . . . . . . . . . . . 10.5 Example: Nvidia Deep Learning Accelerator . . . . . . . . . . . . . . . 10.5.1 NVDLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 FireSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 NVDLA Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

228 228 230 231 232 233

Autonomous Driving Operating Systems . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Open-Source Autonomous Driving Operating Systems . . . . . . . 11.2.1 Linux RTOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 ROS Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Companies Using Open Source Software for Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Baidu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 BMW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Voyage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.4 Tier4 Autoware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.5 PolySync OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.6 Perrone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Automotive Hard Real-Time Operating Systems and Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Blackberry QNX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Elektrobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 Green Hills Software Company . . . . . . . . . . . . . . . . . . 11.4.4 NVIDIA DriveWorks SDK . . . . . . . . . . . . . . . . . . . . . 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

245 245 247 248 249

257 257 258 259 259 260

Autonomous Driving Software Architecture . . . . . . . . . . . . . . . . . . 12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Software Development Based on ISO 26262 . . . . . . . . . . . . . . . 12.2.1 Introduction About ISO 26262 . . . . . . . . . . . . . . . . . . 12.2.2 Synopsys Software Portfolio . . . . . . . . . . . . . . . . . . . . 12.2.3 ASIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

263 264 264 264 265 266

234 236 237 238 240 240 241 241 242 243

250 250 251 252 253 254 255

Contents

13

xv

12.2.4 Software Architecture Design . . . . . . . . . . . . . . . . . . . 12.2.5 Software Unit Design and Implementation . . . . . . . . . . 12.2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Component Architecture Design Based on SAE J3016 . . . . . . . 12.3.1 Functional Components . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 AutoSAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Architecture Design and Implementation of Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Hardware Framework . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Software System Architecture . . . . . . . . . . . . . . . . . . . 12.4.3 Modules for Data Transfer . . . . . . . . . . . . . . . . . . . . . 12.4.4 Autonomous Driving Test Report . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

267 268 269 269 270 274

Introduction to 5G C-V2X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Mobile Internet of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 C-V2X: How to Change Driving . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Avoid Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Convoy Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Cooperative Driving . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 Queue Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.5 Protecting Vulnerable Road Users . . . . . . . . . . . . . . . . 13.2.6 Support Emergency Services . . . . . . . . . . . . . . . . . . . . 13.2.7 Warning of Danger . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.8 Increasingly Autonomous Driving . . . . . . . . . . . . . . . . 13.3 C-V2X: The Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 C-V2X: How It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Direct Communication . . . . . . . . . . . . . . . . . . . . . . . . 13.4.2 Telecommunication . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.3 5G: How to Change C-V2X . . . . . . . . . . . . . . . . . . . . 13.5 C-V2X: Deployment Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 China Leads the Way . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.2 Australia: Improving Road Safety . . . . . . . . . . . . . . . . 13.5.3 United States: Growth Momentum . . . . . . . . . . . . . . . . 13.5.4 Europe: Broad Support . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

283 284 285 285 286 286 286 286 287 287 287 287 290 290 290 291 291 292 292 293 293 294

276 277 278 280 280 281

List of Figures

Fig. 1.1 Fig. 2.1 Fig. 2.2 Fig. 2.3 Fig. 2.4 Fig. 2.5 Fig. 2.6 Fig. 2.7 Fig. 2.8 Fig. 2.9 Fig. 2.10 Fig. 2.11 Fig. 2.12

Fig. 2.13 Fig. 2.14 Fig. 3.1 Fig. 3.2

Fig. 3.3 Fig. 3.4 Fig. 4.1 Fig. 4.2 Fig. 4.3 Fig. 4.4

The video captured by the camera will be streamed for the object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Overview of the approach [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 The single-stream network for 3D object detection [13] . . . . . . . . . 32 Overview of our object category recognition framework [14]. (a) Training pipeline. (b) Testing pipeline . . . . . 33 Overview of the object detection framework [14] . . . . . . . . . . . . . . . . 33 Overview of the deep MANTA approach [16] . .. . . .. . . .. . . .. . . .. . 34 BirdNet 3D object detection framework [20] . . . . . . . . . . . . . . . . . . . . . 36 Overview of Complex-YOLO [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A sample illustration of the 3D FCN structure used in [23] . . . . . 38 PointNet architecture [25] . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .. . .. . 39 MV3D network architecture [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Complex-YOLO pipeline [31] .. . .. . .. . .. . .. . .. . .. .. . .. . .. . .. . .. . .. . 43 The ground truth spatial distribution, outlining the size of the bird’s-eye view region. Left: sample detection; right: 2D spatial histogram with annotation boxes [31] . . . . . . . . . . . . . . . . . 44 Simplified YOLOv2 CNN architecture Euler-Region-Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Regression estimation of 3D bounding boxes. Loss function . . . 48 Lane evaluation system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Proposed framework with three main parts. 512 × 256 size input data are compressed by the resizing network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Details of the hourglass block consist of three types of bottle-neck layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Details of bottle-neck. The three kinds of bottle-necks have different first layers according to their purposes [55] . . . . . . . . . . . . 77 Planning and decision-making architecture . . . . . . . . . . . . . . . . . . . . . . . 88 Illustration of the decision process hierarchy . . . . . . . . . . . . . . . . . . . . . 89 Find the shortest path from vertex a to vertex e . . . . . . . . . . . . . . . . . . 89 MPC model predictive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 xvii

xviii

Fig. 4.5

Fig. 4.6 Fig. 5.1 Fig. 7.1 Fig. 7.2 Fig. 7.3 Fig. 7.4 Fig. 8.1

Fig. 8.2

Fig. 8.3 Fig. 9.1 Fig. 9.2 Fig. 9.3 Fig. 9.4 Fig. 9.5 Fig. 9.6 Fig. 10.1 Fig. 10.2 Fig. 10.3 Fig. 10.4 Fig. 10.5 Fig. 10.6 Fig. 10.7 Fig. 10.8 Fig. 10.9 Fig. 10.10 Fig. 10.11 Fig. 11.1 Fig. 11.2 Fig. 11.3 Fig. 11.4 Fig. 12.1 Fig. 12.2

List of Figures

Describes the angle between the vehicle and the road centerline and the lateral deviation between the vehicle and the reference trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPC system optimization model . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Overall overview of AVP architecture in autoware [1] . . . . . . . . . . NVIDIA DRIVE software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nvidia Drive high-resolution map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TIDL development process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simplified surround view system based on TDA4VM [2] . . . . . . . The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is the compressed model . . . . . . . . . Typical framework for low-rank regularization methods. On the left is the original convolutional layer, and on the right is the rank-K constrained convolutional layer . . . . . . . . . . . . . . Schematic diagram of MobileNet grouped convolution . . . . . . . . . . Highly parallel computing architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapping convolutions to multiplication using Toeplitz matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shape parameters of the convolution/full connection layer . . . . . . The architecture comparison of the original Eyeriss and Eyeriss v2; the figure is from [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix multiplication implemented by systolic arrays .. . . .. . . . .. . Google TPU software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seventh-generation Jacinto SoC platform . . . . . . . . . . . . . . . . . . . . . . . . . Compute cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accelerator cluster .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . Vision accelerator .. .. . .. .. . .. .. . .. .. . .. .. . .. .. . .. .. . .. .. .. . .. .. . .. .. . The computation of the dense optical flow engine . . . . . . . . . . . . . . . Typical vision system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic of ADAS image recognition SoC . . . . . . . . . .. . . . . . . . . . . . DNN execution process and DNN accelerator schematic . . . . . . . . ISP with a runtime BIST controller illustrates . . . . . . . . . . . . . . . . . . . . MPU connections and accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROS system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The open-source vehicle control project (OSCC) at the core middleware layer of PolySync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EB robinos architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NVIDIA DriveWorks end-to-end processing pipeline . . . . . . . . . . . The architecture design process of the software architecture for the autonomous driving category defined by SAEJ3016 . . . . . Classification of functional components according to SAE J3016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104 106 150 166 168 169 174

186

189 192 202 203 208 213 217 219 226 228 229 229 230 230 234 236 237 238 239 249 254 258 261 269 270

List of Figures

Fig. 12.3 Fig. 12.4 Fig. 12.5 Fig. 12.6 Fig. 12.7 Fig. 12.8

xix

The functional architecture of autonomous driving proposed in [1] . . . .. . . . .. . . . . .. . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . .. . . AutoSAR software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AutoSAR basic software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System architecture . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . Environmental map feedback . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . Detailed positioning planning flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . .

271 274 276 278 279 280

List of Tables

Table 1.1 Table 2.1 Table 2.2 Table 2.3 Table 2.4 Table 3.1 Table 3.2 Table 3.3 Table 7.1 Table 7.2

Table 8.1 Table 10.1 Table 10.2 Table 11.1 Table 12.1 Table 12.2 Table 13.1

Autonomous vehicle status being tested by leading industry companies . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. Lists the comparison between 2D and 3D detection methods [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of different sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results of bird’s-eye view detection performance [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results of 3D object detection performance [31] . . . . Factors affecting lane detection system performance . . . . . . . . . . . . Details of the proposed networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results for culane dataset. (First and second best results are highlighted in rea and blue) . . . . . . . . . . . . . . . . . . . . . . Vision and automatic parking applications and requirements . . . The processing stages of the surround view/automatic parking application and the main characteristics of the corresponding TDA4VM device .. . . .. . . .. . . . .. . . .. . . . .. . . .. . . . .. . . .. . . . .. . . .. . . .. . . Overview of different methods for model compression and acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software functional safety . . .. . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . .. . . . . Security mapping to applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nvidia driver works architecture . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . Software architecture design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . Software unit design and implementation (the number of + indicates the degree of security function support) . . . . . . . . . . . . . . . . Technical advantages of C-V2X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 26 27 50 52 70 75 83 172

175 185 233 235 260 267 268 289

xxi

Chapter 1

Challenges of Autonomous Driving Systems

Abstract Designing autonomous driving systems is particularly challenging for several reasons. These systems must always make the “right” decisions to avoid accidents, so computationally intensive machine learning, computer vision and robotic processing algorithms are used to provide the required high precision. Although computationally intensive for this purpose, it is crucial for such a mission-critical system to be able to react to hazardous situations in real time. This means that signal processing always needs to be done within the strict time limitation. In addition, the autonomous driving system needs to perform the necessary calculations within a certain power budget to avoid negatively affecting the vehicle’s range and fuel thermal efficiency. To address these challenges, the following key issues need to be addressed: • Design constraints for building autonomous driving systems. • Computing power and bottlenecks for advanced end-to-end autonomous driving systems; and • Architectures for building such systems to meet all design constraints. In this chapter, we mainly provide a brief discussion about challenges while designing autonomous driving systems from the perspectives of functional safety, real-time computational constraints, storage limitations and power/thermal constraints. Then, we also briefly discussed several key components to design one autonomous driving system, such as the perception system, decision making and vehicle control. In this chapter, we also described several available computing resources for satisfying the autonomous driving system, including the DSP/GPU and FPGA. Finally, in this chapter, the contents of the books are summarized here to give the reader a brief overview of this book.

1.1

Autonomous Driving

To promote the development of highly autonomous vehicles, the National Highway Traffic Safety Administration has issued a guideline for autonomous driving systems, in which automated driving is based on the automation level defined by SAE © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Ren, D. Xia, Autonomous driving algorithms and Its IC Design, https://doi.org/10.1007/978-981-99-2897-2_1

1

2

1

Challenges of Autonomous Driving Systems

International, in the order from low to high. Autonomous driving is divided into six levels. • Level 0: No Automation: The driver must complete all driving tasks even if the vehicle issues a warning. • Level 1: Driver assistance. In limited driving conditions (e.g., high-speed cruising), the automated system shares the responsibility for steering and acceleration/ deceleration with the driver, and the driver handles the rest of the driving tasks (e.g., changing lanes). • Level 2: Partial automation: The automated system fully controls the steering and acceleration/deceleration of the vehicle under limited driving conditions, while the driver performs the rest of the driving tasks. • Level 3: The automated system handles all driving tasks in limited driving conditions and expects the driver to respond to a request for intervention (i.e., resume driving). • Level 4: Advanced Automation: The automated driving system can handle all driving tasks under limited driving conditions, even if the driver does not respond to a request for intervention. • Level 5: Fully automation: The autonomous driving system can fully control all driving tasks in all driving conditions. Obviously, levels 1 and 2 are primarily driver assistance, with the driver always handling most of the driving tasks in any situation. Under certain driving conditions, the autonomous driving system will consider level 3 to level 5 as autonomous driving for drivers. Since they show the future of autonomous driving systems, in this book, we will focus on autonomous driving system design from level 3 to level 5.

1.1.1

Current Autonomous Driving Technology

To understand the status of autonomous driving, the computing platform and sensor technology will be investigated, as shown in Table 1.1. Even if Tesla and Waymo can only reach level 2 or level 3 in automation, the driver is also largely involved in the control of the vehicle. This demonstrates the challenges of building self-driving cars and motivates researchers to investigate this emerging field. Consider the computing platforms and sensors used in autonomous driving systems. Most of Table 1.1 Autonomous vehicle status being tested by leading industry companies Manufacturer Automation level Platform Sensor

Mobileye Level 2

Tesla Level 2

Nvidia Level 3

Waymo Level 3

SOC Camera

SOC + GPU Camera + Radar

SOC + GPU Camera + radar + lidar

SOC + GPU Camera + radar + lidar

1.2

Autonomous Driving System Challenges

3

them utilize a combination of SoC and GPU to provide the massive computing power required for autonomous driving systems. Another interesting observation is that Nvidia, Audi and Waymo were able to build experimental self-driving cars at level 3 automation systems and use lidar (LIDAR) as the sensing device by sending beams of light for high-precision measurement of vehicle surroundings. LIDAR has always been one of the main and critical sensors for the commercialization of such autonomous driving systems. Commercial LIDAR devices can cost as much as $75,000, even more than the vehicle itself in some luxury cars. As a result, the industry has been trying to move away from lidar devices and build vision-based autonomous driving systems, expecting to use only much cheaper cameras and radar to perceive their surroundings. Companies such as Mobileye and Tesla, for example, recently announced their plans to focus on vision-based autonomous driving systems that use cameras and radar as sensing devices.

1.2

Autonomous Driving System Challenges

Google, Uber, Tesla, Mobileye, and many other automotive companies have recently made significant investments in the future of autonomous driving systems. Autonomous driving systems should allow the car to drive itself without human assistance. Vehicles equipped with self-driving capabilities can detect the environment, localize the location, and operate the vehicle to safely reach the designated destination without human intervention. The demand for this autonomous driving continues to grow, resulting in increasing industry investment. Intel recently acquired Mobileye, a leader in computer vision-based autonomous driving technology with a purchased cost of $15.3 billion. The report from the marketing shows that by 2035, cars with self-driving capabilities are expected to account for 25% of the car market, or 18 million vehicles, and the size of the self-driving car market is expected to jump to $77 billion by 2035. Despite recent advances in self-driving systems from industry leaders such as Google, Tesla, and Mobileye, self-driving cars are still largely in the trial and research phase. Therefore, designing autonomous driving systems remains largely an open research problem. Designing autonomous driving systems is particularly challenging for several reasons. These systems must always make the “right” operational decisions to avoid accidents, so often computationally intensive machine learning, computer vision and robotic processing algorithms are used to provide the required high precision. Although computationally intensive for this purpose, it is crucial for such a mission-critical system to be able to react to hazardous situations in real time. This means that signal processing always needs to be done within a strict time limitation. In addition, the system needs to perform the necessary calculations within a certain power budget to avoid negatively affecting the vehicle’s range and fuel thermal efficiency. To address these challenges, the following key issues need to be addressed:

4

1

Challenges of Autonomous Driving Systems

• design constraints for building autonomous driving systems. • computing power and bottlenecks for advanced end-to-end autonomous driving systems; and • architectures for building such systems to satisfy all design constraints. While there are very detailed regulations for regular cars (such as crash testing, fuel economy, vehicle inspections, etc.), regulators have only recently begun to develop regulations for autonomous vehicles. In the “Federal Autonomous Vehicle Policy” issued by the United States, the US Department of Transportation only mentioned that “the focus should be on software development and verification” without any specific details. Therefore, many of the design constraints that will be discussed in this section come from material published by practitioners such as Toyota [1, 2], Udacity and Mobileye.

1.2.1

Functional Constraints

To avoid traffic accidents, an autonomous driving system should be able to “understand” real-time driving conditions and react to them quickly enough. Although selfdriving cars may reduce traffic accidents, the actual performance requirements of self-driving systems remain unclear. Based on existing work in driver assistance systems, the reaction time of an autonomous driving system should be determined by two factors: • Frame rate: The frame rate determines how quickly real-time sensor data enter the process engine. • Processing latency: The latency to identify a scenario and make an operational decision determines how quickly a system can react to captured sensor data. Depending on the level of expectation selected and the action taken, the driver will take varying amounts of time to respond. For example, when the driver expects a possible disturbance, it takes 600 ms to react; otherwise, it takes 850 ms. A typical driver takes 0.96 s to release the accelerator, 2.2 s to reach maximum braking, and 1.64 s to initiate a turn to avoid an accident. The fastest action a driver can take takes 100 ms, or 150 ms. To provide better safety, autonomous driving systems should be able to react faster than the driver. This shows that the waiting time for processing the driving situation should be within 100 ms. This complies with Mobileye’s recently published industry standards, as well as Udacity’s design specifications. In addition to dealing with delays, self-driving systems need to frequently update their understanding of road conditions to keep up with changing real-time traffic conditions. In other words, the frame rate must be high when the real-time traffic conditions change drastically between two adjacent frames. To react quickly to changing hazardous conditions, the designed system should be able to react faster than the driver, and the visible frequency should be every 100 ms. This also aligns with the frame rate of the Collision Avoidance Assist system Mobileye has built.

1.2

Autonomous Driving System Challenges

5

Therefore, the autonomous driving system should be able to handle all current situations or at least every 100 ms of waiting time.

1.2.2

Predictability Constraints

Autonomous driving should be one of the critical applications to perform tasks in real time. This means that failure to complete processing within a specific deadline is a failure, so predictability of performance is critical. Failure to do so in real time can put passengers at risk and sometimes lead to fatal accidents. Therefore, the performance of autonomous driving systems should be very predictable to be widely adopted. Predictability is defined in terms of time (i.e., meeting a specified time period) and functional aspect (i.e., making the right operational decisions). Since the performance of autonomous driving systems varies widely, tail latency (e.g., 99%, 99.99% latency) should be used as a metric to evaluate performance to capture stringent predictability requirements.

1.2.3

Storage Limitations

Although GPS technology is commonly used to identify the vehicle’s position in a navigation system, it does not provide the required level of accuracy (e.g., millimeter-level accuracy is required to keep the vehicle within a certain lane) and cannot be used with localizing vehicles for autonomous driving tasks. Therefore, positioning based on previous maps has been widely used to provide positioning functions with centimeter-level accuracy. Among them, the surrounding views are transformed into feature descriptions to map the feature points stored in the previous map to identify the location of the vehicle. However, it is not feasible to always transmit previous maps from the cloud because the vehicle does not always have access to the network, and even with limited accessibility, the vehicle still needs to perform the necessary autonomous driving tasks. Therefore, prior maps need to be stored on the autonomous vehicle. However, large environment maps in the past, such as entire countries, took up much storage space. For example, the prior map of the entire United States occupies 41 TB of storage space on the self-driving system. As a result, ten of terabytes of storage are required to store previous maps in the large environments required for autonomous driving systems to locate vehicles (e.g., 41 terabytes for the entire map of the United States).

6

1.2.4

1

Challenges of Autonomous Driving Systems

Thermal Constraints

The autonomous driving system has the following thermal constraints: (1) The temperature of the space accommodating the computing system must be within the operating range of the system; (2) The heat generated by the computing system should have an impact on the thermal performance of the vehicle. Relatively small thermal will not overheat the engine, which could affect the reliability of the vehicle. Modern self-driving cars typically have two temperature zones: in the climatecontrolled cabin or outside. Outside the passenger compartment, operating temperatures can reach up to +105 °C ambient, which is higher than the temperature at which most general-purpose computer chips can safely operate. For example, typical Intel processors can only operate at temperatures below 75 °C. Therefore, the autopilot system should be placed in a climate-controlled cabin to avoid overheating. However, when computing systems are placed in passenger cabins without additional cooling infrastructure, passengers cannot tolerate elevated temperatures. For example, a computing system consuming 1 kW of power (e.g., 1 CPU and 3 GPUs running at full utilization) will likely raise the temperature by 10 °C. In conclusion, additional air conditioning loads need to be added to remove the heat generated by the autonomous driving system. As a result, the computing systems of driverless cars need to be placed in a climate-controlled cabin to operate safely. This means that additional cooling capacity needs to be added to remove the extra heat generated by the computing system to maintain the permissible temperature inside the cabin.

1.2.5

Power Is Constrained

In gasoline-fueled vehicles, the electrical system is typically provided by an automotive alternator with a power of 1–2 kW. The electrical system can often be enhanced by reducing the fuel thermal efficiency of the car. The exact reduction depends on the fuel thermal efficiency of the vehicle. However, as a rule of thumb, for a gasoline-powered vehicle, for every 400 W of power consumption, the per mile per gallon (MPG) rating will decrease by 1 (e.g., an additional 400 W of power consumption will reduce the MPG of the 2017 Audi A4 sedan by 3.23%, from 31 MPG). Likewise, the additional power consumption reduces the total electric vehicle (EV) driving range due to limited battery capacity. The total power consumption of the autonomous driving system should include the power consumption of the computing system and the storage and cooling overhead. While the power consumption of a computing system is highly dependent on the computing platform (e.g., CPU, Central Processing Unit, GPU, Graphics Processing Unit), a typical storage system consumes approximately 8 W for every 3 TB of data stored at once. To remove the extra heat generated by the system, a typical automotive air conditioner consumes approximately 77% of the cooling load

1.3

Designing an Autonomous Driving System

7

to dissipate heat (i.e., the effective coefficient of performance is 1.3. The ratio of useful cooling provided to the work required is usually called the effective coefficient of energy absorption by the system. In other words, applying a cooling overhead of 77 W in a 100 W system removes the extra heat generated by the system. Furthermore, other constraints have not been addressed in this book. For example, any equipment on a car should be able to withstand the vibrations and shocks of the vehicle. Sudden pulses may range from 50 g to 500 g (g is used to measure the pulse, representing gravity), and vibrations may be as high as 15 g. Furthermore, hardware reliability is also an important constraint for real-time systems. In real-time systems, three-mode redundancy is usually used to provide safety assurance for aircraft. However, there will be much less environmental variability (e.g., temperature, atmospheric pressure) for autonomous vehicles than for airplanes, making rare events such as errors caused by radiation less likely.

1.3

Designing an Autonomous Driving System

An autonomous driving system is designed to capture data from various real-time sensors, such as cameras, lidar, and millimeter-wave radar, and then the autonomous driving system performs the necessary processing to recognize the driving environment, make operational decisions, and finally operate the vehicle to reach the given destination. Autonomous driving systems typically consist of three main components: (1) scene recognition, which is used to localize the vehicle at the centimeter level and track objects; (2) path planning, which is used to generate future paths; and (3) vehicle control, which is used to physically operate the vehicle to follow the planned path [3]. In self-driving systems, these algorithms are the basis of most modern self-driving systems, which have been adopted by the built self-driving car Udacity [4] and are consistent with the way Mobileye designed its self-driving systems [5]. A diagram of the components is shown in Fig. 1.1. The sensors are used to detect the object and locate the vehicle in the localization engine. The detected objects are then passed to the object tracking engine to track moving objects. The fusion engine projects the vehicle position and the tracked object into the same 3D coordinate space. Motion planning uses spatial information to make operational decisions. The mission scheduler is called only when the vehicle deviates from the original route plan generated by a navigation service such as Google Maps. The fusion engine retrieves the coordinates of the object being tracked by the tracker and merges it with the current vehicle position provided by the localization engine. The merged information is then transformed into the same three-dimensional coordinate space and sent to the motion planning engine to generate decisions about vehicle operation. Issues such as perception, decision-making, and control of autonomous driving systems are discussed below, as well as the limitations of existing approaches to safety verification and testing of autonomous driving systems.

8

1

Challenges of Autonomous Driving Systems

Fig. 1.1 The video captured by the camera will be streamed for the object detection

1.3.1

Perception Systems

Sensors in the perception of autonomous driving systems are critical for vehicle localization. However, it is limited by its accuracy and high cost since the sensors of the Global Navigation Satellite System (GNSS) are expensive, inaccurate, and highly sensitive in urban environments. For example, GPS-based GNSS sensors are still susceptible to latter positioning errors; GNSS-based lane-level positioning methods are also susceptible to “multipath interference” when external objects obstruct the GPS signal and satellite clock errors will occur, which will cause an inevitable inconsistency between GNSS coordinates and high-definition map coordinates. In addition, vision sensors, while lower in cost, are still inaccurate in harsh weather conditions and complex backgrounds because they are designed to operate on sharper images and video. Information from high-definition maps can be used to improve the imagery provided by vision sensors, but building high-definition maps requires extensive software and human effort. Finally, lidar sensors are expensive, and because of their limitations in identifying “ungrounded objects,” it is unclear whether they can detect when a person inside them moves unexpectedly. Currently, the latest research proposes fusing sensors with overlapping functions but different types to reduce cost, achieve redundancy, and improve their performance, such as safety. In machine learning-based perception systems, such as neural networks, sensor inputs are easier to manipulate than through adversarial examples. These adversarial examples are created by modifying the camera image to induce certain behaviors of the neural network. For example, reducing the system’s confidence in the prediction or causing the prediction to misclassify the input. The inputs from the sensors of the self-driving system can be manipulated, for example, by slightly modifying road signs, causing the neural network of the self-driving system to misclassify these

1.3

Designing an Autonomous Driving System

9

signs, show misbehavior and cause road safety hazards. To improve its resistance to manipulation, methods are proposed to protect neural networks from adversarial examples. Safety risks also arise when errors perceived by the autonomous driving system will propagate to subsequent software components. Perception algorithms process sensor inputs and generate outputs based on their understanding of the autonomous driving system’s environment. The latter may be inaccurate and used as input in decision-making algorithms, thereby influencing the control commands of the autonomous driving system, potentially resulting in unsafe driving behavior. This error propagation from the perception component led to a fatal crash with Tesla’s Autopilot system in 2016. For the system to take into account the uncertainty introduced by the sensing components, it is critical to estimate and minimize the uncertainty in each individual component. For example, a Bayesian probabilistic framework and Monte Carlo sampling were used to estimate the prediction confidence scores generated by the perception system of an autonomous driving system. Uncertainty in each component should also be communicated and well-integrated across all software components to provide an overall measure of system uncertainty and facilitate decision making. The performance of autonomous driving systems is limited by computationally demanding perception algorithms. In [6], perception algorithms such as object detection, tracking, and localization were validated, which together constitute more than 94% of the computing power of autonomous driving systems. These computational constraints prevent further improvements in accuracy, and the adoption of higher resolution cameras and the use of computational platforms such as graphics processing units (GPUs) to overcome these constraints may result in additional heat, thereby significantly increasing power consumption and reducing autonomous driving distance and fuel consumption efficiency. While machine learning techniques, such as deep neural network architectures, can improve object detection tasks such as bounding box detection (maximizing detection of objects within a box) and semantic segmentation (classifying every pixel in image space), a time delay can be introduced for real-time classification of high-resolution images [7].

1.3.2

Decision-Making

In dynamic road environments, autonomous vehicles face many challenges. These challenges are filled with uncertainty and unpredictable movement of objects, such as road closures and cleanup of accidents. The main challenge is that it may not be possible to correctly interpret the meaning of certain decision rules in complex driving scenarios. Modelling and understanding human–machine interaction is critical for navigating safely in mixed traffic, building consumer trust in autonomous vehicles, and

10

1 Challenges of Autonomous Driving Systems

promoting their widespread adoption to achieve their full safety, but this remains a challenge for decision-making algorithms. First, it is critical to understand whether humans in autonomous driving are ready to regain control of the vehicle. For example, handing over control to the passenger creates a safety risk if the self-driving system fails to recognize behavioral traits such as exhaustion or distraction. Second, understanding the intentions of people near autonomous vehicles and other vehicles is key to safe navigation. For example, humans often use gestures and other social cues to indicate their intention to violate certain traffic rules to facilitate traffic flow. Drivers must also consult with other car drivers during driving, such as in lanes, overtaking and merging, which requires balancing the uncertainty of human behavior while avoiding overly defensive driving behaviors to keep traffic flowing. However, self-driving cars may not be able to interpret or execute social cues correctly, which may hinder other road users from using the self-driving car’s behavioral abilities, creating mismatched expectations that lead to accidents, which have led to the majority of crashes. Occur. Despite these issues, several studies have explored the impact of human-computer interaction on the learning ability of autonomous driving, which is critical to address the above issues. Decision-making algorithms are also constrained by computational complexity and algorithms in other software components that can undermine the performance and safety of autonomous driving in dynamic environments. Although solutions have been developed, such as graph search and the exploration of random trees for autonomous driving and other mobile robots, finding the optimal path is computationally time-consuming and not always feasible [8]. Additionally, in the case of multiple dynamic obstacles such as pedestrians and other road users, computationally demanding perception algorithms reduce the time required for motion planning algorithms to continuously compute new collision-free trajectories [9]. The motion classification algorithm also needs to be well integrated with the control algorithm and consider the constraints faced by the control algorithm, such as time, velocity and acceleration constraints and evolution of the trajectory, but it requires more computing resources than existing processors. Some researchers have begun to develop path-planning algorithms that can address perception uncertainties and control constraints, mitigating potentially dangerous situations. In addition, vehicle trajectories that were initially considered safe can become dangerous when unexpected environmental changes occur, such as when moving obstacles affect the autonomous driving’s trajectory and perception of the initial plan, a new method for incremental plan adjustment, which can enhance the adaptability of autonomous driving to unexpected situations [10]. The communication capabilities of 5G networks can leverage more reliable vehicle-to-vehicle infrastructure (V2X) to provide autonomous driving with more information about nearby obstacles, faster than existing 4G networks to support near-instant decisionmaking.

1.3

Designing an Autonomous Driving System

1.3.3

11

Vehicle Control

Currently, control algorithms and their basic models for vehicle motion have been developed. The model has achieved considerable success in trajectory tracking, ensuring that autonomous driving follows a path determined by its decision-making algorithm. However, safety risks can arise from the inaccuracy of control algorithms in modelling autonomous driving movements, especially under unexpected road conditions. Geometry and motion control algorithms have been widely used due to their simplicity and relatively low computational cost. However, since they only model the geometry and kinematics of the vehicle (such as acceleration and velocity), they can lead to errors and vehicle instability. Geometric and kinematic control algorithms can lead to dangerous driving behaviors at high speeds due to their disregard for vehicle dynamics (e.g., friction and tire slip). In this case, the dynamics can affect the movement of the vehicle, such as sudden lane changes or trying to avoid unexpected obstacles. In the application of “pure pursuit” geometric algorithms, where the vehicle “constantly pursues virtual motion points” [11], “rapid changes” in the vehicle’s path during high-speed driving can cause the algorithm to “overestimate” the steering input generated by the system to correct for the ability of a vehicle to move, causing the vehicle to oversteer and skid. Additionally, setting the control parameters to “compensate” to ignore dynamics makes the geometric and kinematic control algorithms highly sensitive to parameter changes. For example, tuning the optimal value of the “look-ahead distance” for pure tracking algorithms is a challenge. This optimum value is measured based on the vehicle’s chosen “waypoint” from its existing position; a larger value will cause the vehicle to deviate from the actual curved path during sharp turns, resulting in a “curve” with a value that is too high [12]. Adaptive or model predictive control algorithms are now used in autonomous driving. However, it remains inaccurate when it violates assumptions and is computationally expensive. First, dynamic control algorithms incorporate linear or nonlinear models of vehicle dynamics, which are derived primarily from tire forces. The tire force is generated by the friction between the tire and the road surface and is the main external influence when the vehicle is in motion. When the steering angle and sideslip angle exceed 5°, the linear model will become inaccurate, and the nonlinear model may be more accurate at this time, especially at high speed and large steering angle, but the amount of calculation is larger. Like geometrically pure pursuit algorithms, some dynamic control algorithms are still highly sensitive to changes in “look-ahead distance” and unknown vehicle parameters such as tire-road friction. Since these parameters are not available in real time, the cost of installing additional sensors is high. Therefore, Amer et al. [13] developed an adaptive geometry controller that has both the low computational cost advantage of geometry controllers and robustness. Finally, the model predictive control (MPC) algorithm considers system constraints, inputs and outputs to optimize actuator inputs and has been successfully used for autonomous driving trajectory tracking while satisfying

12

1 Challenges of Autonomous Driving Systems

both safety and time constraints. However, MPC requires highly complex and computationally demanding online optimization, especially when nonlinear vehicle dynamics are considered. The research proposes methods to linearize nonlinear vehicle models and relax some of the constraints for collision avoidance to reduce these computational requirements. Vehicle control technology can be further developed in the future. Furthermore, all control algorithms face constraints caused by other software components, challenges in handling unexpected situations, and a lack of adequate practical testing. First, most control algorithms perform well only if the trajectory computed by the decision algorithm is continuous and has not yet taken into account the time delays propagated by computationally expensive sensors. These sensors can seriously destabilize the vehicle. The findings show that in unexpected situations (such as avoiding an emergency collision), sudden path changes can cause the tires to skid, and trajectory tracking and vehicle stability can be compromised. Finally, due to the large amount of computation, most control algorithms are only tested during simulation, not in actual autonomous driving, and are only validated with minimal parameter changes and unexpected environmental changes [11]. To ensure that the controller is suitable for autonomous driving in a real environment, the researchers propose the use of so-called “hardware closed loop” simulation. In the simulation tests, physical actuators were included, and a V2X system was developed that utilizes environmental data to update the hardware. Control parameters for autonomous driving vary with driving conditions and are robust, and controllers have been developed that integrate steering, braking, and suspension control under various road conditions.

1.3.4

Safety Verification and Testing

Existing automated driving testing methods have many limitations that make it impossible to verify the safety of automated driving before deployment. First, many developers have conducted extensive road testing and analysed data such as kilometers/miles, injuries and fatalities to improve the performance of autonomous driving until relatively low fatalities and injuries are reached. Second, existing systems whose security requirement standards are designed for traditional systems engineering processes where requirements are “known” and “clearly specified” (e.g., ISO Standard 26,262) involve first creating functional requirements, annotating security-related requirements, assigning them to safety-critical subsystems and designing them according to these requirements [14]. However, this is incompatible with adaptive systems in self-driving devices, which learn from new data in real time, rather than just relying on well-defined requirements. Therefore, a different approach is needed to clarify the safety requirements of autonomous driving systems. Validating autonomous driving systems is also challenging due to the uncertainty of their algorithms and the adaptive nature of machine learning systems. First, it is

1.3

Designing an Autonomous Driving System

13

difficult to assess whether the self-driving test results are correct due to the nondeterministic algorithms in the self-driving system, which produce nonrepeatable and probabilistic outputs. This indicates potential differences in system behavior under near-identical testing, its high sensitivity to small changes in environmental conditions, and potential differences in behavior during actual deployment testing and certification. This calls for a new approach to testing that focuses on building sufficient confidence that the system exhibits the desired behavior, rather than expecting precise and unique outputs from certain inputs. Second, the training data in a machine learning system, which may contain accidental correlations leading to false predictions (overfitting), must be detected to prevent any significant changes in the rules learned by the system. This challenge requires the use of expensive and complex manual data labelling. Machine-learning algorithms can also have erroneous corner cases, behavior that can lead to fatal accidents in autonomous driving trials. However, methods for pre-detecting and correcting these behaviors in the current testing process are still highly dependent on manually collecting labelled test data, making it difficult to scale. Simulating extreme conditions is easier than testing, especially for self-driving radar systems, but there is a real risk of overfitting to simulated data because even experienced test designers have blind spots that cannot cover all driving situations. Due to the high sensitivity of nondeterministic systems to small input changes, the challenge of determining a specific situation with a specific input combination of the system to detect the system in extreme cases is exacerbated. Finally, in machine learning-based software, detecting edge cases is more challenging than detecting bugs in local software because the logic of the latter is represented by control-flow statements that are easy to inspect, but the logic in machine learning algorithms is derived from data. It is learned and embedded in a highly nonlinear optimization function, which makes it more challenging to identify the inputs that trigger extremecase behavior. The limitations of existing testing and safety verification methods for autonomous driving and machine learning can be improved in a number of ways, such as through fault injection. This method is a widely used tool for evaluating safety and validating extreme cases in fault tolerance mechanisms in autonomous driving systems. For example, by randomly modifying the weights of neural networks, simulating false inputs from sensors, and mapping defects in autonomous driving software that could activate under unexpected circumstances. Synthetic methods and formal verification tools are popular methods for verifying autonomous driving control systems, but they are limited in deployment due to their high computational cost. Formal verification tools, such as control algorithms and online verification of neural networks, must be represented by traffic conditions and probabilities of road users, while traditional verification tools face challenges when modelling complex environments. In [15], it also highlighted making these verification methods more scalable for large-scale machine learning algorithms in real-world applications.

14

1

Challenges of Autonomous Driving Systems

Tips In conclusion, computational cost is a common problem in many algorithms spanning autonomous driving software components, modelling and understanding human–machine interaction in autonomous driving systems remains a major challenge for decision-making algorithms, and the uncertainty and self-determination of machine learning systems. Adaptability makes existing testing and safety critical, and current verification methods are not sufficient to ensure that autonomous driving is safe.

1.4

The Autonomous Driving System Computing Platform

There is a major challenge platform for researching autonomous driving computing. The investigation of vehicle hardware is usually an interaction with a chip design company. The computing platform of the current vehicle consists of two computing boxes. Each computing box is equipped with an Intel Xeon E5 processor and four to eight Nvidia K80 GPU accelerators connected via a PCI-E bus. At peak performance, the 12-core CPU delivers 400 G operations per second (GOPS), consuming 400 W of power. Each GPU can perform 8 trillion operations per second (TOPS) while consuming 300 W of power. Overall, the system delivers 64.5 TOPS at approximately 3000 W. One computing box is connected to 12 high-definition cameras around the vehicle for object detection and object tracking. LiDAR is mounted on top of the vehicle for vehicle localization, as well as some obstacle avoidance functions. The second computing box is used for reliability and performs the exact same task; if the first box fails, the second box can immediately take over. When both boxes are running at their peak (worst case), they will draw more than 5000 W, which will generate an enormous amount of heat. Additionally, each box costs approximately $20,000–$30,000, which would make the entire solution unaffordable for the average consumer. To overcome these shortcomings, chip manufacturers have proposed several types of platforms and processing solutions, such as those based on GPUs, DSPs, FPGAs and ASICs.

1.4.1

GPU

Nvidia PX2 platform is the current leading autonomous driving solution based on GPUs. Each PX2 contains two Tegra SoCs and two Pascal graphics processors. Each GPU has dedicated memory, as well as dedicated instructions for DNN acceleration. To provide high throughput, each Tegra is directly connected to a Pascal GPU using a PCI-E Gen 2 × 4 bus (4.0 GB/s total bandwidth). Additionally, the dual CPU-GPU cluster is connected via Gigabit Ethernet, which provides 70 Gbit/s. With optimized I/O architecture and DNN acceleration, each PX2 can perform 24 trillion deep

1.4

The Autonomous Driving System Computing Platform

15

learning computations per second. When running the AlexNet deep learning workload, it can convert 2800 images per second.

1.4.2

DSP

Texas Instruments (TI) TDA is a DSP-based autonomous driving solution. The TDA2x SoC consists of two floating-point C66x DSP cores and four fully programmable vision accelerators dedicated to vision processing functions. Compared to ARM Cortex-15 CPUs, the accelerator enables TDA to complete vision tasks eight times faster while consuming less power. CEVA’s CEVA-XM4 is another DSP-based solution designed to perform computer vision tasks on video streams. Its main advantage is energy efficiency. For 1080p video at 30 frames per second (fps), CEVA-XM4 requires less than 30 mW of power.

1.4.3

Field Programmable Gate Array FPGA

Altera’s Cyclone V SoC is an FPGA-based autonomous driving solution used in Audi products. Altera’s FPGA is optimized for sensor fusion and combines data from multiple sensors for highly reliable object detection. Another solution is the Zynq Ultra Scale MP SoC. When running CNN tasks, Ultra Scale achieves 14 images per second (images/s/watt), which is significantly better than the Tesla K40 GPU (4 images/s/watt). Additionally, for object tracking, 60 fps is achieved in a real-time 1080P video stream.

1.4.4

Specific Integrated Circuit ASIC

Mobileye EyeQ5 is the current leading autonomous driving solution based on ASIC. EyeQ5 features heterogeneous, fully programmable accelerators, each optimized for its own family of algorithms, including computer vision, signal processing and machine learning tasks. With this architectural diversity, applications can use the most appropriate cores for each task, saving computation time. To support system expansion with multiple devices, EyeQ5 has implemented communication between two PCI-E port processors. Tips Each solution has its advantages. However, no single platform is the best. As part of analyzing the current state of computer architecture for autonomous driving, an attempt was made to explore the following three questions: (1) Which computing

16

1 Challenges of Autonomous Driving Systems

units are best suited for the workloads? (2) Are mobile processors sufficient for autonomous driving tasks? (3) How can the most effective autonomous driving computing platform be designed?

1.5

The Content of This Book

This book focuses on the design of autonomous driving chips. First, we focus on the perception part of the automatic driving algorithm because perceiving the surrounding environment is the premise of automatic driving decision-making and control, focusing on the detection of lanes/3D objects, positioning and mapping in automatic driving. We then present the latest developments in motion planning and control in Chap. 4. As the complexity of the perceptron algorithm increases, a special computing platform is also required to ensure the reliability of the algorithm. Since the core algorithms for autonomous driving are deep learning at present, the current deep learning model is relatively complex, and model optimization and customized chip design are required to meet the realization of the deep learning model. Therefore, the first part of the book focuses on the content of environment perception and planning control; the second part focuses on the optimization of models and the design of deep learning chips. The third part focuses on autonomous driving software design and testing.

1.5.1

3D Object Detection

Object detection is a traditional task in computer vision. Different from image recognition, target detection not only needs to identify the object existing on the image and give the corresponding category but also needs to give the position of the object through the smallest box (Bounding box). According to the different output results of target detection, RGB images are generally used for target detection, and the method of outputting the object category and the smallest frame on the image is called 2D target detection, while RGB images, RGB-D depth images and lasers are used. The detection of point clouds, output object categories, lengths, widths, heights, rotation angles and other information in three-dimensional space is called 3D target detection. With the emergence of Faster-RCNN, 2D object detection has achieved unprecedented prosperity, and various new methods are emerging. However, in the application scenarios of unmanned driving, robotics, and augmented reality, ordinary 2D detection cannot provide a perception environment. For all the information needed, 2D detection can only provide the position of the target object in the 2D picture and the confidence of the corresponding category. However, in the real threedimensional world, most applications require information such as the length, width, height and deflection angle of the target object. In the automatic driving

1.5

The Content of This Book

17

scene, it is necessary to provide the three-dimensional size and rotation angle of the target object in the image, and the information projected in the bird’s-eye view plays a crucial role in path planning and control in the subsequent automatic driving scene. Currently, three-dimensional target detection is in a period of rapid development. At present, it mainly uses monocular cameras, binocular cameras, and multiline laser radars to perform three-dimensional target detection. The main difficulties in 3D visual target detection are as follows: • occlusion: occlusion can be divided into two cases: the target objects are occluded by each other, and the target object is occluded by the background. • truncation: some objects are truncated by the picture, and only some objects that are not occluded can be displayed in the picture. • Small target: Relative to the size of the input image, the target object occupies very few pixels. • rotation angle: the orientation of the object is different, but the corresponding features are the same, and it is difficult to effectively learn the rotation angle. This chapter focuses on reviewing the current 3D object detection algorithm, combined with the code implementation of an algorithm, to explain 3D object detection.

1.5.2

Lane Detection

Lane detection is the basis of many advanced driver assistance systems (ADAS, Advanced Driving and Assist System), such as the lane departure warning system (LDWS, lane departure warning system) and lane keeping assistant system (LKAS, lane keeping and assistant system). Some successful ADAS or automotive companies, such as Mobileye, BMW and Tesla, have developed their own lane detection and lane keeping products and have achieved remarkable achievements in both research and practical applications. Both auto companies and individuals have accepted the Mobileye series ADAS products and Tesla Autopilot to realize autonomous driving. Almost all mature lane assistance products currently use visionbased technology. Lane markings are painted on the road for visual perception. Utilize vision-based technology to detect lanes from camera devices and prevent accidental lane changes by drivers. Therefore, accuracy and robustness are the two most important attributes of a lane detection system. Lane detection systems should have the ability to identify unreasonable detections and adjust detection and tracking algorithms accordingly. When a false alarm occurs, the ADAS should warn the driver to focus on the driving task. Conversely, a vehicle with a high level of automation continuously monitors its environment and should be able to handle low-accuracy detection on its own. Therefore, as the degree of vehicle automation increases, the evaluation of lane detection systems becomes more important.

18

1.5.3

1

Challenges of Autonomous Driving Systems

Motion Planning and Control

Motion planning and control of vehicles is a mature technology with the potential to reshape mobility by improving the safety, accessibility, efficiency, and convenience of automotive transportation. Safety-critical tasks that autonomous vehicles must perform include dynamic planning of the environment shared with other vehicles and pedestrians and robust execution through feedback control. The purpose of this chapter is to investigate the state-of-the-art in planning and control algorithms, especially for urban environments. A discussion of planning controls provides insight into the strengths and limitations of various approaches and aids in the selection of system-level designs.

1.5.4

The Localization and Mapping

The SLAM problem is one of the keys to truly realizing self-driving robots. Therefore, it is also the core technology of self-driving cars. However, many problems make it impossible for SLAM algorithms to drive vehicles for hundreds of kilometers under very different conditions. There are two main problems dealing with SLAM for self-driving cars: (1) localization drifts over time, and (2) maps are not necessarily feasible in every driving condition. In the SLAM community, the first problem is well known. As the distance travelled by the car increases, the localization estimate given by the SLAM algorithm tends to deviate from the true trajectory. Without prior knowledge or absolute information, it is almost impossible to ensure correct positioning within a few kilometers. The second problem is that regardless of the conditions, there needs to be a map sufficient for the localization task. Recently, to be able to locate vehicles in different seasons, weather or traffic condition mapping has attracted much attention. Many solutions have been proposed to address these two problems, such as building maps by selecting unique information to reuse later or leveraging new communication systems to share and enhance maps constructed by other road users. At present, a more practical solution in the field of autonomous driving is to use the high-definition map of autonomous driving. Therefore, the creation of highdefinition maps for autonomous driving has become the entrepreneurial direction of many startups and has received much investment. Regarding HD maps, there is almost no open-source code on the web. Due to the importance of high-definition maps, here is a methodology to explain how to create high-definition maps and how to represent maps. In particular, the rapid development of deep learning provides semantic information. How to integrate semantic information into high-definition maps is also an unsolved problem at present.

1.5

The Content of This Book

1.5.5

19

The Autonomous Driving Simulator

The selection of an appropriate simulation environment for the simulator of autonomous driving is also a key step in autonomous driving, which is critical not only for the development stage of the project but also for the final solution of autonomous driving. The simulator will help the information obtained from the model to determine how the problem can be solved. Today, due to its importance in the development of control algorithms, many simulators exist in almost all robotics fields. Simulators for autonomous driving are a challenge that covers a wide range of research areas, including machine learning, vehicle dynamics, traffic simulation, and more. That is why there are so many different emulators, each focusing on a specific area. Additionally, in the field of reinforcement learning, the use of simulators is mandatory. The main idea is that, in some cases, it takes many iterations to find a solution, which would be expensive on the one hand and potentially dangerous situations on the other hand if real cars were used. Therefore, this chapter mainly introduces the status of the main open-source simulators and how to develop a qualified autonomous driving simulator. The final implementation of the autonomous driving solution can save considerable manpower and material resources, which is worthy of the company’s investment.

1.5.6

Autonomous Driving ASICs

Here, we will focus on the current status of autopilot chips in the industry, including Mobileye, Nvidia and Texas Instruments, which have successfully launched autopilot chips and began to compete for hundreds of billions of dollars in the autopilot market. This chapter, through an example, introduces how to use the Texas Instruments autopilot chip to implement and optimize some autopilot algorithms.

1.5.7

Deep Learning Model Optimization

Deep convolutional neural networks (CNNs) provide state-of-the-art technology for a variety of applications, especially in the field of computer vision. However, due to the large amount of data storage required to realize this technology, it has become an important obstacle to popularization and application. The original AlexNet architecture requires approximately 240 MB of memory to store the weight parameters needed to classify a single image; its deeper successor, VGG, requires a considerable amount of memory (528 MB). In environments with considerable cloud computing power enabled by multiple graphics processing units (GPUs), large memory requirements may not be

20

1 Challenges of Autonomous Driving Systems

considered limiting. However, in the case of mobile or edge-based embedded devices with limited computing power, such resource-intensive deep neural networks cannot be easily applied, including smartphone IoT proliferation of deep learning applications. Therefore, the design of deep neural networks, which require less memory and computing power, has established itself as a new research direction. In particular, modifications to large, bulky models that retain as much performance as possible while reducing memory requirements, namely, neural network compression. Another direction is to design a more memory-efficient network architecture from scratch. Therefore, this chapter will discuss the different methods in detail.

1.5.8

Design of Deep Learning Hardware

From the current development trend of autonomous driving, CNN-based algorithms occupy a dominant position, so the design of specialized deep learning chips is of great significance for autonomous driving. Due to the popularity of DNNs, many recent hardware platforms have special features for DNN processing. For example, Intel Knights Mill CPUs will have special vector instructions for deep learning. The Nvidia PASCAL GP100 GPU has 16-bit floating point (FP16) arithmetic support, which can perform two FP16 operations on a single precision core, speeding up deep learning computations. There are also systems built specifically for DNN processing, such as the Nvidia DGX-1 and Facebook’s Big Basin custom DNN server. On various embedded system-on-chip (SoC) systems, such as Nvidia Tegra and Samsung Exynos, DNN inference has also been demonstrated. Therefore, it is important to fully understand how processing is performed on these platforms and how to design applicationspecific accelerators for DNNs to further improve throughput and energy efficiency.

1.5.9

Self-Driving ASICs Design

Autonomous driving algorithms first need to decide what type of self-driving car hardware platform should be used. Designers can express algorithms as software in a CPU, DSP, or GPU, but these algorithms are expensive and may have other drawbacks. CPUs cannot be fast or efficient enough, while DSPs are good at image processing but lack sufficient performance for deep AI. Although the GPU is good at training, it is too power hungry for it. Another option for an in-vehicle solution is to instantiate the algorithm in a hardware platform in FPGA, ASIC and custom designs in specialized form. Although more upfront design is needed, FPGAs or ASICs still provide the best power/performance results. This is why an increasing number of design teams choose ASIC to implement their algorithms.

1.5

The Content of This Book

1.5.10

21

Operating Systems for Autonomous Driving

This chapter provides insights into specific operating systems (real-time and generalpurpose), middleware operating systems and frameworks, systems engineering, and architectural considerations for building autonomous driving systems. The development of autonomous driving systems involves domain-specific algorithms, architectures, systems engineering, and technical implementation.

1.5.11

Autonomous Driving Software Architecture

Autonomous driving projects currently underway are essentially testing the increasing complexity of sensors and the software algorithms needed to process the car’s vast amount of information to make the right decisions and act. This processing requires considerable software, and by current estimates, there are already 1 billion lines of code to power a fully self-driving car. The computational requirements to execute this massive amount of software are like the server-like performance compared to traditional automotive embedded processing. There is currently a trend in both industry and academia to incorporate clusters of more powerful application processors and accelerators into higher-performance multicore SoCs rather than discrete CPUs. The sophistication of the software application is much higher than even the most advanced airliners, which are already full of autonomous features, as self-driving cars must deal with highly chaotic roads filled with unpredictable human drivers and pedestrians, while the planes are only operating in a relatively empty sky. This results in much algorithmic processing being required in real time to understand what is going on around the car and then all the computational components of autonomous driving, making the right decisions, and executing a massive software stack safely. This greater complexity lends itself to a common unified platform architecture upon which an easily upgradeable and portable software stack can be built.

1.5.12

5G C-V2X

By providing real-time information on conditions beyond the driver’s line of sight, C-V2X can be used in conjunction with other sensors on the vehicle to improve safety. Information captured by the C-V2X system can complement data captured by radar, lidar, and ultrasonic systems that help drivers keep their vehicle at a safe distance from the vehicle in front, as well as address severe weather conditions and low-light situations. The vehicle’s onboard computer can combine data received via C-V2X with information captured by the onboard cameras to interpret road signs and

22

1

Challenges of Autonomous Driving Systems

objects. In addition, GNSS systems that pinpoint the vehicle’s position on 3D and HD maps can be updated in real time over the cellular network. The fusion of evolving and improving sensors and computer intelligence will eventually mimic and surpass the perceptual and cognitive abilities of human drivers. Tips For each part, interested readers can also explore potential research projects and products from this book. For example, readers who are interested in algorithms can develop multisensory fusion algorithms (Chaps. 2 and 3); readers who are interested in high-definition maps can learn how to build high-definition maps (Chap. 5). Startups are developing custom high-definition map generation; if you have a good autonomous driving algorithm, you can also directly use the software method of model compression (Chap. 8), directly running on the NVIDIA/Qualcomm/Texas Instruments autonomous driving platform, implementing a prototype of its product (Chap. 7). Readers who are interested in hardware development can refer to “How to Design Deep Learning Chips” (Chap. 9) and “Autonomous Driving SoC Chip Design” (Chap. 10), especially the open-source NVIDIA. Readers who are hardware designers can refer to this open-source code to design custom ASIC chips. Regarding the ASIC for autonomous driving chips, the author believes that it is a good direction to develop digital image processing chips and deep learning chips dedicated to autonomous driving. The design of special digital image processing chips, how to meet HDR and fast-moving objects, is a major challenge but also an opportunity. Readers who are interested in software development can refer to “Software Architecture Design” (Chap. 12), how to develop their own autonomous driving software products based on open-source code. This is different from previous software development and needs to consider safety performance. Of course, this book mainly refers to a lot of open-source code and literature, which took 12 months to complete, which cost the author a lot of time and energy.

References 1. Shih-Chieh Lin, etc., The Architectural Implications of Autonomous Driving: Constraints and Acceleration, Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems March 2018 Pages 751–766. 2. R Wayne Johnson, John L Evans, Peter Jacobsen, James R Thompson, and Mark Christopher. 2004. The changing automotive environment: high-temperature electronics. IEEE Transactions on Electronics Packaging Manufacturing 27, 3, (2004), 164 -176. 3. Shinpei Kato, Eijiro Takeuchi, Yoshio Ishiguro, Yoshiki Ninomiya, Kazuya Takeda, and Tsuyoshi Hamada. 2015. An open approach to autonomous vehicles. IEEE Micro 35, 6, (2015), 60-68. 4. Udacity. 2017. An Open-Source Self-Driving Car. https://www.udacity.com/self-driving-car. (2017). 5. Mobileye. 2017. Enabling Autonomous. http://www.Mobileye.com/future-of-mobility/ Mobileye-enabling-autonomous/. (2017). 6. Lin, SC; Zhang, Y.; Hsu, CH; Skach, M.; Haque, E.; Tang, L.; Mars, J. The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the

References

23

Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, 24–28 March 2018; pp. 751–766. 7. Schwarting, W.; Alonso-Mora, J.; Rus, D. Planning and decision-making for autonomous vehicles. Annu. Rev. Control Robot. Auton. Syst. 2018, 1, 187–210. 8. Villagra, J.; Milanés, V.; Pérez, J.; Godoy, J. Smooth path and speed planning for an automated public transport vehicle. Robot. Auton. Syst. 2012, 60, 252–265. 9. González, D.; Pérez, J.; Milanés, V.; Nashashibi, F. A review of motion planning techniques for automated vehicles. IEEE Trans. Intell. Transp. Syst. 2016, 17, 1135–1145. 10. Pendleton, S.; Du, X.; Shen, X.; Andersen, H.; Meghjani, M.; Eng, Y.; Rus, D.; Ang, M. Perception, planning, control, and coordination for autonomous vehicles. Machines 2017, 5, 6. 11. Dixit, S.; Fallah, S.; Montanaro, U.; Dianati, M.; Stevens, A.; Mccullough, F.; Mouzakitis, A. Trajectory planning and tracking for autonomous overtaking: State-of-the-art and future prospects. Annu. Rev. Control 2018, 45, 76–86. 12. Hellstrom, T.; Ringdahl, O. Follow the past: A path-tracking algorithm for autonomous vehicles. Int. J. Veh. Auton. Syst. 2006, 4, 216–224. 13. Amer, NH; Zamzuri, H.; Hudha, K.; Kadir, ZA Modelling and control strategies in path tracking control for autonomous ground vehicles: A review of state of the art and challenges. J. Intell. Robot. Syst. 2017, 86, 225–254. 14. Koopman, P.; Wagner, M. Challenges in autonomous vehicle testing and validation. SAE Int. J. Transp. Saf. 2016, 4, 15–24. 15. Tian, Y.; Pei, K.; Jana, S.; Ray, B. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 303–314.

Chapter 2

3D Object Detection

Abstract This chapter provides an overview of 3D object detection methods, starting with an introduction to sensors and datasets commonly used in autonomous driving in Sects. 2.1 and 2.2, respectively. Based on the existing work on sensor modalities, the discussion will be summarized into single sensor-based methods, point cloud-based methods and fusion methods. In Sect. 2.5, a paper is selected and combined with open-source code to discuss 3D object detection. Finally, in Sect. 2.6, current research challenges and future research directions are discussed as follows: • summarize datasets and simulation tools used to evaluate detection model performance. • provide a summary of progress in 3D object detection for autonomous vehicles. • compare 3D objects on benchmarks performance of detection methods. • identify future research directions.

2.1

Introduction

Connected networking and autonomous vehicles will undoubtedly help improve driving safety, as well as traffic flow and work efficiency. However, accurate environmental perception and awareness are critical for the safe operation of autonomous vehicles. Perception systems for autonomous vehicles, converting perception data into semantic information, recognition of the location, speed, and level of major road objects (such as vehicles, pedestrians, cyclists, etc.), lane markings, drivable areas, and traffic identification of logo information. It is worth noting that the object detection task is critical, as failure to identify road target objects may lead to safetyrelated incidents. For example, failure to detect vehicles ahead can lead to traffic accidents and even life-threatening situations. One factor in the failure of perception systems is sensor limitations and environmental changes such as lighting and weather conditions. Other challenges include the generalization of driving areas such as possible highways and rural and urban areas. While highway lanes are well structured, there are no prescribed directions for parking vehicles in urban areas. Categories such as pedestrians, cyclists, etc., are © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Ren, D. Xia, Autonomous driving algorithms and Its IC Design, https://doi.org/10.1007/978-981-99-2897-2_2

25

26

2

3D Object Detection

Table 2.1 Lists the comparison between 2D and 3D detection methods [1] 2D object detection 3D object detection

Advantages Well-established datasets and object detection architectures. Usually, only RGB input can get accurate results on the image plane The 3D bounding box provides the size and position of the object in world coordinates. This detailed information allows a better understanding of the environment

Disadvantages Limited information: Lack of object basis, occlusion, and 3D position information Requiring depth estimation for precise localization. The extra dimensional regression increases the complexity of the model. Lacking 3D labels datasets

more diverse, and the backgrounds in which litter boxes are placed are more cluttered. Another factor is occlusion; when one object blocks the view of another object, it may cause part or all that object to be invisible. Not only can objects be very different sizes (comparing a truck to a trash can, for example), but objects can be very close or far away from the self-driving car. Because the scale of an object can greatly affect sensor readings, it can result in very different representations of the same class of objects. Despite the above challenges, the performance of 2D object detection methods for autonomous driving has achieved over 90% average precision (AP, average precision) on the well-established “KITTI” object detection benchmark. Compared with 2D object detection methods, 3D methods can introduce a third dimension in localization and size to provide depth information in world coordinates. However, in the case of autonomous driving, there still has the large gap between the performance of 2D and 3D methods for several reasons. For example, we have limited datasets for 3D object detection and little research efforts on the 3D detection in the past decade. However, 3D object detection is very important for the autonomous driving and it requires more research efforts to improve 3D object detection methods. There is one good survey [1] discussing the existing 3D object detection for autonomous driving in the past several years. Based on the overall architecture from [1], in this chapter, we will provide more recent advances to fill in the 3D object detection in the autonomous driving approaches. With the powerful performance on computer vision tasks driven by the deep learning, now more and more tasks in the autonomous driving are shifted to end to end deep learning-based solutions. The end-to-end pipeline optimizes the overall performance. This chapter focuses on “end-to-end” learning methods, as these methods are state-of-the-art in 3D object detection and have grown rapidly in recent years (Table 2.1). This chapter provides an overview of 3D object detection methods [1], starting with an introduction to sensors and datasets commonly used in autonomous driving in Sects. 2.1 and 2.2, respectively. Based on the existing work on sensor modalities, the discussion will be summarized into single sensor-based methods, point cloudbased methods and fusion methods. In Sect. 2.4, a paper is selected and combined

2.2

Sensors

27

with open-source code to discuss 3D object detection. Finally, in Sect. 2.6, current research challenges and future research directions are discussed as follows: • summarize datasets and simulation tools used to evaluate detection model performance. • provide a summary of progress in 3D object detection for autonomous vehicles. • compare 3D objects on benchmark performance of detection methods. • identify future research directions.

2.2

Sensors

Although people mainly use their own visual and auditory systems when driving, the perception methods usually rely on multiple modalities to overcome the limitations of a single sensor. Autonomous vehicles use a wide variety of sensors: passive sensors (such as monocular and stereo cameras, etc.) and active sensors (including lidar, radar, sonar, etc.). Most research on autonomous driving perception has focused on cameras and lidars. There is some debate whether we should use camera only or use the lidar as the mandatory sensor for the autonomous driving. For camera only based solution like Tesla, it provides the comprehensive solution for advanced driving assistant system such as lane keeping. However, the solution from Telsa is far away from L4 autonomous driving. Some other companies such as Waymo will persist that it will use the lidar as the mandatory sensor to percept the world around the car. But the current lidar solution is very expensive compared to cameras and therefore Lidar cannot be widely used in the vehicles. Therefore, this section will cover these two categories in more detail. A more comprehensive report on sensors for autonomous driving can be found in papers [1, 2]. Table 2.2 shows the different sensors.

Table 2.2 Comparison of different sensors sensors Monocular camera

Advantage • RGB image, texture attributes • Not expensive

Stereo camera

• RGB image, texture attributes • Not expensive • Accurate depth information • Less susceptive to weather conditions • Not susceptive to light condition • Panoramas observation

LiDAR

Disadvantage • No depth information • Susceptible to weather and light conditions • Susceptible to weather and light conditions Expensive • No texture attribute • No color information

28

2.2.1

2

3D Object Detection

Camera

Monocular cameras provide detailed information in the form of pixel intensities and are designed to reveal the shape and texture properties of an image at a larger scale. The traffic symbols such as traffics signs and lanes are specifically designed into some fixed shapes and texture. One disadvantage of monocular cameras is the lack of depth information, which is required for accurate size and position estimation of objects. Besides the monocular camera, stereo camera was proposed and designed to simulate human eyes to percept 3D object and get the depth information about the world. Stereo camera can be used to recover depth information. This algorithm in the stereo camera mainly uses a matching algorithm to find the correspondence in the two images and calculate the depth of each point relative to the camera. However, the calibration from stereo camera is very challenging and stereo camera synchronization is not an easy task. The granularity of camera recognition is relatively high, and rich texture and color can be obtained, so it can achieve refined recognition. In this regard, lidar is not as good as camera. The biggest disadvantage of the camera is that it is greatly affected by ambient light. In scenarios such as strong light, highlighted objects, and low-light environments at night, it is difficult for the collected data to be effectively and reliably perceived by algorithms. Other camera modes that provide depth estimation are ToF (Time of Flight) cameras, which infer depth by measuring the delay between transmitted and received modulated infrared pulses. This technology finds application in vehicle safety and thus, despite its lower price, still has lower computational complexity and resolution compared to stereo cameras. Additionally, camera sensors are susceptible to light and weather conditions, ranging from low brightness at night to sudden, extremely high brightness differences when entering or exiting a tunnel. More recently, the use of LED lights has created flickering problems on traffic signs and vehicle brake lights. This occurs because the camera’s sensor cannot reliably capture the amount of light emitted due to the switching behavior of the LEDs.

2.2.2

LiDAR

Due to the excellent imaging capabilities of lidar and cameras, they have always been regarded as the core sensors of autonomous driving. Compared with the camera, the advantage of lidar is that it can obtain accurate three-dimensional information, and it is an active light source, which cannot be affected by light, and can work as usual during the day and night. Lidar is an active detection and imaging by laser, which is not affected by ambient light, and directly measures the distance, azimuth, depth information, reflectivity,

2.2

Sensors

29

etc. of objects. The algorithm first identifies obstacles and then classifies them. The recognition accuracy and reliability far exceed that of cameras, while the computing resources consumed are lower than those of cameras. It can be said that the most important part of the application of lidar in automatic driving is high-precision positioning. First determine its own location, and then the self-driving vehicle will face the problem of “where to go”. Therefore, determining “where am I” is the first step, and it is also a very crucial step. According to conventional understanding, localization should only be the task of GPS. Indeed, GPS will be used in the positioning of automatic driving, but the accuracy of GPS positioning is insufficient, and the signal stability is poor when encountering tall buildings or entering and exiting tunnels, so it is difficult to guarantee automatic driving vehicle safety. Therefore, autonomous driving localization needs to be combined with lidar, GPS, IMU, etc. to complete stable and reliable high-precision positioning. The lidar hardware cooperates with the AI perception algorithm developed for autonomous driving, which can complete tasks such as identifying surrounding obstacles, detecting road edges, and performing high-precision positioning. It can also achieve classification and labeling, and divide obstacles into trucks and cars, pedestrians, bicycles, etc. The standard lidar models, such as the HDL-64 L [3], use a rotating array of laser beams to obtain a 3D point cloud. The sensor can output 120,000 points per frame, or 1.2 billion points per second at a 10 Hz frame rate. Recently, Velodyne released the VLS-128 model with 128 laser beams, higher angular resolution and a 300 meter radius range. The main challenge to the widespread use of lidar is that the lidar is very expensive. For example, a single sensor can cost more than $70,000, which is even high than one single car. However, with the mass production of solid-state lidar technology, we expect the prices will drop in the next few years.

2.2.3

Camera + Lidar

There are also methods that rely on both lidar and camera modes. Before fusing these modalities, the sensor needs to be calibrated to obtain a single spatial reference frame. Park et al. [4] proposed using polygonal slabs as targets, which can be detected in two ways to generate accurate 3D-2D correspondences and to obtain more accurate calibrations. However, having a spatial target makes this method difficult to calibrate in the field. As an alternative, Ishikawa et al. [5] devised a calibration method that iteratively calibrates them using the sensor’s odometry to estimate the environment without the use of spatial targets. Different sensors have different characteristics. Monocular cameras are inexpensive sensors but lack depth information, which is necessary for accurate 3D object detection. The stereo camera can be used for depth recovery, but in unfavorable lighting conditions and untextured scenes. ToF (Time of Flight) camera sensors have limited resolution and distance. Conversely, lidar sensors can be used for accurate

30

2 3D Object Detection

depth estimation at night but are prone to noise in adverse weather (such as snow and fog) and cannot provide texture information. Therefore, how to fuse different sensor data to achieve reliable target detection has always been the direction of continuous efforts in the industry.

2.3

Datasets

With the widespread use of learning methods, we required lots of labeled training data. For example, ImageNet has speeded up the development and evolution of image classification and object detection models. In the autonomous driving area, we also need such kind of datasets to speed up the machine learning based solutions. In particular, tasks such as object detection and semantic segmentation require precisely labelled data. Here we will try to provide general datasets for driving tasks, especially 3D object detection. In the driving environment, one of the most commonly used datasets is KITTI [6], which provides many different formats of data, including stereo color images, lidar point clouds, and GPS coordinates. All different modular data are all synchronized in time to make sure the data from different modulars occurs at the same time. Datasets from KITTI include well-structured highways, complex urban areas, and narrow country roads. Currently the datasets from KITTI have been widely used for different tasks in the autonomous driving, including stereo matching, visual odometry, 3D tracking, and 3D object detection. In particular, the specific object detection dataset contains 7481 training and 7518 testing data points. These datasets are provided with sensor calibration information and annotated 3D boxes around objects. Although this KITTI dataset has been widely adopted, it has some limitations. Notably, there are several limitations for this dataset: • • • •

It has limited sensor configurations and lighting conditions; all data were taken during the day, mostly in sunny conditions; all the data are taken by the same set of sensors; Additionally, the data distribution is very uneven, with 75% cars, 4% cyclists, 15% pedestrians, and so on.

The lack of diversity challenges the evaluation of current methods in more general scenarios, reducing their reliability in practical applications. Therefore, the existing KITTI dataset can provide some hints how we can do research work on autonomous driving, as mentioned above, the limitations for this dataset could not meet our requirements for practical autonomous driving. With increased requirement for 3D object detection in autonomous driving, more and more 3D datasets have been proposed and share across academic and industry. Here we will list some other existing 3D datasets which can be used for 3D object detection in the autonomous driving. Among these datasets, please remind that

2.4

3D Object Detection Methods

31

KITTI dataset, nuScene and Waymo open datasets are most popular. Interested readers can start your research work starting from these datasets. • • • • • •

KITTI dataset: http://www.cvlibs.net/datasets/kitti/ Argoverse dataset: https://www.argoverse.org/ Lyft L5: https://level-5.global/data/?source=post_page ApolloScape dataset: http://apolloscape.auto/ nuScene: https://www.nuscenes.org/nuscenes#data-collection Waymo Open: https://waymo.com/open/

Of course, even the existing above dataset, we still require more training dataset to cover all scenarios in the practical autonomous driving, including snowy and fog conditions. In order to generate more training data for autonomous driving, the simulation tools can be used to generate training data under specific conditions and to train endto-end driving systems [7, 8]. During the model training, more diverse data can enhance the performance of the detection model in the real environment. These data can be obtained through a game engine or a simulated environment. Several simulation tools have been proposed in academic. For example, CARLA [9] is an opensource simulation tool for autonomous driving that enables flexibility in setting up the environment and configuring sensors. Another simulation tool, Sim4CV [10], enables easy environment customization and simultaneous multi-view rendering of driving scenes while providing ground bounding boxes for object detection.

2.4

3D Object Detection Methods

We can divide 3D object detection methods into three categories: monocular image, point cloud, and fusion-based methods.

2.4.1

Monocular Images Based on Methods

2D object detection has been addressed and successfully applied in multiple datasets. However, the KITTI dataset presents a specific setting for object detection that is challenging. Furthermore, 2D detection at the image plane is not sufficient to provide a reliable actuation system. Such applications require more precise 3D spatial positioning and size estimation. This section will focus on methods for estimating 3D bounding boxes from monocular images. However, since there is no depth information available, most methods first use neural networks, geometric constraints or 3D model matching to detect 2D candidates before object 3D bounding box prediction. We sample candidate bounding boxes with typical physical sizes in the 3D space by assuming a prior on the ground-plane. We then project the boxes to the image

32

2

3D Object Detection

Fig. 2.1 Overview of the approach [11]

Fig. 2.2 The single-stream network for 3D object detection [13]

plane, thus avoiding multi-scale search in the image. We score candidate boxes by exploiting multiple features: class semantic, instance semantic, contour, object shape, context, and location prior. A final set of object proposals is obtained after non-maximum suppression. Chen et al. [11] proposed Mono3D to design simple region proposals using context, semantics, hand-designed shape features, and location priors. For any given region proposal, these features can be efficiently computed and scored through the model. Region proposals are generated by performing an exhaustive search in 3D space and filtered using non maxima suppression (NMS). Figure 2.1 shows an overview of the proposed approach [11]. The results are further refined by regressing the Fast R-CNN [12] model of 3D bounding boxes. This work, building on the authors’ previous work 3DOP [13], proposes depth images to generate region proposals in a similar framework. The Mono 3D model slightly improves the obtained performance despite using only monocular images, which use depth images. Figure 2.2 shows an overview of the approach [13]. It should be noted that an important feature of the driving environment is the presence of severe occlusions in crowded scenes. In this case, the vehicle may block its own view [14]. Introduced visibility patterns into the model and mitigated

2.4

3D Object Detection Methods

33

Fig. 2.3 Overview of our object category recognition framework [14]. (a) Training pipeline. (b) Testing pipeline

Fig. 2.4 Overview of the object detection framework [14]

occlusion effects through object inference. They propose a 3D Voxel Pattern (3DVP) representation, which models appearance in 3D via RGB intensities. Figure 2.3 shows an overview of the approach [14]. With this representation, it is possible to restore objects that are partially visible, occluded or truncated. They obtained a dictionary of 3DVPs by clustering the observed patterns of the data and, given 2D image segments of the vehicle, trained a classifier for each specific pattern. In the testing phase, the patterns obtained by classification are used for occlusion inference and 3D pose and localization estimation. The authors of this paper achieve 3D detection by minimizing the reprojection error between the 3D bounding box projected to the image plane and the 2D detection, but its performance still depends on the region proposal network (RPN). While some RPNs are able to improve on traditional proposed methods, they still fail to handle occlusions, truncations, and different object scales. Extending the previous 3DVP framework, the aforementioned authors propose a SubCNN [15]. Figure 2.4 shows this approach. This is a CNN that explores class information for object detection at the RPN level. Where the concept of subcategories is used, which are object categories that share similar properties such as 3D pose or shape. After region of interest (ROI) proposals, the network outputs class classification, as well as precise 2D bounding box estimates. Using 3DVP [14] as a subcategory of the pedestrian, cyclist and vehicle categories, the model can recover 3D shape, pose and

34

2

3D Object Detection

Fig. 2.5 Overview of the deep MANTA approach [16]

occlusion patterns. The extrapolation layer improves small object detection by introducing a multi-scale image pyramid. Therefore, it cannot be generalized to arbitrary vehicle poses that differ from existing patterns. By exploiting subcategory information, we propose a new CNN architecture for region proposal and a new object detection network for joint detection and subcategory classification. To overcome this problem, Deep MANTA [16] is another template matchingbased method proposed by Chabot et al. They used an extensive CAD 3D model library to match the 2D bounding box of the detected 2D objects with 3D models. Deep MANTA scores all proposed 3D templates to find the best match of object geometry in the database. The system has two main steps. First, the monocular image passes through the deep MANTA network to generate a bounding box with a matching vehicle geometry score. The second step is the inference that the outputs and 3D models are used to extract 3D orientation and locations [14]. As Fig. 2.5 shows, the first step has three levels. The first level produces a regional proposal using RPN. Level 2 extracts the bounding box properties; level 3 provides the class of object, 3D part coordinate, part visibility and a 3D model template. The entire input image is forwarded inside the deep MANTA network. Conv layers with the same color share the same weights. Moreover, these three convolutional blocks correspond to the split of the existing CNN architecture. The network provides object proposals {Bi,1}, which are iteratively refined ({Bi,2} and then the final detection set {Bi,3}). 2D part coordinates {Si}, part visibility {Vi} and template similarity {Ti} are associated with the final set of detected vehicles {Bi,3}. A non-maximum suppression (NMS) is then performed. It removes redundant detections and provides the new set {Bj, Sj, Vj, Tj}. Using these outputs, the inference step allows one to choose the best corresponding 3D template using

2.4

3D Object Detection Methods

35

template similarity Tj and then performs 2D/3D pose computation using the associated 3D shape. Previous attempts performed an exhaustive search on the 3D bounding box space, estimating 3D poses from clusters of appearance patterns or 3D templates. Mousavian [17] first extended the standard 2D object detector with 3D orientation (yaw) and bounding box size regression. Most models use L2 regression for orientation angle prediction. First, according to the network prediction, the size and orientation of the 3D box are determined, and then the pose of the 3D object is restored, and the reprojection error of the 3D bounding box is minimized by solving the translation matrix. All previous monocular image methods can only detect objects by the front camera, ignoring the objects on the side and rear of the vehicle. Although lidar methods can be effectively used for 360-degree detection, [18] proposed the first 3D object detection method from 360-degree panoramic images. The authors estimate dense depth maps for panoramic images and adapt standard object detection methods to equirectangular representations. Due to the lack of panoramic labelling datasets for driving, a projective transformation was used to transform the KITTI dataset. They can also provide benchmark detection results on synthetic datasets. The monocular approach has been extensively studied. Although previous work has considered hand-designed features of region proposals, most approaches have turned to deep learning for region proposals, along with 3D model matching and reprojection, to obtain 3D bounding boxes. Tips: The main disadvantage of single-purpose-based methods is the lack of depth information, limiting detection and positioning accuracy for distant and occluded objects, and sensitivity to light and weather conditions, limiting the use of these methods during the day.

2.4.2

Point Cloud-Based Detection Methods

Currently, point cloud-based 3D object detection methods are divided into three subcategories: (1) projection-based, (2) volumetric representation, and (3) point-net. Each category is explained and analyzed as follows:

2.4.2.1

Projection Methods

Image classification and object detection in 2D images is a well-studied topic in the field of computer vision. The availability of datasets and benchmark architectures for 2D images makes these approaches even more attractive. Therefore, the point cloud (PCL, point cloud layer) projection method first converts 3D points into 2D images through plane, cylindrical or spherical projection and then can use a standard 2D object detection model and regress the position and size to recover the 3D bounding box.

36

2

3D Object Detection

Fig. 2.6 BirdNet 3D object detection framework [20]

Reference [19] used cylindrical projection mapping and a fully convolutional network (FCN) to predict the 3D bounding box of a vehicle. Projection produces an input image with channels encoding the height and distance of the point from the sensor. This input is fed to a 2D FCN, which down samples the input of three consecutive layers and then uses a transposed convolutional layer to up sample these maps into the predicted output of the bounding box (BB, Bounding Box). The first output determines whether a given point is a vehicle or part of the background, so it can be effectively used as a weak classifier. The second output, encoding the vertices of the 3D bounding box, conditioned on the first output, confines the vehicle. Since there will be many bounding box estimates per vehicle, an NMS (non-maximum suppression) strategy can be employed to reduce overlapping predictions. On the KITTI dataset, this detection model is trained in an end-to-end fashion with loss balancing to avoid biasing towards more frequent negative samples. Using cylindrical and spherical projections, [20] used bird’s-eye projection to generate 3D object detections. Figure 2.6 shows the BirdNet 3D objection framework [20]. There is a difference in the input representation: the first uses the minimum, median and maximum height values of points located within the cell as channels to encode the 2D input cell, while the latter two use the height, intensity and density channels. The first method uses the Faster R-CNN architecture as the base, with an adjusted correction network. The output of this network has an oriented 3D bounding box. Although their bird’s-eye results are reasonable, their method performs poorly on orientation and angle regression. Since most lidar systems use sensors with high point densities, the application of the resulting models to low-end lidar sensors is limited. The three outputs of the network are class (green), 2D bounding box (blue) and yaw angle (red). Complex-YOLO [21] references the efficiency of the YOLO architecture and is extended to predict additional dimensions and yaw angles. Figure 2.7 shows the overall architecture [21]. While classical RPN methods further process each region for finer predictions, this architecture, classified as a one-shot detector, obtains detections in a single forwards step. Despite poor detection performance, it enables Complex-YOLO to achieve a runtime of 50 fps, which is five times more efficient than previous methods.

2.4

3D Object Detection Methods

37

Fig. 2.7 Overview of Complex-YOLO [21]

Complex-YOLO [21] is a very efficient model that directly operates on Lidaronly birds-eye-view RGB maps to estimate and localize accurate 3D multiclass bounding boxes. The upper part of the figure shows a bird’s-eye view based on a Velodyne HDL64 point cloud such as the predicted objects. The lower one outlines the reprojection of the 3D boxes into image space. Note: Complex-YOLO needs no camera image as input; it is Lidar based only. Quantifying the reliability of predictions made by autonomous object detection systems is critical to the safe operation of vehicles. As with a human driver, if the system has insufficient confidence in its predictions, it should enter a safe state to avoid risks. While most detection models provide a score for each prediction, they tend to use softmax normalization to capture the class distribution. Since this normalization forces the sum of the probabilities to be unified, it does not necessarily reflect the absolute confidence of the prediction. A Bayesian neural network [22] is used to predict the combined category and 3D bounding box of a region of interest (ROI), which quantifies the network confidence of the two outputs. By measuring the model uncertainty, which accounts for the observed objects, and the observation noise in cases involving occlusion and low point density, by adding constraints that limit the penalty for noisy training samples, an improvement in detection performance is observed when simulating uncertainty.

38

2.4.2.2

2

3D Object Detection

Volumetric Convolution Methods

Volumetric methods assume objects or scenes, represented in a 3D grid or voxel representation, where each cell has properties such as binary occupancy or continuous point density. One of the advantages of this approach is that they explicitly encode shape information. However, as a result, most of the volumes are empty, resulting in reduced efficiency in processing their empty cells. Additionally, since the data are actually three-dimensional, 3D convolutions are necessary, greatly increasing the computational cost of such models. To this end, [23, 24] address the problem of object detection for driving scenes using a one-stage fully connected network in the entire scene volume representation. One-stage detection differs from two-stage detection in that the first stage first generates region proposals, which are then refined in the second processing stage. Li in [23], using binary volume input, only detected vehicles. The output of this model is the “object” and object box vertex predictions. The first output is to predict whether the estimated area belongs to the object of interest, while the second output predicts its coordinates. The authors use expensive 3D convolutions to limit the temporal performance. For a more efficient implementation, reference [24] fixed the size of the object boxes for each category but only detected cars, pedestrians and cyclists (Fig. 2.8). Feature maps are first down sampled by three convolution operations with a stride of 1/2 3 and then up sampled by the deconvolution operation of the same stride. The output abjectness map (o a) and bounding box map (o b) are collected from the deconv4a and deconv4b layers, respectively. Simplification of the architecture in conjunction with sparse convolution algorithms greatly reduces model complexity. L1 regularization and a linear activation function to maintain sparsity between convolutional layers. During inference, the parallel network can be used independently for each class, while the assumption of a fixed object box size allows the network to be trained directly on the 3D cropped regions of the positive samples. During training, the authors augment the data with rotation and translation transformations and employ strict negative mining to reduce false positives.

Fig. 2.8 A sample illustration of the 3D FCN structure used in [23]

2.4

3D Object Detection Methods

2.4.2.3

39

Point Net Method

A point cloud is composed of a variable number of 3D points sparsely distributed in space. Therefore, it is not obvious how to incorporate its structure into a traditional feed-forwards deep neural network that assumes a fixed size of input data. Previous approaches use projection, which converts a point cloud to an image, or a voxel representation, which converts it into a volumetric structure. A third class of methods, called point nets, addresses irregularities by using raw points as input to reduce the loss of information caused by projection or quantization in 3D space. The paper [25] presents the seminal work in this category to perform object classification and object segmentation using 3D PCL as input piecewise. The network performs pointwise transformations using fully connected (FC) layers and pooling aggregates global features via a max. The experimental results show that this method outperforms the volume method. Figure 2.9 shows the pointNet architecture. This model is further extended in PointNet++ [26], where for each layer, progressively more complex features are encoded in a hierarchical structure. The classification network takes n points as input, applies input and feature transformations, and then aggregates point features by max pooling. The output is classification scores for k classes. The segmentation network is an extension of the classification network. It concatenates global and local features and outputs per point scores. “mlp” stands for multilayer perceptron, and the numbers in brackets are layer sizes. Batchnorm is used for all layers with ReLU. Dropout layers are used for the last MLP in the classification net. Among point cloud-based methods, the projection subcategory has received much attention due to object detection close to standard images. However, most methods rely on hand-designed features when projecting point clouds, such as density and height. In contrast, the point method uses raw 3D points to learn a representation in the feature space. In the last category, it is still necessary to investigate new forms of using the whole scene point cloud as input since regular PointNet [25] models employ segmented objects. Likewise, volumetric methods convert point clouds into voxel representations, for which spatial information must

Fig. 2.9 PointNet architecture [25]

40

2 3D Object Detection

be explicitly encoded. This method requires 3D convolutions, and sparse representation is inefficient.

2.4.3

Fusion-Based Methods

As mentioned earlier, point clouds do not provide texture information. However, texture information is very useful for object detection and classification. Conversely, monocular images cannot capture depth values, which are necessary for accurate 3D localization and size estimation. Additionally, the density of the point cloud decreases rapidly as the distance from the sensor increases, while the image still provides detection of distant vehicles and objects. To improve overall performance, some researchers have tried to use two modalities with different strategies and fusion schemes simultaneously. Generally, there are three types of fusion schemes: • Early fusion: At the beginning of the process, patterns are combined to create a new representation that depends on all patterns. • Late fusion: Each modality will be processed separately and independently before the final stage of fusion occurs. This scheme does not require all modalities, as it can rely on predictions from individual modalities. • Deep fusion: In the neural network layer, it mixes modalities hierarchically, allowing features from different modalities to interact between layers, resulting in a more general fusion scheme. One fusion strategy is to use a point cloud projection method, i.e., PCL mapping along the projection, using the extra RGB channels of the front camera for higher detection performance. Two of these methods [27, 28] use a 3D region proposal network to generate 3D regions of interest, which are then projected to a specific view and used to predict classes and 3D bounding boxes. The first method is MV3D [27], which uses lidar bird’s-eye view and projection of forwards-looking points, as well as RGB channel images of forwards-facing cameras. This paper aims at high-accuracy 3D object detection in autonomous driving scenario. They propose multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. They encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird’s eye view representation of 3D point cloud. They design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths, Fig. 2.10 shows the proposed MV3D network architecture in [27]. The second method is AVOD [28], an Aggregate View Object Detection (called AVOD) network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage

2.4

3D Object Detection Methods

41

Fig. 2.10 MV3D network architecture [27]

detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. The proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is available at: https://github.com/ kujason/avod. The second strategy consists of obtaining 2D candidate images using monocular images and generalizing these detections to 3D space using point cloud data. In this category, Frustum Point-Net [29] generates region proposals on an image plane with monocular images and uses point clouds to perform classification, and bounding box regression obtained on the image plane are boxes generalized to 3D space, resulting in frustum region proposals. Then, this set is fed to a second point instance to perform classification and 3D box regression. Similarly, Du et al. [30] first select points that lie in the detection box when projected to the image plane and then use these points for model fitting, resulting in preliminary 3D proposals. The proposal is processed by a two-stage modified CNN that outputs final 3D boxes and confidence scores.

Tips Detection in both methods is constrained by the region proposals on monocular images, which may be limiting factors due to lighting conditions, etc. Fusion methods obtain state-of-the-art detection results by exploring complementary information from multiple sensor modalities. Lidar point clouds can provide accurate depth information with sparse and low point densities at distant locations, while cameras can provide texture information valuable for class recognition.

42

2.5

2

3D Object Detection

Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds [31]

In recent years, point cloud processing has become increasingly important for autonomous driving due to dramatic improvements in automotive LiDAR sensors. The supplier’s sensors can provide real-time three-dimensional points of the surrounding environment, which has the advantage of directly measuring the distance of the object. Therefore, lidar can be used in object detection algorithms for autonomous driving. These algorithms accurately estimate the position and orientation of different objects in 3D space. Compared to images, lidar point clouds are sparse and distributed throughout the measurement area. However, these points are unordered, they interact locally, and they cannot be analyzed in isolation. In general, deep learning-based object detection and classification is a well-known task and widely used for 2D bounding box regression of images. The focus of its research is mainly on the trade-off between accuracy and efficiency. For autonomous driving, efficiency is even more important. Therefore, the best object detectors use region proposal networks (RPNs) or similar grid based RPN methods. These networks are so efficient and accurate that they can even run on specialized hardware or embedded devices. Currently, object detection on point clouds is still poorly performed but increasingly important. These applications must be able to predict 3D bounding boxes. Currently, there are three main methods of using deep learning as follows. • Direct point cloud processing using multilayer perceptron. • Convolutional neural networks (CNNs) are used to convert point clouds into voxels or image stacks. • Combine fusion methods. Here, we select the literature [31] as a reference to describe the latest developments in 3D object detection and explain it in combination with its corresponding source code. Point Cloud Preprocessing.

2.5.1

Algorithm Overview

Surprisingly, thus far, no model has met the real-time requirements for autonomous driving. Therefore, here is an attempt to introduce the first small and accurate model. The model runs above 50 fps on an NVIDIA TitanX GPU. It uses multiview thinking (MV3D) for point cloud preprocessing and feature extraction, but multiview fusion is ignored, and only a lidar-based monocular RGB image is generated to ensure its efficiency. Figure 2.11 shows Complex-YOLO, which is a 3D version of YOLOv2 and one of the fastest and latest image object detectors. A specific E-RPN is used to support Complex-YOLO, which estimates the orientation of the object encoded by the imaginary and real parts of each box. The idea is to have a closed mathematical

2.5

Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection. . .

43

Fig. 2.11 Complex-YOLO pipeline [31]

space with no odd numbers to generalize to exact angles. Even if the object has only a few points (e.g., pedestrians), the model is still able to predict localization-accurate 3D bounding boxes and the precise orientation of the object in real time. For this purpose, a special anchor box (anchor box) is designed. Furthermore, it predicts all eight KITTI classes by using only lidar input data. It evaluates models on the KITTI benchmark suite and achieves comparable results for cars, pedestrians, and cyclists in terms of accuracy and outperforms existing methods by at least 5x in terms of efficiency. Its main contributions are as follows: • By using the new E-RPN, a reliable angle regression function is provided for 3D anchor box estimation. • Provides methods to evaluate high-precision real-time performance on the KITTI benchmark suite, which is more than five times faster than current leading models. • The precise orientation of each 3D bounding box supported by E-RPN is estimated, enabling the trajectory prediction of surrounding objects. • Compared with other lidar-based methods, the model can efficiently estimate all classes simultaneously in one forwards path. We present a slim pipeline for fast and accurate 3D box estimations on point clouds. The RGB map is fed into the CNN. The E-RPN grid runs simultaneously on the last feature map and predicts five boxes per grid cell. Each box prediction is composed of the regression parameters t (see Fig. 2.3) and object scores p with a general probability p0 and n class scores p1. . . pn.

2.5.2

Point Cloud Preprocessing

Point cloud preprocessing first converts the single-frame 3D point cloud collected by the Velodyne HDL-64E laser scanner into a bird’s-eye view RGB image, which covers an area of 80 m × 40 m in front (as shown in Fig. 2.12). Inspired by MV3D, the height, intensity and density of point clouds can be encoded to compose RGB images. The size of the grid map is defined as

44

2

3D Object Detection

Fig. 2.12 The ground truth spatial distribution, outlining the size of the bird’s-eye view region. Left: sample detection; right: 2D spatial histogram with annotation boxes [31]

n = 1024, m = 512. Therefore, the 3D point cloud is projected and discretized into a 2D grid with a resolution of approximately g = 8 cm. Compared to MV3D, the cell size is slightly reduced to achieve smaller quantization errors while having a higher input resolution. For efficiency and performance reasons, only one heightmap is used here instead of multiple. Thus, all three feature channels (zr, zg, zb, zr, g, b 2 Rm × n) are computed for the entire point cloud 2R3 within the coverage area Ω. Define PΩ: PΩ = Ρ = ½x, y, zT jx 2 ½0, 40m, y 2 ½ - 40m, 40m, z 2 ½ - 2m, 1:25m

ð2:1Þ

Considering the 1.73 m position of z for the lidar, z 2 [-2 m, 1.25 m] should be chosen to cover an area approximately 3 m above the ground with the truck as the tallest object. Then, we define a mapping function Sj = f Ps ðPΩi , jÞ, where S 2 Rm × n, maps each point with index i to a specific grid unit Sj of the RGB map. Then, use a set to describe all points mapped to a specific grid cell: PΩi → j = PΩi = ½x, y, zT jSj = f Ps ðPΩi , jÞ ; therefore, the Velodyne intensity can be regarded as I(P _ Ω) to calculate the channel of each pixel: zg Sj = max PΩi → j × ½0; 0; 1T

ð2:2Þ

zb Sj = max I PΩi → j

ð2:3Þ

zr Sj = minð1:0, logðN þ 1Þ=64Þ; N = PΩi → j

ð2:4Þ

Here, N describes the number of points mapped from PΩi to Sj, and g is the parameter for the grid cell size. Hence, zg is the maximum height, zb is the maximum intensity and zr is the normalized density of all points mapped into Sj. Code 2.1 Point Point Cloud Preprocessing def makeBVFeature(PointCloud_, Discretization, bc): Height = cnf. BEV_HEIGHT + 1 Width = cnf. BEV_WIDTH + 1 # quantize the feature map PointCloud = np.copy(PointCloud_)

2.5

Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection. . .

45

PointCloud[:, 0] = np.int_(np.floor(PointCloud[:, 0] / Discretization)) PointCloud[:, 1] = np.int_(np.floor(PointCloud[:, 1] / Discretization) + Width / 2) # sort indices = np.lexsort((-PointCloud[:, 2], PointCloud[:, 1], PointCloud[:, 0])) PointCloud = PointCloud[indices] # height map heightMap = np.zeros((Height, Width)) _, indices = np.unique(PointCloud[:, 0:2], axis=0, return_index=True) PointCloud_frac = PointCloud[indices] max_height = float(np.abs(bc['maxZ'] - bc['minZ'])) heightMap[np.int_(PointCloud_frac[:, 0]), np.int_ (PointCloud_frac[:, 1])] = PointCloud_frac[:, 2] / max_height # intensity map and density map intensityMap = np.zeros((Height, Width)) densityMap = np.zeros((Height, Width)) _, indices, counts = np.unique(PointCloud[:, 0:2], axis=0, return_index=True, return_counts=True) PointCloud_top = PointCloud[indices] normalizedCounts = np.minimum(1.0, np.log(counts + 1) / np.log(64)) intensityMap[np.int_(PointCloud_top[:, 0]), np.int_ (PointCloud_top[:, 1])] = PointCloud_top[:, 3] densityMap[np.int_(PointCloud_top[:, 0]), np.int_(PointCloud_top [:, 1])] = normalizedCounts # Generate an RGB image using the height map, intensity map, and density map generated by the point cloud RGB_Map = np.zeros((3, Height - 1, Width - 1)) RGB_Map[2, :, :] = densityMap[:cnf. BEV_HEIGHT, :cnf. BEV_WIDTH] # r_map RGB_Map[1, :, :] = heightMap[:cnf. BEV_HEIGHT, :cnf. BEV_WIDTH] # g_map RGB_Map[0, :, :] = intensityMap[:cnf. BEV_HEIGHT, :cnf. BEV_WIDTH] # b_map return RGB_Map

2.5.3

The Proposed Architecture

The Complex-YOLO network takes a bird’s-eye view RGB map (refer to point cloud preprocessing) as input, which uses a simplified YOLOv2 CNN architecture, as shown in Fig. 2.13, and expands it through complex angle regression and E-RPN to detect accurate multiclass oriented 3D objects while still operating in real time. The final model has 18 convolutional layers, 5 maximum pooling layers, and 3 intermediate layers for feature reorganization [31]. According to the input feature map, E-RPN can parse the 3D position bx and by, the object size (width bw and length bh) and the probability p0. The category scores are p1, p2, ⋯, pn, and determine its direction b∅. To obtain the correct orientation,

46

2

3D Object Detection

Fig. 2.13 Simplified YOLOv2 CNN architecture Euler-Region-Proposal

the usual Grid-RPN method can be modified by adding a complex angle arg zjeib∅ to it: bx = σ ð t x Þ þ c x

ð2:5Þ

by = σ t y þ c y

ð2:6Þ

bw = p w e t w

ð2:7Þ

bl = p l e

ð2:8Þ

tl

b∅ = arg zjeib∅ = arctan 2 ðt im , t re Þ

ð2:9Þ

Here, tx and tv indicate the starting coordinates of each box, pw and pl indicate the size of the box, and cx and cy indicate the distance between the box and the upper left corner of the image. Under the support of this extension, E-RPN can estimate the

2.5

Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection. . .

47

accurate object orientation according to the imaginary and real parts directly embedded in the network. For each grid cell (32 × 16, see Fig. 2.13), five objects are predicted, including the probability score and class score, and each object yields 75 features. Code are shown below: Code 2.2: Feature Extraction from the Proposed Network # Obtain the output features of the image x = torch.sigmoid(prediction[..., 0]) # predicate x y = torch.sigmoid(prediction[..., 1]) # predicate y w = prediction[..., 2] # width h = prediction[..., 3] # height im = prediction[..., 4] # angle virtual part re = prediction[..., 5] # angle real part pred_conf = torch.sigmoid(prediction[..., 6]) # confidence pred_cls = torch.sigmoid(prediction[..., 7:]) # offset if grid_size != self.grid_size: self.compute_grid_offsets(grid_size, cuda=x.is_cuda) # calculate the value using equation 2-5 to 2-9. pred_boxes = FloatTensor(prediction[..., :6].shape) pred_boxes[..., 0] = x.data + self.grid_x pred_boxes[..., 1] = y.data + self.grid_y pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h pred_boxes[..., 4] = im pred_boxes[..., 5] = re

2.5.4

Anchor Box Design

The object detector predicts five boxes per grid cell. All these boxes are initialized with beneficial priors (i.e., anchor boxes) for better convergence during training. The degrees of freedom (i.e., the number of possible priors) increase due to angular regression but not the number of predictions for efficiency reasons. Therefore, according to the bounding box distribution within the KITTI dataset, only three different sizes and orientations of two angles are defined as priors: (1) vehicle size (head up); (2) vehicle size (downwards); (3) number of cyclists (head up); (4) number of cyclists (facing down); (5) size of pedestrians (facing left).

2.5.5

Complex Angle Regression

The azimuth b∅ of each object can be calculated according to the regression parameters tim and tre, which characterize the phase of the complex number. By using arctan2(tim, tre), the angle can be given directly. On the one hand, singularity is avoided; on the other hand, a closed mathematical space can be realized. Therefore,

48

2

Cx

Fig. 2.14 Regression estimation of 3D bounding boxes. Loss function

3D Object Detection

i tIm

Im

b

w

Cy

b i

s ty s tx

0

M

tRe Re

it has a favorable impact on the generalization of the model. Connecting the regression parameter of the angle directly to the loss function, as shown in Fig. 2.14, realizes the regression estimation of the 3D bounding box. 3D bounding boxes are predicted based on the regression parameters shown in YOLOv2, as well as complex angles towards the bounding box orientation. The transition from 2D to 3D is performed according to the predetermined height of each category [31]. Their network optimization loss function L is based on the concepts from YOLO and YOLOv2, who defined L _ Yolo as the sum of squared errors using the introduced multipart loss. They extend this approach by an Euler regression part L _ Euler to make use of the complex numbers, which have a closed mathematical space for angle comparisons. This neglect singularities, which are common for single angle estimations: L = L _ Yolo + L _ Euler . Code 2.3: The Loss Function # compute the loss function loss_x = self.mse_loss(x[obj_mask], tx[obj_mask]) loss_y = self.mse_loss(y[obj_mask], ty[obj_mask]) loss_w = self.mse_loss(w[obj_mask], tw[obj_mask]) loss_h = self.mse_loss(h[obj_mask], th[obj_mask]) loss_im = self.mse_loss(im[obj_mask], tim[obj_mask]) loss_re = self.mse_loss(re[obj_mask], tre[obj_mask]) loss_eular = loss_im + loss_re loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask]) loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask]) loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale *

2.5

Complex-YOLO: A Euler-Region-Proposal for Real-Time 3D Object Detection. . .

49

loss_conf_noobj loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask]) total_loss = loss_x + loss_y + loss_w + loss_h + loss_eular + loss_conf + loss_cls

2.5.6

Evaluation on KITTI

Complex-YOLO is evaluated on the challenging KITTI object detection benchmark, which is divided into three subclasses of car, pedestrian, and cyclist, and performs 2D object detection, 3D object detection, and bird’s-eye view detection on them. Each category was evaluated on three levels of easy, moderate, and hard, taking into account object size, distance, occlusion, and truncation. This public dataset provides 7481 samples (including annotated images) for training and 7518 test samples of point clouds with Velodyne laser scanners. Note that since only lidar is considered as input, we only focus on bird’s-eye view detection here and do not run the 2D object detection benchmark.

2.5.7

Training

The model is trained from scratch by stochastic gradient descent with a weight decay of 0.0005 and a momentum of 0.9 kg m/s. The implementation here is based on a modified version of the Darknet neural network framework, which is first preprocessed to generate a bird’s-eye view RGB map from the Velodyne samples. The training set is then subdivided into publicly available ground truth, but using a training rate of 85% and a validation rate of 15% because it is trained from scratch and is multiclass predictive for the model being targeted. Conversely, e.g., VoxelNet modifies and optimizes the model for different classes and should be taken from the available ground truth data because it is first used for camera detection, more than 75% of cars, less than 4% of cyclists and allocation schemes with less than 15% of pedestrians are unfavorable. Additionally, more than 90% of all annotated objects are facing the car direction. However, we see surprisingly good results for the validation set and other unlabelled KITTI sequences that cover multiple use-case scenarios such as cities, highways, or inner cities.

2.5.8

Bird’s Eye View Detection

In Table 2.3, the evaluation results of the bird’s-eye view detection performance are listed. This benchmark uses bounding box overlap for comparison. For a better overview and to rank the results, similar current top methods are also listed, but they

Method MV3D [27] F-PointNet [30] AVOD [28] AVOD-FPN [28] Complex-YOLO

Modality Lidar + Mono Lidar + Mono Lidar + Mono Lidar + Mono Lidar

FPS 2.8 5.9 12.5 10 50.4

Car Easy 86.02 88.7 86.8 88.53 85.89 Mod 76.9 84 85.44 83.79 77.4

Table 2.3 Evaluation results of bird’s-eye view detection performance [31] Hard 68.49 75.33 77.73 77.9 77.33

Pedestrian Easy – 58.09 42.51 50.66 46.08 Hard

47.2 33.97 40.83 44.2

Mod 50.22 35.24 44.75 45.9

75.38 63.66 62.39 72.37

Cyclist Easy

61.96 47.74 52.02 63.36

Mod

54.68 6.55 47.87 60.27

Hard

50 2 3D Object Detection

2.6

Future Research Direction

51

are performed on the official KITTI test set. In terms of operation and efficiency, Complex-YOLO consistently outperforms all competitors while achieving comparable accuracy. It runs in approximately 0.02 s on a Titan X GPU, which is 5 times faster than AVOD considering the use of a more powerful GPU (Titan Xp). Compared to lidar-only VoxelNet, it is more than 10 times faster, running 18 times faster than its slowest competitor, MV3D.

2.5.9

3D Object Detection

The evaluation results of the 3D object detection performance are given in Table 2.4. Since the height information is not directly estimated through regression, the fixed spatial height position extracted by the ground truth similar to MV3D is used here to run this benchmark. Furthermore, based only on the class of the object, each object is injected with a predefined height calculated from the average of the heights of the objects of each class. This naturally reduces the accuracy for all classes but confirms the good results measured on the bird’s-eye view detection benchmark.

2.6

Future Research Direction

Further research improving the performance of 3D object detection in autonomous vehicle environments should be considered. Based on the significant performance differences between 2D and 3D detectors and the gaps found in the literature, the potential topics for future research are as follows: 1. Most research on 3D object detection focuses on improving the baseline performance of such methods. Although this is a promising target for research, nothing is known about the level of detection performance required for reliable driving applications. In this regard, a valid research opportunity is to investigate the relationship between detection performance, as measured by relevant key performance indicators (KPIs), and driving safety. 2. The latest progress of Point Nets can be explored to verify the adaptability to missing points and occlusions, which are still the main reasons for poor performance on difficult samples. More specifically, geometric relationships between points can be explored to obtain important information that cannot be obtained by considering each point individually. 3. Currently, many methods consider the use of sensor fusion to improve the reliability of the perception system. Considering the difference in point density, possible contributions will include collaborative perception methods in multiagent fusion schemes. Vehicles can use V2X or LTE communication technologies to share relevant perception information, which can improve and expand

Method MV3D [27] F-PointNet [30] AVOD [28] AVOD-FPN [28] Complex-YOLO

Modality Lidar + Mono Lidar + Mono Lidar + Mono Lidar + Mono Lidar

FPS 2.8 5.9 12.5 10 50.4

Car Easy 71.09 81.2 73.59 81.94 67.72 Mod 62.35 70.39 65.78 71.88 64

Table 2.4 Evaluation results of 3D object detection performance [31] Hard 55.12 62.19 58.38 66.38 63.01

40.23 26.98 36.58 35.92

51.21 38.28 46.35 41.79

44.89 31.51 39 39.7

Hard

Pedestrian Easy Mod

71.96 60.11 59.97 68.17

Cyclist Easy

56.77 44.9 46.12 58.32

Mod

50.39 38.8 42.36 54.3

Hard

52 2 3D Object Detection

References

53

the visibility of the environment, thereby reducing uncertainty and improving performance perception methods. 4. An important limitation of the KITTI dataset is its typical daylight scenes and very standard weather conditions. Although they reported testing methods at night and in heavy snow, they only reported qualitative results. Further studies should be conducted to evaluate the impact of such conditions on object detection and how to achieve reliable performance under routine conditions. 5. The running times show that most methods can only achieve less than 10 fps, which is the minimum rate to keep the lidar frame rate running in real time. To obtain recognition systems that operate fast and reliably in real environments, major improvements must be made. 6. Most methods fail to output predicted calibration results, which may lead to dangerous behaviors in real situations. The seminal work in [22] identified this gap and proposed a method to quantify the uncertainty of detection models but failed to achieve real-time performance. In this area, more research should be done to understand the sources of uncertainty and how to mitigate it.

References 1. E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby and A. Mouzakitis, “A Survey on 3D Object Detection Methods for Autonomous Driving Applications,” in IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3782-3795, Oct. 2019. 2. KUUTTI S, FALLAH S, KATSAROS K, et al. A survey of the state-of-the-art localization techniques and their potentials for autonomous vehicle applications. IEEE Internet Things J., vol. 5, no. 2, pp. 829-846, Apr. 2018. 3. Velodyne HDL-64E LiDAR Specification. Apr. 10, 2018. 4. PARK Y, YUN S, WON C S, et al. Calibration between color camera and 3D LiDAR instruments with a polygonal planar board. Sensors, vol. 14, no. 3, pp. 5333-5353, 2014. 5. ISHIKAWA R, OISHI T, IKEUCHI K. LiDAR and camera calibration using motion estimated by sensor fusion odometry. IEEE, Apr. 2018. https://arxiv.org/abs/1804.05178. 6. GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3354-3361. 7. GAIDON A, WANG Q, CABON Y, et al. Virtual worlds as proxy for multiobject tracking analysis. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4340-4349. 8. XU H, GAO Y, YU F, et al. End-to-end learning of driving models from large-scale video datasets. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR), Jul. 2017, pp. 3530-3538. 9. DOSOVITSKIY A, ROS G, CODEVILLA F, et al. CARLA: an open urban driving simulator. in Proc. 1st Conf. Robot Learn. (CoRL), Nov. 2017, pp. 1-16. 10. MÜLLER M, CASSER V, LAHOUD J, et al. Sim4CV: a photorealistic simulator for computer vision applications. Int. J. Comput. Vis., vol. 126, no. 9, pp. 902-919, Sep. 2018. 11. CHEN X, KUNDU K, ZHANG Z, et al. Monocular 3D object detection for autonomous driving. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2147-2156. 12. GIRSHICK R. Faster R-CNN. in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Washington, DC, USA, Dec. 2015, pp. 1440-1448, doi:https://doi.org/10.1109/ICCV.2015.169.

54

2

3D Object Detection

13. CHEN X, KUNDU K, ZHU Y. 3D object proposals for accurate object class detection. in Advances in Neural Information Processing Systems, 2015, pp. 424-432. 14. XIANG Y, CHOI W, LIN Y, et al. Data-driven 3D voxel patterns for object category recognition. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1903-1911. 15. XIANG Y, CHOI W, LIN Y, et al. Subcategory-aware convolutional neural networks for object proposals and detection. in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, pp. 924-933. 16. CHABOT F, CHAOUCH M, RABARISOA J, et al. Deep MANTA: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR), Jul. 2017, pp. 1827-1836. 17. MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR), Jul. 2017, pp. 5632-5640. 18. PAYEN D, ABARGHOUEI A A, BRECKON T P. Eliminating the blind spot: Adapting 3D object detection and monocular depth estimation to 360° panoramic imagery. in Proc. Eur. Conf. Comput. Vis. ECCV), Sep. 2018, pp. 812-830. 19. LI B, ZHANG T, XIA T. Vehicle detection from 3D LiDAR using fully convolutional network. in Proc. Robot., Sci. Syst. XII, AnnArbor, MI, USA, Jun. 2016. 20. BELTRÁN J, GUINDEL C, MORENO F M, et al. BirdNet: A 3D object detection framework from LiDAR information. May 2018. 21. SIMON M, MILZ S, AMENDE K, et al. Complex-YOLO: real-time 3D object detection on point clouds. Mar. 2018. 22. FENG D, ROSENBAUM L, DIETMAYER K. Towards safe autonomous driving: capture uncertainty in the deep neural network for LiDAR 3D vehicle detection. 2018. 23. LI B. 3D fully convolutional network for vehicle detection in point cloud. in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 1513-1518. 24. ENGELCKE M, RAO D, WANG D Z, et al. Vote3Deep: fast object detection in 3D point clouds using efficient convolutional neural networks. in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017, pp. 1355-1361. 25. CHARLES R Q, SU H, KAICHUN M, et al. PointNet: deep learning on point sets for 3D classification and segmentation. in Proc. Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 77-85. 26. SCHLOSSER J, CHOW C K, KIRA Z. Fusing LiDAR and images for pedestrian detection using convolutional neural networks. in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 2198-2205. 27. CHEN X, MA H, WAN J, et al. Multiview 3D object detection network for autonomous driving. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR), Jul. 2017, pp. 6526-6534. 28. KU J, MOZIFIAN M, LEE J, et al. Joint 3D proposal generation and object detection from view aggregation. in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2018, pp. 1-8. 29. QI C R, LIU W, WU C, et al. Frustum PointNets for 3D object detection from RGB-D data. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 918-927. 30. DU X, ANG M H, KARAMAN S, et al. A general pipeline for 3D detection of vehicles. in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Brisbane, QLD, Australia, May 2018, pp. 3194-3200. 31. SIMON M, MILZ S, AMENDE K, et al. Complex-YOLO: real time 3D object detection on point clouds, arXiv, 2018.

Chapter 3

Lane Detection

Abstract With the development of high-speed computing devices and advanced machine learning theories such as deep learning, end-to-end detection algorithms can be used to solve the problem of lane detection in a more efficient way. However, the key challenge for lane detection systems is to adapt to the demands of high reliability and diverse road conditions. An efficient way to construct a robust and accurate advanced lane detection system is to fuse multimodal sensors and integrate the lane detection system with other object detection systems. In this chapter, we briefly review traditional computer vision solutions and mainly focus on deep learning-based solutions for lane detection. Additionally, we also present a one-lane detection evaluation system, including offline and online systems. Finally, we use one lane detection algorithm and code to show how lane detection works in an autonomous driving system.

Traffic accidents are mainly caused by human error, for example, driver inattentiveness and misbehavior. To this end, many companies or institutions have put forwards measures and technologies to ensure driving safety and reduce traffic accidents. Among these technologies, road perception and lane marking detection are crucial in helping drivers avoid mistakes. Lane detection is the basis for many assistance systems developed for advanced drivers, such as lane departure warning systems and lane keep assist systems. Some successful ADAS or automotive companies, such as Mobileye, BMW, and Tesla, have developed their own lane detection and lane keeping products and have achieved remarkable achievements in both R&D and practical applications. Both automotive companies and individual customers have widely used Mobileye series ADAS products and Tesla Autopilot for autonomous driving. Almost all of the current mature lane assist products have used vision-related technologies. Lane markings are painted on the road for the visual perception of relevant personnel. It uses vision-based technology to detect lanes from camera equipment and prevent drivers from making unexpected lane changes. Therefore, accuracy and robustness are the two most important properties of a lane detection system. Lane detection systems should have the ability to understand unreasonable detections and adjust detection and tracking algorithms accordingly. When a false alarm occurs, the ADAS should warn the driver that they © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Ren, D. Xia, Autonomous driving algorithms and Its IC Design, https://doi.org/10.1007/978-981-99-2897-2_3

55

56

3 Lane Detection

should focus on the driving task. On the other hand, a vehicle with a high level of automation continuously monitors its environment and should be able to handle low-accuracy detection. Therefore, as the degree of vehicle automation increases, the evaluation of lane detection systems becomes more important. Most vision-based lane detection systems are usually designed according to image processing techniques within a similar framework. However, with the development of high-speed computing devices and advanced machine learning theories such as deep learning, end-to-end detection algorithms can be used to solve the problem of lane detection in a more efficient way. However, the key challenge for lane detection systems is to adapt to the demands of high reliability and diverse road conditions. An efficient way to construct a robust and accurate advanced lane detection system is to fuse multimodal sensors and integrate the lane detection system with other object detection systems. For example, it should have the ability to detect surrounding vehicles and identify road areas. Existing R&D work has shown that lane detection performance can be improved using these multimodule integration techniques. However, high-precision sensors such as laser detection and radar are expensive. More details can be found in Ref. [1].

3.1

Traditional Image Processing

Visual-based lane detection can be roughly divided into two categories: featurebased and model-based. Depending on the feature extraction methods, lane marking features (lane color, texture, edge, etc.) can be detected. For example, in [2], using the Sobel operator, complex lane edge features are detected, and the road image is divided into several subregions along the vertical direction. Suddamalla [3] et al., using pixel intensities and edge information, detected curved and straight lanes and extracted lanes using adaptive thresholding techniques. To remove camera perspective distortion in digital images and extract real lane features, perspective transformation can be used to efficiently detect lane markings. Collado [4] et al. created a bird’s-eye view of road images and proposed an adaptive lane detection and classification method based on spatial lane features and the Hough transform algorithm. A bird’s-eye view and a particle filter method based on lane features are combined for estimating multilane detection [5]. In addition to using color images, images in other color formats can also be used to detect lanes. The general idea of color format conversion is that the yellow and white lanes are marked in other color domains, which can be more vivid and increase the contrast. In [6], the extended edge linking algorithm can be used to detect lane edges during the lane assumption stage. Lane pixels, edge directions, and lane marking widths in YUV format, which can be used in the lane verification step to select candidate edge pairs. In [7], adaptive classifiers are used to identify lanes. First, it converts the color image to HSV format to increase its contrast. Then, based on the luminance value, it uses one threshold to process the binary feature image.

3.1

Traditional Image Processing

57

While color transforms can benefit lane detection in some normal cases, they are not robust and have limited ability to handle shadows and lighting changes. Borkar [8] et al. proposed a hierarchical approach to detect lanes at night. A blurring technique is used to reduce video noise, and the binary image is generated according to an adaptive local thresholding method. Lane representation in the frequency domain algorithm [9] exploits the lane features in the frequency domain. In the frequency domain, the algorithm captures lane intensities and orientations and uses deformable templates to detect lane markings. In [10], a spatiotemporal lane detection algorithm was introduced. It is characterized by accumulating certain row pixels from past frames to generate a series of spatiotemporal images, which are then applied to the Hough transform of the synthesized image to detect lanes. In [11], a real-time lane detection system based on FPGA and DSP was designed according to the magnitude characteristics of lane gradients and the improved Hough transform. Ozgunalp and Dahnoun [12] proposed an improved feature map and used it for lane detection. First, the lane orientation histogram is determined using the edge orientation; then, the feature map is improved and shifted based on the lane orientation estimation. In general, feature-based methods are more computationally efficient and can detect lanes accurately when the lane markings are clear. However, due to too many assumptions, such as lane color and shape, it suffers from robustness and poor visibility conditions compared to model-based methods. For model-based methods, it is generally assumed that they can use a specific model (e.g., linear, parabolic, or various spline models, etc.) to describe the lanes. Additionally, some assumptions about roads and lanes are required (e.g., flat ground). Among these models, the spline model is popular in previous research work. These models are flexible enough to restore curved lanes of any shape. Wang et al. [13] used different spline models to fit the lanes and used Catmull-Rom splines to model the lanes in the image. In [14], the lane model was improved to generate a B-snake model that can model arbitrary shapes by changing the control points. In [15], a combination of the Hough transform based on the near-field region and the river method in the far-field region was used to detect the boundaries of the lanes. Finally, using a B-spline model, the lanes are modelled and tracked using a Kalman filter. Jung and Kelber [16] describe the lanes with a linear parabolic model and classify lane types based on estimates of lane geometry. Aly [17] proposed a multilane fitting method based on the Hough transform, RANSAC and B-spline models. First, the initial lane positions are roughly detected using the Hough transform and then improved by RANSAC and B-spline models. In addition, a dataset of manually marked lanes (called the Caltech Lane dataset) is introduced. The RANSAC algorithm is currently the most popular method for the iterative estimation of lane model parameters. In [18], both the linear lane model and RANSAC are used to detect the lanes, while the Kalman filter is used to correct the noisy output. In [19], ridge features and RANSAC suitable for straight and curved lane fitting were proposed. A lane pixel ridge depends on local structure rather than contrast, so it is defined as the centerline of a bright structure in a region in a grayscale image. In [20, 21], the hyperbolic model and RANSAC are used for

58

3 Lane Detection

lane fitting. In [21], the input image is divided into two parts, which are called the far-field region and the near-field region. In the near-field region, the lanes are considered straight lines detected using the Hough transform algorithm. In the far-field region, the lanes are assumed to be curved and fitted using a hyperbolic model and RANSAC. In [22], a conditional random field method was proposed to detect lane markings in urban areas. Bounini [23] et al. introduced a method for lane boundary detection in a simulated environment. Among them, the least squares method is used to fit the model, and by determining the dynamic region of interest, the computational overhead is reduced. In [24], an automatic multi-segment lane switching scheme and RANSAC lane fitting method were proposed, applying the RANSAC algorithm to fit lines according to edge images. The lane switching scheme was used to determine the lane curvature, and from the straight and curved models, the correct lane model was selected to fit the lane. In [25], a Gabor wavelet filter was applied to estimate the orientation of each pixel and match the second-order geometric lane model. Niu [26] et al. proposed a novel curve fitting algorithm for lane detection. It has a two-stage feature extraction algorithm and applies a density-based spatial clustering algorithm with noise to determine whether the candidate lane segment belongs to its own lane; the curve model can be used to fit the determined small lane segment. This method is particularly effective for small lane segment detection tasks. In general, model-based methods are more robust than feature-based methods due to the use of model fitting techniques. This model can generally ignore noisy measurements and abnormal pixels for lane markings. However, model-based methods usually require more computational overhead because RANSAC has no upper limit on the number of iterations. Furthermore, model-based methods are less easy to implement than feature-based systems.

3.2

Example: Lane Detection Based on the Hough Transform

First, some basic concepts about the Hough transform are introduced below, and then the Hough transform from the open CV is used to find the lane.

3.2.1

Hough Transform

The Hough transform was originally just a technique for detecting lines in images, and now it has been widely used to detect 2D and 3D curves. The execution steps of its algorithm are as follows:

3.2

Example: Lane Detection Based on the Hough Transform

59

1. corner or edge detection: For example, using Canny, Sobel, and adaptive thresholding, the resulting binary/gray image will have 0 (for no edges) and 1 (or higher) for edges. This is the input image. 2. ρ and θ creation: The range of ρ is from [ - max _ dist , max _ dist], where max _ dist is the diagonal length of the input image. The range of ρ is from -90 to 90. 3. Compute θ relative to ρ, which is a 2D array with rows as ρ values and columns as θ values. 4. Then, we “Vote” in the accumulator. For each edge point and each θ value, we find the nearest ρ value, and then we increment that index in the accumulator. Each element indicates how many points/pixels contribute “votes” to potential candidate line segments. 5. Peak finding. The local maximum in the accumulator indicates the parameter of the most prominent line in the input image. Peaks are easily found by applying a threshold or relative threshold (equal to or greater than some fixed percentage of the global maximum). The following code is used to detect lines: Code 3.1 Detect Lines Using the Hough Transform import numpy as np def hough_line(img): # Rho and Theta range thetas = np.deg2rad(np.arange(-90.0, 90.0)) width, height = img.shape diag_len = np.ceil(np.sqrt(width * width + height * height)) # max_dist rhos =np.linspace(-diag_len, diag_len, diag_len * 2.0) # Cache some reusable values cos_t = np.cos(thetas) sin_t = np.sin(thetas) num_thetas = len(thetas) # Hough accumulator array of theta vs rho accumulator = np.zeros((2 * diag_len, num_thetas), dtype=np.uint64) y_idxs, x_idxs = np.nonzero(img) # (row, col) indexes to edges # Vote in the hough accumulator for i in range(len(x_idxs)): x = x_idxs[i] y = y_idxs[i] for t_idx in range(num_thetas): # Calculate rho. diag_len is added for a positive index rho = round(x * cos_t [t_idx] + y * sin_t[t_idx]) + diag_len rho, t_idx] += 1

60

3

Lane Detection

return accumulator, thetas, rhos

The following code shows how to call the above code: Code 3.2 How to Call the Hough Transform to Detect the Lines [Create binary image and call hough_line image = np.zeros((50,50)) image[10:40, 10:40] = np.eye(30) accumulator, thetas, rhos = hough_line(image) # Easiest peak finding based on max votes idx =np.argmax(accumulator) rho = rhos[idx / accumulator.shape[1]] theta = thetas[idx % accumulator.shape[1]] print "rho={0:.2 f}, theta={1:.0f}".format(rho, np.rad2deg(theta))

3.2.2

Lane Detection

The example given below mainly uses opencv provided cv2.HoughLinesP to perform line detection. Code 3.3 cv2.HoughLinesP Function Provided by Opencv def get_lane_lines(color_image, solid_lines=True): """ This function takes as input a color road frame and tries to infer the lane lines in the image. :param color_image: input frame :param solid_lines: if True, only selected lane lines are returned. If False, all candidate lines are returned. :return: list of (candidate) lane lines. """ # resize to 960 x 540 color_image = cv2.resize(color_image, (960, 540)) # convert to grayscale img_gray = cv2.cvtColor(color_image, cv2.COLOR_BGR2GRAY) # perform Gaussian blur img_blur = cv2. GaussianBlur(img_gray, (17, 17), 0) # perform edge detection img_edge = cv2.Canny(img_blur, threshold1=50, threshold2=80) # perform Hough transform detected_lines = hough_lines_detection(img=img_edge, rho=2, theta=np.pi/180, threshold=1,

3.3

Example: RANSAC Algorithm and Fitting Straight Line

61

min_line_len=15, max_line_gap=5) # convert (x1, y1, x2, y2) tuples into Lines detected_lines = [Line(l[0][0], l[0][1], l[0][2], l[ 0][3]) for l in detected_lines] # if 'solid_lines' infer the two lane lines if solid_lines: candidate_lines = [] for line in detected_lines: # consider only lines with slope between 30 and 60 degrees if 0.5 SIZE: break # Use the best estimate we obtained to draw graph Y = best_a * RANDOM_X + best_b # Line graph ax1.plot(RANDOM_X, Y) text = "best_a = " + str(best_a) + "\nbest_b = " + str(best_b) plt.text(5,10, text, fontdict={'size': 8, 'color': 'r'}) plt.show()

3.4

Based on Deep Learning

Although conventional methods based on image processing can usually be used to detect lane markings, some researchers are still working on using novel machine learning and deep learning methods to detect lane markings. Due to the development of deep learning, parallel computing and big data technology, deep learning technology has become one of the hottest research fields in the past decade. Compared with traditional methods, many deep learning algorithms have shown great advantages in computer vision tasks, and due to their greatly improved detection and recognition performance, convolutional neural networks (CNNs) are the most popular methods for object recognition research. CNNs offer some impressive properties, such as high detection accuracy, automatic feature learning, and “end-to-end” recognition. Recently, some researchers have successfully applied CNNs and other deep learning techniques to lane detection. It is reported that the accuracy of lane detection is greatly improved from 80% to 90% using the CNN model compared to traditional image processing methods. In [27], it is proposed to transform the lane detection problem into an instance segmentation problem, where each lane forms its own instance, which can be trained “end-to-end”. To parameterize the segmented lane instances prior to fitting the lanes,

3.4

Based on Deep Learning

65

it may be recommended to further apply an image-based perspective transform instead of contrasting with a fixed “bird’s-eye” transform. In this way, lane fitting is ensured, which is robust, unlike existing methods that rely on fixed predetermined transformations. In summary, they present a fast lane detection algorithm running at 50 fps that can handle a variable number of lanes and cope with lane changes. On the TuSimple dataset, the method proposed here is validated, and good results are obtained. In [28], a distillation method, the so-called “self-attention distillation”, is proposed, which enables the model to learn from itself and obtain substantial improvements without any additional supervision or labels. Valuable contextual information can be used as a form of “free” supervision to further learn representations by performing top-down and layerwise attention refinement within the network. Selfattention distillation can be easily incorporated into any feed-forwards convolutional neural network (CNN) without increasing the inference time. By using lightweight models such as ENet, ResNet18, and ResNet on three popular lane detection benchmarks (TuSimple, CULane and BDD100K), self-attention distillation is validated. The performance of the lightest model, ENet-SAD, can surpass existing algorithms in relative terms. Notably, compared to the state-of-the-art SCNN [16], ENet-SAD has 20× fewer parameters and runs 10× faster while still delivering excellent performance on all benchmarks. Open-source code implementation can be found in https://github.com/cardwing/Codes-for-Lane-Detection. As the name suggests, lane detection is used to detect lanes on the road and provide the exact location and shape of each lane. It is one of the key technologies for realizing modern assisted autonomous driving systems. However, several unique properties of lanes challenge detection methods because of their lack of features, making lane detection algorithms easily confused by other objects with similar local appearances. In addition, the inconsistency in the number of lanes on the road and the patterns of different lane lines, such as solid lines, polylines, single lines, double lines, merged lines, and split lines, further affect the performance. In [29], a deep neural network-based method called LaneNet is proposed, which divides lane detection into two stages: lane edge proposal and lane line localization. In the first stage, a lane edge proposal network is used to classify pixels for lane edges, and in the second stage, a lane line localization network is used to detect lane lines based on lane edge proposals. Note that LaneNet’s goal is to detect only lane lines, which brings more difficulty in suppressing false detections of lane-like markings on roads such as arrows and characters. Nonetheless, the lane detection method proposed here shows strong performance in both highway and urban road scene methods without relying on any assumptions about lane numbers or lane line patterns. High operating speed and low computational cost enable LaneNet to be deployed in vehicle-based systems. Experimental results show that LaneNet can consistently provide excellent performance in real-world traffic situations. Open source code reference at https://paperswithcode.com/paper/lanenet-real-time-lanedetection-networks-for#code In [30], a unified “end-to-end trained” multitask network is proposed. The network enables lane and marker detection and recognition in adverse weather

66

3 Lane Detection

conditions, guided by vanishing points, including rainy and low-light conditions. It has not been extensively studied thus far due to obvious challenges. For example, images taken on rainy days have low illumination levels, while wet roads can cause light reflections and distort the appearance of lanes and road markings. At night, color distortion can occur with limited lighting. As a result, benchmark datasets do not exist, and there are only a few algorithms that have been put into practical operation that can work in harsh weather conditions. To address this problem, an attempt was made to establish a lane and road marking benchmark consisting of approximately 20,000 images with 17 lane and road marking categories under four different conditions, namely, no rain, rain, heavy rain and nighttime. First, we train and evaluate the “multiple versions of the multi-task network” proposed in [30] and verify the importance of each task. The final method, VPGNet, can detect and classify lane and road markings and predict vanishing points in real time (20 fps) under various conditions. Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer by layer. Although CNNs have demonstrated a powerful ability to extract semantics from raw pixels, their ability to capture spatially related pixels between rows and columns of an image has not been fully explored. It is important to know the importance of these relationships for learning semantic objects with strong prior shapes and weak appearance coherence. For example, traffic lanes are often blocked or even not painted on the road surface. In [31], a spatial CNN (SCNN, spatial CNN) is proposed, in which the traditional deep layerby-layer convolution is generalized to a layer-by-layer convolution in a feature map so that messages are placed between rows and columns in layers transfer between pixels. Such SCNNs, especially for long continuous-shaped structures or large objects, have strong spatial relationships. However, it has fewer exterior cues, such as traffic lanes, poles and walls. Here, an attempt is made to apply the SCNN to the newly released and extremely challenging traffic lane detection dataset and CityScape. The results show that the SCNN can learn the spatial relationship of structure outputs and significantly improve its performance. Meanwhile, SCNN also outperforms ReNet and MRF + CNN (MRFNet) based on recurrent neural networks (RNN) in the lane detection dataset by 8.7% and 4.6%, respectively. It achieves No. 1 with 96.53% accuracy during the TuSimple benchmark lane detection challenge. In conclusion, machine learning algorithms, or intelligent algorithms, significantly improve the accuracy of lane detection and provide many effective detection architectures and techniques. Although these systems typically require more computational overhead and require large amounts of training data, they are superior to conventional methods. Therefore, it is expected that many new, efficient and robust lane detection methods with lower training and computational requirements will be developed in the near future.

3.5

3.5

The Multi-Sensor Integration Scheme

67

The Multi-Sensor Integration Scheme

Sensor fusion greatly improves the performance of lane detection. This is due to the use of more sensors and enhanced perception capabilities. Using multiple cameras (including monocular and stereo cameras) and combining multiple cameras with different fields of view is the most common method to enhance lane detection. In [32], a dense vanishing point detection method for lane detection using stereo cameras is proposed. Bertozzi and Broggi [33] proposed a general obstacle and lane detection system for obstacle and lane detection based on stereo cameras and IPM images. The system has been tested on roads of more than 3000 km and has shown robustness. In [34], three wide-angle cameras and one telephoto lens camera are combined and sampled at 14 Hz. Convert the original image to HSV format and perform IPM. In [35], a surround-view surveillance system with four naked-eye cameras and a single front-view camera is used for lane detection and vehicle localization. The advantage of using a surround view monitoring system is that a picture of the complete top view of the vehicle can be generated. Among them, the front view, surrounding view and rear view of the vehicle are included in a single image. Not only can the camera equipment be used alone, but the lane detection system can also be implemented by combining the camera with GPS and radar. In particular, radar is used for road boundary detection in poor lighting. Jung [36] et al. proposed an adaptive lane detection method based on the region of interest, aiming to design an integrated adaptive cruise control and lane keeping assist system. Distance data from adaptive cruise control will be used to determine dynamic regions of interest and improve the accuracy of monocular vision-based lane detection systems. Lane detection systems are designed using conventional methods, which include marginal distribution functions, steerable filters, model fitting and tracking. If a nearby vehicle is detected using the distance sensor, all edge pixels will be eliminated to enhance lane detection. The final experimental results show that identifying nearby vehicles based on distance data can improve the lane detection accuracy and simplify the detection algorithm. In [37], a GPS and vision system-based positioning system for autonomous vehicles is proposed. First, prior information such as road shape is extracted from GPS, which is then used to refine the lane detection system. Finally, the proposed method is extensively evaluated and found to be reliable under varying road conditions. In [38], a comprehensive lane detection system in a structured highway scene is proposed. The curvature of the road is initially determined using GPS and digital maps, and then the two lanes are designed for straight lanes and curves, respectively, in the detection module. Schreiber [39] et al. introduced a lane marking-based positioning system that uses a stereo camera to detect lane markings and curbs and a system for vehicle positioning by integrating the Global Navigation Satellite System (GNSS), high-precision maps, and stereo vision systems. In rural areas, the accuracy can be as high as a few centimeters.

68

3

Lane Detection

In [40], an integrated lane departure warning system using GPS, inertial sensors, high-precision maps, and vision systems is used. Vision-based lane departure systems are susceptible to various road conditions and weather. Using a sensor fusion scheme can increase the stability of the lane detection system and make the system more reliable. In addition, vision-based lane detection systems and accurate digital maps help reduce GPS-induced position errors, leading to more accurate vehicle positioning and lane keeping. Lidar is another widely used sensor used in most autonomous vehicles in major DARPA challenges. This is mainly due to its high precision and strong sensing capabilities. Provided by lidar, cloud road signs with higher reflective properties than lidar can detect lane markings based on highly reflective points on the road. LiDAR uses multichannel laser lights to scan surrounding surfaces and generate 3D images. Therefore, compared to vision-based systems, LIDAR and vision-integrated lane detection systems are more accurate and stable to shadows and lighting changes. Shin [41] et al. proposed a lane detection system using cameras and lidars. Its algorithms include ground road extraction, lane detection with multimodal data, and combination of lane information. The proposed method, in practical experiments, shows high detection accuracy performance (up to 90% accuracy). While camera and lidar-based approaches can handle curved lanes, shadows, and lighting issues, both require complex cocalibration of multimodal sensors. Amaradi et al. [42] proposed a vehicle tracking and obstacle detection system using cameras and lidars. First, lanes are detected using the Hough transform, and lidar is used to detect obstacles and measure the distance between the vehicle itself and obstacles ahead to plan unobstructed driving areas. In [43], a fusion system consisting of multiple cameras and lidars is proposed to detect lane markings in urban areas. The test vehicle was reportedly the only vehicle to use a vision-based lane detection algorithm during the final phase of the DARPA Urban Challenge. The system detects multiple lanes and then estimates and tracks the centerline. First, lidar and camera calibration are combined to detect road paint and curbs. Then, using lidar, the false alarm rate is reduced by detecting obstacles and drivable road areas. Other sensors, such as vehicle CAN bus sensors and inertial measurement units (IMUs), are also commonly used to build complete vehicle perception systems. While lidar-based lane detection systems may be more accurate than other systems, the cost is still prohibitive for mass production. Therefore, recent research has tended to fuse sensors such as GPS, digital maps, and cameras, which are already used in commercial vehicles, to design reliable lane detection and driver assistance systems.

3.6

Lane Detection Evaluation Criteria

Due to the lack of ground truth data, most previous lane detection studies have used visual verification to evaluate system performance, and only a few researchers have performed quantitative performance analysis and evaluation. This is because lane detection evaluation is a complex task, detection methods may vary by hardware and

3.6

Lane Detection Evaluation Criteria

69

Fig. 3.1 Lane evaluation system

algorithms, and there are no general metrics for comprehensive evaluation of lane detection algorithms. Because road and lane conditions vary significantly from country to country, there is no guarantee that a lane detection system will be accurate in one place. Some detection algorithms may even show significantly different detection results at different times of day and night. It is also unfair that a system based on monocular vision is inferior to a system with fusion of vision and lidar. Therefore, it is necessary to evaluate the performance of the lane detection system. It should be noted that important indicators of lane detection performance are driving safety issues and robustness. In this section, the evaluation methods used can be divided into two categories: online evaluation and offline evaluation. Online evaluation can be regarded as the process of calculating detection confidence in real time. The main evaluation architecture is shown in Fig. 3.1. As mentioned earlier, common vision-based lane detection systems can be roughly divided into three parts, namely, preprocessing, lane detection and tracking. Therefore, evaluation can be applied to these three parts, and the performance of these modules can be evaluated separately. In the following, the influencing factors affecting the performance of the lane detection system will be summarized first; then, the online and offline evaluation methods used in past research and other literature will be introduced. Finally, evaluation metrics are discussed.

3.6.1

Lane Detection System Factors

Previously, the research on vision-based lane detection systems was different in terms of hardware, algorithm and application. Some systems focus on highways, while others are used in urban areas. In urban road areas, the use of an accurate freeway guidance lane detection system cannot be guaranteed, as more disturbances and dense traffic will be observed in these areas. Therefore, it is not possible to use a single evaluation method or metric to evaluate all existing systems. In Table 3.1,

70

3

Lane Detection

Table 3.1 Factors affecting lane detection system performance Factors Lane and road factors Hardware factors Traffic factors Weather factors

Description Crosswalks, parking lanes, lane colors, lane styles, road curvature, bad lane markings, complex road textures Camera type, camera calibration, camera mounting position, other sensors Curb strips, guardrails, surrounding vehicles, shadows, lighting issues, vibration Cloudy, snowy, rainy, foggy

some important factors that may affect the performance of the lane detection system are listed. These factors, as well as the operating environment of the system, should be considered when conducting an unbiased evaluation and comparison of lane detection systems. Since different lane detection algorithms are designed and tested for different locations, different road and lane factors in different locations will affect the detection performance. Additionally, data recording devices, cameras, or other vision hardware can significantly impact other aspects of the lane detection system. For example, lane detection systems may have different resolutions and fields of view when using different cameras, which can affect detection accuracy. Finally, some traffic and weather factors may also lead to different lane detection performances. Many other factors reduce the lane detection performance and make performance vary with other systems. For example, some lane detection systems are tested in complex traffic environments where there are more disturbances (such as crosswalks or poor-quality lane markings), while others are tested on standard highways with few contributing factors in the environment. Therefore, the ideal approach is to use a common platform for algorithm evaluation, which is almost impossible in real life. Therefore, a mature evaluation system should try to consider the influencing factors and comprehensively evaluate the performance of the system. A potential solution to these problems is to use parallel vision architectures. Below, we discuss the performance evaluation system and the proposed reasonable metrics.

3.6.2

Offline Evaluation

In previous literature, offline evaluation has often been used. After the framework of the lane detection system is determined, the performance of the system is first evaluated using still images or video sequences. There are some public datasets on the Internet, such as KITTI Road and Caltech Road. The KITTI Road dataset, including 289 training images and 290 testing images, is divided into three categories. In the dataset, road and ego lane areas are marked. Typically, the evaluation is performed by using the ROC curve, which accounts for the pixel-level true and false detection rate. The Caltech Road dataset contains 1224 labelled single frames captured in four different road conditions. Both datasets focus on evaluating urban

3.6

Lane Detection Evaluation Criteria

71

road and lane detection performance. The main drawbacks of image-based evaluation methods are that they are poorly reflective of the real traffic environment, and the dataset also contains limited annotated test images. On the other hand, video datasets can also describe more information and reflect real traffic situations. However, more human resources are usually required to mark the real route. To this end, Borkar et al. [44] proposed a semiautomatic method to label lane pixels in video sequences. Among them, time slice images and interpolation methods are used to reduce the labelling workload. A time-segmented image is constructed by selecting the same rows from each video frame and rearranging the row pixels according to the order of the frames. This requires two or more timesliced images, and the accuracy of marking lanes is proportional to the number of images. For the lane marking task, it boils down to labelling points in time sliced images. From each time slice image, after selecting the ground truth for the markers, the interpolated marker lanes can be restored to the video sequence accordingly. By converting lane marking into several point marking tasks, the marking effort can be greatly reduced. Despite human-annotated real data, some researchers use synthetic methods to generate images of lanes with known location and curvature parameters in a simulator. Lopez et al. [45] used a MATLAB simulator to generate video sequences and real lane lines by creating a lane frame with known lane parameters and locations. This method can generate arbitrary road and lane models with any number of video frames. Using simulators to generate real lane lines is an effective way to evaluate lane detection systems under ideal road conditions. However, there are currently few driving simulators that are able to fully simulate real-world traffic environments. Therefore, after evaluation with the simulator, the detection performance must still be tested with real lane images or videos. Another approach is to evaluate the lane detection system by testing the system on a real test track to compare with the accurate lane positions provided by GPS and high-resolution maps [46].

3.6.3

Online Evaluation

Online evaluation systems combine road and lane geometry information and integrate it with other sensors to generate detection confidence. Lane geometry constraints, a reliable metric for online evaluation. Once the camera is calibrated and mounted on the vehicle, road and lane geometry (such as ego lane width) can be determined. In [47], the authors propose a real-time lane assessment method based on detected lane width measurements, where the detected lanes are validated against three criteria. The three conditions are the slope and intercept of the straight-line lane model, the predetermined road width, and the location of the vanishing point. The authors analyse the distribution of lane model parameters and create a lookup table to determine the correctness of detections. Once the detected lane width exceeds a threshold, a re-estimation of the lane width constraint is proposed.

72

3

Lane Detection

In [48], the authors used the world coordinate measurement error, rather than the image coordinate error, to evaluate the detection accuracy. Simultaneously, using roadside-facing cameras, lane information is directly recorded, ground truth is generated, and vehicle positions within the lane are estimated. In [49], the authors calculate the real-time confidence based on different detection algorithms, giving a similarity measure of the results. The function of the evaluation module is usually to evaluate whether the positions of detected lanes from different algorithms are within a certain distance. If similar results are obtained, the detections are averaged, and a higher detection confidence is reported. However, this method requires simultaneous execution of both algorithms at each step, which increases the computational burden. In [43], the authors combined vision and lidar-based algorithms to build a confidence probability network and employed distance travelled to determine lane detection reliability. If the vehicle can safely travel at that distance, the system has a high consistency of estimates at a certain distance in front of the vehicle. In previous studies, other online assessment methods were also used. For example, the metric evaluated is the offset between the centerline and the lane boundary. In addition to using a single sensor, vision-based lane detection results can be evaluated using other sensors, such as GPS, LiDAR, and high-accuracy road models [43]. In [50], the authors introduced the vanishing point lane detection algorithm. First, the vanishing points of the road segments are detected according to the probabilistic voting method. The vanishing point, along with the line orientation threshold, is then used to determine the correct lane segment. To further reduce the false detection rate, the author adopts a real-time frame similarity model to evaluate the lane position consistency.

3.6.4

Evaluation Metrics

Existing research works mainly use visual evaluation or the simple detection rate as the evaluation metric because there is still no general performance metric to evaluate lane detection performance. Li [51] et al. designed a complete test scheme for intelligent vehicles. This solution is mainly aimed at the performance of the whole vehicle, not just the lane detection system, so five main requirements for the lane detection system are given: insensitive to shadows, suitable for unpainted roads, handling of curvy roads, satisfying lane parallel constraints and reliability measurements. Veit [52] proposed a “feature-level evaluation method based on a hand-labelled dataset”. The authors compare six different lane feature extraction algorithms. The authors concluded that a feature extraction algorithm that combines photometric and geometric features achieves the best results. McCall and Trivedi [53] examined the most important evaluation metrics to evaluate lane detection systems. They concluded that it was inappropriate to use detection rates only as a metric. Therefore, the authors suggest using three different metrics, including standard deviation of error, mean absolute error, and standard deviation of rate-of-change error. Satzoda and

3.7

Example: Lane Detection

73

Trivedi [54] introduced five metrics to measure different properties of lane detection systems and examine the trade-off between accuracy and computational efficiency. The five metrics include lane feature accuracy, ego vehicle localization, lane position bias, computational efficiency and accuracy, and a measure of bias accumulation over time. Among these metrics, the accumulation of deviations over time helps to determine the maximum safe time and can be used to assess whether the proposed system meets the critical response time of ADAS. However, for all these metrics, the assessment of detection accuracy is highly valued, but its robustness is not considered. In summary, the lane detection system can be evaluated separately from the preprocessing, lane detection algorithm, and tracking aspects. Moreover, the evaluation metrics are not limited to measuring the error between the detected lane and the real lane but can also be extended to evaluate the lane prediction range, shadow sensitivity, and computational efficiency. The specific evaluation metrics of the system should be based on the requirements of the actual application. Generally, a lane detection system should have three basic properties, namely, accuracy, robustness and efficiency. The main goal of the lane detection algorithm should be to meet the real-time safety requirements with acceptable accuracy and low computational cost. Furthermore, the measurement algorithm of the accuracy metric, whether it can detect straight and curved lanes, has a small error. In the past, the problem of lane detection accuracy has been extensively studied, and many metrics can be found in the literature. However, the robustness will undoubtedly encounter more challenges. Therefore, urban road images are usually used to evaluate the robustness. However, many other factors also affect lane performance, such as weather, shadows and lighting, traffic and road conditions.

3.7

Example: Lane Detection

In this section, we will refer to one paper [55] to illustrate how we can run a one lane detection algorithm. In [55], they propose a traffic line detection method called Point Instance Network (PINet); the method is based on the key points estimation and instance segmentation approach. PINet includes several hourglass models that are trained simultaneously with the same loss function. Therefore, the size of the trained models can be chosen according to the target environment’s computing power. They cast a clustering problem of the predicted key points as an instance segmentation problem; the PINet can be trained regardless of the number of traffic lines. PINet achieves competitive accuracy and false positives on the CULane and TuSimple datasets, popular public datasets for lane detection. The code is available at https:// github.com/koyeongmin/PINet_new.

74

3.7.1

3

Lane Detection

Overview

The proposed algorithm has three output branches and predicts the exact location and instance features of points on traffic lines. Users can obtain models of various sizes according to their computing power by network clipping without additional training or modification; the knowledge distillation technique reduces the performance decrease among full-size networks and short-clipped networks. These are the primary contributions of this study: 1. Using the key points estimation approach, they propose a novel method for traffic line detection. It produces a more compact size prediction output than those of other semantic segmentation-based methods. 2. The framework consists of several hourglass modules, and thus, various models that have different sizes can be obtained by simple clipping because each hourglass module is trained simultaneously using the same loss function. 3. The proposed method can be applied to various scenes that include any orientation of traffic lines, such as vertical or horizontal traffic lines, and arbitrary numbers of traffic lines. 4. The proposed method has lower false positives and noteworthy accuracy performances. It guarantees the stability of the autonomous driving car. Figure 3.2 shows details of the proposed framework, the point instance network (PINet). PINet generates points on traffic lines and distinguishes individual instances. The 512 × 256 size input is fed to the resizing network that compresses the input to a smaller size (64 × 32). The predicting network extracts features to generate outputs from the resized input; several hourglass modules are connected in series. Output branches that generate three kinds of outputs are applied to each

Fig. 3.2 Proposed framework with three main parts. 512 × 256 size input data are compressed by the resizing network architecture

3.7

Example: Lane Detection

75

Table 3.2 Details of the proposed networks Type Input data Resizing

Encoder

(Distillation layer)

Decoder

Output branch

Layer

Size/Stride

Conv+Prelu+bn Conv+Prelu+bn Conv+Prelu+bn Bottle-neck(down) Bottle-neck(down) Bottle-neck(down) Bottle-neck(down) Bottle-neck Bottle-neck Bottle-neck Bottle-neck Bottle-neck(up) Bottle-neck(up) Bottle-neck(up) Bottle-neck(up) Conv+Prelu+bn Conv+Prelu+bn Conv

3/2 3/2 3/2

3/1 3/1 1/1

Output size 3*512*256 32*256*128 64*128*64 128*64*32 128*32*16 128*16*8 128*8*4 128*4*2 128*4*2 128*4*2 128*4*2 128*4*2 128*8*4 128*16*8 128*32*16 128*64*32 64*64*32 32*64*32 C*64*32

hourglass module. From these outputs, they can predict the exact location of traffic line points and distinguish each instance. Because each hourglass module has the same output branch, users can make a lighter model by clipping some modules without additional training or modification. The knowledge distillation technique is utilized to prevent the difference in the performance between the full-size and shortclipped models. Loss functions are calculated by three outputs and the knowledge distillation part in all hourglass modules. The weighted summation of all loss functions is applied for back-propagation; The compressed input is fed to the predicting network, which includes four hourglass modules. Three output branches are applied at the ends of each hourglass block; they predict confidence, offset, and embedding features. The loss function can be calculated from the outputs of each hourglass block. By clipping several hourglass modules, the required computing resources can be adjusted [55]. Table 3.2 shows the proposed framework of the network. The input RGB image size is 512 × 256. It is fed to the resizing network. This image is compressed to a smaller size (64 × 32) by the sequence of convolution layers in the resizing network; the output of the resizing network is fed to the predicting network. An arbitrary number of hourglass modules can be included in the prediction network; four hourglass modules are used in this study. All hourglass modules are trained simultaneously by the same loss function. After the training step, users can choose how many hourglass modules to use according to the computing power without any additional training. The following sections provide details about each network.

76

3

Lane Detection

Fig. 3.3 Details of the hourglass block consist of three types of bottle-neck layers

Resizing Network: The resizing network reduces the input image’s size to save memory and inference time. First, the input RGB image size is 512 × 256. This network consists of three convolution layers. All convolution layers are applied with a filter size of 3 × 3, stride of 2, and padding size of 1. Table 3.2 shows details of the constituent layers. Let us check with the code to understand the resizing network. Code 3.6 Resize Layer class resize_layer(nn.Module): def __init__(self, in_channels, out_channels, acti = True): super(resize_layer, self).__init__() self.conv = Conv2D_BatchNorm_Relu(in_channels, out_channels//2, 7, 3, 2) self.maxpool = nn.MaxPool2d(2, 2) self.re1 = bottleneck(out_channels//2, out_channels//2) self.re2 = bottleneck(out_channels//2, out_channels//2) self.re3 = bottleneck(out_channels//2, out_channels) def forwards(self, inputs): outputs = self.conv(inputs) outputs = self.re1(outputs) outputs = self.maxpool(outputs) outputs = self.re2(outputs) outputs = self.maxpool(outputs) outputs = self.re3(outputs) return outputs

Predicting Network: The resizing network output is fed to the prediction part, which will be described in this section. This part predicts the exact points on the traffic lines and the embedding features, for instance, segmentation. This network consists of several hourglass modules, each including an encoder, decoder, and three output branches, as shown in Fig. 3.3.

3.7

Example: Lane Detection

77

Fig. 3.4 Details of bottleneck. The three kinds of bottle-necks have different first layers according to their purposes [55]

The same bottle-necks, down bottle-necks, and up bottle-necks. Output branches are applied at the ends of the hourglass layers; confidence output is forwarded to the next block [55]. Some skip connections transfer the information of the various scales to deeper layers. Each colored block in Fig. 3.3 is a bottle-neck module; these bottle-neck modules are described in Fig. 3.4. There are three kinds of bottlenecks: same, down, and up bottlenecks. The same bottle-neck generates output that has the same size as the input. Code 3.7 The Same Bottle-Neck Layer class bottleneck (nn.Module): def __init__(self, in_channels, out_channels, acti=True):

78

3

Lane Detection

super(bottleneck, self).__init__() self.acti = acti temp_channels = int(in_channels/4) if in_channels < 4: temp_channels = in_channels self.conv1 = Conv2D_BatchNorm_Relu(in_channels, temp_channels, 1, 0, 1) self.conv2 = Conv2D_BatchNorm_Relu(temp_channels, temp_channels, 3, 1, 1) self.conv3 = Conv2D_BatchNorm_Relu(temp_channels, out_channels, 1, 0, 1, acti = self.acti) self.residual = Conv2D_BatchNorm_Relu(in_channels, out_channels, 1, 0, 1) def forward(self, x): re = x out = self.conv1(x) out = self.conv2(out) out = self.conv3(out) if not self.acti: return out re = self.residual(x) out = out + re return out

The down bottle-neck is applied for downsampling in the encoder; the first layer of the down bottle-neck is replaced by a convolution layer with filter size 3, stride 2, and padding 1. Code 3.8 The Down Bottleneck Layer class bottleneck_down(nn. Module): def __init__(self, in_channels, out_channels): super(bottleneck_down, self).__init__() temp_channels = in_channels//4 if in_channels < 4: temp_channels = in_channels # size/stride (1/1) convolution self.conv1 = Conv2D_BatchNorm_Relu(in_channels, temp_channels, 1, 0, 1) # size/stride (3/2) convolution self.conv2 = Conv2D_BatchNorm_Relu(temp_channels, temp_channels, 3, 1, 2) # size/stride (1/1) convolution self.conv3 = Conv2D_BatchNorm_Relu(temp_channels, out_channels, 1, 0, 1) # size/stride (3/2) residual convolution self.residual = Conv2D_BatchNorm_Relu(in_channels, out_channels, 3, 1, 2) def forward(self, x): re = x out = self.conv1(x) out = self.conv2(out) out = self.conv3(out) re = self.residual(x)

3.7 Example: Lane Detection

79

out = out + re return out

The transposed convolution layer with filter size 3, stride 2, and padding 1 is applied for the up bottle-neck in the upsampling layers. Code 3.9 The Upsampling Bottleneck Layer class bottleneck_up(nn.Module): def __init__(self, in_channels, out_channels): super(bottleneck_up, self).__init__() temp_channels = in_channels//4 if in_channels < 4: temp_channels = in_channels # convolution layer (1/1) self.conv1 = Conv2D_BatchNorm_Relu(in_channels, temp_channels,1, 0, 1) # deconvolution layer(3/2) self.conv2 = nn.Sequential(nn.ConvTranspose2d(temp_channels, temp_channels, 3, 2, 1, 1), nn.BatchNorm2d(temp_channels), nn.ReLU() ) # convolution (1/1) self.conv3 = Conv2D_BatchNorm_Relu(temp_channels, out_channels, 1, 0, 1) # deconvolution(3/2) self.residual = nn.Sequential( nn.ConvTranspose2d(in_channels, out_channels, 3, 2, 1, 1), nn.BatchNorm2d(out_channels), nn.ReLU() ) def forward(self, x): re = x # three convolution + residual layer. out = self.conv1(x) out = self.conv2(out) out = self.conv3(out) re = self.residual(re) out = out + re return out

And then final predicating network should be like this: Code 3.10 The Hourglass Block Layer class hourglass_block(nn.Module): def __init__(self, in_channels, out_channels, acti = True, input_re=True): super(hourglass_block, self).__init__() self.layer1 = hourglass_same(in_channels, out_channels) self.re1 = bottleneck(out_channels, out_channels) self.re2 = bottleneck(out_channels, out_channels) self.re3 = bottleneck(1, out_channels) self.out_confidence = Output(out_channels, 1) self.out_offset = Output(out_channels, 2) self.out_instance = Output(out_channels, p.feature_size) self.input_re = input_re

80

3

Lane Detection

def forward(self, inputs): outputs = self.layer1(inputs) outputs = self.re1(outputs out_confidence = self.out_confidence(outputs out_offset = self.out_offset(outputs) out_instance = self.out_instance(outputs) out = out_confidence outputs = self.re2(outputs) out = self.re3(out) if self.input_re: outputs = outputs + out + inputs else: outputs = outputs + out return [out_confidence, out_offset, out_instance], outputs

Each output branch has three convolution layers and generates a 64 × 32 grid. Confidence values for the key point existence, offset, and embedding features of each cell in the output grid are predicted by the output branches.

3.7.2

Loss Function

For training, four loss functions are applied to each output branch of the hourglass modules. The following sections provide details of each loss function. As shown in Table 3.1, the output branch generates a 64 grid, and each cell in the output grid consists of the predicted values of 7 channels, including the confidence value (1 channel), offset (2 channel) value, and embedding feature (4 channel). The confidence value determines whether key points of the traffic line exist; the offset value localizes the exact position of the key points predicted by the confidence value, and the embedding feature is utilized to distinguish key points into individual instances. Therefore, three loss functions, except for the distillation loss function, are applied to each cell of the output grid. The distillation loss function to distillate the knowledge of the teacher network is adapted in the distillation layer of each encoder, as shown in Table 3.1. (1) Confidence Loss: The confidence output branch predicts the confidence value of each cell. If a key point is present in the cell, the confidence value is close to 1; if not, it is 0. The output of the confidence branch has 1 channel, and it is fed to the next hourglass module. The confidence loss consists of two parts: existence loss and nonexistence loss. The existence loss is applied to cells that include key points; the nonexistence loss is utilized to reduce the confidence value of each background cell.

Code 3.11 Confidence Loss # exist confidence loss confidance_gt = ground_truth_point[:, 0, :, :]

3.7

Example: Lane Detection

81

confidance_gt = confidance_gt.view(real_batch_size, 1, self.p.grid_y, self.p.grid_x) exist_condidence_loss = torch.sum((confidance_gt[confidance_gt==1] confidance[confidance_gt==1])**2)/torch.sum(confidance_gt==1) # non exist confidence loss nonexist_confidence_loss = torch.sum((confidance_gt [confidance_gt==0] - confidence[confidance_gt==0])**2)/torch.sum (confidance_gt==0)

(2) Offset Loss: From the offset branch, PINet predicts the exact location of the key points for each output cell. The output of each cell has a value between 0 and 1; this value indicates the position related to the corresponding cell.

Code 3.12 Offset Loss #offset loss offset_x_gt = ground_truth_point[:, 1:2, :, :] offset_y_gt = ground_truth_point[:, 2:3, :, :] predict_x = offset[:, 0:1, :, :] predict_y = offset[:, 1:2, :, :] x_offset_loss = torch.sum( (offset_x_gt[confidance_gt==1] predict_x[confidance_gt==1])**2 )/torch.sum(confidance_gt==1) y_offset_loss = torch.sum( (offset_y_gt[confidance_gt==1] predict_y[confidance_gt==1])**2 )/torch.sum(confidance_gt==1) offset_loss = (x_offset_loss + y_offset_loss)/2

(3) Embedding Feature Loss: The loss function of this branch is inspired by SGPN, a 3D point cloud instance segmentation method. The branch is trained to make the embedding feature of each cell closer if the embedding features are the same in this instance.

Code 3.13 Embedding Feature Loss #compute loss for similarity sisc_loss = 0 disc_loss = 0 feature_map = feature.view(real_batch_size, self.p.feature_size, 1, self.p.grid_y*self.p.grid_x) feature_map = feature_map.expand(real_batch_size, self.p. feature_size, self.p.grid_y*self.p.grid_x, self.p.grid_y*self.p. grid_x).detach() point_feature = feature.view(real_batch_size, self.p.feature_size, self.p.grid_y*self.p.grid_x,1) point_feature = point_feature.expand(real_batch_size, self.p. feature_size, self.p.grid_y*self.p.grid_x, self.p.grid_y*self.p. grid_x)#.detach() distance_map = (feature_map-point_feature)**2 distance_map = torch.norm(distance_map, dim=1).view (real_batch_size, 1, self.p.grid_y*self.p.grid_x, self.p. grid_y*self.p.grid_x)

82

3

Lane Detection

# same instance sisc_loss = torch.sum(distance_map[ground_truth_instance==1])/ torch.sum (ground_truth_instance==1) # different instance, same class disc_loss = self.p.K1-distance_map[ground_truth_instance==2] disc_loss[disc_loss car_s; bool is_closer_than_safety_margin = vehicle.s - car_s < safety_margin; if (is_in_front_of_us && is_closer_than_safety_margin) { is_too_close = true; prepare_for_lane_change = true; } } }

4.2.6.3

Calculate the Trajectory Route of the Current Lane

This section mainly uses the Frenet coordinate system (also known as the coordinate system used by the road), converts the obtained Frenet coordinates of the current car into Cartesian coordinates, and then adds the position information of 30 m, 60 m and 90 m ahead to the coordinate points of the current reference system. Then, it uses the sampling points based on the Bezier curve to generate the curve, calculate the cost function, perform collision detection, and finally find the optimal safety curve. 4.2.6.4

Frent Road Coordinate System

When planning the path trajectory, the Frent road coordinate system is used. The reason is that when implementing the planning curve, the current road is used as the premise, and the generated candidate curve can be finally expressed as the current vehicle position. The index position of the reference line points to the new curve in the forwards direction of the reference line. That is, taking the vehicle itself as the origin, the coordinate axes are perpendicular to each other, and it is divided into s direction (that is, the direction along the reference line, usually called longitudinal) and d direction (that is, the current normal direction of the reference line, called horizontal). Compared with the Cartesian coordinate system, the Frenet coordinate system significantly simplifies the problem because in road driving, it is always possible to simply find the reference line of the road (i.e., the centerline of the road). The representation of the position can be described simply using the longitudinal distance (i.e., the distance along the road) and the lateral distance (i.e., the distance from the reference line). The following code gives the mutual conversion between the Cartesian coordinate system and the Frenet road coordinate system: Code 4.9 The Mutual Conversion Between the Cartesian Coordinate System Andfrenetroad Coordinate System //From the Cartesian x,y coordinates Convert to Frenet s,d coordinates std::vector cartesian_to_frenet(double x, double y, double theta, std::vector maps_x, std::vector maps_y)

4.2

Traditional Planning and Control Solutions

99

{ int next_wp = get_next_waypoint(x, y, theta, maps_x, maps_y); int prev_wp; prev_wp = next_wp - 1; if (next_wp == 0) prev_wp = maps_x.size() - 1; double n_x = maps_x[next_wp] - maps_x[prev_wp]; double n_y = maps_y[next_wp] - maps_y[prev_wp]; double x_x = x - maps_x[prev_wp]; double x_y = y - maps_y[prev_wp]; //Calculate the projection of x on n double proj_norm = (x_x * n_x + x_y * n_y)/(n_x * n_x + n_y * n_y); double proj_x =proj_norm * n_x; double proj_y =proj_norm * n_y; double frenet_d = distance(x_x, x_y, proj_x, proj_y); //see if the d value is positive or negative by comparing it to the center point double center_x = 1000 - maps_x[prev_wp]; double center_y = 2000 - maps_y[prev_wp]; double centerToPos = distance(center_x, center_y, x_x, x_y); double centerToRef = distance(center_x, center_y, proj_x, proj_y); if (centerToPos maps_s[prev_wp + 1] && (prev_wp < (int)(maps_s.size() - 1))) prev_wp++;

100

4

Motion Planning and Control

int wp2 = (prev_wp + 1) % maps_x.size(); double heading = atan2((maps_y[wp2] - maps_y[prev_wp]), (maps_x[wp2] - maps_x[prev_wp])); //the x,y,s along the segment double seg_s = (s - maps_s[ prev_wp]); double seg_x = maps_x[prev_wp] + seg_s*cos(heading); double seg_y = maps_y[prev_wp] + seg_s*sin(heading); double perp_heading = heading - pi/2; double x = seg_x + d*cos(perp_heading); double y = seg_y + d*sin(perp_heading); return { x, y }; }

4.2.6.5

The Spline Function Generates the Candidate Trajectory Path

Based on the spatial position of the current vehicle position and the position of the reference line’s forwards direction, the “spline function” can be used to calculate the curve. Regarding the specific implementation of the spline function, please refer to http://kluge.in-chemnitz.de/opensource/spline/, which is implemented by calling the function in the header file. Code 4.10 The Spline Function Generates Candidate Trajectory Paths //a list of well-spaced (x, y) waypoints, these will later be interpolated with the spline, filling it with more velocity-controlling points vector pts_x ; vector pts_y; //x, y, reference values for yaw state double ref_x = car_x; double ref_y = car_y; double ref_yaw = deg2rad(car_yaw); //If the previous car position is almost empty, use the current car position as a starting reference if (prev_size < 2) { double prev_car_x = car_x - cos(car_yaw); double prev_car_y = car_y - sin(car_yaw); pts_x.push_back(prev_car_x); pts_x.push_back(car_x); pts_y.push_back(prev_car_y); pts_y.push_back(car_y); } //use previous path end point as reference else { ref_x = previous_path_x[prev_size - 1]; ref_y = previous_path_y[prev_size - 1];

4.2

Traditional Planning and Control Solutions

101

double ref_x_prev = previous_path_x[prev_size - 2]; double ref_y_prev = previous_path_y[prev_size - 2]; ref_yaw =atan2(ref_y - ref_y_prev, ref_x - ref_x_prev); pts_x.push_back(ref_x_prev); pts_x.push_back(ref_x); pts_y.push_back(ref_y_prev); pts_y.push_back(ref_y); } //In Frenet coordinates, evenly add points at 30 m intervals before the start reference vector next_wp0 = frenet_to_cartesian(car_s + 30, (lane_width * lane + lane_width/2), map_waypoints_s, map_waypoints_x, map_waypoints_y); vector next_wp1 = frenet_to_cartesian(car_s + 60, (lane_width * lane + lane_width/2), map_waypoints_s, map_waypoints_x, map_waypoints_y); vector next_wp2 = frenet_to_cartesian(car_s + 90, (lane_width * lane + lane_width/2), map_waypoints_s, map_waypoints_x, map_waypoints_y); pts_x.push_back(next_wp0[0]); pts_x.push_back(next_wp1[0]); pts_x.push_back(next_wp2[0]); pts_y.push_back(next_wp0[1]); pts_y.push_back(next_wp1[1]); pts_y.push_back(next_wp2[1]); //Shift and rotate coordinates for (size_t i = 0; i < pts_x.size(); ++i) { double shift_x = pts_x[i] - ref_x; double shift_y = pts_y[i] - ref_y; pts_x[i] = shift_x * cos(0 - ref_yaw) - shift_y * sin(0 - ref_yaw); pts_y[i] = shift_x * sin(0 - ref_yaw) + shift_y * cos(0 - ref_yaw); } //Create spline function tk::spline s; s.set_points(pts_x, pts_y);

4.2.6.6

Collision Detection

Finally, collision detection is independent of the cost function, and the cost function is sorted from low to high. It is performed from the beginning, and the first trajectory parameter that satisfies the collision detection is always used. Code 4.11 Collision Detection //Calculate how to break up spline points to travel at reference velocity double target_x = 30.0; double target_y = s(target_y); double target_dist = sqrt(target_x * target_x + target_y * target_y); double x_add_on = 0.0; for (size_t i = 1; i