Advances in computer vision. Proc. 2019 computer vision conf., Vol.1 9783030177942, 9783030177959

1,218 114 15MB

English Pages 833 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Proc. 4 inter. conf. Computer Vision and Image Processing, CVIP 2019, Part 1 9789811540141, 9789811540158

806 78 7MB Read more

Proc. 4 inter. conf. Computer Vision and Image Processing, CVIP 2019, Part 2 9789811540172, 9789811540189

1,339 141 9MB Read more

Proc. 3 inter. conf. Computer Vision and Image Processing, CVIP 2018, Vol. 2 9789813292901, 9789813292918

545 102 10MB Read more

Proc. 3 inter. conf. Computer Vision and Image Processing, CVIP 2018, Vol. 1 9789813290877, 9789813290884

1,076 79 8MB Read more

Scale Space and Variational Methods in Computer Vision, 7 conf., SSVM 2019 9783030223670, 9783030223687

981 55 12MB Read more

Advanced Topics in Computer Vision (Advances in Computer Vision and Pattern Recognition) [New ed.] 9781447155201, 9781447155195, 1447155203

316 98 12MB Read more

Computer Vision in Medical Imaging (Series in Computer Vision, 2) 9789814460934, 9814460931

The major progress in computer vision allows us to make extensive use of medical imaging data to provide us better diagn

171 85 34MB Read more

Computer Vision Algorithms and Applications

2,945 415 18MB Read more

Smart Computer Vision 9783031205408, 9783031205415

This book addresses and disseminates research and development in the applications of intelligent techniques for Computer

972 74 12MB Read more

Smart Computer Vision 3031205405, 9783031205408

This book addresses and disseminates research and development in the applications of intelligent techniques for computer

720 77 14MB Read more

Advances in computer vision. Proc. 2019 computer vision conf., Vol.1
9783030177942, 9783030177959

Author / Uploaded
Arai K.
Kapoor S (ed.)

Table of contents :
Preface......Page 6
Contents......Page 8
1 Introduction......Page 14
2.1 Previous Approaches......Page 15
3.1 Dataset......Page 16
3.2 Railway vs Road Signs and Signals......Page 20
4.1 Faster R-CNN......Page 21
4.2 Evaluation Method......Page 22
5 Results......Page 23
References......Page 27
1 Introduction......Page 29
2 Background......Page 30
3 Network Architecture......Page 31
4 Data......Page 33
5 Experiment and Results......Page 34
Appendix......Page 37
References......Page 39
1 Introduction......Page 40
2 Image Color Representation......Page 42
3 Transfer Learning of DL Models for Image Classification......Page 43
4 Experimental Layout......Page 44
5.2 ResNet......Page 46
6 Findings and Further Work......Page 48
References......Page 50
1 Introduction......Page 52
2 Related Work......Page 53
3 Weak Supervision by NCC......Page 54
5.1 Alignment......Page 57
5.3 Training Stability Analysis......Page 59
A.1 Training Set......Page 63
A.2 Ground Truth Annotation......Page 64
A.3 Match Filtering......Page 65
References......Page 70
Abstract......Page 72
1 Introduction......Page 73
4 The Application of Nature Inspired Algorithms in Deep Learning......Page 74
4.1 Harmony Search Algorithm for Deep Belief Network......Page 75
4.5 Ant Colony Optimization for Deep Belief Network......Page 76
4.9 Genetic Algorithm for Restricted Boltzmann Machine......Page 77
5 The General Overview of the Synergy Between Nature Inspired Algorithms and Deep Learning......Page 78
6 Challenges and Future Research Directions......Page 79
7 Conclusions......Page 80
References......Page 81
1 Introduction......Page 84
3 RBM-SVR Algorithm......Page 87
4.1 Principles of GWO Algorithm......Page 88
4.2 Principles of the DE Algorithm......Page 89
5 Experimental Setup and Results......Page 91
6 Conclusions and Outlook......Page 95
References......Page 96
1 Introduction......Page 99
2 Related Work......Page 100
3 Model......Page 102
3.1 Visual Semantic Embedding......Page 103
3.2 Discriminator......Page 104
4 Experiments and Results......Page 105
4.1 Quantitative Analysis......Page 106
4.2 Qualitative Analysis......Page 107
5 Conclusions......Page 108
References......Page 109
1 Introduction......Page 112
2.1 Inception Architecture......Page 114
3.1 Architecture......Page 115
3.2 Training Methodology......Page 116
4.1 Data Sets......Page 117
4.3 Parameter Sensitivity......Page 118
4.4 Experimental Results......Page 119
5 Conclusion......Page 120
References......Page 121
1 Introduction......Page 122
1.1 Related Work......Page 124
1.2 Single Image Super-Resolution Using GAN......Page 126
2 Proposed Method......Page 128
2.1 Loss Functions......Page 130
3 Experiment Analysis......Page 131
References......Page 137
1 Introduction......Page 141
2.1 What Is Deep Learning......Page 142
2.2 Advantages of Deep Learning......Page 143
2.3 Advantages of Traditional Computer Vision Techniques......Page 145
3.1 Mixing Hand-Crafted Approaches with DL for Better Performance......Page 146
3.2 Overcoming the Challenges of Deep Learning......Page 147
3.4 Problems Not Suited to Deep Learning......Page 148
3.5 3D Vision......Page 149
3.6 Slam......Page 151
3.8 Dataset Annotation and Augmentation......Page 152
4 Conclusion......Page 153
References......Page 154
1 Introduction......Page 158
2 Related Research......Page 159
3.1 Overview of Proposed Methods......Page 160
3.3 Generation of Training/Testing Data......Page 161
3.4 Geometrically Calculate Location......Page 162
3.5 Self-localization by CNN......Page 165
3.6 Self-localization by Convolutional LSTM......Page 166
4 Evaluation of Self-localization......Page 167
5 Conclusion......Page 170
References......Page 171
1 Introduction......Page 172
2.1 Age and Gender Classification......Page 173
2.2 Deep Neural Networks......Page 175
3.2 Age Classification Approach......Page 176
4.1 Model Architecture......Page 177
4.2 Prediction Head and Cross-Modal Learning......Page 179
4.3 Training......Page 180
4.5 Iterative Model Refinement......Page 181
5 System and Evaluation......Page 183
5.2 Results......Page 184
References......Page 186
1 Introduction......Page 191
2 Related Work......Page 192
3.1 Markov Decision Process......Page 193
3.2 Architecture......Page 197
4 Experiments......Page 198
4.2 Results......Page 199
5 Conclusion......Page 201
A Algorithm......Page 202
B Evaluation Overview......Page 203
References......Page 204
1 Introduction......Page 205
1.1 Mobile Sourced Images......Page 206
1.3 Fixed Source Images from Closed-Circuit Television (CCTV)......Page 207
2.1 Model Description......Page 208
2.2 Road Weather Condition Estimation Using CCTV......Page 209
2.3 Road Weather Condition Estimation Using Mobile Cameras......Page 210
3 Experimental Results......Page 211
3.1 Experimental Results on Naturalistic Driving Images......Page 215
References......Page 216
1 Introduction......Page 218
2 Related Work......Page 219
3 Algorithm Description......Page 220
3.1 Design of Feature Extraction Network......Page 221
3.2 Clustering Scheme......Page 223
4.2 Effectiveness Analysis of Feature Extraction......Page 224
4.3 Analysis of the Effectiveness of a Parallel Network......Page 225
4.4 Analysis of Detection Results......Page 226
References......Page 232
1 Introduction......Page 235
2 Regeneration of Subkeys Using a Secret Key......Page 237
4 The Decryption Process......Page 238
5.1 Subkeys Regeneration......Page 239
5.2 Encryption Results......Page 241
5.3 Decryption Results......Page 242
6.2 Scatter Diagrams......Page 243
6.3 Correlation Coefficients......Page 244
6.4 Parameter Mismatch......Page 245
6.5 Differential Attack Analysis......Page 246
7 Conclusions......Page 247
References......Page 248
1 Introduction......Page 250
2 Graphs and Cuts......Page 253
3 Island-Free Reconstructions......Page 254
4 Simple Reconstruction Paths......Page 255
5 Map Reconstructions for 1-Run Boundaries......Page 256
6 Island-Free Reconstructions for 2-Run Boundaries......Page 260
7 MAP Reconstructions for 2-Run Boundaries......Page 263
References......Page 268
1 Introduction......Page 270
2.1 Characteristics of the 3D Neutron Data......Page 272
2.3 DBSCAN-Assisted DVR......Page 274
3 Application on Neutron Data......Page 276
3.1 Feature Extraction in Single Crystal Diffuse Scattering......Page 278
3.2 Interactive Visualization of Neutron Tomography......Page 279
4.1 Involvement of HPC......Page 281
4.2 Expanding DBSCAN’s Applications in Neutron Science......Page 282
References......Page 283
1 Introduction......Page 285
2.2 Segmentation......Page 286
3.1 Plate Characterization......Page 287
3.2 Selection of Best Binarization Technique......Page 288
3.3 License Number Segmentation......Page 289
5.1 Character Segmentation......Page 292
5.2 Classification Problems......Page 293
References......Page 295
1 Introduction......Page 297
2 Related Work......Page 298
3.1 Structure of the Human Eye......Page 300
3.2 3D Perception Considerations......Page 303
3.4 Snell Law......Page 304
3.5 Color Spaces......Page 305
3.6 Chroma Key......Page 306
4.2 Capture Module Design......Page 307
4.3 Phase of Coding of Images......Page 310
4.4 Representation Phase......Page 311
5 Experiments and Results......Page 312
References......Page 314
1.1 Background......Page 316
1.2 Related Work......Page 317
2.1 Superpixel Partitioning Using SLIC......Page 318
3.1 Overview......Page 320
3.2 Detection and Manipulation of Superpixels......Page 321
3.4 Crack Cleaning and Simplification......Page 324
4.1 Subjective Assessment......Page 325
4.2 Objective Assessment......Page 326
References......Page 328
1.1 Problem......Page 330
2.2 Previous Research......Page 331
4.1 Research Design......Page 332
5.1 Solution......Page 336
References......Page 338
1 Introduction......Page 339
2 Background......Page 340
3 Method and Approach of Scenario......Page 341
4.2 Image Processing......Page 342
5 Image Recognition Through CNN......Page 343
6.1 Function Evaluate......Page 346
6.2 Confusion Matrix......Page 347
6.3 ROC Curve......Page 348
7 Results Analysis......Page 349
8 Conclusions and Future Works......Page 350
References......Page 351
1 Introduction......Page 352
2 Related Works......Page 353
3 Datasets......Page 354
4.2 License Plate Detection......Page 355
4.4 Character Recognition......Page 356
5.2 License Plate Detection......Page 358
5.3 Recognition......Page 359
References......Page 361
1 Introduction......Page 363
2 The Motivation of Our Approach......Page 364
3.1 Objective Object Weight Map......Page 366
3.3 Segmentation Based on Iterative Graph Cut......Page 369
4 Experiment Result......Page 370
References......Page 372
1 Introduction......Page 374
2 Proposed System......Page 375
2.3 Measurement Stage......Page 376
3.2 System Performance......Page 377
4 Experimental Results......Page 378
References......Page 380
1 Introduction......Page 382
2.1 Problem Statement......Page 383
2.2 Overview......Page 384
3 Optimizations......Page 385
3.1 Element-Wise Addition Fusion......Page 386
3.3 Linear Transformation Fusion......Page 387
3.5 Padding Transformation......Page 388
4.2 Network Architectures......Page 389
4.4 Cost Efficiency......Page 391
4.5 Optimizations......Page 392
5 Related Work......Page 393
References......Page 394
1 Introduction......Page 397
2 Compressive Sensing......Page 398
3.1 The Laplacian Operator......Page 400
3.3 Gaussian Blur......Page 401
5 Results......Page 402
References......Page 406
1 Introduction......Page 407
2 Head of Hades Sample......Page 408
3 3D Reconstruction Model......Page 410
Acknowledgment......Page 413
References......Page 414
1 Introduction......Page 415
2.1 Models......Page 416
2.2 Data......Page 417
3.1 Segmentation Model......Page 419
3.2 Fine-Tuning on Expert Annotations......Page 421
4 Discussion......Page 424
References......Page 426
1 Introduction......Page 429
2.1 Denoising Quality Feature Design......Page 431
2.3 Automatic Parameter Tuning......Page 437
3.1 Image Denoising Quality Benchmark......Page 438
3.2 Regression Validation......Page 439
3.3 Evaluation on Denoising Quality Ranking......Page 440
3.4 Evaluation on Parameter Tuning......Page 441
References......Page 444
1 Introduction......Page 447
2.1 Plant Disease Database Creation......Page 448
2.2 Method......Page 449
3 Results and Discussion......Page 454
References......Page 455
1 Introduction......Page 457
2 Related Works......Page 458
3.1 Face Detection and Alignment......Page 460
3.3 COSFIRE......Page 461
3.4 Fusion Methods......Page 464
4.1 Data Sets......Page 465
4.2 Experiments......Page 466
5 Discussion......Page 467
References......Page 468
1 Introduction......Page 472
3 Methodology......Page 474
3.1 Design of GCP-Marker......Page 475
3.2 Detection of GCP......Page 476
4.1 Data Acquisition......Page 482
5 Results......Page 485
References......Page 487
1 Introduction......Page 490
3 Marker-Based Tracking During Robot-Assisted Radical Prostatectomy......Page 492
3.1 The Target Application......Page 494
3.2 The Simulation of the Surgical Operation......Page 496
4 The Optimization Procedure......Page 497
4.1 Validation......Page 500
5 Evaluation......Page 501
5.1 Considering Different Number of Markings......Page 502
5.2 Adding Deformation......Page 504
5.3 Filtering......Page 505
References......Page 507
1 Introduction......Page 510
2 Laser Scanning System......Page 511
4 3D Shape Deflection Monitoring......Page 512
References......Page 514
1 Introduction......Page 516
2 Related Work......Page 517
3.2 Template Matching......Page 520
3.3 Image Pre-processing......Page 521
3.4 Optical Character Recognition......Page 522
3.5 Text Post Processing......Page 523
4 Experimental Results......Page 524
References......Page 526
1 Introduction......Page 528
2 Literature Review......Page 530
3.2 Flowchart of the Proposed Approach......Page 533
4.1 Image Visibility and Contrast Improvement......Page 541
4.2 3D Quality Improvement......Page 542
5 Conclusion......Page 545
References......Page 546
1 Introduction......Page 548
1.2 Network Structure......Page 549
1.4 Proposed Model......Page 550
2.2 Loss Margin......Page 551
3.2 Dynamic Margin......Page 553
4.2 Model Training......Page 554
5.1 Toy Experiment......Page 556
5.2 Evaluation Metrics......Page 557
5.3 Recall-Time Data Filtering......Page 561
References......Page 562
1 Introduction......Page 564
2 Related Works......Page 566
3.1 Query to Social Media......Page 568
3.2 Image Matching Process and GPS Data Extraction......Page 569
4 Results......Page 571
References......Page 574
1 Introduction......Page 577
2 Methods......Page 579
2.2 Data Collection......Page 580
3 Preliminary Findings (n = 14)......Page 584
3.4 Learning/Fatigue Effects......Page 585
4 Discussion......Page 586
5 Limitations and Future Work......Page 588
Appendix—Pre-study Questionnaire......Page 589
References......Page 591
1 Introduction......Page 593
2 Problem of Alignment of 2-D Shapes......Page 594
3 Related Work......Page 596
4 Material Used for This Study......Page 598
5.1 Background......Page 599
5.2 Acquisition of Object Contours from Real Images......Page 600
5.3 Approximation......Page 602
6.1 The Alignment Algorithm......Page 603
6.2 Calculation of Point Correspondences......Page 605
6.3 Calculation of the Distance Measure......Page 607
7 Evaluation of Our Alignment Algorithm......Page 608
8 Conclusions......Page 610
References......Page 611
1 Introduction......Page 613
2 Algorithm Framework......Page 614
3 PLK Optical Flow......Page 616
4 EKF......Page 619
5 Experiments......Page 620
References......Page 625
1 Introduction......Page 627
2.2 Method of Sauvola......Page 629
3.1 The Framework of Proposed Method......Page 630
3.2 Character Edge Extraction with High and Low Contrast......Page 631
3.3 Local Threshold Calculation......Page 632
3.4 Image Binarization......Page 634
4.1 Algorithm Parameters......Page 635
4.2 Algorithm Parameters......Page 636
4.4 Time Performance Analysis......Page 638
References......Page 640
1 Introduction......Page 642
2.1 Displacement Accuracy Validation......Page 644
2.2 Laboratory Testing to Obtain Multipoint Displacement of Long-Medium Span Bridge......Page 645
3 Field Testing......Page 648
4 Discussion and Conclusions......Page 649
References......Page 650
1 Introduction......Page 652
2.1 SPM......Page 653
2.4 Image Source......Page 654
3.2 Hausdorf Distance (HD)......Page 655
4 Comparison Results......Page 656
4.1 Images with Different Noise Levels......Page 657
5 Conclusions......Page 658
References......Page 659
1.1 The Challenge of Chronic Respiratory Disease......Page 661
2 Background......Page 662
3.2 Software Development......Page 664
3.3 Development of Buddi-DL, the Open Web App for LibreHealth RIS......Page 666
4 Results......Page 668
6 Conclusion......Page 669
References......Page 670
1 Introduction......Page 671
2.1 Pixel Intensity Based Face Detection Algorithm......Page 673
2.2 Image Fusion......Page 674
2.3 Optimization......Page 676
2.5 MobileNet Architecture......Page 678
3.1 Dataset......Page 679
3.3 Face Recognition Accuracy......Page 680
4 Conclusion......Page 681
References......Page 682
1 Introduction......Page 684
3.2 Model Architecture......Page 685
4.1 Raw Signals......Page 687
4.2 Logarithmic Spectrograms......Page 688
References......Page 689
1 Introduction......Page 691
2.1 Boundary Extraction......Page 692
2.3 Quadratic Function for Curve Fitting......Page 693
3 Problem Representation and Its Solutions with the Help of GA......Page 695
3.1 Initialization......Page 696
3.3 Demonstration......Page 697
References......Page 698
1 Introduction......Page 700
2.1 Differomorphism in Image Registration......Page 701
2.3 Numerical Algorithm of LDDMM......Page 702
3 Fast Sequential Diffeomorphic Atlas Building (FSDAB)......Page 703
3.1 Numerical Algorithm of FSDAB......Page 704
4 Fast Bayesian Principal Geodesic Analysis (FBPGA)......Page 705
5.2 3D Brain Dataset......Page 707
7 Conclusion......Page 710
References......Page 711
1 Introduction......Page 713
2 Literature Review......Page 715
3.1 Linear Cellular Automata Transform......Page 717
3.2 Pre-processing of Data for 2D Vector Maps......Page 719
3.3 Computation of Relative Coordinates......Page 720
3.4 Binary Transform for Cover Data......Page 721
3.5 The Degree of Watermarking......Page 722
4 The Suggested Strategy of Reversible Watermarking......Page 723
4.2 Embedding Algorithm......Page 724
4.3 Extraction Algorithm......Page 725
5.1 Results of Experiments......Page 726
5.2 Robustness Assessment......Page 728
5.3 Assessment of the Capability for Content Authentication......Page 729
References......Page 730
1 Introduction......Page 733
2 Literature Review......Page 734
3 Proposed Methodology......Page 735
4.2 Results......Page 736
4.3 Discussion......Page 737
5 Conclusion......Page 740
References......Page 741
1 Introduction......Page 742
2.1 TPACK and Higher Education......Page 743
3 Study Context......Page 744
3.1 Possible ICT Technologies Used to Carry Out the Proposed Activities......Page 745
4.1 Participants......Page 748
4.3 User Acceptance Questionnaire for TPACK Structure......Page 749
4.4 Focus Group Interview......Page 750
5.1 Results About the Need for Technological Device Usage by University Learners......Page 751
5.2 User Acceptance Results of TPACK Model in Accordance with University Learners Need......Page 753
6.1 User Acceptance and Future of TPACK Approach......Page 754
6.2 Concerns About “TPACK Structure According to University Learners’ Need” Approach......Page 756
8 Conclusion......Page 757
References......Page 758
1 Introduction......Page 760
2.1 Online Restaurant Reviews......Page 762
2.2 Dining Experience......Page 763
2.3 Hallyu......Page 764
3.1 Data Collection......Page 765
3.2 Analysis Method......Page 766
4.2 Practical Contributions......Page 768
References......Page 769
Abstract......Page 771
2 Background......Page 772
2.2 Principles of Linked Data......Page 773
2.3 Linked Data Publishing Methodology......Page 774
2.4 Tools......Page 775
2.7 Linked Data in Higher Education......Page 776
3.1 Model of Bodies of Knowledge (BOK)......Page 777
3.3 Academic Offer Domain......Page 778
3.4 Design of HTTP URIs......Page 780
3.5 RDF Generation......Page 781
3.6 Publication and Exploitation of Data......Page 782
4 Conclusions......Page 784
References......Page 785
1.1 Context......Page 787
2 State of the Art......Page 788
3.1 Corpora......Page 791
3.2 Method and Algorithms......Page 792
3.3 Implemented Method System Architecture......Page 794
3.4 Model and Classification Training......Page 796
4 Solution Re-usability and Core Capabilities in a Nutshell......Page 797
5 Conclusions and Future Work......Page 799
References......Page 800
1 Introduction......Page 802
2.3 Modeling Preferences......Page 803
3.2 Selection Process......Page 804
5.1 Linguistic Variables......Page 805
5.2 Choice of the Set of Linguistic Terms......Page 806
6 Methodologies in the Development of New Products......Page 807
6.2 New Products Development Models......Page 808
6.3 Variables and Elements in the Analysis and Design of a New Product Development Model......Page 809
References......Page 810
1 Introduction......Page 812
2 Analysis of Related Work......Page 813
3.1 Scientific Data Sources......Page 817
4 Specification of Ontology Requirements......Page 818
5 Ontology Design......Page 819
5.2 Publication Class......Page 820
5.3 Researcher Profile Ontology......Page 821
6.2 Publication Class Population......Page 822
7 Ontology Enrichment......Page 823
8.1 Researcher Publications......Page 825
8.3 Qualified and Specialized Researchers......Page 826
8.4 Publications by Year......Page 827
8.6 Publications by Gender......Page 828
9 Conclusions......Page 829
References......Page 830
Author Index......Page 831

Citation preview

Advances in Intelligent Systems and Computing 943

Kohei Arai Supriya Kapoor Editors

Advances in Computer Vision Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1

Advances in Intelligent Systems and Computing Volume 943

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen, Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Kohei Arai Supriya Kapoor •

Editors

Advances in Computer Vision Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1

123

Editors Kohei Arai Saga University Saga, Saga, Japan

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-17794-2 ISBN 978-3-030-17795-9 (eBook) https://doi.org/10.1007/978-3-030-17795-9 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

It gives us the great pleasure to welcome all the participants of the Computer Vision Conference (CVC) 2019, organized by The Science and Information (SAI) Organization, based in the UK. CVC 2019 offers a place for participants to present and to discuss their innovative recent and ongoing research and their applications. The prestigious conference was held on 25–26 April 2019 in Las Vegas, Nevada, USA. Computer vision is a field of computer science that works on enabling the computers to identify, see and process information in a similar way that humans do and provide an appropriate result. Nowadays, computer vision is developing at a fast pace and has gained enormous attention. The volume and quality of the technical material submitted to the conference confirm the rapid expansion of computer vision and CVC’s status as its flagship conference. We believe the research presented at CVC 2019 will contribute to strengthen the great success of computer vision technologies in industrial, entertainment, social and everyday applications. The participants of the conference were from different regions of the world, with the background of either academia or industry. The published proceedings has been divided into two volumes, which covered a wide range of topics in Machine Vision and Learning, Computer Vision Applications, Image Processing, Data Science, Artificial Intelligence, Motion and Tracking, 3D Computer Vision, Deep Learning for Vision, etc. These papers are selected from 371 submitted papers and have received the instruction and help from many experts, scholars and participants in proceedings preparation. Here, we would like to give our sincere thanks to those who have paid great efforts and support during the publication of the proceeding. After rigorous peer review, 118 papers were published including 7 poster papers. Many thanks go to the Keynote Speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute significantly to this conference. We are also indebted to the organizing committee for their great efforts in ensuring the successful implementation of the

v

vi

Preface

conference. In particular, we would like to thank the technical committee for their constructive and enlightening reviews on the manuscripts in the limited timescale. We hope that all the participants and the interested readers benefit scientifically from this book and find it stimulating in the process. See you in next SAI Conference, with the same amplitude, focus and determination. Regards, Kohei Arai

Contents

Deep Learning for Detection of Railway Signs and Signals . . . . . . . . . . Georgios Karagiannis, Søren Olsen, and Kim Pedersen

1

3D Conceptual Design Using Deep Learning . . . . . . . . . . . . . . . . . . . . . Zhangsihao Yang, Haoliang Jiang, and Lan Zou

16

The Effect of Color Channel Representations on the Transferability of Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Diaz-Cely, Carlos Arce-Lopera, Juan Cardona Mena, and Lina Quintero Weakly Supervised Deep Metric Learning for Template Matching . . . . Davit Buniatyan, Sergiy Popovych, Dodam Ih, Thomas Macrina, Jonathan Zung, and H. Sebastian Seung Nature Inspired Meta-heuristic Algorithms for Deep Learning: Recent Progress and Novel Perspective . . . . . . . . . . . . . . . . . . . . . . . . . Haruna Chiroma, Abdulsalam Ya’u Gital, Nadim Rana, Shafi’i M. Abdulhamid, Amina N. Muhammad, Aishatu Yahaya Umar, and Adamu I. Abubakar

27

39

59

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data: A Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . Wenwen Tu and Hengyi Liu

71

CanvasGAN: A Simple Baseline for Text to Image Generation by Incrementally Patching a Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . Amanpreet Singh and Sharan Agrawal

86

Unsupervised Dimension Reduction for Image Classification Using Regularized Convolutional Auto-Encoder . . . . . . . . . . . . . . . . . . . Chaoyang Xu, Ling Wu, and Shiping Wang

99

vii

viii

Contents

ISRGAN: Improved Super-Resolution Using Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Vishal Chudasama and Kishor Upla Deep Learning vs. Traditional Computer Vision . . . . . . . . . . . . . . . . . . 128 Niall O’Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh Self-localization from a 360-Degree Camera Based on the Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Shintaro Hashimoto and Kosuke Namihira Deep Cross-Modal Age Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Ali Aminian and Guevara Noubir Multi-stage Reinforcement Learning for Object Detection . . . . . . . . . . . 178 Jonas König, Simon Malberg, Martin Martens, Sebastian Niehaus, Artus Krohn-Grimberghe, and Arunselvan Ramaswamy Road Weather Condition Estimation Using Fixed and Mobile Based Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Koray Ozcan, Anuj Sharma, Skylar Knickerbocker, Jennifer Merickel, Neal Hawkins, and Matthew Rizzo Robust Pedestrian Detection Based on Parallel Channel Cascade Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Jiaojiao He, Yongping Zhang, and Tuozhong Yao Novel Scheme for Image Encryption and Decryption Based on a Hermite-Gaussian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Mohammed Alsaedi MAP Interpolation of an Ising Image Block . . . . . . . . . . . . . . . . . . . . . 237 Matthew G. Reyes, David L. Neuhoff, and Thrasyvoulos N. Pappas Volumetric Data Exploration with Machine Learning-Aided Visualization in Neutron Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Yawei Hui and Yaohua Liu License Plate Character Recognition Using Binarization and Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Sandeep Angara and Melvin Robinson 3D-Holograms in Real Time for Representing Virtual Scenarios . . . . . . 284 Jesús Jaime Moreno Escobar, Oswaldo Morales Matamoros, Ricardo Tejeida Padilla, and Juan Pablo Francisco Posadas Durán

Contents

ix

A Probabilistic Superpixel-Based Method for Road Crack Network Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 J. Josiah Steckenrider and Tomonari Furukawa Using Aerial Drone Photography to Construct 3D Models of Real World Objects in an Effort to Decrease Response Time and Repair Costs Following Natural Disasters . . . . . . . . . . . . . . . . . . . . 317 Gil Eckert, Steven Cassidy, Nianqi Tian, and Mahmoud E. Shabana Image Recognition Model over Augmented Reality Based on Convolutional Neural Networks Through Color-Space Segmentation . . . . 326 Andrés Ovidio Restrepo-Rodríguez, Daniel Esteban Casas-Mateus, Paulo Alonso Gaona-García, and Carlos Enrique Montenegro-Marín License Plate Detection and Recognition: An Empirical Study . . . . . . . . 339 Md. J. Rahman, S. S. Beauchemin, and M. A. Bauer Automatic Object Segmentation Based on GrabCut . . . . . . . . . . . . . . . . 350 Feng Jiang, Yan Pang, ThienNgo N. Lee, and Chao Liu Vertebral Body Compression Fracture Detection . . . . . . . . . . . . . . . . . . 361 Ahmet İlhan, Şerife Kaba, and Enver Kneebone PZnet: Efficient 3D ConvNet Inference on Manycore CPUs . . . . . . . . . . 369 Sergiy Popovych, Davit Buniatyan, Aleksandar Zlateski, Kai Li, and H. Sebastian Seung Evaluating Focal Stack with Compressive Sensing . . . . . . . . . . . . . . . . . 384 Mohammed Abuhussein and Aaron L. Robinson SfM Techniques Applied in Bad Lighting and Reflection Conditions: The Case of a Museum Artwork . . . . . . . . . . . . . . . . . . . . . 394 Laura Inzerillo Fast Brain Volumetric Segmentation from T1 MRI Scans . . . . . . . . . . . 402 Ananya Anand and Namrata Anand No-reference Image Denoising Quality Assessment . . . . . . . . . . . . . . . . . 416 Si Lu Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Hiteshwari Sabrol and Satish Kumar Fusion of CNN- and COSFIRE-Based Features with Application to Gender Recognition from Face Images . . . . . . . . . . . . . . . . . . . . . . . 444 Frans Simanjuntak and George Azzopardi

x

Contents

Standardization of the Shape of Ground Control Point (GCP) and the Methodology for Its Detection in Images for UAV-Based Mapping Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Aman Jain, Milind Mahajan, and Radha Saraf Non-linear-Optimization Using SQP for 3D Deformable Prostate Model Pose Estimation in Minimally Invasive Surgery . . . . . . . . . . . . . 477 Daniele Amparore, Enrico Checcucci, Marco Gribaudo, Pietro Piazzolla, Francesco Porpiglia, and Enrico Vezzetti TLS-Point Clouding-3D Shape Deflection Monitoring . . . . . . . . . . . . . . 497 Gichun Cha, Byungjoon Yu, Sehwan Park, and Seunghee Park From Videos to URLs: A Multi-Browser Guide to Extract User’s Behavior with Optical Character Recognition . . . . . . . . . . . . . . . 503 Mojtaba Heidarysafa, James Reed, Kamran Kowsari, April Celeste R. Leviton, Janet I. Warren, and Donald E. Brown 3D Reconstruction Under Weak Illumination Using Visibility-Enhanced LDR Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 Nader H. Aldeeb and Olaf Hellwich DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Marius Cordea, Bogdan Ionescu, Cristian Gadea, and Dan Ionescu Towards Resolving the Kidnapped Robot Problem: Topological Localization from Crowdsourcing and Georeferenced Images . . . . . . . . 551 Sotirios Diamantas Using the Z-bellSM Test to Remediate Spatial Deficiencies in Non-Image-Forming Retinal Processing . . . . . . . . . . . . . . . . . . . . . . . 564 Clark Elliott, Cynthia Putnam, Deborah Zelinsky, Daniel Spinner, Silpa Vipparti, and Abhinit Parelkar Learning of Shape Models from Exemplars of Biological Objects in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 Petra Perner A New Technique for Laser Spot Detection and Tracking by Using Optical Flow and Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 600 Xiuli Wang, Ming Yang, Lalit Gupta, and Yang Bai Historical Document Image Binarization Based on Edge Contrast Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Zhenjiang Li, Weilan Wang, and Zhengqi Cai Development and Laboratory Testing of a Multipoint Displacement Monitoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 Darragh Lydon, Su Taylor, Des Robinson, Necati Catbas, and Myra Lydon

Contents

xi

Quantitative Comparison of White Matter Segmentation for Brain MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Xianping Li and Jorgue Martinez Evaluating the Implementation of Deep Learning in LibreHealth Radiology on Chest X-Rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Saptarshi Purkayastha, Surendra Babu Buddi, Siddhartha Nuthakki, Bhawana Yadav, and Judy W. Gichoya Illumination-Invariant Face Recognition by Fusing Thermal and Visual Images via Gradient Transfer . . . . . . . . . . . . . . . . . . . . . . . 658 Sumit Agarwal, Harshit S. Sikchi, Suparna Rooj, Shubhobrata Bhattacharya, and Aurobinda Routray An Attention-Based CNN for ECG Classification . . . . . . . . . . . . . . . . . . 671 Alexander Kuvaev and Roman Khudorozhkov Reverse Engineering of Generic Shapes Using Quadratic Spline and Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 Misbah Irshad, Munazza Azam, Muhammad Sarfraz, and Malik Zawwar Hussain Bayesian Estimation for Fast Sequential Diffeomorphic Image Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 Youshan Zhang Copyright Protection and Content Authentication Based on Linear Cellular Automata Watermarking for 2D Vector Maps . . . . . . . 700 Saleh AL-ardhi, Vijey Thayananthan, and Abdullah Basuhail Adapting Treemaps to Student Academic Performance Visualization . . . 720 Samira Keivanpour Systematic Mobile Device Usage Behavior and Successful Implementation of TPACK Based on University Students Need . . . . . . . 729 Syed Far Abid Hossain, Yang Ying, and Swapan Kumar Saha Data Analysis of Tourists’ Online Reviews on Restaurants in a Chinese Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 Meng Jiajia and Gee-Woo Bock Body of Knowledge Model and Linked Data Applied in Development of Higher Education Curriculum . . . . . . . . . . . . . . . . . 758 Pablo Alejandro Quezada-Sarmiento, Liliana Enciso, Lorena Conde, Monica Patricia Mayorga-Diaz, Martha Elizabeth Guaigua-Vizcaino, Wilmar Hernandez, and Hironori Washizaki Building Adaptive Industry Cartridges Using a Semi-supervised Machine Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774 Lucia Larise Stavarache

xii

Contents

Decision Making with Linguistic Information for the Development of New Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789 Zapata C. Santiago, Escobar R. Luis, and Ponce N. Alvaro Researcher Profile Ontology for Academic Environment . . . . . . . . . . . . 799 Maricela Bravo, José A. Reyes-Ortiz, and Isabel Cruz Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819

Deep Learning for Detection of Railway Signs and Signals Georgios Karagiannis1,2(B) , Søren Olsen1 , and Kim Pedersen1 1

Department of Computer Science, University of Copenhagen, 2200 Copenhagen, Denmark {geka,ingvor,kimstp}@di.ku.dk 2 COWI A/S, Parallelvej 2, 2800 Lyngby, Denmark [email protected]

Abstract. Major railway lines need advance management systems based on accurate maps of their infrastructure. Asset detection is an important tool towards automation of processes and improved decision support on such systems. Due to lack of available data, limited research exists investigating railway asset detection, despite the rise of Artificial Neural Networks and the numerous investigations on autonomous driving. Here, we present a novel dataset used in real world projects for mapping railway assets. Also, we implement Faster R-CNN, a state of the art deep learning object detection method, for detection of signs and signals on this dataset. We achieved 79.36% on detection and a 70.9% mAP. The results were compromised by the small size of the objects, the low resolution of the images and the high similarity across classes. Keywords: Railway · Object detection Deep learning · Faster R-CNN

1

· Object recognition ·

Introduction

The ever increasing modernisation of signal systems and electrification of major railway lines lead to increasingly complex railway environments. These environments require advanced management systems which incorporate a detailed representation of the network, its assets and surroundings. The aim of such systems is to facilitate automation of processes and improved decision support, minimising the requirement for expensive and inefficient on-site activities. Fundamental requirements are detailed maps and databases of railway assets, such as poles, signs, wires, switches, cabinets, signalling equipment, as well as the surrounding environment including trees, buildings and adjacent infrastructure. The detailed maps may also form the basis for a simulation of the railway as seen from the train operator’s viewpoint. Such simulations/videos are used in the training of train operators and support personnel. Ideally, the maps should be constantly updated to ensure currency of the databases as well as to facilitate detailed documentation and support of maintenance and construction processes in the c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 1–15, 2020. https://doi.org/10.1007/978-3-030-17795-9_1

2

G. Karagiannis et al.

networks. However, with currently available methods, mapping of railway assets is largely a manual and highly labour intensive process, limiting the possible levels of detail and revisit times. The response to this challenge is to automate railway asset mapping based on different sensor modalities (2D images or 3D point clouds) acquired from ground or air. Despite the high demand for automatic asset detection along railways, there is very little research on this field [1]. Here, we present an approach on detection of signs and signals as a first step towards automatic generation and update of maps of railway environments. We implement an object detection model based on Faster R-CNN (Region-based Convolutional Neural Network) presented by Ren et al. in [2] on a dataset used to map a railway of 1,700 km in 2015. The mapping was carried out manually from a private company1 with a lot of experience in such projects. Currently, many such projects exist around the world and to the best of our knowledge are still carried out manually (people going through all images and mark objects of interest). Our approach aims to show the performance of an advanced object detection algorithm, such as Faster R-CNN, on a novel dataset used in a real world project.

2

Literature Review

2.1

Previous Approaches

The research on automatic object detection along railways is sparse, compared to the analogous, popular field of road furniture detection, mainly due to the lack of available railway traffic data [1]. Most of the research is focused on passenger detection [3,4] or track detection [5–9] for different purposes. The limited research existing focused on detection of only a single type of object (sign recognition [10], sign detection [11] or wire detection [12]). Marmo et al. [10] presented a classical approach for railway signal detection. It is focused on detecting a specific type of signals (single element) in video frames and classify it according to the colour of the light (green-pass, red-no pass). The implementation is based on simple image processing techniques such as histogram analysis, template matching and shape feature extraction. The method resulted in 96% detection accuracy and 97% classification accuracy in a total of 955 images which is impressing for this type of approaches. The advantage of this method is efficiency, however it is focused on a very specific type of signals and the examples presented are scenes of low complexity. Arastounia [12] presented an approach for detection of railway infrastructure using 3D LiDAR data. The approach is focused on detection of cables related to the railway (i.e. catenary, contact or return current cables), track bed, rail tracks and masts. The data covers about 550 m of Austrian railways. The approach is based mainly on the topology of the objects and their spatial properties. Points on the track bed are first detected from a spatially local statistical analysis. All the other objects of interest are recognised depending on their spatial relation 1

Second Affiliation.

Railway Signs and Signals

3

with the track bed. The overall average detection achieved by this method is 96.4%. The main drawback of this approach is that it depends on a sophisticated type of data that needs special equipment to capture and it is more complicated to process. Agudo et al. in [1] presented a real-time railway speed limit and warning signs recognition method on videos. After noise removal, Canny edge detection is applied and an optimised Hough voting scheme detects the sign region of interest on the edge image. Based on gradient directions and distances of points on the edge of a shape, candidate central points of signs occur. Then, recognition is achieved by applying shape criteria since the signs are either circular, square or rectangular. The method scored 95.83% overall accuracy on classification. However, even though the dataset had more than 300,000 video frames, only 382 ground truth signs existed and the authors do not provide any score for the detection accuracy. 2.2

Convolutional Neural Networks

All of the above approaches are using traditional image analysis methods to solve object detection problems. To the best of our knowledge, there are no published methods that attempt to solve object detection in railways based on Convolutional Neural Networks (CNNs). CNNs represent the state of the art concept in Computer Vision for classification, detection and semantic segmentation. Regarding object detection, we can divide most CNN-based methods into two categories: region-based and single shot methods. The most characteristic representative of the first is Region-based CNN (R-CNN) and its descendants Fast, Faster and the recent Mask R-CNNs [2,13–15]. From the second category, most representative methods are You Only Look Once (YOLO) [16] and Single Shot multibox Detector (SSD) [17]. In general, region based methods are considerably slower but more effective. Also, region based methods show better performance on smaller objects [2,16]. Given the performance shown in competitive challenges and the fact that our dataset consists mainly of very small objects, we Faster R-CNN [2] consider more suitable for our problem.

3

Data Analysis

3.1

Dataset

In our case, the dataset consists of 47,912 images acquired in 2013 and show the railway from Brisbane to Melbourne in Australia. The images were acquired in three different days in the morning with similar sunlight conditions. The camera used is a Ladybug3 spherical camera system and the images are panoramic views of size 5400 × 2700 pixels. The images were processed manually by the production team of the company2 for annotations and resulted in 121,528 instances of railway signs and signals. 2

Second Affiliation.

4

G. Karagiannis et al.

Fig. 1. Instances of signals and signs. First row from left to right (class name in parentheses when it is different from the description): speed sign (Sign S), unit sign (Sign U), speed standard letter (Sign Other), position light main (PL 2W), position light separately (PL 2W& 1R), signal with two elements (Signal2 F). Second row: diverging left-right speed sign (Sign LR), diverging left speed sign (Sign L), signal number (Sign Signal), signal with one element (Signal1 F), signal with three elements (Signal3 F), signal with four elements (Signal4 F).

Fig. 2. Instances of speed signs at different lighting conditions, viewpoints and scales.

The samples are originally separated into twenty five classes. Each class is in fact a subclass of two parent classes, signs and signals. Fifteen classes correspond to signals: three different types of position lights with their back and side view, signals with one, two, three or four elements (lights), their side and back view and other type of signals. Also, ten classes correspond to different types of signs: speed signs, diverging left speed signs, diverging right speed signs, diverging left-right speed signs, unit signs, signal number signs, other signs, back views of circular signs, back views of diamond signs and back views of rectangular signs. From the total amount of samples, 67,839 correspond to signals and 53,689 to signs. Figure 1 shows some instances of signs and signals. Each of these instances correspond to a different class. We can see the high similarity among the classes both for signs and signals. Specifically, diverging left, diverging right, diverging left-right and regular speed signs are very similar and especially when they are smaller than the examples shown in Fig. 1. Similarly, even for humans it is often hard to distinguish between signals with three or four elements when they are small. All examples shown here are larger than their average size on the dataset for clarity. Figure 2 shows examples of regular speed signs with different viewpoint, size and illumination. These examples illustrate the scale, viewpoint and illumination variation of the objects but at the same time the high similarity among the classes.

Railway Signs and Signals

5

Fig. 3. Instances of signals with one element (Signal1). All four examples belong to the same class even though they have different characteristics. From left to right: some have long (first and second), no top cover at all (third) or short cover (last). Also, some have no back cover (first), other have circular (second and third) or semicircular (last)

Figure 3 shows examples of the same class of signals (Signal1). It is important to note that the class is one of the least represented in the dataset with only a few hundred samples available. Though, despite the low availability we can see that there is significant intra-class variability. The same is observed in all the other classes of signals except the ones corresponding to position lights. Figure 4 shows the amount of available samples for each class. These quantities vary widely for the different classes. For instance, there are about 23,000 samples available for the front view of 4-lamp signals but only a few hundreds for position lights with two lamps or for diverging left-right speed signs. Our dataset reflects the real distribution of objects along a railway, which means that in a railway there exist very few position lights with two lamps and diverging left-right speed signs. Therefore, this level of imbalance among the classes is unavoidable in real applications. However, in deep learning, large amounts of samples are necessary to train a robust model. A common workaround to ensure adequate balance among classes is to apply some data augmentation techniques (e.g. apply random crops, rotations, scaling, illumination changes etc. on the existing samples). However, such techniques cannot solve the problem, without causing bias to the dataset, in our case because the difference in available samples is too high. Past observations [18], have found that non-iconic samples may be included during training only if the overall amount of samples is large enough to capture such variability. In any other case, these samples may act as noise and pollute the model. Thus, it is necessary for some classes with similar characteristics to be merged. By merging all signs except speed and signal number signs to a general class signs other, we get a class with about 30,000 samples. Also, we merged all position lights to a single class, resulting in about 10,000 samples for this class. Finally, the front and side views of each signal class were merged into a single class. The back views of all signals remained a separate class because there was no specific information available on which type of signal each back view belonged to. After these operations, we end up with ten classes of at least a few thousand samples each (Fig. 5). This way, the misrepresentation problem is softened, however we introduce high intra-class variability. The least represented class is the single element signals (Signal 1) with about 6,000 samples which is still about four times less than the most dominant class, but more manageable.

6

G. Karagiannis et al.

Fig. 4. Amount of sample instances per class before merging some classes. The imbalance among the classes is too high.

Fig. 5. Amount of sample instances per class. After merging some classes with similar characteristics we end up with a more balanced dataset.

Another important aspect of the samples is their size. Figure 6 is a histogram of the size of the samples in square pixels. About 65% of the samples have area less than 1000 pixels (≈322 ) and 89% less than 2500 pixels (502 ). Given the size of the panoramic images, a 502 sample corresponds to 0.018% of the whole image. In COCO [18], one of the most challenging datasets in Computer Vision, the smallest objects correspond to 4% of the entire image. The winners for the 2017 COCO Detection: Bounding Box challenge achieved less than 55% accuracy. This is an indication of the difficulty of our problem, in terms of relative size of objects.

Railway Signs and Signals

7

Fig. 6. Amount of samples according to their size in pixel2 . Most samples (89%) are smaller than 502 pixels.

A reason behind the high amount of very small objects in our dataset is that the data was acquired by driving on a single track. However, in many sectors along the railway there are multiple parallel tracks and the signs and signals corresponding to these tracks appear in the dataset only in small sizes since the camera never passed close to them. One way to limit the small object size problem in our dataset is to split each panoramic image into smaller patches with high overlap, small enough to achieve a less challenging relative size between objects and image. Specifically, in our approach, each image is split into 74 patches of size 6002 pixels with 200 pixels overlap on each side. Even at this level of fragmentation, a 502 object corresponds to 0.69% of a patch size. A consequence of splitting the images into smaller patches is that the same object may now appear in more than one patches due to overlap. In fact, while on the panoramic images there exist 121,528 object instances, on the patches that were extracted, there exist 203,322 instances. The numbers shown in Fig. 6 correspond to the objects existing on the patches. 3.2

Railway vs Road Signs and Signals

Here, it is important to highlight the difference between the problem described in this paper and the more popular topic of the detection of road signs and signals. The most important difference is the size of the objects. The height of a road sign varies from 60 to 150 cm depending on the type of road [19] while in railways it is usually less than 40 cm high [20]. Given also that most of the times the signs are located only a few centimetres from the ground supported by a very short pole, it much harder to detect. Also, in railways, signs are often very similar but have different meaning like the first two examples of the second row in Fig. 1. At the same time, it is very common along the same railway objects of the same class to look different as shown in Fig. 3. Thus, a detector of railway signs and

8

G. Karagiannis et al.

signals needs to be able to distinguish objects based on fine details. Finally, in railways the signs and the signals are often combined creating more complex structures posing an extra challenge on a detection algorithm (e.g. the detected signals shown in Fig. 10). Given the above differences, we consider railway signs a more challenging detection problem.

4 4.1

Methodology Faster R-CNN

For the detection of signs and signals, we applied the Faster R-CNN presented by Ren et al. in [2] using ResNet-101 [21] as feature extractor. We decided to implement this approach mainly motivated by its high performance on competitive datasets such as Pascal VOC 2007 (85.6% mAP), Pascal VOC 2012 (83.8% mAP) and COCO (59% mAP). The main drawback of this approach compared to other acknowledged object detection methods such as YOLO [16]) or SSD is its high processing time (three to nine times slower depending on the implementation [17]). However, the sacrifice in time pays off in accuracy, especially in this dataset since this method performs better on small objects [2]. Here we will present some key points of Faster R-CNN. First, Faster RCNN is the descendant of Fast R-CNN [2] which in turn is the descendant of R-CNN [13]. As their names imply, Fast and Faster R-CNNs are more efficient implementations of the original concept in [13], R-CNN. The main elements of Faster R-CNN are: (1) the base network, (2) the anchors, (3) the Region Proposal Network (RPN) and (4) the Region based Convolutional Neural Network (RCNN). The last element is actually Fast R-CNN, so with a slight simplification we can state that Faster R-CNN = RPN + Fast R-CNN. The base network is a, usually deep, CNN. This network consists of multiple convolutional layers that perform feature extraction by applying filters at different levels. A common practice [2,14,16] is to initialize training using a pre-trained network as a base network. This helps the network to have a more realistic starting point compared to random initialization. Here we use ResNet [21]. The second key point of this method is the anchors, a set of predefined possible bounding boxes at different sizes and aspect ratios. The goal of using the anchors is to catch the variability of scales and sizes of objects in the images. Here we used nine anchors consisting of three different sizes (152 , 302 and 602 pixels) and three different aspect ratios 1:1, 1:2 and 2:1. Next, we use RPN, a network trained to separate the anchors into foreground and background given the Intersection over Union (IoU) ratio between the anchors and a ground-truth bounding box (foreground if IoU > 0.7 and background if IoU < 0.1). Thus, only the most relevant anchors for our dataset are used. It accepts as input the feature map output of the base model and creates two outputs: a 2 × 9 box-classification layer containing the foreground and background probability for each of the nine different anchors and a 4 × 9 box-regression layer containing the offset values on x and y axis of the anchor

Railway Signs and Signals

9

Fig. 7. Right: the architecture of Faster R-CNN. Left: the region proposal network (RPN). Source: Figs. 2 and 3 of [2].

bounding box compared to the ground-truth bounding boxes. To reduce redundancy, due to overlapping bounding boxes, non-maximum suppression is used on the proposed bounding boxes based on their score on the box-classification output. A threshold of 0.7 on the IoU is used resulting in about 1,800 proposal regions per image in our case (about 2,000 in the original paper). Afterwards, for every region proposal we apply max pooling on the features extracted from the last layer of the base network. Finally, the Fast RCNN is implemented, mainly by two fully-connected layers as described originally in [14]. This network outputs a 1 × (N + 1) vector (a probability for each of the N number of classes plus one for the background class) and a 4 × N matrix (where 4 corresponds to the bounding box offsets across x and y axis and N on the number of classes). Figure 7 shows the structure of Faster R-CNN and the RPN. The RPN and R-CNN are trained according to the 4-step alternate training to learn shared features. At first, the RPN is initialized with ResNet and fine tuned on our data. Then, the region proposals are used to train the R-CNN separately, again initialized by the pre-trained ResNet. Afterwards, the RPN is initialized by the trained R-CNN with the shared convolutional layers fixed and the non-shared layers of RPN are fine tuned. Finally, with the shared layers fixed, the non-shared layers of R-CNN are fine tuned. Thus, the two networks are unified. 4.2

Evaluation Method

For the evaluation of detection, an overlap criterion between the ground truth and predicted bounding box is defined. If the Intersection over Union (IoU) of these two boxes is greater than 0.5, the prediction is considered as True Positive (TP). Multiple detections of the same ground truth objects are not considered true positives, each predicted box is either True-Positive or False-Positive (FP). Predictions with IoU smaller than 0.5 are ignored and count as False Negatives (FN). Precision is defined as the fraction of the correct detections over the total

10

G. Karagiannis et al.

P detections ( T PT+F P ). Recall is the fraction of correct detections over the total P amount of ground truth objects ( T PT+F N ) [22]. For the evaluation of classification and overall accuracy of our approach, we adopted mean Average Precision (mAP) [23] as a widely accepted metric [22]. For each class, the predictions satisfying the overlap criterion are assigned to ground truth objects in descending order by the confidence output. The precision/recall curve is computed and the average precision(AP) is the mean value of interpolated precision at eleven equally spaced levels of recall [22]:

AP =

1 11

pinterp (r)

(1)

r∈{0,0.1,...,1}

where pinterp (r) = maxr˜:˜r≥r p(˜ r)

(2)

Then, the mean of all APs across the classes is the mAP metric. This metric was used as evaluation method for the Pascal VOC 2007 detection challenge and from that time is the most common evaluation metric on object detection challenges [18].

5

Results

The network was trained on a single Titan-X GPU implementation for approximately two days. We trained it for 200k iterations with a learning rate of 0.003

Fig. 8. Percentage of objects detected according to their size in pixels2 . The algorithm performed very well on detecting objects larger than 400 pixels (more than 79% in all size groups). On the other hand, it performed poorly on very small objects (less than 200 pixels) detecting only 24% of them. The performance was two times better on objects with size 200–400 pixels (57%), which is the most dominant size group with about 45,000 samples (see Fig. 6).

Railway Signs and Signals

11

and for 60k iterations with learning rate 0.0003. The model achieved an overall precision of 79.36% on the detection task and a 70.9% mAP. The precision level of detection is considered high given the challenges imposed by the dataset as they are presented in Sect. 3. Figure 8 shows the detection performance of the algorithm with respect to the size of the samples. We can see that the algorithm fails to detect very small objects. About 76% of objects smaller than 200 pixels are not detected. A significantly better, but still low, performance is observed for objects of 200–400 pixels size (57% detection rate). On the other hand, the algorithm performs uniformly well for objects

Fig. 9. Example of detection. Two successful detections and one false positive. We can see the illumination variance even in a single image with areas in sunlight and in shadow. The sign is detected successfully despite its small size and the partial occlusion from the bush. The entrance of the tunnel was falsely detected as a signal with three elements.

12

G. Karagiannis et al.

larger than 400 pixels (79% for sizes 400–600 and more than 83% for larger than 600 pixels). These results show that, in terms of object size, there is a threshold above which the performance of Faster-RCNN is stable and it is unaffected by the size of the objects. Therefore, if there are enough instances and the objects are large enough, object size is not a crucial factor. In our case, this threshold is about 500 pixels. An interesting metric for our problem would be to measure the amount of unique physical objects detected. The goal of object detection in computer vision, usually, is to detect all the objects that appear in an image. In mapping, most of the times, it is important to detect the locations of the physical objects. If we have multiple images showing the same physical object from different point of views and at different scales, in mapping, it would be sufficient to detect it at

Fig. 10. Example of detection. A successful detection of more complex object structures. Different signals mounted on top of each other did not confuse the algorithm.

Railway Signs and Signals

13

least in one image and not necessarily in all images. We expect that our approach will perform better in this metric, however, information about unique physical objects is not available for this dataset. Figures 9 and 10 show two representative examples of the results. We can see in these figures the variance in illumination conditions and the small size of the objects. The speed signs that appear on these images are very small but they belong to the most common object size interval for this dataset (200–400 pixels). By looking at these figures, we can also realise the difficulty of the classification task of this dataset. Even for a human, it is hard to decide on which class the objects on these images belong to.

Fig. 11. Confusion matrix showing the classification AP of the detected objects. Rows represent the predictions and columns the ground truth objects. Last row and last column (red colour) summarise the performance per class. Bottom right cell (blue colour) is the mAP of the method. See Fig. 2 for class name explanation.

14

G. Karagiannis et al.

Figure 11 is a confusion matrix that summarises the performance of the method on classification. The diagonal represents the Average Precision (as described in Sect. 4.2) for each class and the bottom right cell is the mean of these precisions. A first observation is that there is significant variation in accuracy for the different classes. As expected, the accuracy is decreased for the classes trained with few samples (position lights) and for those with high similarities with other classes (back view of signals). An interesting exception to this observation is the class signals with one element (S1). Despite the low amount of training samples and the fact that all classes of signals show high resemblance among each other, the classification accuracy is above the overall and among the highest.

6

Conclusions

In this paper, we presented a novel dataset used in real world projects for mapping of railway infrastructure. The dataset is very challenging mainly because the objects are very small relatively to the size of the entire images but also absolutely in terms of pixels2 . We implemented a state of the art deep learning method for detection and classification of signs and signals on this dataset scoring 79.36% on the detection task and 70.9% on the classification task. We believe that the accuracy of the model will be higher in a similar dataset with better resolution images. Also, given that the object positions along a railway follow specific regulations, taking into account the spatial relationships among the objects can improve accuracy. Moreover, it would be interesting to apply other state of the art methods on this dataset and analyse the strengths and weaknesses of each method on the same tasks. In addition, for mapping tasks such as the current, a more appropriate metric would be to evaluate detections based on the amount of physical objects and not on objects existing in the images. In conclusion, our results are promising and suggest further investigation into the use of deep learning for railway asset detection and mapping.

References ´ Vélez, J.F., Moreno, A.B.: Real-time railway speed limit 1. Agudo, D., S´ anchez, A., sign recognition from video sequences. In: 2016 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–4, May 2016 2. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 2015, pp. 91–99 (2015) 3. Zheng, D., Wang, Y.: Application of an artificial neural network on railway passenger flow prediction. In: Proceedings of 2011 International Conference on Electronic Mechanical Engineering and Information Technology, vol. 1, pp. 149–152, August 2011 4. Tsai, T.-H., Lee, C.-K., Wei, C.-H.: Neural network based temporal feature models for short-term railway passenger demand forecasting. Expert Syst. Appl. 36(2), 3728–3736 (2009). Part 2. http://www.sciencedirect.com/science/article/ pii/S0957417408001516

Railway Signs and Signals

15

5. Karthick, N., Nagarajan, R., Suresh, S., Prabhu, R.: Implementation of railway track crack detection and protection. Int. J. Eng. Comput. Sci. 6(5), 21476–21481 (2017). http://ijecs.in/index.php/ijecs/article/view/3535 6. Sadeghi, J., Askarinejad, H.: Application of neural networks in evaluation of railway track quality condition. J. Mech. Sci. Technol. 26(1), 113–122 (2012). https://doi. org/10.1007/s12206-011-1016-5 7. Gibert, X., Patel, V.M., Chellappa, R.: Robust fastener detection for autonomous visual railway track inspection. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 694–701, January 2015 8. Sinha, D., Feroz, F.: Obstacle detection on railway tracks using vibration sensors and signal filtering using bayesian analysis. IEEE Sens. J. 16(3), 642–649 (2016) 9. Faghih-Roohi, S., Hajizadeh, S., N´ un ˜ez, A., Babuska, R., Schutter, B.D.: Deep convolutional neural networks for detection of rail surface defects. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2584–2589, July 2016 10. Marmo, R., Lombardi, L., Gagliardi, N.: Railway sign detection and classification. In: 2006 IEEE Intelligent Transportation Systems Conference, pp. 1358–1363, September 2006 11. Melander, M., Halme, I.: Computer vision based solution for sign detection. Eur. Railway Rev. (2016). https://www.globalrailwayreview.com/article/30202/ computer-vision-based-solution-sign-detection/ 12. Arastounia, M.: Automated recognition of railroad infrastructure in rural areas from LIDAR data. Remote Sens. 11, 14916–14938 (2015) 13. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014) 14. Girshick, R.: Fast R-CNN. In: Proceedings (IEEE International Conference on Computer Vision), vol. 2015, pp. 1440–1448 (2015) 15. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017) 16. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2016, pp. 779–788 (2016) 17. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: Single shot multibox detector. In: Lecture Notes in Computer Science, vol. 9905, pp. 21–37 (2016) 18. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Lecture Notes in Computer Science, vol. 8693, pp. 740–755 (2014) 19. Great Britain: Department for Transport. Traffic Signs Manual. The Stationary Office, London, United Kingdom (2013) 20. Safety, R., Board, S.: Lineside Operational Safety Signs. Railway Group Standard, London (2009) 21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2016, pp. 770–778 (2016) 22. Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4 23. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGrawHill Inc., New York (1986)

3D Conceptual Design Using Deep Learning Zhangsihao Yang, Haoliang Jiang, and Lan Zou(&) Carnegie Mellon University, Pittsburgh, PA 15213, USA {zhangsiy,haolianj,lzou1}@andrew.cmu.edu

Abstract. This article proposes a data-driven methodology to achieve fast design support to generate or develop novel designs covering multiple object categories. This methodology implements two state-of-the-art Variational Autoencoder, dealing with 3D model data, with a self-defined loss function. The loss function, containing the outputs of individual layers in the autoencoder, obtains combinations of different latent features from different 3D model categories. This article provides detail explanation for utilizing the Princeton ModelNet40 database, a comprehensive clean collection of 3D CAD models for objects. After converting the original 3D mesh file to voxel and point cloud data type, the model will feed an autoencoder with data in the same dimension. The novelty is to leverage the power of deep learning methods as an efficient latent feature extractor to explore unknown designing areas. The output is expected to show a clear and smooth interpretation of the model from different categories to generate new shapes. This article will explore (1) the theoretical ideas, (2) the progress to implement Variational Autoencoder to attain implicit features from input shapes, (3) the results of output shapes during training in selected domains of both 3D voxel data and 3D point cloud data, and (4) the conclusion and future work to achieve the more outstanding goal. Keywords: Design support Data analysis Generative model Computer vision

3D representation

1 Introduction Three-dimensional (3D) representations of real-life objects have been an essential application in computer vision, robotics, medicine, virtual reality, and mechanical engineering. Due to the increase of interest in tapping into latent information of 3D models in both academic and industrial areas, a generative neural network of 3D objects has achieved impressive improvement in the last several years. Recent research achievement on generative deep learning neural network concepts includes autoencoder [4], Variational Autoencoder (VAE) [1] and generative adversarial model [6]. These models have led great progress in natural language processing [3], translate machine and image generation [10]. With the capability of extract features from data, these models are also widely used in generating embedding vectors [10]. In studies of 3D shapes, researchers utilized these state-of-the-art generative neural networks to take latent feature vectors as the representation of 3D voxelized shapes and 3D point cloud shapes. Those researches show promising results during experiments. To support the © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 16–26, 2020. https://doi.org/10.1007/978-3-030-17795-9_2

3D Conceptual Design Using Deep Learning

17

3D object research community, several large 3D CAD model database came out, such as notably ModelNet [11] and ShapeNet [5]. Although the generative deep learning neural networks can extract general latent features from training database, as well as to apply variable changes to the output shape, the output 3D models are still restricted to multiple certain design areas: it cannot compute or conclude the result when an important part of the input data is in lack. As a result, we exploited recent advances in generative model of 3D voxel shapes [9], 3D point cloud shapes [2] and a series of research work in fast image and video style transfer proposed initially by Leon A. Gatys et al. [4]. In this paper, the authors came up with the concept of style and the content of two input images and used a loss function to combine the feature extracted from both images. After that, a convolutional neural network is used in real-time style transfer [8] and videos style transfer [7]. Motivated by those works, we propose a multi-model architecture with a pattenbalanced loss function to extract implicit information from multiple input objects and set the implicit information as the constraints on the output of our deep learning neural network. By using our methodology, we can combine the latent features to our output 3D model and generate entirely new design areas. Briefly, the contributions of this project are shown as follows: • To propose a novel multi-model deep learning neural network and define a multipattern loss function to learn implicit features from input objects from the same category using 3D point cloud and objects from different categories using 3D voxel data. • To show that our model is capable of realizing shape transfer which will be helpful to break current design domains and generate brand new novel designs. • To demonstrate the potentiality of implementing a generative neural network to develop fast 3D model design support based the methodology this project proposed. The general pipeline of our project is: • Data processing: convert mesh file into 3D voxel data and 3D point clouds data. • Building a Variational Shape Learner building and training to verify its performance. • Building a Point Net Autoencoder building and training to verify its performance. • Implementing VSL and Point Net Autoencoder into our systems. • Experiment: applying different layer outputs in loss function; taking multiple categories input pattern, like furniture and transportation.

2 Background 3D Object Representation: In this project, the authors are tapping into voxel data and point cloud as 3D object representation. Voxel can represent a value on a 3D regular grid. Voxel data are capable of representing a simple 3D structure with both empty and filled cubes, frequently used to display volumetric objects. Point Cloud data is a type of 3D space representation using a set of 3D data points. It is generally produced by 3D scanners to sample a large number of points from the surface of space objects. Point Clouds are widely used as a detailed 3D object representation, also used as 3D CAD models for designing and manufacturing.

18

Z. Yang et al.

Autoencoders: Autoencoders are generative deep neural network architectures, designed to extract features from the input and reproduce the output. Usually, they contain a narrow bottleneck layer as the low dimensional representation or an embedding code for the input data. There are two parts of Autoencoders: encoder and decoder. The encoder is to downsample the input to the latent code, the decoder to expand the latent code to reproduce the input data. Variational Autoencoder (VAE) models are autoencoders taking advantage of the distribution of latent variables. In VAE, a variational approach is used to learn latent representation. As a result, VAE contains additional loss component and specific training algorithm. Neural Style Transfer: Neural style transfer (NST) is a trending topic since 2016. It can be interpreted as the process of using Convolutional Neural Network (CNN) to embed certain styles into image or video contents. It is widely studied in both academic literature and industrial applications and received increasing attention from industries currently. A variety of approaches, not limited to CNN, is used to solve or explore new aspects. Nevertheless, 3D style transfer is still an unexplored topic which could be very useful in 3D graphics area.

3 Network Architecture The network architecture described in this paper is inspired by the fast style transform. In this architecture, there are one auto-encoder and three loss providers, with each of them has different functions. The auto-encoder is used to transform input data to the desired output style and is called transform machine. The other three auto-encoders are utilized to provide the loss for the loss function, and each of them could be called separately: content loss provider, style loss provider, and prediction loss provider. The assumption in the network architecture is that specific layers’ outputs could provide information about a 3D model’s content and that certain layers outputs could offer information about a 3D model’s style. The 2D content and style refer to the content and the style of an image. For example, when offering one image of New York City at daylight condition and one image of Pittsburgh at the night-light condition, the output preserve the content of New York City and the style of Pittsburgh. The output will be an image of New York City at moonlight condition. In this article, however, the content and style are referred to input 1 and input 2, so we could change the mixture level of content and style with higher flexibility by converting content and style as needed, to get the scene of what is content and style based on a different scenario. Take the style transform from an airplane to a car as an example. First, the airplane is put into the auto-encoder, the output is called the transformed airplane. Then put the transformed airplane, the original airplane and the original car into three different loss providers to generate loss functions for back-propagation in the network. One thing to be noticed is that: the loss provider does not do any back-propagation. So the weights in the loss provider are fixed during training, while the only changing weights during the training period is the weights of the style transform auto-encoders. Regarding more detail information about the reason why we fix part of the network, the fast style transforms article [7] could be a related reference to interested readers.

3D Conceptual Design Using Deep Learning

19

Fig. 1. Architecture of neural network

The figure above (Fig. 1) shows the overall architecture of a couple of different neural networks in the system. To be more specific, our network system could experiment on a different state of the art auto-encoder network architecture (Fig. 2). The networks in use are VSL and Point Net auto-encoder version [2]. VSL (Variational Shape Learner) is a network with voxel input and the voxel formatted data output. Point Net auto-encoder is with point cloud data as input and the output.

20

Z. Yang et al.

(a) The network architecture of VSL

(b) The network architecture of Point Net Fig. 2. The network architecture (a) and (b)

4 Data The data in the network architecture is voxel data and point cloud data. The visualization of the data is shown in Fig. 3. The voxel data is either 0 or 1, with data shape in (30, 30, 30). The point cloud data’s shape is in (number of points, points dimension). There is a 3 dimension for each point in this case. When using Point Net network, there is 2048 certain number of points.

Fig. 3. Visualization of voxel data and point cloud data

3D Conceptual Design Using Deep Learning

21

5 Experiment and Results In this task, firstly per-train the original auto-encoder for 3000 epochs only based on the cars and the airplanes dataset. Then use this network as the auto-encoder and as the loss providers. The model uses the outputs of layers of content loss provider as the target content, and outputs of layers of style loss provider as the target style. The predicted target content and target style are extracted from the corresponding outputs of the transformed object’s loss provider’s layer. Then compute MSE loss on these outputs to do backpropagation. These results are shown in Figs. 4, 5 and 6.

Fig. 4. Shape transform between airplane and car

The results of using the car and airplane are shown in Fig. 4. When we change the ratio of the loss of airplane and car, the output of the auto-encoder will change as well. The output is approximating to a car as the ratio of airplane becoming lower, shown from left to right. As the evidence to avoid simple merge of a car and an airplane, the wings of the airplane are becoming smaller as the ratio of the airplane decreased.

Fig. 5. Shape transform between bed and chair

Additionally, the results of the transform between chairs and tables are shown in Fig. 5. It seems that at a specific ratio, the chair, and the bed become one recliner which makes this transforming work more promising. Based on the VSL paper [9], the local features and global features are named in the similar situation as stated in this paper; this experiment use local features for the style of images and global features as the content of images. The reason that Fig. 6 looks out of imagination is that of the output becoming something like spaceships.

22

Z. Yang et al.

Fig. 6. Transform between airplane and car when using global features and local features

When using Point Net auto-encoder, the results are shown in Figs. 7, 8, 9 and 10. The content layers using is first 1d convolution output. Moreover, the style layer used is the bottleneck feature of the network.

Fig. 7. Transform result between a civil and a bomber

In Fig. 7, the left airplane (a) is a civil airplane, and the right one (b) is a bomber airplane. When combining both of them, the wings are combing two wings together in the shape of a civil airplane and mimic the style of the bomber straight wings. The number of engine number is the same as the civil airplane; this means that the network

3D Conceptual Design Using Deep Learning

23

Fig. 8. Transform result between a civil and a fighter

truly learns the content of civil airplane. The wings and the horizontal stabilizer mimic the style of the bomber airplane. In Fig. 8, combining a civil airplane (a) and a fighter airplane (c), the generated airplane has learned how an airplane could change. The shape of the airplane has changed its length and shape. The fuselage length mimics the civil which is regarded as the input of a content loss provider. Moreover, the fuselage’s shape mimics the fighter’s shape. Additionally, the wings’ shape and horizontal stabilizer changed its’ shape to a combination of both the wings of the civil aircraft’s and fighter’s.

Fig. 9. Front view of the result between a civil and a fighter

Fig. 10. Top view of the result between a civil and a fighter

In Figs. 9 and 10, the front view and the top view provide more detailed information about how the wings and the horizontal stabilizers change.

24

Z. Yang et al.

6 Conclusion This article proposes a data-driven methodology and implements two different 3D shape representations and VAEs into the methodology, though the increasing interest in NST, 3D shape transfer is still an unexplored area for future research and experiments. 3D shape transfer could be beneficial both in academic areas and industrial applications. The project tapped into voxel data, Point Cloud data, Point Net and VSL to see the performance. The experiments implement via voxel data, our system can generate new shapes which contain latent features of the inputs. By varying the ratio in loss function and in the layers where the loss function extracts output, the output of the system will change locally and globally. The experiments set point cloud as data type, the output can show more detailed changes. The output is mimicking the styles from the input and combining them in a latent way. Furthermore, our system could generate output shape very fast, usually ten to twenty seconds after preprocessor of the data and pre-train the VAEs. In our experiments, the system can finish training of 2000 epochs in less than half of a minute by using a Nvidia Geforce GTX 1080, Graphic Card. To get higher resolution, this article experiment both on VSL and Point Net. Point Net could provide better results.

7 Future Work As promised, this article successfully experimented both of the Point Net neural network and VSL, separately, for high-revolution Point Clouds data in the same categories input and low-revolution voxel data in different categories. Most of the output data need to convert to mesh files which industries designers and conceptual graphic designers used for their daily design. This means the low revolution voxel data will achieve a smooth surface output in CAD model which can be used in a different scenario: physics-based stimulator, scenario-based simulator, and graphic design. Particularly, with the help of stimulator, industries designer can evaluate multiple outputs based on specific spec technology. For instance, the military aircraft designer will measure speed, acceleration, engine efficiency of different airplane output from the previous experiments based on the CAD model, in order to evaluate the different type of airplanes such as attack aircraft, reconnaissance aircraft, and fighter aircraft. The entire training time is between ten to twenty seconds per output, and the pretrained neural network takes about one day to train. This will save tremendous time and effort for designers to sketch and test the model in a different scenario, and implement this system in different industries.

Appendix See Fig. 11.

3D Conceptual Design Using Deep Learning

Fig. 11. Detail architecture of neural network

25

26

Z. Yang et al.

References 1. Makhzani, A., et al.: Adversarial Autoencoders. American Physical Society, 25 May 2016. http://arxiv.org/abs/1511.05644 2. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.J.: Representation learning and adversarial generation of 3D point clouds. CoRR, abs/1707.02392 (2017). http://arxiv.org/ abs/1707.02392 3. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Ben-gio, Y.: A recurrent latent variable model for sequential data. CoRR, abs/1506.02216 (2015). http://arxiv.org/abs/1506. 02216 4. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRR, abs/1508.06576 (2015). http://arxiv.org/abs/1508.06576 5. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Fisher, M.: ShapeNet: an information-rich 3D model repository, 09 December 2015. https://arxiv.org/abs/1512.03012 6. Goodfellow, J., Pouget-Abadie, J., Mirza, M., Xu, B., Bengio, Y.: Generative Adversarial Networks, 10 June 2014. https://arxiv.org/abs/1406.2661 7. Huang, X., Belongie, S.J.: Arbitrary style transfer in real-time with adaptive instance normalization. CoRR, abs/1703.06868 (2017). http://arxiv.org/abs/1703.06868 8. Johnson, J., Alahi, A., Li, F.-F.: Perceptual losses for real-time style transfer and superresolution. In: European Conference on Computer Vision (2016) 9. Liu, S., Ororbia II, A.G., Giles, C.L.: Learning a hierarchical latent-variable model of voxelized 3D shapes. CoRR, abs/1705.05994 (2017). http://arxiv.org/abs/1705.05994 10. Rezende, D.J., Eslami, S.M.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. CoRR, abs/1607.00662 (2016). http:// arxiv.org/abs/1607.00662 11. Wu, Z., Song, S., Khosla, A., Tang, X., Xiao, J.: 3D shapenets for 2.5D object recognition and next-best-view prediction. CoRR, abs/1406.5670 (2014). http://arxiv.org/abs/1406.5670

The Effect of Color Channel Representations on the Transferability of Convolutional Neural Networks Javier Diaz-Cely(B) , Carlos Arce-Lopera, Juan Cardona Mena, and Lina Quintero Universidad Icesi, Calle 18 No. 122-135, 760031 Cali, Colombia {jgdiaz,caarce,lmquintero}@icesi.edu.co, [email protected]

Abstract. Image classification is one of the most important tasks in computer vision, since it can be used to retrieve, store, organize, and analyze digital images. In recent years, deep learning convolutional neural networks have been successfully used to classify images surpassing previous state of the art performances. Moreover, using transfer learning techniques, very complex models have been successfully utilized for other tasks different from the original task for which they were trained for. Here, the influence of the color representation of the input images was tested when using a transfer learning technique in three different well-known convolutional models. The experimental results showed that color representation in the CIE-L*a*b* color space gave reasonably good results compared to the RGB color format originally used during training. These results support the idea that the features learned can be transferred to new models with images using different color channels such as the CIE-L*a*b* space, and opens up new research questions as to the transferability of image representation in convolutional neural networks. Keywords: Color channel representation · Convolutional deep learning · Transfer learning

1

Introduction

Image classification is a classical, but still open research problem in computer vision. The problem can be alleviated by constructing robust and discriminative visual feature extractors enabling researchers to design algorithms that automatically classify objects in images. At the beginning, the common approach was to manually design and create sets of features. More recently, research has been focused in designing deep learning (DL) algorithms that automatically learn these sets of features. This approach that uses a large amount of data and consumes considerable resources is surpassing state-of-the-art performances in classification tasks. Convolutional Neural Networks (CNNs) are DL models trained for a specific image classification task on an initial dataset. These models can be reused for new c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 27–38, 2020. https://doi.org/10.1007/978-3-030-17795-9_3

28

J. Diaz-Cely et al.

different tasks with new datasets [1]. Training a CNN from scratch is expensive in resources because it needs a dataset of sufficient size for the particular target task. Therefore, to use a previously trained model as a feature extractor is an interesting way to gain on classification precision without consuming as much resources in training. For example, an input image can be turned into a feature vector. Then, the feature vector can be used as the input of any type of model, artificial neural network or otherwise. Previous studies on this practice, which is called transfer learning, have focused on different aspects of the transferability, such as the level of generality of the layers being transferred [18], or the target task [11]. Furthermore, image classification algorithms typically receive as input digital images using the RGB color format. This color format translates the sensitivities of the camera sensor into three primary colors. However, this color format suffers from different problems, such as being hardware-dependent. This means that the same scene taken with two different cameras will result in images with different RGB values. To avoid such problems, the Commission Internationale de L’Eclairage (CIE) developed a standard color space, the CIE-XYZ, to enable camera calibration protocols. This way, images can be related to the same real scene. To further take advantage of human perception, other color spaces have been developed, such as the CIE-L*a*b*1 color space which mimics some characteristics of human color perception. Conversion algorithms between different color spaces have been developed to transform the RGB values of the digital images to any other color space. However, to our knowledge, the effect of the color representation in transfer learning has not been studied. This is of great interest and importance for the research community because typically digital images are stored in the RGB color format and, consequently, the majority of CNNs models are RGB-trained networks. The effects of the image representation (i.e. converting the input image to other color space) can influence the performance in the transfer learning practice. Here, the effect of the image color channels (different color spaces) on the transferability is shown. A DL convolutional model, trained with a set of images encoded on an RGB space, is reused for a new image classification task dealing with a different color space. Artificial neural networks are universal function approximators [2]. So, given that there is a function that maps the RGB color space to other color spaces, and given enough examples, it would be possible to train a model to translate from one space to another, and an RGB-trained model could be used over a LAB dataset after said conversion was made. However, object classification tasks may not need such color conversion and it remains unclear if RGB-trained models take advantage of color information similarly to the way humans use their color perception. Therefore, the requirement that the new datasets when using transfer learning must be in an RGB channel representation could be a critical requirement for the successful transferability of an RGB-trained convolutional network. 1

We will use LAB to refer to the CIE-L*a*b* color space, for simplicity.

Channel Representation Transfer Learning

29

Additionally, the effect of a particular color channel on the classification performance is unknown. For instance, the effect of the transfer learning of a RGBtrained model where the images only take into account one repeated channel (for example, LLL, AAA, BBB) can show how the model treat that particular color channel and its influence on the classification performance. By changing the input information of the model, the results can give clarity about the internal processing of the CNN model. This paper is structured as follows. After the introduction, Sect. 2 presents background on color space image representations. Section 3 briefly explains the process of transfer learning of CNNs for image classification. Then, in Sect. 4, the experimental layout used for the analyses of the transferability is described. The results are presented in Sect. 5. And finally, Sect. 6 presents the conclusions of the paper and suggests further work.

2

Image Color Representation

Usually, color is represented using the combination of three base channels. The RGB color space is by far the most commonly used color space. The RGB color space linearly combines the output of the sensitivity of the three photo sensors in a camera to create color. This direct transformation from camera sensor sensitivity to color coding is straightforward and practical. However, this color space suffers from two critical problems: its hardware-dependency and its perceptual irrelevance. The hardware-dependency refers to the problem that the results of the RGB color space are critically bound to the characteristics of the photosensitive devices. This means that changing the camera or even changing the camera configuration can affect the results in the RGB color space. Therefore, computer vision algorithms based solely on RGB color space images are more susceptible to be hardware-dependent solutions. To avoid such problems, the CIE designed a standard color space, the CIE-XYZ that translates the problem of the RGB color space into a standard hardware independent color space. To transform RGB coordinates to CIE-XYZ, it is necessary to calibrate the photosensitive devices. This calibration is expensive in time and resources, and normally, it requires special equipment. Note that hardware-dependent color spaces may be converted to the CIE-XYZ color space by characterizing the device, linearizing the device space and then applying a linear transformation based on the chromaticity values and luminance of the device (matrix inner product). The perceptual irrelevance is a problem for computer vision solutions targeted for humans. As color is a human perception, it makes sense that the limits of the human perception should be considered by the color representation model. An approximation to solving this problem are the color spaces HSL and YUV, that try to model human color perception by separating their color channels into one achromatic channel and two chromatic channels. However, these color spaces are not perceptually uniform and are still hardware-dependent. The CIE solved this problem in 1976 with the development of the CIE-L*a*b color space.

30

J. Diaz-Cely et al.

CIE-L*a*b* color space was designed to approximate human color perception. L* corresponds to the luminance channel and a* and b* channels describe the chromatic values. The a* channel represents the color components green to red and the b* channel represent the color components blue to yellow. In computer vision, there is still a debate of which color representation is the best as input for each task. The general rule has been that it depends on the application. Normally, images are obtained in the RGB format directly from the cameras. Any transformation to other color spaces can cost in computational time, so having the images feed to the network already converted can alleviate the expense of having the nodes to do the conversion. The change in color spaces also allows taking advantage of human perception limits, which is characterized by having poor color perception in comparison with luminance perception. This way using full-resolution images in the luminance channel and low resolution images for the chromatic channels may result in considerable boost in performance while keeping the accuracy high. Computer vision researchers have tested different color channels for different tasks, such as image segmentation [7,12], humanmachine Interaction [10], Image Colorization [20] and image classification [14,19]. Depending on the problem, either the RGB color space [12], the LAB color space [7,14,19,20], or the YUV [10] color space have been more successful.

3

Transfer Learning of DL Models for Image Classification

CNN models excel on image classification. Since the AlexNet model [8] surpassed all other traditional approaches during the 2012 ImageNet challenge, all the following winners used some form of convolutional models, continually improving the models, reducing the error rates and advancing the development of the field. Models were trained on the provided continually growing image datasets (1.2 million images with 1000 classes, in the 2012 challenge). While there is no denying that these models achieve top performance, the process of collecting such enormous datasets, and the time and computational resources needed for effectively training such models easily becomes a serious difficulty. A traditional assumption with machine learning models is that the training and evaluation data must follow the same distribution; traditional supervised machine-learning models do not work well when used to make predictions on a dataset independent from the one employed for its own training. When trying to apply DL techniques where only a small dataset is available or when there are resource limitations, such assumption is pragmatically dropped and a transfer learning approach [9] is followed: The learned knowledge from models previously trained on a generic dataset focusing on a particular task can be transferred to another domain, models can be reused to perform a different task, with a different data distribution, and with very little computation cost. The initial layers of a CNN show very broad and general feature abstractions, similar to low-level Gabor filters (lines, curves), which serve as the basis of more complex feature abstraction on subsequent layers (corners, circles, textures);

Channel Representation Transfer Learning

31

final layers show complex structures (faces, cars, etc.). Transfer learning can be applied to CNN models [1]; instead of starting the learning process from a randomly initialized network, there is an advantage of using a previously trained model. One of the simpler approaches to transfer learning is to strip the original network of their classification layers, using it as a mere feature extractor. Therefore, the final layer results for a given image is an embedding, a transformation that can be used as the input for another independent classification model. This new classification model can be any sort of fully connected multilayer neural network or any other non connectivist model, such as support vector machines. In this approach, the original network needs to be used only once for a set of images. The new representations are obtained and the subsequent classification model can be trained using the embeddings as inputs. Another approach, called fine-tuning, is to append the final classification layers in place of the original ones, and then to train them and some of the latter convolutional layers of the original model with the new dataset and target classes. The risk of such an approach is to train too many parameters having only a small training dataset, incurring in overfitting. There have been studies that analyze different aspects of the transferability of convolutional models. For instance, the number of layers whose weights should be transferred, the dependency of the level on the transferability (lower level layers are more general than higher level ones) and the difference between the original and the target tasks have been studied [18]. Also, the transferability of a convolutional network initially trained for object classification to other tasks over different datasets, such as object classification, scene recognition, fine grained recognition and attribute detection have been studied [11]. Moreover, [4] showed that a more complete embedding can be constructed if all layers contribute their own feature extraction as part of the image representation to be used as input to the final classification layers (or other machine learning models). Finally, [17] used a RBG trained CNN to boost the performance of a RGB-D image detector over an original AlexNet [8] model. However, to our knowledge, the effects of color representation on the transferability of a convolutional network have not been reported.

4

Experimental Layout

In order to compare the performances of the models retrained over the different color space representations, for this study a very popular dataset was chosen, taken from the cats and dogs Kaggle competition2 . The whole dataset consists of 25000 images, of different sizes, equally distributed among the two possible classes. The original dataset presents all of its images encoded according to the three color channels red, green, and blue (RGB color format). A software implementation in Java of well-known formulas to convert from the RGB color space to 2

Available at https://www.kaggle.com/c/dogs-vs-cats/data.

32

J. Diaz-Cely et al.

four additional representations based on the LAB, using D65 as the white point of reference was implemented. All the pixels in the image were converted from the hardware-dependent color space RGB to the perceptually-based LAB color space. Additional to the LAB representation of the dataset, each of the LAB color channels of the images was used to generate a new set of images in which the three color channels had the same information: LLL, AAA, BBB. Figure 1 shows the process for a particular image. These image datasets were constructed to test the influence of each channel on the performance of the classification.

Fig. 1. Experimental layout

Three popular CNN models, which were originally trained over the ImageNet dataset [3] were chosen for the test. The ImageNet dataset consists on more than 1.2 million RGB images classified over 1000 classes. All images were rescaled to the expected input sizes of each transferred model: – Inception-V3 [16]: A performance improved evolution of the Inception architecture used by the GoogleNet [15] that won the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This is the first architecture that strays away from the idea that convolution-pooling layers must always be stacked sequentially and introduces the idea of stacking blocks instead (Inception modules) that internally execute convolutional layers in parallel. The input image size is 299 × 299. – ResNet [5]: This model is built on the basis of a very deep network (152 layers), upon which it won 2015 classification, detection and localization ILSVRC. It chains blocks of two consecutive convolutions, which in turn combines (adds) with the original block input to form the block output, learning residual functions with reference to the original input. This idea is said to ease the training of deeper networks, as the models are not as complex as other traditional shallower models. The input image size is 224 × 224.

Channel Representation Transfer Learning

33

– MobileNet [6]: This model was built with size and speed efficiency in mind, so that application over mobile devices and real time applications became possible by factorizing standard convolutions (reducing the number of operations needed), and by allowing for width and resolution reduction. The input image size is 224 × 224. Simpler network architectures based only on the succession of convolutional and pooling layers, such as AlexNet [8] or VGG [13] were not considered for the test, as there have been several advances on CNNs since their creation. The three CNN models were transferred from their original ImageNet classification task to the target task of recognizing cats and dogs. They serve as images feature extractors (no fine-tuning was performed). Then, they become inputs to the 5 different datasets (each for a different color representation). These models include two hidden fully connected layers of 64 neurons each and a final softmax layer. Using 85% of the images for training and 15% for test (following a stratified partition), the 15 models (3 architectures * 5 datasets) were trained over 500 epochs, using random weight initialization, categorical cross entropy loss function and stochastic gradient descent with momentum as the optimization algorithm (with the same configuration parameters). A machine with an 8-core Intel Xeon E3-1270 v5 @ 3.60 GHz, 16 GB RAM, and Nvidia Quadro M400 GPU, running Ubuntu 16.04, with python 3.6 and TensorFlow 1.2. was used.

5

Results

In this section, the results of the different experiments with the three CNN models are described. 5.1

Inception-V3

In Fig. 2, the evolution of the test accuracy over the training epochs of an Inception-V3 model transferred features is presented. The Inception-V3 model trained on the features extracted from the RGB dataset considerably underperformed in comparison to the model trained on the features extracted from the LAB dataset. While the RGB model achieved an accuracy level of around 68.5%, the LAB model reached an accuracy of 75.5%. Even one channel color representations performed better than the RGB image dataset, with BBB showing accuracy levels just under the LAB dataset model (74%), and with LLL and AAA achieving 73% and 75%, respectively. 5.2

ResNet

In Fig. 3, the evolution of the test accuracy over the training epochs of a ResNet model transferred features is shown.

34

J. Diaz-Cely et al.

Fig. 2. Evolution of test set accuracy over the training epochs for the Inception-V3 model

In this case, the RGB trained ResNet model considerably outperformed the other color channel representation models, achieving an accuracy of 96.5%. Nevertheless, the LAB and the grayscale LAB-based models performed rather well, with accuracy levels around 91%.

Fig. 3. Evolution of test set accuracy over the training epochs for the ResNet model

Channel Representation Transfer Learning

5.3

35

MobileNet

In Fig. 4, the evolution of the test accuracy over the training epochs of a MobileNet model transferred features is presented. For MobileNet models, the 5 models showed similar performances, achieving an accuracy level of around 77%. In order of best accuracy performance, we have the AAA, LLL, RGB, LAB, and BBB models.

Fig. 4. Evolution of test set accuracy over the training epochs for the MobileNet model

6

Findings and Further Work

The effects on the transferability of a CNN using different color representations and different network architectures was presented. To our knowledge, this results are the first systematic experiments of applying RGB-trained networks to data in another color representation. The effects were relevant and showed that the performance was affected by the color representation and the network architecture. These results can help to understand the underlying processes that take place inside a CNN by showing the influence of different input color representations. For example, the results of transferring RGB-trained models using one channel color representations (e.g. LLL, AAA, BBB) showed different performance results depending on the network architecture. These results may indicate that some models architecture do not exploit the difference in color information. More specifically, for the MobileNet network architecture there were no considerable differences on the performances of the five models, with their respective accuracies converging at around 77%. Therefore, a one color channel representation performed similarly to RGB and LAB color representations. This fact may indicate that no color information or any difference between color channels was coded

36

J. Diaz-Cely et al.

by the network. We suggest that this type of network uses only spatial information to achieve its classification. Also, we may argue that for this architecture there is no need to feed the model 3 color channel information but it suffices to just give a grayscale version which may improve the time performance of the classification system. For the ResNet network architecture, the performance results were better for the RGB model. In the ResNet architecture the original representation of the input image continues to be relevant over the deeper layers, as each couple of layers adds to the residual function based on the input channels. This is not the case for the Inception-V3 and MobileNet network architectures, where the original input influence over the features extracted is less important as the layers go deeper. Therefore, we may argue that having the original and target representations on the same encoding is more influential in performance in the ResNet architecture, achieving an impressing 97% accuracy over the RGB model. Nevertheless, the performance of the classification tasks on the LAB, LLL, AAA and BBB models were good, going over 91% accuracy. ResNet network architecture is a very deep network which may suggest that the trained model is very specific and tuned to exploit the differences in color for a specific color representation such as for the RGB color space. However, due to its specificity, other color representations do not perform as good as the one that served in the training of the model. As future work, it is of great interest to investigate if training the ResNet model in a different color representation may influence these results. Finally, the experimental results showed that the Inception-V3 architecture performed differently depending on the color representation of the input. The best classification accuracy was obtained by the LAB model, with approximately 75% accuracy, followed by the BBB, LLL and AAA models. The RGB model scored the worst performance, with a 68% accuracy. Consequently, we may argue that Inception-V3 architecture may use human perceptual color differences because there was a difference between the LAB model and the RGB model. Moreover, the results also suggest that the nodes in the network architecture are more sensitive to differences in the B, L and A channel in that order specifically. These results are the first reports on the effects of color representation on the transferability of a CNN. These results showed that color representation influences performance in the Inception-V3 and ResNet network architectures. However, no difference was shown in the MobileNet architecture. To be able, to generalize our results, other datasets with different classification targets are needed. Also, another limitation is that these results may depend on the target classification. Indeed, many objects may differ from others not only by their color information but by other visual information. The selected dataset (dogs-vs-cats) was chosen because the classification task was not primarily made by using color information. Therefore, it may be interesting to compare these results with models trained on classification tasks that heavily depend on color information. Acknowledgment. This work was funded by Universidad Icesi through its institutional research support program.

Channel Representation Transfer Learning

37

References 1. Bengio, Y., Bastien, F., Bergeron, A., Boulanger-Lewandowski, N., Breuel, T.M., Chherawala, Y., Cissé, M., Cˆ oté, M., Erhan, D., Eustache, J., Glorot, X., Muller, X., Lebeuf, S.P., Pascanu, R., Rifai, S., Savard, F., Sicard, G.: Deep learners benefit more from out-of-distribution examples. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, 11–13 April 2011, pp. 164–172 (2011) 2. Csa´ aji, B.C.: Approximation with artificial neural networks. Ph.D. dissertation, Faculty of Sciences, E¨ otv¨ os Lor´ and University, Hungary (2001) 3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009) 4. Garcia-Gasulla, D., Vilalta, A., Parés, F., Moreno, J., Ayguadé, E., Labarta, J., Cortés, U., Suzumura, T.: An out-of-the-box full-network embedding for convolutional neural networks. CoRR abs/1705.07706 (2017) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 6. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017) 7. Kaur, A., Kranthi, B.: Comparison between YCbCr color space and CIELab color space for skin color segmentation. Int. J. Appl. Inf. Syst. 3(4), 30–33 (2012). Published by Foundation of Computer Science, New York, USA 8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097– 1105. Curran Associates, Inc. (2012) 9. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 10. Podpora, M., Korba´s, G.P., Kawala-Janik, A.: YUV vs RGB–choosing a color space for human-machine interaction. In: Ganzha, M.P.M., Maciaszek, L. (eds.) Position Papers of the 2014 Federated Conference on Computer Science and Information Systems. Annals of Computer Science and Information Systems, vol. 3, pp. 29–34. PTI (2014). https://doi.org/10.15439/2014F206 11. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2014, pp. 512–519. IEEE Computer Society, Washington (2014) 12. Shin, M.C., Chang, K.I., Tsap, L.V.: Does colorspace transformation make any difference on skin detection? In: Sixth IEEE Workshop on Applications of Computer Vision (WACV 2002), pp. 275-279 (2002) 13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014) 14. Srivastava, M.M., Kant, S.: Visual aesthetic analysis using deep neural network: model and techniques to increase accuracy without transfer learning. CoRR abs/1712.03382 (2017) 15. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

38

J. Diaz-Cely et al.

16. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016) 17. Xu, X., Li, Y., Wu, G., Luo, J.: Multi-modal deep feature learning for RGB-D object detection. Pattern Recogn. 72(C), 300–313 (2017) 18. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS 2014, vol. 2, pp. 3320–3328. MIT Press (2014) 19. Zeng, Y., Xu, X., Shen, D., Fang, Y., Xiao, Z.: Traffic sign recognition using kernel extreme learning machines with deep perceptual features. IEEE Trans. Intell. Transp. Syst. 18(6), 1647–1653 (2017) 20. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: 14th European Conference on Computer Vision - ECCV 2016, Amsterdam, The Netherlands, pp. 649–666 (2016)

Weakly Supervised Deep Metric Learning for Template Matching Davit Buniatyan(B) , Sergiy Popovych, Dodam Ih, Thomas Macrina, Jonathan Zung, and H. Sebastian Seung Princeton University, Princeton, NJ 08544, USA {davit,popovych,dih,tmacrina,jzung,sseung}@princeton.edu

Abstract. Template matching by normalized cross correlation (NCC) is widely used for finding image correspondences. NCCNet improves the robustness of this algorithm by transforming image features with siamese convolutional nets trained to maximize the contrast between NCC values of true and false matches. The main technical contribution is a weakly supervised learning algorithm for the training. Unlike fully supervised approaches to metric learning, the method can improve upon vanilla NCC without receiving locations of true matches during training. The improvement is quantified through patches of brain images from serial section electron microscopy. Relative to a parameter-tuned bandpass filter, siamese convolutional nets significantly reduce false matches. The improved accuracy of the method could be essential for connectomics, because emerging petascale datasets may require billions of template matches during assembly. Our method is also expected to generalize to other computer vision applications that use template matching to find image correspondences. Keywords: Metric learning · Weak supervision · Siamese convolutional neural networks · Normalized cross correlation

1

Introduction

Template matching by normalized cross correlation (NCC) is widely used in computer vision applications from image registration, stereo matching, motion estimation, object detection and localization, to visual tracking [11,15,20,26]. The paper describes that the robustness of the algorithm can be improved by applying deep learning. Namely, transforming the template and source images by “siamese” convolutional nets [5] can significantly reduce the rate of false matches and make NCC output more useful rejecting false matches. The main novelty of this paper is a weakly supervised training approach. The only requirement for the training data is that a true match to the template should exist somewhere in the source image, but the location of that match is not required as an input to the learning procedure. Weak supervision is important for applications without large ground truth annotation of image correspondences. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 39–58, 2020. https://doi.org/10.1007/978-3-030-17795-9_4

40

D. Buniatyan et al.

The proposed method is evaluated on images from serial section electron microscopy (ssEM). The error rate of template matching on raw ssEM images is on the order of 1 to 3%. Preprocessing with a bandpass filter lowers the error rate [2], and performing matching on images transformed with the approach improves the error rate by a factor of 2–7x. The overall result is an error rate of 0.05 to 0.30%. A common strategy for reducing false matches is to reject those that are suspect according to some criteria [24]. This can be problematic if too many true matches are rejected. NCC output provides superior rejection efficiency once deep learning is incorporated. To achieve zero false matches, requires to reject only 0.12% of the true matches based on NCC output, an improvement of 3.5x over the efficiency based on a bandpass filter.

2

Related Work

Established methods for correspondence matching typically consist of two stages: a feature extraction stage and a matching stage. Feature Extraction: Local geometric features have been studied for over 20 years [1,17,19]. Due to advances in deep learning, a new set of learned descriptors has emerged [9,14,25,31]. A neural network is typically trained to produce descriptors so that matching image patches have similar descriptors and nonmatching patches have dissimilar descriptors. The main novelty of this paper relative to these approaches is the use of weak supervision. MatchNet and similar approaches rely on strong supervision. They require datasets consisting of matching and non-matching pairs of image patches. Owing to expense, standard datasets such as the UBC patch dataset are not guaranteed to have pixel-perfect matches [4]. A system trained on these datasets may be limited by noise. On the other hand, our method requires only weak supervision. A training example consists of a template image patch and a larger source image with the guarantee that the template matches exactly one unspecified location within the source image. A rich source of such data is stereo image pairs, which have been exploited in weak supervision [28]. Matching: Once feature descriptors have been computed densely in both a template and a source image, a consensus algorithm is typically used to produce a global transformation from template to source. The goal is to find a simple transformation that maps each point in the template to a point in the source with a similar feature descriptor [7]. Normalized cross-correlation is a simple consensus algorithm that is differentiable and enables to train the system endto-end. A second advantage of NCC matching is the fine spatial localization that it offers. Other methods compute and subsequently match interest points. These may save computation time, but are undesirable if pixel-perfect localization is preferred. Suppose that two interest points match. One may be able to improve the match by sliding the patch around one of the interest points in order to

Weakly Supervised Deep Metric Learning for Template Matching

41

search for a point with a higher similarity score. This kind of search would likely improve spatial localization and start to resemble our NCCNet. NCC effectively uses cosine similarity to measure the distance between feature embeddings. Other works such as MatchNet [9] use a neural net to represent the metric function. In practice, it is computationally infeasible to use a neural net to compare all pairs of image patches. While our approach might sacrifice some power using cosine similarity, increasing the complexity of the embedding net might recover that power as shown in [13]. Furthermore, a network using a learned metric can violate the triangle inequality, but it is unclear whether such power is desirable. An alternative to feature matching is to train a convolutional network that directly outputs a vector field. The vector field describes exactly how each pixel moves between source and template (optical flow) [6,18,21]. This approach is well-suited for computing dense correspondences, while template matching makes sense for computing sparse sets of corresponding points. Deep Learning: The idea of using deep learning to improve NCC template matching for image correspondences is simple and obvious, but has been little explored as far as authors are aware. The closest antecedent of our work introduced an NCC layer inside a network used for the person identification problem [27]. Unlike standard Siamese networks [3,5,9,13,32], one of the Siamese channels takes a smaller input. NCCNet still shares the same weights and computes the same transformation on a macro level, but differs at sub-pixel resolution because of max-pooling layers (though the difference is negligible). Each channel is based on residual U-Net architecture [10,23] however upsampling is performed using resize convolutions instead of transpose convolutions.

3

Weak Supervision by NCC

Given a template image T ∈ Rt and a larger source image F = {Fi : Fi ∈ Rt }n (set of overlapping patches of template-size), define their normalized crosscorrelogram by (Fi − μFi ) (T − μT ) (1) N CC(Fi , T ) = σFi σT where μ and σ represent pixelwise mean and variance values over template-sized images. Each pixel of the normalized cross-correlogram corresponds with a potential placement of the template within the source image. Such a placement gives a one-to-one correspondence between template pixels and a subset of the source pixels. The location of the maximum of the correlogram is taken to be the location at which the template matches the source. Ideally, the correlogram should have a high and narrow peak only at the location of a true match, and should be low at all other locations. There should be no peak at all if no true match between the

42

D. Buniatyan et al.

template and source. In practice, the algorithm above fails in several ways. Peaks may exist at spurious locations corresponding with false matches. Another failure mode is a wide peak near a true match, leading to imprecise spatial localization. To reduce the failure rate, feature transformation to the template and source images prior to computing the NCC is applied. Good transformation should enhance landmarks and suppress noise that could give rise to spurious matches. The key observation that allows us to achieve this goal is that output produced by NCC can be considered as a similarity distance metric [12,30] between the template and the source patch centered around the pixel. Under the metric learning framework, given pairs of points in a space X that are known to be similar or dissimilar, and a similarity measure S : Rn × Rn → R, the objective is to learn an embedding ψ : X → Rn such that S(ψ(x), ψ(y)) is large for similar (x, y) and small for dissimilar (x, y). If the embedding function ψ is a neural network, then the technique is known as siamese networks [3] because identical networks are applied to both x and y [5]. Spatial coordinates of similar/dissimilar image patches are not available during training by using weakly-supervised metric learning. Instead of using ground truth match locations, assume that the maximum peak in the correlogram is indeed the correct match. Then, the non-peak correlogram values correspond to dissimilar image pairs. Define the maximum peak P1 and second maximum peak P2 that’s at least a minimum distance away as P1 (F, T ) = max N CC(Fi , T ) i

N CC(Fj , T ) P2 (F, T ) = max ∗

(2)

i =j

The objective of the optimization is to maximize the difference in the heights of the primary and secondary peaks, L(F, T ) = P1 (F, T ) − P2 (F, T ), which could be labelled as the correlation gap or rdelta . max ψ∈Ψ

L(ψ(F ), ψ(T ))

(3)

Siamese convolutional network is trained by repeating the following for template-source pairs that are known to contain a true match: 1. Compute the correlogram for source and template image. 2. Find the peak of the correlogram. 3. Make a gradient update to the convolutional net that increases the height of the peak. 4. Draw a small box around the peak of the correlogram. 5. Find the maximum of the correlogram outside the box, and label this the secondary peak. 6. Make a gradient update to the convolutional net that decreases the secondary peak. The cost function has two purposes depending on the shape of the correlogram (Fig. 1). If the primary peak is wider than the box, then the secondary

Weakly Supervised Deep Metric Learning for Template Matching

43

Fig. 1. Loss function 2D example: On the left, the second peak outside of the gray bounding box is pulled down. The correlogram with a wide peak becomes more narrow for better localization. On the right, optimization promotes the first peak and diminishes the second peak for less ambiguity.

peak will not actually be a local maximum (Fig. 1). In this case, the cost function encourages narrowing of the primary peak, which is beneficial for precise spatial localization. The size of the box in the algorithm represents the desired localization accuracy. In other cases, the secondary peak will be a true local maximum, in which case the purpose of the cost function is to suppress peaks corresponding to false matches. The above algorithm corresponds to similarity metric learning if the primary peak indeed represents a true match and the secondary peak represents a false match. In fact, the NCC has a nonzero error rate, which means that some of the examples in the training have incorrect labeling. However, if the error rate starts small, the similarity metric learning will make it even smaller. Our algorithm requires a good match to exist between each source-template pair. However, the location of the match is not required as an input to the learning, so the supervision is fairly weak. By itself, the above algorithm may lead to pathological solutions where the network is able to minimize the cost function by ignoring the input image. For example, the network could transform the source and template images so each contains a single white pixel, resulting in a perfectly peaked correlogram. To avoid these solutions, one can additionally train on source-template pairs that are known to contain no good match. Since these are dissimilar pairs, the goal of learning is to reduce the peak of the NCC. min

ψ∈Ψ

P1 (ψ(F ), ψ(T ))

(4)

1. Compute the correlogram for source and template image. 2. Find the peak of the correlogram. 3. Make a gradient update to the convolutional net that decreases the height of the peak. Once negative examples are added, the setup is changed and the pathological case is no longer a global optimum. By permuting the source images relative to the template images within a batch dissimilar pairs can be artificially generated.

44

4

D. Buniatyan et al.

Network Architecture

NCCNet is a siamese network but with a different input size for each of the two channels as shown in Fig. 2 (512 × 512px and 160 × 160px). Each siamese channel is based on a U-Net architecture with shared weights [10]. Our U-Net consists of a sequence of residual blocks with max-pooling layers and a sequence of residual blocks with upsampling layers. A residual block is a sequence of 5 convolutions with a skip connection from the first to the last convolution. For each convolution, a 3×3 kernel with tanh non-linearity is used followed by batch normalization.

Fig. 2. Siamese U-Net networks connecting to channel-wise NCC layer. The residual block consists of 5 stacked convolutions with skip connections from first to the last one. See Appendix Tables 4 and 5 for details

Between depth levels of the same resolution image, there is a skip connection between the max-pooling and upsampling inputs. At those skip junctions, both inputs are summed. There are four depth levels. As the depth increases, the number of feature channels doubles, starting with 8 channels and ending at 64. Upsampling is done with resize convolutions as opposed to the common transpose convolutions. Transpose convolution can cause a checkerboard artifact due to striding with zeros, while a resize convolution avoids such artifacts through linear interpolation. Another residual block for processing the output is added to the final layer. The network also enforces symmetric input-output resolution for each channel. Outputs of the two siamese channels are processed by the NCC layer. NCC layer takes the transformed template image and the transformed source image. It outputs channel-wise Normalized Cross-Correlogram.

5 5.1

Experiments Alignment

The technique is evaluated using images acquired by serial section electron microscopy (EM). After a series of images are produced from consecutive brain

Weakly Supervised Deep Metric Learning for Template Matching

45

tissue slices, they have to be aligned to each other for subsequent processing. This is a difficult task because the images contain many deformations that occur during section collection and imaging (local stretches, tears, folds, etc.) [24]. These defects have to be corrected in order to produce high quality alignment, and finding matches between consecutive image patches is an essential step in modern alignment pipelines [22,24]. Achieving highly precise alignment between successive sections is critical for the accuracy of the subsequent step of tracing fine neurites through the image volume. An image of a 1 mm3 brain volume is roughly a petavoxel [33], and a high quality assembly could require billions of template matches [16,24]. Every false match leads to tracing errors in multiple neurites, so even a small error rate across so many matches can have devastating consequences. In total the evaluation set consist of 216,000 image pairs. Three datasets are produced from the serial images: raw images (raw ), images preprocessed with a circular Gaussian bandpass (Difference of Gaussian) filter that was optimally tuned to produce a low number of false matches (bandpass), and images preprocessed with our convolutional net by applying the larger convolutional channel across the entire image, upsampling and blending accordingly (NCCNet). Further the parameters of the template matching procedure are varied: the template image size between small and large (160px and 224px ); matching between neighboring images (adjacent) as well as the next-nearest neighbors (across).

Fig. 3. Match criteria for adjacent 224px experiment. (a) Distributions of the three template match criteria for each image condition. Red bars represent counts of false matches that were manually identified. Green bars represent counts of true matches. (b) Percentage of true matches that must be rejected to reduce the error rate when varying the r delta criterion. See the Appendix A.3 for distributions & rejection curves from other experiments. Additionally please see Figs. 13, 14 and 15

In all experiments, the images preprocessed by the NCCNet consistently produced 2–7 times fewer false matches than the other two sets of images (see Table 1). To ensure that fewer false matches were not produced at the expense of true matches, we evaluated the overlap between true match sets created by the bandpass images and NCCNet images. Appendix Table 3 summarizes how many true matches were unique to the bandpass, NCCNet, or neither. The number of matches that NCCNet finds correct significantly outweighs other methods, especially across pairs.

46

D. Buniatyan et al.

Table 1. The network outperforms other widely used methods for given task. False matches for each image condition across experiments. Total possible adjacent matches: 144,500. Total possible across matches: 72,306. Adjacent Template size 160px

224px

Across 160px

224px

Raw

1,778 1.23%

827 0.57%

2,105 2.91%

1,068 1.48%

Bandpass

480

0.33%

160 0.11%

1,504 2.08%

340

NCCNet

261

0.18% 69

0.05% 227

0.31% 45

0.47% 0.06%

Rejection efficiency is critical to balancing the trade-off between complete elimination of false matches and retaining as many true matches as possible. We show 3.5x improvement over filtering methods by eliminating only 0.12% of total matches to have perfect correspondence for aligning EM slices (Fig. 3). 5.2

Suppression and Enhancement of Features

What is the reason for superior NCC matching performance after processing by NCCNet? Some intuition can be gained by examining how NCCNet transforms images (Figs. 4 and 5). The output of NCCNet contains white and black spots, which could be interpretable as enhancement of certain features in the original images. For example, many of the white spots correspond to mitochondria. NCCNet learns to suppress borders of blood vessels. When such borders are fairly straight, they tend to produce a ridge in the NCC, leading to large uncertainty in spatial localization. NCCNet also learns to suppress high contrast image defects (Fig. 4). For the empirical results in the previous section, we trained an NCCNet architecture with a single output image. Since NCCNet transformed an input image to an output image, it could be viewed as a simple “preprocessing” operation. The NCCNet architecture can be generalized to have multiple output channels. The NCC is computed for each output channel and then averaged over the output channels. This channel-averaged NCC is used in the loss function for training. NCCNet with multiple output channels produces a richer representation of the input image. A feature vector with dimensionality equal to the number of output channels is assigned to each pixel of the input image. We trained an NCCNet with eight output channels (NCCNet-8) on the same ssEM images. In the example shown in Fig. 6, NCCNet-1 (single output channel) yields a false match, but NCCNet-8 yields the correct match. This is presumably because the multiple outputs of NCCNet-8 are richer in information than the single output of NCCNet-1. 5.3

Training Stability Analysis

Weak supervision allows the training labels to change as training progresses. Especially at earlier stages, some correlograms might have incorrect peaks,

Weakly Supervised Deep Metric Learning for Template Matching

a

c

b

d

47

Fig. 4. Example of how NCCNet enhances some features and suppresses others, thereby eliminating a false NCC match. (a) Raw source and template images. The template contains “dirt” (black spot) at the upper left corner. Both source and template contain high contrast blood vessel borders. (b) NCC of raw images is maximized at a peak (red) that is a false match. (c) Suppressed the dirt in the template as well as the blood vessel borders after processing by NCCNet. NCCNet has detected features (black and white spots), which are somewhat interpretable. For example, some white spots appear to correspond to mitochondria in the original image. (d) NCC of images after processing by NCCNet is maximized at a peak (green) that matches correctly.

creating a set of incorrect labels among the training examples. We notice this in our training procedure: an incorrect label early on in training will be corrected by later iterations. See Fig. 7 for an example. Noisy labels start out rather frequent due to enrichment of the training set for pathological examples. We have performed two new experiments: (1) maliciously switching the labels of the primary and secondary NCC peaks dynamically throughout training, and (2) introducing nominally positive source-template pairs with no true match. Both changes make no qualitative difference when 20–25% of examples are noise. Furthermore, if one increases the adversary noise to 30%–40%, the training still progresses, however the learned features and loss

48

D. Buniatyan et al.

a

c

b

d

Fig. 5. Example of how NCCNet reduces location uncertainty. (a) Raw source and template images both contain high contrast blood vessel borders, and the template contains a border that is relatively straight. (b) The NCC of raw images has a long ridge with a peak (red) that is far from the location of the correct match. The ridge indicates large location uncertainty along one axis corresponding to the blood vessel border. (c) Suppressed blood vessel borders and enhanced other features (black and white spots) after processing by NCCNet. (d) NCC of images after processing by NCCNet is maximized at a sharp peak (green) that is at the correct location.

function cannot recover. As expected training fails when the noise is increased to more than 50%. We can summarize the intuition for these results as follows: If more than half of primary-secondary peak pairs in a batch are correct, then batch gradient will point in the correct direction with high probability. In the following iterations, after the weight update, some of the wrong matches will turn into correct matches. This will increase the probability of descending the gradient in the correct direction. Inductively, this method will recover the corruption. However, if the initial noise conditions are high enough (30%–40%) the gradient might orbit instead of converging and certainly diverges in the case of >50% error.

Weakly Supervised Deep Metric Learning for Template Matching

49

Fig. 6. NCCNet with single vs. multiple output channels. (a) NCC of raw source and template images locates the correct match. (b) After processing with NCCNet-1 (single output channel), the NCC peak yields a false match. This is presumably because the template is information-poor after processing by NCCNet-1. The template is dominated by the interior of the blood vessel, which is featureless. Few features are detected outside the blood vessel. (c) After processing with NCCNet-8 (eight output channels), the NCC peak indicates the correct match. This is presumably because more information about the template is encoded in the multiple output channels.

Inputs to the initial NCC layer (first training step) are randomly transformed images by U-Net initialized with the method from Glorot et al. [8]. We observe that randomly projecting source/template pairs through convolutional networks preserves the peak. However, random projection does not correct the wrong peak by removing unnecessary or occluding information. This is learned by updating weights using SGD. The training is mostly stable given our hyperparameter search. Furthermore, we compared Xavier initialization to randomly initializing weights with N (0, 1) and observed that training is robust under different initializations.

50

D. Buniatyan et al.

Fig. 7. The same correlogram evaluated during training at different time steps. The circle defines the peak. The label is dynamic due to weak supervision. Starts with wrong match and then transforms into correct one.

6

Conclusions

Notably, the reduction in error from applying the NCCNet generalized well to both harder/across section and easier/adjacent section tasks. As different template matching parameters are often needed at different stages of the alignment process, the ability to generalize is crucial to applications. The results suggest that a single NCCNet may be used throughout the range of speed-accuracy tradeoffs as well as in dealing with missing (across) sections that typically appear in biomedical data. Combining NCC with deep learning reduces false matches from template matching. It also improves the efficiency by which those false matches can be removed so that a minimal number of true matches are rejected. This is a promising technique that offers us the ability to significantly increase the throughput of our alignment process while maintaining the precision we require. We would like to further explore to what extent weak supervision generalizes to other template matching applications such as object tracking [29]. Acknowledgment. This work has been supported by AWS Machine Learning Award and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.

A A.1

Appendix Training Set

Our training set consists of 10,000 source-template image pairs from consecutive slices in an affine-aligned stack of images that contained non-affine deformations. We include only source-template pairs such that rdelta < 0.2 for simple NCC. This rejects over 90% of pairs and increases the fraction of difficult examples,

Weakly Supervised Deep Metric Learning for Template Matching

51

similar to hard example mining in the data collection process. During the training we alternate between a batch of eight source-template pairs and then the same batch with randomly permuted source-template pairings. We randomly cropped the source and template images such that the position of the peak is randomly distributed. We used random rotations/flipping of both channel inputs by 90◦ , 180◦ , 270◦ . A.2

Ground Truth Annotation

For training data, source image is 512 × 512px and each template is 224 × 224px. These pairs are sampled from 1040 superresolution slices, each 33,000 × 33,000px. We validated our model on 95 serial images from the training set an unpublished EM dataset with a resolution of 7×7×40 nm3 . Each image was 15,000 × 15,000px, and had been roughly aligned with an affine model but still contained considerable non-affine distortions up to 250px (full resolution).

Fig. 8. (Left) matches on the raw images; (Middle) matches on images filtered with a Gaussian bandpass filter; (Right) NCCNet: matches on the output of the convolutional network processed image. Displacement vector fields are a representation of output of template matching in a regular triangular grid (edge length 400px at full resolution) at each slice. Each node represents the centerpoint of a template image used in the template matching procedure. Each vector represents the displacement of that template image to its matching location in its source image. Matches shown are based on 224px template size on across (next-nearest neighbor) sections. Additionally see Figs. 10, 11 and 12

In each experiment, both the template and the source images were downsampled by a factor of 3 before NCC, so that 160px and 224px templates were 480px and 672px at full resolution, while the source image was fixed at 512px downsampled (1,536px full resolution). The template matches were taken in a triangular grid covering the image, with an edge length of 400px at full resolution (Fig. 8 shows the locations of template matches across an image). Our first method to evaluate performance was to compare error rates. Errors were detected manually with a tool that allowed human annotators to inspect

52

D. Buniatyan et al.

Fig. 9. Manual inspection difficulties. (a) The vector field around a match (circled in red) that prominently differs from its neighbors. (b) The template for the match, showing many neurites parallel to the sectioning plane. (c) The false color overlay of the template (green) over the source image (red) at the matched location, establishing the match as true.

the template matching inputs and outputs. The tool is based on the visualization of the displacement vectors that result from each template match across a section, as shown in Fig. 8. Any match that significantly differed (over 50px) from its neighbors were rejected, and matches that differed from neighbors but not significantly were individually inspected for correctness by visualizing a false color overlay of the template over the source at the match location. The latter step was needed as there were many true matches that deviated prominently from its neighbors: the template patch could contain neurites or other features parallel to the sectioning plane, resulting in large motions of specific features in a random direction that may not be consistent with the movement of the larger area around the template (see Fig. 9 for an example of this behavior). Table 1 summarizes the error counts in each experiment. Table 2. Image parameters for training and testing. Unless otherwise noted, resolutions are given after 3x downsampling where 1px represents 21 × 21 nm. Training Adjacent Template size

160px

160px 224px 160px 224px

Source size

512px

512px

Section depth

40 nm

40 nm

Section size (full res.) 33,000px 15,000px

512px 80 nm 15,000px

No. of sections

1040

95

48

No. of matches

10,000

144,000

72,000

2.0–12.0px

2.5–25.0px

Bandpass σ (full res.) N/A

A.3

Across

Match Filtering

To assess how easily false matches could be removed, we evaluated matches with the following criteria: – norm: The Euclidean norm of the displacement required to move the template image to its match location in the source image at full resolution.

Weakly Supervised Deep Metric Learning for Template Matching

53

Table 3. Dissociation of true matches set between the bandpass and NCCNet. Counts of true matches per category. Total possible adjacent matches: 144,500. Total possible across matches: 72,306.

Template size

Adjacent 160px

Neither

144 0.10%

54 0.04%

162 0.22%

33 0.05%

Bandpass only 117 0.08%

15 0.01%

65 0.09%

12 0.02%

NCCNet only

224px

Across 160px

224px

336 0.23% 106 0.07% 1342 1.86% 307 0.42%

– r max : The first peak of the correlogram serves as a proxy for confidence in the match. – r delta: The difference between the first peak and second peak (after removing a 5px square window surrounding the first peak) of the correlogram provides some estimate of the certainty that there is no other likely match in the source image, and the criteria the NCCNet was trained to optimize. These criteria can serve as useful heuristics to accept or reject matches. It approximate the unknown partitions for the true and erroneous correspondences. The less overlap between the actual distributions when projected onto the criterion dimension, the more useful that criterion. Figure 3 plots these three criteria across the three image conditions. The improvement in rejection efficiency also generalized well across experiments, as evident in the Appendix, Sup. Fig. 15. Achieving a 0.1% error rate on the most difficult task we tested (across, 160px template size) required rejecting 20% of the true matches on bandpass, while less than 1% rejection of true matches was sufficient with the convolutional networks. False matches can be rejected efficiently with the match criteria. The NCCNet transformed the true match distributions for r max and r delta to be more leftskewed, while the erroneous match distribution for r delta remains with lower values (see Fig. 3a), resulting in a distribution more amenable to accurate error rejection. For the case of adjacent sections with 224px templates, we can remove every error in our NCCNet output by rejecting matches with an r delta below 0.05, which removes only 0.12% of the true matches. The same threshold also Table 4. Residual block architecture. i is the number of input channels, n ranges from {8, 16, 32, 64} to {32, 16, 8} on downsampling and upsamling levels. Layer

Kernel size

Input layer

1 Convolution 3 × 3 × i × n 0 2 Convolution 3 × 3 × n × n 1 3 Convolution 3 × 3 × n × n 2 4 Convolution 3 × 3 × n × n 3 5 Convolution 3 × 3 × n × n 4 + 1

54

D. Buniatyan et al. Table 5. Each channel architecture Layer

Kernel size

1

Residual block 3 × 3 × 1 × 8

2

Max-pooling

3

Residual block 3 × 3 × 8 × 16

4

Max-pooling

5

Residual block 3 × 3 × 16 × 32

6

Max-pooling

7

Residual block 3 × 3 × 32 × 64

8

Resize Conv

9

Residual block 3 × 3 × 32 × 32

10 Resize Conv

Striding Input layer 0 2×2 2×2 2×2

5 6

3 × 3 × 64 × 32 2 × 2

7 8+6

3 × 3 × 32 × 16 2 × 2 3 × 3 × 16 × 8

3 4

11 Residual block 3 × 3 × 16 × 16 12 Resize Conv

1 2

9 10 + 4

2×2

11

13 Residual block 3 × 3 × 8 × 8

12 + 1

14 Residual block 3 × 3 × 8 × 1

13

Fig. 10. Bandpass vector field has no false matches

removes all false matches in the bandpass outputs, but removes 0.40% of the true matches (see Fig. 3b). This 3.5x improvement in rejection efficiency is critical to balancing the trade-off between complete elimination of false matches and retaining as many true matches as possible. The NCCNet produced matches in the vast majority of cases where the bandpass produced matches. It introduce some false matches that the bandpass did not, but it correctly identified 3–20 times as many additional true matches relatively (see Table 3).

Weakly Supervised Deep Metric Learning for Template Matching

Fig. 11. Comparable quality

Fig. 12. NCCNet output is slightly better

Fig. 13. Match criteria for adjacent, 160px.

55

56

D. Buniatyan et al.

Fig. 14. Match criteria for across, 160px.

Fig. 15. Match criteria for across, 224px.

The majority of the false matches in the convnet output were also present in the bandpass case, which establishes the NCCNet as superior to and not merely different from bandpass. Defining whether a patch pair is a true match depends on the tolerance for spatial localization error. Our method makes the tolerance explicit through the definition of the secondary peak. Explicitness is required by our weakly supervised approach and is preferable because blindly accepting a hidden tolerance parameter in the training set could produce suboptimal results in some applications (Table 2).

Weakly Supervised Deep Metric Learning for Template Matching

57

References 1. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Computer vision–ECCV 2006, pp. 404–417 (2006) 2. Berg, A.C., Malik, J.: Geometric blur for template matching. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I. IEEE (2001) 3. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., S¨ ackinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. IJPRAI 7(4), 669–688 (1993) 4. Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 43–57 (2011) 5. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005) 6. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015) 7. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 8. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010) 9. Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286 (2015) 10. Hegde, V., Zadeh, R.: FusionNet: 3D object classification using multiple data representations. arXiv preprint arXiv:1607.05695 (2016) 11. Heo, Y.S., Lee, K.M., Lee, S.U.: Robust stereo matching using adaptive normalized cross-correlation. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 807–822 (2011) R Mach. Learn. 5(4), 12. Kulis, B., et al.: Metric learning: a survey. Found. Trends 287–364 (2013) 13. Kumar, B.G., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep Siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394 (2016) 14. Lenc, K., Vedaldi, A.: Learning covariant feature detectors. In: Computer Vision– ECCV 2016 Workshops, pp. 100–117. Springer, Heidelberg (2016) 15. Lewis, J.P.: Fast template matching. In: Vision Interface, vol. 95, pp. 15–19 (1995) 16. Lichtman, J.W., Pfister, H., Shavit, N.: The big data challenges of connectomics. Nat. Neurosci. 17(11), 1448–1454 (2014) 17. Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011) 18. Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014) 19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

58

D. Buniatyan et al.

20. Luo, J., Konofagou, E.E.: A fast normalized cross-correlation calculation method for motion estimation. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6), 1347–1357 (2010) 21. Pathak, D., Girshick, R., Doll´ ar, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. arXiv preprint arXiv:1612.06370 (2016) 22. Preibisch, S., Saalfeld, S., Rohlfing, T., Tomancak, P.: Bead-based mosaicing of single plane illumination microscopy images using geometric local descriptor matching. In: SPIE Medical Imaging, p. 72592S. International Society for Optics and Photonics (2009) 23. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer, Heidelberg (2015) 24. Saalfeld, S., Fetter, R., Cardona, A., Tomancak, P.: Elastic volume reconstruction from series of ultra-thin microscopy sections. Nat. Methods 9(7), 717–720 (2012) 25. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015) 26. Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014) 27. Subramaniam, A., Chatterjee, M., Mittal, A.: Deep neural networks with inexact matching for person re-identification. In: Advances in Neural Information Processing Systems, pp. 2667–2675 (2016) 28. Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348 (2017) 29. Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5008. IEEE (2017) 30. Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Michigan State Univ. 2(2), 4 (2006) 31. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: learned invariant feature transform. In: European Conference on Computer Vision, pp. 467–483. Springer, Heidelberg (2016) 32. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015) 33. Zheng, Z., Lauritzen, J.S., Perlman, E., Robinson, C.G., Nichols, M., Milkie, D., Torrens, O., Price, J., Fisher, C.B., Sharifi, N., Calle-Schuler, S.A., Kmecova, L., Ali, I.J., Karsh, B., Trautman, E.T., Bogovic, J., Hanslovsky, P., Jefferis, G.S.X.E., Kazhdan, M., Khairy, K., Saalfeld, S., Fetter, R.D., Bock, D.D.: A complete electron microscopy volume of the brain of adult drosophila melanogaster. bioRxiv (2017)

Nature Inspired Meta-heuristic Algorithms for Deep Learning: Recent Progress and Novel Perspective Haruna Chiroma1(&), Abdulsalam Ya’u Gital2, Nadim Rana3, Shafi’i M. Abdulhamid4, Amina N. Muhammad5, Aishatu Yahaya Umar5, and Adamu I. Abubakar6(&) 1

Department of Computer Science, Federal College of Education (Technical), Gombe, Nigeria [email protected] 2 Department of Mathematical Sciences, Abubakar Tafawa Balewa University, Bauchi, Nigeria [email protected] 3 College of Computer Science and Information Systems, Jazan University, Jazan, Kingdom of Saudi Arabia 4 Department of Cyber Security Science, Federal University of Technology, Minna, Nigeria [email protected] 5 Department of Mathematics, Gombe State University, Gombe, Nigeria [email protected] 6 Department of Computer Science, International Islamic University Malaysia, Gombak, Malaysia [email protected]

Abstract. Deep learning is presently attracting extra ordinary attention from both the industry and the academia. The application of deep learning in computer vision has recently gain popularity. The optimization of deep learning models through nature inspired algorithms is a subject of debate in computer science. The application areas of the hybrid of natured inspired algorithms and deep learning architecture includes: machine vision and learning, image processing, data science, autonomous vehicles, medical image analysis, biometrics, etc. In this paper, we present recent progress on the application of nature inspired algorithms in deep learning. The survey pointed out recent development issues, strengths, weaknesses and prospects for future research. A new taxonomy is created based on natured inspired algorithms for deep learning. The trend of the publications in this domain is depicted; it shows the research area is growing but slowly. The deep learning architectures not exploit by the nature inspired algorithms for optimization are unveiled. We believed that the survey can facilitate synergy between the nature inspired algorithms and deep learning research communities. As such, massive attention can be expected in a near future. Keywords: Deep learning Deep belief network Cuckoo search algorithm Convolutional neural network Firefly algorithm Nature inspired algorithms

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 59–70, 2020. https://doi.org/10.1007/978-3-030-17795-9_5

60

H. Chiroma et al.

1 Introduction Nature inspired algorithms are metaheuristic algorithms inspired from the nature. The inspiration of the algorithms can be from natural biological system, evolution, human activities, group behaviour of animals, etc. for example, biological human brain inspired the proposing of artificial neural network (ANN) [1], genetic algorithm (GA) inspired from the theory of evolution [2], cuckoo search algorithms (CSA) inspired from behaviour of cuckoo births [3], artificial bee colony (ABC) got inspiration from the behaviour of bee [4], among many other algorithms. These nature inspired algorithms are found to be very effective and efficient in solving real world optimization problems better than the conventional algorithms because of their ability to effectively handle highly nonlinear and complex problems especially in science and engineering [5]. The ANN is the early and major breakthrough in the field of artificial intelligence. The ANN model has been very active in solving real world complex problems in different domain of machine learning application such as the health [6, 7], agriculture [8], automobile industry [9], finance [10], etc. Currently, ANN in its single, hybrid or ensemble form remained an active research area [11] and expected to witness more attention in the future, for example it is role in self-driving vehicles [9]. However, the ANN is trained with back propagation algorithm with limitations such as the problem of being fall in local minima and over fitting of training data. As a result of that, many researchers propose the use of nature inspired algorithm for the training of the ANN to avoid the challenges. For example, GA [12, 13], ABC [14], CSA [15], particle swarm optimization (PSO) [16] were applied to train ANN and it was found to be better than the back propagation algorithm and avoid the local minima problem. Presently, deep learning [17] is a hot research topic in machine learning. The deep learning is the deep architecture of ANN with logistic on node weights update and activation function. When supply with a large scale data, the deep learning models and extract high level abstraction from the large scale data set [18]. However, the deep learning is face with many limitations but not limited to: lack of systematic procedure to realized optimum parameter values, manual configuration of the deep learning architecture and lack of standard training algorithm. As such, many approaches including nature inspired algorithms were proposed by researchers to mitigate the challenges. The application of nature inspired algorithms in deep learning is limited because of the lack of synergy between the deep learning and nature inspired metaheuristic algorithm [19]. [18] present the role of natured inspired algorithms in deep learning in the context of big data analytics. However, the study argued that limited study can be found to apply nature-inspired algorithm in deep learning. Only one study that incorporated nature inspired algorithm in deep learning is reviewed in the study. In this paper, we propose to extend the work of [18] by surveying more number of literature that hybridized nature inspired algorithm and deep learning architecture. This is to show the strength of the application of nature inspired algorithms in deep learning and new perspective for future research to encourage synergy between the natured inspired algorithms and deep learning research communities.

Nature Inspired Meta-heuristic Algorithms for Deep Learning

61

2 Nature Inspired Meta-heuristic Algorithms As already discussed in Sect. 1, these category of algorithms are inspired from nature. The number of nature inspired algorithms are many, likely more than 200 as argued by [20]. In this section, only the nature inspired algorithms that are found to be incorporated into deep leaning are outlined. However, the discussion of each of the algorithm is beyond the scope of this study. In the literature, little attention is given to the combination of the strengths of nature inspired algorithms and deep learning to create more powerful algorithm for solving real world problems. The nature inspired algorithm that are found to be hybridized with deep learning includes GA, CSA, harmony search algorithm (HSA), simulated annealing (SA), gravitational search algorithm (GSA), ant colony optimization (ACO), firefly algorithm (FFA), evolutionary algorithm (EA) and PSO. Interested readers can refer to [21] for the description of these algorithms and how their computational procedure operates to achieve a target goal.

3 Deep Learning In machine learning, deep learning is considered as one of the most vibrant research area. The deep learning started gaining prominence from 2006, the time it was presented in the literature [22, 23]. In the real sense of it, the deep learning has been in existence since the 1940s. However, it is prominence came to lamplight starting from 2006 to current times because of the technological advancement in computing such as high performance computing systems, GPU, etc. and the advent of large scale data [24]. Machine learning algorithm success highly depends on data representation, as such, the deep learning plays a vital role in processing the large scale data because it can uncover valuable hidden knowledge [23]. The Design of the deep learning architecture resulted from the extension of the feed forward ANN with multiples number of hidden layers [18]. The ANN forms the core of the deep learning. The major architecture of the deep learning involves convolutional neural network (ConvNet), deep belief network (DBN), deep recurrent neural network (DRNN), stacked autoencoder (SAE) and generative adversarial network (GAN). The deep learning has performed excellently in different domain of applications including image and video analysis [25–27], natural language processing [28], text analysis [29], object detection [30], speech processing [31] and dimension reduction [32].

4 The Application of Nature Inspired Algorithms in Deep Learning The application of nature inspired algorithms to train ANN is a subject of debate in computer science community. Those kicking against the application of nature inspired algorithms to ANN argued that the local minima that is being targeted to solve using nature inspired algorithms is not a serious problem. It is also echoed by [22] that the local minima problem is not a grave issue. Therefore, the application of nature inspired algorithm to train ANN to deviate from the local minima does not worth the effort.

62

H. Chiroma et al.

The local minima problem is believed to be caused by permutation at the hidden layers of the ANN which can be resolved amicably by minimizing errors. However, the school of thought on the other hand argued that the application of the nature inspired algorithms to train the ANN have its own strengths: Optimum weights can be realised. It can find the best optimum solution that is very hard to realise at a minimal computational cost. Therefore, the application of the nature inspired algorithms in deep learning architectures warrant extensive investigation to unravel the benefits [18]. There are efforts been made by researchers in the optimization of the deep learning architecture parameters through nature inspired algorithms. Figure 1 is the taxonomy created based on the combination of the nature inspired algorithms and deep learning architecture as found in the literature. The efforts made by researchers are discussed as follows:

Fig. 1. Taxonomy of the natured inspired algorithms for deep learning architecture

4.1

Harmony Search Algorithm for Deep Belief Network

The HSA and it is variants are found to be applied by researchers to optimise the parameters of DBN. For example, [19] propose quaternion HSA (QHSA) and improve quaternion HSA (IQHSA) to optimise the parameters of the DBN. The QHSA and the IQHSA are used to optimise the learning rate, hidden layer neurons, momentum and weight decay of the DBN (QHSA-DBN and IHSA-DBN). Both the QHSA-DBN and IHSA-DBN are evaluated on image reconstruction problem. It was found that the QHSA-DBN and IHSA-DBN perform better than the standard algorithms based on the HSA. The author in [33] optimise Bernoulli restricted Boltzmann machine through HSA to select suitable parameters that minimizes reconstruction error. The HSA selected the suitable learning rate, weight decay, penalty parameter, and hidden units of the Bernoulli restricted Boltzmann machine (RBM) (HSA-BRBM). The HSA-BRBM is evaluated on benchmark dataset to solve image reconstruction. The HSA-BRBM is found to improve the performance of the state-of-the-art algorithms. The author in [34] applied the optimization of discriminative RBM (DRBM) based on meta-heuristic

Nature Inspired Meta-heuristic Algorithms for Deep Learning

63

algorithms to deviate from commonly use random search technique of parameter selection. The variants of the HSA (VHSA) and the PSO were used to select the optimum parameters (learning rate, weight decay, penalty parameter, and hidden units) of the DRBM (VHSA-DRBM and PSO-DRBM). The VHSA-DRBM and PSO-DRBM effectiveness were tested on benchmark dataset and found to perform better than the commonly use random search technique of selecting parameters. The HSA provide trade-off between the accuracy and computational load. The author in [33] propose vanilla HSA (VaHSA) for optimizing the parameters of DBN. The parameters of the DBN were optimised by the VaHSA (VaHSA-DBN). The VaHSA-DBN is tested on multiple dataset and it was found to performs better than the classical algorithms. However, the convergence time is expensive. 4.2

Firefly Algorithm for Deep Belief Network

The FFA is one of the nature inspired algorithms that is used to optimise the parameters of deep learning architecture e.g. DNN. For example, the author in [35] introduce FFA in DBN (FFA-DBN) to calibrate its parameters for image reconstruction. The DBN calibration is done automatically by the FFA to eliminate the manual method of calibrating the DBN. The FFA-DBN is applied for binary image reconstruction. The results show that the FFA-DBN outperform the classical algorithms. 4.3

Cuckoo Search Algorithm Deep Belief Network

The CSA is among the active algorithm that gain prominence in the literature. It finds it is way into deep learning for parameter optimization. For example, the author in [36] fine tune the parameters of DBN through CSA to optimise the parameters. The CSA optimises the DBN parameters to realise the CSA-DBN. The performance of the CSADBN is evaluated on multiple datasets. The CSA-DBN performance is compared with that of the PSO, HS and HIS for optimising the DBN with different layers. The results suggested that the CSA-DBN perform better than the compared algorithms. 4.4

Evolutionary Algorithm for Deep Belief Network

The EA is used to optimise the parameters of DBN. This is the only study to the best of the author’s knowledge that applied EA for DBN parameter optimization. The author in [24] applied adaptive EA in DBN to automatically tuned the parameters without the need for pre-requisite knowledge on DBN domain knowledge. The EA DBN (EADBN) is evaluated on both benchmark and real world data set. The result of the evaluation shows that the EADBN enhance the performance of the standard variants of the DBN. 4.5

Ant Colony Optimization for Deep Belief Network

The ACO is an established algorithm for solving optimization problem. It is found to be used for the optimization of DBN parameters. To deviate from the challenges of getting optimum parameters of discriminate deep belief network (DDBN), ACO is

64

H. Chiroma et al.

applied to automatically determine the best architecture of the DBN without heavy human effort. The number of neurons on 2-hidden layers and the learning rate of the DBN were automatically determined by the ACO (ACO-DDBN). The ACO-DDBN is applied in prognosis to assess the state of health of components and systems. The performance shows that the ACO-DDBN performs better than the classical grid based DDBN and support vector machine in both accuracy and computational time [37]. 4.6

Particle Swarm Optimization for Deep Restricted Boltzmann Machine

The PSO has been applied to solve many optimization problems. The DRBM parameters is optimise by the PSO as found in the literature. [38] applied PSO for automatic determining of the structure of deep RBM (DeRBM). The PSO determine the optimal number of units and parameters of the DeRBM, code name PSO-DeRBM. The PSO-DeRBM is used to predict time series. The performance of the PSO-DeRBM is evaluated by comparing it with multi-layer perceptron neural network (MLP). Results shows that the PSO-DeRBM is superior to the MLP. 4.7

Harmony Search Algorithm for Convolutional Neural Network

The HSA has been applied to DBN as discussed in the precedent section. However, [35] proposes variants of HSA to fine-tune the large number of parameters in ConvNet (HSA-ConvNet) to avoid the manual process because it is prompt to error. The HSAConvNet is applied on handwritten, fingerprint recognition and classification of images. It has shown to perform better than the standard algorithms. 4.8

Genetic Algorithm for Convolutional Neural Network

The GA is one of the earliest nature inspired algorithm that motivated researchers to propose different variants of nature inspired algorithms. It has been used extensively in solving optimization problems. The GA is used to optimise the parameters of deep learning model. For example, the author in [39] applied GA and grammatical evolution to reduce the manual trial and error procedure of determining the parameters of ConvNet (GAConvNet and GEConvNet). The evolutionary algorithms are used to determine the ConvNet architecture and hyperparametrs. The GAConvNet and GEConvNet are evaluated on benchmark dataset. The results suggested that the GAConvNet and GEConvNet enhance the performance of the classical ConvNet. 4.9

Genetic Algorithm for Restricted Boltzmann Machine

The GA is also applied for the optimization of RBM parameters, two studies were found to use GA for the optimization of RBM. First, [40] propose the use of GA for the automatic design of RBM (GA-RBM). The GA initializes the RBM weights for determining the number of both visible and hidden neurons. The GA is able to realized optimum structure of the Deep RBM. The GA-RBM was tested on handwritten classification problems. The results show that the GA-RBM performs better than the

Nature Inspired Meta-heuristic Algorithms for Deep Learning

65

conventional RBM and the shallow structure of the RBM. The author in [41] incorporated GA into DeRBM. The weighted nearest neighbour (WNN) weight is evolved using the GA. The effectiveness of the propose GA based WNN (GAWNN) and DeRMB (GADeRMB) is evaluated on classification problems. It is found to perform better than the SVM and statistical nearest neighbour. 4.10

Simulated Annealing for Convolutional Neural Network

The author in [42] used SA to improve the performance of ConvNet. The SA is applied to ConvNet (SAConvNet) to optimize the parameters of the ConvNet. The performance of the SAConvNet is compared with the classical ConvNet. The SAConvNet enhance the performance of the ConvNet but convergence time increases. 4.11

Gravitational Search Algorithm for Convolutional Neural Network

The author in [43] incorporated GSA into ConvNet to improve its performance and avoid been stuck in local minima. The GSA is used for the training of the ConvNet in conjunction with back propagation algorithm (GSA-ConvNet). The GSA-ConvNet is evaluated on OCR application. The GSA-ConvNet is found to improve the performance of the conventional ConvNet.

5 The General Overview of the Synergy Between Nature Inspired Algorithms and Deep Learning An overview of the research area is presented in this section to show the strength of incorporating nature inspired algorithms in deep neural network architecture. It is clearly indicated that the penetration of nature inspired algorithms in deep learning has received little attention from the research community. This is highly surprising in view of the fact that both the deep learning and nature inspired algorithms research received unprecedented attention from the research communities (e.g. see [44] for deep learning and [21] for nature inspired algorithms). Moreover, the two research areas are well established in solving real world challenging and complex problems. As already stated earlier, lack of synergy between the two research communities exist. However, evidence from the literature clearly indicated that combining nature inspired algorithms and deep learning has advantage of improving the performance of the deep learning architecture. This is because the empirical evidence shows that the synergy between the nature inspired algorithms and deep learning architecture always improve the accuracy of the conventional deep learning architecture. In addition, the laborious trial and error technique of determining the high number of parameters for the deep learning architecture is eliminated because the optimum parameters are being realized automatically by the nature inspired algorithms. As such, human effort in determining the optimum parameters is eliminated. Figure 2 depicted the trend of the synergy between the nature inspired algorithms and deep learning. As shown in Fig. 2, despite the fact that the two research areas predate 2012, evidence from the literature indicated that the combination of nature inspired

66

H. Chiroma et al.

algorithms and deep learning started appearing in 2012 with two literature. In 2013, a break occurred without a single research in this direction. It is found that 2015 and 2017 witness the highest number of works. Though, 2018 is still active, a literature has appeared as at the time of writing this manuscript. We realised that papa et al. are at the forefront of promoting the synergy between the nature inspired algorithms and deep learning research communities. The research area does not get the magnitude of the attention it deserved. In general, it can be deduced that the research area is slowly gaining acceptability within the research community because the number of research in the last four years has increased. The trend is expected to grow in the near future at a faster rate.

Fig. 2. The trend of the integration of natured inspired algorithms and deep learning architecture

6 Challenges and Future Research Directions The nature inspired algorithms require setting of parameters themselves, the best systematic way to realize the optimum parameter settings of the nature inspired algorithms remain an open research problem. Therefore, adding nature inspired algorithm to deep learning constitute additional parameter settings. Though, the parameter settings of the deep learning architecture can be reduced in view of the fact that some of the parameters can be determine automatically by the nature inspired algorithms. In a situation whereby the parameter settings of the nature inspired algorithm is not good enough to provide good performance, it would have a multiplier effect on the deep learning architecture. Hence, reduce it is performance and possibly caused the model to stuck in local minima. This is correct because the performance of the nature inspired

Nature Inspired Meta-heuristic Algorithms for Deep Learning

67

algorithms heavily depends on parameter settings. Future work should be on parameterless nature inspired algorithm to eliminate the need for human intervention in setting parameters. We expect future deep learning models to be autonomous. Weights of the deep learning architecture plays a critical role because the performance of the deep learning architecture heavily depends on the optimal initial weights of the architecture. The author in [32] argued that fine-tuning weights can be accomplished by gradient decent in auto encoder and it works well especially if the initial weights are near optimum. The critical nature of the weights in influencing the performance of the deep learning prompted many researchers to propose various ways of getting optimum weights (e.g. [45–47]). Despite the critical role been played by the initial weights of the deep learning architecture, very few concern is shown on the deep learning architecture initial weights optimization through the application of nature inspired algorithms. Intensive study on the application of nature inspired algorithms for optimising deep learning initial weights should a major concern in future research. Despite the fact that the nature-inspired algorithm improve accuracy, it sometimes increases convergence time for the deep learning architecture. As such, real life application that time is critical will not be suitable for implementing deep learning models incorporated with nature inspired algorithms. Typical example is medical facilities because one second can cause a serious tragedy or dead. Though, there is evidence that the nature inspired algorithms can improve the convergence speed of deep learning models. It can be concluded the performance of the nature inspired algorithm as regard to convergence speed is not consistent. In the future, researchers should work on the convergence speed of the nature inspired algorithm for deep learning to ensure it is consistency. One of the major challenge of meta-heuristic algorithm is that it requires metaoptimization in some cases to enhance it is performance. The meta-optimization procedure is excessive in a deep learning applications. However, the deep learning application already has significant effort in computation [34]. As such, it can add to the complexity and challenges of the deep learning. In the future, researchers should work towards reducing the meta-optimization efforts in nature inspired algorithms. The author in [18] pointed out that the excessive optimization of ANN through nature inspired algorithm mitigates the flexibility of the ANN which can caused over fitting of the training data. Excessive training of deep learning models with nature inspired algorithms should be discouraged and control to the tolerant level. The survey revealed that some major deep learning architectures such as the GAN, SAE, DRNN and deep echo state network were not exploit by nature inspired algorithms. The GAN [48] is a newly propose architecture of deep learning.

7 Conclusions This paper proposes to present the recent development regarding the incorporation of nature inspired algorithms into deep learning architectures. The concise view of the recent developments, strengths, challenges and opportunities for future research regarding the synergy between the natured inspired algorithms and deep learning are presented. It was found that the synergy between the nature inspired algorithms and the

68

H. Chiroma et al.

deep learning research communities is limited considering the little attention it attracted in the literature. We belief this paper has the potential to bridge the communication gap between the nature inspired algorithms and deep learning research communities. Experts researchers can use this paper as a benchmark for developing the research area while novice researchers can use it as an initial reading material for starting a research in this domain.

References 1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943) 2. Holland, J.: Adaptation in natural and artificial systems: an introductory analysis with application to biology, Control and artificial intelligence (1975) 3. Yang, X.-S., Deb, S.: Cuckoo search via lévy flights. In: 2009 World Congress on Nature and Biologically Inspired Computing, NaBIC 2009, pp. 210–214 (2009) 4. Karaboga, D.: An idea based on honey bee swarm for numerical optimization, Technical report-tr06, Erciyes university, engineering faculty, computer engineering department (2005) 5. Yang, X.-S., Deb, S., Fong, S., He, X., Zhao, Y.: Swarm intelligence: today and tomorrow. In: 2016 3rd International Conference on Soft Computing and Machine Intelligence (ISCMI), pp. 219–223 (2016) 6. Chiroma, H., Abdul-kareem, S., Ibrahim, U., Ahmad, I.G., Garba, A., Abubakar, A., et al.: Malaria severity classification through Jordan-Elman neural network based on features extracted from thick blood smear. Neural Netw. World 25, 565 (2015) 7. Chaoui, H., Ibe-Ekeocha, C.C.: State of charge and state of health estimation for lithium batteries using recurrent neural networks. IEEE Trans. Veh. Technol. 66, 8773–8783 (2017) 8. Dolezel, P., Skrabanek, P., Gago, L.: Pattern recognition neural network as a tool for pest birds detection. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2016) 9. Nie, L., Guan, J., Lu, C., Zheng, H., Yin, Z.: Longitudinal speed control of autonomous vehicle based on a self-adaptive PID of radial basis function neural network. IET Intel. Transp. Syst. 12(6), 485–494 (2018) 10. Bahrammirzaee, A.: A comparative survey of artificial intelligence applications in finance: artificial neural networks, expert system and hybrid intelligent systems. Neural Comput. Appl. 19, 1165–1195 (2010) 11. Xu, Y., Cheng, J., Wang, L., Xia, H., Liu, F., Tao, D.: Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Sig. Process. Lett. 25(7), 1044–1048 (2018) 12. Lam, H., Ling, S., Leung, F.H., Tam, P.K.-S.: Tuning of the structure and parameters of neural network using an improved genetic algorithm. In: 2001 The 27th Annual Conference of the IEEE Industrial Electronics Society, IECON 2001, pp. 25–30 (2001) 13. Chiroma, H., Abdulkareem, S., Abubakar, A., Herawan, T.: Neural networks optimization through genetic algorithm searches: a review. Appl. Math. 11, 1543–1564 (2017) 14. Karaboga, D., Akay, B., Ozturk, C.: Artificial bee colony (ABC) optimization algorithm for training feed-forward neural networks. In: International Conference on Modeling Decisions for Artificial Intelligence 2007, pp. 318–329 (2007) 15. Nawi, N.M., Khan, A., Rehman, M.Z.: A new back-propagation neural network optimized with cuckoo search algorithm. In: International Conference on Computational Science and Its Applications 2013, pp. 413–426 (2013)

Nature Inspired Meta-heuristic Algorithms for Deep Learning

69

16. Juang, C.-F.: A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 34, 997–1006 (2004) 17. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 18. Fong, S., Deb, S., Yang, X.-s.: How meta-heuristic algorithms contribute to deep learning in the hype of big data analytics. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 3–25. Springer (2018) 19. Papa, J.P., Rosa, G.H., Pereira, D.R., Yang, X.-S.: Quaternion-based deep belief networks fine-tuning. Appl. Soft Comput. 60, 328–335 (2017) 20. Fister Jr., I., Yang, X.-S., Fister, I., Brest, J., Fister, D.: A brief review of nature-inspired algorithms for optimization, arXiv preprint arXiv:1307.4186 (2013) 21. Xing, B., Gao, W.-J.: Innovative computational intelligence: a rough guide to 134 clever algorithms. Springer (2014) 22. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 (2015) 23. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intel. 35, 1798–1828 (2013) 24. Zhang, C., Tan, K.C., Li, H., Hong, G.S.: A cost-sensitive deep belief network for imbalanced classification. IEEE Trans. Neural Netw. Learn. Syst. 28(99), 1–4 (2018) 25. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P., Garcia-Rodriguez, J.: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft. Comput. 70, 41–65 (2018) 26. Liu, Y., Chen, X., Wang, Z., Wang, Z.J., Ward, R.K., Wang, X.: Deep learning for pixellevel image fusion: recent advances and future prospects. Inf. Fusion 42, 158–173 (2018) 27. Yaseen, M.U., Anjum, A., Rana, O., Antonopoulos, N.: Deep learning hyper-parameter optimization for video analytics in clouds. IEEE Trans. Syst. Man Cybern: Syst 15(99), 1–12 (2018) 28. Neterer, J.R., Guzide, O.: Deep learning in natural language processing. Proc. West Va. Acad. Sci. 90(1) (2018) 29. Tang, M., Gao, H., Zhang, Y., Liu, Y., Zhang, P., Wang, P.: Research on deep learning techniques in breaking text-based captchas and designing image-based captcha. IEEE Trans. Inf. Forensics Secur. 13, 2522–2537 (2018) 30. Pathak, A.R., Pandey, M., Rautaray, S.: Application of deep learning for object detection. Proc. Comput. Sci. 132, 1706–1717 (2018) 31. Ji, Y., Liu, L., Wang, H., Liu, Z., Niu, Z., Denby, B.: Updating the Silent Speech Challenge benchmark with deep learning. Speech Commun. 98, 42–50 (2018) 32. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 33. Papa, J.P., Rosa, G.H., Marana, A.N., Scheirer, W., Cox, D.D.: Model selection for discriminative restricted boltzmann machines through meta-heuristic techniques. J. Comput. Sci. 9, 14–18 (2015) 34. Papa, J.P., Rosa, G.H., Costa, K.A., Marana, N.A., Scheirer, W., Cox, D.D.: On the model selection of bernoulli restricted Boltzmann machines through harmony search. In: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1449–1450 (2015) 35. Rosa, G., Papa, J., Costa, K., Passos, L., Pereira, C., Yang, X.-S.: Learning parameters in deep belief networks through firefly algorithm. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pp. 138–149 (2016) 36. Rodrigues, D., Yang, X.-S., Papa, J.: Fine-tuning deep belief networks using cuckoo search. In: Bio-Inspired Computation and Applications in Image Processing, pp. 47–59 (2017)

70

H. Chiroma et al.

37. Ma, M., Sun, C., Chen, X.: Discriminative deep belief networks with ant colony optimization for health status assessment of machine. IEEE Trans. Instrum. Measur. 66, 3115–3125 (2017) 38. Kuremoto, T., Kimura, S., Kobayashi, K., Obayashi, M.: Time series forecasting using restricted Boltzmann machine. In: International Conference on Intelligent Computing, pp. 17–22 (2012) 39. Baldominos, A., Saez, Y., Isasi, P.: Evolutionary convolutional neural networks: An application to handwriting recognition. Neurocomputing 283, 38–52 (2018) 40. Liu, K., Zhang, L.M., Sun, Y.W.: Deep Boltzmann machines aided design based on genetic algorithms. In: Applied Mechanics and Materials, pp. 848–851 (2014) 41. Levy, E., David, O.E., Netanyahu, N.S.: Genetic algorithms and deep learning for automatic painter classification. In: Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 1143–1150 (2014) 42. Rere, L.R., Fanany, M.I., Arymurthy, A.M.: Simulated annealing algorithm for deep learning. Proc. Comput. Sci. 72, 137–144 (2015) 43. Fedorovici, L.-O., Precup, R.-E., Dragan, F., David, R.-C., Purcaru, C.: Embedding gravitational search algorithms in convolutional neural networks for OCR applications. In: 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI) 2012, pp. 125–130 (2012) 44. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 45. Chaturvedi, I., Ong, Y.-S., Tsang, I.W., Welsch, R.E., Cambria, E.: Learning word dependencies in text by means of a deep recurrent belief network. Knowl.-Based Syst. 108, 144–154 (2016) 46. Mannepalli, K., Sastry, P.N., Suman, M.: A novel adaptive fractional deep belief networks for speaker emotion recognition. Alexandria Eng. J. 56(4), 485–497 (2016) 47. Qiao, J., Wang, G., Li, X., Li, W.: A self-organizing deep belief network for nonlinear system modeling. Appl. Soft Comput. 65, 170–183 (2018) 48. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data: A Deep Learning Approach Wenwen Tu1(&) and Hengyi Liu2(&) 1

2

Southwest Jiaotong University, Chengdu 610031, China [email protected] University of Waterloo, Waterloo, ON N2L 3G1, Canada [email protected]

Abstract. As the fourth generation sharing bike, the dockless sharing bike is equipped with an electronic lock on the rear wheel. Since the dockless sharing bike system does not require the construction of specific infrastructures, it has become one of the important travel modes for residents. The distribution pattern of the use of shared bikes can capably represent the travel demands of the residents. However, since the number of sharing bikes is very large, the trajectory data that contains a large number of sparsely distributed data and noise. It results in high computational complexity and low computational accuracy. To address this problem, a novel deep learning algorithm is proposed for predicting the transfer probability of traffic flow of Shared Bikes. A stacked Restricted Boltzmann Machine (RBM)-Support Vector Regression (SVR) deep learning algorithm is proposed; a heuristic and hybrid optimization algorithm is utilized to optimize the parameters in this deep learning algorithm. In the experimental case, the real shared bikes data was used to confirm the performance of the proposed algorithm. By making comparisons, it revealed that the stacked RBMSVR algorithm, with the help of the hybrid optimization algorithm, outperformed the SVR algorithm and the stacked RBM-SVR algorithm. Keywords: Shared bikes Transfer probability of traffic flow Stacked RBM

1 Introduction Bike-sharing has drawn attention in recent years as a new type of environmentally friendly sharing economies [1]. Bicycle sharing is an innovative way to reduce traffic pollution. It can effectively solve the “last mile” problem of the resident trip. There are two main types of bike sharing in China: pile station bike sharing, pile-free bike sharing. China’s shared bikes market has experienced three stages of development. From 2007 to 2010, the first stage, the publicly shared bikes business model that started from abroad began to enter the country. The bike sharing system is mostly with pile station bikes and mainly managed by the government. From 2010 to 2014, the second phase, enterprises specialized in the bicycle market began to appear, but public bicycles were still dominated by the pile bikes. The third phase, from 2014 to 2018 with the rapid development of the mobile Internet, the commercial Internet-sharing bikes © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 71–85, 2020. https://doi.org/10.1007/978-3-030-17795-9_6

72

W. Tu and H. Liu

companies arrived. And this type of more convenient pile-free bikes began to replace the pile station bikes [2]. The pile free bike sharing is the research object of this paper. The pile station bike sharing system requires parking piles, card swipe equipment, and a cumbersome certification registration process. Pile stations are usually built in urban transportation hubs or large residential communities. Bicycle access and parking must be at a fixed pile site. Users need to carry the ID card to apply for a rental pass in the neighborhood committee to active the smart chip of the bike. In the pile free bicycle sharing scheme, the users can access and park the shared bikes at any time and any places. The travel demand of shared bicycles will not get limited by the distribution of piles [3]. Therefore, the use of shared bikes can capably represent the travel demands of the residents. The study of the distribution law of the purpose of shared bike users is a meaningful issue [4, 5]. However, due to a large number of shared bikes, it takes a lot of computation time to analyze and predict each user [6]. This paper divides the city into different traffic zones and studies the law of shared bike flow in each zone of the city. On this basis, we predict the ratio of the transit flow of shared bike between traffic zones. It can be understood as the probability that a shared bike user departs from traffic zone A to arrive at traffic zone B. The transfer probability of shared bikes indicates the probability of the user travel from a traffic zone to another one. If the total demand for shared bikes in a traffic zone is known, then we can predict the OD (Origin-Destination) distribution and traffic assignment in this city. Accurate prediction of OD distribution and traffic assignment is beneficial for transportation planning of bike sharing. It also helps governments and shared bike companies to more efficiently schedule shared bikes. In China, the bike sharing company has dropped overmuch sharing bikes in the city to occupy the market. Excessive bikes and inefficient scheduling problems led to environmental damage and the waste of road resources. Through the traffic assignment and traffic distribution, the scheduling problem can be better solved. At the same time, bikeways can be built between the traffic zones with high transfer probability to improve the biking environment and increase the attractiveness of shared bikes. Additionally, based on the transfer probability of shared bikes, the recommended list of destinations can be created for the shared bikes users. It can improve the quality of service, increase the appeal of shared bikes to users, and reduce the reliance on the car. As a newly developed travel method, the algorithms for the destination prediction of trips based on shared bikes need to be researched deeply [7–9]. The traditional destination prediction algorithms are mainly developed on the probability theory and neural network. As a probabilistic model is used to compute the trajectory data that contains a large number of sparsely distributed data and noise, a high-precision approach may not always be guaranteed [10–13]. In previous studies, most of the researches were conducted with piled station bike sharing. The pile free bike sharing is different from the piled station bike sharing in the travel characteristics. The number of trajectories of the pile free bike sharing is enormous, and the number of parking sites is large. Therefore, it is a challenge to utilize traditional methods to analyze the pile free bike sharing data. The destination prediction algorithm based on the neural network has limited processing capabilities in the presence of massive data. However, in Deep neural networks (DNN), the model with multi-hidden layers can be developed based on the artificial neural network. The hidden layers of DNN convert the input data into a

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

73

more abstract compound representation. “Depth” in “deep learning” indicates the number of layers of the data converting process. Compared to shallow network models, DNN can model more complex nonlinear relationships with fewer neurons [14–20]. Restricted Boltzmann Machines (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its input sets. RBM can also be used in deep learning networks [21, 22]. Deep Belief Networks (DBN) can be formed by “stacking” RBMs and optionally fine-tuning the results of DBN with gradient descent and backpropagation [23–27]. In this paper, a hybrid stacked RBM-SVR algorithm, with contributions of support vector regression (SVR) [28] and stacked RMB, is developed to predict continuous outputs. The bottom layer of the model is a stacked RBM for feature learning, and the top layer of the model is an SVR model to realize the continuous output prediction. In this SVR model, the kernel function is a radial basis function. There are two crucial parameters in the SVR model - error penalty factor cs , and kernel function parameter cs . These two parameters can affect the model’s capability on fitting and generalization [29–31]. The selection of cs and cs is crucial to the performance of SVM. Therefore, these parameters are optimized to improve the accuracy of prediction resulted from the stacked RBM-SVR algorithm proposed in this study. In the field of optimization algorithms, Mirjalili, Mirjalili and Lewis [32] proposed Grey Wolf Optimizer (GWO) as a meta-heuristic algorithm for solving many multimodal functions in 2014. This algorithm has been widely used in the optimization of neural networks as it leaves out gradient information and has a simple structure and a proper capability on global searching. However, it has some disadvantages, such as low accuracy and premature convergence [33–36]. Furthermore, an optimization algorithm named Differential Evolution (DE) algorithm, which can randomly search in continuous space using vector floating-point programming, was proposed by Storn and Price [37]. Hence, to reach an optimal performance of the GWO algorithm, this paper combines DE and GWO algorithms resulting in a hybrid optimization algorithm abbreviated in DEGWO [38]. The DE algorithm can be used to improve the searching capability of GWO and avoid GWO from falling into a local optimum [39–41]. The DEGWO algorithm generates initial populations, subpopulations, and variant populations for each iteration, and then uses the GWO’s capabilities for global searching to optimize the cs and cs parameters. The main objective of this paper is to build a stacked RBM-SVR network and develop the DEGWO algorithm to optimize two crucial parameters in the SVR model. As expected, the proposed model can estimate the probability of flow transfer between the traffic zones by extracting information from the motion trajectory of shared bikes. In addition, the proposed model predicts the destinations and the probability of flow transfer for shared bikes among commuters in a traffic zone according to the real measurements.

74

W. Tu and H. Liu

2 Probability of Traffic Flow Transfer In this paper, we assume that a city is divided into N traffic zones, each of which is mutually reachable. Origin and destination traffic zones are designated as I and J. The transfer probability of traffic flow from zone I to zone J in day d is denoted as pdI;J and is given by Eq. (1). pdI;J ¼

d dI;J N P d dI;J

ð1Þ

J¼1

d Where, I ¼ 1; 2; 3; . . .; N; J ¼ 1; 2; 3; . . .; N; dI;J refers to the traffic of shared bike from zone I to zone J in day d. The transfer probability of traffic flow between traffic zones characterizes the transfer and distribution of the traffic flow, and indirectly reflects the distribution of the Origin-Destination (OD) demand in this city.

3 RBM-SVR Algorithm In this study, a hybrid stacked RBM_SVR deep learning-based model is constructed for the calculation of transfer probability of traffic flow among traffic zones. That model connects the trained RBM model to the SVR model for realizing prediction process. A basic RBM model is a two-layer undirected graphical model. It produces visible layers and hidden layers that form arbitrary exponential distributions. We assume that all the hidden layers and visible layers belong to binary distributions. We mark the hidden layer and visible layer with h and v, respectively. The number of hidden layers and visible layers are denoted as m, and n, respectively. For a given set of states ðv; hÞ, the energy equation for the RBM is given by Eq. (2). Eðv; hjhÞ ¼

n X i¼1

ai v i

m X j¼1

bj hj

n X m X

vi wij hj

ð2Þ

i¼1 j¼1

Where,8i; j 2 f0; 1g; h ¼ wij ; ai ; bj involves the parameters of RBM; wij is a connection weight between the visible layer i and the hidden layer j; ai represents the bias of the visible layer, and bj denotes the bias of the hidden layer. According to this energy equation, the joint probability distribution of this set of states ðv; hÞ can be calculated by Eq. (3). eEðv;hjhÞ Pðv; hjhÞ ¼ P Eðv;hjhÞ e v;h

ð3Þ

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

75

Then, the marginal distribution of Pðv; hjhÞ can be obtained by Eq. (4). P PðvjhÞ ¼ P h

eEðv;hjhÞ ð4Þ

eEðv;hjhÞ

v;h

Since each visible layer and each hidden layer are independent, the activation probability of the hidden layer j and the ith visible layer are shown in Eqs. (5) and (6), respectively. X Pðvi ¼ 1jh; hÞ ¼ rðai þ wij hj Þ ð5Þ j

Pðhj ¼ 1jv; hÞ ¼ rðbj þ

X

vi wij Þ

ð6Þ

i 1 is Sigmoid Function. A stacked RBM model is conWhere, rðxÞ ¼ 1 þ expðxÞ structed according to the above theory. The number of layers of the stacked RBM model is 3, and there are 300 nodes in each layer. The output vector obtained after the feature learning process of the input dataset via the stacked RBM model, is designated as H. Then, the probability of the traffic transfer among traffic zones yd is predicted via the SVR model. The output vector and the probability are shown in Eqs. (7) and (8).

H ¼ /ð~xÞ

ð7Þ

yd ¼ RBM:SVRðHÞ

ð8Þ

Where,~x is the input dataset, / represents the stacked RBM model, RBM:SVR represents the RBM_SVR model.

4 An Improved RBM-SVR Algorithm In this section, the DEGWO algorithm is developed to optimize the SVR’s parameters for calculating the probability of destination transfer and aim to tackle the issue that the GWO optimization algorithm is susceptible to fall into a local optimum. 4.1

Principles of GWO Algorithm

Assume that in a D-dimensional search space, the population size X ¼ X1 ; X2 ; . . .; XN~ ~ individuals Xih ¼ ðXi1 ; Xi2 ; . . .; XiD Þ represents the position of the grey is composed of N h h h wolf ih in the dimension dh . The optimal solution in the population is defined as wolf a, and the optimal solutions for the second and third objective function values are defined as wolf b and wolf d, respectively. The remaining candidate solutions are defined as x. In the GWO algorithm, the hunting operation is led by wolf a, wolf b, wolf d, and

76

W. Tu and H. Liu

wolves x follow these three wolves for hunting. As shown in Eqs. (9) and (10), the behavior of gray wolves surrounding the prey is simulated. ð9Þ D ¼ CXp ðtÞ XðtÞ C ¼ 2r1

ð10Þ

Where, t is the number of current iteration, C is the swing factor, Xp ðtÞ denotes the position of the prey after the tth iteration, XðtÞ represents the position of the gray wolf during the tth iteration, and r1 is a random number within ½0; 1. The updated position of the grey wolf is given by Eqs. (11) and (12). Xðt þ 1Þ ¼ Xp ðtÞ AD

ð11Þ

A ¼ 2ar2 a

ð12Þ

Where, A is the convergence factor,r2 is a random number uniformly distributed within [0, 1], and the variable a linearly decreases from 2 to 0 with the increase of the number of iteration. The first three obtained optimal values are saved to enforce other searching individuals (including x) to constantly update their positions according to the position of the optimal value, and the calculation method is expressed as Eqs. (13)–(15). Dp ¼ Cl Xp ðtÞ XðtÞ ð13Þ Xl ðt þ 1Þ ¼ Xp ðtÞ Al Dp 3 P

Xp ðt þ 1Þ ¼

l¼1

ð14Þ

Xl

3

ð15Þ

Where,p ¼ a; b; d; l ¼ 1; 2; 3. The distances between the other individual gray wolves and a; b, and d, as well as the distances between them and the updated position of the grey wolf are be determined by Eqs. (13) and (14). Then, the position of the prey can be determined by Eq. (15). 4.2

Principles of the DE Algorithm

The principal function of the DE algorithm is to use the weighted difference between two individual vectors, which are randomly selected from the population, as the perturbation quantity of the third random reference vector, named as the mutant vector. Then, a crossover is conducted on the mutant vector and the reference vector to generate a trial vector. Finally, the selection is performed on the reference vector and the trial vector, and the better one is saved in the next generation group.

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

77

Assume that in the D-dimensional search space, in the population size NP, ZðgÞ is the gth generation of the population, and Zk ðgÞ is the kth individual in the gth generation of the population. Their relationship is given by Eqs. (16) and (17). ZðgÞ ¼ Z1 ðgÞ; Z2 ðgÞ; . . .; Znp ðgÞ

ð16Þ

Zk ðgÞ ¼ Zk;1 ðgÞ; Zk;2 ðgÞ; . . .; Zk;D ðgÞ

ð17Þ

where, k ¼ 1; 2; . . .; NP, g ¼ 1; 2; . . .; gmax , and gmax is the number of the last iteration. Initialization of the Population Initially, the algorithm randomly generates the 0th generation of the population over the entire search space, and the value of the individual zk;q ð0Þ in each dimension q is generated according to Eq. (18). L zk;q ð0Þ ¼ zLk;q þ randð0; 1ÞðzU k;q zk;q Þ

ð18Þ

Where,q ¼ 1; 2; . . .; D; randð0; 1Þ is a random number, which is uniformly distributed within ½0; 1; zLk;q is the lower threshold of the individual population,zU k;q is the upper threshold of the individual population. Mutation Mutant individual is generated via Eq. (19). sk;q ðgÞ ¼ zp1 þ Fðzp2 zp3 Þ

ð19Þ

Where,zp1 ; zp2 ; zp3 are three different parameter vectors randomly selected from the current population, and zp1 6¼ zp2 6¼ zp3 6¼ i; F is an amplifying factor within [0, 1]. Crossover The crossover process in the DE algorithm is expressed as Eq. (20). g þ 1Þ ¼

sk;q ðgÞ; sk;q ðgÞ;

randð0; 1Þ CR or q ¼ randð0; 1Þ randð0; 1Þ CR or q ¼ 6 randð0; 1Þ

ð20Þ

Where, CR is the crossover probability within ½0; 1, and randð0; 1Þ is a random number, which is uniformly distributed within ½0; 1 and used to guarantee that at least one-dimensional component comes from the target vector Zk . Selection Selection operation compares the vector lk ðg þ 1Þ and the vector zk ðgÞ by an evaluation function, which is given by Eq. (21). zk;q ðg þ 1Þ ¼

lk ðg þ 1Þ; zk ðgÞ;

f ½lk ðg þ 1Þ\f ½zk ðgÞ f ½lk ðg þ 1Þ f ½zk ðgÞ

ð21Þ

78

W. Tu and H. Liu

Therefore, this mechanism enables the offspring population to at least do not be inferior to the current one, so that the average performance of the population is improved, and the optimal solution is converged. 4.3

Hybrid DEGWO Algorithm

In the hybrid DEGWO algorithm, Sdegwo:dbn ¼ ðNP; gmax ; CR; D; ub; lb; FÞ. Where NP denotes population size, gmax denotes the maximum number of iterations, ub and lb are the search range, and D is search dimension. We define Dtrain ¼ fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . .; ðxm0 ; ym0 Þg is the training set, and Dtest ¼ fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . .; ðxn ; yn0 Þg is the test set. rtest and rtrain denote the error in test and learning procedure respectively. Table 1 is the specific procedure employing the DE and the GWO algorithms to optimize parameters cs and cs in the RBM-SVR deep learning model.

5 Experimental Setup and Results In this study, 2,468,059 trajectory data, spanning more than 300,000 users and 400,000 shared bikes from Mobike, were recorded on shared bikes and then analyzed. Each set of data contains the information such as time of trip departure and geohash code of the location, bike ID, bike type, and user ID. All the data were collected in the period of May 10, 2017, to May 18, 2017. The coordinates of the origin and destination traffic zones in the trajectory data were coded in the format of geohash. Geohash is a geocoding format that alternatively divides the Earth’s surface into two halves along the latitude and longitude directions and uses a binary number to represent the non-overlapping meshes generated in the division. In this paper, the geohash code of each trajectory data has 7 digits, corresponding to a square area with the same longitudinal and latitudinal spacing of 0.001373. To simplify the calculation and statistical processings, we merged nine adjacent blocks into one square block. The length of each merged square is approximately 411.9873 m. According to this method, we divided Beijing into 10 10 traffic zones. For convenience, we used the traffic zone number denoting each traffic zone. We sequentially assigned a number from 1 to 4200 for each traffic zone. The latitude and longitude coordinates of the center point of each traffic zone were obtained via the geohash decoding method. The linear distances between the origin and destination traffic zones were measured, the absolute value of geohash between different traffic zones was calculated, and the daily transfer probability of traffic flow pdI;J among the traffic zones was calculated according to Eq. (1). The prediction of the destination requires to be completed before predicting the transfer probability of the traffic flow among the traffic zones. In this paper, the origin and destination traffic zones for each initial traffic zone were calculated and constituted the destination collection of the initial traffic zones. The destination collection was used as destination candidates for each initial traffic zone on a given day. Based on the mentioned method, we used the data collected during the previous two days to predict the destinations on the third day. Figure 1 shows the error values of predicted destinations and actual destinations for initial traffic zones according to the latitude and

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data Table 1. The procedure of RBM_SVR_DEGWO algorithm

79

80

W. Tu and H. Liu

Fig. 1. Error values of destinations prediction of 6 test groups.

Table 2. Experimental group training data and test data st

The 1 day of training data

The 2nd day of training data

Prediction data

Test group number

Training data date

Week Data Training amount data date (trajectories)

Week Data Prediction amount data date (trajectories)

Week Data amount (trajectories)

1 2 3 4 5 6

2017/5/10 2017/5/13 2017/5/14 2017/5/15 2017/5/16 2017/5/10

Wed. Sat. Sun. Mon. Tues. Wed.

Thur. Sun. Mon. Tues. Wed. Mon.

Fri. Mon. Tues. Wed. Thur. Tues.

262569 225281 236594 279554 288719 262569

2017/5/11 2017/5/14 2017/5/15 2017/5/16 2017/5/17 2017/5/15

272210 236594 279554 288719 322201 279554

2017/5/12 2017/5/15 2017/5/16 2017/5/17 2017/5/18 2017/5/16

265173 279554 288719 322201 315758 288719

longitude coordinates of Beijing. In the experiment, we selected data of different adjacent days as 6 test groups (Table 2). According to the improved probability prediction model developed based on stacked RBM-SVR proposed in this study, the following variables were selected as input variables: the origin traffic zone number, the destination traffic zone number, longitude coordinate of an initial point, latitude coordinate of an initial point, longitude coordinate of a an origin and destination traffic zone point, latitude coordinate of an origin and destination traffic zone, the distance between center points of the traffic zones, the absolute value of the difference between the numbers of traffic zone, and the number of day. Besides, the output of the model is transfer probability pdI;J of daily traffic flow for shared bikes among the traffic zones.

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

81

According to the proposed method, data collected from the previous two days were used as training data to predict the traffic transfer probability between the traffic zones on the third day. Figure 2 illustrates the root-mean-square errors between the predicted transfer probabilities of traffic flow and the actual values according to the latitude and longitude coordinates of Beijing. We can see that as the traffic zone gets closer to the center of Beijing, there is a higher the number of travel trips and a higher the amount of data in the trajectory of a traffic zone. And the mean square error of the algorithm becomes smaller. The prediction accuracy also becomes higher.

Fig. 2. The root-mean-square errors of the predicted transfer probabilities of bike sharing traffic flow.

To assess the prediction performance of the proposed algorithm, two algorithms, the traditional SVR model and the stacked RBM and SVR model, were compared to reflect the feature extraction and parameter optimization capabilities of the stacked RBM algorithm and the DEGWO algorithm. In the comparison, the same training data, test data, and the same parameter settings of stacked RBM were used. In the traditional SVR model, all the optimization parameters were set to the default value of 1. The mean-square error values between the predicted value and the actual value for each type of comparison method based on the above settings are displayed in Fig. 3. Under the SVR, RBM_SVR and RBM_SVR_DEGWO algorithms, the average value of the mean-square error for 100 traffic cells is 0.0916, 0.0542, and 0.0283, respectively. The comparison results showed that the prediction results obtained by the stacked RBM-SVR algorithm with the help of the DEGWO algorithm are superior to the prediction results of the traditional SVR algorithm and the common stacked RBMSVR algorithm. the DEGWO algorithm optimizes the parameters of the RBM-SVR model, finds a better parameter collocation scheme, and reduces the prediction error.

82

W. Tu and H. Liu

Fig. 3. The mean-square error bars of the predicted transfer probabilities of bike sharing traffic flow of 6 test groups for each type of comparison method.

After adding the DEGWO algorithm, we made the model fit the data better. Also, the lower error value is maintained in the prediction results of 100 traffic cells, which indicates that the DEGWO algorithm also improves the robustness of the model. The traffic zone number is increased from the lower left corner to the right. When it reaches 10, it turns to the upper row and then once again turns from the left to the right until all traffic zones are numbered. The traffic zone numbered around zero and ten are in remote areas of the city. Fewer people are using shared bicycles in remote areas than in downtown areas. Therefore, the amount of data in this area is small. It has led to higher forecasting errors in urban outskirts than in urban centers. It has a lot to do with Beijing’s urban layout and population distribution. The central area of Beijing contains cultural relics such as the Forbidden City. The passengers here are mostly tourists. Since the interior of the Forbidden City Park does not allow shared bicycles to enter, the amount of data for sharing bikes is low, which caused a high prediction error in that area. Beijing is a city of the radiology center distribution. There are more mountains to the west side compared to the east side. The distribution of population and commercial land is naturally lower than the east part in the west side. Most of the northern areas are residential communities. The east is an open plain. The west is also a major commercial center. According to the error results, it can be found that the prediction errors of the east and west of Beijing are lower. Therefore, it shows that the distribution of commercial areas has a strong connection with shared bicycle travel.

6 Conclusions and Outlook The principal objective of this study is to use a deep learning algorithm to solve the problem of predicting the transfer probability of traffic flow in the destinations of shared bikes among the traffic zones. In this paper, we propose a stacked RBM-SVR deep learning algorithm for the prediction of transfer probability. A typical stacked RBM model can perform feature extraction on large data; an SVR model can realize the

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

83

prediction of transfer probability. Furthermore, we utilize a hybrid optimization algorithm, named DEGWO, to optimize the parameters cs and cs in the stacked RBM-SVR algorithm. Based on the comparison results, it demonstrates that the proposed DEGWO algorithm outperformed the stacked RBM-SVR model. The production of mutant individual, crossover, and selection operations in the DE algorithm eliminate the issue that the GWO algorithm is susceptible to fall into a local optimal, as well as improving the global search capability and the prediction accuracy of the algorithm. In the high density urban commercial areas, the shared bicycle flow is higher, and the prediction accuracy of transfer probabilities of bike sharing traffic flow is also likely higher. In practical implications of the work, if the total demand for shared bikes in a traffic zone is known, then the OD distribution and traffic assignment in this city can be predited by multiplying the total demand by the transfer probability. Accurate prediction of OD distribution is beneficial for transportation planning of bike sharing. At present, the daily transfer probability is predicted in this paper. however, if the larger volume of training data can be obtained, and more relevant information, such as weather, regional business distribution and resident information, can be utilized, the transfer probability for per hour can be predicted. It can further improve the accuracy of OD distribution pattern prediction and the estimation of travelers

References 1. Vogel, P., Greiser, T., Mattfeld, D.C.: Understanding bike-sharing systems using data mining: exploring activity patterns. Procedia-Soc. Behav. Sci. 20, 514–523 (2011) 2. Wu, Y., Zhu, D.: Bicycle sharing based on PSS-EPR coupling model: exemplified by bicycle sharing in China. Procedia CIRP 64, 423–428 (2017) 3. Fishman, E.J.T.R.: Bikeshare: a review of recent literature. Transp. Rev. 36, 92–113 (2016) 4. Inferring dockless shared bike distribution in new cities. In: Proceedings of the Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (2018) 5. Where Will dockless shared bikes be Stacked?:—Parking hotspots detection in a New City. In: Proceedings of the Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 6. Dockless bike-sharing reallocation based on data analysis: solving complex problem with simple method. In: Proceedings of the 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) (2018) 7. Chemla, D., Meunier, F., Calvo, R.W.: Bike sharing systems: solving the static rebalancing problem. Discrete Optim. 10, 120–146 (2013) 8. Contardo, C., Morency, C., Rousseau, L.-M.: Balancing a dynamic public bike-sharing system. Cirrelt Montreal (2012) 9. Schuijbroek, J., Hampshire, R.C., Van Hoeve, W.-J.: Inventory rebalancing and vehicle routing in bike sharing systems. Eur. J. Oper. Res. 257, 992–1004 (2017) 10. Traffic prediction in a bike-sharing system. In: Proceedings of the Sigspatial International Conference on Advances in Geographic Information Systems (2015) 11. Bicycle-sharing system analysis and trip prediction. In: Proceedings of the IEEE International Conference on Mobile Data Management (2016)

84

W. Tu and H. Liu

12. Come, E., Randriamanamihaga, N.A., Oukhellou, L., Aknin, P.: Spatio-temporal analysis of dynamic origin-destination data using latent dirichlet allocation: application to vélib’ bike sharing system of paris. In: Trb Annual Meeting (2014) 13. Mobility modeling and prediction in bike-sharing systems. In: Proceedings of the International Conference on Mobile Systems, Applications, and Services (2016) 14. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 15. Hinton, G.E., Sejnowski, T.J.: Learning and releaming in boltzmann machines. In: Parallel Distributed Processing: Explorations in the microstructure of cognition, vol. 1, p. 2 (1986) 16. Greedy layer-wise training of deep networks. In: Proceedings of the Advances in neural information processing systems (2007) 17. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 (2015) 18. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L. D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989) 19. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.J.N.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115 (2017) 20. Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2017) 21. Shi, J., Zheng, X., Li, Y., Zhang, Q., Ying, S.: Multimodal neuroimaging feature learning with multimodal stacked deep polynomial networks for diagnosis of Alzheimer’s disease. IEEE J. Biomed. Health Inf. 22, 173–183 (2018) 22. Rafiei, M.H., Adeli, H.J.E.S.: A novel unsupervised deep learning model for global and local health condition assessment of structures. Eng. Struct. 156, 598–607 (2018) 23. Hinton, G.E.: Neural Networks: Tricks of the Trade, pp. 599–619. Springer, Heidelberg (2012) 24. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the Proceedings of the 27th international conference on machine learning (ICML-10) (2010) 25. Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the Proceedings of the 24th international conference on Machine learning (2007) 26. Classification using discriminative restricted Boltzmann machines. In: Proceedings of the Proceedings of the 25th international conference on Machine learning (2008) 27. Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput. 20, 1631–1649 (2008) 28. Awad, M., Khanna, R.: Support Vector Regression. Apress, New York (2015) 29. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. Adv. Neural. Inf. Process. Syst. 28, 779–784 (1997) 30. Travel time prediction with support vector regression. In: Proceedings of the Intelligent Transportation Systems 2003 (2004) 31. A GA-based feature selection and parameters optimization for support vector regression. In: Proceedings of the International Conference on Natural Computation (2011) 32. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014) 33. Faris, H., Aljarah, I., Al-Betar, M.A., Mirjalili, S.: Grey wolf optimizer: a review of recent variants and applications, pp. 1–23 (2018) 34. Precup, R.-E., David, R.-C., Petriu, E.M.: Grey wolf optimizer algorithm-based tuning of fuzzy control systems with reduced parametric sensitivity. IEEE Trans. Ind. Electron. 64, 527–534 (2017)

Transfer Probability Prediction for Traffic Flow with Bike Sharing Data

85

35. Yang, B., Zhang, X., Yu, T., Shu, H., Fang, Z.: Grouped grey wolf optimizer for maximum power point tracking of doubly-fed induction generator based wind turbine. Energy Convers. Manag. 133, 427–443 (2017) 36. Rodríguez, L., Castillo, O., Soria, J., Melin, P., Valdez, F., Gonzalez, C.I., Martinez, G.E., Soto, J.J.A.S.C.: A fuzzy hierarchical operator in the grey wolf optimizer algorithm. Appl. Soft Comput. 57, 315–328 (2017) 37. Storn, R., Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11, 341–359 (1997) 38. Gupta, S., Deep, K.: Soft Computing for Problem Solving, pp. 961–968. Springer, Berlin (2019) 39. Pholdee, N., Bureerat, S., Yıldız, A.R.: Hybrid real-code population-based incremental learning and differential evolution for many-objective optimisation of an automotive floorframe. Int. J. Vehicle Des. 73, 20–53 (2017) 40. Maučec, M.S., Brest, J.J.S.: A review of the recent use of differential evolution for largescale global optimization: an analysis of selected algorithms on the CEC 2013 LSGO benchmark suite (2018) 41. Ho-Huu, V., Nguyen-Thoi, T., Truong-Khac, T., Le-Anh, L., Vo-Duy, T.J.N.: An improved differential evolution based on roulette wheel selection for shape and size optimization of truss structures with frequency constraints. Neural Comput. Appl. 29, 167–185 (2018)

CanvasGAN: A Simple Baseline for Text to Image Generation by Incrementally Patching a Canvas Amanpreet Singh and Sharan Agrawal(B) New York University, Jersey City, USA [email protected]

Abstract. We propose a new recurrent generative model for generating images from text captions while attending on specific parts of text captions. Our model creates images by incrementally adding patches on a “canvas” while attending on words from text caption at each timestep. Finally, the canvas is passed through an upscaling network to generate images. We also introduce a new method for generating visual-semantic sentence embeddings based on self-attention over text. We compare our model’s generated images with those generated by Reed’s model and show that our model is a stronger baseline for text to image generation tasks. Keywords: Image generation · GAN Attention · Generative networks

1

· Conditional generation ·

Introduction

With introduction of Generative Adversarial Networks (GAN) Goodfellow et al. [8] and recent improvements in their architecture and performance [1,10], the focus of research community has shifted towards generative models. Image generation is one of the central topic among generative models. As a task, image generation is important as it exemplifies model’s understanding of visual world semantics. We as humans take context from books, audio recordings or other sources and are able to imagine corresponding visual representation. Our models should also have same semantic understanding of context and should be able to generate meaningful visual representations of it. Recent advances, quality improvements and successes of the discriminative networks has enabled the industrial applications of image generation [12,27,38]. In this paper, we propose a sequential generative model called CanvasGAN for generating images based on a textual description of a scenario. Our model patches a canvas incrementally with layers of colors while attending over different text segments at each patch. Variety of sequential generative models have been introduced recently, which were shown to work much better in terms of visual quality as model gets multiple chances to improve over previous drawings. Similar to CanvasGAN’s motivation c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 86–98, 2020. https://doi.org/10.1007/978-3-030-17795-9_7

CanvasGAN

87

from human execution of painting, attention is motivated by the fact that we humans improve our performance by focus on a particular aspect of task at a moment rather than whole of it [2,30]. In recent works, attention alone has been shown to be really effective without anything else [29]. For our model, at each timestep, model focuses on a particular part of text for creating a new patch rather than whole sentence. A lot of models have been proposed in the recent year for the text to image generation task [25,31,34]. Many of these models incorporate GANs and a visual semantic embedding [6] for capturing text features which are semantically important to images. Generator networks create an image from sampled noise conditioned on the text and discriminator predicts whether the images is real or generated. However, the images generated by these networks are mostly not coherent and seem distant from text’s semantics. To overcome this incoherence, we propose new method for generating image-coherent sentence embeddings based on self-attention over the text. In this work, we make two major contributions: 1. We propose a new model for text to image generation, called CanvasGAN, which analogous to human painters generates an image from text incrementally by sequentially patching an empty canvas with colors. Furthermore, CanvasGAN uses attention to focus over text to use for generating new patch for canvas at each time-step. 2. We introduce a new visual semantic embedding generation mechanism which uses self-attention to focus on important hidden states of RNN to generate a sentence embedding instead of taking hidden state at last timestep as usual. This sentence embedding generator is separately trained to be coherent with image semantics using pairwise ranking loss function between sentence embedding and image.

2

Related Work

Generative Adversarial Networks (GAN) [8] and Variational Auto-Encoders (VAE) [16] can be considered as two major categories in deep generative models. In conditional image generation, GANs have been studied in-depth where the initial work used simple conditional variables like object attributes or class labels (MNIST) to generate images [22,28,32]. Multiple models were introduced in image to image translation which encompasses mapping from one domain to other [15,38,39], style transfer [7] and photo editing [3,37]. In context of sequential generative models, Denton et al. [5] uses a laplacian pyramid generator and discriminator called LAPGAN to synthesize images from low to high resolutions levels sequentially. Similar to our work, DRAW network [9] is a sequential version of an auto-encoder where images are generated by incrementally adding patches on a canvas. Closest to our work is Mansimov et al. [19], which uses a variational auto-encoder to patch a canvas and then use an inference network to map back to latent space. In CanvasGAN, we use a discriminator based loss function with GAN based architecture and our new visual-semantic embedding to generate images from text.

88

A. Singh and S. Agrawal

Fig. 1. CanvasGAN for generating images from text through incremental patches on an empty canvas. Starting from an empty canvas, at each timestep, we attend over text features to get relevant features. These features are passed as input to a recurrent unit which uses hidden state from previous timestamp to figure out what to patch. Finally, through upsampling the hidden state, we get the three channels which are finally combined and patched to the canvas based on γi .

Image caption generation, which is reverse of text to image generation has seen a lot of significant models which have good performance. Karpathy and FeiFei [14] uses alignments learned between CNN over image regions and BiRNN over sentences through multi-modal embedding to infer descriptions of image regions. Xu et al. [30] uses attention over weights of a convolutional net’s last layer to focus on a particular area of image to generate captions as a time-series data via a gated recurrent unit [4,11]. In Reed et al. [25], a simple generative model was introduced for converting text captions into images. This model started a series of work on text-to-image generation task. It contained a simple upscaling generator conditioned on text followed by a downsampling discriminator. The model also used visual semantic embedding [24] for efficiently representing text in higher continuous space which is further upsampled into an image. This was further improved by Zhang et al. [34], by generating high quality 256 × 256 images from a given text caption. Their model (StackGAN) employs a two step process. They first use a GAN to generate a 64 × 64 image. Then, this image along with text embedding is passed on to another GAN which finally generates a 256 × 256 image. Discriminators and generators at both stages are trained separately. Most notable recent works in the text to image task are AttnGAN [31] and HDGAN [36]. In AttnGAN, the authors improved StackGAN by using attention to focus on relevant words in the natural language description and proposed a deep attentional multimodal similarity model to compute a fine-grained imagetext matching loss for training the visual semantic embedding generator. In HDGAN, authors generate images at different resolutions with a single-streamed

CanvasGAN

89

generator. At each resolution, there is a separate discriminator which tells (i) whether image is fake or real, and (ii) if it matches the text or not. Another important contribution is made by Zhang et al. [35] in which the authors generate details using cues from all feature locations in the feature maps. In their model (SAGAN), the discriminator can check that highly detailed features in distant portions of the image are consistent with each other which leads to a boost in the inception score.

3

Model

We propose CanvasGAN, a new sequential generative network for generating images given a textual description. Model structure is shown in Fig. 1. We take motivation from human painters in how they create a painting iteratively instead of painting it in single step. Keeping that in mind, starting with an empty canvas, we paint it with patches iteratively. At each step, a patch is generated based on attended features from text. First, we retrieve GloVe [23] embeddings g for the words in caption, w and encode it using our visual semantic network f vs which we explain in Sect. 3.1. This provides us with sequentially encoded word embeddings, e and a sentence embedding, s for whole sentence. We sample our noise z ∈ RD from standard normal distribution N (0, 1). We use conditional augmentation, f ca over sentence embedding, s to overcome the problem of discontinuity in latent mapping in higher dimensions due to low amount of data [34]. Conditional augmented sentence vector c along with noise z is passed through neural network f to generate initial hidden state, h0 of the recurrent unit. Initially, canvas is empty and a zero tensor canvas0 = 0. Now at each timestep, i ∈ {1, . . . , t}, we execute a series of steps to patch canvas. Attention weights αi are calculated by neural network fiatt with inputs c, z and hi−1 which are scaled between (0, 1) as βi by taking a softmax. Attended sentence embedding, e¯i for current time-step is calculated as σj βij ej . Next hidden state, hi is calculated by recurrent unit f i which take previous hidden state hi−1 and attended sentence embedding e¯i as inputs. The r, g and b channels for next patch are produced by neural networks fir , fig and fib which are concatenated to produce a flattened image. This image is passed through an upscaling network, fiup , to generate a patch of size same as canvas. For adding this patch to canvas, we calculate a parameter γ from neural network, fiγ which determines how much of delta will be added to the canvas. Finally, γ ∗ δ is added canvasi−1 to generate canvasi . Our final image representation is denoted by canvast at last time-step. Mathematically, our model can be written as, canvas0 = 0 z ∼ N (0, 1), g = GloV e(w) e, s = f vs (g) c = f ca (s) h0 = f0 (c, z)

90

A. Singh and S. Agrawal

for i = 1, . . . , t: αi = fiatt (c, z, hi−1 ) exp(αi ) , e¯i = βij ej βi = j exp(αij ) j hi = f i (e¯i , hi−1 ) ri = fir (hi ), gi = fig (hi ), bi = fib (hi ), γ = fiγ (hi ) δ = fiup ([r; g; b]) canvasi = canvasi−1 + γ ∗ δ For our experiments, we implement fca , fir , fig , fib and fiγ as simple affine network followed by rectified linear unit non-linearity. We choose f0 and fi as Gated Recurrent Unit [4]. For fiatt , we first concatenate c, z and hi−1 and then, pass them through a affine layer followed by a softmax. Finally, upscaling network using deconvolutional layers with residual blocks to scale the image to higher resolutions [18,33]. 3.1

Visual Semantic Embedding

Standard models for representations of words in continuous vector space such as GLoV e [23], W ord2V ec [20] or f astT ext [13] are trained on a very large corpus (e.g. Wikipedia) which are not visually focused. To overcome this issue, visual semantic models have been proposed in recent works [17,24]. These representations perform well, but don’t have power to focus on important words in text. In these embeddings, focus is on last encoded state of recurrent unit as in machine translation [2]. We propose a new method for calculating visually semantic embedding using self-attention. By using self-attention over encoded hidden states of a recurrent units, our method generates a sentence embedding in latent space as shown in Fig. 2. To introduce component of visual semantic, we compare image with sentence embedding using pairwise ranking loss function similar to [17]. For proper comparison, image is encoded into same latent space as text by passing features extracted from last average-pooling layer of pretrained Inception-v3 through an affine layer. This method allows embedding to focus on most important part of sentence to generate sentence embedding. To introduce visual semantic component, a lot of models were proposed. Some of these models used a smaller datasets with visual descriptions, like Wikipedia article’s text along with its images. These embeddings didn’t perform well in practice. Also, these models were not capable of being generalized for any zeroshot learning task and had to be fine-tuned separately according to the task, To overcome this, a lot of visual semantic models [17,24] were proposed which take in account both the image and its representative text while training.

CanvasGAN

91

Fig. 2. Architecture of self-attended visual semantic embedding. Word representations →) from text-embeddings and passed through a bidirectional recurrent unit whose (− w i → − ← − forward ( hi ) and backward ( hi ) hidden states are summed to get one hidden state − → (hfi ) for each word. We pass these hidden states through a feed-forward network to calculate attention weights (αi ) for each hidden state. Finally, we linearly combine hidden state multiplied by their respective attention weights to get final hidden state. This is compared via pairwise ranking loss with image representation of original image downsampled through a CNN to same number of dimensions. Whole network is trained end-to-end.

− − → ← − − → ← − − → ← →, − → − → w h1 , h1 , h2 , h2 , . . . , hn , hn = GRU (− 1 w2 , . . . , wn ) − → ← − − → hfi = hi + hi αi = fscore (hfi ) n hf = αi hfi i

(1) (2) (3) (4)

→, − → − → where − w 1 w2 , . . . , wn are the vector representation of y, original words in onehot (1-of-K) encoded format, y = (y1 , y2 , . . . , yn ) where K represents the size of vocabulary and n is length of sequence. GRU is our gated recurrent unit, while → − ← − fscore is scoring function for attention weights (αi ). hi and hi are bidirectional − → →. hf is sum of both bidirectional hidden states which is hidden states for − w i

i

combined with αi to get final hidden state hf . 3.2

Discriminator

Our model’s discriminator (Fig. 3) is a general downsampling CNN which takes an image and a text representation to predict whether image is semantically relevant to text or not. Image is downsampled and passed through residual branch for feature extraction. Text representation (hf ) is spatially replicated so that it and the image are of same dimensionality. These are concatenated and further passed through convolutional layers to generate a real number between 0 and 1 predicting the probability of semantic relevance.

92

A. Singh and S. Agrawal

Fig. 3. Architecture of our model’s discriminator. Image is downsampled and passed through residual blocks to extract feature. Then, it is concatenated with spatially replicated text representation and the combination is passed through convolutional layers to predict the probability of whether the text is semantically relevant to image or not.

4

Experiments and Results

We train our model on oxford flowers-102 dataset [21] which contains 103 classes for different flowers. Each flower has an associated caption each describing different aspects, namely the local shape/texture, the shape of the boundary, the overall spatial distribution of petals, and the colour. We create extra augmented data via rolling matching text across the text dimension which matches each text with wrong image and thus creates a mismatching batch corresponding to a matching batch. We also create a batch of relevant text in which we roll half of the text with rest of the batch, in this case the text is somewhat relevant to image but still doesn’t match semantically. We directly pass mismatching and matching text and images to discriminator while minimizing binary cross entropy (BCE) loss with one (relevant) and zero (non-relevant) respectively. For relevant text, we pass the text to generator and further pass generated image to discriminator while minimizing BCE loss with zero (fake). We also calculate generator loss in case of matching text by passing generated image to discriminator with task and minimizing BCE loss for discriminator prediction with respect to zero. See Fig. 4 for overview of loss calculation.

Fig. 4. Calculation of Generator and Discriminator loss using matching, mismatching and relevant text. In the left side of the figure, we optimize our discriminator using matching text and mismatching text with the original training image. We also use relevant text with generator generated images to further optimize the discriminator. On the right side of the figure, we use discriminator’s loss for generator generated image for matching text and try to maximize that loss.

CanvasGAN

4.1

93

Quantitative Analysis

For analyzing CanvasGAN quantitatively, we will review the loss curves for generator and different losses of discriminator. In Fig. 5, we can see various loss curves for generator and discriminator losses for both ours and Reed et al. [25] model. In both models, discriminator loss for matching and relevant text drops progressively with time and shows that discriminator gets better. For relevant text though, discriminator loss drops close to zero in the beginning as the generator is untrained and is not able to generate plausible images. However, it recovers from that after a few epochs. For CanvasGAN’s loss curves, we can see the evident effect of applying RNN based incremental generator with attention: generator loss drops quite drastically in the start as discriminator fails to catch up with the generator. This shows that with attention, the generator is able to produce plausible images from the get-go. After a few epochs, the discriminator eventually copes up with the generator and the generator loss starts increasing, which should result in a better discriminator and generator overall and it is supported by the fact that our loss always remains below than that of Reed et al. [25]. Figure 6 shows the attention weights calculated per timestep for an image caption. Our results show that RCA-GAN almost always focuses on color and

Fig. 5. Loss vs steps curves are shown for ours and Reed et al. [25] model. (a) shows loss curves for RCAGAN which includes Generator loss, Discriminator loss with matching, mismatching and relevant text. (b) shows similar loss curves for Reed et al. [25]. (c) shows comparison of Generator loss for RCAGAN and Reed et al. [25]

Fig. 6. Attention weights for 4 timesteps for image caption “this orange flower has five fan shaped petals and orange stamen barely visible”. We can see how the attention weights change with timesteps. Significant weights are concentrated around“orange”. The image on left is the generated image.

94

A. Singh and S. Agrawal

shape mentioned in image caption which is very good for visual semantics. For more attention weights maps, see Fig. 7 in Appendix 1. To evaluate CanvasGAN’s generative capabilities, we calculated the inception score [26] using pre-trained ImageNet. We achieved an inception score [26] of 2.94 ± 0.18 using only 4 timesteps which is close to the score of 3.25, achieved by the state-of-the-art model [34]. This shows that RCA-GAN has a huge potential– with further optimizations and increased timesteps, it can perform much better. 4.2

Qualitative Analysis

Our results show that images generated by CanvasGAN are always relevant to text and never non-sensical as is the case we observed with Reed et al. [25]. Table 1 shows the images generated by both models for a certain text description. We can see that CanvasGAN generates semantically relevant images almost all of the time, while Reed et al. [25] generates distorted relevant images most of the times, but fails badly on images with long captions. CanvasGAN’s outputs that have been generated incrementally and then sharpened using CNN are usually better and expressive. Further improvements for quality can be made by generating more patches by increasing number of time-steps. Table 1. Comparison of images generated by our CanvasGAN model and those generated by Reed et al. [25]. We also provide original image related to caption for reference. Caption this flower has white petals with pointed tips and a grouping of thick yellow stamen at its center. this flower has large pink petals with a deep pink pistil with a cluster of yellow stamen and pink pollen tubes

this flower has petals that are red and has yellow stamen the violet flower has petals that are soft, smooth and arranged separately in many layers around clustered stamens there is a star configured array of long thin slightly twisted yellow petals around an extremely large orange and grey stamen covered ovule

Reed et al

CanvasGAN

Original

CanvasGAN

5

95

Conclusions

Text to image generation has become an important step towards models which better understand language and its corresponding visual semantics. Through this task we aim to create a model which can distinctly understand colors and objects in a visual sense and is able to produce coherent images to show it. There has been a lot of progress in this task and many innovative models have been proposed but the task is far from being solved yet. With each step we move towards human understanding of text as visual semantics. In this paper, we propose a novel architecture for generating images from text incrementally like humans by focusing at a part of time at a particular incremental step. We use GAN based architecture using a RNN-CNN generator which incorporates attention and we name it CanvasGAN. We show how the model focuses on important words in text at each timestep and uses them to determine what patches to add to canvas. Finally, we compare our model with previous prominent work and show our generator’s comparatively better results.

Appendix 1: Attention Weights

(a) Text caption is: ”this is a white flower with pointy pedals and bright yellow stamen”

(b) Text caption is: ”The flower is pink i color with petals that are oval shaped wth striped.”

Fig. 7. Attention weights for captions recorded for 4 timesteps. Colormap display how probable is the corresponding hidden state. Generated image is show on the left side.

96

A. Singh and S. Agrawal

References 1. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 3. Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 (2016) 4. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 5. Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, pp. 1486–1494 (2015) 6. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013) 7. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423. IEEE (2016) 8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 9. Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 (2015) 10. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 12. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016) 13. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) 14. Karpathy, A., Fei-Fei, L., Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) 15. Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192 (2017) 16. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013) 17. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014) 18. LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995) 19. Mansimov, E., Parisotto, E., Ba, J.L., Salakhutdinov, R.: Generating images from captions with attention. arXiv preprint arXiv:1511.02793 (2015) 20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

CanvasGAN

97

21. Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, pp. 722–729. IEEE (2008) 22. Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. arXiv preprint arXiv:1610.09585 (2016) 23. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 24. Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of finegrained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58 (2016) 25. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 (2016) 26. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016) 27. van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 4790–4798. Curran Associates, Inc., Red Hook (2016). http://papers.nips.cc/paper/6527-conditionalimage-generation-with-pixelcnn-decoders.pdf 28. van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with PixelCNN decoders. In: Advances in Neural Information Processing Systems, pp. 4790–4798 (2016) 29. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017) 30. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) 31. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint arXiv:1711.10485v1 (2017) 32. Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2Image: conditional image generation from visual attributes. In: European Conference on Computer Vision, pp. 776–791. Springer, Cham (2016) 33. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2528–2535. IEEE (2010) 34. Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., Metaxas, D.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242 (2016) 35. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318v1 (2018) 36. Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchically-nested adversarial network. CoRR, abs/1802.09178 (2018) 37. Zhu, J.-Y., Kr¨ ahenb¨ uhl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: European Conference on Computer Vision, pp. 597–613. Springer, Cham (2016)

98

A. Singh and S. Agrawal

38. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017) 39. Zhu, J.-Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems, pp. 465–476 (2017)

Unsupervised Dimension Reduction for Image Classification Using Regularized Convolutional Auto-Encoder Chaoyang Xu1 , Ling Wu2,3(B) , and Shiping Wang3 2

1 School of Information Engineering, Putian University, Putian 351100, China School of Economics and Management, Fuzhou University, Fuzhou 350116, China [email protected] 3 College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China

Abstract. Unsupervised dimension reduction has gained widespread attention. Most of previous work performed poorly on image classification due to taking no account of neighborhood relations and spatial localities. In this paper, we propose the ‘regularized convolutional autoencoder’, which is a variant of auto-encoder that uses the convolutional operation to extract low-dimensional representations. Each auto-encoder is trained with cluster regularization terms. The contributions of this work are presented as follows: First, we perform different sized filter convolution in parallel and abstract a low-dimensional representation from images cross scales simultaneously. Second, we introduce a cluster regularized rule on auto-encoders to reduce the classification error. Extensive experiments conducted on six publicly available datasets demonstrate that the proposed method significantly reduces the classification error after dimension reduction. Keywords: Deep learning · Auto-encoder Convolutional neural network · Dimension reduction Unsupervised learning

1

Introduction

Real-world images usually have a high dimension which leads to the well-known ‘curse of dimensionality’ in image classification tasks or other computer vision applications. Supervised deep learning networks have achieved great success based on millions of training labeled data [1]. However, in real-world applications, obtaining more training labeled data tend to be unaffordable. Unsupervised learning methods use only unlabeled data because their mechanisms are closer to the learning mechanism of the human brain and simpler than those of supervised methods. Unsupervised dimension reduction has gained widespread attention because it could deal with the ‘curse of dimensionality’. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 99–108, 2020. https://doi.org/10.1007/978-3-030-17795-9_8

100

C. Xu et al.

In recent years, many unsupervised dimension reduction methods derived from various thoughts have been presented. As commonly used unsupervised dimension reduction algorithms, principal component analysis (PCA) [2] uses singular value decomposition (SVD) of the data to transform high-dimensional data into low-dimension subspaces while retaining most of the information of the data. As a linear dimension reduction method, PCA is not effective enough to non-linear datasets. Unlike PCA, locally linear embedding (LLE) [3] is a nonlinear dimension reduction method, which preserves distances with k-nearest neighbor of high-dimension data. Uniform manifold approximation and projection (UMAP) [4] is a novel unsupervised non-linear dimension reduction for large-scale data based on Riemannian geometry and algebraic topology. There are also other unsupervised dimension reduction algorithms presented including multiple dimensional scaling (MDS) [5], isometric feature mapping (Isomap) [6], neighborhood preserving embedding (NPE) [7], unsupervised discriminative feature selection (UDFS) [8]. An auto-encoders (AE) [9] uses a linear function for a linear projection, a sigmoid function for a nonlinear mapping. Most of the researches in this field are aimed at giving constraints to hidden layers. Sparse auto-encoder [10] believes that an image cannot contain all features, so the hidden layer should be sparse. Vincent et al. [11] proposed denoising auto-encoder that made the learned representations robust to partial corruption of the input pattern. Contractive auto-encoder [12] introduced the Frobenius norm of the Jacobian matrix which penalized the highly sensitive inputs to increase robustness. Zhang et al. [13] constructed neighborhood relations for each data sample and learned a local stacked contractive auto-encoder (SCAE) to extract deep local features. Previous research on auto-encoder ignored the image structure. To preserve the image neighborhood relations and spatial localities, convolution operation forces each feature to be global and to span the entire image. There have been a lot of developments in the field of convolutional neural networks in recent years. To cover a large receptive field of the image, 11*11 sized filters are used in AlexNet [1]. NiN [14] used 1*1 filters for dimension reduction and substituted global average pooling for a fully connected layer. VGG [15] proposed that the combination of 3 convolutional layers had an effective receptive field of 7*7. One of the benefits is a decrease in the number of parameters and time cost. Unlike above networks, GoogleLeNet [16] performed convolutional operation and max pool operation in parallel. This method introduced the inception module that consisted of a 1*1 filter convolution, a 3*3 filter convolution, a 5*5 filter convolution, and a max pooling operation. The concatenation of different sized filter convolutions is able to extract different scale representations from images. To extract image features directly, Ma et al. [17] proposed MFFDN architecture that found five common image features namely color, texture, shape, gradient, saliency which were incorporated into hidden variables. The main contributions are summarised as follows: 1. We use inception module that consists of different sized filter convolutions and extracts cross-scale representations from images.

Unsupervised Dimension Reduction for Image Classification Using RCAE

101

2. We introduce cluster regularized rule to prevent non-linear auto-encoders from fracture a manifold into many different domains and lead to very different codes for nearby data-points.

Fig. 1. Inception architecture

Fig. 2. Auto-encoder architecture

In the following statement, we firstly recalled inception architecture and autoencoder paradigm in Sect. 2. The architecture and regularized condition will be presented in Sect. 3. We designed experiments to discuss the parameters of the algorithm and prove the effectiveness of regularized convolutional auto-encoders in Sect. 4. Finally, this paper is concluded in Sect. 5.

2

Related Works

In this section, we first provide a brief review of inception architecture and then discuss the auto-encoder and its extensions. 2.1

Inception Architecture

Traditional deep neural networks perform max pooling operation and convolutional operation in sequentially. To improve the performance, the network has to increase the numbers of layers or use more units at each layer. More layers or more units mean more parameters, resulting in the network with high risk of over-fitting. The inception architecture performs max pooling operation and convolutional operation in parallel that consists of 1*1 filter, 3*3 filter, 5*5 filter, and 3*3 max pooling as seen in Fig. 1. The 1*1 convolutional operations provide a method of dimension reduction. The 3*3 filter and 5*5 filter are able to cover different sizes receptive field of the image and extract different information. The 3*3 max pooling operation helps to reduce spatial sizes and prevent the network from over-fitting. There are ReLUs function which helps improve the non-linearity of the network. In the top layer, the network concatenates all sizes of representation and outputs it to extract cross-scale representations. Basically, the network is able to perform the functions of these different operations while the computationally time is promising.

102

2.2

C. Xu et al.

Auto-Encoder Paradigm

The auto-encoder algorithm belongs to a special family of dimension reduction methods implemented using artificial neural networks. The generalized autoencoder consists of two parts, an encoder and a decoder, as shown in Fig. 2. From this figure, the encoder maps an input xi ∈ Rdx to a reduced latent representation zi ∈ Rdz by a deterministic function fθ with parameters θ, zi = fθ (xi ). For dimension reduction, the vector length is limited to dx > dz . This latent representation zi ∈ Rdz is then used to reconstruct the inputs by another deterministic function gθ : yi = gθ (zi ). Where fθ and gθ are both specific functions for a linear projection or a sigmoid function for a nonlinear mapping, or a NN network for a more complicated function and deep auto-encoder. Each training pattern xi is then mapped onto its reduced latent representation zi and its reconstruction yi . The reconstruction error is represented by the following form: 1 J(θ) = yi − xi 2 (1) 2 i The parameters are optimized, minimizing reconstruction error J over the training set X = {x1 , · · · , xn }. It aims to learn a compressed representation for an input through minimizing its reconstruction error. When adding some constraint to hidden representation, we are able to prevent auto-encoders from fracture the manifold into many different domains which result in very different codes for similar images. We obtain a compressed representation of the input. Recently, the auto-encoder algorithms and its extensions [10–12] demonstrate a promising ability to obtain low-dimensional representations from high-dimensional spaces, which could induce the ‘intrinsic data structure’ [18].

3

Algorithm

To reduce the classification error after dimension reduction, we extend the autoencoder with inception module and cluster regularized item to a new regularized convolutional auto-encoders called RCAE. The architecture of RCAE and the cluster regularized item will be discussed as follows. 3.1

Architecture

The architecture of regularized convolutional auto-encoders is motivated to capture image representations at every scale with convolutional filter and brings them together to outputs. The generalized auto-encoder consists of two parts, an encoder that made up of inception module group, global average pooling and constrained layers and a decoder transforms hidden representation to image. The full architecture is illustrated in Fig. 3. The inception module group in the encoder is set up as follows: inception module that accepts full image sizes as an input is used to process and consolidate

Unsupervised Dimension Reduction for Image Classification Using RCAE

103

features across scales. At each inception module, the network begins convolution operation in parallel and combination of different size representation acrossscale. After reaching the output resolution of the network, filter concatenation is applied to produce the final hidden representation. The global average pooling layer between inception module group and constraint layers sums out the spatial information, thus it is more robust to spatial translations of the input. Unlike traditional fully connected layers, the global average pooling layer has no parameter to optimize thus over-fitting is avoided. The constrained layers tunes regularization attaches the hidden representation of similar images to each other and thus prevents the manifold fracturing problem that is typically encountered in the embedding learned by the autoencoders. The final hidden representation is achieved by adding constrained representation y with style representation z. The constrained representation y is obtained by multiplying cluster representation y by an m ∗ n constrained matrix REM, where REM is learned with stochastic gradient descent (SGD) [19] through minimizing regularized error. The regularized error is computed as follows: 1 distancei,j 2 (2) R(error) = − 2 i,j where distancei,j stands for the distance of all possible combinations of two clusters i and j. The decoder transforms its low-dimensional hidden representation to highdimensional representation using multi-layers deconvolutional filters. Every layer produces sparse representation. Note that the last layer of the decoder produces a multi-channel feature map, although its encoder input has 3 channels (RGB).

Fig. 3. An illustration of RCAE architecture. Inception nodes has inception architecture as seen in Fig. 1. The representation layer is final low-dimensional hidden representation after dimension reduction.

3.2

Training Methodology

The main idea of the proposed approach is to perform different convolution operation in parallel and extend auto-encoder with cluster constraint to do dimension reduction. Given the high-dimensional images X = [x1 , · · · , xN ], the algorithm of the regularized convolutional auto-encoders (RCAE) is summarized as Algorithm 1 to obtain the low-dimensional hidden representation Z = [z1 , · · · , zN ].

104

C. Xu et al.

Algorithm 1. The RCAE algorithm for dimension reduction. Input:training images X, the number of low-dimensional hidden representation m, the number of cluster n Output:training model RCAE, low-dimensional hidden representation Z step 1: Randomly initialize the regularized matrix REM with Truncated Normal distribution step 2: feed mini-batch size of images from X to network step 3: calculate reconstruction error J with equation (1) step 4: calculate regularized error R(error) with equation (2) step 5: minimize J and R with SGD step 6: loop step 1-5 until training error converges step 7: return model RCAE, low-dimensional hidden representation Z of X

The RCAE network weights and the regularized matrix REM are all initialized using the truncated normal distribution as described in [20]. The encoder takes in a mini-batch size of images as the inputs and returns a low-dimensional hidden representation. The reconstruction loss at the end of the network is mean squared error (MSE) [21]. An addition loss called regularized error (Rerror) is computed to regulate anto-encoder which prevent very different codes from similar images. The low-dimensional hidden representation is then passed into the KNN classification algorithm. The classification accuracy is calculated to evaluate the performance of the dimension reduction method. More details about the architecture of the algorithm will be described in Sect. 4.4. We implement a small inception layer to perform convolution. The decoder is a small 3-layer net with deconvolution operation followed by 2 fully connected layers with dropout. This is much similar to googleLeNet in structure but smaller in the interest of faster training for evaluation. We are exploring how convolution operation tricks reduce over-fitting, help the networks extract the image feature and improve classification accuracy after dimension reduction.

4 4.1

Experiments and Discussion Data Sets

In this section, we will evaluate the predictive performance of the proposed RCAE based on six standard datasets: F-Mnist, SVHN, Cifar-10, Flowers-102, Dogs-120, and CUB-200. The data augmentation is served to these nature images in the experiments. The details of used datasets are summarized in Table 1. Having a good training dataset is a huge step towards the robust model. We perform whitening and local contrast normalization [22] to the RGB image on all datasets.

Unsupervised Dimension Reduction for Image Classification Using RCAE

105

Table 1. The details of six datasets

4.2

ID Dataset

# train # test # attributes # classes Data-types

1

F-Mnist

60,000

10,000 28*28

10

Gray-scale

2

SVHN

73,257

26032

32*32*3

10

RGB

3

Cifar-10

50,000

10,000 32*32*3

10

RGB

4

Flowers-102 6149

1020

64*64*3

102

RGB

5

Dogs-120

15000

5580

64*64*3

120

RGB

6

CUB-200

3000

3033

64*64*3

200

RGB

Classification Algorithm and Evaluation Indicators

The kNN Algorithm. In this research, we have used kNN for classification. In this paper, the kNN algorithm has been selected to evaluate the accuracy of the different datasets in classification tasks. In classification, kNN predicts the class for an image using the information provided by the k-nearest neighbors. Essentially, this algorithm is based on local information provided by the training images, instead of constructing a global model from the whole data. In this paper, we choose k = 30. Classification Accuracy. The main purpose of the evaluation indicator is to evaluate the validity of dimension reduction methods. The commonly used evaluation indicators are classification accuracy. A better classifier can achieve higher accuracy which is computed as follows: accuracy =

(T P + T N ) TP + TN + FP + FN

(3)

where TP is the number of instances correctly identified. FP stands for the number of instances incorrectly identified. TN represents the number of instances correctly rejected. FN corresponds to stands for the number of instances incorrectly rejected. 4.3

Parameter Sensitivity

As described above as Algorithm 1, they are two important parameters, the number of low-dimensional hidden representations m and the number of predictive clusters n. The number of low-dimensional hidden representations were set to m = {16, 32, 64, 128, 256, 512}, respectively, the number of predictive clusters were set to n = {3, 7, 10, 20, 50, 100} for datasets (F-Mnist, SVHN, Cifar-10) and were set to n = {20, 50, 100, 150, 200, 300} for datasets (Flowers-102, Dogs-120, CUB-200). We test the two architectural parameters with a training mini-batch size of 128 after respective epochs of optimization on the six datasets until the training loss converges. We select the epoch wherein the global accuracy is highest amongst the evaluations on every training set.

106

C. Xu et al.

F-Mnist dataset

SVHN dataset

Cifar-10 dataset

Flowers-102 dataset

Dogs-120 dataset

CUB-200 dataset

Fig. 4. Classification accuracy of RCAE with regard to m and n.

Figure 4 compares the classification accuracy of the RCAE for the six datasets to the two parameters. For each dataset, we try different values of its parameters, resulting in a three-dimensional bar of accuracy over the number of predictive clusters and the number of hidden representations. On F-Mnist, the accuracy goes up to 97.58% from 85.03% and then does not go up with the number of predictive clusters varying from 3 to 10. On the other hand, accuracy does not vary notably with changes in the number of hidden representations. On Flower102, accuracy goes from a baseline of 24.02% to 29.09% and keep balance when the number of predictive clusters varying from 20 to 150. Similar results are also observed in other data sets. 4.4

Experimental Results

For performance comparisons, we choose to compare with PCA, LLE, UMAP, and AE. Figure 2 compares the classification accuracy with the KNN classifier by using the low-dimensional hidden representations as features. For the KNN classifier, the numbers of neighbors are set to 30. For AE and RCAE, the training set is shuffled and each mini-batch has 128 images and perform optimization for different epochs on the six datasets. Each image is used only once in an epoch. The encoder and decoder weights were all initialized using Truncated Normal distribution. To train all the parameters we use stochastic gradient descent (SGD) with a fixed learning rate of 0.001 [23] using our Tensorflow implementation. We select the epoch wherein the classification accuracy is highest amongst the evaluations on the six datasets.

Unsupervised Dimension Reduction for Image Classification Using RCAE

107

There are two important observations can be seen in Table 2. Firstly, there is AE and RCAE algorithm which clearly dominate others. We use this benchmark to first compare AE with several non auto-encoder methods including PCA, LLE, and UMAP. This was done to give the reader a perspective of the improvements in classification accuracy that has been achieved by using auto-encoder compared to classical dimension reduction method. Secondly, the reader must be aware that AE and its extension always perform better than non auto-encoder methods. The difference between AE and RCAE is that the latter method preserves the image neighborhood relations and spatial localities when reducing dimension and add constraint rule to the low-dimensional hidden representation. Note that the training time of RCAE is expensive due to its complicated network and more parameters. Table 2. Comparison of classification accuracy results for test dataset

5

Method/dataset F-Mnist SVHN

CIFAR-10 Flowers-102 Dogs-120 CUB-200

Baseline

0.8339

0.3489

0.3812

0.1022

0.1191

0.018

PCA

0.8493

0.5035

0.3483

0.1762

0.1352

0.0231

LLE

0.825

0.3244

0.3283

0.1134

0.1136

0.0206

NMAP

0.8214

0.4048

0.3214

0.1213

0.1041

0.0212

AE

0.8589

0.5725

0.4283

0.2686

0.1412

0.0234

RCAE

0.8758

0.5993 0.4389

0.2909

0.1637

0.0293

Conclusion

We presented RCAE, a deep convolutional auto-encoder architecture for unsupervised dimension reduction. The main motivation behind RCAE was the need to design an efficient architecture for unsupervised dimension reduction which preserves the image neighborhood relations and spatial localities. We analyzed RCAE and compared it with two important parameters to reveal that the practical trade-offs involved in designing architectures for dimension reduction, particularly the number of predictive clusters, the number of low-dimensional hidden representations. These architectures which have complicated network performed best but consumed more memory and time. On six well-known benchmark datasets, RCAE performs competitively, achieves high accuracy for KNN classification after dimension reduction. Unsupervised dimension reduction for image classification is a harder challenge in machine learning tasks and we hope to see more attention paid to this important problem. Acknowledgments. This work is partly supported by the National Natural Science Foundation of China under Grant No. 61502104, the Fujian Young and Middle-aged Teacher Education Research Project under Grant No. JT180478.

108

C. Xu et al.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012) 2. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417–520 (1933) 3. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000) 4. Mcinnes, L., Healy, J.: UMAP: uniform manifold approximation and projection for dimension reduction (2018) 5. Wickelmaier, F.: An introduction to MDS. Sound Quality Research Unit (2003) 6. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 7. He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: Tenth IEEE International Conference on Computer Vision, pp. 1208–1213 (2005) 8. Wang, S., Wang, H.: Unsupervised feature selection via low-rank approximation and structure learning. Knowl.-Based Syst. 124, 70–79 (2017) 9. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 10. Nair, V., Hinton, G.E.: 3D object recognition with deep belief nets. In: International Conference on Neural Information Processing Systems, pp. 1339–1347 (2009) 11. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning, pp. 1096–1103 (2008) 12. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: ICML (2011) 13. Zhang, J., Yu, J., Tao, D.: Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans. Image Process. 27(5), 1–1 (2018) 14. Lin, M., Chen, Q., Yan, S.: Network in network. Comput. Sci. (2013) 15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014) 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 17. Ma, G., Yang, X., Zhang, B., Shi, Z.: Multi-feature fusion deep networks. Neurocomputing 218, 164–171 (2016). http://www.sciencedirect.com/science/article/ pii/S0925231216309559 18. Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: International Conference on Neural Information Processing Systems, pp. 777–784 (2004) 19. Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: International Conference on Neural Information Processing Systems, pp. 161–168 (2007) 20. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on ImageNet classification, pp. 1026–1034 (2015) 21. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis, 2nd edn. Springer, New York (1986). ISBN 0-387-96098-8 22. Jarrett, K., Kavukcuoglu, K., Ranzato, M., Lecun, Y.: What is the best multistage architecture for object recognition? In: IEEE International Conference on Computer Vision, pp. 2146–2153 (2010) 23. Bottou, L.: Large-scale machine learning with stochastic gradient descent, pp. 177– 186 (2010)

ISRGAN: Improved Super-Resolution Using Generative Adversarial Networks Vishal Chudasama and Kishor Upla(B) Sardar Vallabhbhai National Institute of Technology, Surat 395007, Gujrat, India [email protected], [email protected]

Abstract. In this paper, we propose an approach for single image superresolution (SISR) using generative adversarial network (GAN). The SISR has been an attractive research topic over the last two decades and it refers to the reconstruction of a high resolution (HR) image from a single low resolution (LR) observation. Recently, SISR using convolutional neural networks (CNNs) obtained remarkable performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics. Despite this, these methods suffer with a serious drawback in terms of visualization quality of the SR images; the results look overly-smoothed. This is due to the loss function in those methods has a pixel level difference which increases the values of PSNR and SSIM metrics; however the visualization quality is degraded. The GAN has a capability to generate visually appealable solutions. It can also recover the high-frequency texture details due to the discrimination process involved in GAN. Here, we propose improved single image super-resolution using GAN (ISRGAN) with the concept of densely connected deep convolutional networks for image super-resolution. Our proposed method consists two networks: ISRNet and ISRGAN. The ISRNet is trained using MSE based loss function to achieve higher PSNR and SSIM values and ISRGAN is trained by using a combination of VGG based perceptual loss and adversarial loss in order to improve the perceptual quality of the SR images. This training step forces the SR results more towards the natural image manifold. The efficiency of the proposed method is verified by conducting experiments on the different benchmark testing datasets and it shows that the proposed method of ISRGAN outperforms in terms of perception when compared to the other state-of-the-art GAN based SISR techniques. Keywords: Super-resolution · Densely connected residual network Global residual learning · Perceptual loss · Adversarial loss

1

·

Introduction

The SISR is a classical yet challenging problem in the munity which aims to reconstruct the HR image with from its LR test image. In SISR, the given LR image sampling of its original HR image and experiments are

computer vision comhigh-frequency details is prepared by downperformed on this LR

c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 109–127, 2020. https://doi.org/10.1007/978-3-030-17795-9_9

110

V. Chudasama and K. Upla

Fig. 1. The comparison of the proposed ISRNet and ISRGAN with ground truth (original) images for ×4 resolution factor on the random data samples of Set5, Set14 and BSD100 datasets. (Reader can zoom-in for a better view)

test image. During downsampling operation, the high frequency details are lost due to the aliasing process and same are required to obtain by SISR approach. This is an ill-posed inverse problem where no unique solution exists; there are many SR images which correspond to given LR observation. In order to solve this problem, many SR approaches form the inverse function which minimizes a pixel-wise MSE based loss function to super-resolve the given LR image. While MSE loss function is easy to optimize, it returns the mean of possible solutions which results the SR image with over-smoothed regions. Therefore, it is very difficult to reconstruct all the fine details with pixel-wise accuracy in those SISR

ISRGAN: Improved Super-Resolution Using GANs

111

approaches. Our proposed method called improved single image super-resolution network (ISRNet) is based on MSE loss function but it results the SR images with better value of PSNR and SSIM measures. However, the SR result obtained using that proposed method i.e., ISRNet look blurry due to MSE loss function as it is illustrated in Fig. 1. However, higher PSNR does not reflect the perceptually plausible solutions. Hence, reconstructing the super-resolved image with high-frequency texture details at large upscaling factor is remains an unsolved problem. Generative adversarial network (GAN) has an ability to generate visually appealable solutions [1,2]. In GAN, discriminator network forces the generator network to produce the perceptually visible solutions. Super-resolution using GAN (SRGAN) [3] was the first framework to reconstruct the SR image and it has a capability to produce perceptually better results for larger upscaling factors. However, SRGAN has more computational cost and it requires high memory space. Also, some degradations are still observed in the SR results obtained using SRGAN approach [3]. In this paper, we propose an improved single image super-resolution using GAN (ISRGAN) with densely connected convolutional networks in which the proposed ISRNet is used as a generator network. Instead of relying only on MSE loss function, we employ a combination of VGG based perceptual loss [4] and adversarial loss [1] as the loss function in the proposed ISRGAN method. The VGG based perceptual loss is obtained from the high-level feature maps of pretrained VGG-19 network [4]. The comparison of the SR images obtained using the proposed ISRNet and ISRGAN method with ground truth is depicted in Fig. 1. Here, one can notice that the SR images obtained using the proposed ISRGAN is more close to the ground truth image. Although the quantitative measures for the proposed ISRGAN is lower then the proposed ISRNet approach, it results the SR images with more perceptual fidelity than the SR result obtained using proposed ISRNet method. In this paper, our contributions are as follows: – We propose the ISRNet model which obtains the comparable SR results with the SRResNet [3] in terms of PSNR and SSIM measures. – In order to improve the perceptual quality and the quantitative performance of the SR images, the ISRGAN is proposed which outperforms to the existing SRGAN and other GAN based SR method. 1.1

Related Work

Since last two decades, the image super-resolution (SR) gains a considerable attention from the computer vision research group which has a broad range of applications [5–9]. There are two main categories for SR approaches: multiimages SR (MISR) and single image SR (SISR). In MISR, the SR image is obtained by using multiple LR images of the same scene. These methods are not proven to be effective since they require image registration and fusion stages which are computationally complex [10–13]. On the other hand, the SISR methods aim to obtain SR image using single LR observation which is more practical and also cost-effective. Hence, we focus on SISR based SR method in this work.

112

V. Chudasama and K. Upla

Classical SISR Methods: In the computer vision community, many SISR methods have been proposed. A detailed review of these methods has been examined and evaluated by Yang et al. in [10]. Out of these methods, interpolation based SR methods are very easy to implement and are widely used. However, the representation power of these methods are very limited and they often generate SR images with blocky regions. The example-based SR methods [14–17] have been developed to enhance the performance by using a rich prior information. These methods also include the SR approaches based on compressive-sensing (CS) which assume that the natural image patches can be sparsely represented by the dictionary of atoms. Such a dictionary can be formed by a database of patches. These CS based SR methods have achieved comparable performance with the result obtained using the state-of-the-art SR methods. However, the main drawback of these CS based methods is that they are computationally expensive to find the solution of the sparse coefficients. Also, CS based method focus only on a single image channel or gray-scale image for the super-resolution of color images. However, none of them has analyzed the performance on multiple channels of RGB image [18]. Also, the performance of these SR methods is poor in terms of reconstruction of high-quality details for higher upscaling factors. Convolutional Neural Network (CNN) Based SISR Methods: Recently, CNN for SR images are effectively applied in SISR. The CNN based methods are also called deep learning based SR methods since they use many hidden (i.e., deep) layers from input to output layers. The deep learning based SISR methods differ from the existing example-based methods in the way that the former can perform SR on color images with three channels at a time and the dictionaries and manifolds are implicitly learn via hidden layers in network. The pioneer work in this category is carried out by Dong et al. [18] which is called as SRCNN. This method learns an end-to-end mapping between LR and HR images and uses a bicubic interpolation to upsample the given LR image; they train a three-layered CNN end-to-end model to achieve SR results. After SRCNN, the very deep convolutional network (VDSR) [19] shows significant improvement over the SRCNN by increasing the network depth from 3 to 20 convolutional layers. In this work, authors adopt the global residual learning paradigm to predict the difference between the bicubically upsampled LR and original HR images in order to achieve fast convergence speed. In [20], Kim et al. proposed a network called, deep recursive convolutional network (DRCN), with 16 recursive layers. This model keeps the small number of the model parameters and achieves better PSNR value when compared to the same obtained by SRCNN and VDSR methods. The deep recursive residual network (DRRN) which is proposed by Tai et al. [21] trains a 52-layer deep network by extending the local residual learning approach of the ResNet [22] with deep recursive to achieve state-of-theart results. Lai et al. [23] introduce LapSRN in which the sub-band residuals of HR images are progressively reconstructed at multiple pyramid levels. They train their network using the Charbonnier loss function [24] instead of L2 and they achieve better SR images.

ISRGAN: Improved Super-Resolution Using GANs

113

Adversarial Training and Perceptual Loss: All the afore-mentioned SR techniques outperform to the state-of-the-art classical SISR methods in terms of PSNR and SSIM measures. Despite this, they fail to recover the high-frequency texture details in the super-resolved images at larger upscaling factors. This is mainly due to the MSE based loss functions used in those methods. Generative Adversarial Networks (GANs) are relatively a new kind of models which are based on unsupervised learning. They were proposed by Goodfellow et al. [1] in the year of 2014. Since then, many variants of GAN have been proposed in the literature for understanding and improving the performance of GANs [2,25–28]. The limitation of MSE based loss function in deep learning based methods is solved by Bruna et al. [29] and Johnson et al. [4]; they propose a perceptual loss function which create perceptually plausible solutions. In the perceptual loss function, the features are extracted from high-level feature maps of pre-trained VGG-19 network [30] instead of low-level pixel-wise loss functions and then the Euclidean distance is computed between extracted feature maps. Recently, Ledig et al. [3] propose an SR method called super-resolution using GAN (SRGAN) which sets a new state-of-the-art performance in SISR. Diverse from MSE loss function, they train their networks using a combination of perceptual and adversarial loss functions which produce a perceptually better solution. In addition to that, Sajjadi et al. propose an SISR method called EnhanceNet [31] in which automated texture synthesis is used as a texture loss in the combination with perceptual and adversarial losses and they generate realistic texture images. In this paper, we propose the SISR method based on GAN which differs from other GAN based methods i.e., SRGAN [3] and EnhanceNet [31] as follows: – With comparison to EnhanceNet [31] and SRGAN [3], our method use a residual network. We propose densely connected convolution networks inspired from Huang et al. [32] instead of original residual network proposed in [22]. Such dense blocks offer implicit deep supervision with less number of trainable parameters where the gradients are flowing back without much loss because of the short connection. – In contrast to SRGAN [3], we add global residual learning network in our generator network which helps us to learn the identity function for LR image and it also helps to stabilize the training process. This also reduces the color shifts in the SR result in the proposed method. – We add three 1 × 1 convolution layers after bicubic interpolation in global residual learning network which help us to further extract the useful features. However, in EnhanceNet [31], authors use only bicubic interpolation in their global residual learning network. 1.2

Single Image Super-Resolution Using GAN

In SISR, the super-resolved image, I SR is obtained from a given LR observation, I LR . The original HR image, I HR is passed through downsampling operation with factor r in order to obtain LR image i.e., I LR . For a color image, I LR can be described by real-valued tensor of size w×h×c and its corresponding I HR and

114

V. Chudasama and K. Upla

I SR are described by real-valued tensor of rw × rh × c. In GAN, it is required to train a generator network G which can estimate SR image for a given LR input image. The generator network G is trained to optimize its parameters θG which is given by, N 1 SR θˆG = arg min l (GθG (InLR ), InHR ), (1) θG N n=1 where lSR is a loss function and it is a weighted combination of perceptual and adversarial loss functions. In Eq. (1), InHR and InLR , n = 1, 2, . . . , N , are the training HR and its corresponding LR images, respectively. In addition to generator network G, a discriminator network D with parameter θD is also optimized in an adversarial manner and same can represented as, min max EI HR ∼ptrain (I HR ) [log DθD (I HR )] θG

θD

+ EI LR ∼pG (I LR ) [log(1 − DθD (GθG (I LR )))]. (2) Here, G is trained to fool D by producing SR images highly similar to HR images, while D is trained to distinguish SR images from HR images. By doing this procedure, G can learn to generate perceptually plausible images which reside in the natural image manifold.

Fig. 2. Network architectures of the proposed method. Here, k indicates the number of filters and s indicates a stride value of convolutional layer.

ISRGAN: Improved Super-Resolution Using GANs

2

115

Proposed Method

The architecture of the proposed SR method based on GAN i.e., ISRGAN is displayed in Fig. 2 for ×4 upscaling factor. Inspired by VGG architecture [30], we use an exclusive filters of size 3 × 3 in the convolutional layer that allows us to create a deeper model with less number of the parameters in the network. Generator Network: The architecture for the generator network is depicted in Fig. 2. Here, we use M number of residual blocks; each residual block consists densely connected convolutional networks. This densely connected CNNs are trained in a recursive manner as suggested by Huang et al. [32]. Use of such residual blocks increases the convergence speed than that of the stacked convolutional networks during the training period [31]. In Fig. 3, we show the difference between the residual structure of the proposed method with that of the other existing residual network based methods.

Fig. 3. Comparison of different residual networks: (a) Original ResNet proposed by Ledig et al. [3], (b) Modified ResNet proposed by Lai et al. [33] and (c) proposed ResNet

Figure 3(a) shows the original residual network proposed by He et al. [22] which was used in the SRResNet architecture [3]. In Fig. 3(b), the residual network utilized in EDSR network [33] is displayed in which batch normalization (BN) layers are removed to simplify the structure when compared to the structure in Fig. 3(a). By removing BN layer, the GPU usage is reduced during training since it consumes the same amount of memory as the preceding convolutional layer [33]. The residual block structure of the proposed method is depicted in Fig. 3(c). Here, we remove the BN layers and also replace the first two convolutional layers in Fig. 3(a) with three dense blocks. As stated earlier, such dense

116

V. Chudasama and K. Upla

blocks can reduce the number of parameters by reusing the features and this also helps in the reduction in memory usage as well as overall computations. In comparison with the architecture proposed in [22], here we use multiple output feature-maps which are concatenated into a single tensor rather than direct summing as done in [22]. Finally, one 1 × 1 transition convolutional layer is used to reshape the number of channels to the desired level. The local residual learning (LRL) is introduced as skip connection in the residual block as shown in Fig. 3(c) which improves the information flow. The LRL has the form of xl = xl−1 + F (xl−1 ), where F (·) denotes the function of 1 × 1 convolutional transition layer in residual network. Using LRL, network can solve the problem of exploding or vanishing gradient since the higher layer gradients are directly passed to the lower layer in residual block. Similar to the LRL, Kim et al. [19] propose a global residual learning (GRL) by adding the model’s output with the bicubic interpolation of the input to generate the residual image. The GRL helps the network to learn the identity function for I LR . During training, GRL also helps to stabilize the training process and reduces the color shifts in the output image. This idea of GRL is also used in the proposed method. However, we pass the bicubically interpolated image through three 1×1 convolution layers which helps to extract the useful features of I LR further (see Fig. 2). It also reduces the number of training parameters as compared to 3 × 3 convolution layer [34]. Long et al. [35] use a transpose convolution layer to upsample the feature maps inside the network. However, such layer produces checkerboard artifacts in the solutions which also enforces an additional regularization term in the final solutions. Odena et al. [36] observed that the resize convolution, i.e. nearestneighbor interpolation followed by a single convolution layer, can reduce the checkerboard artifacts. However, this approach still creates some checkerboard artifacts for some specific loss functions [31]. Hence, we add two convolution layers after resize convolution layer in the proposed method which can add regularization term to reduce further the checkerboard artifacts in the SR results. Discriminator Network: The discriminator network is also displayed in Fig. 2 along with the generator network. The proposed discriminator network follows the architecture guidelines suggested by Radford et al. [2]. We use leaky ReLU (α = 0.2) as an activation function which avoids any kind of pooling layers in the network. In order to maintain the size of image at output of convolutional layer, strided convolutions are used in the proposed method whenever the number of features are doubled. The proposed discriminator network consists eight convolutional layers with kernel filters increased by a factor of 2 from 32 to 256 which is followed by three fully connected dense layers and one sigmoid activation function. The discriminator network takes HR and SR images as inputs and discriminate them by giving probability value between 0 to 1.

ISRGAN: Improved Super-Resolution Using GANs

2.1

117

Loss Functions

In this section, we describe the different loss functions which are used to train the network. In the proposed SR method, we formulate the loss function lSR by weighted combination of content loss, lx and adversarial loss, ladversarial as [3,31], lSR = lx + 10−3 ladversarial .

(3)

The content loss may be either pixel-wise MSE loss or perceptual loss function. Pixel-Wise MSE Loss Function: The pixel-wise MSE function is the most widely used loss function to optimize the network and same is defined as, 1 HR = 2 (I − G(I LR )x,y )2 , r wh x=1 y=1 x,y rw

lM SE

rh

(4)

where r is the downsampling factor while w and h indicate width and height of the LR image, respectively. Higher PSNR value can be obtained by minimizing the MSE loss function. Hence, most of the state-of-the-art SISR methods use a MSE based loss function to obtain better value of PSNR [18–20]. Perceptual Loss Function: The higher PSNR value does not reflect the visualizing quality of the SR image. Johnson et al. [4] and Dosovitskiy and Brox [37] propose a perceptual loss function in which instead of computing loss function in image space directly, both I SR and I HR images are first mapped into the feature space by a feature map function φ obtained from a pre-trained VGG-19 network [30]. Then the VGG loss is defined as the euclidean distance between the feature maps of reference HR image, I HR and a generated SR image, I SR as, lV GG19−M SE(i,j) = Wi,j Hi,j 1 HR (φi,j (Ix,y ) − φi,j (G(I LR )x,y ))2 . (5) Wi,j Hi,j x=1 y=1

Here, φi,j indicates the feature map of j th convolution layer before ith max pooling layer within the VGG-19 network. Also Wi,j and Hi,j relates the respective feature map dimensions within the VGG-19 network. Adversarial Loss Function: Recently, an adversarial training is proven to be very effective in order to generate the SR images which are visually appealable [1]. In the adversarial training, the generator network is trained to learn the mapping between LR to HR image space. Simultaneously, the discriminator network is also trained to discriminate HR and SR images. This leads to adversarial behaviour

118

V. Chudasama and K. Upla

and it tends to a min-max game where the generator is trained to minimize the loss function based on N number of training samples as, lgenerator =

N

− log(D(G(I LR ))),

(6)

n=1

while discriminator minimizes the loss function as, ldiscriminator =

N

(− log(D(I HR )) − log(1 − D(G(I LR )))).

(7)

n=1

Here, D(I HR ) is the probability value of the HR image and D(G(I LR )) is the probability of the reconstructed super-resolved image. Hence, the adversarial loss is a combination of generator loss lgenerator and discriminator loss ldiscriminator as given by Eqs. (6) and (7), respectively.

3

Experiment Analysis

We have tested the performance of the proposed method on three different benchmark datasets: Set5 [39], Set14 [38] and BSD100 [40]. The SR results obtained using the proposed method are compared with the other existing state-of-theart methods such as SRCNN [18]1 , VDSR [19]2 , DRCN [20]3 , LapSRN [23]4 , EnhanceNet [31]5 and SRGAN [3]6 . The quantitative comparison is performed in terms of PSNR [41] and SSIM [42] metrics and same are calculated after removing the boundary pixels of Y-channel images in YCbCr color space [3,31,33]. Training Details and Hyper-parameter Settings All the experiments have been conducted on a computer with following specifications: Intel i7-6850K processor @3 GHz × 12, 64 GB RAM and NVIDIA TITAN X Pascal 12 GB GPU. Our implementation uses a Tensorflow libraries [43]. In the training, we are using two datasets: RAISE [44] and DIV2K [33]. The data samples in these datasets are augmented before the training process with different operations such as flipping, random rotation up to 270◦ and downscaling by a factor of 0.5 to 0.7 randomly. In the proposed generator network, there are 16 identical residual blocks and an Adam optimizer [45] with β1 = 0.9 is used to optimize the model. The LR test images are obtained by downsampling the available HR images with factor r = 4. 1 2 3 4 5 6

https://github.com/jbhuang0604/SelfExSR. http://cv.snu.ac.kr/research/VDSR/. http://cv.snu.ac.kr/research/DRCN/. http://vllab.ucmerced.edu/wlai24/LapSRN/. http://webdav.tuebingen.mpg.de/pixel/enhancenet/. https://twitter.app.box.com/s/lcue6vlrd01ljkdtdkhmfvk7vtjhetog.

ISRGAN: Improved Super-Resolution Using GANs

119

Fig. 4. The SR results obtained using the proposed and other SR methods for a single image of Set14 [38] testing dataset. The zoomed-in regions are also displayed with red and blue borders. The PSNR and SSIM are shown in brackets at the top of all the SR results.

120

V. Chudasama and K. Upla

We have trained two proposed models i.e., ISRNet and ISRGAN in following manner. We adopted a two-stage training strategy as suggested by Kim et al. [19] which helps to avoid undesired local minima. The proposed ISRNet model is first trained on RAISE dataset using MSE based loss function with learning rate of 10−4 and that is trained up to 8 × 105 number of iterations. In the second stage, we again train the same model i.e., ISRNet with DIV2K dataset with same learning rate and same loss function. It is initialized with the weights of the first stage and is trained up to 2 × 105 number of iterations. After this, we train our proposed ISRGAN on RAISE dataset by using VGG54 based perceptual loss function and it is trained up to 4 × 105 iterations with a learning rate of 10−4 . We keep an alternate updation between generator and discriminator networks as suggested by Goodfellow et al. [1]. Result Analysis The SR results obtained using proposed and the other existing state-of-the-art methods are depicted in Fig. 4 for a single image of Set14 dataset. It is worth to mention here that the testing of the proposed method is performed on all the datasets; however for the comparison purpose, here we show the results obtained using single image of Set14 dataset. The performance of our proposed method is compared with bicubic interpolation and other six existing state-of-the-art SR methods. In the Fig. 4, first row consists the SR results obtained using bicubic, SRCNN [18], MS-LapSRN [23], two modes of EnhanceNet [31] methods. In the second row, we display the SR results obtained using SRResNet [3], SRGAN [3] and our proposed methods of ISRNet and ISRGAN. In the last, we display the original HR image. In order to see the improvement present in terms of high frequency details of SR results, we have also display the zoomed-in regions of all the SR results along with the complete images. In addition to that, the quantitative measures i.e., PSNR and SSIM obtained from SR methods are also displayed at the top of the SR results (see Fig. 4). By looking at the SR results displayed in Fig. 4, one can notice the improvement present in the results of proposed ISRGAN method in which the texture details are more close to the original HR image when compared to the SR results of other SR techniques. It is also worth to mentioned here that the zoomed-in regions of the SR result obtained using proposed ISRGAN is with less degradation and having less noise when compared to the same with SR methods. This shows that the proposed ISRGAN is capable of generating more perceptually visible SR images when compared to the recent state-of-the-art existing methods such as EnhanceNet-PAT [31] and SRGAN [3]. This is mainly because of the use of densely connected convolutional layers with short connection and concatenation stage in proposed residual networks. Such dense networks helps to extract hierarchical features. The performance of the proposed ISRNet model is quantitatively compared in terms of PSNR and SSIM measures in Table 1. In this table, we display the quantitative comparison obtained using existing CNN and GAN based methods. Here, one can observe that the value of PSNR is highest for Set5 dataset for the proposed ISRNet method. However, for the other two

ISRGAN: Improved Super-Resolution Using GANs

121

Table 1. The quantitative comparison for the SR results obtained using proposed ISRNet and other existing state-of-the-art approaches. Highest measures are indicated with red color and second highest measures are indicated in blue color. Methods

PSNR Set5

Set14

SSIM BSD100 Set5

Set14

BSD100

Bicubic

28.4302 26.0913 25.9619 0.8109 0.7043 0.6675

SRCNN [18]

30.0843 27.2765 26.7046 0.8527 0.7425 0.7016

VDSR [19]

31.3537 28.1101 27.2876 0.8839 0.7691 0.7250

DRCN [20]

31.5405 28.1219 27.2380 0.8855 0.7686 0.7232

LapSRN [23]

31.5417 28.1852 27.3175 0.8863 0.7706 0.7259

MS-LapSRN [23]

31.7368 28.3559 27.4241 0.8899 0.7749 0.7300

EnhanceNet-E [31] 31.7568 28.4297 27.5112 0.8886 0.7771 0.7319 SRResNet [3]

32.0786 28.5991 27.6111 0.8936 0.7816 0.7366

ISRNet

32.1154 28.5257 27.5381 0.8932 0.7800 0.7352

datasets, the values of PSNR and SSIM are second highest for the proposed ISRNet method; but the margin between first and second highest values is small for all those measures. In addition to that, the quantitative comparison for the proposed ISRGAN method is also analyzed and same is depicted in Table 2. Here, we use GAN based existing methods for the comparison and it shows that the quantitative performance of the proposed ISRGAN method is better when compared to other EnhanceNet [31] and SRGAN [3] methods. Here, we can also observe that the amount of margin in the values of these measures is quite more between the proposed ISRGAN and other GAN based methods for all the datasets. Table 2. The quantitative comparison for the proposed ISRGAN method with the other existing GAN based state-of-the-art methods. Here, boldface values indicate the highest value among others. EnhanceNet-PAT [31] SRGAN [3] ISRGAN Set5 PSNR 28.5733

29.4263

29.4978

0.8100

0.8353

0.8432

PSNR 25.7748

SSIM Set14

26.1179

26.8152

0.6778

0.6954

0.7180

PSNR 24.9358

25.1812

25.8368

0.6403

0.6662

SSIM BSD100 SSIM

0.6261

122

V. Chudasama and K. Upla

In addition to the quantitative comparison of our ISRGAN with other GAN based existing methods, we also show the SR results obtained using all those methods separately for different samples of Set5, Set14 and BSD100 datasets. These SR results are displayed in Fig. 5. Here, one can notice that the SR results of proposed ISRGAN yield sharp details and also reconstruct better textures when compared to other GAN based SR approaches such as EnhanceNet [31] and SRGAN [3].

Fig. 5. The comparison of SR results for different GAN based SR methods obtained for random samples of Set5 [39], Set14 [38] and BSD100 [40] datasets.

ISRGAN: Improved Super-Resolution Using GANs

123

Fig. 6. The performance improvement of the proposed ISRGAN approach during the training process.

124

V. Chudasama and K. Upla

Thus, the proposed ISRGAN reconstructs the high-frequency details which are missing in LR image and also results the SR images in the more realistic looking images. Hence, one can say that the proposed ISRGAN approach is capable to generate sharper high-frequency details which can look more natural in appearance. In order to see the performance improvement of ISRGAN generator network with respect to the number of iterations, we show the SR result obtained at specific interval of iterations and same is displayed in Fig. 6 along with the PSNR and SSIM values. Since we initialized the generator network with the ISRNet network’s weights, the first result shows the results of ISRNet. Here, one can note that only after 20k number of iterations, the proposed ISRGAN produce substantially diverge solutions from the same of ISRNet (see first image in Fig. 6) with more high-frequency details. However, the reconstruction appears close to the original HR image with increasing number of training iterations. 3.1

Conclusion

In this paper, we propose a method for image super-resolution using generative adversarial network (GAN). It consists ISRNet and ISRGAN. First, ISRNet is trained using MSE based loss function to achieve higher PSNR and SSIM values. In the next stage, ISRGAN is trained by using a combination of VGG based perceptual loss and adversarial loss to capture high-frequency details. Proposed ISRGAN is able to generate SR images which are close to the original HR images. Different experiments have been carried out on benchmark testing datasets and found that ISRGAN performs better in terms of perception when compared to the other existing GAN based state-of-the-art methods. Acknowledgment. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

References 1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 2. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015) 3. Ledig, C., Theis, L., Husz´ ar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016) 4. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision, pp. 694–711. Springer (2016)

ISRGAN: Improved Super-Resolution Using GANs

125

5. Gunturk, B.K., Batur, A.U., Altunbasak, Y., Hayes, M.H., Mersereau, R.M.: Eigenface-domain super-resolution for face recognition. IEEE Trans. Image Process. 12(5), 597–606 (2003) 6. Goto, T., Fukuoka, T., Nagashima, F., Hirano, S., Sakurai, M.: Super-resolution system for 4K-HDTV. In: 2014 22nd International Conference on Pattern Recognition, ICPR, pp. 4453–4458. IEEE (2014) 7. Peled, S., Yeshurun, Y.: Superresolution in MRI: application to human white matter fiber tract visualization by diffusion tensor imaging. Magn. Reson. Med. 45(1), 29–35 (2001) 8. Thornton, M.W., Atkinson, P.M., Holland, D.: Sub-pixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. Int. J. Remote Sens. 27(3), 473–491 (2006) 9. Zhang, L., Zhang, H., Shen, H., Li, P.: A super-resolution reconstruction algorithm for surveillance images. Signal Process. 90(3), 848–859 (2010) 10. Yang, C.-Y., Ma, C., Yang, M.-H.: Single-image super-resolution: a benchmark. In: European Conference on Computer Vision, pp. 372–386. Springer (2014) 11. Hayat, K.: Super-resolution via deep learning. arXiv preprint arXiv:1706.09077 (2017) 12. Nasrollahi, K., Moeslund, T.B.: Super-resolution: a comprehensive survey. Mach. Vis. Appl. 25(6), 1423–1468 (2014) 13. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 349–356. IEEE (2009) 14. Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010) 15. Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood regression for fast example-based super-resolution. In: 2013 IEEE International Conference on Computer Vision, ICCV, pp. 1920–1927. IEEE (2013) 16. Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on in-place example regression. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1059–1066. IEEE (2013) 17. Peleg, T., Elad, M.: A statistical prediction model based on sparse representations for single image super-resolution. IEEE Trans. Image Process. 23(6), 2569–2582 (2014) 18. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016) 19. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016) 20. Kim, J., Kwon Lee, J., Mu Lee, K.: Deeply-recursive convolutional network for image super-resolution. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1637–1645 (2016) 21. Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, vol. 1, no. 4 (2017) 22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 23. Lai, W.-S., Huang, J.-B., Ahuja, N., Yang, M.-H.: Fast and accurate image superresolution with deep Laplacian pyramid networks. arXiv preprint arXiv:1710.01992 (2017)

126

V. Chudasama and K. Upla

24. Barron, J.T.: A more general robust loss function. arXiv preprint arXiv:1701.03077 (2017) 25. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016) 26. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017) 27. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017) 28. Berthelot, D., Schumm, T., Metz, L.: BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017) 29. Bruna, J., Sprechmann, P., LeCun, Y.: Super-resolution with deep convolutional sufficient statistics. arXiv preprint arXiv:1511.05666 (2015) 30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 31. Sajjadi, M.S., Sch¨ olkopf, B., Hirsch, M.: EnhanceNet: single image super-resolution through automated texture synthesis. In: 2017 IEEE International Conference on Computer Vision, ICCV, pp. 4501–4510. IEEE (2017) 32. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2261–2269 (2017) 33. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, vol. 1, no. 2, p. 3 (2017) 34. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013) 35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 36. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill 1(10), e3 (2016) 37. Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Advances in Neural Information Processing Systems, pp. 658–666 (2016) 38. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparserepresentations. In: International Conference on Curves and Surfaces, pp. 711–730. Springer (2010) 39. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012) 40. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 2, pp. 416–423. IEEE (2001) 41. Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 20th International Conference on Pattern Recognition, ICPR, pp. 2366–2369 (2010) 42. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

ISRGAN: Improved Super-Resolution Using GANs

127

43. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283 (2016). https://www.usenix.org/system/files/conference/ osdi16/osdi16-abadi.pdf 44. Dang-Nguyen, D.-T., Pasquini, C., Conotter, V., Boato, G.: RAISE: a raw images dataset for digital image forensics. In: Proceedings of the 6th ACM Multimedia Systems Conference, pp. 219–224. ACM (2015) 45. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Deep Learning vs. Traditional Computer Vision Niall O’Mahony(&), Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh IMaR Technology Gateway, Institute of Technology Tralee, Tralee, Ireland [email protected]

Abstract. Deep Learning has pushed the limits of what was possible in the domain of Digital Image Processing. However, that is not to say that the traditional computer vision techniques which had been undergoing progressive development in years prior to the rise of DL have become obsolete. This paper will analyse the benefits and drawbacks of each approach. The aim of this paper is to promote a discussion on whether knowledge of classical computer vision techniques should be maintained. The paper will also explore how the two sides of computer vision can be combined. Several recent hybrid methodologies are reviewed which have demonstrated the ability to improve computer vision performance and to tackle problems not suited to Deep Learning. For example, combining traditional computer vision techniques with Deep Learning has been popular in emerging domains such as Panoramic Vision and 3D vision for which Deep Learning models have not yet been fully optimised. Keywords: Computer vision

Deep learning Hybrid techniques

1 Introduction Deep Learning (DL) is used in the domain of digital image processing to solve difficult problems (e.g. image colourization, classification, segmentation and detection). DL methods such as Convolutional Neural Networks (CNNs) mostly improve prediction performance using big data and plentiful computing resources and have pushed the boundaries of what was possible. Problems which were assumed to be unsolvable are now being solved with super-human accuracy. Image classification is a prime example of this. Since being reignited by Krizhevsky, Sutskever and Hinton in 2012 [1], DL has dominated the domain ever since due to a substantially better performance compared to traditional methods. Is DL making traditional Computer Vision (CV) techniques obsolete? Has DL superseded traditional computer vision? Is there still a need to study traditional CV techniques when DL seems to be so effective? These are all questions which have been brought up in the community in recent years [2], which this paper intends to address. Additionally, DL is not going to solve all CV problems. There are some problems where traditional techniques with global features are a better solution. The advent of DL may open many doors to do something with traditional techniques to overcome the © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 128–144, 2020. https://doi.org/10.1007/978-3-030-17795-9_10

Deep Learning vs. Traditional Computer Vision

129

many challenges DL brings (e.g. computing power, time, accuracy, characteristics and quantity of inputs and among others). This paper will provide a comparison of deep learning to the more traditional handcrafted feature definition approaches which dominated CV prior to it. There has been so much progress in Deep Learning in recent years that it is impossible for this paper to capture the many facets and sub-domains of Deep Learning which are tackling the most pertinent problems in CV today. This paper will review traditional algorithmic approaches in CV, and more particularly, the applications in which they have been used as an adequate substitute for DL, to complement DL and to tackle problems DL cannot. The paper will then move on to review some of the recent activities in combining DL with CV, with a focus on the state-of-the-art techniques for emerging technology such as 3D perception, namely object registration, object detection and semantic segmentation of 3D point clouds. Finally, developments and possible directions of getting the performance of 3D DL to the same heights as 2D DL are discussed along with an outlook on the impact the increased use of 3D will have on CV in general.

2 A Comparison of Deep Learning and Traditional Computer Vision 2.1

What Is Deep Learning

To gain a fundamental understanding of DL we need to consider the difference between descriptive analysis and predictive analysis. Descriptive analysis involves defining a comprehensible mathematical model which describes the phenomenon that we wish to observe. This entails collecting data about a process, forming hypotheses on patterns in the data and validating these hypotheses through comparing the outcome of descriptive models we form with the real outcome [3]. Producing such models is precarious however because there is always a risk of unmodelled variables that scientists and engineers neglect to include due to ignorance or failure to understand some complex, hidden or non-intuitive phenomena [4]. Predictive analysis involves the discovery of rules that underlie a phenomenon and form a predictive model which minimise the error between the actual and the predicted outcome considering all possible interfering factors [3]. Machine learning rejects the traditional programming paradigm where problem analysis is replaced by a training framework where the system is fed a large number of training patterns (sets of inputs for which the desired outputs are known) which it learns and uses to compute new patterns [5]. DL is a subset of machine learning. DL is based largely on Artificial Neural Networks (ANNs), a computing paradigm inspired by the functioning of the human brain. Like the human brain, it is composed of many computing cells or ‘neurons’ that each perform a simple operation and interact with each other to make a decision [6]. Deep Learning is all about learning or ‘credit assignment’ across many layers of a neural network accurately, efficiently and without supervision and is of recent interest due to enabling advancements in processing hardware [7]. Self-organisation and the

130

N. O’Mahony et al.

exploitation of interactions between small units have proven to perform better than central control, particularly for complex non-linear process models in that better fault tolerance and adaptability to new data is achievable [7]. 2.2

Advantages of Deep Learning

Rapid progressions in DL and improvements in device capabilities including computing power, memory capacity, power consumption, image sensor resolution, and optics have improved the performance and cost-effectiveness of further quickened the spread of vision-based applications. Compared to traditional CV techniques, DL enables CV engineers to achieve greater accuracy in tasks such as image classification, semantic segmentation, object detection and Simultaneous Localization and Mapping (SLAM). Since neural networks used in DL are trained rather than programmed, applications using this approach often require less expert analysis and fine-tuning and exploit the tremendous amount of video data available in today’s systems. DL also provides superior flexibility because CNN models and frameworks can be re-trained using a custom dataset for any use case, contrary to CV algorithms, which tend to be more domain-specific. Taking the problem of object detection on a mobile robot as an example, we can compare the two types of algorithms for computer vision: The traditional approach is to use well-established CV techniques such as feature descriptors (SIFT, SURF, BRIEF, etc.) for object detection. Before the emergence of DL, a step called feature extraction was carried out for tasks such as image classification. Features are small “interesting”, descriptive or informative patches in images. Several CV algorithms, such as edge detection, corner detection or threshold segmentation may be involved in this step. As many features as practicable are extracted from images and these features form a definition (known as a bag-of-words) of each object class. At the deployment stage, these definitions are searched for in other images. If a significant number of features from one bag-of-words are in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.). The difficulty with this traditional approach is that it is necessary to choose which features are important in each given image. As the number of classes to classify increases, feature extraction becomes more and more cumbersome. It is up to the CV engineer’s judgment and a long trial and error process to decide which features best describes different classes of objects. Moreover, each feature definition requires dealing with a plethora of parameters, all of which must be fine-tuned by the CV engineer. DL introduced the concept of end-to-end learning where the machine is just given a dataset of images which have been annotated with what classes of object are present in each image [7]. Thereby a DL model is ‘trained’ on the given data, where neural networks discover the underlying patterns in classes of images and automatically works out the most descriptive and salient features with respect to each specific class of object for each object. It has been well-established that DNNs perform far better than traditional algorithms, albeit with trade-offs with respect to computing requirements and training time. With all the state-of-the-art approaches in CV employing this methodology, the workflow of the CV engineer has changed dramatically where the knowledge

Deep Learning vs. Traditional Computer Vision

131

and expertise in extracting hand-crafted features has been replaced by knowledge and expertise in iterating through deep learning architectures as depicted in Fig. 1.

Fig. 1. (a) Traditional computer vision workflow vs. (b) Deep learning workflow. Figure from [8].

The development of CNNs has had a tremendous influence in the field of CV in recent years and is responsible for a big jump in the ability to recognize objects [9]. This burst in progress has been enabled by an increase in computing power, as well as an increase in the amount of data available for training neural networks. The recent explosion in and wide-spread adoption of various deep-neural network architectures for CV is apparent in the fact that the seminal paper ImageNet Classification with Deep Convolutional Neural Networks has been cited over 3000 times [2]. CNNs make use of kernels (also known as filters), to detect features (e.g. edges) throughout an image. A kernel is just a matrix of values, called weights, which are trained to detect specific features. As their name indicates, the main idea behind the CNNs is to spatially convolve the kernel on a given input image check if the feature it is meant to detect is present. To provide a value representing how confident it is that a specific feature is present, a convolution operation is carried out by computing the dot product of the kernel and the input area where kernel is overlapped (the area of the original image the kernel is looking at is known as the receptive field [10]). To facilitate the learning of kernel weights, the convolution layer’s output is summed with a bias term and then fed to a non-linear activation function. Activation Functions are usually non-linear functions like Sigmoid, TanH and ReLU (Rectified Linear Unit). Depending on the nature of data and classification tasks, these activation functions are selected accordingly [11]. For example, ReLUs are known to have more biological representation (neurons in the brain either fire or they don’t). As a result, it yields favourable results for image recognition tasks as it is less susceptible to the vanishing gradient problem and it produces sparser, more efficient representations [7]. To speed up the training process and reduce the amount of memory consumed by the network, the convolutional layer is often followed by a pooling layer to remove redundancy present in the input feature. For example, max pooling moves a window

132

N. O’Mahony et al.

over the input and simply outputs the maximum value in that window effectively reducing to the important pixels in an image [7]. As shown in Fig. 2, deep CNNs may have several pairs of convolutional and pooling layers. Finally, a Fully Connected layer flattens the previous layer volume into a feature vector and then an output layer which computes the scores (confidence or probabilities) for the output classes/features through a dense network. This output is then passed to a regression function such as Softmax [12], for example, which maps everything to a vector whose elements sum up to one [7].

Fig. 2. Building blocks of a CNN. Figure from [13]

But DL is still only a tool of CV. For example, the most common neural network used in CV is the CNN. But what is a convolution? It’s in fact a widely used image processing technique (e.g. see Sobel edge detection). The advantages of DL are clear, and it would be beyond the scope of this paper to review the state-of-the-art. DL is certainly not the panacea for all problems either, as we will see in following sections of this paper, there are problems and applications where the more conventional CV algorithms are more suitable. 2.3

Advantages of Traditional Computer Vision Techniques

This section will detail how the traditional feature-based approaches such as those listed below have been shown to be useful in improving performance in CV tasks: • • • • •

Scale Invariant Feature Transform (SIFT) [14] Speeded Up Robust Features (SURF) [15] Features from Accelerated Segment Test (FAST) [16] Hough transforms [17] Geometric hashing [18]

Feature descriptors such as SIFT and SURF are generally combined with traditional machine learning classification algorithms such as Support Vector Machines and K-Nearest Neighbours to solve the aforementioned CV problems. DL is sometimes overkill as often traditional CV techniques can solve a problem much more efficiently and in fewer lines of code than DL. Algorithms like SIFT and even simple colour thresholding and pixel counting algorithms are not class-specific, that is, they are very general and perform the same for any image. In contrast, features

Deep Learning vs. Traditional Computer Vision

133

learned from a deep neural net are specific to your training dataset which, if not well constructed, probably won’t perform well for images different from the training set. Therefore, SIFT and other algorithms are often used for applications such as image stitching/3D mesh reconstruction which don’t require specific class knowledge. These tasks have been shown to be achievable by training large datasets, however this requires a huge research effort and it is not practical to go through this effort for a closed application. One needs to practice common sense when it comes to choosing which route to take for a given CV application. For example, to classify two classes of product on an assembly line conveyor belt, one with red paint and one with blue paint. A deep neural net will work given that enough data can be collected to train from. However, the same can be achieved by using simple colour thresholding. Some problems can be tackled with simpler and faster techniques. What if a DNN performs poorly outside of the training data? If the training dataset is limited, then the machine may overfit to the training data and not be able to generalize for the task at hand. It would be too difficult to manually tweak the parameters of the model because a DNN has millions of parameters inside of it each with complex inter-relationships. In this way, DL models have been criticised to be a black box in this way [5]. Traditional CV has full transparency and the one can judge whether your solution will work outside of a training environment. The CV engineer can have insights into a problem that they can transfer to their algorithm and if anything fails, the parameters can be tweaked to perform well for a wider range of images. Today, the traditional techniques are used when the problem can be simplified so that they can be deployed on low cost microcontrollers or to limit the problem for deep learning techniques by highlighting certain features in data, augmenting data [19] or aiding in dataset annotation [20]. We will discuss later in this paper how many image transformation techniques can be used to improve your neural net training. Finally, there are many more challenging problems in CV such as: Robotics [21], augmented reality [22], automatic panorama stitching [23], virtual reality [24], 3D modelling [24], motion estimation [24], video stabilization [21], motion capture [24], video processing [21] and scene understanding [25] which cannot simply be easily implemented in a differentiable manner with deep learning but benefit from solutions using “traditional” techniques.

3 Challenges for Traditional Computer Vision 3.1

Mixing Hand-Crafted Approaches with DL for Better Performance

There are clear trade-offs between traditional CV and deep learning-based approaches. Classic CV algorithms are well-established, transparent, and optimized for performance and power efficiency, while DL offers greater accuracy and versatility at the cost of large amounts of computing resources. Hybrid approaches combine traditional CV and deep learning and offer the advantages traits of both methodologies. They are especially practical in high performance systems which need to be implemented quickly. For example, in a security camera, a CV algorithm can efficiently detect faces or other features [26] or moving

134

N. O’Mahony et al.

objects [27] in the scene. These detections can then be passed to a DNN for identity verification or object classification. The DNN need only be applied on a small patch of the image saving significant computing resources and training effort compared to what would be required to process the entire frame. The fusion of Machine Learning metrics and Deep Network has become very popular, due to the simple fact that it can generate better models. Hybrid vision processing implementations can introduce performance advantage and ‘can deliver a 130X–1,000X reduction in multiply-accumulate operations and about 10X improvement in frame rates compared to a pure DL solution. Furthermore, the hybrid implementation uses about half of the memory bandwidth and requires significantly lower CPU resources’ [28]. 3.2

Overcoming the Challenges of Deep Learning

There are also challenges introduced by DL. The latest DL approaches may achieve substantially better accuracy; however this jump comes at the cost of billions of additional math operations and an increased requirement for processing power. DL requires a these computing resources for training and to a lesser extent for inference. It is essential to have dedicated hardware (e.g. high-powered GPUs [29] and TPUs [30] for training and AI accelerated platforms such as VPUs for inference [31]) for developers of AI. Vision processing results using DL are also dependent on image resolution. Achieving adequate performance in object classification, for example, requires highresolution images or video – with the consequent increase in the amount of data that needs to be processed, stored, and transferred. Image resolution is especially important for applications in which it is necessary to detect and classify objects in the distance, e.g. in security camera footage. The frame reduction techniques discussed previously such as using SIFT features [26, 32] or optical flow for moving objects [27] to first identify a region of interest are useful with respect to image resolution and also with respect to reducing the time and data required for training. DL needs big data. Often millions of data records are required. For example, PASCAL VOC Dataset consists of 500K images with 20 object categories [26, 33], ImageNet consists of 1.5 million images with 1000 object categories [34] and Microsoft Common Objects in Context (COCO) consists of 2.5 million images with 91 object categories [35]. When big datasets or high computing facilities are unavailable, traditional methods will come into play. Training a DNN takes a very long time. Depending on computing hardware availability, training can take a matter of hours or days. Moreover, training for any given application often requires many iterations as it entails trial and error with different training parameters. The most common technique to reduce training time is transfer learning [36]. With respect to traditional CV, the discrete Fourier transform is another CV technique which once experienced major popularity but now seems obscure. The algorithm can be used to speed up convolutions as demonstrated by [37, 38] and hence may again become of major importance. However, it must be said that easier, more domain-specific tasks than general image classification will not require as much data (in the order of hundreds or thousands rather

Deep Learning vs. Traditional Computer Vision

135

than millions). This is still a considerable amount of data and CV techniques are often used to boost training data through data augmentation or reduce the data down to a particular type of feature through other pre-processing steps. Pre-processing entails transforming the data (usually with traditional CV techniques) to allow relationships/patterns to be more easily interpreted before training your model. Data augmentation is a common pre-processing task which is used when there is limited training data. It can involve performing random rotations, shifts, shears, etc. on the images in your training set to effectively increase the number of training images [19]. Another approach is to highlight features of interest before passing the data to a CNN with CV-based methods such as background subtraction and segmentation [39]. 3.3

Making Best Use of Edge Computing

If algorithms and neural network inferences can be run at the edge, latency, costs, cloud storage and processing requirements, and bandwidth requirements are reduced compared to cloud-based implementations. Edge computing can also privacy and security requirements by avoiding transmission of sensitive or identifiable data over the network. Hybrid or composite approaches involving conventional CV and DL take great advantage of the heterogeneous computing capabilities available at the edge. A heterogeneous compute architecture consists of a combination of CPUs, microcontroller coprocessors, Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs) and AI accelerating devices [31] and can be power efficient by assigning different workloads to the most efficient compute engine. Test implementations show 10x latency reductions in object detection when DL inferences are executed on a DSP versus a CPU [28]. Several hybrids of deep learning and hand-crafted features based approaches have demonstrated their benefits in edge applications. For example, for facial-expression recognition, [41] propose a new feature loss to embed the information of hand-crafted features into the training process of network, which tries to reduce the difference between hand-crafted features and features learned by the deep neural network. The use of hybrid approaches has also been shown to be advantageous in incorporating data from other sensors on edge nodes. Such a hybrid model where the deep learning is assisted by additional sensor sources like synthetic aperture radar (SAR) imagery and elevation like synthetic aperture radar (SAR) imagery and elevation is presented by [40]. In the context of 3D robot vision, [42] have shown that combining both linear subspace methods and deep convolutional prediction achieves improved performance along with several orders of magnitude faster runtime performance compared to the state of the art. 3.4

Problems Not Suited to Deep Learning

There are many more changing problems in CV such as: Robotic, augmented reality, automatic panorama stitching, virtual reality, 3D modelling, motion stamation, video stabilization, motion capture, video processing and scene understanding which cannot

136

N. O’Mahony et al.

simply be easily implemented in a differentiable manner with deep learning but need to be solved using the other “traditional” techniques. DL excels at solving closed-end classification problems, in which a wide range of potential signals must be mapped onto a limited number of categories, given that there is enough data available and the test set closely resembles the training set. However, deviations from these assumptions can cause problems and it is critical to acknowledge the problems which DL is not good at. Marcus et al. present ten concerns for deep learning, and suggest that deep learning must be supplemented by other techniques if we are to reach artificial general intelligence [43]. As well as discussing the limitations of the training procedure and intense computing and data requirements as we do in our paper, key to their discussion is identifying problems where DL performs poorly and where it can be supplemented by other techniques. One such problem is the limited ability of DL algorithms to learn visual relations, i.e. identifying whether multiple objects in an image are the same or different. This limitation has been demonstrated by [43] who argue that feedback mechanisms including attention and perceptual grouping may be the key computational components to realising abstract visual reasoning. It is also worth noting that ML models find it difficult to deal with priors, that is, not everything can be learnt from data, so some priors must be injected into the models [44, 45]. Solutions that have to do with 3D CV need strong priors in order to work well, e.g. image-based 3D modelling requires smoothness, silhouette and illumination information [46]. Below are some emerging fields in CV where DL faces new challenges and where classic CV will have a more prominent role. 3.5

3D Vision

3D vision systems are becoming increasingly accessible and as such there has been a lot of progress in the design of 3D Convolutional Neural Networks (3D CNNs). This emerging field is known as Geometric Deep Learning and has multiple applications such as video classification, computer graphics, vision and robotics. This paper will focus on 3DCNNs for processing data from 3D Vision Systems. Wherein 2D convolutional layers the kernel has the same depth so as to output a 2D matrix, the depth of a 3D convolutional kernel must be less than that of the 3D input volume so that the output of the convolution is also 3D and so preserve the spatial information. The size of the input is much larger in terms of memory than conventional RGB images and the kernel must also be convolved through the input space in 3 dimensions (see Fig. 3). As a result, the computational complexity of 3D CNNs grows cubically with resolution. Compared to 2D image processing, 3D CV is made even more difficult as the extra dimension introduces more uncertainties, such as occlusions and different cameras angles as shown in Fig. 4. FFT based methods can optimise 3D CNNs reduce the amount of computation, at the cost of increased memory requirements however. Recent research has seen the implementation of the Winograd Minimal Filtering Algorithm (WMFA) achieve a twofold speedup compared to cuDNN (NVIDIA’s language/API for programming on their graphics cards) without increasing the required memory [49]. The next section will

Deep Learning vs. Traditional Computer Vision

137

Fig. 3. 2DCNN vs. 3D CNN [47]

Fig. 4. 3D object detection in point clouds is a challenging problem due to discrete sampling, noisy scans, occlusions and cluttered scenes. Figure from [48].

include some solutions with novel architectures and pre-processing steps to various 3D data representations which have been proposed to overcome these challenges. Geometric Deep Learning (GDL) deals with the extension of DL techniques to 3D data. 3D data can be represented in a variety of different ways which can be classified as Euclidean or non-Euclidean [50]. 3D Euclidean-structured data has an underlying grid structure that allows for a global parametrization and having a common system of coordinates as in 2D images. This allows existing 2D DL paradigms and 2DCNNs can be applied to 3D data. 3D Euclidean data is more suitable for analysing simple rigid objects such as, chairs, planes, etc. e.g. with voxel-based approaches [51]. On the other hand, 3D non-Euclidean data do not have the gridded array structure where there is no global parametrization. Therefore, extending classical DL techniques to such

138

N. O’Mahony et al.

representations is a challenging task and has only recently been realized with architectures such as Pointnet [52]. Continuous shape information that is useful for recognition is often lost in their conversion to a voxel representation. With respect to traditional CV algorithms, [53] propose a single dimensional feature that can be applied to voxel CNNs. A novel rotation-invariant feature based on mean curvature that improves shape recognition for voxel CNNs was proposed. The method was very successful in that when it was applied to the state-of-the-art recent voxel CNN Octnet architecture a 1% overall accuracy increase on the ModelNet10 dataset was achieved. 3.6

Slam

Visual SLAM is a subset of SLAM where a vision system is used instead of LiDAR for the registration of landmarks in a scene. Visual SLAM has the advantages of photogrammetry (rich visual data, low-cost, lightweight and low power consumption) without the associated heavy computational workload involved in post-processing. The visual SLAM problem consists of steps such as environment sensing, data matching, motion estimation, as well as location update and registration of new landmarks [54]. Building a model of how visual objects appear in different conditions such as 3D rotation, scaling, lighting and extending from that representation using a strong form of transfer learning to achieve zero/one shot learning is a challenging problem in this domain. Feature extraction and data representation methods can be useful to reduce the amount of training examples needed for an ML model [55]. A two-step approach is commonly used in image based localization; place recognition followed by pose estimation. The former computes a global descriptor for each of the images by aggregating local image descriptors, e.g. SIFT, using the bag-of-words approach. Each global descriptor is stored in the database together with the camera pose of its associated image with respect to the 3D point cloud reference map. Similar global descriptors are extracted from the query image and the closest global descriptor in the database can be retrieved via an efficient search. The camera pose of the closest global descriptor would give us a coarse localization of the query image with respect to the reference map. In pose estimation, the exact pose of the query image calculated more precisely with algorithms such as the Perspective-n-Point (PnP) [13] and geometric verification [18] algorithms [56]. The success of image based place recognition is largely attributed to the ability to extract image feature descriptors. Unfortunately, there is no algorithm to extract local features similar to SIFT for LiDAR scans. A 3D scene is composed of 3D points and database images. One approach has associated each 3D point to a set of SIFT descriptors corresponding to the image features from which the point was triangulated. These descriptors can then be averaged into a single SIFT descriptor that describes the appearance of that point [57]. Another approach constructs multi-modal features from RGB-D data rather than the depth processing. For the depth processing part, they adopt the well-known colourization method based on surface normals, since it has been proved to be effective and robust across tasks [58]. Another alternative approach utilizing traditional CV techniques presents the Force Histogram Decomposition (FHD), a graph-based hierarchical

Deep Learning vs. Traditional Computer Vision

139

descriptor that allows the spatial relations and shape information between the pairwise structural subparts of objects to be characterized. An advantage of this learning procedure is its compatibility with traditional bags-of-features frameworks, allowing for hybrid representations gathering structural and local features [59]. 3.7

360 Cameras

A 360 camera, also known as an omnidirectional or spherical or panoramic camera is a camera with a 360-degree field of view in the horizontal plane, or with a visual field that covers (approximately) the entire sphere. Omnidirectional cameras are important in applications such as robotics where large visual field coverage is needed. A 360 camera can replace multiple monocular cameras and eliminate blind spots which obviously advantageous in omnidirectional Unmanned Ground Vehicles (UGVs) and Unmanned Aerial Vehicles (UAVs). Thanks to the imaging characteristic of spherical cameras, each image captures the 360° panorama of the scene, eliminating the limitation on available steering choices. One of the major challenges with spherical images is the heavy barrel distortion due to the ultra-wide-angle fisheye lens, which complicates the implementation of conventional human vision inspired methods such as lane detection and trajectory tracking. Additional pre-processing steps such as prior calibration and deworming are often required. An alternative approach which has been presented by [60], who circumvent these pre-processing steps by formulating navigation as a classification problem on finding the optimal potential path orientation directly based on the raw, uncalibrated spherical images. Panorama stitching is another open research problem in this area. A real-time stitching methodology [61] uses a group of deformable meshes and the final image and combine the inputs using a robust pixel-shader. Another approach [62], combine the accuracy provided by geometric reasoning (lines and vanishing points) with the higher level of data abstraction and pattern recognition achieved by DL techniques (edge and normal maps) to extract structural and generate layout hypotheses for indoor scenes. In sparsely structured scenes, feature-based image alignment methods often fail due to shortage of distinct image features. Instead, direct image alignment methods, such as those based on phase correlation, can be applied. Correlation-based image alignment techniques based on discriminative correlation filters (DCF) have been investigated by [23] who show that the proposed DCF-based methods outperform phase correlationbased approaches on these datasets. 3.8

Dataset Annotation and Augmentation

There are arguments against the combination of CV and DL and they summarize to the conclusion that we need to re-evaluate our methods from rule-based to data-driven. Traditionally, from the perspective of signal processing, we know the operational connotations of CV algorithms such as SIFT and SURF methods, but DL leads such meaning nowhere, all you need is more data. This can be seen as a huge step forward, but may be also a backward move. Some of the pros and cons of each side of this debate have been discussed already in this paper; however, if future-methods are to be purely data-driven then focus should be placed on more intelligent methods for dataset creation.

140

N. O’Mahony et al.

The fundamental problem of current research is that there is no longer enough data for advanced algorithms or models for special applications. Coupling custom datasets and DL models will be the future theme to many research papers. So many researchers’ outputs consist of not only algorithms or architectures, but also datasets or methods to amass data. Dataset annotation is a major bottleneck in the DL workflow which requires many hours of manual labelling. Nowhere is this more problematic than in semantic segmentation applications where every pixel needs to be annotated accurately. There are many useful tools available to semi-automate the process as reviewed by [20], many of which take advantage of algorithmic approaches such as ORB features [55], polygon morphing [63], semi-automatic Area of Interest (AOI) fitting [55] and all of the above [63]. The easiest and most common method to overcome limited datasets and reduce overfitting of deep learning models for image classification is to artificially enlarge the dataset using label-preserving transformations. This process is known as dataset augmentation and it involves the artificial generation of extra training data from the available ones, for example, by cropping, scaling, or rotating images [64]. It is desirable for data augmentation procedures to require very little computation and to be implementable within the DL training pipeline so that the transformed images do not need to be stored on disk. Traditional algorithmic approaches that have been employed for dataset augmentation include Principle Component Analysis (PCA) [1], adding noise, interpolating or extrapolating between samples in a feature space [65] and modelling the visual context surrounding objects from segmentation annotations [66].

4 Conclusion A lot of the CV techniques invented in the past 20 years have become irrelevant in recent years because of DL. However, knowledge is never obsolete and there is always something worth learning from each generation of innovation. That knowledge can give you more intuitions and tools to use especially when you wish to deal with 3D CV problems for example. Knowing only DL for CV will dramatically limit the kind of solutions in a CV engineer’s arsenal. In this paper we have laid down many arguments for why traditional CV techniques are still very much useful even in the age of DL. We have compared and contrasted traditional CV and DL for typical applications and discussed how sometimes traditional CV can be considered as an alternative in situations where DL is overkill for a specific task. The paper also highlighted some areas where traditional CV techniques remain relevant such as being utilized in hybrid approaches to improve performance. DL innovations are driving exciting breakthroughs for the IoT (Internet of Things), as well as hybrid techniques that combine the technologies with traditional algorithms. Additionally, we reviewed how traditional CV techniques can actually improve DL performance in a wide range of applications from reducing training time, processing and data requirements to being applied in emerging fields such as SLAM, Panoramicstitching, Geometric Deep Learning and 3D vision where DL is not yet well established.

Deep Learning vs. Traditional Computer Vision

141

The digital image processing domain has undergone some very dramatic changes recently and in a very short period. So much so it has led us to question whether the CV techniques that were in vogue prior to the AI explosion are still relevant. This paper hopefully highlight some cases where traditional CV techniques are useful and that there is something still to gain from the years of effort put in to their development even in the age of data-driven intelligence.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of 25th International Conference on Neural Information Processing System, NIPS 2012, vol. 1, pp. 1097–1105 (2012) 2. Nash, W., Drummond, T., Birbilis, N.: A review of deep learning in the study of materials degradation. npj Mater. Degrad. 2 (2018). Article number: 37. https://doi.org/10.1038/ s41529-018-0058-x 3. Bonaccorso, G.: Machine Learning Algorithms: Popular Algorithms for Data Science and Machine Learning, 2nd edn. Packt Publishing Ltd., Birmingham (2018) 4. O’Mahony, N., Murphy, T., Panduru, K., et al.: Improving controller performance in a powder blending process using predictive control. In: 2017 28th Irish Signals and Systems Conference (ISSC), pp. 1–6. IEEE (2017) 5. O’Mahony, N., Murphy, T., Panduru, K., et al.: Real-time monitoring of powder blend composition using near infrared spectroscopy. In: 2017 Eleventh International Conference on Sensing Technology (ICST), pp. 1–6. IEEE (2017) 6. O’Mahony, N., Murphy, T., Panduru, K., et al.: Adaptive process control and sensor fusion for process analytical technology. In: 2016 27th Irish Signals and Systems Conference (ISSC), pp. 1–6. IEEE (2016) 7. Koehn, P.: Combining genetic algorithms and neural networks: the encoding problem (1994) 8. Wang, J., Ma, Y., Zhang, L., Gao, R.X.: Deep learning for smart manufacturing: methods and applications. J. Manufact. Syst. 48, 144–156 (2018). https://doi.org/10.1016/J.JMSY. 2018.01.003 9. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. 2018, 1–13 (2018). https://doi. org/10.1155/2018/7068349 10. Dumoulin, V., Visin, F., Box, G.E.P.: A guide to convolution arithmetic for deep learning. arXiv Prepr arXiv:1603.07285v2 (2018) 11. Hayou, S., Doucet, A., Rousseau, J.: On the selection of initialization and activation function for deep neural networks. arXiv Prepr arXiv:1805.08266v2 (2018) 12. Horiguchi, S., Ikami, D., Aizawa, K.: Significance of softmax-based features in comparison to distance metric learning-based features (2017) 13. Deshpande, A.: A beginner’s guide to understanding convolutional neural networks. CS Undergrad at UCLA (2019). https://adeshpande3.github.io/A-Beginner%27s-Guide-ToUnderstanding-Convolutional-Neural-Networks/. Accessed 19 July 2018 14. Karami, E., Shehata, M., Smith, A.: Image identification using SIFT algorithm: performance analysis against different image deformations (2017) 15. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features, pp. 404–417. Springer, Heidelberg (2006) 16. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection, pp. 430–443. Springer, Heidelberg (2006)

142

N. O’Mahony et al.

17. Goldenshluger, A., Zeevi, A.: The hough transform estimator 32 (2004). https://doi.org/10. 1214/009053604000000760 18. Tsai, F.C.D.: Geometric hashing with line features. Pattern Recogn. 27, 377–389 (1994). https://doi.org/10.1016/0031-3203(94)90115-5 19. Wang, J., Perez, L.: The effectiveness of data augmentation in image classification using deep learning 20. Schöning, J., Faion, P., Heidemann, G.: Pixel-wise ground truth annotation in videos - an semi-automatic approach for pixel-wise and semantic object annotation. In: Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, pp. 690– 697. SCITEPRESS - Science and Technology Publications (2016) 21. Zhang, X., Lee, J.-Y., Sunkavalli, K., Wang, Z.: Photometric stabilization for fast-forward videos (2017) 22. Alhaija, H.A., Mustikovela, S.K., Mescheder, L., et al.: Augmented reality meets computer vision : efficient data generation for urban driving scenes (2017) 23. Meneghetti, G., Danelljan, M., Felsberg, M., Nordberg, K.: Image alignment for panorama stitching in sparsely structured environments, pp. 428–439. Springer, Cham (2015) 24. Alldieck, T., Kassubeck, M., Magnor, M.: Optical flow-based 3D human motion estimation from monocular video (2017) 25. Zheng, B., Zhao, Y., Yu, J., et al.: Scene understanding by reasoning stability and safety. Int. J. Comput. Vis. 112, 221–238 (2015). https://doi.org/10.1007/s11263-014-0795-4 26. Zheng, L., Yang, Y., Tian, Q.: SIFT meets CNN: a decade survey of instance retrieval 27. AlDahoul, N., Md Sabri, A.Q., Mansoor, A.M.: Real-time human detection for aerial captured video sequences via deep models. Comput. Intell. Neurosci. 2018, 1–14 (2018). https://doi.org/10.1155/2018/1639561 28. Conventional computer vision coupled with deep learning makes AI better. Network World. https://www.networkworld.com/article/3239146/internet-of-things/conventional-computervision-coupled-with-deep-learning-makes-ai-better.html. Accessed 12 Sept 2018 29. Bahrampour, S., Ramakrishnan, N., Schott, L., Shah, M.: Comparative study of deep learning software frameworks (2015) 30. An in-depth look at Google’s first tensor processing unit (TPU). Google cloud big data and machine learning blog. Google cloud platform (2017). https://cloud.google.com/blog/bigdata/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu. Accessed 11 Jan 2018 31. Vision Processing Unit: Machine vision technology. Movidius. https://www.movidius.com/ solutions/vision-processing-unit. Accessed 11 Jan 2018 32. Ng, H.-W., Nguyen, D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. https://doi.org/10.1145/2818346.2830593 33. Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3D geometry to deformable part models. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2012) 34. Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y 35. Lin, T.-Y., Maire, M., Belongie, S., et al.: Microsoft COCO: common objects in context (2014) 36. CS231n convolutional neural networks for visual recognition. http://cs231n.github.io/ transfer-learning/. Accessed 9 Mar 2018 37. Highlander, T.C.: Efficient training of small kernel convolutional neural networks using fast fourier transform 38. Highlander, T., Rodriguez, A.: Very efficient training of convolutional neural networks using fast fourier transform and overlap-and-add (2016)

Deep Learning vs. Traditional Computer Vision

143

39. Li, F., Wang, C., Liu, X., et al.: A composite model of wound segmentation based on traditional methods and deep neural networks. Comput. Intell. Neurosci. 2018, 1–12 (2018). https://doi.org/10.1155/2018/4149103 40. Nijhawan, R., Das, J., Raman, B.: A hybrid of deep learning and hand-crafted features based approach for snow cover mapping. Int. J. Remote Sens. 1–15 (2018). https://doi.org/10. 1080/01431161.2018.1519277 41. Zeng, G., Zhou, J., Jia, X., et al.: Hand-crafted feature guided deep learning for facial expression recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 423–430. IEEE (2018) 42. Burchfiel, B., Konidaris, G.: Hybrid Bayesian eigenobjects: combining linear subspace and deep network methods for 3D robot vision 43. Marcus, G.: Deep learning: a critical appraisal 44. Nalisnick, E., Smyth, P.: Learning priors for invariance, pp. 366–375 (2018) 45. Diligenti, M., Roychowdhury, S., Gori, M.: Integrating prior knowledge into deep learning. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 920–923. IEEE (2017) 46. Zhu, H., Nie, Y., Yue, T., Cao, X.: The role of prior in image based 3D modeling: a survey. Front. Comput. Sci. 11, 175–191 (2017). https://doi.org/10.1007/s11704-016-5520-8 47. Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3D convolutional networks. arXiv Prepr arXiv:1412.0767 (2015) 48. Pang, G., Neumann, U.: 3D point cloud object detection with multi-view convolutional neural network. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 585–590. IEEE (2016) 49. Lan, Q., Wang, Z., Wen, M., et al.: High performance implementation of 3D convolutional neural networks on a GPU. Comput. Intell. Neurosci. 2017, 1–8 (2017). https://doi.org/10. 1155/2017/8348671 50. Ahmed, E., Saint, A., Shabayek, A.E.R., et al.: Deep learning advances on different 3D data representations: a survey. arXiv Prepr arXiv:1808.01462 (2018) 51. Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. arXiv Prepr arXiv:1711.06396 (2017) 52. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet ++: deep hierarchical feature learning on point sets in a metric space. arXiv Prepr arXiv:1706.02413v1 (2017) 53. Braeger, S., Foroosh, H.: Curvature augmented deep learning for 3D object recognition. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3648–3652. IEEE (2018) 54. O’Mahony, N., Campbell, S., Krpalkova, L., et al.: Deep learning for visual navigation of unmanned ground vehicles; a review (2018) 55. Karami, E., Prasad, S., Shehata, M.: Image matching using SIFT, SURF, BRIEF and ORB: performance comparison for distorted images 56. Angelina Uy, M., Hee Lee, G.: PointNetVLAD: deep point cloud based retrieval for largescale place recognition 57. Camposeco, F., Cohen, A., Pollefeys, M., Sattler, T.: Hybrid scene compression for visual localization 58. Loghmani, M.R., Planamente, M., Caputo, B., Vincze, M.: Recurrent convolutional fusion for RGB-D object recognition 59. Clément, M., Kurtz, C., Wendling, L.: Learning spatial relations and shapes for structural object description and scene recognition. Pattern Recogn. 84, 197–210 (2018). https://doi. org/10.1016/J.PATCOG.2018.06.017

144

N. O’Mahony et al.

60. Ran, L., Zhang, Y., Zhang, Q., et al.: Convolutional neural network-based robot navigation using uncalibrated spherical images. Sensors 17, 1341 (2017). https://doi.org/10.3390/ s17061341 61. Silva, R.M.A., Feijó, B., Gomes, P.B., et al.: Real time 360° video stitching and streaming. In: ACM SIGGRAPH 2016 Posters on - SIGGRAPH 2016, pp. 1–2. ACM Press, New York (2016) 62. Fernandez-Labrador, C., Perez-Yus, A., Lopez-Nicolas, G., Guerrero, J.J.: Layouts from panoramic images with geometry and deep learning 63. Schöning, J., Faion, P., Heidemann, G.: Pixel-wise ground truth annotation in videos - an semi-automatic approach for pixel-wise and semantic object annotation. In: Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, pp. 690– 697. SCITEPRESS - Science and and Technology Publications (2016) 64. Ioannidou, A., Chatzilari, E., Nikolopoulos, S., Kompatsiaris, I.: Deep learning advances in computer vision with 3D data. ACM Comput. Surv. 50, 1–38 (2017). https://doi.org/10. 1145/3042064 65. Devries, T., Taylor, G.W.: Dataset augmentation in feature space. arXiv Prepr arXiv:1702. 05538v1 (2017) 66. Dvornik, N., Mairal, J., Schmid, C.: Modeling visual context is key to augmenting object detection datasets

Self-localization from a 360-Degree Camera Based on the Deep Neural Network Shintaro Hashimoto(&) and Kosuke Namihira Japan Aerospace Exploration Agency (JAXA), 2-1-1 Sengen, Tsukuba, Japan [email protected]

Abstract. This research aimed to develop a method that can be used for both the self-localization and correction of dead reckoning, from photographed images. Therefore, this research applied two methods to estimate position from the surrounding environment and position from the lengths between the own position and the targets. Convolutional neural network (CNN) and convolutional long short-term memory (CLSTM) were used as a method of self-localization. Panorama images and general images were used as input data. As a result, the method that uses “CNN with the pooling layer partially eliminated and a panorama image for input, calculates the intersection of a circle from the lengths between the own position and the targets, adopts three points with the closest intersection, and do not estimate own position if the closest intersection has a large error” was the most accurate. The total accuracy was 0.217 [m] for the xcoordinate and y-coordinate. As the room measured about 12 [m] by 12 [m] in size along with only about 3,000 training data, the error was considered to be small. Keywords: Self-localization Machine learning Convolutional neural network Convolutional LSTM

360-degree camera

1 Introduction Self-localization is required in a navigation system such as a geographic information system that assists a person in reaching a certain destination point, and in order for a robot to perform various tasks autonomously in real environments. In self-localization, the Global Positioning System (GPS) receiver is often used because it is easy to introduce and offers high accuracy estimation. Self-localization by the GPS receiver requires signals from four or more GPS satellites. Self-localization by GPS becomes unstable when GPS signals are interrupted from the satellites due to many obstacles such as high-rise buildings, elevated roadways subways, and tunnels. Therefore, by processing data from various sensors such as a gyro sensor, acceleration sensor, and wheel speed sensor, existing dead reckoning technology can estimate location with high accuracy even in an environment where it is difficult for self-localization by GPS. Self-localization using an acceleration sensor or wheel speed sensor is a useful solution in an environment where GPS cannot be used, but in many cases, errors are accumulated because an integrated value is used to calculate distance and speed from these sensors. For this reason, correction is required using an absolute value at a certain timing. For correction based on an absolute value, a reference point that never changes © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 145–158, 2020. https://doi.org/10.1007/978-3-030-17795-9_11

146

S. Hashimoto and K. Namihira

its position is often provided. For example, there is also a method of estimating selflocalization from the positional relation of a landmark tower or mountains. There is also the Star Tracker (STT) that estimates attitude from the positional relationship of the stars. An optical sensor is often used for this method. Although this method is a technique applicable to both the self-localization and correction of dead reckoning, it needs landmarks and must accurately identify and recognize such landmarks. In noisy environments such as indoors, it is difficult to accurately identify and recognize landmarks. Therefore, in this research, self-localization is realized by using deep learning. By using deep learning, it is possible to automatically obtain geographical features to serve as landmarks without having to manually set the landmarks and estimate positions. Moreover, deep learning can only be used for solutions for finding landmarks. In such case, it is necessary to geometrically calculate the own position from the landmark found. Both methods should be selected according to the environment. However, if deep learning directly infers the self-position from the input image, the inference result includes uncertainty. Deep learning cannot judge whether the inference result is correct or not. This research converted inference result of deep learning to the self-position by geometric formula and judged whether the inference by deep learning is correct or not. The sensor used in this research adopts the optical sensors of a 360-degree camera (panorama camera) and a non-360-degree camera (general camera such as a smartphone camera). The reason for adopting a panorama camera was because a more precise positional relation of each object in the surrounding environment can be obtained than by using a general camera. JAXA developed in 2017 a small robot called Int-Ball (JEM Internal Ball Camera) that supports astronauts in the ISS (International Space Station). In order to increase tasks that the Int-Ball can process, self-localization is necessary. This research developed an algorithm that runs at low processing speed in order to enable self-localization with a small robot such as Int-Ball.

2 Related Research There are existing studies on the self-localization using images and the deep learning. These studies do not require the GPS positioning data in outdoor environments. Kuleshov et al. demonstrated the appearance-based self-localization based on machine learning and a finite set of captured images taken in known positions [1]. Such appearance-based learning framework has become very popular in the field of robot learning [2]. Further, there are existing studies on self-localization using 360-degrees panoramic images. Nakagawa et al. estimated the position of the camera by estimating the optical flow of 360-degrees panoramic image. They did not use the deep learning technique [3]. In contrast with, there exists the research called CNN-SLAM on self-localization using Convolutional Neural Network (CNN) for self-localization [4]. To summarize this research, the image obtained from the monocular camera (RGB) is converted to depth map by CNN, fused with direct monocular SLAM, and 3D map is reconstructed. That is, self-localization is also possible. The method differs from this research. Because, this research is using panorama camera, not requiring 2 frames of data like a

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

147

Simultaneously Localization and Mapping (SLAM), and using Recurrent Neural Networks (RNN) in order to analyze time series. CNN-SLAM needs computational resources because it uses many point group data for self-localization in addition to obtaining depth information using CNN. SLAM also needs attitude estimation with high accuracy, many calibrations are necessary. As far as we know, there has been no report about self-localization based on 360-degrees panoramic images and the deep learning techniques. In recent years, cameras that can take 360-degrees images widely appear in the market. We have the opportunity to utilize them. Therefore, this study attempted to self-localization using 360-degrees panoramic images and some deep learning methods. When adopting a method of self-localization, the main concern is not only the availability but also the accuracy. In this research, we discuss the applicability and issued self-localization.

3 Proposed Methods of Self-localization Using Deep Learning This section describes the proposed methods of self-localization using deep learning as employed in this research. 3.1

Overview of Proposed Methods

In this research, self-localization is performed using the methods of the CNN and convolutional long short-term memory (CLSTM) in deep learning, and the advantages and disadvantages of each method are identified [5, 6]. Two input images are used for each deep learning technique. One image is captured by a general camera with a fixed viewing angle, and the other image is taken by an omnidirectional camera (panorama camera) at 360°. This research applies two methods for self-localization: (1) “directly measuring the position” and (2) “calculating the intersecting points of a circle using the relative positions of landmarks” obtained from the CNN and CLSTM output results. Table 1 lists the combinations used for self-localization in this research. If it is not a panorama, the length between the own position and multiple objects that is necessary to compute the intersection of circles cannot be obtained. There are two variations in CNN and CLSTM output: (1) a pair that outputs the x-coordinate and z-coordinate directly without the y-coordinate, and (2) a pair of distances with three targets for calculating the intersection of circles. Table 1 lists the combinations used for self-localization in this research. If it is not a panorama, the length between the own position and multiple objects that is necessary to compute the intersection of circles cannot be obtained. There are two variations in CNN and CLSTM output: (1) a pair that outputs the x-coordinate and z-coordinate directly without the y-coordinate, and (2) a pair of distances with three targets for calculating the intersection of circles.

148

S. Hashimoto and K. Namihira Table 1. Combinations used for self-localization Direct Intersection of circles CNN

General Panorama CLSTM General Panorama

3.2

✓ ✓ ✓ ✓

– ✓ – ✓

Development/Evaluation Environment

Table 2 lists the components of the development and evaluation environment. Table 2. Development and evaluation environment Name OS CPU GPU Memory

3.3

Details Ubuntu 14.04 Intel(R) Core(TM) i7-5930 K CPU GeForce GTX TITAN X 2 32 GB

Generation of Training/Testing Data

The learning phase of deep learning requires a lot of input data and time. In consideration of input data regenerate, making modifications, and other tasks, simulation data that imitates the physical world was generated. Simulation data of panorama images and general images were prepared using Unity of game engine. The size of the room is about 12 [m] by 12 [m]. Figure 1 shows an overview of the room. This simulator Human manipulated this simulator, walked in the room with a stride of 5 [cm] and created data. As in the physical world, data generated by the simulator is time series data (do not teleport). Figure 2 shows the panorama image generated by simulation and the general image generated by simulation. Figure 3 shows the path of training data and the path of testing data.

Fig. 1. Overview of the room

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

149

Fig. 2. Training images: Left is generated panorama image. Right is general image.

Fig. 3. The path data: Left is the path of training data. Right is the path of testing data.

About 3,000 training data and 412 testing data were prepared. The panorama image is 512 pixels by 256 pixels in size. The general image is 256 pixels by 256 pixels in size. Both images have a landmark. However, the position of this landmark is not given in the input data. Given the three-dimensional space, there is x, y and z, but y was deleted because it has no movement. In this research, z is regarded as y. 3.4

Geometrically Calculate Location

This research considers a method for the intersection of circles used in indirect selflocalization. The method for the intersection of circles estimates the own position on a map from the lengths between one’s own position and certain fixed targets that do not move on the map. It can estimate one’s own position from simple Eqs. (1) to (5). The variables x and y are the positions of the targets. Px and Py are the own positions to be estimated. Because the intersection of the circles is calculated, two sets of answers (with one including Px and Py ) come out. l¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx2 x1 Þ2 þ ðy2 y1 Þ2 1 h ¼ tan1 yx22 y x1

ha ¼ cos1

l2 þ r12 r22 2lr1

ð1Þ ð2Þ ð3Þ

150

S. Hashimoto and K. Namihira

Px ¼ x1 þ r1 cosðh ha Þ

ð4Þ

Py ¼ y1 þ r1 sinðh ha Þ

ð5Þ

Figure 4 shows each parameter of the governing equation in (1) to (5).

Fig. 4. Intersection of circles

In general, self-localization improves accuracy by calculating multiple intersection points from multiple targets. When there are multiple intersection points, own position should select one from multiple intersection points. In the case of three objects, once the lengths between the three objects and the own positions are determined, 8 intersecting points are obtained. Therefore, this method should estimate the own position from these 8 intersecting points (Fig. 5).

Fig. 5. Intersections of three circles

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

151

Although combinations are 20, there are only 8 combinations when the pair of own intersections is eliminated (Fig. 6). A1 means A1x , A1y at the first intersecting point of A‘s circle. A2 means A2x , A2y at the second intersecting point of A‘s circle. The same applies to B1, B2, C1 and C2. When the distances between three objects and the own position is correctly determined, the error of the three distances becomes zero at a certain intersecting point. Therefore, this research adopted the combination of the three closest points from these 8 combinations, and adopted the average of the positions as the estimated own position. In case of large errors in the three intersections with the smallest error among the given combinations, it should not estimate own position using the intersections.

Fig. 6. Combinations of intersection

Equations (6) to (8) calculate the error distance of a certain combination n. Equation (9) calculates the error distance that minimizes the combination patterns. Equation (10) calculates its average positions from three intersecting points with the smallest error distance obtained. minðnÞx and minðnÞy denote the combination pattern of minimum error distance (e.g. A1, B1, C2). ABn ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi ðAnx Bnx Þ2 þ Any Bny

ð6Þ

ACn ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ðAnx Cnx Þ2 þ Any Cny

ð7Þ

BCn ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ðBnx Cnx Þ2 þ Bny Cny

ð8Þ

length ¼ minðABn þ ACn þ BCn Þ AminðnÞx þ BminðnÞx þ CminðnÞx 3 AminðnÞy þ BminðnÞy þ CminðnÞy 3

Px;y ¼ Pestimate ¼

discard Px;y

;

if threshold length if threshold [ length

ð9Þ ð10Þ

ð11Þ

152

S. Hashimoto and K. Namihira

It is judged whether the value estimated by (11) is correct or incorrect. There is another method of verifying whether the position information estimated by the method using advance position information is correct, such as whether the position information has moved significantly compared with the previous value. However, in this case, there must always be a previous value that is correct. Such a previous value is not necessary in the proposed methods of this research. 3.5

Self-localization by CNN

CNN consists mainly of a convolutional layer, pooling layer, and fully connected layer. Although the pooling layer of CNN contributes to position invariance in extracting the amount of features, when estimating the positions of such features, the position information becomes ambiguous due to position invariance. The closer the image position information is to the input layer, the better the position information is clarified. This research reduces the pooling layer, thereby improving the accuracy of estimating of the own position. Figure 7 shows the model of network.

Fig. 7. Overview of the model (A)

The model used in this research is as follows: Model of CNN (A): Conv (5, 5, 3, 3) → Pool → Conv (5, 5, 3, 3) → Conv (5, 5, 3, 3) → Pool → Fc ((128 * 64 * 3), 2048) → Fc (2048, output classes number) Legend: Conv: Convolution (Kernel_x, Kernel_y, Input, Output) Pool: Max Pooling (Kernel size: 2x2; Stride size: 2x2) FC: Fully Connected (Input, Output)

The model reduced the amount of convolution to allow the input of large images into the dense layer as much as possible. In this environment of training/testing data, there are few obstacles and the color of the object is clear. However, in the opposite case, a higher amount of convolution is considered necessary. Moreover, this research attempted to combine the layer that extracts the amount of features and the image closer to the input layer before the fully connected layer. This attempt is to reduce the influence of position invariance by propagating the output of shallow layer to the output layer. The features of targets are considered extractable by

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

153

the convolutional layer in the layer close to output. Figure 8 shows the model of network. The skip in Fig. 8 propagates the output of layer. After skipping, the source layer and destination layer are combined taking into consideration the vector size.

Fig. 8. Overview of the model (B)

The model used in this research is as follows: Model of CNN (B): Pool Conv (5, 5, 3, 3) Pool Skip & Conv (3, 3, 3, 32) Pool Conv Conv (5, 5, 3, 3) (3, 3, 32, 64) Pool Conv (3, 3, 64, 256) Conv (3, 3, 256, 256) Pool Conv (3, 3, 256, 512) Conv (3, 3, 512, 512) Pool Fc ((8 * 4 * 512) + Skip (128 * 64 * 3), 2048) Fc (2048, output classes number)

3.6

Self-localization by Convolutional LSTM

Although CNN is commonly used for image processing (classification and regression), LSTM is not commonly used. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate, and it can learn the long-term dependence of time series data as compared with RNN. However, there are problems where the image position information is not retained, and LSTM is not applied to time series data of images. In order to accurately process the time series data of images, CLSTM that combines the convolutional layer with LSTM was developed. Figure 9 shows the structure of CLSTM. This research adopted CLSTM and estimates self-localization from the images of time series data. In general, CLSTM is used for predicting the next frame of moving images, etc. In this research, it estimates the own position using the dense layer (neural network) as the final layer.

154

S. Hashimoto and K. Namihira

Fig. 9. Overview of CLSTM

The model used in this research is as follows: Model of CLSTM: ConvLSTM(5, 5, 3, 3) ConvLSTM(3, 3, 3, 3) Pool Fc ((64 * 128 * 8), 1024) (3, 3, 8, 8)

Pool ConvLSTM (3, 3, 8, 8) Fc (1024, output classes number)

ConvLSTM

Legend: ConvLSTM: Convolutional LSTM (Kernel_x, Kernel_y, Input, Output)

The loss function of models CNN (A), CNN (B), and CLSTM is the least squares mean. Moreover, we adopt Batch Normalization and initializing method of weight by He in order to learn effectively [7, 8].

4 Evaluation of Self-localization This section describes the results of the method presented in Sect. 3. In the training phase, whether each method learned until achieving the most accuracy rate could not be confirmed. In the case of CNN, there were 30,000 to 50,000 iterations (about 10 training data per batch). In the case of CLSTM, there were 60,000 to 80,000 iterations (about 5 training data per batch). This research evaluated using weights that were the most accurate obtained during this training.

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

155

Table 3. Results of self-localization by all methods: units of this table are meter. Red color is the best result.

No. Model Camera Method Average error x-error y-error STD No. Model Camera Method Average error x-error y-error STD

1 CNN (A) + Threshold 3[m] Panorama Intersection

2

3

4

5

CNN (A)

CNN (B)

CLSTM

CNN (A)

0.217

0.333

1.060

1.610

0.279

0.096 0.173 0.100

0.227 0.184 0.219

0.769 0.673 1.702

1.190 0.864 0.832

0.085 0.253 0.138

6 CNN (B) Panorama Directly

7 CLSTM

8 CNN (A) General

CNN (B)

CLSTM

0.398

1.830

0.420

0.434

2.430

0.290 0.230 0.148

1.150 1.210 0.891

0.396 0.106 0.216

0.253 0.314 0.287

1.330 1.720 1.200

Directly

9

10

The result of the self-localization by all methods is shown in Table 3. As a result, the most accurate method of estimating position was a combination of CNN (A), panorama images, and estimating position using the intersecting points of a circle (No. 1). The average error of estimating the position was 0.217 [m]. As a breakdown, the x-axis showed an error of 0.096 [m] and the y-axis showed an error of 0.173 [m]. The standard deviation (STD) was 0.1 [m]. Only the value estimated with the accumulated value of the distance error of each intersecting point less than 3 [m] (threshold) was adopted. As a result, 191 out of 412 test data were excluded. No. 2 shows the results when values of inference were not selected according to the threshold. Accuracy can be improved using strict threshold filtering, but the amount of data that can be used decreases accordingly. As shown in No. 3, accuracy was deteriorated when Pooling layer was added to No. 2. When CNN was changed to CLSTM, however, overall accuracy got worse (No. 4). In CSLTM, a motion picture lasting 0.25 s (15 frames) was given and the position estimation result of the final frame was adopted as the estimated position. As a moving image is the input for CLSTM, there tends to be more parameters. For this reason, input to the neural network (dense layer) is difficult while maintaining the original resolution and requires the pooling layer, resulting in less precision.

156

S. Hashimoto and K. Namihira

Next, instead of indirectly estimating position from the intersecting points of the circle, the result of directly estimating position is shown. No. 5 and No. 6 show the position estimation results obtained by CNN. Type B of CNN tended to show lower accuracy. No. 5 shows the most accurate results among the methods without threshold filtering and the intersection of circles. No.7 shows the results of changing CNN to CLSTM. The accuracy can be confirmed as being lower than that of CNN from No. 5 and No. 6. No. 8, No. 9, and No. 10 the self-localization results with general images. The accuracy of self-localization using general images is less accurate than when using panorama images. Figure 10 shows an overlay of the result with the highest accuracy and the label data. The blue color is label data. The gray color is the inference result not using threshold (No. 2). The orange color is the inference result using threshold (No. 1). Thus, threshold filtering was confirmed as being able to eliminate estimates that were far from the correct answers (label data). Figure 11 shows the image with the largest estimation error. This image is not particularly abnormal and is not a misleading image. Therefore, it thinks that the number of samplings of training data poses a problem as being a major cause of the error.

Fig. 10. Overlay of results with the highest accuracy and label data after threshold filtering

Combination of CNN (A), panorama images, and estimating position using the intersection points of a circle can be run with fewer machine resources than CLSTM and CNN-SLAM because this threshold processing is faster than SLAM.

Self-localization from a 360-Degree Camera Based on the Deep Neural Network

157

Fig. 11. Image of the largest estimation error

5 Conclusion This research aimed to develop a method that can be used for both the self-localization and correction of dead reckoning, from photographed images. Therefore, this research applied two methods to estimate position from the surrounding environment and position from the lengths between the own position and the targets. CNN and CLSTM were used as a method of self-localization. Panorama images and general images were used as input data. Although CLSTM, which can process time series data, was thought to offer higher accuracy, the accuracy of self-localization was improved by using CNN with the pooling layer eliminated. In particular, the method that uses “CNN with the pooling layer partially eliminated and a panorama image for input, calculates the intersection of a circle from the lengths between the own position and the targets, adopts three points with the closest intersection, and do not estimate own position if the closest intersection has a large error” was the most accurate. The total accuracy was 0.217 [m] for the x-coordinate and y-coordinate. As the room measured about 12 [m] by 12 [m] in size along with only about 3,000 training data, the error was considered to be small. Moreover, although the method of directly estimating position by using CNN is easy, it cannot judge whether the estimated position output by CNN is correct or not. Therefore, a method of calculating multiple intersecting points from the output of CNN and judging whether the estimated position is correct or not from the error between the intersecting points was found to be good. Panorama images proved better in terms of the lengths output to “multiple” targets by CNN and CLSTM. In the case of general images, there is a high possibility that “multiple” targets cannot be captured. In the future, we would like to estimate position by installing a panorama/360-degree camera in HMDs, robots, spacecraft and elsewhere. We hope that services using location information will be created even indoors where the GPS signal does not reach. For future work, we would like to conduct experiments in a real environment. Moreover, we want to make self-localization possible with less training data having higher accuracy.

158

S. Hashimoto and K. Namihira

References 1. Kuleshov, A., Bernstein, A., Burnaev, E., Yanovich, Y.: Machine learning in appearancebased robot selflocalization. In: 16th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE Conference Publications (2017) 2. Pauli, J.: Learning-Based Robot Vision: Principles and Applications. Lecture Notes in Computer Science, vol. 2048, 292 p. Springer, Heidelberg (2001) 3. Nakagawa, M., Akano, K., Kobayashi, T., Sekiguchi, Y.: Relative panoramic camera position estimation for image-based virtual reality networks in indoor environments. In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. IV-2/W4 (2017) 4. Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. arXiv:1704.03489 (2017) 5. LeCun, Yann, Bottou, Leon, Bengio, Yoshua, Haffner, Patrick: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 6. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv:1506.04214 (2015) 7. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. arXiv:1502.01852 (2015) 8. loffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)

Deep Cross-Modal Age Estimation Ali Aminian(B) and Guevara Noubir(B) Northeastern University, Boston, MA 02115, USA {aliiaminian,noubir}@ccs.neu.edu

Abstract. Automatic age and gender classification systems can play a vital role in a number of applications including a variety of recommendation systems, face recognition across age progression, and security applications. Current age and gender classifiers, are lacking crucial accuracy and reliability in order to be used in real world applications since most real-time systems have zero fault tolerant. This paper develops an end-to-end, deep architecture aiming to improve the accuracy and reliability of the age estimation task. We design a deep convolutional neural network (CNN) architecture for age estimation that builds upon a gender classification model. The system leverages a gender classifier to improve the accuracy of the age estimator. We investigate several architectures and techniques for the age estimator model with cross-modal learning, including an end-to-end model, using gender embedding of the input image, which leads to an increased accuracy. We evaluated our system on the Adience benchmark, which consists of real-world in-the-wild pictures of faces. We have shown that our system outperforms state-of-the-art age classifiers, such as [1] by 9%, by training a cross-modal age classifier. Keywords: Age and gender classification Convolutional neural networks

1

·

Introduction

Age and gender classification plays an important role in numerous applications, from a variety of computer vision-based recommendation and human-computer interaction systems. In these systems, age is considered an essential factor, from improving face recognition across age progression, to understanding relations between individuals, and making further inferences towards grasping social interactions. Security systems can also benefit from a better understanding of the age, gender, and capabilities of actors. Recent developments in deep learning enabled the improvement of a variety of tasks. For instance, in computer vision, it is possible to achieve very high accuracy for applications, such as face detection, and face recognition [2]. This progress benefited from the confluence of new techniques in deep neural net (DNN) architectures, the presence of highly parallel low-cost computing infrastructure, such as GPGPU, and the availability of feature rich large datasets. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 159–177, 2020. https://doi.org/10.1007/978-3-030-17795-9_12

160

A. Aminian and G. Noubir

Despite this phenomenal progress, the performance of age classifiers is still lacking reliability and accuracy [1]. Age classification is an intrinsically challenging problem, as it is hard for humans to achieve with high accuracy. Furthermore, it is difficult to obtain massive accurately-labeled datasets for pictures in-the-wild, unlike other computer vision tasks, such as object classification. In this paper, we propose and investigate an approach to improve the accuracy of age classification with cross-modal learning by using a pre-trained gender classifier. In addition to training the age classifier with the same deep model as we have used for gender classification, we also feed the model with the gender embedding as the gender classifier extracts useful features from the gender. This approach is motivated by the fact that the aging process is not identical in men versus women, due to biological and behavioral differences. For instance, skin thickness, collagen density, rate of collagen loss at different stages of life, texture, and level of tissue hydration [3]. Beyond a straightforward integration of the gender classifier output, we explored several techniques to improve the system accuracy, by feeding the embedding of an improved model of each task to the other task. The aim being that the deep model is able to find more essential features at each round. We considered choosing the gender embedding from different fully connected (FC) layers by integrating it into different layers of the age classification model. We also considered an end-to-end model of a more general integration of the two classifier networks. We evaluated our model on the Adience dataset, a thorough benchmark for age and gender classification [4]. The Adience dataset consists of in-the-wild unfiltered face images, that present the typical variations in appearance, noise, pose, and lighting expected of images taken without careful preparation or posing. We selected the Adience dataset because of its realism, although it creates additional challenges. Our evaluation results demonstrate that our techniques outperform state-of-the-art approaches in age classification. To summarize our contributions, we designed a simple architecture, used as the underlying model for both age and gender classification. We designed, trained, and evaluated several cross-modal learning models for age classification built upon gender classification model. In addition, we proposed an end-to-end model which automates the cross-modal learning process during the training. We also iteratively refined both models, in order to improve accuracies of both tasks.

2

Related Works

Before describing our approach, detailed design, and performance, we briefly summarize the related work, both for DNNs, as well as age and gender classification. 2.1

Age and Gender Classification

Gender Classification. The problem of gender classification from facial images received significant attention in recent years and many approaches have been

Deep Cross-Modal Age Estimation

161

developed for this purpose. A survey of gender classification can be found in [5] and more recently in [6]. Below, we briefly survey relevant methods. Early methods for gender classification used a neural network on a small set of near-frontal face images [7]. Later work used the 3D structure of the head for classifying gender [8]. SVM classifiers were investigated in [9]. Other works used AdaBoost for gender classification [10]. Finally, viewpoint-invariant age and gender classification were introduced in [11]. More recently, Webers Local texture Descriptor [12] were used in [13] for gender recognition on The Face Recognition Technology (FERET) benchmark [14]. In [15], intensity, shape and texture features were used again on the FERET benchmark. FERET benchmark [14] developed by the DoD to facilitate face recognition has been a popular performance evaluation method. It worth noting that the FERET dataset was developed under highly controlled conditions. Hence, FERET images are less challenging than in-the-wild face images. Furthermore, due to its extensive use for gender classification evaluation, the FERET benchmark is saturated. Therefore, realistic comparison of these techniques became a difficult task. Recent work started evaluating using newer datasets. In [16], a combination of LBP with an Adaboost classifier approach, experimented on the popular Labeled Faces in the Wild (LFW) [17] benchmark. However, the main usage of LFW is face recognition. Due to recent advances in deep models, several methods have been proposed with significant improvement for age prediction. For instance, [18–20] developed deep models for classification. [21] proposed a network for both gender and smile prediction. Age Classification. The problem of age classification from facial attributes, also recently attracted significant attention, due to its usefulness in real-world applications. A detailed survey of age classification can be found in [22], and [23], and more recently in [24]. Early methods used extraction of ratios [25], given the facial landmarks for each image. To this end, [26] used a similar method for age progression. Since all these methods need an accurate localization, the benchmark being used is highly controlled for constrained images. In another line of work, few methods represent the aging process as a subspace [27] or a manifold [28]. Since these methods require to have a near-frontal faces, therefore, these methods are developed on constrained datasets with nearfrontal faces (e.g UIUC-IFP-Y [28,29], FG-NET [30], and MORPH [31]). Similar to early approaches, these methods are also inadequate for in-the-wild datasets. Aside from the approaches described above, some methods used local features for representing face images. In [32], Gaussian Mixture Models (GMM) [33] were used for representing the distribution of local patches. In [34], GMM were used, but instead of pixel patches, robust descriptors were used. On the other hand, instead of GMM, Hidden-Markov-Model, super-vectors [35], were used in [36] for representing face patch distributions.

162

A. Aminian and G. Noubir

As an alternative to previous methods, [37] used Gabor image descriptors [38]. In [39], a combination of Biologically-Inspired Features (BIF) [40] and various manifold-learning methods were used for age estimation. The Gabor image descriptor [38], and local binary patterns (LBP) [41] were used in [42] with a hierarchical age classifier based on SVM [43]. Moreover, [44,45] proposed the improved versions of relevant component analysis [46] and locally preserving projections [47], which are used for dimensionality reduction with Active Appearance Models (AAP) [48] as an image feature. Up to this point, the best performing methods were demonstrated on the Group Photos benchmark [49]. All of these methods have proven effective on small and/or constrained benchmarks for age estimation. Aside from datasets that are taken under highly controlled condition, AgeBD [50], is the first manually collected in-the-wild age database. However, in this paper, similar to [1], we focus on the more challenging, in-the-wild Adience benchmark, for instance in comparison to LFW [17]. We train and report our system performance on this challenging dataset as well. Finally, researchers leveraged deep neural networks to achieve better results on facial age classification tasks. In [51], a deep model for real and apparent age prediction was proposed using the VGG-16 architecture, and was trained on ImageNet [52]. In addition, [53–56] achieved significant improvement in age prediction by using deep architectures. The work by [1] leveraged recent developments in DNNs, and outperformed all previous methods by training a simple, and independent deep network for both age and gender classification on Adience dataset. However, these methods are still far behind the human accuracy, due to their intrinsic simple architecture, and the fact that age and gender are trained separately. In this work, we show that our proposed method which leverages deep CNN, with a cross-modal learning approach, outperforms all previous methods for age classification. 2.2

Deep Neural Networks

A detailed survey of recent advances in CNNs can be found in [57]. LeNet-5 network described by [58] for optical character recognition is considered as one of the first applications of CNNs. However, the network was relatively modest due to the limited computational resources in comparison with more modern deep networks. AlexNet was introduced [59] and significantly improved the classification accuracy in the ImageNet competition. To name a few applications, human pose estimation [60], face parsing [61], facial key-point detection [62–64], face analysis [65,66], speech recognition [67], and action classification [68], all benefited from recent advances in the design of deep CNNs. More recently, beyond single deep architecture methods, the emergence of cross-modal deep architectures [69], and [24], resulted in improvements where models for one task can built upon another task model.

Deep Cross-Modal Age Estimation

3

163

Background

In this section, we first present how age and gender are associated, and can provide useful information for each other in classification task, and then we discuss our high-level designs, and the underlying architecture. 3.1

Age and Gender Association

In our context, age and gender classification are associated in the sense that the output of each potentially contain valuable information for the other task. To illustrate, the skin on a man versus that on a woman is significantly different. Aside from the ability to grow a beard, which is one obvious example, from a structural point of view, some of the differences include skin thickness, collagen density, loss of collagen as we age, texture and hydration. Thickness of the skin varies with the location, age and sex of the individual. In addition, androgens (i.e. testosterone), which cause an increase in skin thickness, accounts for why a man’s skin is about 25% thicker than that of a woman’s [3]. Consequently, knowing that an individual is male in advance, can help the network better identify age category due to specific attributes that are common in specific gender at each age. 3.2

Age Classification Approach

Our proposed architecture is based on the work by [1], in which a simple CNN architecture was used to independently classify age and gender, given the input image. The model consists of three convolutional layers and two FC layers. Despite its simplicity, it improved the state-of-the-art age and gender classifiers on the challenging newly released Adience dataset by 5%. We propose to extend this prior work by cross-modal learning, training an age estimation model based upon a pre-trained gender classifier. We feed the extracted embedding of the pre-trained gender classifier to our age classification model during the training phase. In other words, we help our model to extract better features for age classification during the training phase, given the gender embedding from a pre-trained model. In addition, the opposite is true, since knowing the age can provide helpful information to detect the gender. We considered several variants of the embedding and feeding in our system, and finally proposed a deep end-to-end model which automatically chooses the best variant. In this work, first we re-train a gender classifier following [1] with the same architecture. Afterwards, we train our age classification model, given the image and gender embedding from the pre-trained gender classifier. Finally, we test our model following the same steps in [1] and other works, to compare our results and accuracy with proposed systems under the same conditions. We explain the detailed architecture, as well as our proposed designs in the next section. Figure 1 illustrate the cross-modal learning idea in CNNs.

164

A. Aminian and G. Noubir

Fig. 1. Cross-modal learning. This shows how the cross-modal learning works in general. We consider the model as a black box, with any arbitrary design. We have a pre-trained model for task one. After obtaining the embedding of the input from pre-trained model, we can concatenate the embedding at some layer in our primary model, aiming to obtain better results in the training task

4

Technical Approach

In this section, we first present our model architecture along with possible designs and architectures, discuss the training and testing process, and then the iterative model refinement technique and its improvement in regard to accuracy. It worth mentioning that since access to personal information on the subjects in the images is limited, we have available datasets limited in size compared to face and object detection datasets. Therefore, we are exposed to the risk of over-fitting. As a consequence, we avoid over-fitting by having fewer layers and neurons, as well as having dropout layers. 4.1

Model Architecture

We use the CNN network architecture introduced in [1] as the basis for our work. This underlying architecture is illustrated in Fig. 2 and it follows the conventions in AlexNet [59] architecture. It consists of three convolutional layers, followed by two FC layers, with a small number of neurons. By comparing this simple network to other architectures, it is clear that the proposed network has fewer layers and neurons. The choice of a network with a smaller number of neurons and layers is motivated by the desire to avoid over-fitting, as well as the nature of the problem. That is because the output is of size eight, which is fairly small compared to other classification problems such as object detection, which consists of tens of thousands of output categories. Since we follow the AlexNet architecture conventions, we use the same parameters and hyper parameters for

Deep Cross-Modal Age Estimation

165

Fig. 2. Illustration of our underlying CNN architecture. The model contains three convolutional layers, each followed by a rectified linear operation and pooling layer. The first two layers also follow normalization using local response normalization [59]. The first convolutional layer contains 96 filters of 7 × 7 pixels, the second convolutional layer contains 256 filters of 5 × 5 pixels, The third and final convolutional layer contains 384 filters of 3 × 3 pixels. Finally, two FC layers are added, each containing 512 neurons

the first three convolutional layers. For the last two FC layers, we choose 512 as the number of neurons, in order to avoid over-fitting. All three color channels are considered and processed through the network. Each input image is first re-scaled to 256 × 256, and then a crop of 227 × 227 is fed to the network depends on the proper cropping method. Below are the details of three convolutional layers. Layer 1: 96 filters of size 3×7×7 pixels applied to the input image, and then followed by a rectified linear operator (ReLU). Then, a max pooling layer, which takes maximum value of each 3 × 3 region, with 2 pixels strides (this is the distance between the receptive field centers of neighboring neurons in a kernel map), followed by a local response normalization layer. On the first convolutional layer, we used neurons with receptive field size F = 11, stride S = 4 and no zero padding P = 0. Since (227−11)/4+1 = 55, and because we have 96 filters (K = 96), the convolutional layer output volume has the size 96 × 55 × 55. Each of the 96 × 55 × 55 neurons in this volume was connected to a region of size 3 × 11 × 11 in the input volume. Furthermore, all 96 neurons in each depth column are connected to the same 3 × 11 × 11 region of the input, but with different weights. Since max pooling layer region is 3×3, with S = 2, then we have (55 + 1)/2 = 28 for both W and H. Therefore, we have an output of size 96 × 28 × 28 due to 96 filters that we applied. Layer 2: From the previous convolutional layer, we have 96 × 28 × 28 output, which would be considered as the input of this layer. This layer contains 256 filters of size 96 × 5 × 5 pixels, with stride one (S = 1), and the same padding, which is followed by ReLU, max pooling layer, and then again local response normalization, with the same hyper parameters as before. Hence, we have

166

A. Aminian and G. Noubir

256 × 28 × 28 as the output of convolutional layer, and since S = 2, we have 256 × 14 × 14 as this step’s output. Layer 3: Finally, last convolutional layer operates on a 256 × 14 × 14 blob. This layer contains 384 filters of size 256 × 3 × 3 pixels (3 × 3 filters), followed by ReLU and a max pooling layer, again, with the same hyper parameters as before. It worth noting that, again, we have stride of size one (S = 1), and the same padding as the previous convolutional layer. As a consequence, the output is of size 384 × 6 × 6. Then, the FC layers are defined as below. Layer 4: This layer receives the output of the third convolutional layer (384× 6 × 6), and contains 512 neurons, which is followed by a ReLU and dropout layer. We point out that the output of the last convolutional layer would be reshaped to a single array of size 13824. Layer 5: This layer receives the output of first FC layer which has 512 dimensions, and similarly contains 512 neurons followed by a ReLU and dropout layer. 4.2

Prediction Head and Cross-Modal Learning

Potentially, we can feed a softmax classifier the output of the last FC layer, and consider each number as the probability of the corresponding category. However, in this work, for our prediction head, given the input image, we use the last FC layer and the gender embedding from the pre-trained gender classifier. There are several variants for choosing gender embedding since the last few layers in the gender classification model is encoding the image at different levels. Intuitively, the first FC layer has extracted less abstract features; however, the second FC layer contains more complex features embedded. In this design, we use three separate architectures, which we discuss below. FC Layer Embedding of Gender into Age Classifier (Manual Embedding). The age classifier network architecture remains the same, and one FC layer and a softmax layer are added to let the network learn domain-specific features. Layer 6: This layer has 512 neurons, and is fully connected to both the last FC layer in the model, which has 512 dimensions, and gender embedding of that particular image, followed by a ReLU and dropout layer. Layer 7: This layer is a softmax layer considered as our classifier output, and contains eight neurons (classes) for the age classifier connected to the last FC layer. Due to technical details of softmax function, we can consider the output vector of this layer as the probability of each class being true. There are a few variants in regard to choosing the proper gender embedding for each image (each of last two FC layers in gender classification model). In addition, we proposed two architectures for age classification to incorporate the gender embedding in our cross-learning. In our experiments, we tried both architectures, to evaluate the prediction accuracy. A schematic of this architecture can be seen in Figs. 3 and 4.

Deep Cross-Modal Age Estimation

167

Fig. 3. Illustration of our first architecture (Manual 1). We use the FC layer of our pre-trained gender classifier as the embedding of the input, and fed it into FC3 of our prediction head along with FC2 of our underlying model

End-to-End (E2E) Architecture. Following the work of [70] by Google, we also followed the inception model, in order to give the model flexibility to combine all variants, and choose the embedding and concatenation layer which works best in the age classifier (out of four possibilities here). To this end, we design an architecture in which the network considers all variations, and the last FC layer sets proper weights for each variation, and outputs the predicted label. From the performance perspective, most computations remain shared, and all convolutional layers remain unchanged. Therefore, the number of parameters does not grow significantly. The detailed structure could be observed in Fig. 5. The rest is the same as the architecture explained formerly. 4.3

Training

For re-training the gender classifier, weights are initialized randomly from a Gaussian distribution with zero mean and standard variation 0.01 at all layers. Then, for the age classification model, weights are initialized from the pretrained gender classifier, since the purpose of prediction head is to make the model domain-specific. Also, we use cross-entropy as our loss function. We train the model without using any data outside of the dataset. Target values are represented as sparse binary vectors corresponding to the ground truth classes, in which the correct class is represented as one, and elsewhere is zero. For each task, the target vector is of size eight (for age classification) or two (for gender classification). We deploy two methods to limit the risk of over-fitting. First, we use dropout. Based on the architecture, we have two dropout layers, each with a dropout ratio of 0.5 (50% chance of setting a neuron to zero). Second, we use data

168

A. Aminian and G. Noubir

Fig. 4. Illustration of our second architecture (Manual 2). We use the FC layer of our pre-trained gender classifier as the embedding of the input, and fed it into softmax layer of our prediction head along with FC3 of our underlying model. This design gives less room to the model to fine-tune

augmentation, by taking a random crop of 227 × 227 from the image, and randomly mirroring it in each forward-backward training pass. Training is performed using stochastic gradient descent with batch size of 50 images, and learning rate of e−3 , reduced to e−4 after 10,000 iterations. 4.4

Testing

We experiment with two methods for cropping the input image of size 256 × 256 as below. Figure 6, shows the cropping approaches. – Center-Crop: Cropping 227 × 227 pixels from the center around the face, and then feeding the network. – Over-Sampling: Input image is cropped five times. Four 227 × 227 pixels from each corner, and one 227×227 from the center is cropped, and all five 227×227 cropped images along with their horizontal reflections are fed to the network, and the final prediction would be the average prediction for all the variations. In fact, small misalignments due to challenging nature of Adience dataset can negatively impact the final results. However, the second over-sampling method is designed to compensate for these small misalignments by feeding the network with multiple translated versions of the same face. 4.5

Iterative Model Refinement

So far, we have kept the gender classifier unchanged as it was proposed by [1]. To this end, we have achieved an improvement in the age classifier, which outperforms the current state-of-the-art methods.

Deep Cross-Modal Age Estimation

169

Fig. 5. Illustration of the E2E system. This diagram represents how automation works in our design. We are assuming the gender classifier has two FC layers followed by each other. We call them 1 and 2 and use the colors green and yellow, respectively. In the age classifier, after having output of the convolutional layers, we split it into four copies with regard to different variations

Similarly, we can build the gender classifier with the same approach upon the newly trained age classifier. Therefore, we achieve a better gender classification compared to what we have currently had. In our age classification approach, the accuracy of the system highly depends on the embedding of the gender, and consequently the accuracy of the gender classifier. Therefore, having a more accurate gender classifier helps to re-train the age model more reliably. As a result, we can keep repeating the last two steps, which basically provides better classifiers. Essentially, each time we re-train age (gender) from scratch, since we have more accurate gender (age) classifier by now, we are expecting to train a better age (gender) classifier. Ultimately, at some point, the system improvement stops. Intuitively, this indicates the system cannot infer more useful information from the embeddings. However, due to the limited size of dataset, and limited variations among the pictures, we have not been able to achieve promising results with this technique, but we strongly believe that having a richer dataset would provide better results for both age and gender classifiers. This approach can be seen in Fig. 7.

170

A. Aminian and G. Noubir

Fig. 6. Two cropping methods. Left design, shows over-sampling method. For a random image (256×256), we crop five squares of size 227×227. This method improves alignment qualities. Right design, shows center-crop method, which for any random image (256 × 256), it crops the center of size 227 × 227 around the face

5

System and Evaluation

Our system is implemented in Tensorflow [71], an open-source framework supported by Google, due to its advantages in parallelism, simplicity, and rich resources. We trained the network on a GeForce GTX 1060 GPU with 6 GBs of memory.

Fig. 7. Iterative Model Refinement Approach. This diagram represents the idea of iterative model refinement, when we have two tasks, which improvement in one can also improve the other task’s accuracy. We have initially the gender classifier one pretrained, hence we can build age classifier one upon that, and have improvement over first age classifier accuracy, and then we can repeat this as long as we have improvement over one or both tasks

Deep Cross-Modal Age Estimation

171

Table 1. Adience dataset breakdown by gender category 0–2

4–6

Male

745

934

734

2308

1294

392

442

8192

Female

682 1234 1360

919

2589

1056

433

427

9411

1427 2162 2294 1653

4897

2350

825

869 19487

Both

928

8–13 15–20 25–32 38–43 48–53 60– Total

Our gender classifier takes roughly five hours to be trained. Also, our age classification model takes about 10 h to be trained. There are two reasons why it takes longer than the gender classification architecture. First, we have one more layer, which leads to more parameters to be trained at each step. In addition, for each image batch, we need to first pass it through the gender classifier in order to get the embedding of each image first, and then use the embedding to feed to the FC layer during each step of training. At test time, each image takes 300 ms on average to be processed and the system outputs predicted label. 5.1

Adience Benchmark

We train and test the accuracy of our deep model on the newly released Adience benchmark [4]. The Adience dataset mostly contains images which automatically were uploaded to Flickr. Since these images were uploaded by mobile devices with no filters, viewing images are highly unconstrained. It is worth mentioning that as opposed to other datasets like LFW collection [17], which images are taken with filters on media web pages or social websites, Adience dataset images are challenging in their nature, due to their extreme variations in pose, occlusion, lighting condition, and other factors. Adience dataset contains roughly 26K images of 2, 284 subjects. The breakdown of each category can be seen in Table 1. Testing for both age and gender classification is performed using a standard five-fold, subject-exclusive, cross validation protocol defined in [4]. In the next section, we are comparing the previously reported results with the new results our design has produced. 5.2

Results

In Table 2, we compare the results of our gender classifier which is the same as the results by [1] since we followed their architecture. Moreover, in Table 3, our results for different designs can be observed. In addition, in Table 4, our iterative model refinement approach results can be seen, which is not as expected due to limited variation in dataset for genders. For our age classifier, we measure the accuracy of the system both when the design predicts the exact age group, and when the system prediction is different from the correct category by at most one (correct category, or two adjacent categories) which we call it “1-off”. This follows others who have done this in the past.

172

A. Aminian and G. Noubir Table 2. Gender classification methods and accuracies Method

Accuracy

Best from [4]

77.8 ± 1.3

Best from [72]

79.3 ± 0.0

Best from [1] single-crop

85.9 ± 1.4

Best from [1] over-sample 86.8 ± 1.4

Both proposed age and gender classifiers have been compared with methods described in [4], and [1], since Adience dataset is the benchmark which we tested our systems. Also, the proposed gender classifier has been compared to results by [72], which uses the same gender classification pipeline of [4] applied to more effective alignment of the faces. Table 3. Age classification methods and accuracies (over-sampling) Method

Exact

1-off

Best from [4]

45.1 ± 2.6 79.5 ± 1.4

single-crop [1]

49.5 ± 4.4 84.6 ± 1.7

over-sample [1]

50.7 ± 5.1 84.7 ± 2.2

Manual 1 architecture 53.3 ± 3.2 85.9 ± 2.5 Manual 2 architecture 55.1 ± 2.6 88.5 ± 2.9 60.2 ± 1.8 92.5 ± 1.8

E2E system

Based on our results which can be seen in Table 3, we outperform the most accurate age classifiers with considerable gaps on Adience benchmark. We point out that we use over-sampling method in all of our experiments, since it performs better compared to the center crop method. Table 4. Model refinement iterations and accuracies Classifier Step Accuracy Gender

1

86.8 ± 1.4

Age

1

55.1 ± 2.6

Gender

2

84.8 ± 1.2

Deep Cross-Modal Age Estimation

6

173

Conclusions

Age and gender classifiers still lack accuracy. In this work, we developed a set of techniques that leverage a gender classifier, to improve the accuracy of age classification. We evaluated several variants on the Adience dataset, achieving an improvement of over 9% compared to the current techniques. In future work, we plan to investigate further optimization of the gender, age, and other crossmodel learning, for instance, iteratively refining the models from one embedding to another.

References 1. Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshops, June 2015 2. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: closing the gap to humanlevel performance in face verification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708, June 2014 3. Howard, D.: Is a man’s skin really different? The International Dermal Institute 4. Eidinger, E., Enbar, R., Hassner, T.: Age and gender estimation of unfiltered faces. IEEE Trans. Inf. Forensics Secur. 9(12), 2170–2179 (2014) 5. M¨ akinen, E., Raisamo, R.: Evaluation of gender classification methods with automatically detected and aligned faces. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 541–547 (2008) 6. Reid, D., Samangooei, S., Chen, C., Nixon, M., Ross, A.: Soft biometrics for surveillance: an overview, January 2013 7. Golomb, B.A., Lawrence, D.T., Sejnowski, T.J.: SEXNET: a neural network identifies sex from human faces. In: Advances in Neural Information Processing Systems 3, NIPS Conference, Denver, Colorado, USA, 26–29 November 1990, pp. 572–579 (1990) 8. O’Toole, A.J., Vetter, T., Troje, N.F., B¨ ulthoff, H.H.: Sex classification is better with three-dimensional head structure than with image intensity information. Perception 26(1), 75–84 (1997). PMID: 9196691 9. Moghaddam, B., Yang, M.: Learning gender with support faces. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 707–711 (2002) 10. Baluja, S., Rowley, H.A.: Boosting sex identification performance. Int. J. Comput. Vision 71(1), 111–119 (2007) 11. Toews, M., Arbel, T.: Detection, localization, and sex classification of faces from arbitrary viewpoints and under occlusion. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1567–1581 (2009) 12. Chen, J., Shan, S., He, C., Zhao, G., Pietik¨ ainen, M., Chen, X., Gao, W.: WLD: a robust local image descriptor. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1705–1720 (2010) 13. Ullah, I., Aboalsamh, H., Hussain, M., Muhammad, G., Mirza, A., Bebis, G.: Gender recognition from face images with local LBP descriptor. 65, 353–360 (2012) 14. Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.J.: The FERET database and evaluation procedure for face-recognition algorithms. Image Vision Comput. 16(5), 295–306 (1998)

174

A. Aminian and G. Noubir

15. Perez, C., Tapia, J., Estevez, P., Held, C.: Gender classification from face images using mutual information and feature fusion. Int. J. Optomechatronics 6(1), 92–119 (2012) 16. Shan, C.: Learning local binary patterns for gender classification on real-world face images. Pattern Recogn. Lett. 33, 431–437 (2012) 17. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments, October 2008 18. Akbulut, Y., S ¸ eng¨ ur, A., Ekici, S.: Gender recognition from face images with deep learning. In: 2017 International Artificial Intelligence and Data Processing Symposium (IDAP), pp. 1–4, September 2017 19. Mansanet, J., Albiol, A., Paredes, R.: Local deep neural networks for gender recognition. Pattern Recogn. Lett. 70, 80–86 (2016) 20. Antipov, G., Berrani, S., Dugelay, J.: Minimalistic CNN-based ensemble model for gender prediction from face images. Pattern Recogn. Lett. 70, 59–65 (2016) 21. Zhang, K., Tan, L., Li, Z., Qiao, Y.: Gender and smile classification using deep convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2016, Las Vegas, NV, USA, 26 June–1 July, 2016, pp. 739–743 (2016) 22. Fu, Y., Guo, G., Huang, T.S.: Age synthesis and estimation via faces: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1955–1976 (2010) 23. Han, H., Otto, C., Jain, A.K.: Age estimation from face images: human vs. machine performance. In: International Conference on Biometrics, ICB 2013, Madrid, Spain, 4–7 June 2013, pp. 1–8 (2013) 24. Salvador, A., Hynes, N., Aytar, Y., Mar´ın, J., Ofli, F., Weber, I., Torralba, A.: Learning cross-modal embeddings for cooking recipes and food images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 3068–3076 (2017) 25. Kwon, Y.H., da Vitoria Lobo, N.: Age classification from facial images. In: Conference on Computer Vision and Pattern Recognition, CVPR 1994, Seattle, WA, USA, 21–23 June 1994, pp. 762–767 (1994) 26. Ramanathan, N., Chellappa, R.: Modeling age progression in young faces. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, NY, USA, 17–22 June 2006, pp. 387–394 (2006) 27. Geng, X., Zhou, Z.H., Smith-Miles, K.: Automatic age estimation based on facial aging patterns. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2234–2240 (2007) 28. Guo, G., Fu, Y., Dyer, C.R., Huang, T.S.: Image-based human age estimation by manifold learning and locally adjusted robust regression. IEEE Trans. Image Processing 17(7), 1178–1188 (2008) 29. Fu, Y., Huang, T.S.: Human age estimation with regression on discriminative aging manifold. IEEE Trans. Multimedia 10(4), 578–584 (2008) 30. INRIA: The FG-Net ageing database (2002). www.prima.inrialpes.fr/fgnet/html/ benchmarks.html 31. Ricanek Jr., K., Tesafaye, T.: MORPH: a longitudinal image database of normal adult age-progression. In: Seventh IEEE International Conference on Automatic Face and Gesture Recognition (FG 2006), Southampton, UK, 10–12 April 2006, pp. 341–345 (2006) 32. Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., Huang, T.S.: Regression from patch-kernel. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, Alaska, USA, 24–26 June 2008 (2008)

Deep Cross-Modal Age Estimation

175

33. Fukunaga, K.: Introduction to Statistical Pattern Recognition, pp. 1–592 (1991) 34. Yan, S., Liu, M., Huang, T.S.: Extracting age information from local spatially flexible patches. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, 30 March–4 April 2008, Caesars Palace, Las Vegas, Nevada, USA, pp. 737–740 (2008) 35. Ghahramani, Z.: An introduction to hidden Markov models and Bayesian networks. IJPRAI 15(1), 9–42 (2001) 36. Zhuang, X., Zhou, X., Hasegawa-Johnson, M., Huang, T.: Face age estimation using patch-based hidden Markov model supervectors. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4, December 2008 37. Gao, F., Ai, H.: Face age classification on consumer images with Gabor feature and fuzzy LDA method. In: Proceedings of the Advances in Biometrics, Third International Conference, ICB 2009, Alghero, Italy, 2–5 June 2009, pp. 132–141 (2009) 38. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Process. 11(4), 467–476 (2002) 39. Guo, G., Mu, G., Fu, Y., Dyer, C.R., Huang, T.S.: A study on automatic age estimation using a large database. In: IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, 27 September–4 October 2009, pp. 1986–1991 (2009) 40. Riesenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999) 41. Ahonen, T., Hadid, A., Pietik¨ ainen, M.: Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 2037–2041 (2006) 42. Choi, S.E., Lee, Y.J., Lee, S.J., Park, K.R., Kim, J.: Age estimation using a hierarchical classifier based on global and local facial features. Pattern Recogn. 44(6), 1262–1281 (2011) 43. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 44. Chao, W., Liu, J., Ding, J.: Facial age estimation based on label-sensitive learning and age-oriented regression. Pattern Recogn. 46(3), 628–641 (2013) 45. Mirzazadeh, R., Moattar, M.H., Jahan, M.V.: Metamorphic malware detection using linear discriminant analysis and graph similarity. In: 2015 5th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 61–66, October 2015 46. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions using equivalence relations. In: Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), 21–24 August 2003, Washington, DC, USA, pp. 11–18 (2003) 47. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems 16, Neural Information Processing Systems, NIPS 2003, Vancouver and Whistler, British Columbia, Canada, 8–13 December 2003, pp. 153–160 (2003) 48. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Computer Vision - ECCV 1998, 5th European Conference on Computer Vision, Freiburg, Germany, 2–6 June 1998, Proceedings, vol. II, pp. 484–498 (1998) 49. Gallagher, A.C., Chen, T.: Understanding images of groups of people. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, Florida, USA, 20–25 June 2009, pp. 256–263 (2009)

176

A. Aminian and G. Noubir

50. Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., Zafeiriou, S.: AgeDB: the first manually collected, in-the-wild age database. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, HI, USA, 21–26 July 2017, pp. 1997–2005 (2017) 51. Rothe, R., Timofte, R., Gool, L.V.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vision 126(2–4), 144–157 (2018) 52. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: ImageNet large scale visual recognition challenge. CoRR abs/1409.0575 (2014) 53. Pei, W., Dibeklioglu, H., Baltrusaitis, T., Tax, D.M.J.: Attended end-to-end architecture for age estimation from facial expression videos. CoRR abs/1711.08690 (2017) 54. Chen, J., Kumar, A., Ranjan, R., Patel, V.M., Alavi, A., Chellappa, R.: A cascaded convolutional neural network for age estimation of unconstrained faces. In: 8th IEEE International Conference on Biometrics Theory, Applications and Systems, BTAS 2016, Niagara Falls, NY, USA, 6–9 September 2016, pp. 1–8 (2016) 55. Xing, J., Li, K., Hu, W., Yuan, C., Ling, H.: Diagnosing deep learning models for high accuracy age estimation from a single image. Pattern Recogn. 66, 106–116 (2017) 56. Liu, H., Lu, J., Feng, J., Zhou, J.: Group-aware deep feature learning for facial age estimation. Pattern Recogn. 66, 82–94 (2017) 57. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G.: Recent advances in convolutional neural networks. CoRR abs/1512.07108 (2015) 58. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) 59. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held 3–6 December 2012, Lake Tahoe, Nevada, United States, pp. 1106–1114 (2012) 60. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp. 1653–1660 (2014) 61. Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012, pp. 2480–2487 (2012) 62. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013, pp. 3476–3483 (2013) 63. Wu, Y., Hassner, T.: Facial landmark detection with tweaked convolutional neural networks. CoRR abs/1511.04031 (2015) 64. Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 3691–3700 (2017) 65. Ranjan, R., Sankaranarayanan, S., Castillo, C.D., Chellappa, R.: An all-in-one convolutional neural network for face analysis. CoRR abs/1611.00851 (2016)

Deep Cross-Modal Age Estimation

177

66. Dehghan, A., Ortiz, E.G., Shu, G., Masood, S.Z.: DAGER: deep age, gender and emotion recognition using convolutional neural network. CoRR abs/1702.04280 (2017) 67. Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, 26–31 May 2013, pp. 6645– 6649 (2013) 68. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp. 1725–1732 (2014) 69. Xu, D., Ouyang, W., Ricci, E., Wang, X., Sebe, N.: Learning cross-modal deep representations for robust pedestrian detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 4236–4244 (2017) 70. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014) 71. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., J´ ozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467 (2016) 72. Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 4295–4304 (2015)

Multi-stage Reinforcement Learning for Object Detection Jonas K¨ onig1 , Simon Malberg1 , Martin Martens1 , Sebastian Niehaus2(B) , Artus Krohn-Grimberghe3 , and Arunselvan Ramaswamy1 1 2

Paderborn University, Paderborn, Germany AICURA Medical GmbH, Berlin, Germany [email protected] 3 Lytiq GmbH, Paderborn, Germany

Abstract. We present a reinforcement learning approach for detecting objects within an image. Our approach performs a step-wise deformation of a bounding box with the goal of tightly framing the object. It uses a hierarchical tree-like representation of predefined region candidates, which the agent can zoom in on. This reduces the number of region candidates that must be evaluated so that the agent can afford to compute new feature maps before each step to enhance detection quality. We compare an approach that is based purely on zoom actions with one that is extended by a second refinement stage to fine-tune the bounding box after each zoom step. We also improve the fitting ability by allowing for different aspect ratios of the bounding box. Finally, we propose different reward functions to lead to a better guidance of the agent while following its search trajectories. Experiments indicate that each of these extensions leads to more correct detections. The best performing approach comprises a zoom stage and a refinement stage, uses aspect-ratio modifying actions and is trained using a combination of three different reward metrics. Keywords: Deep reinforcement leaning Object detection

1

· Q-learning ·

Introduction

For humans and many other biological systems, it is natural to extract visual information sequentially [1]. Even though non-biological systems are very different, approaches that are inspired by biological systems often achieve good results [2,3]. In traditional computer vision, brute force approaches like sliding windows and region proposal methods are used to detect objects by evaluating all the proposed regions and choose the one or the ones where it is likely to fit an object [4–6]. We propose a sequential and hierarchical object detection approach similar to the ones proposed by [3] and [7], which is refined after every hierarchical step. Like biological visual systems, the next looked at region depends on the previous decisions and states. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 178–191, 2020. https://doi.org/10.1007/978-3-030-17795-9_13

Multi-stage Reinforcement Learning for Object Detection

179

Our approach uses deep reinforcement learning to train an agent that sequentially decides which region to look at next. In our most sophisticated approach, the agent repetitively acts in two alternating action stages, zooming and moving, until it assumes to adequately frame an object with a tightly fitting bounding box. The actions allow to deform the bounding box enough to reach many bounding boxes of different sizes, positions and shapes. Because of the hierarchical zooming process, only a small number of region candidates must be evaluated, so that it is affordable to compute new feature maps after bounding box transformations, like in [3], instead of cropping from a single feature map. This results in higher spatial resolutions of the considered regions, which are therefore more informative and allow for better decision making of the agent. We consider multiple reward functions using different metrics to better guide the training process of the agent compared to using the Intersection-over-Union. With these metrics the quality of a state is evaluated and rewarded not only with respect to the fit but also to the future potential of reaching a valid detection. This leads to a speed-up in the training process and results in better detection quality.

2

Related Work

Many works have already shown that reinforcement learning can be a powerful tool for object detection, without the need of object proposal methods like Selective Search [4] and Edge Boxes [5]. [7] casts the problem of object localization as a Markov decision process in which an agent makes a sequence of decisions, just as our approach does. They use deep Q-learning for class-specific object detection which allows an agent to stepwise deform a bounding box in its size, position and aspect ratio to fit an object. A similar algorithm was proposed by [8], which allowed for three-dimensional bounding box transformations in size and position to detect breast lesions. This approach reduced the runtime of breast lesion detection without reducing the detection accuracy compared to other approaches in the field. Another approach is to perform a hierarchical object detection with actions that only allow to iteratively zoom in on an image region as shown by [3]. This does not allow for changes to the aspect ratio or movements of the bounding box without zooming. The main difference to the approaches before is that it imposes a fixed hierarchical representation that forces a top-down search which can lead to a smaller number of actions needed, but also leads to a smaller number of possible bounding box positions that can be reached. Because of the smaller number of actions, it becomes more affordable to extract high-quality descriptors for each region, instead of sharing convolutional features. [2] proposed a method using a sequential approach related to ‘saccade and fixate’, which is the pattern of search in biological systems. Instead of deforming bounding boxes, the agent fixates interesting locations of an image and considers regions at that location from a region proposal method. The agent may terminate and choose one of the already seen bounding boxes for which it has the highest confidence of fitting the object or fixate on a new location based on already

180

J. K¨ onig et al.

considered regions. Our approach combines some of these approaches to create a 2-stage bounding box transformation process which uses zooming and allows for different aspect ratios and bounding box movements.

3

Model

In this section, we define the components of our model and how it is trained. We compare two general models. The 1-stage model is based on the Image-Zooms technique for hierarchical object detection as proposed by [3]. We introduce an extension to that model called 2-stage, which adds a way to perform refining movements of the bounding box. 2-stage compensates for 1-stage’s drawback that only a very limited number of bounding boxes can be reached. Additionally, we extend each model by actions to change the aspect ratio of the bounding box to improve the agent’s ability to fit narrow objects. 3.1

Markov Decision Process

We use Q-learning to train the agent [9]. This reinforcement learning technique estimates the values of state-action pairs. The agent chooses the action with the highest estimated future value regarding the current state. We formulate the problem as a Markov decision process by defining states, actions and rewards, on which Q-learning can be applied. States. The state contains the image features encoded by a VGG convolutional neural network [10] and a history vector that stores the last four actions taken by the agent. The history vector is a one-hot vector as used by [7] and [3]. For the 2-stage model, we use two separate history vectors for the two stages. The zoom state contains only the last four zoom actions and the refinement state only the last four refinement actions. The refinement history vector is cleared after every zoom action. Actions 1-stage. Using the 1-stage model, the agent can choose between a terminal action and five zoom actions. The terminal action ends the search for an object and returns the current bounding box. Zooming actions shrink the bounding box and thereby zoom in on one of five predefined subregions of the image: Top-left, top-right, bottom-left, bottom-right and center. Each zoom action shrinks the bounding box to 75% of its width and height so that possible subregions overlap. Such an overlapping is beneficial in situations where an object would otherwise cross the border between two possible subregions and could therefore not be framed without cutting off parts of the object. By iteratively performing these zoom actions, the agent can further zoom into the image until it finally decides that the bounding box fits the object well.

Multi-stage Reinforcement Learning for Object Detection

181

1-stage-ar. The 1-stage-ar model extends the 1-stage model by adding two zoom actions that allow to change the aspect ratio of the bounding box. The bounding box can be compressed either in width or in height while holding the position of its center fixed. For both actions, the bounding box is shrinked to 56.25% of its width or height, respectively, to achieve an equal reduction of the bounding box area as compared to the other zoom actions. 2-stage. The 2-stage model adds a second stage to the 1-stage model that performs five consecutive refinement steps after each zoom action. In each refinement step, the agent can move the bounding box left, right, up or down by 10% of its width or height, respectively, or decide to do nothing at all (see Fig. 1). 2-stage-ar. The 2-stage-ar model analogously extends the 1-stage-ar model with a refinement stage.

Fig. 1. The agent can take up to ten zoom actions, which are determined by the first stage. For the 2-stage model, after each zoom step five refinement steps are taken, which are determined by the second stage.

Reward. The reward function scheme used is similar to the one proposed by [3,7]. They only used the Intersection-over-Union (IoU) as a metric to calculate the reward. Even though the correctness of a detection is evaluated using the IoU between the bounding box and the ground truth, using a zoom reward which is positive if a zoom increases the IoU and negative otherwise, does often not result in a favorable zoom action (see Fig. 2). Therefore, more sophisticated rewards can be useful to guide the training process [11]. Multiple metrics were considered to be integrated in the reward function to better estimate the quality of a zoom step. Two of them proved to be useful to increase the detection accuracy and lead to a faster convergence. GTC. The Ground Truth Coverage (GTC) is the percentage of the ground truth that is covered by the current bounding box. The main difference between IoU and GTC is, that the size of the bounding box has no direct impact on the GTC. If the whole ground truth is covered by the bounding box, the GTC is 1. If the bounding box covers nothing from the ground truth, the GTC is 0. The GTC metric can be used to put a higher weight on still covering a large portion of the ground truth after a zoom step.

182

J. K¨ onig et al.

CD. The Center Deviation (CD) describes the distance between the center of the bounding box and the center of the ground truth in relation to the length of the diagonal of the original image. Therefore, the highest possible center deviation is asymptotically 1. Unlike IoU and GTC, a small CD is preferred. The CD metric can be used to support zooms in the direction of the ground truth center, which reduces the probability of zooming away from the ground truth. Reward Function. Instead of using the IoU alone to evaluate the quality of a zoom state, we propose a more sophisticated quality function which additionally uses the introduced metrics to better evaluate the quality of zooms in a single equation. r(b, g, d) = α1 iou(b, g) + α2 gtc(b, g) + α3 (1 − cd(b, g, d))

(1)

r(b, g, d) estimates the quality of the current state and is calculated using bounding box b, ground truth g and image diagonal d as input weighted with factors α. The reward for zoom actions is calculated as follows: Rz (b, b , g, d) = sign(r(b , g, d) − r(b, g, d))

(2)

If the quality function value for the bounding box b after a zoom step is larger than the value for the bounding box b before, the zoom is rewarded, otherwise it is penalized. The terminal action has a different reward scheme which assigns a reward if the IoU of the region b with the ground truth is larger than a certain threshold τ and a penalty otherwise. The reward for the terminal action is as follows: +η if iou(b,g) ≥ τ (3) Rt (b, g) = −η otherwise In all our experiments, we set η = 3 and τ = 0.5, as in [3,7]. Sigmoid. In our experiments we also used a sigmoid function on each metric in the quality function to see if this nonlinear attenuation helps to improve the learning process. A Gaussian attenuation was also considered but did not achieve a better performance. This results in a new quality function using sigmoids. r(b, g, d) =

α2 α1 + 1 + e(β1 (γ1 −iou(b,g))) 1 + e(β2 (γ2 −gtc(b,g))) α3 + (β (γ −(1−cd(b,g,d)))) 3 3 1+e

(4)

βi influences the strength of attenuation through the Sigmoid and γi influences its centering. Different parameters were tested, and the best results were achieved with the following parameters. α1 , α2 , α3 = 1

β1 , β2 , β3 = 8 γ1 , γ2 , γ3 = 0.5

(5)

Multi-stage Reinforcement Learning for Object Detection

183

Fig. 2. The first zoom increases the IoU by a small amount and therefore results in a positive reward, but also cuts off a large portion of the ground truth, which cannot be covered anymore in the future using the 1-stage approach.

Refinement Reward. The refinement steps of the second stage are learned with a simpler reward function. This is because the only goal of that stage is to center the object within the bounding box. Therefore, only the CD metric is used in the reward function. ⎧ ⎪ sign(cd(b, g, d) ⎪ ⎪ ⎪ ⎪ ⎪ −cd(b , g, d)) if cd(b,g,d) = cd(b’,g,d) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −1 if cd(b,g,d) = cd(b’,g,d), ⎪ ⎪ ⎪ ⎨ while CD-decreasing Rm (b, b , g, d) = (6) ⎪ refinements possible ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ +1 if cd(b,g,d) = cd(b’,g,d), ⎪ ⎪ ⎪ ⎪ ⎪ while no CD-decreasing ⎪ ⎪ ⎪ ⎩ refinement possible Decreasing the Center Deviation is rewarded, increasing it is punished. If the agent does not move, the reward depends on the possibility to get a smaller CD. If the CD cannot be decreased using one of the moving refinements, then not moving is rewarded, otherwise it is punished.

184

J. K¨ onig et al.

Reward Assignment. In 1-stage, the zoom reward is calculated and assigned immediately after each zoom action. In 2-stage, the reward for zooming is calculated and assigned after all five refinement actions have been taken to facilitate the coordination between the two stages. The refinement reward is calculated and assigned directly after the refinement. 3.2

Architecture

Both, the zoom stage and the refinement stage follow the layer architecture visualized in Fig. 3. At each step, they take the image and the current bounding box as input and output another bounding box, that should fit the object better. The input image is cropped to the region determined by the bounding box and then resized to 224 × 224 pixels. From this resized image region, a VGG-16 [10] extracts 512 7 × 7 sized maps of the region’s image features. These feature maps are flattened and concatenated with the agent’s history vector containing the last four actions and passed through two fully connected layers with 1024 neurons each forming the Q-network. Each fully connected layer is followed by a ReLU nonlinearity [12] and trained with 20% dropout [13]. The Q-network’s output is a vector of Q-values containing one estimation per possible action. The agent chooses the action with the highest Q-value and transforms the bounding box accordingly. The image region inside the transformed bounding box serves as new input for the next step.

Fig. 3. At each step, an image region as framed by the current bounding box is input to a VGG-16 convolutional neural network. The extracted image features concatenated with a history vector are then input to a Q-network. The Q-value outputs determine the next action to manipulate the bounding box.

Multi-stage Reinforcement Learning for Object Detection

3.3

185

Training

In this section, we describe our approach for training the model and the hyperparameters used. Optimization. All models weights are initialized randomly from a normal distribution and optimized by Adam [14] with a learning rate of 10−6 . Epsilon-Greedy Policy. We balance exploration against exploitation using an epsilon-greedy policy. Trainings start by taking random actions only ( = 1) and linearly increase the likelihood of taking actions based on the learned policy until = 0.1, meaning 10% of actions are random and 90% are based on the learned policy. For both models, decreases by 0.1 after each epoch. Forced Termination. States in which the IoU is higher than 0.5 are relatively scarce when taking random actions. Nevertheless, to learn the terminal action of the zoom model quickly, we enforce taking the terminal action whenever the IoU exceeds 0.5 as recommended by [3]. Experience Replay. In both the zoom stage and the refinement stage, consecutive states are very correlated, because a large portion of the considered image region is shared by both states. To train the models with independent experiences, we apply an experience replay by storing experiences in a rolling buffer and training the model on a random subset of these stored experiences after each step. For both stages, the size of the experience buffer is 1,000 experiences of which 100 are sampled randomly for the batch training. Reward Discounting. Our Q-function considers future rewards with a discount factor of 0.9.

4

Experiments

For a quantitative and qualitative comparison of different variations of our previously defined models, we conduct experiments in which we train our agent on one set of images and test it against another set of images. To allow for a comparison with [3], we have used images of airplanes from the PASCAL VOC dataset [15]. Our agent has been trained on the 2012 trainval dataset and evaluated against the test dataset from 2007. The performance of each model is measured by its true positive (TP) rate. We consider a detection a true positive when the Intersection-over-Union of the predicted bounding box with the ground truth bounding box exceeds 0.5, as defined by the Pascal VOC challenge [15]. To measure the convergence over the course of multiple epochs of training, the models are evaluated against the test set after every tenth epoch of training. For the 1-stage models, we conduct four runs for each model-reward combination and average the results. For the 2-stage models, we take the average of three runs due to increased runtime requirements.

186

J. K¨ onig et al.

4.1

Experimental Procedure

1-stage. Our first series of experiments serves the purpose of establishing a wellperforming zooming methodology. For this, we examine the effect of different reward functions and the effect of adding aspect ratio zoom actions. Each model has been trained for 200 epochs on a set of 253 images and tested on a disjunctive set of 159 images. Convergence curves indicate that results do not improve significantly by training longer than 200 epochs. 2-stage. In a second series of experiments, the influence of a second-stage refinement is evaluated. Training builds on top of a 1-stage-ar model that is pretrained with an IoU, GTC and CD reward, namely the best-performing firststage weights that reached a result of 51 true positives after 200 epochs (see Figs. 4 and 5). For a duration of 100 epochs, only the refinement model is trained. Results indicate that during training the detection quality of the 2-stage model starts off worse than 1-stage’s final results but surpasses them at around 100 epochs of refinement training. Finally, both networks, zoom and refinement, are trained simultaneously for 50 epochs so that an appropriate coordination can be learned. 4.2

Results

As previously hypothesized, a combination of Intersection-over-Union, Ground Truth Coverage and Center Deviation outperforms a reward that is purely based on the Intersection-over-Union. Experimental results indicate that these additional reward metrics help the agent to decide on its next zoom action, especially in situations where multiple zooms promise similar improvements in IoU. GTC helps to minimize cutting-off mistakes that cannot be reverted. CD assists to maintain focus on the object and prevent missing it on the zoom trajectory.

Table 1. Average results of all runs measured by their true positives (TP), false positives (FP) and false negatives (FP) IoU FN

IoU, GTC, CD (Sigmoid)

TP

TP

Model

TP

1-stage

28.75 102.50 27.75 35.25 90.75 33.00 28.25 99.00 31.75

1-stage-ar 38.75

FP

IoU, GTC, CD FP

FN

FP

FN

92.00 28.25 47.75 89.50 21.75 49.75 86.00 23.25

2-stage

40.00 95.00 24.00

2-stage-ar

61.67 89.67

7.67

No matter which reward, the 1-stage-ar approach achieves more correct detections than the simpler 1-stage approach (see Table 1). The fitting potential for narrow objects can apparently be exploited well enough to justify the two

Multi-stage Reinforcement Learning for Object Detection

187

additional actions that must be learned by the agent. For the 1-stage-ar models, the positive effect of considering additional reward metrics is even stronger than for the 1-stage models. We hypothesize that GTC is particularly effective when aspect ratio zooms are possible. In situations where an aspect ratio zoom is the optimal action, all other zoom actions usually cut off a large portion of the object. This might result in an IoU increase but a larger decrease in GTC, so that taking an action other than the necessary aspect ratio zoom results in punishment. However, implementing these reward metrics using a sigmoid squashing function seems to have no significant impact on the performance, maybe even a small negative impact on the performance of 1-stage. Extending the model with a second stage for refinement leads to a further improvement in true positive detections. The refinement ability is especially helpful in two types of situations: (1) when the first stage makes an unfavorable zoom decision that can be corrected by the second stage and (2) when no predefined zoom region fits the object adequately but the second stage can refine the bounding box position enough to correct this offset.

30 20 10

Percentage of True Positives

30 20 10

IoU, GTC, CD IoU, GTC, CD (sig) IoU

0

IoU, GTC, CD IoU, GTC, CD (sig) IoU

0

Percentage of True Positives

40

Mean test convergence curve for 1−stage−ar

40

Mean test convergence curve for 1−stage

0

50

100

150

200

0

50

Epoch

100

150

200

Epoch

(a) 1-stage

(b) 1-stage-ar

30 20 10

1−stage 1−stage−ar 2−stage 2−stage−ar

0

Percentage of True Positives

40

Mean test convergence curve for IoU, GTC, CD

0

50

100

150

200

Epoch

(c) All models with IoU, GTC, CD

Fig. 4. Convergence curves on the test set averaged over all runs

188

5

J. K¨ onig et al.

Conclusion

This paper has presented a reinforcement learning approach that detects objects by acting in two alternating action stages. The approach is based on a hierarchical object detection technique that makes it affordable to calculate feature maps of the image before each step. This leads to a large variety of feature maps that will be extracted from each image during training, so that image data is already augmented by the model itself, which improves results especially on small data sets. By adding actions that change the aspect ratio of the bounding box and by refining the bounding box after each zoom step, our model is capable of detecting 75% more objects than without these supplementary actions. Additionally, our experiments indicate that the number of correct detections can be increased by training the agent with reward metrics that are more informative about the quality and future potential of current states, namely Ground Truth Coverage and Center Deviation. Combining aspect ratio zooms, refining movements and enhanced reward metrics, the detection rate could be nearly doubled in our experiments as compared to the pure zooming approach. When taking into account that we did not train our 2-stage-ar model until convergence, one may expect even higher detection rates with more training. Subsequent optimization of model parameters and fine-tuning of the reward weights should be considered to increase the number of correct detections even further. Also, the detection accuracy might improve when the agent may choose different step sizes of zooming, aspect ratio changes and refining movements. Admittedly, it is questionable if the presented approach may be labeled as traditional reinforcement learning. Normally the ground truth is not known or not used to train the agent, which is not the case with our training method. Instead, we use the ground truth annotation to shape our rewards and train the agent. Therefore, our approach is supervised and uses reinforcement learning techniques at the same time.

Multi-stage Reinforcement Learning for Object Detection

A

189

Algorithm

Algorithm 1 . Training procedure (the blue lines are only necessary for the 2-stage models) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42:

vggM odel ← loadV ggM odel() zoomM odel ← loadZoomM odel() zoomExp ← initializeZoomExpReplay() ref M odel ← loadRef M odel() ref Exp ← initializeRef ExpReplay() for epoch ← 1 to maxEpochs do for all image in trainingset with one object of interest do groundT ruth ← loadGroundT ruth(image) boundingBox ← initializeBoundingBox() historyV ectorZoom ← initializeHistoryV ectorZoom() zoomState ← getZoomState(vggM odel, image, boundingBox, historyV ectorZoom) zoomStep ← 0 terminated ← F alse while not terminated and zoomStep < 10 do prevZoomBoundingBox ← boundingBox prevZoomState ← zoomState zoomAction ← chooseZoomAction(zoomM odel, zoomState) zoomAction ← checkF orcedT ermination(boundingBox, groundT ruth, zoomAction) if zoomAction = terminal action then boundingBox ← perf ormZoomAction(boundingBox, zoomAction, image) historyV ectorRef ← initializeHistoryV ectorRef () ref State ← getRef State(vggM odel, image, boundingBox, historyV ectorRef ) for ref Step ← 1 to 5 do prevRef BoundingBox ← boundingBox prevRef State ← ref State ref Action ← chooseRef Action(ref M odel, ref State) boundingBox ← perf ormRef Action(boundingBox, ref Action, image) historyV ectorRef ← updateHistoryV ectorRef ( historyV ectorRef, ref Action) ref Reward ← calculateRef Reward(boundingBox, prevRef BoundingBox, groundT ruth, image) ref State ← getRef State(vggM odel, image, boundingBox, historyV ectorRef ) ref Exp ← memorizeRef Exp(prevRef State, ref State, ref Step, ref Reward, ref Action) end for end if

190

J. K¨ onig et al.

Algorithm 1 . Training procedure (the blue lines are only necessary for the 2-stage models) (continued) 43: zoomStep ← zoomStep + 1 44: terminated ← zoomAction = terminal action 45: historyV ectorZoom ← updateHistoryV ectorZoom( 46: historyV ectorZoom, zoomAction) 47: zoomReward ← calculateZoomReward(boundingBox, 48: prevZoomBoundingBox, groundT ruth, image) 49: zoomState ← getZoomState(vggM odel, image, boundingBox, 50: historyV ectorZoom) 51: zoomExp ← memorizeZoomExp(prevZoomState, zoomState, 52: zoomReward, zoomAction) 53: f itZoomM odel(zoomM odel, zoomExp) 54: f itRef M odel(ref M odel, ref Exp) 55: end while 56: end for 57: end for

150

200

0

50

100

150

200

10 20 30 40 0

10 20 30 40 0 0

1−stage IoU, GTC, CD (sig)

0

50

100

150

1−stage−ar IoU, GTC, CD

1−stage−ar IoU, GTC, CD (sig)

50

100

150

200

50

100

150

2−stage IoU, GTC, CD

2−stage−ar IoU, GTC, CD

50

100 Epoch

150

200

200

0

50

100

150

200

Epoch

0

10 20 30 40

Epoch

0

0 0

Epoch

200

10 20 30 40

1−stage−ar IoU

Percentage of True Positives

Epoch

10 20 30 40

Epoch

Percentage of True Positives

Epoch

Percentage of True Positives

0

1−stage IoU, GTC, CD

Percentage of True Positives

100

Percentage of True Positives

10 20 30 40 0

50

10 20 30 40

0

0

Percentage of True Positives Percentage of True Positives

1−stage IoU

10 20 30 40

Percentage of True Positives

Evaluation Overview

0

B

0

50

100

150

200

Epoch

Fig. 5. Convergence curves of the evaluated approaches. The colored areas visualize the min-max value ranges, the lines represent the mean values.

Multi-stage Reinforcement Learning for Object Detection

191

References 1. Itti, L., Rees, G., Tsotsos, J.K.: Neurobiology of attention (2005) 2. Mathe, S., Pirinen, A., Sminchisescu, C.: Reinforcement learning for visual object detection. IEEE (2016) 3. Bueno, M.B., Nieto, X.G., Marqués, F., Torres, J.: Hierarchical object detection with deep reinforcement learning. arXiv:1611.03718v2 (2016) 4. Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013) 5. Zitnick, C.L., Piotr, D.: Edge boxes: locating object proposals from edges. In: European Conference on Computer Vision (2014) 6. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 7. Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. IEEE (2015) 8. Maicas, G., Carneiro, G., Bradley, A.P., Nascimento, J.C., Reid, I.: Deep reinforcement learning for active breast lesion detection from DCE-MRI. In: 2007 International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham (2017) 9. Watkins, C.J.C.H.: Learning from Delayed Rewards (1989) 10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556v6 (2015) 11. Mataric, M.J.: Reward functions for accelerated learning. Mach. Learn. Proc. 1994, 181–189 (1994) 12. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010) 13. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. In: Journal of Machine Learning Research, pp. 1929–1958, 2014 14. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 15. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 303–338 (2010)

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras Koray Ozcan1(&), Anuj Sharma1(&), Skylar Knickerbocker2(&), Jennifer Merickel3(&), Neal Hawkins2(&), and Matthew Rizzo3(&) 1

2

Institute for Transportation, Iowa State University, Ames, IA, USA {koray6,anujs}@iastate.edu Center for Transportation Research and Education, Ames, IA, USA {sknick,hawkins}@iastate.edu 3 University of Nebraska Medical Center, Omaha, NE, USA {jennifer.merickel,matthew.rizzo}@unmc.edu

Abstract. Automated interpretation and understanding of the driving environment using image processing is a challenging task, as most current vision-based systems are not designed to work in dynamically-changing and naturalistic real-world settings. For instance, road weather condition classification using a camera is a challenge due to high variance in weather, road layout, and illumination conditions. Most transportation agencies, within the U.S., have deployed some cameras for operational awareness. Given that weather related crashes constitute 22% of all vehicle crashes and 16% of crash fatalities, this study proposes using these same cameras as a source for estimating roadway surface condition. The developed model is focused on three road surface conditions resulting from weather including: Clear (clear/dry), Rainy-Wet (rainy/slushy/wet), and Snow (snow-covered/partially snow-covered). The camera sources evaluated are both fixed Closed-circuit Television (CCTV) and mobile (snow plow dash-cam). The results are promising; with an achieved 98.57% and 77.32% road weather classification accuracy for CCTV and mobile cameras, respectively. Proposed classification method is suitable for autonomous selection of snow plow routes and verification of extreme road conditions on roadways. Keywords: Road weather classification Scene classification Neural networks CCTV Mobile camera

VGG16

1 Introduction According to the Federal Highway Administration, there were 1.2 million weather related crashes from 2005 to 2014 [1]. As a result, 445,303 individuals were injured and 5,897 people lost their lives. Weather related crashes constitute 22% of all vehicle crashes and 16% of crash fatalities. It has also been shown that traffic flow rate, and

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 192–204, 2020. https://doi.org/10.1007/978-3-030-17795-9_14

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

193

vehicle speeds are significantly reduced under inclement weather conditions [2]. Therefore, accurate and comprehensive weather condition monitoring has a key role in road safety, effective winter maintenance, and traveler information and advisories. Imbedded road sensors have been used for the purpose of automated traffic control via variable traffic message signs [3]. However, road sensors are highly dependent on thresholds to estimate the road surface conditions based on sensor measurements such as wetness, temperature, etc. Lasers and other electro-optical technologies have been used to measure road surface grip condition especially for snow and ice [4]. Surveillance and mobile cameras can also provide automated monitoring of road weather conditions. Recently, Minnesota Department of Transportation established a snowplow camera network to share frequently captured images with public [5]. This research paper considers providing road weather condition estimations from both snowplow and highway surveillance camera networks. 1.1

Mobile Sourced Images

With the increasing popularity of mobile cameras for driver assistance and autonomous driving applications, there has been an increasing demand for using in-vehicle camera systems. Son and Baek [6] proposed a design and implementation taxonomy using realtime in-vehicle camera data for driver assistance and traffic congestion estimation. A camera and other sensors were used to estimate the road’s condition by capturing the vehicle’s response to the roadway. Rajamohan et al. [7] developed road condition and texture estimation algorithms for classes of smooth, rough, and bumpy roads. To do so, they combined in-vehicle camera, GPS, and accelerometer sensors. Kutila et al. [8] proposed an algorithm for road condition monitoring by combining laser scanners and stereo cameras. They reported 95% accuracy with the help of sensor fusion methods for road surface types: dry, wet, snow, and ice. Laser scanners were helpful to estimate depth of the road surface layer for various weather scenarios. Support vector machine (SVM) based classifiers have been used for classifying winter road surfaces for videos recorded with in-vehicle cameras. For three major roadway condition classes (i.e., bare, snow-covered, and snow-covered with tracks), they provided overall accuracy of 81% to 89%. For distinguishing dry and wet road conditions, Yamada et al. performed a multi-variate analysis of images captured by vehicle mounted cameras. Nguyen et al. [9] employed polarized stereo cameras for 3D tracking of water hazards on the road. Moreover, Abdic et al. [10] developed a deep learning based approach to detect road surface wetness from audio and noise. Kuehnle and Burghout [11] developed a neural network with three or four input features such as mean and standard deviation of color levels to classify dry, wet, and snowy conditions from video cameras. They achieved 40% to 50% correct classification accuracy with the network architecture. Pan et al. [12] estimated road conditions based on clear and various snow covered conditions using pre-trained deep convolutional neural network (CNN). They proved CNN based algorithms perform better than traditional, random tree, and random forest based models for estimating if the road is snow covered or not, while omitting rainy/wet

194

K. Ozcan et al.

scenario in general. Finally, Qian et al. [13] tested various features and classifiers for road weather condition analysis on a challenging dataset that was collected from an uncalibrated dashboard camera. They achieved 80% accuracy for road images of classes: clear vs. snow/ice covered and 68% accuracy for classes: clear/dry, wet, and snow/icecovered. Our dataset consists of classes from clear/dry, wet, and snow/icecovered classes that are annotated manually for this project. 1.2

Fixed Source Images from Road Weather Information Stations (RWIS)

Another approach to estimating road conditions involves using a camera mounted above the roadway and looking directly down on the road surface. Jonsson [14] used weather data and camera images to group and classify weather conditions. The study used grayscale images and weather sensor features to estimate road condition groupings across dry, ice, snow, track (snowy with wheel tracks), and wet. They also provided a classification algorithm for the road condition classes dry, wet, snowy, and icy [15]. Similarly, using the road surface’s color and texture characteristics provided suitable classification with k-nearest neighbor, neural network and SVM for dry, mild snow coverage, and heavy snow coverage [16]. However, these approaches are limited to the locations where cameras are installed and with a confined field of view. In other words, the cameras are only looking down on the road from above while they are attached to poles near road verges. It would be too costly to monitor every segment of the road with such implementations. 1.3

Fixed Source Images from Closed-Circuit Television (CCTV)

CCTV cameras provide surveillance of road intersections and highways. They can be utilized as observance sensors for estimating features such as traffic density and road weather conditions. Lee et al. [17] proposed a method to extract weather information from road regions of CCTV videos. They analyzed CCTV video edge patterns and colors to observe how they are correlated with weather conditions. However, rather than estimating road weather conditions, they developed algorithms that estimated overall weather conditions across the scene. They presented 85.7% accuracy with three weather classes: sunny, rainy, and cloudy. Moreover, snowfall detection algorithm, particularly in low visibility scenes, was developed in [18] using CCTV camera images. For modeling various road conditions, we selected VGGNets since it has been proven to be effective for object detection and recognition in stationary images [19]. Recently, Wang et al. implemented VGGNet models in large-scale Places365 datasets [20]. The model was trained with approximately 1.8 million images from 365 scene classes. The model achieved 55.19% for top-1 class accuracy and 85.01% top-5 class accuracy on test dataset. With the model developed by Zhou et al. [21], it has been

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

195

shown that place recognition can be achieved with high accuracy. As it is explained in the next sections, features learned from this network is shown to be useful for differentiating various road image features for defined weather classes. This paper adapts place recognition models to fine-tune the last three layers of the network for road condition estimation. 1.4

Objective

In this paper, we are proposing to estimate road surface weather conditions. To the authors’ best knowledge, it is the first application of road weather condition estimation using CCTV cameras observing road intersections and highways. Our experimental results show feasibility and utility on CCTV datasets that monitor the road surface. Further algorithm development is also presented to improve road weather condition classification from forward facing snow plow mobile cameras. We have developed a model for classifying the road condition based on three major weather classes: clear/dry (marked as Clear), rainy/slushy/wet (marked as RainyWet), and snow-covered/partially snow-covered (marked as Snow). The model provides 77.32% accuracy for mobile camera solution, which improves upon prior method presented by Qian et al. [13], which presented 68% of classification accuracy.

2 Proposed Method 2.1

Model Description

The proposed model was derived from the promising VGG16 implementation on the Places365 dataset. Figure 1 demonstrates the final classification layers of the network as modified for our application of weather condition classification. The last three network layers were replaced to adapt classification with the fully connected layer along with the softmax and output layers. They were then connected to the last pooling layer of the VGG16 network trained with the Places365 dataset. A useful Keras implementation, which the model was adopted from, is available through previously to classify places [22]. The model consists of five convolution layers after image batches are inputted to the learning architecture. Features learned from Places365 dataset are fine-tuned for classification of defined weather conditions. Learned features are transformed into fully connected layers after the pooling operation. Then, softmax layer is an activation function so that the output is a categorical probability distribution of the final classification output. The image dataset has been augmented to use flipped images along with rotation up to ±20°. Furthermore, the class representatives are made to be balanced so that we have equal number of images for each weather class definition.

196

K. Ozcan et al.

Fig. 1. Final layers of the network after pooling was modified for 3-class classification.

To test the model, the dataset was divided randomly into training (80%), validation (10%), and testing (10%) subsets. The training and validation parts of the dataset were used during the transfer learning stage. In the next sections, we are to explain how the training was performed for (i) CCTV surveillance camera images; and (ii) snow plow camera images. The road conditions were separated into three major classes of snowy, rainy/wet, and clear. Validation accuracy is the percentage of the images that are correctly classified in the validation dataset. Epoch number corresponds to how many times the model processed all of the images in training dataset. Loss is a summation of errors from each example in training or validation datasets. When the model’s prediction is ideal, it gets closer to zero; otherwise, loss is greater. Over training period, the loss is expected to approach zero as long as the model is learning useful features to predict defined classes more accurately. 2.2

Road Weather Condition Estimation Using CCTV

Images from CCTV surveillance cameras were also collected over a one year period in Iowa. Images were gathered from snow plow vehicles when they are operating on the roads of Iowa. The total number of images used for training was 8,391 where each class had a balanced number of training examples of 2,797 per class. Data augmentation was used for the training data to give random pixel movements (±20 pixels) on the horizontal direction. For VGG16, we used a 4,096-dimensional feature vector from activation of the fully connected layer. Training, validation, and test images were resized to 224 244 pixels and cropped to a width of 60% and a height of 80% of original image size. The Stochastic Gradient Descent (SGD) was started at a learning rate of 0.0001, which allowed for fine-tuning to make progress while not destroying the initialization. Since road regions are mostly concentrated in the middle of the image for CCTV cameras, cropping the image allowed us to focus more on the road regions. Final validation reached 98.92% accuracy after five epochs of training. It took about 68 min to complete the training for a single NVIDIA™ GTX 1080 Ti GPU. Figure 2 shows the training and validation accuracy along with loss while training the model with 5 epochs.

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

2.3

197

Road Weather Condition Estimation Using Mobile Cameras

Images from snow plow vehicles were also collected over a one-year period on roadways in Iowa. The images were gathered from the cameras on the state of Iowa roadways. The total number of images used for training was 4,443, divided evenly across classes (1,481 per class). Each image of the dataset was manually classified or annotated according to selected weather conditions on roadways. Training data were augmented to introduce more variety into the dataset by applying (i) a ±20 pixel movement in the horizontal direction and (ii) a flipped image version. Training, validation, and test images were resized to 224 244 pixels and red, green, and blue color channels before being fed into the network. We started with SGD at a learning rate of 0.0001. Final validation accuracy reached 71.16% training for 10 epochs. It took about 53 min to complete the training for a single GPU. The trained model fit the training data well over time, despite having difficulty generalizing well for validation and test data. Figure 3 shows the training and validation accuracy along with loss while training the model with 10 epochs.

Fig. 2. Training accuracy and loss while training CCTV weather model.

198

K. Ozcan et al.

Fig. 3. Training accuracy and loss while training mobile camera weather model.

3 Experimental Results After training the proposed model via transfer learning, we used 10% of the dataset for testing. Figure 4 presents the confusion matrix for road weather condition estimation using CCTV camera feeds. Each row of the confusion matrix represents instances in a predicted or target class while each column represents the instances in an output class. The top row of each box shows the percent accuracy for the correct classification. The bottom row of each box shows the number of images for each target-output combination. All of the road weather condition classes had more than a 96% correct classification accuracy. The trained model generalized well for validation and test datasets.

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

199

Example images from the CCTV dataset, with correctly classified images, are presented in Fig. 5. Resulting images and their highest probability for estimated classification are presented with examples from each class in the dataset. The proposed model is capable of predicting each class with a high degree of accuracy on test dataset, which is in accordance with the overall accuracy on the validation set while training the model. Overall accuracy of 98.57% is achieved for correctly classifying three classes clear, rainy-wet, and snow covered.

Fig. 4. Confusion matrix for CCTV camera weather estimations.

Figure 6 presents the confusion matrix for road weather condition estimation using snow plow vehicle camera feeds. The overall accuracy for correct classification was 77.32% across the three major weather classes. While the overall accuracy is high for snowy and clear roads, the model was less accurate on rainy/wet road condition classification. As shown from the training process, the model fit the training data well, but lacked improving accuracy on the validation dataset over training epochs. While the snow plow camera feed images are challenging to classify, increasing the training data may help the model’s generalizability and reduce overall error in the validation dataset. Example images with the highest probability of correct classification are shown in Figs. 5 and 7. It should be observed that the model is able to estimate road class correctly when it is snow covered and the surface is wet. Moreover, the model is able to classify a clear road condition, even when the snow accumulation is present by the side of the road. Overall, when the frontal view of the camera is clearly observing the road, the model performs well to classify road conditions with high accuracy. Figure 8 presents example images that were misclassified as a result of the incorrect highest probability for condition estimation. From the snow plow camera views on the first row, road layout and the camera location were significantly different than the straight roads in the frontal view. Moreover, we observed misclassifications when the camera view was blurry due to wetness. Also, light road color was confused with snow covered areas, resulting in misclassifications.

200

K. Ozcan et al.

Fig. 5. Example images showing the classification results with highest probability.

Fig. 6. Confusion matrix for Snow Plow camera weather estimations.

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

Fig. 7. Example images showing the classification results with highest probability.

Fig. 8. Example images that are misclassified.

201

202

3.1

K. Ozcan et al.

Experimental Results on Naturalistic Driving Images

With the trained model developed for snow plow vehicle images, we tested the developed model on sample naturalistic dataset of video footage from in-vehicle cameras that face towards the roadway. Cameras were installed into personal vehicles and captured individuals’ daily driving in their typical environment. Initial results show promising weather condition estimation on the road. Example figures from naturalistic video recordings are presented in Fig. 9. In order to concentrate on the actual road regions while testing the developed method, we masked the road regions in front of the vehicle with a trapezoid, as it can be observed in the last row of Fig. 9. Since the other environmental regions change significantly and they don’t contribute much details for the weather estimation task on road, they are not included for images from naturalistic driving recordings. Although the model classifies rainy/wet and clear conditions, there is still room for improvement for correctly classifying the snow-covered regions. The current dataset doesn’t have any snow-covered road regions in front of the vehicle. Also, a maximum confidence score for a correct classification result is comparably lower since these naturalistic driving images are quite different from the initial training dataset.

Fig. 9. Example images showing the classification results on naturalistic driving video.

Road Weather Condition Estimation Using Fixed and Mobile Based Cameras

203

4 Conclusions This study developed a customized model for estimating the road weather conditions using both CCTV cameras and snow plow vehicle mobile cameras. The model is adapted from Places205-VGGNet model for scene recognition. The developed model has been trained and tested on two new datasets prepared for road weather condition estimation. Both of the datasets included a large number of images collected from camera feeds over different locations from the state of Iowa. The model developed achieved 98.57% and 77.32% accuracy with the images from CCTV cameras and snow plow cameras, respectively. While there is still room for improvement, the model provides promising results for weather condition estimation using CCTV surveillance cameras. It should be noted that snow plow camera images are highly variable depending on the vehicle’s location, environmental brightness, and road layout. Still, the proposed model provides an 11% accuracy improvement over the previous models for road weather classification datasets using dashcam images. The model for mobile images was tested on a naturalistic driving dataset and it provided promising results for differentiating clear and wet road conditions. All in all, developed model is shown to be suitable for CCTV camera feeds as well as mobile camera feeds from snow plow vehicles.

5 Future Work To increase the overall accuracy of the weather classification results, the team is planning to extract the road regions, especially for the mobile images in a robust manner. Currently, the model is suffering from the variety of challenging images under extreme weather, illumination, and road layout conditions. It might be noteworthy to have a less noisy dataset with more images for model training. Model results based on the CCTV images are encouraging with hopes that this can be supplemented and integrated with images from mobile sources. As image datasets expand, further improvements could increase the number of weather conditions estimated such as icy, slushy, foggy, etc. As a future implementation direction, we plan to run the developed models to estimate the road condition with feeds from CCTV surveillance cameras as well as snow plow vehicles on the road. Developed model is beneficial for autonomous selection of snow plow routes and verification of extreme road conditions on roadways.

References 1. U.S. DOT Federal Highway Administration: How do weather events impact roads? https:// ops.fhwa.dot.gov/weather/q1_roadimpact.htm. Accessed 01 Aug 2018 2. Rakha, H., Arafeh, M., Park, S.: Modeling inclement weather impacts on traffic stream behavior. Int. J. Transp. Sci. Technol. 1(1), 25–47 (2012) 3. Haug, A., Grosanic, S.: Usage of road weather sensors for automatic traffic control on motorways. Transp. Res. Procedia 15, 537–547 (2016) 4. Ogura, T., Kageyama, I., Nasukawa, K., Miyashita, Y., Kitagawa, H., Imada, Y.: Study on a road surface sensing system for snow and ice road. JSAE Rev. 23(3), 333–339 (2002)

204

K. Ozcan et al.

5. Hirt, B.: Installing snowplow cameras and integrating images into MnDOT’s traveler information system. National Transportation Library (2017) 6. Son, S., Baek, Y.: Design and implementation of real-time vehicular camera for driver assistance and traffic congestion estimation. Sensors 15(8), 20204–20231 (2015) 7. Rajamohan, D., Gannu, B., Rajan, K.S.: MAARGHA: a prototype system for road condition and surface type estimation by fusing multi-sensor data. ISPRS Int. J. Geo- Inf. 4(3), 1225–1245 (2015) 8. Kutila, M., Pyykönen, P., Ritter, W., Sawade, O., Schäufele, B.: Automotive LIDAR sensor development scenarios for harsh weather conditions. In: 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro (2016) 9. Nguyen, C.V., Milford, M., Mahony, R.: 3D tracking of water hazards with polarized stereo cameras. In: IEEE International Conference on Robotics and Automation (ICRA) (2017) 10. Abdic, I., Fridman, L., Brown, D.E., Angell, W., Reimer, B., Marchi, E., Schuller, B.: Detecting road surface wetness from audio: a deep learning approach. In: 23rd International Conference on Pattern Recognition (ICPR) (2016) 11. Kuehnle, A., Burghout, W.: Winter road condition recognition using video image classification. Transp. Res. Rec. J. Transp. Res. Board 1627, 29–33 (1998) 12. Pan, G., Fu, L., Yu, R., Muresan, M.I.: Winter road surface condition recognition using a pre-trained deep convolutional neural network. In: Transportation Research Board 97th Annual Meeting, Washington DC, United States (2018) 13. Qian, Y., Almazan, E.J., Elder, J.H.: Evaluating features and classifiers for road weather condition analysis. In: IEEE International Conference on Image Processing (ICIP), September 2016 14. Jonsson, P.: Road condition discrimination using weather data and camera images. In: 14th International IEEE Conference on Intelligent Transportation Systems (ITSC) (2011) 15. Jonsson, P.: Classification of road conditions: from camera images and weather data. In: IEEE International Conference on Computational Intelligence for Measurement Systems and Applications (CIMSA) Proceedings (2011) 16. Sun, Z., Jia, K.: Road surface condition classification based on color and texture information. In: Ninth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (2013) 17. Lee, J., Hong, B., Shin, Y., Jang, Y.-J.: Extraction of weather information on road using CCTV video. In: International Conference on Big Data and Smart Computing (BigComp), Hong Kong (2016) 18. Kawarabuki, H., Onoguchi, K.: Snowfall detection under bad visibility scene. In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), October 2014 19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 20. Wang, L., Lee, C.-Y., Tu, Z., Lazebnik, S.: Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496 (2015) 21. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2018) 22. Kalliatakis, G.: Keras-places. https://github.com/GKalliatakis/Keras-VGG16-places365. Accessed 15 Nov 2018

Robust Pedestrian Detection Based on Parallel Channel Cascade Network Jiaojiao He1,2(&), Yongping Zhang1, and Tuozhong Yao1 1

School of Electronic and Information Engineering, Ningbo University of Technology, Ningbo 315211, Zhejiang, China [email protected], [email protected], [email protected] 2 School of Electronics and Control Engineering, Chang’an University, Xi’an 713100, Shaanxi, China

Abstract. Promoted by Smart City, pedestrian detection under wide-angle surveillance has attracted much attention. Aiming to the small-size pedestrians have poor resolution and different degrees of distortion in visual picture from wide-angle field of view, a robust pedestrian detection algorithm based on parallel channel cascade network is proposed. The algorithm, an improved Faster R-CNN (Faster Region Convolutional Neural Networks), first obtains the differential graph and original graph to construct parallel input, and then introduces a new feature extraction network, which called the Channel Cascade Network (CCN), further designs parallel CCN for fusing more abundant image features. Finally, in Region Proposal Network, the size distribution of pedestrians in the picture is counted by clustering to best fit the pedestrian date sets. Compared with the standard Faster-RCNN and the FPN, the proposed algorithm is more conducive to the small-size pedestrian detection in the case of wide angle field distortion. Keywords: Parallel cascading channel network Small-size pedestrian detection Wide-angle surveillance Regional proposal Clustering

1 Introduction Pedestrian detection, as the basis of target detection, has attracted wide attention in recent ten years. And it is the basic task of many practical applications such as human behavior analysis, attitude detection, intelligent video surveillance, automatic driving, intelligent traffic control, intelligent robots and other fields. Pedestrians in the wide field of view have more research value in cases such as large shopping centers, new entertainment places, railway stations, bus stations and other large sites [1]. However, in the wide-angle perspective, there are still some problems in pedestrian detection: (i) the multi-pose and multi-scale question of pedestrian [2]; (ii) the large span of image scene; (iii) there exist crowd occlusion in real-world scenarios [3]. All have interference on the accuracy of pedestrian detection. In application scenarios, other than the above problem, it is also necessary to consider the distortion under the Camera’s perspective, which also poses a great challenge to detection of pedestrian. © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 205–221, 2020. https://doi.org/10.1007/978-3-030-17795-9_15

206

J. He et al.

Recently, a series of new algorithms have emerged, based on neural networks, showing considerable accuracy gains [4]. Although Faster-RCNN is widely recognized for its high performance in target detection, the following problems still exist in pedestrian detection in wide-angle view. On the one hand, the variation span of the pedestrian scale is large and the pedestrians in the distance have only about ten pixels; on the other hand, due to the distortion of the picture, the surrounding pedestrian’s resolution is also very low. Our goal will be better carried out in small-sized pedestrian using faster R-CNN. For the pedestrians whose size is too small, there are a lot of false detections and missed detections. We first introduce the differential information and study the channel relationship. We introduce a novel architectural network, which we call “ChannelCascading Network” (CCN) [5]. This new method makes full use of the low-level image channel features, adopts the progressive cascade strategy, further merges more image detail information, and combines the clustering algorithm to select the best anchors, so that the detection rate of small-size pedestrians is improved. At the same time, the phenomenon of misreporting and underreporting in overall pedestrian detection is significantly reduced.

2 Related Work Generally, pedestrian detection methods can be broadly divided into two categories: classic methods and deep learning methods. Classic methods of extracting features is to use hand-designed models, including early scale invariant feature transformation [6] and histograms of oriented gradients (HOG) (see [7] and [8]). Classifiers generally use linear support vector machine [9] or adaptive boosting [10]. Subsequently, the deformable partial model [11] and the combination of HOG and local binary patterns [12] are used as feature sets to improve the performance of pedestrian detection. However, it is still weak in generalization ability and has poor robustness. With becoming increasingly popular for deep neural network, convolutional neural networks are had been widely used in computer vision task [13]. In recent years, From RCNN [14], SPPNET [15] and Fast-RCNN [16] to Faster-RCNN [17], object detection has achieved a new level. Due to the introduction of the Regional Proposal Network (RPN), Faster-RCNN has shown excellent performance for common object detection. However, there are still problems in the detection of small targets. Anchors in Faster R-CNN work well for Pascal VOC dataset as well as the COCO dataset. For small objects, however, the default anchors are bigger, so it is difficult to detect small-size instances in the wide-angle view. A series of anchors in Faster R-CNN are based on experience, and the size and ratio of them are set, hence it has many drawbacks and limitations. Specially, it will lead to the poor performance on small object detection. If the correct anchor is chosen at the beginning, it will certainly help the network to detect better. Yolo9000 [18], Yolov3 [19], SSD [20] improve the selection of anchor using clustering algorithm, and experiments show that suitable anchor is more advantageous for detection. However, compared with the two-stage detector, the accuracy is still limited. In our research, the reasonable anchors are selected by clustering, and it is also verified that the small size anchor is more conducive to the detection of small size pedestrians.

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

207

There are generally three main types of ways to improve the accuracy of multi-scale target detection. The first is to use the skip layer technique to combine multi-layer feature to detect (see [21] and [22]). The second method (see [20], [23] and [24]) is to extract different layer features and combine to predict. The third method [25] combines the two methods to perform multi-scale prediction. The above methods take full advantage of low-level detail information and high-level semantic information in different ways to improve accuracy of detection. In the light of the wide angle monitoring data set, we take the VGG [26] network as the benchmark. In our work, the method of feature extraction is similar to second, whereas it is not a direct extraction of each layer of feature fusion, but a cascade of shallow features into the subsequent layer, which will promote the flow of information. Because the channels information contained in different layer is different, a progressive cascading strategy is adopted to fully integrate different layers of information, enhance effective channel information, and promote information transmission.

3 Algorithm Description In the light of the detection of high-dense people in wide-angle monitoring, especially small-size pedestrians, we propose three improvements on the basis of Faster-RCNN: first, designing parallel cascading channel network to fuse differential information, and second, optimizing feature extraction network, through image feature weighted addition obtain enhanced features; third, using a clustering algorithm to automatically find suited anchors to improving the RPN search mechanism. We note the improved Faster-RCNN as Improved FRCNN, and Fig. 1 shows the overall architecture of Improved FRCNN. In the proposed network, we design feature fusion method carefully with parallel cascading channel network. As shown in the figure, it includes two inputs, an original picture and a difference picture, parallel feature extraction networks are contained in blue dotted boxes and the optimized RPN is shown in the green dotted frame. CCN

ROls blob RPN cls_score roi_pool_conv5 N M S

bbox_regression

Enhenced featrue Detector

Fig. 1. Overall framework of proposed algorithm

208

J. He et al.

3.1

Design of Feature Extraction Network

The object detection performance directly depends on whether the feature is complete or not, and pedestrians are no exception. Generally, the richer the image feature information, the better the performance of the classifier. VGG16 uses four pools in process of extracting feature maps in traditional Faster-RCNN, so the feature maps will be reduced to 1/16 of the original size. If only the last layer of image features is applied as the input of the next layer network, it will lose a large amount of detailed information. For small-size pedestrians, after the image is reduced by 1/16, the feature information is lost a lot, so the pedestrian feature cannot be reconstructed. When depth network extracts feature, the semantic information is less in low-level feature map, but target accurate positioning for it; high-level feature map have abundant semantics information, but target positioning of it is relatively poor. That is using shallower layers to predict smaller objects, while using deeper layers to predict bigger objects, thus we make full use of shallow information and propose a parallel Channel-Cascading Network (CCN). In proposed network, the parallel CCN has the same structure. Each CCN adopts a progressive cascade method, which plays on location information in the shallow layers to enhance the positioning of small-size pedestrians. The input of the network is the original graph and the difference graph obtained from three frames difference. To some extent, the three-frame difference optimizes the bilateral and coarse contour of the pedestrian character, which is faster than Gaussian background modeling. Subsequently, the two inputs are fed into the parallel channel cascade network respectively and weighted addition after fifth layers of feature output, which can not only enhances features, but also extracts more distortion features information. The channel-cascading network structure is shown in Fig. 2. Conv1

Conv2

Conv3

Max pooling

Conv4

Max pooling

Max pooling 1

2

Concat

Concat

Fig. 2. CCN network architecture

Conv5

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

209

The idea of constructing the network is that the receptive field corresponding to the low convolution layer is smaller, and the response area of the same area is more obvious to the small target, which is conducive to the detection of small-scale pedestrian instances. The high-layer feature map reduces the resolution of the image because of the multiple maximum pool layer, but the contour information is much more, which is beneficial to the detect pedestrian near and in the middle. As shown in Fig. 2, instead of using output of VGG-net directly, we connect the result of pooling from the first convolution layer and the output from the second layer convolution, the concatenated result replace the second layer convolution output of the original network, as the enhanced feature. That is the result of the first concatenation. And then connect the pooling output of the second layer convolution and the output from the third layer convolution, the concatenated result as the third level output. After being pooled, it is further sent to the fourth layer convolution network. In the process of progressive cascading, useless ambient noise may be introduced, which is not always useful to small-scale pedestrian. In detail, we use 1 1 Convolution for learning weight to reduce noise and increase the ability of network to extract features. In particular, after each level of cascade, different features are fused in the same space using local response normalization to highlight image features. We use two-dimensional planar graphs to represent the details of CCN implementation. The convolution feature map with different widths can be seen in the graph, the wider the width of the rectangle, the more details it contains at the lower level. In Fig. 3, details of the first cascade are given. 1 and 2 in Fig. 3 correspond respectively to 1 and 2 in Fig. 2. The transformational relation is given in formula (1): 0

0

Ftr : I ! O,I 2 RHWC ; O 2 RH W C X c0 O ¼ W*I ¼ Wnc Xn n¼1

0

ð1Þ

Here Ftr denotes any given transformation and executes a convolution operation, I is the network input, and O refers to the output of network, W ¼ ½W1 ; W2 ; . . .; Wn represents the weight of filter, where Wn is recorded as the nth filter weight. Remarkably, the influence of the bias is not taken into account in Eq. (1).

Fig. 3. Flowchart of progressive cascaded structure network

210

J. He et al.

The dotted line box in Fig. 3 is the first concatenation process. As shown in the figure, p represents the pooling operation. H, W, C is the height, width and channel of 0 0 0 input image respectively. H1 , W1 , C1 represents the height, width, and channel of the convolution map respectively and all subscript numbers represent the number of layers in the convolution layer. H002 , W002 , C002 is the height, width, and channel of the enhanced 0 0 feature map. Here, C002 equals C1 plus C2 . Firstly, Y is obtained through the mapping relationship, and then performing the maximum pooling operation after the convolution 0 output to compress the image size, satisfying Hr ¼ H2 , r is the stride of the pool operation. This operation reduces the size of the original image and simplifies the computation. In addition, the pooling output of first layer is concatenated with the output feature map of the second layer, and then the channel-compressed output of the concatenated result with 1 1 Convolution is employed as the enhanced feature. The enhanced feature serves as the output of the second layer convolution of the new network. This is the first time concatenation detail. 3.2

Clustering Scheme

In two-stage detectors, the region proposal is a vital step in the process of detection, anchors have an important role in the region proposal stage of Faster R-CNN. An anchor in RPN is a set of boxes. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image and the size and ratio of anchors are set according to experience. The scale and size of anchors are obtained from experience will have manmade errors and relatively time-consuming, so we abandon this method and use unsupervised clustering algorithm to automatically obtain more appropriate anchor in order to meet the pedestrian characteristics. Meanwhile, the influence of small-scale anchor on small-scale pedestrian detection is analyzed. We obtain the length-width ratio of the pedestrian boxes marked by hand in training set using k-mean algorithm, automatically found the statistical law of target frame. Selected anchor numbers are set by number of clusters, correspondingly using the cluster center of frame as selected anchor number. As shown in the green dotted frame in Fig. 1, after clustering analysis of wideangle monitoring dataset, the optimal number of anchors is selected by the mountain climbing algorithm. The traditional K-means clustering method uses the Euclidean distance function, which means that the larger boxes will produce more error than the smaller boxes, and the clustering results may deviate. Therefore, we define own cost function as: Jðbox; centerÞ ¼ 1 IOUðbox; centerÞ

ð2Þ

Here IOU is the intersection over union of box and center. The mathematical definition of IOU is as follows: IOUðbox; centerÞ ¼

area of overlap ðbox; center Þ area of inion ðbox; center Þ

ð3Þ

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

211

4 Experiment Evaluation 4.1

Dataset

In order to evaluate the optimized pedestrian detector, a wide-angle monitoring dataset was introduced. The dataset was sampled in the 2016 station panoramic monitoring camera videos, which was mainly collected for pedestrian detection project in the wide angle field. The size of the image is 960 1280, and the sampling is in the daytime. We use the pictures acquired under two different angles as training set, use the picture under other several angles as test set, while additional data set, such as the shopping mall pedestrian detection dataset being utilized to be verified. Figure 4 lists some of our datasets. Pedestrian scales vary greatly and have different degrees of distortion in visual images. Especially, the small-size pedestrian is only a few pixels in visual image. Therefore, the improved pedestrian detector is applied, and applied transfer learning to the pedestrian detection task. All of our experiments are based on TensorFlow.

Fig. 4. Dataset enumerating

4.2

Effectiveness Analysis of Feature Extraction

In general, in the process of feature extraction of convolutional neural network, the foreground part has high activity, which makes the feature map clearer. Such image is easy to classify and detect. In Fig. 5 shows the visualization results of features

212

J. He et al.

extracted from each convolution layer. (a) is the feature map extracted from the first to fifth layers by VGG, (b) is a corresponding the output feature map in CCN. Comparing the second and third column feature map in (a) (b), the pedestrian feature contour extracted by CCN is clearer and the background is more pure. The difference of the fourth layer feature graph is relatively large, and adding local response normalization in (b) makes the response larger and larger, and then it is sent into the fifth layer convolution neural network, which makes the pedestrian feature more obvious. More visualization results are given in Fig. 6. The original graph is in the first column, the fifth of feature map extracted by VGG locate the second one, and the third column is the feature map of extracting by CCN.

(a) Feature map of each Layer of VGG

(b) Feature map of each Layer of CCN Fig. 5. Visual comparison of feature map

4.3

Analysis of the Effectiveness of a Parallel Network

Figure 7 is a schematic diagram of feature extraction of parallel channel cascade network. It can be seen from the graph that after parallel CCN, the original feature and the difference feature are extracted respectively. Then the pixel weighted addition is used in the two outputs, and the final feature maps are more obvious in the target region. Getting good features is an important factor in improving detection performance.

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

213

Fig. 6. Some visual contrast results. Column left: original images; Column middle: the feature images extracted by VGG; Column right: the feature images extracted by CCN

4.4

Analysis of Detection Results

The parameter setting in the network is that the learning is 0.01, the maximum iterations is 40000 times. The wide-angle monitoring pedestrian data set is used in the experiments. We carry out several necessary trials when designing the most effective channel cascade network. Firstly, we explore the effect on detection accuracy using the different layers, then concatenated effective convolutional layer and are used as the final choice. As shown in Table 1, M_P stands for pedestrians in the middle, N_P stands for pedestrians nearby, and F_P stands for pedestrians in the distance. Network structure is from top to bottom successively are as follows: structure one, structure two, structure three, structure four. 1, 2, 3, 4, 5 represent respectively different convolution layers

214

J. He et al.

Fig. 7. A schematic diagram of feature extraction in parallel CCN

labels. The brackets represent cascading operations between layers. The table compares influence of the cascade of different levels to pedestrian detection performance. In structure four, it is revealed that cascading high-level information bring no profit to the accuracy. Comparisons based on different network structures, we select structure three as optimization feature extraction network. Compared with the structure one (VGG16), the mean average precision of the optimized structure has improved by 15.4%, the average precision of small-size pedestrian has increased by 14.2%.

Table 1. Comparison of detection performance in cascades between different layers Network structure 1-2-3-4-5 1-(1,2)-3-(3,4)-5 1-(1,2)-(2,3)-4-5 1-(1,2)-(2,3)-(3,4)-5

AP(N_P) 66.1% 75.7% 77.9% 77.6%

AP(M_P) 52.1% 71.3% 72.3% 71.9%

AP(F_P) 29.4% 39.5% 43.6% 30.3%

MAP 49.2% 63.2% 64.6% 60.0%

Secondly, when using K-means clustering algorithm, the selection of K value is very important. The Loss function curve as shown in Fig. 8, it reflects the change of loss value. With the increase of K value, the loss value decreases gradually until it reaches a certain value, and the loss value tends to be flat, which will be regarded as the number of anchors. The cost function changes very little When k > 9, so we take k as 9. Anchors are optimized by k-means. Anchor initial value is generated from the labeled pedestrian dataset. Coordinate of anchors in Improved Faster-RCNN is the scales = [2, 4, 16] and ratios = [0.5, 1, 2].

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

215

Fig. 8. Cost function curve. Table 2. Clustering results Width/pixel 7.235639 26.708819 4.800436 15.494792 6.129348 11.619378 9.334852 19.955491 3.832031

Height/pixel 14.352453 31.565810 9.723116 25.309570 11.411626 20.648536 17.387967 29.656258 7.137695

Table 2 is a high and wide clustering result of anchor. Table 3 shows the location coordinate of anchors. The left-down coordinates of anchor is denoted as ðx1; y1Þ, and ðx2; y2Þ. represents the right-up coordinates of anchor. The second column is the coordinates of the anchor in the original network, and the third column is the coordinates of the optimized anchor. The results show that the improved anchors become smaller to better match the shape of small-size pedestrian. Table 4 shows that compared with anchors with experiential method, the overall detection rate is increased by 20.6%, afr anchor box optimization and the small-size pedestrian detection rate increased by 21.8%. Thirdly, the control variate method is applied to further study small-size pedestrian detection. Our method is to keep the large scale anchors unchanged and only use the small anchor while detecting. The results in Fig. 9 reveal that Small size anchors selected by clustering are more representative.

216

J. He et al. Table 3. Coordinate of anchors NO. Anchor coordinate in Faster-RCNN/pixel y1 x2 y2 x1 1 −83 −39 100 56 2 −175 −87 192 104 3 −359 −183 376 200 4 −55 −55 72 72 5 −119 −119 136 136 6 −247 −247 264 264 7 −35 −79 52 96 8 −79 −167 96 184 9 −167 −343 184 360

Anchor coordinate in Improved FRCNN/pixel x1 y1 x2 y2 −105 −138 120 152 −55 −112 71 127 −147 −181 162 196 −23 −54 39 69 −16 −37 31 52 −33 −74 48 89 −47 −90 62 104 −74 −132 89 147 −101 −181 116 196

Table 4. Detection results Method AP(N_P) AP(M_P) AP(F_P) MAP Experiential method 66.1% 52.1% 29.4% 49.2% Clustering algorithm 80.6% 77.5% 51.2% 69.8%

Fig. 9. The detection result of F_P

Finally, the detector is optimized with three-part improvement, namely the Improved FRCNN. The contrast results of Improved FRCNN, FPN and classical Faster-RCNN in Table 5. The results show that the average detection accuracy of Improved FRCNN is improved compared with the other two algorithms, and the mean average precision is improved by 25.2% compared with the classical Faster-RCNN algorithm, the average precision of small-size pedestrians increased by 30.3%. Table 5. Performance comparison of different algorithms Algorithm Faster-RCNN FPN Improved FRCNN

AP(N_P) 66.1% 45.5% 83.3%

AP(M_P) 52.1% 49.3% 80.1%

AP(F_P) 29.4% 30.2% 59.7%

MAP 49.2% 41.7% 74.4%

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

(a) Faster-RCNN detection result

(b) FPN detection result

Fig. 10. Detection results from different perspectives scene

217

218

J. He et al.

(c) Optimization algorithm detection result Fig. 10. (continued)

Experimental results of the proposed algorithm, the classic Faster-RCNN and the FPN algorithm on the wide-angle monitoring dataset is shown in Fig. 10. The graph (a) is the result of detection with Faster-RCNN in a different view of multiple data sets, graph (b) is the corresponding result with FPN(Feature Pyramid Networks), graph (c) is the corresponding result with Improved FRCNN. In the figure, for the wide-angle monitoring dataset, the classic Faster-RCNN still has shortcomings in small-size pedestrian detection; FPN increases the multi-scale prediction to improve small-size pedestrian detection, but there is more misdetection, the detection results are not good in some cases with large distortion; the proposed algorithm is performs better on smallsize pedestrians in the large field of view, which reduces The omission factor of smallsize pedestrians is reduced, and whole detection rate is improved. Figure 11 shows the P-R curve of the three algorithms. The red curve refers to Faster-RCNN, the blue curve refers to FPN and the cyan curve refers to optimizing the classifier. As can be seen from the figure, cyan curve encloses blue and red curve completely, it means that the classifier after optimization is better than Faster-RCNN and FPN. Curve compare the result shows that, classifier after the optimization works better in the wide-angle field of view.

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

219

Fig. 11. P-R curve contrast diagram

5 Conclusion In this paper, we proposed parallel channel cascade pedestrian detector applying difference information to guide the network to learn more information of small target. First, we investigate the framework to connect different layers and generate enhanced feature maps. Compared with state-of-the-art small-sized pedestrian detector, our method achieves comparable accuracy on the wide-angle monitoring pedestrian dataset. With a parallel channel cascade network, we have shown the advantages of them in different datasets. The visualization results of feature graph show the superiority of parallel cascade network in feature extraction and the effectiveness of weak supervisory information. Advantages of the proposed algorithm is not only to introduce differential information to construct cascaded networks, to optimize feature extraction networks by progressive cascading, but also to improve the RPN search mechanism effectively by unsupervised learning algorithm, thus alleviating the problem that small-size pedestrians cannot be detected. This method has been applied in engineering and verified its reliability. Acknowledgments. This work is supported in part by National Natural Science Foundation of China (N0. 61771270), in part by Natural Science Foundation of Zhejiang Province (No. LY9F0001, No. LQ15F020004, No. LY19F010006), and by Key research and development plan of Zhejiang province (2018C01086).

References 1. Zhang, Q.: Research on pedestrian detection methods on still images. University of Science and Technology of China (2015). (in Chinese) 2. Lin, T.Y., Dollar, P., Girshick, R., et al.: Feature pyramid networks for object detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 936– 944 (2016) 3. Wang, B.: Pedestrian detection based on deep learning. Beijing Jiaotong University (2015). (in Chinese)

220

J. He et al.

4. Zhang, J., Xiao, J., Zhou, C., et.al: A multi-class pedestrian detection network for distorted pedestrians. In: 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA), May, pp. 1079–1083 (2018) 5. He, J., Liu, K., Zhang, Y., et al: A channel-cascading pedestrian detection network for smallsize pedestrians, pp. 325–338. Springer (2018) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004). IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7. Dalal, N., Triggs, B., et al.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, pp. 886–893 (2005) 8. Zhu, Q., Yeh, M.C., Cheng, K.T., et al.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1491–1498 (2006) 9. Burges, C.J.: A tutorial on support vector machines for pattern recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 121–167 (1998) 10. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: European Conference on Computational Learning Theory, pp. 23–37. Springer, Berlin (1995) 11. Felzenszwalb, P.F., et al.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–1645 (2010) 12. Wang, X.: An HOG-LBP human detector with partial occlusion handling. In: Proceedings of the IEEE International Conference on Computer Vision, Kyoto, Japan, September, pp. 32– 39 (2009) 13. Kuo, W., Hariharan, B., Malik, J..: DeepBox: learning objectness with convolutional networks. In IEEE International Conference on Computer Vision, pp. 2479–2487 (2015) 14. Girshick, R., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1904– 1916 (2014) 15. He, K., Zhang, X., Ren, S., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916 (2014) 16. Girshick, R.: Fast R-CNN. In: Computer Science, pp. 1440–1448 (2015) 17. Ren, S., Girshick, R., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 1137–1149 (2015) 18. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger, pp. 6517–6525 (2016) 19. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2018). arXiv:1804.02767 20. Liu, W., Anguelov, D., Erhan, D., et al.: SSD: single shot MultiBox detector. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 21–37 (2015) 21. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2015) 22. Kong, T., Yao, A., Chen, Y., et al: HyperNet: towards accurate region proposal generation and joint object detection. In: Computer Vision and Pattern Recognition, pp. 845–853 (2016) 23. Cai, Z., et al.: A unified multi-scale deep convolutional neural network for fast object detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 354–370 (2016)

Robust Pedestrian Detection Based on Parallel Channel Cascade Network

221

24. Hariharan, B., Arbelaez, P., Girshick, R., et al: Hypercolumns for object segmentation and fine-grained localization. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2014) 25. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2014) 26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Computer Science, pp. 730–734 (2014)

Novel Scheme for Image Encryption and Decryption Based on a Hermite-Gaussian Matrix Mohammed Alsaedi(&) College of Computer Science and Engineering, Taibah University, Medina 30001, Kingdom of Saudi Arabia [email protected], [email protected]

Abstract. Media security is an issue of great concern over the internet and during wireless transmissions. In this paper, a novel scheme for image encryption and decryption is proposed based on a Hermite-Gaussian matrix and an array of subkeys. The proposed scheme includes a secret key that is processed to extract an array of subkeys, which are employed with the extracted phase part of the inverse Fourier transform of a Hermite-Gaussian matrix to encrypt and decrypt a grayscale image. After key generation and the production of two columns of subkeys, the Hermite-Gaussian matrix is multiplied by the first group of subkeys and subjected to a modulus operation (the remainder) with the second group, and the resulting matrix is verified for singularity. If the singularity test is passed, then the resulting image is multiplied by the original image and the output is subjected to a modulus operation and is used for the following subkeys. However, if the singularity test fails, then a new subkey is chosen and the process is repeated until all subkeys are tested and used to produce the encrypted image. For subsequent decryption, the reverse process is implemented to recover the original image from the encrypted one. Statistical analysis shows that the proposed scheme is robust and is strong against attacks. The correlation factor, among other tests, shows that reshuffling the image pixels reduces the correlation between neighboring pixels to very low values (i.e.,

O2∗

and ˜ > O∗ . =⇒ O 2

(6)

Minimum Odd Bonds

251

If one can establish proxies for each of the six terms in (4), then it will follow from (4)–(6) that Δ ˜ 2∗ = ˜ 2,b , O ˜ 2,w , O ˜ bi,1 , O ˜ bi,2 , O ˜ bi,3 , O ˜ bi,4 } = O2∗ . O min{O

(7)

˜ for some particular type of reconstruction equals Moreover, if the proxy O ˜ ∗ , then O ˜ = O∗ , and the set of all reconstructions of this type that are optimal O 2 2 ˜ odd bonds. Conversely, if O ˜>O ˜ ∗ for some is precisely the set of those with O 2 ∗ type of reconstruction, then O > O2 , and consequently, no reconstruction of this type can be optimal. In summary, the set of all optimal reconstructions for a ˜ equals O ˜ ∗ , of the given boundary r is the union, over the types whose proxy O 2 reconstructions of those types that attain their respective proxy. It remains now to find the six proxies. As will be seen, each will be based on generating reconstructions from 1-run optimal reconstruction connecting paths, so that the values of the proxies are straightforward to compute using Theorem 1. As will be discussed in more detail subsequently, there are situations when one can immediately rule out the existence of one of the six types of reconstructions from the values on the boundary, in which case one can set the value of the corresponding proxy to be infinite. First, the proxy for O2,b is based on generating a reconstruction by connecting the endpoints of each black run with a 1-run optimal reconstruction path, and is defined as ∞ if single-pixel white run in a corner Δ ˜ 2,b = O . (8) O1∗ (b1 ) + O1∗ (b2 ) else ˜ 2,b is the number of odd bonds in a reconstruction generated The quantity O by two non-touching simple black reconstruction paths, if such a pair of paths ˜ 2,b is obviously the least number of odd bonds exist. If there exist such, then O among reconstructions generated by non-touching black paths, i.e., it equals O2,b . Note that boundaries with a single-pixel white run in a corner are the only boundaries for which O2,b is infinite, but are not the only ones for which O2,b = O1∗ (b1 ) + O1∗ (b2 ). For example, if there is a single-pixel black run in a corner and each of the white runs is a single pixel, then while there is a reconstruction generated by non-touching black reconstruction paths, there is no reconstruction generated by non-touching simple black reconstruction paths. ˜ 2,w for O2,w is defined similarly. The proxy O Note that if there are two single-pixel runs in corners, they must be the same ˜ 2,b will be finite. ˜ 2,w and O color. Therefore, at least one of O ˜ 2,w are proxies for O2,b and O2,w , respectively. ˜ 2,b and O Lemma 4. O ˜ 2,b is a proxy for O2,b , which we Proof. By symmetry it suffices to show that O do by demonstrating (5) and (6). First assume O2,b = O2∗ . Every pair of simple black paths do not touch, because if a pair did touch, then merging the 1-run reconstructions generated by each path would create a reconstruction of a different type that had fewer odd

252

M. G. Reyes et al.

bonds than O2,b , in which case O2,b > O2∗ , which contradicts the assumption. Furthermore, any such non-touching simple black paths generate a reconstruc˜ 2,b = O2,b = O∗ , demonstrating (5). ˜ 2,b odd bonds. Therefore O tion with O 2 ∗ Now assume O2,b > O2 . If O2,b = ∞, there do not exist non-touching black reconstruction paths, which means that there is a single-pixel white run in a ˜ 2,b > O∗ . Now assume that ∞ > O2,b > O∗ . ˜ 2,b = ∞ and hence O corner. Thus, O 2 2 ˜ 2,b , then all pairs of simple black paths touch, and again there is a If O2,b > O ˜ 2,b > O∗ . ˜ 2,b , so O reconstruction of a different type better than that counted by O 2 ∗ ˜ ˜ If O2,b = O2,b , then O2,b > O2 . This completes the proof of (6) and the lemma. In defining a proxy for Obi,i , first note that there are two cases in which there does not exist an island-free bi-connected reconstruction with a widget at odd bond i, in which cases Obi,i = ∞. The first is when a widget cannot exist at i, which happens when and only when one of the pixels of odd bond i is a corner pixel and is contained in a run of length greater than 1. The second is when a widget is possible at i but one of the sub odd bonds contains a single-pixel run in a corner. In this case, a feasible trio is not possible, and therefore no island-free reconstruction exists. Even when a feasible trio is possible, it may not be possible to form a simple feasible trio. Recall from Sect. 6 that a pair of odd bonds that can be connected with a connecting pair of simple paths is called properly oriented. If for boundary odd bond i one of the pairs of sub odd bonds is not properly oriented, then there does not exist a simple feasible trio for i. Now define the following quantity, which will soon be demonstrated to be a proxy for Obi,i : ∗ O1 (ob1 , OB1 ) + O1∗ (ob2 , OB2 ) + O1∗ (ob3 , OB3 ) + 1, s.c.t. Δ ˜ Obi,i = ∞, not s.c.t. where the condition s.c.t. indicates that three pairs of connecting simple paths between the pairs of sub odd bonds for boundary odd bond i. In other words, a widget is possible, all sub odd bonds for i are properly oriented, and no sub odd bond contains a single-pixel in a corner. Note that even if condition s.c.t. holds, there still might not exist a simple feasible trio, for instance if any set of three connecting pairs of simple paths includes an adjacent connecting pair ˜ bi,i is still computable. Moreover, as the that overlap. However, the quantity O following lemma argues, if a feasible trio is not simple, then the reconstruction generated by it is not optimal. Lemma 5 (Bi-Connected MAP Implies Simple Feasible Trio). Suppose (w1 , b1 , w2 , b2 ) is a 2-run boundary for which (W, B) is a bi-connected minimum odd bond reconstruction with a widget at the ith boundary odd bond. Then the feasible trio q1 , q2 , q3 that generates (W, B) is simple ( i.e., each qj is 1-run optimal), and consequently, all sub odd bonds (obj , OBj ) are properly oriented. Proof. We will demonstrate the contrapositive, namely, that any bi-connected, island-free reconstruction (W, B) for a 2-run boundary (w1 , b2 , w2 , b2 ) generated by a non-simple feasible trio is not optimal. Accordingly, let (W, B) be such a

Minimum Odd Bonds

253

reconstruction generated by a feasible trio q1 , q2 , q3 with at least one non-simple member, and let Cb,1 , Cb,2 , Cw,1 , Cw,2 denote its 4-connected monotone regions. For any C ⊂ V , let O(C) denote the number of odd bonds in the reconstruction with B = C and W = V \ C, or equally, the reconstruction with W = C and B = V \C. Then, one can easily see that the number of odd bonds in (W, B) can be expressed in any of the following ways O(Cw,1 ∪ Cw,2 ) = O(Cw,1 ) + O(Cw,2 ) − 2 = O(Cb,1 ) + O(Cb,2 ) − 2 = O(Cb,1 ∪ Cb,2 ). = If q1 is not simple, one can replace it with a simple path q1 , creating Cw,1 w C(w1 , q1 ) and generating reconstruction Cw,1 ∪ Cw,2 with O(Cw,1 ∪ Cw,2 ) odd ) < O(Cw,1 ). Moreover, bonds. Since q1 is 1-run optimal and q w is not, O(Cw,1 the sets Cw,1 and Cw,2 may overlap or touch, and consequently, ∪ Cw,2 ) ≤ O(Cw,1 ) + O(Cw,2 ) − 2. O(Cw,1

It then follows that ∪ Cw,2 ) ≤ O(Cw,1 ) + O(Cw,2 ) − 2k < O(Cw,1 ) + O(Cw,2 ) − 2 O(Cw,1 = O(Cw,1 ∪ Cw,2 ), (9)

which implies that (W, B) is not a minimum odd bond configuration. The same argument shows that (W, B) is not minimum odd bond if q3 is not simple. Now assume that both q1 and q3 are simple, but q2 is not. First note that one can replace q1 with q¯1 such that q¯1w is the inner path for the white run w1 ; and likewise replace q3 with q¯3 such that q¯3b is the inner path for the black run b1 . The resulting trio is still feasible, and the new q¯1 and q¯3 are still simple with q1 ) = O(q1 ) the same numbers of odd bonds as q1 and q3 , respectively, i.e., O(¯ and O(¯ q3 ) = O(q3 ). Next, it can be easily seen that q2 can be replaced with a simple q¯2 such that either q¯2w does not cross q¯3w , or q¯2b does not cross q¯1b . If q¯2w does not cross q¯3w , consider the reconstruction generated by C¯w,1 ∪ C(w2 , q¯3w , q¯2w ). On the other hand, if q¯2b does not cross q¯1b , consider the reconstruction generated by C¯b,1 ∪ C(b2 , q¯1b , q¯2b ). In either case, the resulting three paths form a simple feasible trio, and an argument like that in (9) shows that the reconstruction has fewer odd bonds than the one with which we started, implying that the latter is not optimal, and completing the proof of the two-corner case, and the theorem itself. ˜ bi,i is a proxy for Obi,i . We now argue that O ˜ bi,i is a proxy for Obi,i . Lemma 6. O ˜ bi,i = O∗ and Obi,i > O∗ Proof. We need to show that Obi,i = O2∗ implies O 2 2 ∗ ˜ implies Obi,i > O2 . If Obi,i = O2∗ , then all simple trios for i are feasible. For if there was a simple trio that was not feasible, then there would be a reconstruction of a different

254

M. G. Reyes et al.

type with fewer odd bonds than a bi-connected, island-free reconstruction with a widget at i, which would contradict Obi,i = O2∗ . Since all simple trios are feasible, ˜ bi,i = Obi,i = O∗ . O 2 Now assume Obi,i > O2∗ . First, suppose that all pairs of sub odd bonds are ˜ bi,i = Obi,i > O∗ . If properly oriented. If all simple trios are feasible, then O 2 ˜ bi,i and there exists a there is a simple trio that is not feasible, then Obi,i > O reconstruction of a type other than an island-free bi-connected reconstruction ˜ bi,i > O∗ . Next suppose with a widget at i with fewer odd bonds. Therefore O 2 that one of the pairs of sub odd bonds for i is not properly oriented. By its ˜ bi,i = ∞ > O∗ . definition, O 2

Fig. 5. Minimum odd bond reconstructions for a given 2-run boundary.

˜ ∗ . The Having established proxies for each term in (4), it follows that O2∗ = O 2 following theorem summarizes the set of all minimum odd bond reconstructions for 2-run boundaries. Figure 5(a)–(e) illustrates a situation where the set of MAP reconstructions may consist of reconstructions of more than one type. Theorem 2 (2-Run MAP Reconstructions). Consider a 2-run boundary r = (b1 , w1 , b2 , w2 ) with no boundary run containing four corners.

Δ ˜∗ = ˜ 2,b , O ˜ 2,w , O ˜ bi,1 , O ˜ bi,2 , O ˜ bi,3 , O ˜ bi,4 . (A) O2∗ = O min O 2 ˜ 2,w = O ˜ ∗ , then all reconstructions generated by a pair of simple white (B) If O 2 reconstruction paths are optimal, and no pair of simple white reconstruction paths touch. ˜ ∗ , then all reconstructions generated by a pair of simple black ˜ 2,b = O (C) If O 2 reconstruction paths are optimal, and no pair of simple black reconstruction paths touch. ˜ ∗ , then all bi-connected, island-free reconstructions with a widget ˜ bi,i = O (D) If O 2 at boundary odd bond i generated by a simple trio of reconstruction paths are optimal. Such trios are feasible, and the reconstructions are the only bi-connected, island-free reconstructions with a widget at odd boundary bond i that are optimal. ˜ terms in (A) does not equal O ˜ ∗ , then no reconstruction of the (E) If an O 2 corresponding type will be optimal. (F) There are no other optimal reconstructions.

Minimum Odd Bonds

8

255

Concluding Remarks

This paper has derived MAP estimates for blocks conditioned on a boundary with 1 or 2 runs. Unlike traditional applications of Ford-Fulkerson, our solutions are closed-form and are semantically informative with respect to the motivating image reconstruction problem. Looking forward, the solutions in this paper for 1- and 2-run boundaries will be useful for boundaries with more than 2 runs, as island-free reconstructions are determined by those runs that contain a corner. Moreover, there are potentially fruitful connections to be explored between our motivating sampling and reconstruction problem and recent developments in the connections between Markov image models and deep learning [10].

References 1. Pappas, T.N.: An adaptive clustering algorithm for image segmentation. IEEE Trans. Signal Process. 40, 901–914 (1992) 2. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001) 3. Greig, D.M., Porteus, B.T., Seheult, A.H.: Exact MAP estimation for binary images. J. Roy. Stat. Soc. Ser. B 51(2), 271–279 (1989) 4. Lakshmanan, S., Zhao, D., Gallagher, A.M.: Multiresolution image compression using Gaussian-Markov random fields. In: Proceedings 36th Midwest Symposium Circuit System, vol. 1, August 1993 5. Forschhammer, S., Rasmussen, T.S.: Adaptive partially hidden Markov models with application to bilevel image coding. IEEE Trans. Image Proc. 8, 1516–1526 (1999) 6. Krishnamoorthi, R., Seetharaman, K.: Image compression based on a family of stochastic models. Signal Process. 87, 408–416 (2007) 7. Reyes, M.G., Neuhoff, D.L., Pappas, T.N.: Lossy cutset coding of bilevel images based on Markov random fields. IEEE Trans. Image Proc. 23, 1652–1665 (2014) 8. Reyes, M.G., Neuhoff, D.L.: Cutset width and spacing for Reduced Cutset Coding of Markov random fields. In: ISIT (2016) 9. Reyes, M.G., Neuhoff, D.L.: Row-centric lossless compression of Markov images. In: ISIT (2017) 10. Mehta, P., Schwab, D.J.: An exact mapping between variational renormalization group and deep learning, October 2014. https://arxiv.org/abs/1410.3831 11. Baxter, R.J.: Exactly Solved Models in Statistical Mechanics. Academic, New York (1982) 12. Ruzic, T., Pizurica, A.: Context-aware patch-based image inpainting using Markov random field modeling. IEEE Trans. Image Proc. 24, 444–456 (2015) 13. Prelee, M.A., Neuhoff, D.L.: Image interpolation from Manhattan cutset samples via orthogonal gradient method. In: ICIP 2014, pp. 1842–1846 (2014) 14. Picard, J.-C., Queyranne, M.: On the structure of all minimum cuts in a network and applications. In: Mathematical Programming Study, vol. 13, North-Holland (1980)

256

M. G. Reyes et al.

15. Vazirani, V.V., Yannakakis, M.: Suboptimal cuts: their enumeration, weight and number. In: ICALP 1992 Proceedings of 19th International Colloquium on Automata, Languages and Programming, pp. 366–377, July 1992 16. Ford Jr., L.R., Fulkerson, D.R.: Flows in Networks. Princeton University Press, Princeton (2015)

Volumetric Data Exploration with Machine Learning-Aided Visualization in Neutron Science Yawei Hui1(&) and Yaohua Liu2 1

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA [email protected] 2 Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA [email protected]

Abstract. Recent advancements in neutron and X-ray sources, instrumentation and data collection modes have significantly increased the experimental data size (which could easily contain 108–1010 data points), so that conventional volumetric visualization approaches become inefficient for both still imaging and interactive OpenGL rendition in a 3D setting. We introduce a new approach based on the unsupervised machine learning algorithm, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), to efficiently analyze and visualize large volumetric datasets. Here we present two examples of analyzing and visualizing datasets from the diffuse scattering experiment of a single crystal sample and the tomographic reconstruction of a neutron scanning of a turbine blade. We found that by using the intensity as the weighting factor in the clustering process, DBSCAN becomes very effective in de-noising and feature/boundary detection, and thus enables better visualization of the hierarchical internal structures of the neutron scattering data. Keywords: Scientific visualization Feature extraction Unsupervised learning and clustering Volumetric dataset

1 Introduction It has been a long-term challenge to effectively visualize 3D objects derived from large volumetric datasets in many scientific disciplines, industry domains and medical applications [1–3]. Most implemented techniques focus on the direct volume-rendering This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 257–271, 2020. https://doi.org/10.1007/978-3-030-17795-9_18

258

Y. Hui and Y. Liu

(DVR) algorithm which excels in its high sensitivity to the delicate structures of the 3D objects at the expense of computational costs. For moderately sized datasets (typically one to ten million data points) of simple density profiles, it is relatively easy to manipulate transfer functions (TF) [4, 5] used in DVR (e.g., threshold cut-off and segmented alpha ranges) so that independent features of the 3D object and the boundaries between the signal and background noise can be well determined. However, when the complexity of the internal structures or simply the sizes of the datasets increase, it enters the domain of large datasets (typically a hundred million to ten billion data points) and a simple scheme involving only TF manipulation can no longer work efficiently. In this work we propose a new visualization analysis approach that, besides the TF manipulation, takes into account the spatial statistics of the data points. This approach enables one to explore fine structures in the sense of spatial clustering of 3D objects. As a preliminary and yet crucial step in the visualization workflow, this analysis will play the roles of noise filtering, feature extraction with boundary detection, and generating well-defined subsets of data for the final visualization. Among several algorithms that have been tested (including kMeans, Independent Component Analysis, Principle Component Analysis, Blind Linear Unmixing, etc.), we found that the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [6] is very effective to accomplish the clustering tasks for our visualization analysis. Surprisingly, there are very limited applications of this algorithm for 3D datasets so far. And among several applications which did utilize DBSCAN [7–10], the authors exclusively took the vanilla form of this density-based clustering algorithm, i.e. applying DBSCAN without considering the physical value at each voxel in the 3D volume. This lack of thorough exploration on DBSCAN’s applications inspires us to investigate, for the first time, its ability in detecting/identifying 3D features and creating visualization from large volumetric datasets with both photometric and spatial information. Two exemplary applications of our method have been presented with neutron datasets from a single crystal diffuse scattering experiment and a neutron tomography imaging reconstruction, acquired at the Spallation Neutron Source (SNS) and the High Flux Isotope Reactor (HFIR), respectively, at Oak Ridge National Laboratory (ORNL). Recent advancements in neutron sources, instrumentation and data collection modes have pushed the size of experimental datasets into the big data domain, which poses challenges for both still imaging and interactive OpenGL rendition in a 3D setting. In this work we show that, by using the intensity as the weighting factor in the clustering process, DBSCAN enables one to spatially separate and extract interesting scattering features from the bulk data. A single feature or a combination of many of them could be chosen to create concise yet highly informative 2D projections of the 3D objects (i.e. still imaging), or to render 3D OpenGL objects interactively so that one could explore the datasets in much more details by moving, rotating and zooming in/out around them. In following sections, we will focus on the visualization analysis in Sect. 2, the applications of DBSCAN on neutron datasets in Sect. 3, some discussions and perspectives on future work in Sect. 4, and a brief conclusion in Sect. 5.

Volumetric Data Exploration with Machine Learning-Aided Visualization

259

2 Visualization Analysis The goal of our visualization procedure is to explore and identify independent features for visualization and eliminate the noise at the same time, with as less user interference as possible. In this section, we will first introduce some specific characteristics of the neutron datasets used in our analysis and show the traditional DVR visualization via solely TF manipulation on these datasets. We then lay out the details of the DBSCANaided visualization analysis.

Fig. 1. Exemplary 2D cross-section images of the single crystal diffuse scattering data from CZO. The 2D slices were cut perpendicular to the K axis in the sample’s reciprocal space with a thickness of 0.02 rlu. Data are plotted with relative scattering intensity in the logarithmic scale.

2.1

Characteristics of the 3D Neutron Data

The first dataset used in our analysis was collected at the elastic diffuse scattering spectrometer beamline CORELLI at SNS on a sample of single-crystal calciumstabilized zirconia of composition Zr0.85Ca0.15O1.85 (CZO hereafter). The experimental data have been reduced into the sample’s reciprocal space using Mantid [11, 12]. The reduced scattering dataset is saved as a 701 701 701 single-precision 3D matrix

260

Y. Hui and Y. Liu

with dimensions along the H, K and L axis of the (evenly spaced) reciprocal lattice. Figure 1 shows several exemplary 2D slices cut perpendicularly to the K axis. The intensity of CORELLI data typically spans a high dynamic range (*6 orders of difference in magnitude). There exist both Bragg peaks (strong and sharp spots seen in slices K = −7 and K = 0) and diffuse scattering (broad and weak features pervasive in all slices). All “NaN” values which represent no experimental data are pre-emptively removed from the analysis. These sliced images clearly show intricate features existing in the 3D space, which are of researchers’ most interest to be extracted/visualized.

Fig. 2. 2D cross-section images along the Z-axis in the real space after the tomography reconstruction for an Inconel 718 turbine blade, imaged at the CG-1D cold neutron imaging prototype facility at HFIR of ORNL. The background noise can be clearly seen in these images.

The second dataset was created from a tomographical reconstruction process on an Inconel 718 turbine blade (Turbine hereafter), imaged at the CG-1D cold neutron imaging prototype facility at HFIR of ORNL [2]. After the reconstruction, the Turbine dataset was saved in a 1997 1997 1997 single-precision matrix with dimensions along the X, Y, Z axis of the 3D real space. Figure 2 shows selected 2D sliced images at different Z-positions. The noise can be seen in these images as both bulky background and filaments which are the relics of the tomography reconstruction algorithm. An efficient way to filter out these background noise before the 3D visualization is needed.

Volumetric Data Exploration with Machine Learning-Aided Visualization

261

In both cases, the intensity of each data point has certain physical meaning. For the Turbine data, the intensity reflects the amplitude of the interaction potential between neutrons and the sample in the real space; while for the single crystal diffuse scattering data, the intensity is the 3D spatial Fourier transformation of the neutron-sample interaction potential. Strong localization of the measured intensities corresponds to certain physical properties of the objects under investigation and it inspires us to exploit their spatial correlations in our visualization analysis. 2.2

Traditional DVR with TF Manipulation

A scrutiny on the intensity profile (e.g. a histogram of the intensity) along with intuition gained from 2D cross-section images (Figs. 1 and 2) reveals that many interesting structures/features are often mingled together and mixed with the vast background of noise, therefore investigating the intensity profile alone is not effective for feature detection and extraction, as illustrated in Fig. 3. We show in Fig. 3(a) the histogram of intensity of the CZO dataset with several local extrema identified and in Fig. 3(b) a scatterplot of the 3D still image. For clarity, the visualization space is limited to a cuboid with dimensions of (301, 501, 301) along the axis of H, K and L in the reciprocal space. The first extreme marked at “CUSP” in the histogram reflects the random fluctuation of the background noises in the “empty” reciprocal space where the scattering signal from the sample and instrument is vanishingly small. Therefore, we set the cutoff intensity at “CUSP”, which marks and removes 0.5% of the data points as “noise”. For the rest of the “signal” data points, the dynamic range in their intensities spans more than 6 orders of magnitude and the total number of data points remains *45 million (301 501 301) in total. To avoid the overlapping problem in the traditional 3D scatterplot, we manipulate the TF by using a non-uniform “alpha” or transparency when plotting (otherwise, the scatterplot will simply manifest as a solid colored block). On the other hand, to visualize both the weak diffuse features and the strong Bragg peaks which reside at the opposite ends in the intensity profile, we divide the range of alpha at the value “THRESHOLD” (chosen to be 50 TOP after many trials) into two segments. The first one covers the weak signals in the intensity range [CUSP, THRESHOLD] with the alpha values changing linearly from 0 to 1; the second keeps a constant alpha value (=1) for all points with intensities above the THRESHOLD. Practically, it is of a trial-and-error matter to choose a proper value for THRESHOLD and it requires good understanding of the datasets. Even though some structured 3D diffuse scattering features show up in Fig. 3(b), the patterns are overall vague and cloudy with noises, making it very difficult to characterize the morphological features. 2.3

DBSCAN-Assisted DVR

In our new approach, before employing traditional DVR for 3D visualization, we reduce the data via the unsupervised clustering algorithm DBSCAN to remove noise points by default and to facilitate the feature extraction and object-boundary detection.

262

Y. Hui and Y. Liu

Fig. 3. (a) Intensity histogram of the CZO dataset in a cuboid with dimensions (301, 501, 301) along (H, K, L) axis. (b) 3D still image of the CZO dataset obtained solely by TF segmentation. Data are shown on the linear scale of intensities with noises removed below “CUSP” and a segmented TF divided at THRESHOLD = 50 TOP.

Volumetric Data Exploration with Machine Learning-Aided Visualization

263

The general application of DBSCAN takes two parameters in the clustering procedure: (1) the maximum distance between two points for them to reside in the same neighborhood; and (2) minPts - the minimum number of points required to form a dense region. The distance between two points is usually defined in the Euclidean metric. In the neutron datasets discussed here, coordinates of the data points are taken as the uniformly distributed voxel indices which scale linearly with the physical positions of data points in either the reciprocal space for the CZO dataset, or the real space for the Turbine. The calculation of the parameter e would then be greatly simplified. For a Cartesian coordinate system, if considering only the smallest “e-neighpffiffiffi borhood” of an arbitrary point, we can set e between [1, 2] which will include only the six first nearest neighbors. To expand the e-neighborhood, we can choose the value pffiffiffi pffiffiffi of e in ½ 2; 3 to include the twelve second nearest neighbors, and so on. In the following visualization analysis, we keep e fixed at 1.7 so that only the 18 nearest neighbors (1st and 2nd) are taken into the calculation of the density for local clustering. The most critical adaptation to apply DBSCAN on our neutron datasets is to use the intensity as a measure of weight in calculating the second DBSCAN parameter – minPts. As mentioned above, the intensity of neutron scattering data is of physical significance, however the traditional DVR algorithm doesn’t take it into account when designing the TF. It’s obvious that the diffuse scattering features shown in both 2D (Fig. 1) and 3D images (Fig. 3(b)) are as much spatially correlated as photometrically. To utilize both information, i.e., the intensity and spatial location, we dictate the algorithm to calculate minPts with varying weights so that for each data point, its contribution to weighted-minPts is proportional to its intensity. By doing so, the DBSCAN algorithm becomes very effective in de-noising and feature/boundary detection for neutron scattering data. For example, for the CZO dataset, with a proper weighted-minPts value, one can detect both the Bragg peaks (sharp spots with a few high intensity points) and the diffuse scattering features (broad features with many low intensity points), and label them in different clusters provided sufficient spatial separations, as shown in the next section. Among many controls/tweaks one could apply in the clustering process in order to tailor DBSCAN to a certain need, we utilize its native ability of distinguishing in a cluster the “core” points from its boundary points [6]. This feature will play a critical role of intelligently reducing the size of the dataset and making it practical to interactively manipulate the 3D object created from the Turbine dataset.

3 Application on Neutron Data Without loss of generality, we use 3D scatterplots for visualization after the DBSCAN clustering analysis. These scatterplots use a simple TF which maps the relative intensities of points in a cluster to a continuous alpha range in [0, 1], which makes a sharp contrast to the painstaking TF manipulation process demonstrated in Sect. 2.2.

264

Y. Hui and Y. Liu

Fig. 4. (a) A 3D still image of the CZO dataset with all clusters identified by DBSCAN using e ¼ 1:7 and weighted-minPts = 70. (b, c) 3D scatterplots of clusters grouped per symmetry – (b) the most prominent two clusters; and (c) the next group of eight prominent clusters. Data are shown on the logarithmic scale of relative intensities.

Volumetric Data Exploration with Machine Learning-Aided Visualization

3.1

265

Feature Extraction in Single Crystal Diffuse Scattering

Figure 4 shows the results from the DBSCAN clustering and the final visualization of the CZO dataset. Specifically, Fig. 4(a) shows a 3D still image which includes all clusters identified by DBSCAN algorithm after using e ¼ 1.7 and weightedminPts = 70. With this set of parameters, DBSCAN identifies 668 clusters in total, which covers 6.5% of the total *45 million data points. In another word, 93.5% of the data points are identified as noise and will not be used for visualization. In comparison to traditional DVR result shown in Fig. 3(b), DBSCAN has removed the cloudy background so efficiently that the spatially isolated features stand out, which is of tremendous help on visualizing morphological structures of the diffuse scattering patterns in the 3D reciprocal space.

Fig. 5. Top panels - Scatterplots of one of the most prominent cluster identified in Fig. 4(b) in different spatial perspectives; Bottom panels – iso-surfaces of the same object revealing finer internal structures.

266

Y. Hui and Y. Liu

More importantly, DBSCAN provides an easy way to extract distinct 3D diffuse scattering features from the volumetric data and make it possible to examine in detail each individual feature independently. For examples, Fig. 4(b) and (c) show the first prominent group of two clusters and the second prominent group of eight clusters, respectively. Detailed close-ups could be easily achieved by simply selecting the desired clusters for visualization and such an example is given in Fig. 5. The top two panels present scatterplots of the most prominent two clusters identified in Fig. 4(b) in different spatial perspectives. The bottom two panels show the same object but plotted in its iso-surface format which reveal finer internal structures of the cluster.

Fig. 6. Intensity histogram of the Turbine dataset. Negative intensity data points are not shown in the histogram.

3.2

Interactive Visualization of Neutron Tomography

Besides usual tasks like denoising, the neutron tomography dataset brings another challenge with its gigantic size to the visualization analysis. In our case, the raw Turbine dataset contains 8 billion data points in total and its intensity histogram is shown in Fig. 6. Similar to the CZO dataset, the data points with low intensities ( letter height maximum then 9: if height < letter height minimum then 10: if height × aspect ratio − width > height × 0.25 then 11: Compute center of each contour using parameters obtained in step 8. 12: end if 13: end if 14: end if 15: Remove the objects that do not satisfy the above conditions. 16: num letters[i] ← count of objects after removing unwanted objects 17: end for 18: end for 19: count ← number of images having maximum number of characters and also save their i value. 20: if count == 1 then 21: Best binarized image is i 22: else if count > 1 then 23: Calculate foreground pixel to background pixel ratio, r(i) 24: Best binarized image is i, where i gives the images having maximum foreground to background pixel ratio. 25: end if

3.3

License Number Segmentation

After binarization we must isolate the characters from the rest of the plate. We use horizontal histogram projection which sums the pixels along the horizontal direction. The summation of all the pixels along each row gives horizontal projection values. We use vertical projection to extract characters from the license plate candidate, vertical projection obtains the coordinates of the characters. The horizontal projection profile allows us to determine the row numbers of the license plate characters. Vertical projection helps to extract characters independent of position and rotation [4]. To achieve fast computation and to reduce memory requirements during horizontal and vertical projection procedure the image is converted to more compact version called the Skelton. The Skelton of an image preserves the structure of object but removes all redundant pixels as shown in Fig. 3. Figure 6 shows the license number that is extracted from the plate and Fig. 4 displays the extracted state information from the license plate and Fig. 5 displays the extracted license number from the license plate using horizontal projection.

ALPR Using Binarization and CNNs

277

Fig. 2. Plots 1–5 display the binarized images for various techniques. In this case, edge-based thresholding is the best binarization technique.

278

S. Angara and M. Robinson

Fig. 3. Skeltonized result of license number portion

Fig. 4. Extracted state information from the license plate which shows ‘Texas’

License plate characters are located between rows Rstart and Rend whose pixel values are selected from the binary image. State information will exist outside this range. The characters are in binary format and the state information is in gray scale format. In order to extract characters, the license plate region obtained from the above operation is input to a recognizer: the Convolutional Neural Network.

Fig. 5. Extracted license number from the license plate

Fig. 6. Segmented characters from license number portion

ALPR Using Binarization and CNNs

4

279

Convolutional Neural Networks

The Convolutional Neural Network (CNN), first proposed by LeCun [5] in 1988, is a special type of multilayer perceptron trained in supervised mode using backpropagation. It is one of the most successful learning machines and has achieved state-of-the-art results in a diverse range computer vision tasks [5–9]. Convolutional neural networks represent a radical departure from traditional feature extraction methods such as SIFT [10], HOG [11] and SURF [12] as input to a classifier. Our CNNs are implemented using Caffe, a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) [13]. Table 1 shows the architecture for the convolutional neural network that we used.

5

Results and Discussion

The complex nature of license plate character recognition requires assessing performance on two interdependent metrics: character segmentation performance and character recognition. Table 1. Architectural layout of CNN for license number recognition using binary images

5.1

Layer

Input size

Filter Number Pad size of filters

Stride

Conv1

32 × 32 3 × 5 60

1×1 1

Conv2 + ReLU2

32 × 30 5 × 7 120

2×3 1

Pool2 + Norm2

32 × 30 8 × 8 120

-

Conv3 + ReLU3

32 × 30 5 × 1 384

2×0 1

Pool3 + Norm3

32 × 30 2 × 2 384

-

1

2

Fully connected (ip1) + 32 × 30 ReLU4 + Dropout4

584

-

-

Fully connected (ip2)

584

-

34

-

-

Softmax layer

34

-

34

-

-

Character Segmentation

To evaluate the effectiveness of our segmentation methods, we randomly chose a sample of 394 plates to manually evaluate by comparing the license number and the extracted information. The database of images also includes intentionally degraded images by contrast. To compute the accuracy, 394 images are considered of which 372 images are segmented satisfactorily. This gives an accuracy of 94.4% for license number segmentation. Further inspection shows that some of the failed images segmented partially. Successfully segmented images can include unwanted objects such as state symbols as shown in Fig. 7. The state symbol is included due to the vertical projection technique as discussed in Sect. 3.3.

280

S. Angara and M. Robinson

Fig. 7. Extracted license number from the given input license plate. An unwanted symbol representing the state of Texas appears due to vertical projection.

5.2

Classification Problems

Characters with similar geometric properties can cause misclassification and license plate data can have different representations of the same character as shown in Fig. 9. Illustrations of this are shown in Table 3 where the first column shows the character class and the second column shows the estimated class by the neural network. Table 2. Classification rate of each character and number using CNN trained with binary images Class Number of images classified correctly

Number of images misclassified

Accuracy (%)

0

20

0

100

1

19

1

95

2

19

1

95

3

18

2

90

4

9

11

45

5

19

1

95

6

20

0

100

7

4

16

20

8

20

0

100

9

20

0

100

A

20

0

100

B

15

5

75

C

18

2

90

D

9

11

45

E

19

1

95

F

20

0

100

G

18

2

90

ALPR Using Binarization and CNNs

281

Table 2. (continued) Class

Number of images classified correctly

Number of images misclassified

Accuracy (%)

H

20

0

100

I

20

0

100

J

19

1

95

K

4

16

20

L

20

0

100

M

16

4

80

N

20

0

100

P

19

1

95

R

20

0

100

S

18

2

90

T

20

0

100

U

20

0

100

V

20

0

100

W

20

0

100

X

20

0

100

Y

20

0

100

Z

1

19

5

Overall accuracy 584

96

85.88

Figure 8 displays different rotations of templates ‘5’. When convolution is applied to all the images using same kernel, results in similar properties as convolution operation invariant to rotation [14]. Hence the main problem for misclassification is similar geometrical properties. In fact, ‘W’looks similar to ‘H’, ‘D’is similar to ‘O’or ‘0’. The other problems include multiple formats of representation such as ‘I’and ‘4’which is displayed in Fig. 9. Some of the badly or partially segmented characters are also included. Table 3. Misclassified characters for binary image Desired class Estimated class I

1

D

0

J

L

M

W or H

Z

2

K

X

7

T

282

S. Angara and M. Robinson

Fig. 8. Example of learned invariance

Table 2 shows the results of the system. Overall the accuracy is 85.88% with most of the individual classes performing at 100%.

Fig. 9. Different representation of same character. In these plates the number 4 has different styles

6

Conclusion

We have developed a license plate character recognition pipeline using multiple binarization techniques and convolutional neural networks to achieve an 85.88% recognition rate. For this research we generated our own dataset and feel that more license plate data could improve overall recognition accuracy. More data could even allow for even deeper convolutional neural networks to improve accuracy even further. Finally, the network’s accuracy could be improved and classes ‘Q’, ‘O’, and ‘0’ could be distinct with much more training data. Overall we have demonstrated that with preprocessing deep learning can prove to be an effective platform for an Automatic License Plate Recognition System.

References 1. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR, vol. 3, pp. 958–962 (2003) 2. Weickert, J.: A review of nonlinear diffusion filtering. In: Scale-Space Theory in Computer Vision, pp. 1–28. Springer (1997) 3. Sauvola, J., Pietik¨ ainen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000)

ALPR Using Binarization and CNNs

283

4. Du, S., Ibrahim, M., Shehata, M., Badawy, W.: Automatic license plate recognition (ALPR): a state-of-the-art review. IEEE Trans. Circ. Syst. Video Technol. 23(2), 311–325 (2013) 5. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 6. Duffner, S.: Face image analysis with convolutional neural networks (2008) 7. Zhao, Z., Yang, S., Ma, X.: Chinese license plate recognition using a convolutional neural network. In: Pacific-Asia Workshop on Computational Intelligence and Industrial Application, PACIIA 2008, vol. 1, pp. 27–30. IEEE (2008) 8. Chen, Y.-N., Han, C.-C., Wang, C.-T., Jeng, B.-S., Fan, K.-C.: The application of a convolution neural network on face and license plate detection. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 3, pp. 552–555. IEEE (2006) 9. Han, C.-C., Hsieh, C.-T., Chen, Y.-N., Ho, G.-F., Fan, K.-C., Tsai, C.-L.: License plate detection and recognition using a dual-camera module in a large space. In: 2007 41st Annual IEEE International Carnahan Conference on Security Technology, pp. 307–312. IEEE (2007) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 11. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005) 12. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008). Similarity Matching in Computer Vision and Multimedia. http://www.sciencedirect.com/science/article/pii/ S1077314207001555 13. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014) 14. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org

3D-Holograms in Real Time for Representing Virtual Scenarios Jes´ us Jaime Moreno Escobar(B) , Oswaldo Morales Matamoros, Ricardo Tejeida Padilla, and Juan Pablo Francisco Posadas Dur´ an ESIME-Zacatenco, Instituto Politécnico Nacional, Mexico, Mexico [email protected]

Abstract. The present work consists of a methodology for the Capture and Representation of three-dimensional holograms in high definition. This proposal makes use of computational tools, such as Computer Vision and Artificial Intelligence, this methodology is divided into five phases or steps. In the first step the problems of the different techniques of 3D visualization are presented and how this project solves them, as is the use of 3D devices. While phase two is intended to explain the background of holography and holograms, it also explains how, making use of stereoscopic vision, devices, methods and techniques have been created that allow the visualization of 3D objects. That is why in the next phase gives the theoretical basis of the elements used in the construction of this project. Also, we explain how human eye obtains natural images, color spaces and principles to extract a chromatically pure color (ChromaKey green) as a scene background. The mathematical methodology is proposed in the fourth phase where all techniques are considered that are used for the construction and design of a capture module, which is used to perform the coding of the final hologram. In addition, the considerations for the construction of a module for the representation of the hologram and to visualize it in a holographic pyramid are exposed. Finally, the main experiments are presented and explained, with the holographic pyramid at two different angles of inclination, also we perform a test of illumination of the capture module, in addition we calibrate the turntable and process of coding and holographic representation. Keywords: Hologram ChromaKey

1

· Computer Vision · Real Time Systems ·

Introduction

Currently there are techniques to create three-dimensional images or give the illusion of depth in the image. Stereoscopy, which consists of collecting threedimensional visual information of an image, which the eyes (right and left) due to their separation, obtain two images with differences between them and the brain c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 284–302, 2020. https://doi.org/10.1007/978-3-030-17795-9_20

3D-Holograms in Real Time for Representing Virtual Scenarios

285

processes the differences of both images, giving a sense of depth of the objects. The problem with using this method is that three-dimensional appreciation is not achieved, but the depth sensation of an image is obtained, thus achieving a three-dimensional illusion. The proposed project creates three-dimensional holographic images giving a volume to specific objects, thus facilitating their study and manipulation, like a hologram. This gives an appreciation of three-dimensional depth that is observed from different angles without losing the form. These create twodimensional images giving the sensation of depth through flat images reflected on the faces of a holographic pyramid [1]. The hologram generation system will consist of three phases: capture, coding and representation. The capture phase, the videos of the object are obtained with a chromatically green background. This allows the representation of the same in three dimensions in the holographic pyramid. The coding phase, you get a series of images or frames of the videos captured in the first phase. With the object with the green Chroma background replaced by an absolute black background. The last phase will allow a hologram to be observed in the center of a holographic pyramid, this due to the refraction and reflection of the light on the faces of it. The capture module obtains from its four sensors, images of four profiles of the object (front, back, left profile and right profile). Thus, the sensors to be used are four, which are activated separately since the PC used only allows to use unsensor at the same time. This system creates holograms with a scale size 2: 1 of the original object, that is, the hologram will be reduced to half its resolution with respect to the original. The holographic pyramid shows in its center, by the refraction of light in its faces, the hologram of an object of half its real size, without showing loss of its characteristics or colors.

2

Related Work

Stereoscopy is a technique to obtain images that generate the sensation of three dimensions. The word stereo comes from the Greek meaning relative to space. In the year of 1838, stereoscopy was officially defined by Sir Charles Wheatstone for his explanation of binocular vision and he was the first to devise an apparatus for providing relief or three-dimensional vision, the stereoscope. The stereoscope allowed the vision of two images and each corresponding to 65 mm of disparity in the eyes. In 1849, Sir David Brewster created and built a binocular camera that took two images and synchronously which allowed stereoscopic portraits. The stereoscopic technique evolved in the second half of the century, was adapted to improvements in procedures such as from the stereo daguerreotype to the ver´ ascopo of Richard. It had great acceptance at the beginning of the 20th century. Many photographers from the 19th and early 20th centuries made stereoscopic shots [3]. In the year 1600 Giovanni Battista Della Porta presented the technique of stereoscopic drawing, which consisted in drawing two images of an object seen with a slight horizontal shift to give

286

J. J. M. Escobar et al.

the sensation of depth. This perception provokes in the observer a sensation of immersion in the scene that is in front of him. 3D visualization in its beginnings used special lenses called anaglyphs, because it was the only way to generate a sense of depth in the viewer’s brain. With the technological advance, new artifacts were created for 3D visualization, such as: – Anaglyph Lenses: Its design is based on the phenomenon of binocular vision, Louis Ducos du Hauron created a procedure called anaglyphe. It consists of the joint reproduction of two images, with the corresponding distance of disparity in each eye, each image was complemented with a chromatically opposed color. – Display by Polarization: This method uses polarized light to separate the left and right images that correspond to each eye. The polarization system does not alter colors, but there is some loss of luminosity. This technique is used in 3D cinema projection. – Display by sequence Estero Activa or Alternative: This system presents in sequence and alternately the left and right images, they are synchronized with lenses built with liquid crystal shutters called LCS (Liquid Crystal Shutter Lenses - Liquid Shutter glasses-) or LCD (Lenses with Liquid Crystal Display -Liquid Crystal Display glasses-), so that each eye sees only its corresponding image. – Head Mounted Display (HMD): This device is a stereoscopic helmet designed with two optical systems with integrated screens for each eye, so that the image is injected by the device. Its main use so far has been for Virtual Reality and video games, at a high cost and is only used experimentally. – AutoStereo Monitors: Autostereoscopic monitors produce 3D images without the need for special lenses. All of them use variants of the lenticular system, that is, micro lenses in parallel and vertically placed on the monitor screen. These lenses generate some deviation from two or more images, usually from 2 to 8, with which the sensation of three dimensions is generated. There are four techniques to create depth sensation, which have evolved but the basis of all of them is mainly stereoscopy: 1. Pulfrich Effect: The technique consists in the perception of a stereoscopic effect of an image in horizontal movement on a plane and with a dark filter in front of each one of the eyes. Due to the lower luminosity that the eyes perceive thanks to the filter, the image reaches the brain with a delay of hundredths of a second. 2. Chromadepth: This system developed by ChromaTek Inc. is based on the deviation of colors produced by the light spectrum. In a prism, the light deviates depending on the wavelength of the same, that is, the more deviation in the red channel, less in the blue channel. The depth information is color coded. There are special lenses to see this image, designed with a pair of transparent crystals with microprisms, Fig. 1.

3D-Holograms in Real Time for Representing Virtual Scenarios

287

Fig. 1. Image with chroma Depth effect.

3. Photo sculpture: The photo sculpture was designed by the sculptor Fran¸cois Willème in the year 1860. It consisted of simultaneously obtaining 24 photographs of the desired model with machines placed around it. The 24 clichés, projected and restored to the pantograph, allowed to model 24 aspects of the character that were then reconstructed in a single sculpture. 4. Reliefography: The Reliefography allows to obtain images of the two profiles and of front of a model with a sensor that moves on a rail to decompose the movement. To these images a lenticular network constructed with a transparent plastic surface with tiny striae is applied, by which the viewer selects in infinity of different and successive angles, where to observe.

3 3.1

Theoretical Foundations of a Holographic System Structure of the Human Eye

The human eye is a complex structure, sensitive and able to detect with great precision color, shape and intensity of light reflected in different objects.

Fig. 2. Anatomy of the eyeball.

From Fig. 2, the eyeball, which is composed of three layers and three cameras:

288

J. J. M. Escobar et al.

– Layers: sclerocornea, uvea and retina. – Cameras: anterior, posterior and vitreous. Now the layers of the human eye will be exposed, since they are the areas where light energy is transformed into electrical impulses that travel to the visual cortex. First, the sclerocornea is defined, which is the outer layer that is composed of the sclera and the cornea. The sclera is the fibrous part that forms the white part of the eye that serves to protect it. While the cornea is the transparent part of the outer layer, known as the optical window of the eye, this part is optical [4]. Then, the uvea is the middle layer is composed of four parts: 1. The iris is the color structure that lies below the cornea and whose central hole constitutes the pupil. The function of the iris is to regulate the amount of light that penetrates the inside of the eye and varies its size depending on the intensity of light. 2. The choroids are located in the back of the eye, has as a function both nutritive and pigmentary screen, this in order that the light strikes where it should and does not spread elsewhere. 3. The ciliary body is in the middle zone and is formed by ciliary processes and the ciliary muscle, in charge of varying the curvature of the lens to be able to focus at different distances. 4. The lens is the lens of the eye, which has the shape of a biconvex lens and is capable of varying its curvature and diopter power by the action of the ciliary muscles. Finally the retina is described. which is the sensitive area of the visual apparatus, is where the images that the eye detects are formed, Fig. 3. Its anterior part is blind and its sensitivity is increasing as it moves away from the previous area and its maximum sensitivity point is a small slit which is called fovea, and it is the part where there is the highest concentration of cells responsible for the sensitivity of the retina, which are the cones and canes. The rods are sensitive only to the intensity of the light, therefore, they operate well with attenuated light and do not produce a sharp or color image. While the cones work with the brightness of the light and generate a well-defined or sharp image, with a reduction of the light in three color frequencies [5,6]. Then, the brain and more specifically the visual cortex is responsible for coding the information received from the entire visual system and this process is an interpretation of images that is composed of five stages: 1. A beam of light is reflected in the object to be observed and this reflected light goes to the human eye. 2. The iris and the pupil serve as regulators of light entering the eye, if the amount of light that enters is large, the pupil reduces its size and if the amount of light is smaller, the pupil increases its size to capture the largest amount of light possible.

3D-Holograms in Real Time for Representing Virtual Scenarios

289

Fig. 3. Rods and cones.

3. The image that the eye captures is reflected in the retina that acts as a screen, here the image is inverted by the retina. 4. The retina is formed by cones and rods, these receive the information of the observed image and transform it into electrical pulses that are sent to the brain by the optic nerves. 5. The electrical impulses reach the thalamus and from the thalamus to the visual cortex of the brain, which composes the final image, as it is observed. The formation of the image in the retina is not a static or simple process, since a human eye focusing at infinity is at rest, and the possible contraction of the iris to regulate the amount of light while the dynamic part of the optical system, which is the lens is at rest, this shows that the eye does not strive to see from afar [8,9]. Making an analogy Human Eye-Camera (Digital Sensor), an eye focused to infinity, by not changing its optical system, sees blurry images at close range that are on its sides, as if focusing a digital sensor to a distance and we capture another closer, the image is represented in the same way blurred, what varies is the thickness of the lens. From Fig. 4, when focusing at a close distance, the ciliary muscles come into action and cause an increase in the thickness of the lens, increasing its power accordingly, this is because it is a biconvex lens that allows the correct focus.

Fig. 4. Analogy eye human-photographic camera (digital sensor).

The human eye is compared to a digital camera or sensor to easily understand the interpretation of images.

290

J. J. M. Escobar et al.

Lens. The cornea is like the lens of the camera, together with the lens behind the iris, are the elements of focus. Shutter. The iris and pupil act as the shutter of the camera. When the iris closes, it covers most of the lens, to control the amount of light that enters the eye and can work well in a wide range of visual conditions, from the darkest to the brightest. Sensor. The retina is the sensory layer at the back of the eye, and acts as an image sensor of a digital camera. It has numerous photo-receptor cells that convert light beams into electrical impulses and send them through of the optic nerve of the brain where the image is finally perceived. 3.2

3D Perception Considerations

For the correct creation of a three-dimensional experience in a scene, there are four neurophysiological conditions that the brain commonly uses to achieve 3D. These conditions allow to perceive differences in the characteristics of the objects depending on the distance at which they are located: 1. Parallax : Is the angular deviation of the apparent position of an object, depending on the point of view. To measure the depth, it is based on the fact that the view is different from different positions, as shown in Fig. 5(a), the position of the object observed in O, varies with the position from the point of view, in A or in B, when projecting O, it varies with the position of the point of view, in A or in B, when projecting O against a distant background. Des The observed object appears to be on the right side of the far star, while from B it is seen on the left side of it. The angle AOB is the parallax angle, that is, an angle that spans segment AB from O. 2. Binocular disparity: Binocular disparity means that each eye captures slightly different views of the same object, due to the separation between them, this is because each eye has a different angle of vision (Fig. 5(b)), therefore, the fields of vision are significantly overlapped. 3. Depth disparity: It is a neurophysiological process that adjusts the shape of the lens to focus on the retina the image of the object that is observed, (Fig. 5(c)). The adjustment of the lens to focus the eye on an object indicates the distance to which it is placed, that is, if an object is clearly observed while the eyes are relaxed, it is known that the object is far away, on the contrary, if upon seeing the object it is necessary to tighten the ciliary muscles to focus on it, then the object is at a near point. 4. Convergence: It is the superposition of the views obtained by each eye, when focusing an object at a certain distance. The angle of an object between the visual axes and the fovea decreases as the distance of the object increases. The visual axis does not exactly coincide with the optical axis of the crystalline lens, since there is only a difference between them of 1.5◦ to 5◦ .

3D-Holograms in Real Time for Representing Virtual Scenarios

(a) Parallax

291

(b) Binocular disparity.

(c) Depth disparity.

Fig. 5. 3D perception considerations.

3.3

Artificial Stereoscopic Vision

The aim of stereoscopic vision is to obtain information about a specific 3D space through the use of stereoscopic images; that is, obtain characteristics of a person’s environment, which would normally be received through their eyes. In addition, this information obtained with a conventional type of sensor, calculates the depth to the object with respect to the two observation sensors. Stereoscopic vision consists of obtaining two images of the same scene, of two different points, thus establishing the correspondence between points of the two images that correspond to the same point of the scene taken, in this way by means of a triangulation, it can be found the distance from this point to the cameras. There are two basic types of matching algorithms: 1. Based on characteristics: consists in the extraction of edges in the image, on which the matching is carried out. 2. Based on area: they carry out the correlation of the gray levels in windows of the different images, considering that in the corresponding point environments the intensity patterns should be similar. 3.4

Snell Law

Snell’s law is a relation that is used to calculate the angle of refraction of light when crossing a surface between two means of propagation, that is, how light

292

J. J. M. Escobar et al.

behaves when it passes from one medium to another, varying its speed since different Means offer different resistance to the displacement of light, which produces the refraction phenomenon, Fig. 6 and Eq. 1. In this way, suppose two different media, characterized by a refractive index (n1 and n2 ). This index is dependent on the medium, that is, characteristic for each medium. When a ray of light meets a medium different from the medium through which it is spreading, 2 things happen. If it encounters a medium through which it can pass, part of the beam is deflected (refraction) and part of the rebound (reflection). n1 sin (α) = n2 sin (β)

(1)

Snell’s law indicates that the product of the refractive index by the sine of the angle of incidence is constant for any ray of light incident on the separating surface of two media. This law was formulated to explain refraction phenomena of light for all types of waves, crossing a separation surface between two media in which the velocity of propagation of the wave varies. The propagation of the light depends on the medium that it crosses, in this propagation there are two important phenomena for Snell’s law that are: – The refraction of light: A wave that impinges on the separation surface between two media is partially reflected, that is, new waves are generated that move away from the surface. The incident ray and the normal to the surface determine the plane of incidence, these lines form the angle of incidence. – The reflection of light: When a wave hits the separation surface between two media, part of the energy is reflected and part enters the second medium. The transmitted ray is contained in the plane of incidence, but changes direction (refracted ray) forming an angle with the normal to the surface.

Fig. 6. Snell law.

3.5

Color Spaces

The human eye recognizes in an image thousands of colors of which these combinations are given by the primary colors red, green and blue, known as RGB.

3D-Holograms in Real Time for Representing Virtual Scenarios

293

A color model is the relationship of a three-dimensional coordinate system and a subsystem in which each color is represented by a point [11]. Among the most used color spaces for image processing are the following. The brain in its processing in color spaces is divided into two levels: 1. Receptional color spaces at the retinal level, such as: (a) RGB : It is composed of the basic additive colors, that is, Red (R), Green (G) and Blue (B ), which is based on the Cartesian coordinate system and add insensitivity to the light asusence. (b) CMYK : This color space is composed of Cyan (C ), Magenta (M ), Yellow (Y, Yellow), Black (K, Key), as observed in Fig. 7, they are secondary colors of light, called subtractive basic colors since they are used as filters to subtract colors from white light when CMYK is compared against RGB.

Fig. 7. CMYK vs RGB.

2. Post-receptional color spaces at the visual cortex level, such as: (a) YCbCr : It is a color space used in digital photography and video systems, which defines the color in 3 components one luminance and two chrominance. It is composed of Y which is the luminance, Cb and Cr indicate the color tone, where Cb colors between blue and yellow, Cr colors between red and green, these components are represented in 3 axes. (b) YIQ: This color space is composed of the luminance (Y ) and the color information (I and Q), is based on linear transformations of images with data RGB, It is mainly used in color coding for television transmission. (c) HSI : It is a tool that is used in image processing based on some of the properties of color perception of the human visual system. Its components are Hue (H ), Saturation (S ) and Intensity (I ). 3.6

Chroma Key

The Chroma Key is a visual technique, which extracts a color from the image, either blue or chromatic green and replace the area occupied by that color, color for another image. For this technique to work, the object or person in front of the background is not the same color. Those colors are used, since they are far away from the skin tone. The Chroma green color is mainly used, because the sensors of the cameras are more sensitive to it [2,10].

294

4 4.1

J. J. M. Escobar et al.

Methodological Proposal of the Holographic System General Operation of the System

Figure 8 shows the system of Capture and holographic projection consists of three phases in its design: 1. Capture, 2. Processing y 3. Display.

Fig. 8. Representation of the general functioning of a holographic capture and reproduction system.

4.2

Capture Module Design

The Capture phase consists of a module (box) with dimensions of 100 cm long, 100 cm of width and 20 cm of height, four webcams, a rotating base and uniform lighting. This phase is subdivided into 5 sub-phases, which are: (a) Figure 9 shows the consideration in the Dimensions of the module. Thus, the dimensions are determined according to the capture objects and the webcams used. The objects must have a maximum measurement of 10 cm in height and 7 cm in width, so that the frames are correctly taken. The final dimensions of the module are shown in the Fig. 10. (b) Selection and Position of the USB sensor: This type of sensor resembles a small digital camera that connects to a computer, for it to capture and transmit images through the Internet, for example. The USB sensors used do not transmit images through the Internet, but through the USB port of the computer to the MATLAB program by means of a video socket, Fig. 11. (c) Choice of Chromatically Green Paint: Inside the capture module is uniformly painted a green color or Chroma Key, Fig. 12. This color key, either green or white, is necessary because it is less computationally comfortable to segment things that are not this color.

3D-Holograms in Real Time for Representing Virtual Scenarios

295

(d) Base Turn Control: The rotating base is a circular base placed in the center of the capture module, the function is to turn at certain revolutions per minute, the desired object. The revolutions per minute are determined according to the number of frames required for each view of the object, Fig. 13.

Fig. 9. Web camera at the correct distance for an adequate capture.

Fig. 10. Capture module with final dimensions.

Fig. 11. Position of the USB sensor on the walls of the capture module.

296

J. J. M. Escobar et al.

Fig. 12. Painting of the capture module without cover.

Algorithm for Holographic Capture 1. Assigning the name and number of pins on the Arduino board to those on the motor pins, Pin 12 (α), pin 11 (β), pin 10 (γ) and pin 9 (δ). 2. Speed of rotation υ = 2 ms. 3. Definition of the κ counter of integer type. 4. Declaration of integer variable for number of turns, λ. 5. Assigning digital outputs on the Arduino port, Pin α, β, γ and δ. 6. Mienrtras do not give φ = 4560 steps in the Motor Δ will not complete a full turn in the rotating base. Thus, λ = λ + 1. 7. Sequentially enable the digital outputs to activate the Motor Δ coils to steps to turn it. 8. κ is increased, that is, κ = κ + 1. 9. If κ is less than 4 then steps 2 through 8 are repeated, otherwise the algorithm is terminated. Light Conditioning. The installed lighting was of yellow light since the white light reflects the green color in the objects, Fig. 14. The illumination is uniform since it is necessary to avoid reflections in the walls and lenses of the sensors or excess brightness in the objects, the walls of the module or the rotating base.

Fig. 13. Pulley system, diameters and speeds ratio.

3D-Holograms in Real Time for Representing Virtual Scenarios

297

Fig. 14. Focus and the incidence of light on the object.

4.3

Phase of Coding of Images

The image processing phase is a subsystem that complements the general operation of the prototype. The subsystem works tools or toolboxes in MATLAB, which detects the green pixels closest to the Chroma Key, and thus replace them with black pixels. As any system or subsystem hologram processing is carried out three phases, as seen in Fig. 15, which are: 1. Reading, 2. Coding, and 3. Representation.

General Coding Algorithm (1) Read the image that contains the black background. (2) Declare a variable of type videoinput. This variable will capture images of the USB sensor (1, 2, 3 or 4) with its characteristics. (3) A video file is created in avi format and assigned a name. (4) The created video is initialized, to later enter the images and at the same time the one-turn initialization command for the rotary table is executed.

Fig. 15. Cybernetic diagram of first order, that is, input, process, output of the image processing subsystem.

298

J. J. M. Escobar et al.

(5) A cycle is started for the acquisition of the images, this cycle is determined by the number of frames to be captured, in this case a thousand frames. (6) Within the cycle a variable is created that saves the image taken by USB sensor or I. (7) The threshold value obtained from the experimentation μ is assigned. (8) The dimensions of the image are obtained and saved in two variables ColI and RowsI . (9) The black background image is resized with the dimensions ColI and RowsI , to the size of the captured image I. (10) Transforms from the color space of RGB to YCbCr, from the captured image I. (11) The green background is eliminated and replaced by the image with a black background, obtaining a new image Ichroma . (12) The image Ichroma is converted to the RGB color space. (13) Depending on the number assigned to the USB sensor, the image is rotated by a specific angle. (14) The Ichroma image is added to the video file in the corresponding frame. 4.4

Representation Phase

The representation phase consists of 2 main elements, for the correct visualization. The first element is the projection monitor which will project a video with the four views of the object (front, back, right profile, left profile), with black background, in the holographic pyramid. The second element is the holographic pyramid which will allow us to visualize the desired object in three dimensions. (a) Projection monitor: To know which pyramid to use, the projection of the hologram should be considered. How to obtain four videos, each one of a view of the object to be modeled. (b) Holographic pyramid: Once the projection mode has been determined, the holographic pyramid to be used is constructed. Since the projection is four views of the object, the projection must be on four surfaces, thus, the pyramid must have four faces on which the projection light will refract at 45 degrees. So for the Technical Considerations for construction of the Holographic Pyramid, on the one hand it is considered first the angle in which the viewer observes it. So a straight viewing angle was taken to observe the hologram in a comfortable position and to visualize it in the center of the pyramid. On the other hand, the dimensions of the monitor to be used are considered. To calculate the dimensions in the Design of the Holographic Pyramid the maximum height for a pyramid of 45◦ is estimated. In a pyramid the maximum height is equal to the distance from the tip of the pyramid to the center of its base, Fig. 16.

3D-Holograms in Real Time for Representing Virtual Scenarios

299

Fig. 16. Design of the holographic pyramid.

5

Experiments and Results

As proof that holograms fulfill the function of giving volume to an object in a holographic pyramid, the following experiment is carried out: 50 people are invited to observe the holograms of Kittens (Figs. 17 and 18) and Wall-E [7] (Figs. 19 and 20) so that give their quality rating to them. The rating is from 1 to 10. Where 1 is not considered a hologram, that is, without volume and without quality and where 10 is considered a hologram of good quality. In order to better understand the quality of holograms, we study the data obtained by calculating the mean, fashion, variance and standard deviation. – – – –

Statistical Moe for Hologram Kittens = 9.09270683 Average for Hologram Kittens = 9.3 Variance for Hologram Kittens = 0.29089184 Standard deviation for Hologram Kittens = 0.544819611

– Statistical Moe for Hologram Wall-E= 9.059231373 – Average for Hologram Wall-E= 9

Fig. 17. Scatterplot of the satisfaction level of the Kittens hologram.

300

J. J. M. Escobar et al.

Fig. 18. Scatterplot with standard deviation of the data sample for the hologram Kittens.

Fig. 19. Scatterplot of satisfaction level of the Wall-E hologram.

Fig. 20. Scatterplot with standard deviation of the data sample for the hologram Wall-E .

3D-Holograms in Real Time for Representing Virtual Scenarios

301

– Variance for Hologram Wall-E= 0.27170484 – Standard deviation for Hologram Wall-E= 0.52654519 Now, the standard or standard deviation is taken as an important measure, given that this determines the average of the fluctuation or dispersion of the data with respect to its central or middle point. The fluctuation or data that is outside this deviation determines the deficiencies of the holograms. With the graphs it is concluded that the observer has a good impression and experience of the final holograms because the fluctuations shown in the graphs (points outside the standard deviation curves) are few and their reusable on average is greater than 9.

6

Conclusions

It is possible to conclude that this system does not depend on any special 3D projection device to give volume to the high definition hologram, nor does it depend on a 3D visualization device to observe the hologram. Transparent objects, green, black or very bright colors give unwanted results. With uniform illumination in the capture module, good results are obtained when extracting the green background of the image, obtaining an HD image. A capture module was built capable of receiving information specific to the characteristics of an object through its cameras. In addition, a holographic pyramid was constructed in which 2:1 scale holograms are visualized with respect to the original object. The designed and constructed system facilitates the creation of holograms of almost any object in a short time unlike other systems, which optimizes the application of the system in different branches. Acknowledgment. This article is supported by National Polytechnic Institute (Instituto Poliécnico Nacional) of Mexico by means of Project No. 20180514 granted by Secretariat of Graduate and Research, National Council of Science and Technology of Mexico (CONACyT). The research described in this work was carried out at the Superior School of Mechanical and Electrical Engineering (Escuela Superior de Ingenier´ıa Mec´ anica y Eléctrica), Campus Zacatenco. It should be noted that the results of ´ this work were carried out by Bachelor Degree students Leslie Marie Ram´ırez Alvarez and Luis Omar Hern´ andez Vilchis. Also, on the one hand, Ing. Daniel Hazet Aguilar S´ anchez is thanked for the support and logical and technical support.

References 1. Jiao, S., Tsang, P.W.M., Poon, T.C., Liu, J.P., Zou, W., Li, X.: Enhanced autofocusing in optical scanning holography based on hologram decomposition. IEEE Trans. Ind. Inf. PP(99), 1 (2017) 2. Kotgire, P.P., Mori, J.M., Nahar, A.B.: Hardware co-simulation for chroma-keying in real time. In: 2015 International Conference on Computing Communication Control and Automation, pp. 863–867, February 2015

302

J. J. M. Escobar et al.

3. Lee, H.M., Ryu, N.H., Kim, E.K.: Depth map based real time 3D virtual image composition. In: 2015 17th International Conference on Advanced Communication Technology (ICACT), pp. 217–220, July 2015 4. Nishitsuji, T., Shimobaba, T., Kakue, T., Ito, T.: Review of fast calculation techniques for computer-generated holograms with the point light source-based model. IEEE Trans. Ind. Inf. PP(99), 1 (2017) 5. Pang, H., Wang, J., Cao, A., Zhang, M., Shi, L., Deng, Q.: Accurate hologram generation using layer-based method and iterative Fourier transform algorithm. IEEE Photonics J. 9(1), 1–8 (2017) 6. Song, X., Zhou, Z., Guo, H., Zhao, X., Zhang, H.: Adaptive retinex algorithm based on genetic algorithm and human visual system. In: 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 01, pp. 183–186, August 2016 7. Stanton, A.: Wall-E: Walt Disney Pictures-Pixar Animation Studios (2008) 8. Su, P., Cao, W., Ma, J., Cheng, B., Liang, X., Cao, L., Jin, G.: Fast computergenerated hologram generation method for three-dimensional point cloud model. J. Disp. Technol. 12(12), 1688–1694 (2016) 9. Wang, J., Zheng, H.D., Yu, Y.J.: Achromatization in optical reconstruction of computer generated color holograms. J. Disp. Technol. 12(4), 390–396 (2016) 10. Yamashita, A., Agata, H., Kaneko, T.: Every color chromakey. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4, December 2008 11. Yang, Y., Yang, M., Huang, S., Que, Y., Ding, M., Sun, J.: Multifocus image fusion based on extreme learning machine and human visual system. IEEE Access 5, 6989–7000 (2017)

A Probabilistic Superpixel-Based Method for Road Crack Network Detection J. Josiah Steckenrider(B) and Tomonari Furukawa Virginia Tech, Blacksburg, VA 24060, USA {jsteck95,tomonari}@vt.edu

Abstract. This paper presents a probabilistic superpixel-based method for detecting road crack networks. The proposed method includes the techniques of skeletonization and end-growing at the superpixel level, which lend to the extraction of slender crack features from road images. Probabilistic crack pixel refinement is implemented, followed by geometry filters and binary crack cleaning operations, with the end goal of presenting cracks in their simplest form for further high-level characterization. The performance study used to characterize this crack detection algorithm was not constrained by crack type, pavement type, or even image resolution. This approach boasts a median pixel-wise distance error rate of less than one pixel, and for a 100-image dataset, the average detected crack length was within 18% of the ground truth crack length. Keywords: Image processing · Probabilistic methods Geometry filtering · Crack detection

1 1.1

· Superpixels ·

Introduction Background

Road condition monitoring is an important aspect of infrastructural maintenance. Not only does road repair present a burden to the public, but damaged pavement can also affect the safety and ride comfort of drivers [1]. Although expensive commercial automated assessment systems have been deployed for interstate highways [2], these existing systems are sometimes unreliable due to road variation, and they have not been validated on local roadways. Furthermore, the majority of local roads are still examined manually by trained workers, which also yields issues in reliability. With the rise of capable computing power and superior sensory systems in recent years, automated road condition monitoring has become a common area of interest in the computer vision and machine learning communities. Because road disrepair can be judged chiefly by the presence and extent of pavement cracks, automated road crack network detection systems are critical to leading-edge infrastructural maintenance. This paper proposes crack network detection techniques which offer the distinct and novel advantage of combining probabilistic region modeling with superpixel image partitioning, making use of geometrical operations to extract dark, slender features. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 303–316, 2020. https://doi.org/10.1007/978-3-030-17795-9_21

304

1.2

J. J. Steckenrider and T. Furukawa

Related Work

Computer vision-based crack detection has been studied for over two decades. Past work on the automatic detection of cracks relies primarily on twodimensional (2-D) computer vision-based methods because cracks generally do not disrupt the road plane. Since this is an open-ended problem with an enormous domain of possible approaches, the techniques investigated in each case vary substantially. Nevertheless, some of the most relevant approaches will be summarized here. As early as the 1990’s, Tanaka, et al. [3] successfully developed a morphological approach to crack detection in which a series of dilations, erosions, convolutions, and thresholdings were combined to estimate pixel state (crack vs. non crack). This early work required certain assumptions about crack images that limit its versatility, particularly with respect to larger cracking networks. Gavil´ an, et al. [4] more recently explored a technique in which non-crack features are removed through similar morphological filters and then cracks are detected based on a path-tracing approach. This work included a support vector machine for classifying pavement type and setting detection parameters accordingly. Whereas these results demonstrated reliable performance under a variety of pavement conditions, specialized equipment was required, and once again, complex cracking networks were not explicitly addressed. Another common family of approaches uses frequency-domain transformations to extract crack features. One prime example of a transform technique is given by Chambon, et al. [5]. This group used a continuous wavelet transform (CWT) and Markov networks to extract thin crack features from a road image. Although this transform method shows promise, the resulting crack estimates tend to be quite fragmented. Very recently, a simple and effective approach using selective pre-processing and geometry filters was shown to give superior results for both single-crack and multi-crack road images [6,7]. However, this technique requires the manual fine tuning of a few parameters which is a detriment to the ultimate goal of automation. The final philosophy towards crack detection summarized here most closely aligns with the improved methods presented in this paper. Oliveira, et al. [8] coarsely divide a road image into a grid, then cast the grid cells into a 2-D feature space where the feature vector is given by the mean and standard deviation of the pixels in each cell. Feature-space classification is used to designate rough crack regions for further thresholding operations. The results obtained by this group are promising, though not without limitations. First, complex crack networks are not addressed. Second, the thinnest cracks in an image are not easily differentiable from background pavement given the coarseness of the grid cells. Finally, masking the raw image based on the coarse pre-processing step excludes any regions that were not adequately classified, and refinement by thresholding tends to fragment crack images.

Probabilistic, Superpixel-Based Crack Detection

1.3

305

Objectives and Proposed Approach

The objectives of this paper are as follows: – To establish effective means of subdividing road images for region segmentation and classification. – To introduce novel geometrical operations at the superpixel level for coarse crack cleaning. – To present a probabilistically robust method for crack pixel refinement. The approach proposed in this paper for road crack network detection newly employs superpixels, which were originally developed by Ren and Malik in 2003 [9]. Superpixelization segregates an image into clusters of pixels (superpixels) based on natural image contours. The proposed superpixel-based method uses the Simple Linear Iterative Clustering (SLIC) technique [15], which is favorable in tunability, versatility, and computational complexity, to create a rough crack mask. The superpixels of this mask are then manipulated by skeletonization and end-growing operations which, to the extent of the knowledge of the authors, have not been shown in the literature thus far and are therefore a novel contribution of this work. The result is then used to probabilistically detect both foreground (crack) and background (non-crack) regions. Each pixel in the raw image is subsequently assigned a likelihood of belonging to a crack by means of a Gaussian model, and the resulting pixel map is binarized and filtered. This probabilistic modeling improves on most existing deterministic methods because road crack images themselves are highly stochastic. Final skeletonization and cleaning operations yield a simple binary representation of cracks in an image. This paper is organized as follows: First, the fundamentals of superpixel partitioning (using the SLIC method) and Gaussian modeling will be summarized. Next, the specifics of the full approach proposed in this paper will be presented. Third, the performance of the proposed approach will be investigated by comparison with ground truth. Finally, conclusions about this approach will be drawn and future work discussed.

2 2.1

Superpixels and Gaussian Modeling Superpixel Partitioning Using SLIC

Although superpixelization algorithms were originally developed simply as a tool for image segmentation, new and improved methods of constructing superpixels are now widely sought after in their own right. A recent exhaustive study of the 28 most prominent published superpixel algorithms recommends the topperforming six methods, each of which is superior in a slightly different way [10]. Of these methods, some take an energy optimization approach [11,12], whereas others use graph-based methods [13] or contour evolution [14], among others. The ever-popular technique used here is the SLIC method [15]. SLIC superpixel partitioning requires an initialization step in which the user specifies an approximate number K of superpixels to be generated. In order to

306

J. J. Steckenrider and T. Furukawa

seed the algorithm, the raw image pixels must be clustered into grid cells. To accomplish this, the input image is divided into a grid of even spacing S in the x- and y- directions. If the width resx and height resy of the image are unequal, the number of rows of subdivisions ny will not equal the number of columns nx . In order to approximate these values, the following formulas are implemented: resy K (1) nx = resx resx ny = nx (2) resy where the [] operator represents nearest-integer rounding. The distance between neighboring grid cells is then given by: resx resy S= = (3) nx ny The center pixels of the grid cells are therefore spaced at intervals of S pixels throughout the image. Each grid cell center is assigned a cluster of pixels with a width and height of 2S, with the edge cases being truncated, such that there is some overlap between grid cells. This way, every pixel is accounted for. An additional step is oftentimes necessary to ensure that superpixel boundaries match natural image contours. If any center pixel happens to fall on a contour, it is likely that the pixels contained in its cluster should in reality belong to two different superpixels. The correction proposed originally in [15] moves cluster centers off of boundaries to the lowest gradient position in a 3 × 3 pixel neighborhood. Next, the substantial portion of SLIC partitioning takes place. Each cluster of N pixels can be represented by an N × 5 vector where the first three columns correspond to the three color channels and the last two columns correspond to the x- and y- position of each pixel. Each pixel is given a score based on a weighted distance between its position in a five-dimensional (5-D) space and the position of the center pixel’s 5-D position. This weighting is described by the following equations: C (4) D = drgb + dxy S drgb = rgbi − rgbc (5) dxy = xyi − xyc

(6)

where C is a tunable compactness parameter that adjusts the “flexibility” of the superpixel boundaries. Each pixel i is then assigned to the cluster with center c that yields the lowest value of D. When all pixels in the raw image have been assigned, the cluster centers are then updated to the 5-D mean of all their constituent pixels and the process is repeated either until convergence or until a threshold criterion is met. These procedures are highly effective in superpixel assignment according to natural image contours, and the algorithm

Probabilistic, Superpixel-Based Crack Detection

307

is much faster than many other methods [10]. SLIC partitioning is only O(N) complex due to the simple linear iterative clustering approach. This makes it superior and adequate for fast crack detection described here. 2.2

Gaussian Modeling

Gaussian-mixture modeling has been useful in the field of computer vision for creating functional models to fit observed data. The principle used in this paper is similar, but simpler; a mixture of Gaussians is unnecessary to model pixel data since the proposed method assumes an approximately unimodal distribution of crack pixel values. Therefore, models composed of single Gaussians for each class are used. This is advantageous in that it is far faster than mixture-modeling techniques. Given an N × M vector, xi , of observed M -dimensional data from class i, an M × 1 mean vector µi and M × M covariance matrix Σi can be computed. Assuming the data is normally distributed (a fair assumption for large N and uniform signal with white noise), a corresponding Probability Distribution Function (PDF) describing the stochastics of interest can be approximated by a Gaussian function of the form: 1 1 exp − [x − µi ]T Σ−1 pi (x) = i [x − µi ] . 2 |2πΣi |

(7)

For a test vector of unknown class x , the relative probability that x corresponds to class 1 versus all other I − 1 classes can be estimated by p1 (x ) . P (class1 |x ) = I i=1 pi (x )

(8)

Although formulated in the general sense for an arbitrary number of classes, the context in which Eqs. (7) and (8) are used here is for only two classes: crack and non-crack. The specifics of this approach will be presented in more detail in the following section.

3 3.1

SLIC-based Probabilistic Crack Network Detection Overview

Figure 1 shows a schematic of the proposed approach. Given a partitioned image, coarse crack detection first takes place by casting superpixels into a meanstandard deviation feature space, recursively removing outliers and remodeling the image. The resulting inliers and outliers are then fed to a Quadratic Bayesian Classifier (QBC) to classify background (inlier) and crack (outlier) regions. Further superpixel manipulation consists of skeletonization, end-growing, and noise removal. The resulting binary mask is closely representative of the true crack, but it lacks fine detail. Probabilistic crack refinement based on Gaussian modeling then takes place, followed by coarse-fine belief fusion to attenuate background noise. Finally, the resulting probability map is binarized, skeletonized,

308

J. J. Steckenrider and T. Furukawa

and cleaned, yielding the final crack-detected image. The following subsections describe detection and manipulation of crack superpixels as well as probabilistic refinement in more detail.

Fig. 1. Basic diagram of crack network detection algorithm

3.2

Detection and Manipulation of Superpixels

In order to ensure computational efficiency and favorable superpixel segmentation, the proposed method down-resolves and smooths raw RGB road images with a Gaussian filter. At this stage, the SLIC algorithm is executed to obtain pixel clusters for further processing. Each superpixel is cast into a 2-D meanstandard deviation feature space, where these two metrics are extracted from the grayscale-converted image. Relatively little information is lost by removing the color channels, as color variation over road pavement is normally negligible. At this juncture, it is assumed that cracks are generally darker than background pavement, and that they are the only non-background objects in an image. The latter assumption is sometimes invalid, but it is handled by geometry filters later in the process. A Gaussian model for the dataset consisting of all superpixels is constructed using the dataset’s mean and covariance, and each individual point x is scored by its value according to Eq. (7). Superpixels whose score falls below a threshold are set aside as outliers, while the rest of the points are retained and the process is repeated. This continues until no change in the partitioned dataset occurs. Figure 2 shows the result of this procedure on a sample image. To create a mechanism by which only a single parameter needs tuning, a QBC is trained with the inlier and outlier classes established in the previous step.

Probabilistic, Superpixel-Based Crack Detection

309

Fig. 2. Raw image, raw image with superimposed superpixels, and superpixels in the 2-D feature space. Note that green and red points denote outliers, where green points are not considered to correspond to cracks since they do not meet the criterion of being darker than background pixels.

With the parameter of crack prior probability specified, the QBC establishes a decision boundary to minimize misclassification given the constraints that come about from quadratic limitations. The resulting partition is similar to that shown in Fig. 2c, but it can be easily adjusted with this single parameter in order to optimize detection depending on the predicted crack severity. Superpixels classified as belonging to cracks are then assigned a value of 1, and non-crack regions are assigned a 0. Next, the width of the coarsely estimated crack is reduced to its singlesuperpixel-wide representation which requires superpixel skeletonization. This is accomplished with a morphological approach in which each superpixel’s bordersharing neighbors are interrogated. This can be difficult, however, since no superpixel is guaranteed to share boundaries with any particular number of other superpixels. The solution we propose scans the superpixel-labeled image at the true-pixel level, tracking locations where pixels transition from one superpixel to the next. Each time a boundary is encountered, the label of the superpixel to which the first true-pixel belongs is stored in an array associated with the next pixel. This array is only updated when a new neighbor is encountered. Although this is a relatively simple way to find superpixel neighbors, it does not preserve the order in which the neighbors occur. This issue is resolved by finding the angles between the centroid of a center superpixel of interest and that of each of its neighboring superpixels (illustrated in Fig. 3b). The neighboring superpixels are then ordered clockwise from the upper-left corner. The surrounding superpixels of the binary image are subsequently “unwrapped” as shown by the white arrow in Fig. 3b and the state transitions are counted (i.e. when a superpixel value differs from the one before it). If more than two transitions occur, the center superpixel is retained as belonging to the crack skeleton. Otherwise, the center superpixel is removed. As the figure shows, false positives are removed because these criteria are not met, whereas adjacent crack pixels are kept.

310

J. J. Steckenrider and T. Furukawa

Although these methods do well to ensure the removal of false positives, true positives are sometimes removed in the process. In order to repair these issues and selectively enhance long and slender features, a subsequent step is taken to “grow” crack endpoints. Past work shows results at the true-pixel level [7,16], but implementing end-growing with superpixels is substantially different and more complex.

Fig. 3. Raw image with overlaid superpixels; magnified region to show neighbor extraction; labeled binary image with three missing crack regions, and one incorrectly classified superpixel. The misclassified superpixel is rejected.

Figure 4 shows an end-growing example where three candidates for the advancing endpoint are shown to fall within a given search window. First, endpoints are detected as crack superpixels which have only one neighboring crack superpixel. This is only possible if the crack has been skeletonized, which makes the previous step critical. The angle θ between the endpoint and the previous crack. Superpixel is computed, and an angular window of (θ + 180◦ ) ± δ is established, centered at the centroid of the endpoint. δ is a variable parameter, where smaller values increase the rigidity with which cracks are propagated. Neighboring superpixels whose angles made Fig. 4. End growing search with the superpixel of interest fall within this range window. θ = −75◦ , δ = are considered as possible candidates for crack class 60◦ . In this case, three assignment. superpixels are candidates An additional requirement is that the RGB mean for becoming the new crack vector of all true-pixels in the endpoint-candidate endpoint. superpixels be sufficiently closer, in the Euclidean sense, to the mean of all labeled crack pixels than to the mean of all background pixels. If this is satisfied, the candidate whose RGB-space location is closest is then reclassified as a crack endpoint. This can be iterated until all endpoints have disappeared or the criteria are not met, though in this case only

Probabilistic, Superpixel-Based Crack Detection

311

one iteration is executed since it is only necessary to reclaim pixels lost in the skeletonization step. Though it is not explicitly shown in Fig. 4, the new endpoint is found to be the center superpixel in this example, bringing the two ends of the fragmented crack closer together. 3.3

Probabilistic Crack Refinement

The coarse mask obtained by superpixel operations is now used to partition true-pixels probabilistically from the raw image into two classes: background and crack. To do this, the mean and covariance matrices are extracted from the three-dimensional(3-D) RGB-space measurement array for each class in the manner described in Sect. 2.2 Under the assumption of unimodal normal distributions, Eq. (7) is used to create PDFs for each class. Then Eq. (8) establishes the probability P (bg|x ) of any given pixel x belonging to the background vs. the probability P (cr|x ) of it belonging to a crack. The result is a pixel map where each pixel’s value is its crack probability. In order to increase contrast, the pixel map is then squared. To attenuate background noise, the resulting image is multiplied by a similar, though coarser, image in which each superpixel’s value is given by its crack probability as determined by the QBC. A subsequent binarization occurs, with nonlinear thresholding criteria, described as follows. First, the image’s histogram is computed with optimized bin size and number. Since at this stage a crack-enhanced image can be assumed to be bimodal, with higher frequencies at values near one and zero, the aim of thresholding is to separate these two modes while upholding crack preservation. This is accomplished heuristically by finding the threshold value corresponding to the bin that first falls below 10% of the maximum bin and adding the square of the ratio of the bin number range to the number of all pixels. Mathematically, this is stated as: max(H) − min(H) 2 (9) T = V0.10(max(H)) + sum(H) where T is the desired threshold value (between 0 and 1), H is the array containing the number of pixel values in each bin, and V is the array of values at which the bins are positioned. Finally, the resulting binary image is cleaned and simplified as described below. 3.4

Crack Cleaning and Simplification

Much of the process of cleaning and simplifying cracks follows previous work by Steckenrider, et al. [7,17]. First, small particles are considered noise and removed. Then a series of geometry filters are implemented to isolate only long and slender features, without excluding multi-branched cracking networks. Next, a topological skeletonization is performed to reduce cracks to single-pixel-wide features. Gaps are closed and small false branches are removed according to the procedures described in [17]. The final result is the most basic form of a crack that can then be used for higher-level pavement assessment.

312

4

J. J. Steckenrider and T. Furukawa

Results

This section contains both subjective and objective assessments of the proposed crack detection approach. The first subsection deals with subjective assessment, and objective assessment follows in the subsequent subsection. It is to be noted that the proposed approach is not compared to other crack detection methods in this paper since each method consists of a different set of techniques and does not have common ground to make a valid comparison. Furthermore, a common pavement image database is unavailable for comparison. Therefore the numerical analysis here focuses on investigating the capability of the proposed approach as compared to human-based ground truth assessment. 4.1

Subjective Assessment

Subjective assessment is given in the form of a series of images showing the progressive steps of the algorithms described above on the image given previously, followed by some before-and-after images of various representative cracks. Figure 5 shows the effect of superpixel skeletonization and end growing, followed by probabilistic crack refinement.

Fig. 5. Initial superpixel class assignment, superpixel skeletonization and end growing, probabilistic refinement.

Note the gap-closing effect of the end growing algorithm; this technique favors long and slender features like cracks, though in some places ends can elongate in the wrong direction (note especially the regions near the top of Fig. 5b). False positives like those at the bottom corners are eliminated by the subsequent probabilistic refinement. Figure 6 shows the thresholding, cleaning, and simplification operations, respectively. The resulting pixel map contains the most fundamental form of the crack, which lends to further perception and classification schemes which will be addressed in the future work section. Figure 7 offers four examples of processed road crack images with varying levels of severity. In an effort to be truthful about the capability of these methods in an unbiased way, these images were chosen from an even spread of good and bad detection instances.

Probabilistic, Superpixel-Based Crack Detection

313

Fig. 6. Crack thresholding, cleaning, and simplification. Some loss of crack detail is sacrificed for better resilience against false positives.

Fig. 7. Four examples of detected cracks of varying severity.

4.2

Objective Assessment

The objective measure of superpixel-based crack detection is given by means of a rigorous pixel-wise analysis. A human analyst established ground truth by carefully drawing the single-pixel-wide skeletons of all visible cracks in 100 road images taken from a wide variety of randomized locations. Although this sample size is relatively small, for processed images with a resolution of 153p × 204p, the number of individual data points (pixels) for validation exceeds 3.1 million, a satisfactory verification set. The images ranged from single cracks to complex crack networks, and many different pavement types (including concrete, asphalt, cement, aggregate, etc.) were included. The accuracy of the assigned crack pixels is determined by the average distance between the positive ground truth pixels and the positive detection pixels. This metric was achieved by computing the distance transform of the ground truth crack images and summing the product of the distance-transformed ground truth and the detected skeleton.

314

J. J. Steckenrider and T. Furukawa

The summation was divided by the total number of positive pixels in the detected image to achieve an average distance deviation (designated Δ, in pixels) from ground truth. The aforementioned distance measure is sensitive to false positives, but it does not give much insight into false negatives. Therefore, the total crack length of the detected image was compared with the total crack length of the ground truth image. The ideal ratio L of these values is 1, so the true measure of error is given simply as E = |L − 1|. To arrive at an overall score for crack detection in a particular image, the false positive metric Δ and the false negative metric E are combined into a net score in such a way that the relative contributions of both metrics hold the same weight. Therefore, Δ is equalized to the range of all observed values of E and the average of the two values is found. We recommend that this be the standard for quantifying pixel-wise crack detection accuracy, as none currently exists. The following equations explicitly define pixel-wise crack detection scoring: range(E) range(E) Δ + min(E) − min(Δ) (10) Δ = range(Δ) range(Δ) Δ + E (11) 2 For reference, the images shown in Fig. 7 (from upper left to lower right) scored 0.0052, 0.1611, 0.2863, and 0.3350, respectively. A score of zero would mean perfect detection, so the lower the value, the better the performance of the detection algorithm. As the examples show, there is a strong penalty on false positives that are located far from the true crack (bottom-right side of 8.d), even when detection is fairly competent. This highlights the robustness of the approach, in that the mean pixel distance error is affected greatly by small false positives, and yet most images perform well in this regard. The average overall Errortotal =

Fig. 8. Histogram of pixelwise error scores.

Probabilistic, Superpixel-Based Crack Detection

315

score on the 100-image dataset was 0.108, with a highly skewed distribution as shown in Fig. 8. This distribution shows that 96% of cracks in crack images were detected with more accuracy than the bottom two images in Fig. 7. In addition, the average (non-scaled) distance metric Δ was 1.509, whereas the median Δ was only 0.795; this demonstrates the substantial skew in this metric as well. The mean E score was 0.178, which means that the average length of a detected crack was within 18% of the ground truth length.

5

Conclusions and Future Work

The superpixel-based crack detection approach proposed here fuses probabilistic methods with superpixel partitioning to offer robust, probabilistically sound results for automatic detection of road pavement cracks. Additionally, novel superpixel manipulation techniques have been introduced which may have uses far beyond crack detection. With an average non-optimized pixel-wise error rate of only 0.108, these techniques are shown to handily accomplish the desired objectives. Although comparison-based results are elusive in this area of research, the provided objective means of evaluation sufficiently shows the capability of the proposed methods. The next step of this work is to consider a more general segmentation approach, in which the assumption that cracks are always dark is no longer made. Although rare, pavement cracks can sometimes appear lighter than the background. Few to no existing methods handle such cases, so improving the versatility of these algorithms will not only improve crack detection, but also offer a distinct advantage over current technologies. In addition, a full recursive probabilistic classification scheme is being developed for the purpose of higher-level crack severity characterization. One ultimate goal is to provide an open online resource which would be useful for not only drivers, but also road maintenance experts when repairs are being considered. Such a resource would feature an interactive local map with road condition clearly marked by color-coded roads. Acknowledgments. The authors would like to acknowledge Murata Manufacturing Co., Ltd. for supporting the work presented in this paper.

References 1. Erickson, S.W.: Street pavement maintenance: road condition is deteriorating due to insufficient funding. Office of the City Auditor, San Jose (2015) 2. Vlacich, B.: State of the pavement 2016. Virginia Department of Transportation, Richmond, VA (2016) 3. Tanaka, N., Uematsu, K.: A crack detection method in road surface images using morphology. In: IAPR Workshop on Machine Vision Applications, pp. 154–157 (1998) 4. Gavil´ an, M., et al.: Adaptive road crack detection system by pavement classification. Sensors 11, 9628–9657 (2011)

316

J. J. Steckenrider and T. Furukawa

5. Chambon, S., Subirats, P., Dumoulin, J.: Introduction of a wavelet transform based on 2D matched filter in a Markov random field for fine structure extraction. Application on road crack detection, SPIE-IS&T, vol. 7251 (2009) 6. Steckenrider, J.J., Furukawa, T.: Selective pre-processing method for road crack detection with high-speed data acquisition. Int. J. Autom. Eng. (2017) 7. Steckenrider, J.J., Furukawa, T.: Detection and classification of stochastic features using a multi-Bayesian approach. In: IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI) (2017) 8. Oliveira, H., Correia, P.L.: Automatic road crack detection and characterization. IEEE Trans. Intell. Transp. Syst. 14(1), 155–168 (2013) 9. Ren, X., Malik, J.: Learning a classification model for segmentation. In: Proceedings Ninth IEEE International Conference on Computer Vision, Nice, France, vol. 1, pp. 10–17 (2003) 10. Stutz, D., Hermans, A., Leibe, B.: Superpixels: an evaluation of the state-of-theart. Comput. Vis. Image Underst. 166, 1–27 (2017) 11. Yao, J., Boben, M., Fidler, S., Urtasun, R.: Real-time coarse-to-fine topologically preserving segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 2947–2955 (2015) 12. Van den Bergh, M., Boix, X., Roig, G., Van Gool, L.: SEEDS: superpixels extracted via energy-driven sampling. Int. J. Comput. Vis. 111(3), 298–314 (2012) 13. Liu, M.Y., Tuzel, O., Ramalingam, S., Chellappa, R.: Entropy rate superpixel segmentation. In: CVPR 2011, Providence, RI, pp. 2097–2104 (2011) 14. Buyssens, P., Gardin, I., Ruan, S.: Eikonal based region growing for superpixels generation: application to semi-supervised real time organ segmentation in CT images. IRBM 35(1), 20–26 (2014) 15. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., S¨ usstrunk, S.: SLIC superpixels, pp. 15–29. EPFL (2010) 16. Fernandez, J.: Image processing to detect worms. Uppsala Universitet, Uppsala, Sweden (2010) 17. Steckenrider, J.J.: Multi-Bayesian approach to stochastic feature recognition in the context of road crack detection and classification. Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA (2017)

Using Aerial Drone Photography to Construct 3D Models of Real World Objects in an Effort to Decrease Response Time and Repair Costs Following Natural Disasters Gil Eckert(&), Steven Cassidy, Nianqi Tian, and Mahmoud E. Shabana Monmouth University, West Long Branch, USA {geckert,s1163543,s1079297,s1134010}@monmouth.edu

Abstract. When a natural disaster occurs, there is often significant damage to vitally important infrastructure. Repair crews must quickly locate the structures with the most damage that are in need of immediate attention. These crews need to determine how to allocate their resources most efficiently to save time and money without having to assess each area individually. To streamline this process, drone technology can be used to take photographs of the affected areas. From these photographs, three dimensional models of the area can be constructed. These models can include point clouds, panoramas, and other threedimensional representations. This process is called photogrammetry. The first step in constructing a three-dimensional model from two dimensional photographs is to detect key features that match throughout all the photos. This is done using David Lowe’s Scale Invariant Feature Transform (SIFT) algorithm which detects the key features. Pairwise matches are then computed by using a k nearest neighbor algorithm to compare all the images one pair at a time finding pixel coordinates of matching features. These pixel matches are then passed to an algorithm which calculates the relative camera positions of the photos in a 3D space. These positions are then used to orient the photos allowing us to generate a 3D model. The purpose of this research is to determine the best method to generate a 3D model of a damaged area with maximum clarity in a relatively short period of time at the lowest possible cost; therefore, allowing repair crews to allocate resources more efficiently. Keywords: Photogrammetry

Point cloud Structure from motion

1 Introduction 1.1

Problem

When natural disasters occur, there is often substantial damage done to crucial structures. Repair crews must act fast to locate the infrastructure that requires foremost attention because society is dependent on it for survival. Traditionally, this is time consuming and cost ineffective because these areas must be individually assessed by repair crews prolonging the time it takes to restore these areas to their former condition © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 317–325, 2020. https://doi.org/10.1007/978-3-030-17795-9_22

318

G. Eckert et al.

and increasing the cost of the repair. Therefore, it is necessary to develop a more efficient way to assess the damage and allocate repair resources more effectively. 1.2

Objectives

The objective of this research is to develop a more efficient way to assess damage occurring from natural disasters by using drone photography to develop a 3D model of an affected area that will allow crews to determine what is in need of immediate repair without having to assess each area individually in the field. This will allow crews to reduce the time and cost needed to make the repairs by allowing them to allocate resources more effectively.

2 Review of the Literature 2.1

Theories, Concepts and Models

Photogrammetry is a technique used commonly to map areas and objects with the input of photographs. The first step in using photogrammetry, to create a threedimensional model from two-dimensional aerial drone photographs, is to match similar features in the images. Similar features of photographs can be matched by using David Lowe’s Scale-Invariant Feature Transform algorithm (SIFT). SIFT detects key points by applying Gaussian filters on an image at different scale values [1]. The Laplacian of Gaussian (LoG) is then used to find similar regions or blobs within the Gaussian filter [2]. As the scale changes, these blobs change in size. In order for the algorithm to work in a timely manner, the LoG can only be approximated using the Difference of Gaussian (DoG) which is found by calculating the difference in two Gaussian filters at different values applied to the image [2]. This is then used to find the extreme values in the image which may be a key point if they reach a certain threshold [2]. After key points are found they are matched throughout all of the images using a nearest neighbor algorithm [1]. The matches are then used to calculate the camera positions of where the photos were taken. Camera positions are used to orient the photos and create a 3D model which can be a point cloud, 3D panorama, or other model. 2.2

Previous Research

Previous research involving constructing 3D models has been conducted by Changchang Wu PhD who specializes in structure for motion and 3D Computer vision [3]. He created VisualSFM a 3D reconstruction structure for motion application that generates 3D point clouds [6].

Using Aerial Drone Photography to Construct 3D Models

319

3 Research Methodology 3.1

Introduction

Throughout this project our goal has been to develop a new technology using the capability of aerial photography drones to decrease response time and repair costs following natural disasters. This would increase effectiveness of repair crew deployment to areas that require the most immediate need. The initial direction of the project was to use artificial intelligence to recognize 2D photographs and determine what is in need of repair. We also determined that it was necessary to research photogrammetry to create a 3D representation of the photos in their proper locations. It would also be necessary to create a solution with an organized user experience that shows a representation of the affected area that is useful in quickly and efficiently determining how to allocate resources.

4 Theoretical Framework Due to modern advancements in aerial drone photography, aerial photographs of an object or area can be used to construct a 3D model that can be used to create a plan for the repair process. Guiding our research was the objective of developing an effective and efficient method to access damage using photographs taken be an aerial drone. Due to the vast amount of pre-existing research on computer vision, this meant that our research would be broad based, multidimensional, and would include some complex technologies that we did not yet understand. 4.1

Research Design

First, we needed to learn more about the business of using drone photography for infrastructure damage assessment. For that we sought input and guidance from Aerial Applications, a leader in the field. They sent us 170 photographs of a cell tower taken by a drone. Each photograph was 3000 4000 pixels. These photographs became the focus of our research. The objective was to create a method by which the ‘health’ of the cell tower could be quickly and accurately determined from the photographs. Generate a 3D Point Cloud. To do this, the team needed to understand photogrammetry, edge detection, feature detection and matching, and camera positions. We examined three methods of feature detection ORB, SIFT, and SURF (see Fig. 1).

320

G. Eckert et al.

ORB

SIFT

SURF

Fig. 1. The image above shows a comparison of the three feature matching algorithms. The first algorithm ORB is the least useful because it generates the least number of matches with the lowest accuracy. The second algorithm SIFT is the best because it has an ideal number of matches with the most accuracy. Although the final algorithm SURF has more matches than SIFT, they are not as accurate and therefore tend to be less reliable.

Each of these methods rely on edge detection algorithms to detect features on photographs that are then matched using a k nearest neighbor algorithm. We determined that SIFT is the best method for detecting and matching features because it results in the most accurate matches (see Fig. 2). Feature matches were then used to calculate the camera positions of the photos which allowed us to generate 3D models using two methods of analysis. These methods were point cloud generation using structure from motion and 3D panorama generation. We also tested various software platforms that used these methods to generate 3D models (see Figs. 3, 4, and 5) in an effort to see how they work, the things they do well, and the things that could be improved upon. We then used these findings in an attempt to come up with a better way to generate a 3D model. We quickly determined that aerial photos of a cell tower taken by a basic drone would challenge our ability to quickly and accurately generate a 3D reconstruction of the tower. The benefits of generating a 3D point cloud using the methods we explored were out-weighed by excessive processing time and lack of clarity of the resultant model. We suspect that the complexity of the subject matter, pictures of a cell tower in space, contributed to this deficiency.

Using Aerial Drone Photography to Construct 3D Models

321

Fig. 2. The figure above shows a larger example of the SIFT algorithm used to detect and find matching features of two very similar images of a cell tower. This algorithm has a greater amount of accurate matches as compared to the other two algorithms. Although it can be seen that this algorithm is not 100% accurate, the number of accurate matches is large enough for the camera position calculation to account for the error in the inaccurate matches.

Fig. 3. The figure above shows an image of a point cloud of a cell tower generated using VisualSFM. This software is an open source application developed by Changchang Wu. This version of the point cloud was highly inaccurate because it could not correlate all the photos to locate camera positions and therefore much detail was missing from the cell tower.

A Way to Classify Photographs. As we became convinced that generating a 3D point cloud was not the answer to our problem, we started, instead, to think about how we might order pictures, so they could be placed into a collage or stitched into a panorama. The majority of the panorama stitching algorithms, we examined, worked marginally well with photographs of fixed landscapes but not well on the cell tower or with many of the other real world objects that Aerial Applications needed to examine.

322

G. Eckert et al.

Fig. 4. The picture above shows a version of the cell tower point cloud generated using Pix4D a professional photogrammetry software. This point cloud is better than the VisualSFM version (shown in Fig. 3), while the landscape around the cell tower is very accurate, there is clutter on the cell tower itself. Therefore, this program is best for reconstructing large surfaces but not small intricate objects.

Fig. 5. The figure above illustrates the third version of the cell tower using ContextCapture. This is the best version of the cell tower point cloud when compared to VisualSFM and Pix4D (Figs. 3 and 4) because the cell tower is the clearest in this version. There are still some imperfections of intricate details such as wires and other connections that are missing. Therefore, it would still be difficult to use this point cloud to determine what components need to be repaired because not all components are clearly defined.

Color histograms (see Fig. 6) were effective in grouping photos taken at similar horizons but ineffective in visualizing the cell tower in a three dimensional space. Therefore, we used our research and knowledge of GPS data, feature matching, camera positions, and histograms in an effort to develop a better 3D model from 2D aerial photographs.

Using Aerial Drone Photography to Construct 3D Models

323

Fig. 6. The graph above is a color histogram which illustrates the distribution of RGB colors in an image. On the x-axis is the bins. A bin is a range of intensity of a color in an image. There are 256 bins so that each intensity is represented. The y-axis represents the number of pixels in that bin (intensity) for each color. The colors blues and green are the most intense in this analyzed image because the sky and vegetation made up most of the image. Histograms are useful in grouping similar images because similar photos usually have similar histograms.

5 Presentation of Findings We concluded that although point cloud generation is the most common photogrammetric method of generating a 3D model, it is ineffective in making natural disaster response more efficient due to a tendency for point clouds to lack clarity and accuracy. Instead we concluded that using the photographs to generate a 3D panoramic model or interactive plot is more effective because it keeps the original integrity of the photographs. 5.1

Solution

A Better Way to Classify Photographs. Ultimately, the solution we developed is efficient, effective, relatively low-tech, and scalable. We used latitude, longitude, and altitude to create a 3D plot that represents the flight path of the drone. Each point in the plot represents a picture of the cell tower (see Fig. 7). The 3D plot can be rotated by the viewer who can essentially fly around the tower by clicking points to reveal the original photograph in a viewing panel. The clarity of the original photographs is kept, and the examiner can focus on key parts of the tower by interacting with the 3D plot without having to worry about any loss of detail. This allows the user to quickly look at all the photographs and easily determine what is in need of repair (see Fig. 8). When linked with other such 3D plots, this solution could be used to analyze much larger sections of infrastructure.

324

G. Eckert et al.

Fig. 7. The figure above shows a three-dimensional scatterplot of locations of the drone from when it took each cell tower image. Each point in the scatterplot represents an image plotted by the latitude, longitude, and altitude of the drone at the time the image was captured. It can be seen that the drone flew a circular path at different heights of the cell tower while taking the pictures in an attempt to get every angle of the tower.

Fig. 8. The figure above shows the interface created using the 3D scatter plot method. (Figure 7) This interface gives the user the ability to select an image set. It then plots the GPS location of the drone for each image. The user can then click on each point and the program shows the corresponding image. The plot has the capability of zoom and rotation to allow the user to move around the plot and quickly examine each image. The benefit of this method is that there is no detail lost in a reconstruction process.

Using Aerial Drone Photography to Construct 3D Models

325

6 Reflections/Further Research Throughout the project, time was a limiting factor; therefore there are many ways that our solution could be enhanced. The next step for this research project would be to improve our existing interface by adding more features that will make it more userfriendly. Some of these features may include allowing the user to select specific images rather than an entire directory of images. Also, it would be important to add an automation feature to the interface that would show the user the pictures represented in the scatter plot in a certain order. In order to make this solution more effective for large scale disaster relief operations, we would research and incorporate artificial intelligence to allow the computer to determine with computer vision what areas and objects in the model, are damaged and in need of the most repair and resources. This removes the need for a human to analyze the model and therefore allows repair crews to solve the issues quicker.

7 Appendices “Photogrammetry is the art, science and technology of obtaining reliable information about physical objects and the environment through processes of recording measuring and interpreting images and patterns of electromagnetic radiant energy and other phenomena” [4]. “Point clouds are collections of 3D points located randomly in space” [5]. “Structure from Motion or SfM is a photogrammetric method for creating threedimensional models of a feature or topography from overlapping two-dimensional photographs taken from many locations and orientations to reconstruct the photographed scene” [6].

References 1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 2. Introduction to SIFT (Scale-Invariant Feature Transform). (n.d.). Retrieved from OpenCV website. https://docs.opencv.org/3.3.0/da/df5/tutorial_py_sift_intro.html 3. Wu, C. http://ccwu.me 4. What is ASPRS? Retrieved from ASPRS website. https://www.asprs.org/organization/whatis-asprs.html 5. Point Cloud Data: Retrieved from U.S. Naval Academy website. https://www.usna.edu/Users/ oceano/pguth/md_help/html/pt_clouds.htm 6. Shervais, K.: Structure from Motion Introductory Guide. Retrieved from UNAVICO website. https://www.unavco.org/education/resources/modules-and-activities/field-geodesy/modulematerials/sfm-intro-guide.pdf

Image Recognition Model over Augmented Reality Based on Convolutional Neural Networks Through Color-Space Segmentation Andrés Ovidio Restrepo-Rodríguez(&), Daniel Esteban Casas-Mateus(&), Paulo Alonso Gaona-García(&), and Carlos Enrique Montenegro-Marín(&) Universidad Distrital Francisco José de Caldas, Bogotá, Colombia {aorestrepor,decasasm}@correo.udistrital.edu.co, {pagaonag,cemontenegrom}@udistrital.edu.co

Abstract. Currently the image recognition and classification implementing Convolutional Neural Networks is highly used, where one of the most important factors is the identification and extraction of characteristics, events, among other aspects; but in many situations this task is left only in charge of the neural network, without establish and apply a previous phase of image processing that facilitates the identification of patterns. This can cause errors at the time of image recognition, which in critical mission scenarios such as medical evaluations can be highly sensitive. The purpose of this paper is to implement a prediction model based on convolutional neural networks for geometric figures classification, applying a previous phase of color-space segmentation as image processing method to the test dataset. For this, it will be carried out the approach, development and testing of a scenario focused on the image acquisition, processing and recognition using an AR-Sandbox and data analysis tools. Finally, the results, conclusions and future works are presented. Keywords: Image acquisition Image processing Image recognition Convolutional neural network Dataset Loss function Accuracy ROC curve AR-SANDBOX

1 Introduction Artificial intelligence is the area that seeks to resemble the capabilities of human beings represented in machines, for which it is involved in man’s fields such as learning, reasoning, adaptation and self-correction [9]. Within this field, neural networks and image processing are working together in order to generate an accurate classification model, improving learning from the extraction of characteristics and patterns. In addition, the combination of these two fields can be seen applied in regenerative medicine, microbiology, hematology, precision agriculture, tumor identification, among others. When performing the implementation of a convolutional neural network with an image processing, results are produced in a similar way for the different types of files © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 326–338, 2020. https://doi.org/10.1007/978-3-030-17795-9_23

Image Recognition Model over Augmented Reality

327

and sizes, making use of semantic segmentation by means of FCN (Fully Convolutional Rate) in which the delimitation and separation of objects, an improvement in the classification and learning speed of the model adapting to the characteristics of the input data is gotten [8], on the other hand you have the difficulty in improving the resolution of images and videos of high quality, making use of convolutional neural networks and through the analysis of each one of the contained pixels, it was demonstrated that the implementation of a convolutional layer at the pixel level improves the quality of images and videos without the need of a great expense of computational resources [9]. That is why the combination of CNN with image segmentation methods offers the opportunity to optimize results. The motivation of this study is framed in the generation of contributions of images recognition wrapped in the fields of immersive techniques, through deep learning and image processing, based on the AR-Sandbox [10, 11], a technique that allows to close the gap between two-dimensional (2D) and three-dimensional (3D) visualization when projecting a digital topographic map on a landscape of isolated space, improving spatial thinking and modeling skills of people, with the purpose of placing it within the use of early childhood education and rehabilitation through motor therapy. This article presents the preliminary results of the implementation of a prediction model based on convolutional neural networks for the classification of geometric figures, contemplating previous steps such as the acquisition of images using the ARsandbox augmented reality device and the processing of these images by means of segmentation by color-space. This item purpose is to improve the preparation, extraction and identification of characteristics. The prediction model is made using the KERAS neural network library under the TensorFlow framework, through an implementation made in Python. The rest of the article is organized like this. Section 2 consists of a background, where works related to the topics to be developed are exposed. Next, Sect. 3 presents a macro scenario subdivided into three components: approach, development and testing, presenting the characteristics and implementation of each of these components. In the test component of the scenario, two datasets with different characteristics are used, in order to perform a comparative analysis and establish the results from them. Finally, the conclusions and future works are presented.

2 Background The image recognition through convolutional neuronal networks has had a large number of applications, together with the previous image processing, in order to optimize pattern recognition. The CNN has had a continuous improvement since its creation, through the innovation of new layers, making use of different computer vision techniques [2]. In the field of regenerative medicine, a study carried out by [3] has developed automatic cell culture systems, where a CNN was implemented as a deep learning method, to automatic recognition of the cellular difference, by means of the contrast in the different images. On the other hand, in China, a study was carried out

328

A. O. Restrepo-Rodríguez et al.

where white blood cell segmentation was implemented, which proposed a method for segmentation based on color-space, making a color adjustment previous to segmentation, where an accuracy of 95.7% and Overall accuracy of 91.3% were achieved for segmentation of the nucleus and segmentation of the cytoplasm [4]. In addition, in a study carried out by [5], they propose a convolutional neuronal network for the improvement of the thermal image incorporating the domain of brightness with a residual learning technique, increasing the performance and the speed of convergence. The fast development of precision agriculture has generated the need for the management and estimation of agriculture through the classification of crops by means of satellite images, but due to the complexity and fragmentation of the characteristics, traditional methods have not been able to fulfill with the standard of agricultural problems, for this reason [6], they propose a classification method of agricultural remote sensing images based on convolutional neural networks, where the correct classification rate obtained is 99.55%. According to the above, the neural networks together with image processing are a commonly used alternative for the image classification. On basis of this panorama, the present study intends to carry out the recognition of images by means of Deep Learning techniques such as convolutional neural networks and image processing as color space segmentation, with the purpose of determining the performance variation of a Convolutional Neural Network, applying color-space segmentation to one of the test datasets. In the next section, the selected study scenario is presented.

3 Method and Approach of Scenario Figure 1 shows the general approach of the scenario to be worked on, which is composed of 3 main components such as (1) Image Acquisition, (2) Image Processing and (3) Image Recognition; each one makes use of different methods and tools, in order to develop each general component. In image acquisition, augmented reality is implemented by means of the AR-SANDBOX device, which performs in real time projections of a color elevation map, capturing depths by means of the depth camera, that has a 3D Kinect of first generation, in addition to make use of a standard projector [1]. On the other hand, image processing is based on color-space segmentation, previously applying saturation to the image, using Python and OpenCV. Finally, there is the Image Recognition component, where a Convolutional Neural Network (CNN) is implemented, becoming the central theme of this research [12]. The development of CNN is done through Keras, an open source neural networks library written in Python, using the TensorFlow framework as Backend; the purpose of the scenario implementation is to predict geometric figures in different contexts.

Image Recognition Model over Augmented Reality

329

Fig. 1. Scenario approach

4 Development of the Scenario 4.1

Image Acquisition

As a first instance, the acquisition of the images was made through the AR Sandbox augmented reality tool, which through a projector, a Kinect and a sandbox provides the experience to make colorful representations in third dimension over the sand. For the acquisition of the image, the projection generated by the AR-SANDBOX is used, which consists of a projector and a Kinect located at a distance of 40 inches (102 cm) from the sand. This last device uses infrared sensors and cameras in order to detect the depth of each of the points on the sand, so that the higher points are assigned a green color and the lower points assigned a blue color, in other words, where there are mountains inside the sand, the green color will be projected and where not, the blue color will be. In this way the geometric figures are made on the sand and a screenshot of the projected image is taken, in Fig. 2, the result of this process is shown when making a circle in the sand. 4.2

Image Processing

Once the acquisition of images is completed, the image processing stage is started, for this purpose the Python library called OpenCV is used, which provides tools to perform segmentation by color space, the color type chosen space was BGR, since the colors of the projections are known in the mentioned scale, which is a requirement for processing, in this case the reference color is the green equivalent in BGR to (47, 122, 16). From the target color and a range of close colors, segmentation by color space is carried out, in order to separate the colors in the image, therefore, the values within

330

A. O. Restrepo-Rodríguez et al.

Fig. 2. Circle projection over an AR Sandbox

these parameters will be painted white and the rest of a black color, this is called a mask, which provides a contrast in the image. Making use of the mask, the white part will take the green color while the rest of the image is still black, in order to return to the reference color range, to demarcate the contour of the figure, facilitating the identification of shapes in the model of prediction that will be exposed ahead, in Fig. 3 the process carried out with three images is presented: a circle, a square and a triangle, where the captured image, the obtained mask and the result image after having applied the filter are seen.

Fig. 3. Processing of three geometric figures

5 Image Recognition Through CNN Finally, in the scenario development, the architecture and main characteristics of the CNN will be presented. In addition, the compilation, training and metrics parameters implemented will be presented.

Image Recognition Model over Augmented Reality

331

Architecture. When considering the convolutional neural networks architecture, the type of layer, the functionality, the output shape, the number of filters and the activation parameter must be considered. In Table 1 these elements are presented.

Table 1. CNN’s architecture Layer Conv2d_7 Conv2d_8 Max_pooling2d_4 Dropout_5 Conv2d_9 Conv2d_10 Max_pooling2d_5 Dropout_6 Conv2d_11 Conv2d_12 Max_pooling2d_6 Dropout_7 Flatten_2 Dense_3 Dropout_8 Dense_4

Type Conv2D Conv2D MaxPooling2 Dropout Conv2D Conv2D MaxPooling2 Dropout Conv2D Conv2D MaxPooling2 Dropout Flatten Dense Dropout Dense

Feature 32 filters 32 filters Pool_size (2,2) Rate 0.25 64 filters 64 filters Pool_size (2,2) Rate 0.25 64 filters 64 filters Pool_size (2,2) Rate 0.25 N/A 512 units Rate 0.5 3 units

Output (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None, (None,

shape 128, 128, 32) 126, 126, 32) 63, 63, 32) 63, 63, 32) 63, 63, 64) 61, 61, 64) 30, 30, 64) 30, 30, 64) 30, 30, 64) 28, 28, 64) 14, 14, 64) 14, 14, 64) 12544) 512) 512) 3)

Activation relu relu N/A N/A relu relu N/A N/A relu relu N/A N/A N/A relu N/A softmax

Figure 4, shows a representation of the convolutional neural network architecture implemented, in which the layers with their corresponding type are shown, denoting the characteristics used. In addition, the Output shape transferred to the next layer is displayed. The neural network has a total of 4 general layers, where the first layer consists of two layers of convolution, a layer of pooling, which reduces the number of parameters to keep the most common characteristics, and a layer of Dropout in order to avoid overfitting; this composition of sublayers is repeated three times, therefore, it has 3 equal layers, but with different parameters. In each of these layers ReLU is used as activation function, given the fact that it is the function used in computer vision; the objective of these three general layers is the extraction of characteristics. Finally, there is the last general layer, consists of a Flatten layer to convert the elements of the image matrix to a flat array, followed by a Dense layer, which adds 512 hidden layers; later it has a Dropout’s layer, and ending the output layer, it has again a Dense layer, where 3 hidden layers are used according to the number of classes that are in the algorithm; The layer also has a Softmax activation function, given the fact that it is necessary to make a representation of the categorical distribution in order to generate the classification.

332

A. O. Restrepo-Rodríguez et al.

Fig. 4. CNN’s structure

Compilation. The compilation parameters that were used for the convolutional neural network are presented in Table 2, where the categorical_crossentropy function was chosen to the Loss parameter, given the fact that it is recommended when there are more than 2 classes and, in this case, there are 3, and the choice to the Metrics parameter was Accuracy, because this function gets the total percentage of successes of the CNN according to a test dataset. Table 2. Compilation parameters Parameter Optimizer Loss Metrics

Value RMSProp categorical_crossentropy accuracy

Training. To train a convolutional neural network, the following parameters must be considered: Training Data. The training dataset is composed of a total of 300 images, distributed in the following way, 100 circles, 100 squares and 100 triangles, each one of these has images elaborated manually and also acquired through the AR-SANDBOX. Target of Training Data. The Target of the training data, is an array made with NumPy, where you have the value of the class each of the elements of the Training dataset, in other words, it has 100 classes 0, 100 classes 1 and 100 classes 2.

Image Recognition Model over Augmented Reality

333

Batch Size. In this case, the number of samples per gradient update is 10, given that the training set is not very large. Therefore, the aim is to divide the training dataset into 30 batches. Epochs. The number of times the training dataset is passed through the neural network is 10.

6 Test of Scenario When the scenario development is completed, the next step is to proceed to perform the test of this, where a set of 6 people made 15 images in the AR-SANDBOX each, distributed among circles, squares and triangles, thus completing a total of 90 images, in other words, 30 of each class, with these 90 images were built 2 test datasets, with the following descriptions: dataSetOriginal: It consists of 90 images, in their original state, without the application of any type of algorithm of image processing. dataSetFilter: It consists of 90 images, but each of them is segmented by color-space. In Fig. 5 an example of the dataset used can be seen. With each of these test sets, the elaborated CNN was evaluated, in order to determine the accuracy, the loss function, the confusion matrix and observe the behavior of the ROC curve.

Fig. 5. Used datasets

6.1

Function Evaluate

DataSetOriginal. Table 3 shows the loss value which is 0.72 and the percentage of successes which is 70%, according to the established metric.

334

A. O. Restrepo-Rodríguez et al. Table 3. Function evaluate: DataSetOriginal Parameter Function Value Loss categorical_crossentropy 0.72 Metrics accuracy 0.70

DataSetFilter. Table 4 shows the loss value which is 0.36 and the percentage of successes which is 87% according to that metric. Table 4. Function evaluate: DataSetFilter Parameter Function Value Loss categorical_crossentropy 0.36 Metrics accuracy 0.87

6.2

Confusion Matrix

DataSetOriginal. Table 5 shows the number of hits that CNN had when testing with 30 images of each class in different order, therefore, when analyzing the table of the neural network had a success rate of 61%. Table 5. Confusion matrix: DataSetOriginal

-----------Circle Square Triangle

Circle 24 13 7

Square 0 13 5

Triangle 6 4 18

DataSetFilter. Table 6 shows the number of hits that CNN had when testing with 30 images of each class in different order, therefore, when analyzing the table, the neural network had a percentage of successes of 87.8%. Table 6. Confusion matrix: DataSetFilter

-----------Circle Square Triangle

Circle 28 5 0

Square 0 22 1

Triangle 2 3 29

Image Recognition Model over Augmented Reality

6.3

335

ROC Curve

DataSetOriginal. Figure 6 shows the ROC curve for the dataset with the original images, where there are five curves, two of them at a general level and the other three at a specific level, the general ones show the averages of areas under the curve of the micro and macro level, on the other hand, the curves at specific level show the AUC of each of the classes being these 0, 1, 2 that corresponds to circle, square and triangle respectively.

Fig. 6. ROC curve: DataSetOriginal

DataSetFilter. Figure 7 shows the ROC curve for the dataset with the processed images, where the area under the curve of the averages at the micro level and macro level of the ROC curve is shown, as well as the AUCs of each of the classes of the geometric figures which composed the test dataset.

336

A. O. Restrepo-Rodríguez et al.

Fig. 7. Curva ROC: DataSetFilter

7 Results Analysis From the evaluation made to the CNN model, with the dataSetOriginal and the dataSetFilter, the following aspects were analyzed: loss function, hit metric, confusion matrix and ROC curve, obtaining the following results: When evaluating the CNN with the dataSetOriginal, 72% was obtained as a function of loss, while with the dataSetFilter, 36% was obtained, therefore, a 36% of decrease is obtained, on the other hand, through the dataSetOriginal, a metric was obtained of hits with a percentage of 70% while using the dataSetFilter a percentage of correct answers of 87% was obtained, presenting an increase of 17% between the two test datasets. Regarding the confusion matrix, the percentage of correct answers when using the dataSetOriginal was 61% while with the dataSetFilter a percentage of 87.7% was obtained, presenting an increase of 26.7%, in addition, in Table 7 the specific analysis for each of the geometric figures is presented. Table 7. Confusion matrix analysis Datasets Figure dataSetOriginal Circle Square Triangle dataSetFilter Circle Square Triangle

Hits (%) 80 43.3 60 93.3 73.3 96.7

Failures (%) 20 56.3 40 6.7 26.3 3.3

Image Recognition Model over Augmented Reality

337

A ROC curve is a graph that shows the performance of a classification model in all the classification thresholds [7], when analyzing a ROC curve, the determining parameter is the area under the curve (AUC), therefore, it is the factor that will be analyzed next. Figure 6 represents the ROC curve when using the dataSetTOrginal, where according to the AUC the minimum average is 0.87 and the maximum average is 0.91, obtaining an average yield of 0.89, on the other hand, Fig. 7 presents the curve of ROC when using the dataSetFilter where the minimum average of AUC is 0.97 and the maximum average is 0.98, obtaining an average yield of 0.975, presenting an increase of 0.085 in the yield. Through a specific analysis, Table 8 presents the data and variations of the AUC of each of the classes using the two test datasets. Table 8. ROC curve analysis Datasets Class Figure dataSetTOrginal 0 Circle 1 Square 2 Triangle dataSetFilter 0 Circle 1 Square 2 Triangle

AUC 0.87 0.93 0.92 0.98 0.94 0.98

From Table 8, it can be seen that the performance in each of the classes presents variations with respect to the other test dataset, but the AUC value for image recognition is higher, which previously had segmentation by color-space.

8 Conclusions and Future Works On basis of the analysis presented, it can be determined that when evaluating CNN with a previously processed dataset applying color-space segmentation, a decrease in the loss value of 36% is obtained, increasing the value of hits by 17% generated by the accuracy metric, given that the distance between the value of the prediction and the expected value decreases. By implementing color-space segmentation to a set of test data, a positive contribution is made to the identification and extraction of patterns or characteristics necessary for the classification of images. When the area under the ROC curve is taken as reference, an average of 0.975 corresponding to the CNN performance is presented under the processed dataset, increasing a value of 0.085 with respect to the average AUC generated from the raw dataset. From the implementation of these tools it is evident that the combination of areas such as multimedia and artificial intelligence provide a great field of action to continue researching and making proposals. The development of the technologies worked provides the opportunity to make proposals in sectors such as health and education. In these sectors, proposals can be made for work on the development of motor skills,

338

A. O. Restrepo-Rodríguez et al.

development of basic knowledge in early childhood, reaction to certain situations, psychology, among others. As future work, it is proposed to implement an immersive environment using augmented reality tools and devices to support motor therapy in children, continuously monitoring the child’s emotional behavior through brain-computer interfaces. On the other hand, it is planned to expand the scope of the developed CNN, achieving the identification of other geometric figures, numbers, letters and symbols.

References 1. Rosyadi, H., Çevik, G.: Augmented reality sandbox (AR sandbox) experimental landscape for fluvial, deltaic and volcano morphology and topography models (2016) 2. Neha, S., Vibhor, J., Anju, M.: An analysis of convolutional neural networks for image classification. Procedia Comput. Sci. 132, 377–384 (2018) 3. Niioka, H., Asatani, S., Yoshimura, A., Ohigashi, H., Tagawa, S., Miyake, J.: Classification of C2C12 cells at differentiation by convolutional neural network of deep learning using phase contrast images. Hum. Cell 31, 87–93 (2018) 4. Zhang, C., et al.: White blood cell segmentation by color-space-based K-means clustering. Sensors 14(9), 16128–16147 (2014) 5. Lee, K., Lee, J., Lee, J., Hwang, S., Lee, S.: Brightness-based convolutional neural network for thermal image enhancement. IEEE Access 5, 26867–26879 (2017) 6. Yao, C., Zhang, Y., Liu, H.: Application of convolutional neural network in classification of high resolution agricultural remote sensing images. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. (2017) 7. Wald, N.J., Bestwick, J.P.: Is the area under an ROC curve a valid measure of the performance of a screening or diagnostic test? J. Med. Screen. 21, 51–56 (2014) 8. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Computer Vision Foundation (2015) 9. Shi, W., Caballero, J., Husz, F., Totz, J., Aitken, A.: Real-time single image and video superresolution using an efficient sub-pixel convolutional neural network. In: Computer Vision Foundation (2015) 10. Giorgis, S., Mahlen, N., Anne, K.: Instructor-led approach to integrating an augmented reality sandbox into a large-enrollment introductory geoscience course for nonmajors produces no gains. J. Geosci. Educ. 65, 283–291 (2017) 11. Woods, T., Reed, S., His, S., Woods, J., Woods, M.: Pilot study using the augmented reality sandbox to teach topographic maps and surficial processes in introductory geology labs. J. Geosci. Educ. 64, 199–214 (2016) 12. Restrepo Rodríguez, A.O., Casas Mateus, D.E., García, G., Alonso, P., Montenegro Marín, C.E., González Crespo, R.: Hyperparameter optimization for image recognition over an ARsandbox based on convolutional neural networks applying a previous phase of segmentation by color–space. Symmetry 10, 743 (2018). https://doi.org/10.3390/sym10120743

License Plate Detection and Recognition: An Empirical Study Md. J. Rahman(B) , S. S. Beauchemin(B) , and M. A. Bauer(B) The University of Western Ontario, London, ON N6A 5B7, Canada {mrahm46,sbeauche,bauer}@uwo.ca

Abstract. Vehicle License Plate Detection and Recognition has become critical to traffic, security and surveillance applications. This contribution aims to implement and evaluate different techniques for License Plate Detection and Recognition in order to improve their accuracy. This work addresses various problems in detection such as adverse weather, illumination change and poor quality of captured images. After detecting the license plate location in an image the next challenge is to recognize each letter and digit. In this work three different approaches have been investigated to find which one performs best. Here, characters are classified through template matching, multi-class SVM, and convolutional neural network. The performance was measured empirically, with 36 classes each containing 400 images per class used for training and testing. For each algorithm empirical accuracy was assessed. Keywords: Image processing · License Plate Detection · License Plate Recognition (LPR) · License plate segmentation Optical Character Recognition (OCR) · Deep learning · HOG with SVM · Template matching

1

·

Introduction

There is a continuous need for managing traffic intelligently, and there are many ways of identifying vehicles. In cities, highways, and in parking areas, any vehicle can be identified by recognizing its license plate for the automated parking or toll fees payment system. License Plate Recognition (LPR) has been a research interest for more than a decade. From the surveillance videos, license plate images are acquired from the front or back of a vehicle. The recognition process is completed in three steps. First, the license plate region is localized from the input image. From the extracted license plate, each character is segmented. Finally, the characters are recognized to obtain the plate information. There are several challenges both in the detection and recognition tasks. Weather conditions play a crucial role in the image quality resulting from the acquisition process. Variance in illumination also poses difficulties for the LP detection process. Therefore, most of the previous work posed constraining assumptions regarding the conditions of image acquisition [1]. In addition, there is a demand for such LP c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 339–349, 2020. https://doi.org/10.1007/978-3-030-17795-9_24

340

Md. J. Rahman et al.

recognition systems to operate in real-time, with a generally accepted frame rate of 20fps. It is expected that algorithms be sufficiently efficient to meet such execution time criteria. The aim of this research is to detect and recognize license plates from rear images of vehicles, and to compare different character recognition methods based on accuracy. Another goal of this work is to investigate whether it is possible to avoid any specialized hardware and improve accuracy.

2

Related Works

The number of vehicle in service grows worldwide each and every year. Managing an increasing number of vehicles has become a major concern that includes issues such as compliance, security, and maintenance. Automated LP identification is an important part of the ongoing vehicular management effort. Researchers have used various techniques in every stage of the license plate localization. In this three-step process, LP localization is the most important part since the overall accuracy mostly depends on it [2]. Ponce et al. [3] have designed systems that learn license plate patterns in order to search for them in whole images. This idea is limited to specific type of license plates because the shape and aspect ratio of the LP vary region-wise. Kim et al. [4] and Vichik [5] have tried to match shape and color. According to Halina et al. [6] these are not appropriate to localize LPs. Instead, more importance should be given to find the features of the object that are common to all LPs. Bai et al. [7,8] correctly remarked that LP sub-areas contain additional features in a usual traffic environment as they display several distinctive characters and colored shapes located in small areas. Indeed, different countries display various code bindings in their LPs. Some researchers have taken interest on identifying and segmenting LP regions based on such features [9,10]. There currently exists three lines of approach to the LP localization aspect. First, there are edge-based methods where edges are used to identify LP regions. Second, there are region-based methods in which certain areas of the image are searched for a match with the LP model. Finally, there are hybrid approaches that combine both methods. Researchers have experimented with deep learning techniques for LP localization [11–13]. However, these frameworks need large amounts of training data to progressively make the networks perform correctly. This data may or may not be available for local license plates. Once the LP is localized in the scene, the next stage is to identify the LP characters. First, there is a need to acquire an adequate set of local license plate images. Different regions have different LP designs and character set modeling is complicated by the fact that the font types are not generally known. Due to lack of local LP character dataset, we adopt a deep learning approach using Convolutional Neural Networks (CNNs) that will be able to learn from any simple set of fonts and map it to the local character set. Many CNN implementations have focused on recognizing a single object in a queried image [14–16,21]. Our problem is to recognize multiple objects in a single pass, since an LP contains multiple symbols.

License Plate Detection and Recognition

3

341

Datasets

Two datasets have been used for this research. The first dataset (used for LP localization) was published by Ribarić et al. [17]. It contains 500 rear colour images of parked vehicles taken by an Olympus C-2040 Zoom digital camera. These pictures were taken under various weather conditions such as sunny, cloudy, rainy, dusk and night time. Samples from this dataset are shown in Fig. 1. In the character recognition phase we use a font dataset for machine learning training purposes (for our use of Support Vector Machines and Convolutional Neural Networks). This dataset is a part of the Char74K-15 dataset [18] shown in Fig. 2. It includes various fonts such as bold, italic, etc. There are 36 letters and digits possible for a license plate (0–9 and A–Z). For each letter and digit, 1000 sample images are used. All these images are gray tone and allow us to train with color invariance.

Fig. 1. Croatian LP dataset

Fig. 2. Font dataset

Fig. 3. System overview

Fig. 4. A. Input image B. Grey image C. Thresholded and binarized image D. Detected edges from C.

342

4

Md. J. Rahman et al.

Methodology

We proceed to describe the methodology we established for this comparative work. The workflow for our methods is depicted in Fig. 3 and includes the following stages: pre-processing, detection, segmentation, and character recognition. We present these stages in detail below. 4.1

Preprocessing

The input images are color RGB images with various aspect ratios. These input images are first converted to greyscale, in an attempt to create a form of color invariance in our processing (color is not descriptive of license plate information). Figure 4 demonstrates the preprocessing phase. The first step is to locate the LP within the image. Note that the edge information from letters and digits is sharply distinctive from its LP background. Here we employed the idea of locating areas where the edge map density is the highest. Naturally, the border lines and the set of characters inside a LP constitute an image area where edge density is high. The image is first thresholded and then binarized prior to computing the edge map. The edge map is obtained with Robert’s operator [19] which considers high spatial frequencies in the greyscale image. Figure 4 demonstrates the preprocessing phase. 4.2

License Plate Detection

Traditional object detection algorithm with machine learning requires a good dataset of positive and negative samples of that object. Due to lack of such an expansive dataset machine learning techniques could not be applied. However, Naikur et al. [20] proposed an image processing technique where the edges of objects in the image are processed and the license plate is localized. The idea behind this technique is to build two histograms, one horizontal and the other vertical. The horizontal histogram goes through every row of the image and calculates a total difference based on each pixel value. Starting from the second

Fig. 5. Horizontal and vertical edge processing

License Plate Detection and Recognition

343

pixel, every pixel value is deducted from its previous one and then added into a sum. The vertical histogram does the same from every column. Generally, more edges are expected in the LP area than other areas. Figure 5 illustrates the algorithm. As anticipated, the dense edge area found by the algorithm is the LP area. However, this algorithm has several limitations. The most common problem is if the image includes background clutter around the LP, causing the algorithm to find an area other than that of the true LP. Therefore, we use images mainly focused on the license plate with minimum of 130 × 25 pixel aspect ratio.

Fig. 6. Character segmentation

4.3

Segmentation

Once the license plate is localized from the image, the next step is the segmentation of the characters from the plate. In the segmentation phase, the LP image is converted to greyscale and thresholded. In order to find the individual characters inside the plate, we perform connected component analysis, as illustrated in Fig. 6. All the dark pixels which are 8-neighbour connected are considered, to the exclusion of all others. Every individual segment is saved as greyscale images of size 16 × 16, as shown in Fig. 7.

Fig. 7. Template matching

4.4

Character Recognition

With the characters segmented from LP in the recognition phase, every individual letter and digit needs to be classified in one of 36 classes. For this purpose, three different algorithms are implemented and their performance evaluated. Template Matching. Template matching [17] is a well-known image comparison technique. One of the solutions for this classification problem is matching the segmented images with letter and digit templates. A straightforward way of matching the letter with the template is by computing the logical AND for all the 36 character images. The region with the highest matching score to a template yields the recognized symbol.

344

Md. J. Rahman et al.

HoG with SVM. Characters and digits have unique shapes. This shape information can be represented with gradients. Histograms of Oriented Gradients (HoG), proposed by Dalal and Triggs [20] adequately capture shapes. More importantly, this feature is invariant to rotation and illumination [20]. These characteristics enhance the detection process. Every character image in the dataset has 1000 instances. While training the SVM, it was found that 320 images per character are sufficient (and 80 images for testing). Hence out of 400 images, a ratio of 80%–20% is used for training and testing. A multiclass SVM is used to classify the 36 symbols. Figure 8 illustrates the process. The character images are first passed as input and preprocessed as before. The HoG feature is calculated on the whole image. Among the cell sizes 8 × 8, 4 × 4 and 2 × 2, 4 × 4 yields the best result. The length of the feature is 324. Among the 400 × 36 features, 320 × 36 are used for training and 80 × 36 for testing. The multiclass SVM uses the one versus all approach to classify symbols.

Fig. 8. Character classification using a combination of HOG features and an SVM classifier

Convolutional Neural Network. Finally, a deep learning approach is investigated. Chun et al. [22] have adopted several techniques with Convolutional Neural Network (CNN) on various sets of images. They used a CNN as a feature extractor and then performed the classification using K-Nearest Neighbour. Similarly, Lauer et al. [14] have proposed a system where the CNN is used as a feature extractor and then classified with an SVM [14]. Using this technique, they achieved 97.06% classification accuracy. In this contribution, we use a CNN directly. As the CNN itself extracts the features, only the preprocessed character images are passed to it. Figure 9 illustrates the stacked up deep learning architecture. The first two stages use two convolution filters of size 5 × 5. The third stage is for the fully connected layers. There are three fully connected layers with three different filters. The first one is of size 20 × 20 to cover a larger area, the second one is 10 × 10 and lastly there is a 1 × 1 which goes through every pixel of the feature image. The last layer is softmax.

License Plate Detection and Recognition

5 5.1

345

Experiments and Results Experimental Setup

All the experiments ran on a core i3, 12 GB RAM and 500 GB HDD machine. Matlab was used for the implementation and the built in toolboxes have been used.

Fig. 9. Deep learning with CNN for character classification

5.2

License Plate Detection

Figure 10 shows a sample of the results from the Croatian car rear image dataset. It is quite significant that the LP images are detected under various illumination conditions and different weather. The change of illumination on the images can clearly be seen. It is of note that the larger and clearer LPs are well detected even under rotational variance.

Fig. 10. Montage of detected license plates

346

5.3

Md. J. Rahman et al.

Recognition

Template Matching. In Fig. 11 shows the character segmentation results are shown. The upper LP is the one extracted from the edge processing. The one below is the collection of individual characters segmented from the plate.

Fig. 11. Segmentation

HoG with SVM. The table illustrated in Fig. 12 is the confusion matrix built from the test result of the HoG with SVM classification. Figure 13 shows a comparison between the template matching and the SVM algorithm test results. It has been observed that HoG with SVM obtains better result than template matching.

Fig. 12. HOG with SVM classification confusion matrix

CNN Classification. For each class, 400 images were considered in the training and testing. The images were resized to 200 × 200. Batch size varied from 50 to 200, the learning rate from 0.1 to 0.001, and epoch from 10 to 50. The best test error obtained is 2.94% (shown in Fig. 14). A comparative table for the algorithms evaluated herein is given in Fig. 15.

License Plate Detection and Recognition

Fig. 13. SVM and template matching result comparison

Fig. 14. CNN classification

Fig. 15. Time and accuracy comparison of algorithms for 50 epochs

347

348

6

Md. J. Rahman et al.

Conclusions

From these experiments it has become evident that license plate detection and recognition can be performed using a single average computer from digital images of license plates. This technique does not require any specialized hardware. Secondly, three machine learning techniques have been implemented and their performance have been compared. CNN outperforms the other two with an overall 97% accuracy. The classification accuracy of all three methods is found in Fig. 15.

References 1. Anagnostopoulos, C.N.E., Anagnostopoulos, I.E., Psoroulas, I.D., Loumos, V., Kayafas, E.: License plate recognition from still images and video sequences: a survey. Trans. Intell. Transp. Sys. 9(3), 377–391 (2008) 2. Jia, W., Zhang, H., He, X.: Region-based license plate detection. J. Netw. Comput. Appl. 30(4), 1324–1333 (2007) 3. Roberts, L.G.: Machine perception of three-dimensional solids. Ph.D. thesis, Massachusetts Institute of Technology. Department of Electrical Engineering (1963) 4. Sandler, R., Vichik, S., Rosen, A.: License plate recognition - final report 5. Sandler, R., Vichik, S., Rosen, A.: Moving car license plate recognition - semesterial project, final report 6. Saha, S., Basu, S., Nasipuri, M.: License plate localization using vertical edge map and hough transform based technique, pp. 649–656. Springer, Heidelberg (2012) 7. Hongliang, B., Changping, L.: A hybrid license plate extraction method based on edge statistics and morphology, vol. 2, pp. 831–834 (2004) 8. Bai, H., Zhu, J., Liu, C.: A fast license plate extraction method on complex background. In: 2003 IEEE Intelligent Transportation Systems, Proceedings, vol. 2, pp. 985–987. IEEE (2003) 9. Wei, W., Wang, M., Huang, Z.: An automatic method of location for numberplate using color features. In: 2001 International Conference on Image Processing, Proceedings, vol. 1, pp. 782–785. IEEE (2001) 10. Kim, K.I., Jung, K., Kim, J.H.: Color texture-based object detection: an application to license plate localization. In: Pattern Recognition with Support Vector Machines, pp. 293–309. Springer (2002) 11. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013) 12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 13. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015) 14. Lauer, F., Suen, C.Y., Bloch, G.: A trainable feature extractor for handwritten digit recognition. Pattern Recognit. 40(6), 1816–1824 (2007). 19 pp 15. Kim, K.I., Kim, K.K., Park, S.H., Jung, K., Park, M.H., Kim, H.J.: Vega Vision: a vision system for recognizing license plates (1999) 16. Ribaric, S., Adrinek, G., Segvic, S.: Real-time active visual tracking system, vol. 1, pp. 231–234, May 2004

License Plate Detection and Recognition

349

17. Brunelli, R.: Template Matching Techniques in Computer Vision: Theory and Practice. Wiley, Hoboken (2009) 18. De Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. In: proceedings of the International Conference on Computer Vision Theory and Applications, Liston, Portugal, February 2009 19. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 20. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005) 21. Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082 (2013) 22. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

Automatic Object Segmentation Based on GrabCut Feng Jiang1(B) , Yan Pang2 , ThienNgo N. Lee1,2 , and Chao Liu1,2 1

Metropolitan State University Denver, Denver, CO 80217, USA [email protected] 2 University of Colorado Denver, Denver, CO 80203, USA

Abstract. Object segmentation is used in multiple image processing applications. It is generally difficult to perform the object segmentation fully automatically. Most object segmentation schemes are developed based on prior information, training process, existing annotation, special mechanical settings or the human visual system modeling. We proposed a fully automatic segmentation method not relying on any training/learning process, existing annotation, special settings or the human visual system. The automatic object segmentation is accomplished by an objective object weight detection and modified GrabCut segmentation. The segmentation approach we propose is developed only based on the inherent image features. It is independent with various datasets and could be applied to different scenarios. The segmentation result is illustrated by testing a large dataset.

Keywords: Segmentation

1

· Saliency · GrabCut · Graph cuts

Introduction

Object segmentation is the essential task of many applications in computer vision, such as object recognition, image parsing, scene understanding. However, most object segmentation algorithms are constructed based on some prior information, training process or involves human interaction. The segmentation process requires users to provide segmentation cues manually [1–6]. Manual labeling can be very time consuming when the data set is large. Besides, the accuracy can not be guaranteed. Automatic segmentation, which does not require human interaction or training process, is more efficient and effective to be applied to large dataset. It is generally hard to proceed fully automatic object segmentation without training process or human interaction. Several automatic object segmentation methods have been proposed [7–12] recently. These methods can be mainly categorized to three categories. (1) Automatic segmentation with ground truth input or annotations from database. (2) Automatic segmentation based on special mechanical settings or designed only for special images. (3) Automatic segmentation based on visual saliency models. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 350–360, 2020. https://doi.org/10.1007/978-3-030-17795-9_25

Automatic Object Segmentation Based on GrabCut

351

Carreira [7,8] has proposed an automatic foreground/background segmentation which is performed by ranking plausible figure-ground object hypotheses. However, the segmentation process is performed utilizing ground truth annotations available in object class recognition datasets. Object segmentation using ground truth annotations could not be applied to all types of images and is thus not dataset independent. Won [9] proposed a segmentation algorithm detecting the object of interest automatically based on the block-wise maximum a posteriori (MAP) algorithm. However, this method is only designed for the images with low depth of field. The background of the testing image must be defocused and blurred. Campbell [10] designed automatic 3D object segmentation algorithm requiring no user input. However, this method relies on the camera fixating on the object of interest in the sequence. Through there is no user input involved, the segmentation process is performed requiring special camera settings. The segmentation algorithm taking advantage of stereo correspondence or based on special camera settings could not be applied to general use. Human visual system models such as saliency maps [11,12] are also used in automatic object segmentation. Fu [12] proposed automatic object segmentation based on saliency calculation and perform an auto-labeling approach using the saliency map. The labeling result is set to be the initial input of graph cuts segmentation. Jung [11] proposed the automatic object segmentation method based on spectral-domain saliency detection and a graph-cut-based segmentation algorithm. These two segmentation methods are constructed based on the assumption that the salient area is consistent with the target area of the segmentation which is not always true in certain scenarios (The reason is discussed in the next Section.). To satisfy more practical requirements, the automatic object segmentation should be unsupervised, dataset independent, not relying on special mechanical settings. It should also work for different scenarios. To satisfy these practical requirements, we propose an automatic object segmentation approach using object weight map detection and a variation of GrabCut segmentation. We call it “Objective Weight Cut”.

2

The Motivation of Our Approach

Object segmentation based on human visual system modeling approaches, such as saliency map, does not require human interaction and can be applied to general images or large dataset without special mechanical requirements. This type of segmentation assumes that the salient part of an image is usually consistent with the object to be segmented. However, such assumption is not always true in real applications. For example, we calculated the saliency map using graph-based visual saliency model [13] and two typical spectral saliency models [14,15] for the image “sailboat on lake”, as shown in Fig. 1. This is a image with a natural scenery. The banks of the shore appears to be a golf course. This image could be a photograph of the sailboat or the golf course. The most salient part of this image is

352

F. Jiang et al.

Fig. 1. Saliency weight map for the image (a) “sailboat on lake” generated by (b) GBVS saliency model and spectral saliency model using (c) DCT and (d) FFT transform

measured to be the trees in the front. It is not difficult to understand that the trees in the front have more outstanding image features which are easier to be captured by human attention. The object with more outstanding image features are considered with higher saliency weight. However, the object of interest in this image could be a number of items such as the sailboat or the golf course. It is not rigorous to argue the object to be segmented is always the most salient part of the image. It is common that an important object area within the image, such as the golf course, is not a salient area. As introduced in [16], the saliency value is determined by first generating the weighted dissimilarity of local image features and then simulating the human visual system by calculating the steady state of the Markov chain formed by the local image feature locations. It is a bottom-up modeling approach for human attention fixation. However, modeling the human visual system detecting the most attractive location does not always help to detect the most significant image region. The global structure of the object is neglected when modeling the human attention. We should not rely on the saliency calculation for the automatic object segmentation. If the cues of segmentation are determined based on both the inherent local and global image features, instead of modeling the human visual system, a fair detection of the most significant image feature region can be performed. The contributions in this paper are that we applied our own weight map generation algorithm and developed a variation of GrabCut segmentation for grayscale images. While a saliency map and graph-cut based segmentation are used in [11,12], our method is different in that we generate a totally objective weight map instead of human visual system modeling and we use a variation of GrabCut segmentation by modeling intensity, texture and structure distribution of the image. Our method can detect the object regions accurately and automatically.

Automatic Object Segmentation Based on GrabCut

3

353

Objective Weight Cut

3.1

Objective Object Weight Map

A objective object weight map detection method based on both local and global image feature extraction is proposed in the work [17]. This weight map generation algorithm was initially proposed for quality assessment for secret images. The weight map highlights the local features within a image which are very different from neighboring locations. At the same time, the weight map emphasizes the global contour structure of the object. Compared with the saliency map, the weight map we generated are fully objective based on only the inherent image features as shown in Fig. 2. No human visual system model is involved. The objective weight map detects the outstanding image feature no matter it is attractive in human attention or not. Outstanding local image features are detected uniformly at any location. The global object contour feature is also detected and combined with the local image features to form the final objective weight map.

Fig. 2. The objective weight map compared with GBVS [16] saliency weight map

As shown in Fig. 2, the objective weight maps we generated for the tested images do not have any stress on the attractive image features but generating a overall smooth weight map for the main object of the image. Both the outstanding local features and the outstanding global contour features are marked out. The object detection detects a totally objective weight map only based on the inherent image features. Our automatic object detection method, “Objective Weight Cut”, is formed by performing a variation of GrabCut segmentation using the objective weight map.

354

F. Jiang et al.

Fig. 3. Flow chart of proposed “Objective Weight Cut”

The flow chart of the “Objective Weight Cut” is shown in Fig. 3. GrabCut [2] segmentation uses the power of the “Graph Cut” [4] algorithm, which was designed to solve the “Min Cut” optimization problem. The main idea of the algorithm is to define an optimization problem (using the energy cost function E) which can be solved by creating a specific graph model. The GrabCut algorithm generate final optimized segmentation result by performing the iterative Graph Cuts. An energy function E is defined so that its minimum should correspond to a good segmentation [2], (1) E(α, k, θ, z) = U (α, k, θ, z) + V (α, z). The smoothness term V is defined as 2 dis(m, n)(−1) [αn = αm ]e−β||zm −zn || V (α, z) = γ

(2)

(m, n ∈ C)

and the data term U is defined as U (αn , kn , θ, zn ) = − log π(αn , kn ) + 0.5 × log det Σ(αn , kn ) + 0.5 × [zn − μ(αn , kn )]T Σ(αn , kn )−1 [zn − μ(αn , kn )].

(3)

Here, the set of all the parameters (the modeling parameters) is denoted by θ. θ = {π(αn , kn ), μ(αn , kn ), Σ(αn , kn ), αn ∈ (0, 1), kn ∈ (1, ...K)}

(4)

The smoothness term V measures how smooth the neighboring pixels’ labeling αn is. The operator “[ ]” in [αn = αm ] denotes the indicator function taking value

Automatic Object Segmentation Based on GrabCut

355

0 or 1. For modeling the pixel within foreground and background, αn ∈ {0, 1}. The energy function is increased by neighboring pixels that do not have same labeling values. z represent the pixel values such as color value. C is the set of pairs of neighboring pixels. β ensures that the exponential part in smoothness term switches appropriately between high and low contrast. The parameter π represents the Gaussian mixture weighting coefficient. The symbol Σ() means covariance of the matrix and γ is a constant. The data term U measures how well the distribution of the pixel values fit certain distribution modeling θ. In experiment, U is performed by calculating the log probability likelihood of existing modeling. For example, in the original GrabCut [2], the pixel value refers to color of the pixels. The corresponding model is Gaussian Mixture Models (GMMs) with K components. Each pixel with labeling αn is covered by unique GMM model with parameter kn , kn ∈ {1, ....K}. The value of parameters K, β are set as suggested by Itay Blumenthal [18]. The parameter β affects the shape of the exponent function within the smoothness term calculation. The higher K allows better description of the distribution. The constant γ was set as 50. It was proved to be a versatile setting for a wide variety of images [19].

Fig. 4. The structure, texture and intensity distribution modeling

The data term U is actually calculating the log probability likelihood of both the foreground and the background distribution modeling. Here we modified the original GrabCut algorithm for grayscale images. We are evaluating the structure, texture and intensity distribution using GMMs using three different grayscale image components, the objective weight map, the normalized local entropy and the intensity value. The three different components are then fed into the log probability likelihood calculation to measure the coherent level among these features. The data

356

F. Jiang et al.

term of the energy function is determined by this distribution modeling approach. The smoothness term is determined by calculating the horizontal and vertical gradient of the image. The data term and the smoothness term are then used as the initial input of the iterative Graph Cut. The iterative Graph Cut segmentation can be viewed as a “Min Cut” optimization problem performed by taking initial “inter-neighboring-pixels” weights as input. The optimization is iteratively performed until convergence. The optimization process is stopped until the difference between two loops is less than 0.1% or reached the maximum iteration number. Good results of optimization can be obtained by segmenting the neighboring pixels with similar texture, similar intensity and locating in similar structure. 3.2

The Texture, Structure, Intensity Distribution Modeling

To determine the data term of the energy function, as shown in Fig. 4, three different elements are used to determine the likelihood of the different distribution for both the potential foreground object and background region. For grayscale images, there is no color information. Inspired by the color distribution modeling in the original GrabCut algorithm, we try to modeling the global structure, local texture and intensity distribution of the grayscale images. The objective weight map, normalized local entropy map and the intensity of the original image are used to represent the texture distribution, the structure distribution and the intensity distribution of the original image respectively. Similar as the work of tattoo segmentation in [20] developing the skin color model for tattoo images, we developed our own distribution model for the grayscale images. The objective weight map is the output of the object detection. The local entropy map is generated by performing the entropy filtering on the entire image. The objective weight map and the intensity map are normalized to [0, 1] as well. The distribution likelihood is then calculated by K-Gaussian Mixture Models (GMMs). Note that, being different with calculating the data term using histogram modeling in the Graph Cut approaches, we are not only modeling the distribution of the intensity value, but also the global structure and local texture characteristics of the image. 3.3

Segmentation Based on Iterative Graph Cut

We then detected the “cumulated weight center” by checking each pixel location and detecting the pixel location that has the maximum cumulated weight within a large neighborhood. The size of the neighborhood is set to be 50% of the width of the image. We define a candidate circle region for future segmentation by gradually increasing the radius from the cumulated weight center until the sum of the cumulated weight reach 75% of the sum of all the weights. The parameter we use here can be set differently and the final segmentation result does not change. 75% is good enough to maintain the segmentation efficiency and large enough to cover most object area in our experiments testing a large data set of 101 different types of objects [21]. The inside part of the candidate circle region

Automatic Object Segmentation Based on GrabCut

357

is set to be initial foreground input and the outside part of the candidate circle region is set to be the background part of the iterative graph cut segmentation. The iterative Graph cut segmentation takes the candidate circle region as the initial data term. To calculate the smoothness term of the energy function, the horizontal and vertical gradient are used as the intensity difference between pixels. The Graph Cut segmentation is performed iteratively until convergence. We then performed post-processing of the original segmentation result. As shown in Fig. 5, the initial segmentation result may not be very accurate as the candidate region is set to be large. For some grayscale images, the target object and the background area may have very similar intensity and texture. We first perform morphological “open” to generate a overall smooth region. Then we eliminate the large area of false positive detection, named “false positive patch”. The target patch and false positive patch are differentiated by their location to the cumulated weight center. The result is then optimized by eliminating the small patches to generate a smooth and integrated segmentation result as shown in Fig. 5.

Fig. 5. Post-processing of iterative Graph Cut segmentation for image “car”

4

Experiment Result

We tested 105 images using our Objective Weight Cut. The test set we use here is a combination of four grayscale images from [17] and 101 images randomly selected from the “Caltech 101” dataset [21]. Caltech 101 is a dataset of digital images created in September 2003 and compiled by Fei-Fei Li formed by images of 101 different categories of objects. The segmentation processing is fully automatically performed with no training, no annotations or initial seeding approaches. As there is no ground truth or existing annotations to compare our segmentation performance. To demonstrate the effectiveness of our automatic segmentation, we listed a lot segmentation results of tested images in Fig. 6. The first row of every two rows is the original image and the second row of every two rows is the segmentation result.

358

F. Jiang et al.

Fig. 6. The final segmentation result.

First, for the images of the object with great intensity difference between foreground and background, such as the images “lotus” and “car”, the final segmentation result detects the object area out no matter the background is clear or noisy. For the images without great foreground and background intensity difference, if the object to be detected has a clear contour structure, such as the images of the “Garfield”, the main part of the object could be segmented out. Some images do not have a obvious foreground background difference in intensity or a obvious contour structure the object could still be segmented out if the texture features of the foreground and the background are different. For example,

Automatic Object Segmentation Based on GrabCut

359

the “hedgehog” and the “owl” do not have great intensity difference with the background, the contour structure is not strong either. However, automatic segmentation still performs very well as the texture of the object is very different from the background. For the images without obvious intensity change, contour structure or texture feature, the segmentation task is not easy even for human eyes. The most important difference between our approach and a segmentation method based on a learning/training process is that our approach segmented the objects objectively only using the input image features. Being objective is especially useful when the input dataset is huge in size, the computing resource is limited and learning/training process is not affordable. The objective object segmentation could detect the region with significant image features faster and more efficiently than the object detection algorithms based on learning/training process. Our approach is also different with all the schemes based on human visual system modeling. Not like the human visual system models, there is no assumption of the object location made in our approach. No concentration process is adopted in our approach. Each location of the input image is weighted equally. Our objective object segmentation could perform the segmentation task efficiently without any training/learning process, human interaction, existing dataset annotation or additional mechanical setting requirements.

5

Conclusion

We proposed a automatic segmentation method based on objective weight detection and modified GrabCut segmentation. The segmentation approach we propose realized fully automatic processing without training/learning process, human interaction, dataset annotation or special mechanical settings. It is a fully automatic segmentation approach which is suitable for various datasets. The segmentation result is very promising illustrated by testing the well-know dataset [21] for object detection.

References 1. Kohli, P., Torr, P.H.: Dynamic graph cuts for efficient inference in markov random fields. IEEE Trans. pattern Anal. Mach. Intel. 29(12), 2079–2088 (2007) 2. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. In: ACM Transaction Graphics (TOG), vol. 23, ACM, pp. 309–314 (2004) 3. Boykov, Y., Funka-Lea, G.: Graph cuts and efficient N-D image segmentation. Int. J. Comput. Vis. 70(2), 109–131 (2006) 4. Boykov, Y.Y., Jolly, M.-P.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: 2001 Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 1, pp. 105–112. IEEE (2001) 5. Chuang, Y.-Y., Curless, B., Salesin, D.H., Szeliski, R.: A Bayesian approach to digital matting. In: 2001 Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, p. II. IEEE (2001)

360

F. Jiang et al.

6. Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 191–198. ACM (1995) 7. Carreira, J., Sminchisescu, C.: Constrained parametric min-cuts for automatic object segmentation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3241–3248. IEEE (2010) 8. Carreira, J., Sminchisescu, C.: CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE Trans. Pattern Anal. Mach. Intel. 34(7), 1312– 1328 (2012) 9. Won, C.S., Pyun, K., Gray, R.M.: Automatic object segmentation in images with low depth of field. In: 2002 Proceedings 2002 International Conference on Image Processing 2002, vol. 3, pp. 805–808. IEEE (2002) 10. Campbell, N.D., Vogiatzis, G., Hern´ andez, C., Cipolla, R.: Automatic 3D object segmentation in multiple views using volumetric graph-cuts. Image Vis. Comput. 28(1), 14–25 (2010) 11. Jung, C., Kim, C.: A unified spectral-domain approach for saliency detection and its application to automatic object segmentation. IEEE Trans. Image Process. 21(3), 1272–1283 (2012) 12. Fu, Y., Cheng, J., Li, Z., Lu, H.: Saliency cuts: an automatic approach to object segmentation. In: 2008 19th International Conference on Pattern Recognition ICPR 2008, pp. 1–4. IEEE (2008) 13. Harel, J., Koch, C., Perona, P., et al.: Graph-based visual saliency. In: NIPS, vol. 1, p. 5 (2006) 14. Schauerte, B., Stiefelhagen, R.: Quaternion-based spectral saliency detection for eye fixation prediction. In: Computer Vision–ECCV 2012, pp. 116–129. Springer (2012) 15. Schauerte, B., Stiefelhagen, R.: Predicting human gaze using quaternion DCT image signature saliency and face detection. In: 2012 IEEE Workshop on Applications of Computer Vision (WACV), pp. 137–144. IEEE (2012) 16. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems, pp. 545–552 (2007) 17. Jiang, F., King, B.: A novel quality assessment for visual secret sharing schemes. EURASIP J. Inf. Secur. 2017(1), 1 (2017) 18. Blumenthal, I.: Grab cut (2012). www.grabcut.weebly.com/ 19. Blake, A., Rother, C., Brown, M., Perez, P., Torr, P.: Interactive image segmentation using an adaptive GMMRF model. In: Computer Vision-ECCV, 2004, pp. 428–441 (2004) 20. Kim, J., Parra, A., Li, H., Delp, E.J.: Efficient graph-cut tattoo segmentation. In: SPIE/IS&T Electronic Imaging, International Society for Optics and Photonics, pp. 94100H–94100H (2015) 21. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding 106(1), 59–70 (2007)

Vertebral Body Compression Fracture Detection Ahmet İlhan1(&), Şerife Kaba2, and Enver Kneebone3 1

Department of Computer Engineering, Near East University, Near East Boulevard, P.O. Box: 99138 Nicosia, TRNC, Mersin 10, Turkey [email protected] 2 Department of Biomedical Engineering, Near East University, Near East Boulevard, P.O. Box: 99138, Nicosia, TRNC, Mersin 10, Turkey [email protected] 3 LETAM EMAR Radiology Clinic, Radiology Consultant, Nicosia, Cyprus [email protected]

Abstract. The spinal column is one of the crucial parts of the human anatomy and the essential function of this column is the protection of the spinal cord. Each part of the individual bones to compose the spinal column is called vertebra. Vertebral compression fracture is one of the types of fractures that occur in the spinal column. This fracture type causes loss of bone density alongside with pain and loss of mobility. In recent years, image processing is an effective tool that is widely used in the analysis of medical images. In this study, a novel system is proposed for the detection of vertebral body compression fracture using spine CT images. The sagittal plane is used in these CT images. For the detection of the fracture, the heights of the longest and shortest heights of the vertebral bodies are measured using image processing techniques. The aim of this study is to develop an automated system to help radiology specialists by facilitating the process of vertebral fracture diagnosis. The proposed system detected the fractured vertebrae successfully in all images that are used in this study. Keywords: Vertebra

Compression fracture Image processing

1 Introduction The main function of the vertebra is to support and protect the spinal cord. Each vertebra’s body is a large, rounded part of the bone and these vertebral bodies are connected to a bone ring. When the vertebrae are stacked on another, this ring forms a hollow tube through the spinal cord passes over [1]. The vertebrae are divided into five regions. These regions are; cervical (pertaining to neck), thoracic (chest and rib cage), lumbar (lower back), sacrum and coccyx (tailbone). There are 24 movable bones including seven cervical, twelve thoracic and five lumbar vertebrae which are numbered as C1–C7, T1–T12 and L1–L5 respectively. The sacrum and coccyx vertebrae are fused [1]. Compression fractures are one of the familiar types of fracture that affects the vertebra. Compression fracture of a vertebra (spine bone) causes to collapse in the height © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 361–368, 2020. https://doi.org/10.1007/978-3-030-17795-9_26

362

A. İlhan et al.

of bone. The majority of the vertebral compression fractures are usually seen in the inferior (lower) section of the thoracic spine which are T11 and T12 and the initial section of the lumbar spine (L1) [1]. Serious compression fractures resulting from a strong impact on the spine, as might occur in a car accident, can cause parts of the vertebral body to repulse into the spinal canal and press on the spinal cord. In the United States, approximately 700,000 cases of compression fractures arise from osteoporosis (bone loss) occur each year. Healthy and strong bones can withstand the strength and pressure of normal activity. In the spine compression fractures occur when the impresses are too great, or the spine bones are not strong enough. There is a high probability of cracking in the vertebral body under the pressure. The strong impact fractures on the spine are prone to crack the posterior (back) section of the vertebral body. Osteoporosis fractures are generally occurred in the anterior (front) section of the vertebral body [2]. Vertebral fracture needs to be analyzed while there is a height loss of more than 20% in the anterior (front), middle, or posterior (back) dimension of the vertebral body. As a result of vertebral height loss, osteoporotic vertebral fractures can be classified into three cases according to specific grades. If the percentage of the fracture is 20– 25%, called “mild”. If the percentage of the fracture is 25–40%, it is called “moderate” and above 40%, the fracture is called “severe” [3]. Medical imaging techniques such as X-ray and computed tomography (CT) are used in the diagnosis of the vertebral compression fractures. If the physician considers that a patient has a compression fracture, then X-rays are requested. X-rays can be used to visualize vertebrae fractures. When an X-ray verifies a fracture, a CT image might be desired. CT is a detailed X-ray that allows the physician to see slices of the body’s tissue. This image shows whether the compression fracture created an unstable area from the injury [2]. Author in [4] presented automated model-based system for vertebra detection using Curved Planar Reformation (CPR) and Generalized Hough transform (GHT) techniques. Author in [5] presented a semi-automated system for vertebra detection using iterative Normalized-cut algorithm and Region-based active contour. Author in [6] presented a manual system for segmentation of the vertebral body using superpixel segmentation, Otsu’s Thresholding and region growing methods. Author in [7] presented an automated segmentation system to analyze wedge compression fracture using two-level probabilistic model, morphological operations and Hough transform based line detection method. In this study, Sect. 2 presents the proposed system, Sect. 3 describes the methodology, Sect. 4 presents the experimental results, and the conclusion is addressed in the final section.

2 Proposed System The proposed system is divided into three stages. The first stage describes the preprocessing that makes the images more suitable for segmentation. The second stage describes the segmentation that distinguishes of the vertebral bodies from the rest of the image. The last stage describes the measurement of the percentage loss of height in the vertebral bodies.

Vertebral Body Compression Fracture Detection

2.1

363

Pre-processing Stage

In this stage, power-law transformation method is described. Power-Law Transform. Power-law transformation converts the narrow range dark input values to wider output values or the wider range input values to narrower dark output values [8]. In this study, power-law transformation method is used to eliminate excess brightness of the image. 2.2

Segmentation Stage

In this stage, thresholding, hole filling, watershed segmentation algorithm, border cleaning, morphological operations, and object removal are described. Thresholding. In image segmentation, thresholding is the most commonly used technique [9]. In this study, thresholding is used to segment the vertebral bodies from the image. The mean value of the image pixels is determined as the threshold value. Hole Filling. Hole filling operator uses flood-fill algorithm. In binary images, algorithm starts to convert background pixels (0’s) to foreground pixels (1’s). The algorithm stops when it reaches the object boundaries [10]. In this study, hole filling operator is used to fill small black holes in the image among the white pixels. Watershed Segmentation Algorithm. The watershed transform algorithm can be described as a morphological gradient segmentation technique. The basic idea of the algorithm is to segment the image by the drams which are named watersheds [11]. In this study, watershed segmentation algorithm is used to separate connected vertebral bodies. Border Cleaning. Border cleaning operator is required to remove particles which are connected to the image boundary [12]. In this study, border cleaning operator is used to remove the sacrum and undesired vertebral body from the image. Morphological Operations. The boundaries and skeletons of objects are defined thru morphological operations on images [13]. Dilation and erosion are the most common morphological operators. In this study, dilation operator is used to connect objects that have the similar shape characteristics with each other. Dilation and erosion can perform more complicated sequences. Opening operator is one of these sequences. [14]. In this study, the opening operator is used to remove undesired small objects which are connected to the vertebral bodies. Object Removal. In this step, the largest object (vertebral body region) is segmented from the image. 2.3

Measurement Stage

In this stage, corner detection and the calculation of the percentage loss of height in the vertebral body are described. Corner Detection. Corner Detection is divided into two steps. The first step is the division of each vertebral body into separate blocks. The second step is the detection of the corners of each vertebral body.

364

A. İlhan et al.

The corners of a polyhedron are maximized/minimized the intercept of lines of a certain slope. In this study, the lines are used x + y, x – y, –x + y, –x – y those oriented at 45° to the x, y axes. As long as the rectangle edges are approximately parallel/perpendicular to the x, y axes, its corners are minimized the intercept of one of these lines [15]. In this study, the distance between the corner points detected to calculate the longest and shortest heights of the vertebral body are measured by the Euclidean distance. Calculation of the Percentage Loss of Height in the Vertebral Body. The following equation is used to calculate the percentage loss of height of each vertebral body. The system determine whether or not the vertebra is fractured according to the above mentioned [3] minimum fracture rate (20%). The system detects that the vertebra has a compression fracture if there is at least 20% height loss. The calculation of the height loss can be formulated as:

100 Pð%Þ ¼ ða bÞ a

ð1Þ

Where P is the percentage loss of height, a is the longest and b is the shortest height of the vertebral body calculated in pixels.

3 Methodology 3.1

Database

The images are collected from “ESR” and “LETAM” (European Society of Radiology [16], LETAM EMAR Radiology Clinic). The database consists of 5 images including 4 fractured and 1 non-fractured spine CT images. All images used are in JPEG format. 3.2

System Performance

In this study, sensitivity, specificity and accuracy are calculated to evaluate the system performance using the following attributes. • • • •

True Positive (TP): Fractured and detected correctly. True Negative (TN): Non-fractured and not detected. False Positive (FP): Non-fractured and detected. False Negative (FN): Fractured and not detected. The table below shows the propose system’s performance (Table 1).

Table 1. Performance of the proposed system. Number of total images TP TN FP FN Sensitivity Specificity Accuracy 5 4 1 0 0 100% 100% 100%

Vertebral Body Compression Fracture Detection

365

4 Experimental Results The simulation of the proposed system is described in this section. Figures below illustrate the implementation of the proposed system (Figs. 1, 2, 3, 4, and 5).

Fig. 1. (a) Original image, (b) transformed image and (c) thresholded image.

Fig. 2. (a) Filled image, (b) watershed segmented image and (c) border cleaned image.

366

A. İlhan et al.

Fig. 3. (a) Opened image, (b) dilated image and (c) object removed image.

Fig. 4. Split-block image.

Fig. 5. Resultant image.

Vertebral Body Compression Fracture Detection

367

5 Conclusion The proposed system is designed to aid the physicians in observing the images for possible fractures more easily and save their valuable time. In this study, 5 spine CT images are evaluated. The system is developed using the most widely used medical image processing techniques. The radiologists diagnose the vertebral compression fractures by eye ball estimation. They need to check the images to find and evaluate the fracture region of the vertebral body. The principle advantage of the proposed system is to detect the fracture region automatically; furthermore it aims to give more scientific and definite results by measuring the height of vertebral body to determine whether there is a fracture or not in the vertebra region. The overall success rate of the proposed system is 100% with and without a vertebral compression fracture. To improve this work, it is planned to make more visits to several local hospitals and increase the number of images to fine tune our system. For future work, the developed system can be installed on medical imaging devices that could be used to automatically detect all types of the vertebral fractures and reflect them on the screen when the spine CT images are taken.

References 1. University of Maryland Medical Center. https://www.umms.org/ummc/healthservices/ orthopedics/services/spine/patient-guides/anatomy-function . Accessed 22 July 2018 2. Esses, S.I., McGuire, R., Jenkins, J., Finkelstein, J., Woodard, E., Watters, W.C., Goldberg, M.J., Keith, M., Turkelson, C.M., Wies, J.L., Sluka, P.: The treatment of osteoporotic spinal compression fractures. J. Bone Joint Surg. Am. 93(20), 1934–1936 (2011) 3. Radiopaedia. https://radiopaedia.org/articles/osteoporotic-spinal-compression-fracture. Accessed 30 July 2018 4. Klinder, T., Ostermann, J., Ehm, M., Franz, A., Kneser, R., Lorenz, C.: Automated modelbased vertebra detection, identification, and segmentation in CT images. Med. Image Anal. 13(3), 471–482 (2009) 5. Patrick, J., Indu, M.G.: A semi-automated technique for vertebrae detection and segmentation from CT images of spine. In: International Conference on Communication Systems and Networks (ComNet), pp. 44–49. IEEE, July 2016 6. Barbieri, P.D., Pedrosa, G.V., Traina, A.J.M., Nogueira-Barbosa, M.H.: Vertebral body segmentation of spine MR images using superpixels. In: 28th International Symposium on Computer-Based Medical Systems. Institute of Electrical and Electronics Engineers–IEEE (2015) 7. Ghosh, S., Raja’S, A., Chaudhary, V., Dhillon, G.: Automatic lumbar vertebra segmentation from clinical CT for wedge compression fracture diagnosis. In: Medical Imaging 2011: Computer-Aided Diagnosis, vol. 7963, p. 796303. International Society for Optics and Photonics, March 2011 8. Dhawan, V., Sethi, G., Lather, V.S., Sohal, K.: Power law transformation and adaptive gamma correction: a comparative study. Int. J. Electron. Commun. Technol. 4(2), 118–123 (2013) 9. Senthilkumaran, N., Vaithegi, S.: Image segmentation by using thresholding techniques for medical images. Comput. Sci. Eng.: Int. J. 6(1), 1–13 (2016)

368

A. İlhan et al.

10. Abinaya, G., Banumathi, R., Seshasri, V., Kumar, A.N.: Rectify the blurred license plate image from fast moving vehicles using morphological process. In: IEEE International Conference on Electrical, Instrumentation and Communication Engineering (ICEICE 2017), pp. 1–6. IEEE, April 2017 11. Saikumar, T., Yugander, P., Murthy, P.S., Smitha, B.: Image segmentation algorithm using watershed transform and fuzzy C-Means clustering on level set method. Int. J. Comput. Theor. Eng. 5(2), 209 (2013) 12. Soille, P.: Texture analysis. In: Morphological Image Analysis, pp. 317–346. Springer, Heidelberg (2004) 13. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson Prentice Hall, Upper Saddle River (2008) 14. Ilhan, U., Ilhan, A.: Brain tumor segmentation based on a new threshold approach. Proc. Comput. Sci. 120, 580–587 (2017) 15. Dantzig, G.B., Thapa, M.N.: Linear Programming 1: Introduction. Springer, Heidelberg (2006) 16. European Society of Radiology. https://posterng.netkey.at/esr/viewing/index.php?module= viewing_poster&task=viewsection&pi=114839&si=1166&searchkey=. Accessed 21 July 2018

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs Sergiy Popovych1(B) , Davit Buniatyan1(B) , Aleksandar Zlateski2(B) , Kai Li1(B) , and H. Sebastian Seung1(B) 1

Princeton University, Princeton, NJ 08544, USA {popovych,davit,sseung}@princeton.edu, [email protected] 2 Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA [email protected]

Abstract. Convolutional nets have been shown to achieve state-ofthe-art accuracy in many biomedical image analysis tasks. Many tasks within biomedical analysis domain involve analyzing volumetric (3D) data acquired by CT, MRI and Microscopy acquisition methods. To deploy convolutional nets in practical working systems, it is important to solve the efficient inference problem. Namely, one should be able to apply an already-trained convolutional network to many large images using limited computational resources. In this paper we present PZnet, a CPU-only engine that can be used to perform inference for a variety of 3D convolutional net architectures. PZNet outperforms MKL-based CPU implementations of PyTorch and Tensorflow by more than 3.5x for the popular U-net architecture. Moreover, for 3D convolutions with low featuremap numbers, cloud CPU inference with PZnet outperforms cloud GPU inference in terms of cost efficiency.

Keywords: Image segmentation Intel Xeon

1

· 3D convolutions · SIMD ·

Introduction

Convolutional neural networks (ConvNets) are becoming the primary choice of automated biomedical image analysis [11,13,20,21], achieving superhuman accuracy for tasks such as chest X-ray anomaly detection [19], neuron segmentation [14] and more [17]. Many tasks within biomedical analysis domain involve analyzing volumetric data acquired by CT, MRI and Microscopy acquisition methods. As a result, tasks such as organ/substructure segmentation, object/lesion detection and exam classification commonly employ 3D ConvNets [16]. Computational costs of ConvNet inference, which involves applying an already trained ConvNet to new images, is of a particular concern to the biomedical image analysis community. Each super resolution volumetric specimen can grow to the order of 10, 0003 voxels [23,24], resulting in vast amounts of data c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 369–383, 2020. https://doi.org/10.1007/978-3-030-17795-9_27

370

S. Popovych et al.

to be analyzed. Additionally, convolution operations often require more computation per pixel in 3D than 2D, which increases computational demand of 3D ConvNet inference. High computational costs of processing datasets can serve as a limiting factor in the quality of analysis. Thus, increasing utilization of available hardware resources for 3D ConvNet inference is a critical task. Modern deep learning frameworks such as Theano [5], Caffe [12], MXNet [6], Tensorflow [4] and Pytorch [8] are mostly optimized for processing of 2D images, and achieve lower hardware utilization on both CPU and GPU platforms for 3D tasks. In this work we show that CPU efficiency for 3D ConvNet inference can be improved by up to 4x, which results in higher utility of existing CPU infrastructure and makes CPU inference a competitive choice in the cloud setting. The main contribution of this work is an inference-only deep learning engine called PZnet, which is specifically optimized for 3D inference on Intel Xeon CPUs. PZnet utilizes ZnnPhi [26], a state-of-the-art direct 3D convolution implementation. ZnnPhi relies on template-based metaprogramming and requires a custom data layout, which makes it not compatible with mainstream deep learning frameworks. For this reason, we created a special inference-only framework PZnet. A convolutional net can be trained using a mainstream deep learning framework, and then imported to PZnet when large-scale inference is required. PZnet implements a number of unique optimizations that complete operations required by several layers in one memory traversal. Reducing the number of memory traversals can be critical for achieving high performance on CPU platforms. These layer fusions are applicable to a number of ConvNet architectures, and can reduce inference time by up to 12%. PZnet outperforms MKL-based [1] CPU implementations of PyTorch and Tensorflow by 3-8x, depending on the network architecture and hardware platform. Moreover, we show that based on current cloud compute prices, PZnet CPU inference is competitive with cuDNN based GPU inference. For inference of a real-world residual 3D Unet architecture [14], PZnet is able to outperform GPU inference in terms of cost efficiency by over 50%. To the best of our knowledge, this is the first work to show CPU inference to beat GPU inference in terms of cloud cost.

2 2.1

PZnet Problem Statement

The goal of this work is to reduce the computational costs of running large scale 3D dense prediction tasks. We achieve this goal by maximizing efficiency of 3D ConvNet inference on CPUs. More specifically, we focus on Intel Xeon processors. A high efficiency CPU inference engine would improve the utility of existing CPU cluster infrastructure and reduce the costs spent on cloud resources.

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

2.2

371

Overview

PZnet is a deep learning engine for 3D inference on Intel Xeon processors. PZnet achieves high efficiency through employing a specialized convolution implementation and performing a series of inter-layer optimizations. PZnet is compatible with the standard Caffe prototxt network specification format, which simplifies importing models trained in other frameworks. PZnet provides support for the following layers: – – – – – – – – – –

Convolution, strided and non-strided Deconvolution, strided and non-strided Batch Normalization Scale ReLU ELU Sigmoid Elementwise (Addition, Division, Multiplication) MergeCrop Pooling (Average, Max)

PZnet consists of two parts: a network generator and an inference API. Network generator compiles the provided model specifications into so-called network files. Network files are shared library objects that are distributed to worker machines. Workers run inference by accessing the models within the network files through PZnet python inference API. PZNet employs ZnnPhi, which, to the best of our knowledge, is the most efficient 3D direct convolution implementation known up to date. ZnnPhi achieves high performance though utilizing SIMD instructions in a cache efficient way, and is compatible with SSE4, AVX, AVX2 and AVX512 SIMD instruction families. ZnnPhi requires image and kernel data to conform to a specific data layout. An image tensor with B batches, F featuremaps, and X, Y, Z spacial dimensions has to be stored as an array with dimensions B × F/S × X × Y × Z × S, where S is the width of the SIMD unit. A kernel tensor of size F × F × Kx × Ky × Kz will be stored as an array with dimensions F /S × F/S × KX × KY × KZ × S × S. ZnnPhi requires such data layout in order to maximize the efficiency of SIMD instruction utilization, and it prevents ZnnPhi from being pluggable into mainstream deep learning frameworks. Additionally, ZnnPhi heavily relies on metaprogramming through C++ templates, which means that layer parameters, such as image and kernel sizes, have to be known during compile time. This allows ZnnPhi to rely on compile time optimizations in order to produce maximally efficient code for each parameter configuration. However, this adds another obstacle to integrating ZnnPhi into a mainstream deep learning frameworks. Most deep learning frameworks allow user-supplied C++ layer implementations, but they either require them as a compiled shared object or as generic source code. Neither option is compatible with ZnnPhi, because ZnnPhi kernels need to be recompiled for each layer configuration.

372

S. Popovych et al.

Fig. 1. PZnet optimization flow

In order to support ZnnPhi, PZnet compiler generates C++ source code which directly corresponds to the provided model. Then, Intel C++ Compiler is invoked in order to produce optimized shared library object files. All PZnet layers support ZnnPhi blocked memory layout. PZnet implicitly performs memory layout transformations for the input and output data tensors in order to provide standard input and output formats.

3

Optimizations

Optimizations performed by PZnet are mainly aimed at reducing the number of memory traversals introduced by non-convolutional layers (batchnorm, scale, activation, etc.). We modify ZnnPhi convolutional kernel in order to be able to perform several layer operations in one memory pass. The overall optimization flow of PZnet is shown on Fig. 1. First, we fuse convolution layers with element-wise addition layers, which are commonly introduced by residual connections. After element-wise layers are fused into convolutions, more convolution layers immediately precede batch normalization and scale layers (Fig. 1). During inference time, both batch normalization and scale layers perform linear transformations of tensors. Weights of the convolution layers can be modified in order to take account for subsequent linear transformation layers, and so we are able to fuse batch normalization and scale layers into the preceding convolution. After linear transformation layers are fused, convolutions are commonly followed by activation layers. We modify ZnnPhi primitives in order to apply activation function to the convolution outputs before they are written out to memory. Finally, after element-wise addition, linear transformation and activation layers are removed, most of convolution layer outputs are used inputs of the subsequent convolution layers. Thus, we can eliminate explicit input padding of the inputs by making convolution layers produce padded outputs, which saves additional memory traversals. Overall, optimization by 7 − 12% is performed, depending on CPU parameters and network architecture.

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

3.1

373

Element-Wise Addition Fusion

State-of-the-art 3D segmentation models often include residual connections [14], resulting in up to 1 element-wise addition layer per 3 convolution layers. We observe that we can eliminate these layers by modifying ZnnPhi convolution implementation. ZnnPhi achieves cache efficiency by implementing convolution as series of hierarchical primitives, with each higher level primitive utilizing the lower level ones in order to complete more complex tasks. The lowest level ZnnPhi primitive, called sub-image primitive, will be extended in this work. The goal of the sub-image primitive is to compute the contribution S consecutive input feature maps to a small patch in the consecutive S output feature maps, where S is the SIMD width for the given instruction family. For example, Fig. 2 illustrates an application of 2 sub-image primitives, black and gray, for the case when S = 2. Both applications of the primitive target the same patches in output feature maps 3 and 4. The gray application initializes output patches to the bias values for the given output feature maps and adds to them the result of convolving kernels with patches in input feature maps 1 and 2. The black application finishes the computation by adding on the result of convolving kernels with patches in input feature maps 3 and 4. Sub-image primitive is designed to maximize register reuse and L1 cache hitrate. It is used by higher level primitives repeatedly in order to generate full output feature maps.

Fig. 2. Application of two ZnnPhi sub-image primitives to the black patches in the output feature maps (SIM D width = 2). Each application computes the contribution of SIM D width consecutive feature maps to the black output patch regions. After A and B are applied, the output patches will hold their final values.

Element-wise Addition Fusion is performed as follows. Assume that elementwise addition layer E receives two input tensors which are produced by layers L1 and L2. Without loss of generality, we assume that L1 is a convolution layer. The following conditions need to be satisfied for fusion of E and L1. First, element-wise addition E has to be the only consumer of the output produced by

374

S. Popovych et al.

convolution L1. Second, there must exist a topological ordering of the network graph in which L1 appears after L2. If both of these conditions are satisfied, element-wise addition E can be fused into convolution L1. After the fusion, the tensor produced by L2 is passed to L1 as an additional Base argument. Lastly, the additive flag for L1 is set to true. When the additive flag is set to true, the first application of the sub-image primitive initializes the output values to be the sum of the layer bias and the Base argument. 3.2

Element-Wise Addition Fusion

Element-wise Addition Fusion is performed as follows. Assume that element-wise addition layer E receives two input tensors which are produced by layers L1 and L2. Without loss of generality, we assume that L1 is a convolution layer. The following conditions need to be satisfied for fusion of E and L1. First, element-wise addition E has to be the only consumer of the output produced by convolution L1. Second, there must exist a topological ordering of the network graph in which L1 appears after L2. If both of these conditions are satisfied, element-wise addition E can be fused into convolution L1. After the fusion, the tensor produced by L2 is passed to L1 as an additional Base argument. Lastly, the additive flag for L1 is set to true. When the additive flag is set to true, the first application of the sub-image primitive initializes the output values to be the sum of the layer bias and the Base argument. 3.3

Linear Transformation Fusion

Let’s define a notation in which convolution layer takes an input of N input feature maps and produces an output of M feature maps, with each output feature map Om described as Om =

N

In ∗ Knm + Bm

n=1

where In denotes input feature map n, K denotes convolution kernel and B denotes bias. For layers that perform linear transformation during inference, such as batch normalization and scaling, the computation of each output feature map can be described as Om = Im · Mm + Am where Mm and Am denote multiplicative and additive weights for feature map m. The combined computation of a convolution layer followed by a linear transformation layer can be described as Om = (

N

In ∗ Knm + Bm ) · Mm + Am

n=1

=

N n=1

In ∗ (Knm · Mm ) + (Bm · Mm + Am )

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

375

which is equivalent to performing a convolution with kernel Knm = Knm · Mm and bias Bm = Bm · Mm + Am . Thus, we can eliminate linear transformation layers that immediately follow convolutions by performing the corresponding weights modifications. Note that this optimization applies only to inference, as presence of batch normalization and scale significantly affects network behavior during training.

3.4

Activation Fusion

In ZnnPhi the sub-image primitive is applied repeatedly to each output patch, with the final application storing the output patch values to memory. We observe that when convolution is immediately followed by activation, activation functions can be applied to output during the final application of sub-image primitive to each patch. This way the computed output values are already in cache during application activation, which eliminates memory traversal overhead caused by activation layers. 3.5

Padding Transformation

ZnnPhi does not support input padding, and so explicit padding layers have to be added to the IR of the network after the parsing phase. Padding cannot be done in-place, which means that memory traversals introduced by padding layers especially hurt inference efficiency. Moreover, due to specificity of the data layout required by ZnnPhi, the memory must be moved in a disjoint small junks in order to perform padding, which further hurts efficiency. However, ZnnPhi allows the user to specify strides for each of the output dimensions. We observe that by manipulating the stride values for the spacial dimensions we can make ZnnPhi convolutions produce padded outputs. Thus, when two convolutions follow each other, the first convolution can produce output which is already padded for the second convolution. More formally, whenever all consumers of a convolution layer output require the same spacial padding, explicit input padding for the consumers can be avoided by generating padded output at the producer layer.

Fig. 3. Left: operation of the padding layer. Right: memory layout of the image before and after padding. The padded output can be viewed as a representation of the input image with initial of f set = x side + 3 and y stride = x side + 2

376

S. Popovych et al.

Fig. 4. Comparison between CPU performance of PZnet, Tensorflow and PyTorch (lower is better).

Figure 3 illustrates how manipulating initial offset and strides can be used in order to generate padded outputs for row-major memory layout. In the illustrated case, a 2D image is to be padded by 1 in box x and y dimensions. This can be achieved by setting the initial offset to be size(y) + 3, and the stride for the y dimension to be size(y) + 2. Analogously, padding can be generated for 3D images with ZnnPhi blocked memory layout.

4 4.1

Evaluation Setup

The experiments in this section are performed on major types of CPU and GPU compute instances from Amazon Web Services (AWS), namely, c4, c5, p2 and p3 instance types. Processing of volumetric datasets is generally done by breaking the volume into a large number of smaller patches. The segmentation result of each patch is computed as a separate task. When all of the patches are computed, the results are concatenated to obtain the final result. Small task granularity minimizes distortions caused by instance termination, which makes spot instances a perfect choice of for large scale inference. Small task granularity also encourages the use of small instances for execution of each individual task. It is easier to achieve high utilization on small number of cores, as it reduces the effects of inter core communication, synchronization, and simplifies parallelization. In other words, it is more efficient to allocate a large number of small workers than a small number of bigger workers. Consequently, our experimentation uses the smallest CPU instances that can support sufficient RAM to hold the network. 4.2

Network Architectures

The evaluation is done on 3 variants of 3D Unet, which is the state-of-the-art architecture for 3D image segmentation. The original 3D Unet uses convolutions

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

377

with no padding and high numbers of feature maps. Several works propose using a modification of this architecture where padding was introduced in order to keep the input and output image size same [15]. Recent works [9,14,18] in 3D segmentation also include residual connections at each level of the UNet. For our evaluation, we use the original 3D unet as proposed by [7], a symmetric variation with padded convolutions, reduced number of features and addition instead of merging layers, and a residual variation used in [14]. These architectures will be referred to as Original, Symmetric, and Residual for the rest of this paper. As modern practices suggest, we include batch normalization and scaling layers after each convolution. We also use ELU as the activation function. Figure 5 depicts the structure of Unet architecture families. Each network is composed of multiple levels of so-called convolutional blocks, connected to each other either through upsampling or downsampling layer (Fig. 5 left). Convolutional blocks of Original and Symmetric variations are contain two consecutive convolutional layers, each followed by batch normalization and activation. Residual variation adds an additional residual connection and an additional convolutional layer to each block (Fig. 5 right). The three architectures also differ in the number of featuremaps used for convolutions. The Original architecture uses 64 featuremaps for the convolutions on the first level, and doubles the number of convolutions on each consequent level. Symmetric architecture uses twice less featuremaps on each level. Reduced number of featuremaps is to compensate for the fact that Symmetric architecture uses convolution input padding, and so the dimensions of each featuremap is higher. Residual architecture reduces the number of features even further, starting at 28 and going up in increments of 8 and 16 at each level, as described in [14]. For the timing experiments, weights were initialized with Xavier initialization [10] for the convolutional layers and as an identity for linear transformation layers. Each measurement was repeated for 60 iterations after a warm-up period of 10 iterations.

Fig. 5. Left: generic Unet architecture with N-1 downsamples, upsamples and skip connections, Right: Convolutional block types for each model used in evaluation. BN corresponds to Batchnorm followed by Scale and Activation function

378

S. Popovych et al.

4.3

CPU Performance

In this experiment, we show that PZnet outperforms Tensorflow and Pytorch for 3D inference. Figure 4 compares CPU performance of PZnet, Tensorflow, and Pytorch. Symmetric architecture has the largest intermediate featuremaps, and the matrix multiplication 3D convolution implementation used in Tensorflow and Pytorch grows quadratically with featuremap sizes. This leads to RAM requirement that cannot be fulfilled with the low-core AWS instances used for experimentation. Tensorflow version used for evaluation was compiled with MKL, FMA and AVX2 support. Despite our best efforts, the master branch of Tensorflow could not be compiled with AVX512 support without producing segmentation faults during network inference. Tensorflow inference optimization scripts where used prior to timing. Pytorch implementation was also compiled with MKL and AVX2 support. The results show that PZnet outperforms Tensorflow by more than 3.4x for all of the experiment settings, and outperforms Pytroch by more than 2.27x. PZnet achieves maximum speedups over Tensorflow and Pytorch for Residual architecture. As will be shown later in this section, Residual architecture benefits most from the optimizations. 4.4

Cost Efficiency

Basic Blocks. In this experiment, we show that when convolutional feature number is small, CPU inference with PZnet can outperform GPU inference in terms of cost efficiency. Inference of convolutional blocks with 8 to 32 featuremaps are timed on CPU and GPU instances. The results of the experiment are shown on Fig. 6. The X axis corresponds to the number of featuremaps used in each of the convolutional layers. The Y axis corresponds to the effective dollar price per patch. Price per patch is obtained by multiplying the execution time by the hourly price of the used AWS instance. The results show that when the featuremap number is small, CPUs can outperform GPUs in terms of price

Fig. 6. Cloud cost comparison for running 3D Unet basic block on various hardware (lower is better). Dashed lines correspond to GPU inference with Tensorflow, solid lines correspond CPU inference with PZnet. Left: featuremap range 8 to 32. Right: featuremap range 8 to 64.

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

379

efficiency. In particular, when the featuremap number is 16, CPUs can be more than twice cheaper. GPUs outperformed CPUs for featuremap number greater than 32. Full Networks. In this experiment, we study cost efficiency comparison between CPUs and GPUs on full network architectures. Three network architectures are tested overall: Original, Symmetric, and Residual variations of 3D Unet. The results of the experiment are presented in Table 1. Price per patch is obtained by multiplying the execution time by the hourly price of the used AWS instance. CPU inference is run with PZnet. CPU price per patch is taken to be the lowest among c4 and c5 instances for each data point. GPU price per patch are taken as the best of running Pytorch and Tensorflow across p2 and p3 instances for each data point. For the networks with high number of featuremaps (Original, Symmetric), GPU cost per patch is lower than CPU cost per patch by factors of 2.76x and 1.82x. However, when the feature number becomes small (Residual) CPU cost efficiency is lower than GPU cost efficiency by a factor of 1.49x. This confirms the hypothesis that CPU inference can be competitive with GPU inference in terms of cost efficiency when the featuremap number is low. Table 1. Inference cost efficiency Price per patch (USD) Platform CPU GPU

Original Symmetric Residual 2.45E−05 4.47E−05 2.53E−05 8.86E−06 2.45E−05 3.77E−05

Best Platform GPU 2.76x Margin

4.5

GPU 1.82x

CPU 1.49x

Optimizations

Table 2 presents the effects of the optimizations performed by PZnet. The original variation of 3D Unet benefits the least from the optimizations. The reason for that is two fold. First of all, original Unet uses the biggest numbers of featuremaps, which makes convolutions take bigger proportion of the total network runtime. Additional memory traversals and linear transformations take smaller proportion of the total inference time, and so their elimination is less critical for achieving high performance. Second, the Original variation of 3D does not use convolution input padding or element-wise addition. Thus, neither Output Padding Transformation nor Addition Fusion can be applied to the Original variation. Optimizations provide most benefits to the residual version of the UNet. This can be explained by the fact that residual UNet uses the smallest number of feature maps, which makes eliminated transforms and a large number of element-wise addition layers.

380

S. Popovych et al. Table 2. Full optimization speedup ELU Instance Original Symmetric Residual c4 c5

11.2% 9.4%

14.9% 17.2%

18.0% 20.0%

Figure 7 breaks down the contribution of each optimization stage to the total speedup for each of the studied network architectures. Since Addition Fusion is used as an enabling optimization for Activation Fusion and Linear Fusion, its effects cannot be isolated. The results show that all three of the remaining optimizations, namely Liner Fusion, Activation Fusion and Padding Elimination contribute significantly to the total optimization speedup. The exception is the Original variation, which does not use convolution input padding, and so the Padding Elimination can have no effect.

Fig. 7. Effects of the optimization stages on the final performance. Measured on Intel Hasswell (AWS C4)

5

Related Work

Existing deep learning frameworks perform graph-level and memory allocation optimizations [22]. Theano [5] and MXNet [6] implement arithmetic optimizations, such as folding of layers that multiply or add by scalar constants. Tensorflow [4] provides an option to freeze the network graph for inference. The optimizations performed by Tensorflow during freezing include removing trainonly operations and remove nodes of the graph that are never reached given output node. Recently, Tensorflow introduced an optimization that fuses batch normalization weights into the convolution kernel, similar to PZnet. NVIDIA Tesla RT [3] provides fusions of convolution, bias and ReLU layers. However, the package does not provide 3D convolution support, which makes it inapplicable to 3D biomedical image processing.

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

381

There has been several efforts to have CPU-efficient convolution implementations. ZNNi [25] combines CPU and GPU primitives for 3D ConvNet inference. Greater amount of RAM available on CPU workers allows CPU primitives to process bigger input volumes, which reduces the wasted computation at the volume border. Intel provides a Caffe branch [1] which utilizes MKL and MKLdnn libraries [2]. Same as PZnet, Intel Caffe supports blocked data format. Unlike PZnet, Intel Caffe supports both training and inference. The crucial difference between PZnet and Caffe is that Intel Caffe and its underlying libraries do not support 3D convolutions.

6

Conclusion

We presented PZNet, an inference-only deep learning engine which is optimized for 3D inference on Intel Xeon CPUs. It utilizes ZNNPHi [26], the state-of-theart direct 3D convolution implementation. Additionally, we perform a series of inter-layer optimizations which reduce inference time by up to 20% and evaluate on state-of-the-art dense prediction neural network architectures frequently used for biomedical image analysis. PZnet outperforms MKL-based [1] CPU implementations of PyTorch and Tensorflow by 3-8x. Moreover, we show PZNet CPU inference can be competitive with CUDnn based GPU inference in terms of cloud price efficiency. In the specific case of [14], PZnet is able to outperform GPU inference in terms of cost efficiency by 50%. Acknowledgments. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DoI/IBC) contract number D16PC0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government. Additionally, this work was partially funded by TRI.

References R distribution of caffe*. https://github.com/intel/caffe. Accessed 11 Apr 2018 1. Intel R math kernel library for deep neural networks (intel R mkl-dnn). https:// 2. Intel github.com/intel/mkl-dnnx. Accessed 11 Apr 2018 3. Nvidia tensorrt. https://developer.nvidia.com/tensorrt. Accessed 11 Apr 2018 4. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI 2016, pp. 265–283 (2016) 5. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-Farley, D., Bengio, Y.: Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590 (2012)

382

S. Popovych et al.

6. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015) ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: 7. C ¸ i¸cek, O., learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 424–432. Springer (2016) 8. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop, number EPFL-CONF-192376 (2011) 9. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Deep Learning and Data Labeling for Medical Applications, pp. 179–187. Springer (2016) 10. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010) 11. Hariharan, B., Arbel´ aez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015) 12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014) 13. Lee, J.-G., Jun, S., Cho, Y.-W., Lee, H., Kim, G.B., Seo, J.B., Kim, N.: Deep learning in medical imaging: general overview. Korean J. Radiol. 18, 570–584 (2017) 14. Lee, K., Zung, J., Li, P., Jain, V., Seung, H.S.: Superhuman accuracy on the SNEMI3D connectomics challenge. arXiv preprint arXiv:1706.00120 (2017) 15. Lin, T.-Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, vol. 1, p. 4 (2017) 16. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A., van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 17. Madani, A., Arnaout, R., Mofrad, M., Arnaout, R.: Fast and accurate view classification of echocardiograms using deep learning. npj Digit. Med., 6 (2018) 18. Quan, T.M., Hilderbrand, D.G., Jeong, W.-K.: FusionNet: a deep fully residual convolutional neural network for image segmentation in connectomics. arXiv preprint arXiv:1612.05360 (2016) 19. Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al.: CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017) 20. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Springer (2015) 21. Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2168–2175. IEEE (2013) 22. Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)

PZnet: Efficient 3D ConvNet Inference on Manycore CPUs

383

23. Tomer, R., Ye, L., Hsueh, B., Deisseroth, K.: Advanced clarity for rapid and highresolution imaging of intact tissues. Nat. Protoc. 9(7), 1682 (2014) 24. Zheng, Z., Lauritzen, J.S., Perlman, E., Robinson, C.G., Nichols, M., Milkie, D., Torrens, O., Price, J., Fisher, C.B., Sharifi, N., et al.: A complete electron microscopy volume of the brain of adult Drosophila melanogaster. BioRxiv, p. 140905 (2017) 25. Zlateski, A., Lee, K., Seung, H.S.: ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 73:1–73:12. IEEE 26. Zlateski, A., Seung, H.S.: Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs. In: Proceedings of the International Conference on Supercomputing, ICS 2017, pp. 8:1–8:10. ACM (2017)

Evaluating Focal Stack with Compressive Sensing Mohammed Abuhussein(B) and Aaron L. Robinson(B) University of Memphis, Memphis, TN 38125, USA {mbhssein,alrobins}@memphis.edu

Abstract. Compressive sensing (CS) has demonstrated the ability in the field of signal and image processing to reconstruct signals from fewer samples than prescribed by the Shannon-Nyquist algorithm. In this paper, we evaluate the results of the application of a compressive sensing based fusion algorithm to a Focal Stack (FS) set of images of different focal planes to produce a single All In-Focus Image. This model, tests l1 -norm optimization to reconstruct a set of images, called a Focal Stack to reproduce the scene with all focused points. This method can be used with any Epsilon Photography algorithm, such as Lucky Imaging, Multi-Image panorama stitching, or Confocal Stereo. The images are aligned and blocked first for faster processing time and better accuracy. We evaluate our results by calculating the correlation of each block with the corresponding focus plane. We also discuss the shortcomings of this simulation as well as the potential improvements on this algorithm. Keywords: Compressive sensing Sparsity

1

· Focal stack · Image Fusion ·

Introduction

Image Fusion provides a method of determining and extracting the useful information in a set of input images and combining the extracted data into one resultant scene with more relevant information than any one single input image. This techniques is particularly useful when an imaging system output is dependent on multiple parameters, modalities, or varying conditions at the time of image capture. To be applicable to all of the above mentioned cases, the resultant fused image can be composed of information from multiple sensors, multiple views of the same scene, multiple focal planes, or multi-temporal [1]. For example, focal stacking is technique where a set of images taken of the same scene with a different focal planes and possibly with different apertures or lenses. The focal stack is the result of combining the multiples images of the scent into one resultant scene with a greater Depth of Field (DOF) than any one of the individual images. Similar to an individual photo, a focal stack has well defined parameters such as an exposure time and depth of field which is the union of all DOFs of all the photos in the stack [2]. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 384–393, 2020. https://doi.org/10.1007/978-3-030-17795-9_28

Compressive Focal Stack

385

In many instances, such as image capture in turbulent or other degraded visual environmental conditions, the captured image quality may be a function time. In these cases, it can be advantageous to implement some form of image processing enhancement on the individual images before the application of the fusion process. The result would be a fused image of image processed data that would hopefully produce a higher quality resultant image. Recently, compressive sensing has been shown to possess the ability to reconstruct signals from missing or noisy data [3], [4]. This property is very useful in image de-noising, and in super-resolution images. Additionally, compressive sensing (CS) has been used in many Image Fusion algorithms, as shown in [5], and [6], with very promising results. In this paper, we apply Compressive Sensing (CS) as a major component of a reconstruction method applied to a focal stack to produce an all-in-focus image as shown in Fig. 1. The results will show that CS is capable of recreating an image using far fewer coefficients and with similar quality to conventional reconstruction techniques.

Fig. 1. Algorithm sequence

In the next section we explain the fundamentals of CS. In Sect. 4, we explain the algorithms we used to process the images. In Sect. 5, we go explain our method to produce all in focus image using CS. Results are discussed in Sect. 6. The paper is concluded by talking about the limitations and future work.

2

Compressive Sensing

In the model of Compressed signal processing, the signal f is a linear combination of columns in the dictionary D often called atoms. These linear measurements are taken of the form Xi =< φi , f > f ori = 1, 2, ..m.

(1)

where φ is an m × d matrix and has less rows (m) than columns. The vectors (columns) of φi are called sampling operator, and the measurement vector x is

386

M. Abuhussein and A. L. Robinson

of the form x = φf . Without any further assumptions the reconstruction of f is in this case an ill-posed problem, since m d without additional information, because φ is non-invertible and has no zeros. An additional assumption is the signal f is k-sparse, meaning it has at most k non-zero coefficients. This can be quantified as follows: (2) f 0 ≤ k d • 0 denotes the zero norm 0 . The Signal f is assumed to be compressible if it followed the power law decay meaning their coefficients decay rapidly according to |f ∗k | ≤ Rk (−1/q)

(3)

where f ∗ is the rearranged signal starting with the highest coefficients, R is a constant, and q is the compression ratio. For small values of q compressibility becomes similar to sparsity. since most of the energy in any signals is saved in very few components, compressible signals can be approximated as sparse signals. Meaning that if the signal f is sorted descendingly, by retaining the first s vectors of f ∗ with the highest coefficients we can get a sparse approximation of the signal fs . We find that the sparse approximation fs is very close to the original signal f , f − fs 2 ≤ Rs1/2−1/q and f − fs 1 ≤ Rs1−1/q

(4)

It is required for the signal f to have few non-zero coefficients. This can be generalized by making the signal f sparse in terms of a sparsifying matrix. We construct some orthonormal basis, as vectors of the matrix D such that the signal f is sparse when f = Dx when

x 0 ≤ s d.

(5)

According to [7], if φ the sampling operator satisfies RIP (as explained later in the paper), then using linear programming to solve y = φx + e will guarantee the construction of an approximation of the original signal with error bounds: fˆ − f 2 ≤ C[ε + Rs1/2−1/q

with

e 2 ≤ ε.

(6)

fˆ is the approximated signal, e is the noise in the measurements, and ε is the bounds of noise. Sampling Matrices and RIP. The concept of tight frames implies that for any φ, if two columns where chosen arbitrarily, those two columns will be close to orthonormal. In other words, if two random vectors f 1, f 2 were projected into a lower dimensional sub-space, f1 and f2 will remain the same distance apart.

Compressive Focal Stack

387

It has been shown in CS when φ is drawn from a Gaussian matrix or random row sampling without replacement from an orthogonal matrix (Fourier matrix), then φ a m × n matrix is well-conditioned in the sense that if is s-sparse (s n), and m is in the order of (slogn) measurements samples in Fourier domain suffice to recover yˆ. Next we will define over-complete representations and explain their construction. Construction of an Overcomplete Representation. Large basis functions with higher dimensions and many redundancies can incorporate more patterns. This redundancy is used to enhance the signal reconstruction, as well as denoising and patterns recognition. Many types of arbitrary matrices satisfy RIP such as matrices with Gaussian, or Bernoulli entries. It has also been shown in [8] that if ψ has independent rows, then ψ satisfies the RIP. Fourier basis provide a simple, good, but non-optimal approximation of the sensed signals. A Discrete Fourier Transform (DFT) matrix is defined as a p × p matrix, denoted as F p . F p is a complex matrix whose entries (n, n) are given by: 2πimn 1 {F p }m,n = ( √ )exp (7) p p Many Compressive sensing results impose on RIP property, because of the guarantees of recovering an approximate sparse signals when the raw data is corrupted by noise.

3

Calculating Intensity

In order to find which pixel is in focus, we need to isolate the pixel with the highest intensity to add to the stack. We will explain the methods we used in our algorithm next. 3.1

The Laplacian Operator

In order to create an all-in-focus image, we need to determine which part of the image is in focus. Most commonly used methods are variation of edge detection techniques [9] that are used to highlight content with high frequency. The parts of the image which are most in focus have highest frequencies. The Laplacian operator technique is used to determine regions of high intensity. The Laplacian operator is a differential operator given by the divergence of a gradient of a function. It is used in edge detection because it detects regions of rapid high intensities. the The Laplacian L(x, y) of an image with pixel intensity I is given by: L(x, y) =

∂2I ∂2I − ∂ 2 x2 ∂ 2 y2

(8)

As each block in the image is represented as a set of discrete pixels we can use a discrete convolution kernel that approximates the derivatives defined by the Laplacian function.

388

3.2

M. Abuhussein and A. L. Robinson

Difference of Gaussians

Difference of Gaussians (DoG) is a feature enhancement algorithm that subtracts one blurred version of a grayscale image with a more blurred version of itself. These images are produced by using a Gaussian convolution filter. In our application DoG subtracts two Gaussian distributions, one with a higher distribution than the other indicating higher intensity [10]. This step leaves us with an array of the highest intensity coefficients in each block x. DoG can be calculated by the following equation: x2 y 2 x2 y 2 1 − 2σ 2 1 − 2σ 2 1 − 2 e e DoG(x, y) = 2πσ12 2πσ22

(a)

(9)

(b)

Fig. 2. DoG applied to one of the images in the stack: (a) Original image (b) Highest intensity points

Figure 2 shows the points of highest intensity, extracted by applying DoG to one of the images in the stack. We can see that only the region in focus is visible after shareholding the results from the DoG. 3.3

Gaussian Blur

Gaussian blur or Gaussian soothing is a process to filter the image by changing values of each pixel by taking a weighed average of its neighboring pixels [11]. This weighed average follows a Gaussian distribution given by: x2 y 2 1 − 2 e 2σ G(x, y) = 2πσ 2

(10)

where σ is the standard deviation of the distribution. It is done with the function cv2.GaussianBlur() provided by OpenCV library on python.

Compressive Focal Stack

4

389

Algorithm Design

In the implemented python code, the input is a set of images of different focal planes. Sparsity s, block size b×b, must be specified. First, all images are spatially mapped and aligned. Next, images divided into blocks, then depending on the size of the images and the specified block size, the total number of blocks is calculated. Difference of Gaussian is applied first to sharpen the edges in the images. The corresponding blocks are aligned, padding is not required since all images where the same size and equally blocked. To detect sharp edges from all blocks Laplacian and DoG are applied, and the result is a sum of the out come for every stack. The result is saved in a matrix, which is rearranged to get the highest coefficients. Now the sparsity s specifies the number of coefficients are used for reconstruction. A tight frame is constructed using block size. Then each block is reconstructed individually using the Fourier basis matrix. Gaussian blur is used to enhance the features in the reconstructed blocks. Reconstructed blocks are stitched together to form the AIF image. The steps are summarized in 1 shown below.

5

Results

Sample images were downloaded from https://www.eecis.udel.edu/∼nianyi/ LFSD files/ and used for the results discussed below. The Set contains 10 1080 × 1080 images with different focus planes. Our algorithm was implemented on python using OpenCV library. The calculated MSE for the output images and one AIF image using a conventional reconstruction method which is used as a reference for accuracy measure.

Algorithm 1. Compressive Focal Stack Procedure 1: procedure CFS(BlockSize(b × b), SparsityF actor(s)) inputs :A Set of Images N Images of different focal depths 2: Block Images 3: Calculate Total Number Of Blocks K 4: Align Corresponding Blocks 5: for i=1 to K do perform for every block 6: Create a blank block Size b × b 7: for j=1 to N do for corresponding blocks in all images 8: Apply Gaussian blur 9: Compute the Laplacian for all blocks to generate a gradient map 10: Return coefficients matrix x 11: end for 12: Solve the minimization problem: min x 1 s.t f = ψx 13: end for 14: Stitch reconstructed blocks 15: output: All-In focus image 16: end procedure

390

M. Abuhussein and A. L. Robinson

The results section is segmented into the following parts: First, we test the results by varying the sparsity level and the number of the input images. In the second part, noise is introduced to all input image. The last part shows the algorithm’s performance when one or more images are blocked by noise. In order to test the robustness of the algorithm in noisy environments, it was tested on a noisy set. a Gaussian noise was introduced to all the input images. A plot of the MSE as a function of sparsity in Fig. 3. It is shown the possibility of reconstructing an AIF image from very few input images with only 35% of the coefficients with a good accuracy.

Fig. 3. Calculating MSE for reconstructions with different number of input images

In Fig. 4 we demonstrate the denoising effects of compressive sensing since it eliminates the lower coefficients. Better results in the case of noisy images for a higher number of samples, for small samples the performance is relatively poor, but it works well for a sparsity of 15% or more. When interrupting the focal stack with one or more noisy images a noticeable drop in accuracy occurs. In this step, we set the sampling ratio to 0.4 and the number if input images to 10, we replace one or more input images with a 80% noisy image, and compare result by increasing the number of noisy images in the stack. Results are shown in Fig. 7 and the error graph is shown in Fig. 5.

Fig. 4. Calculating MSE for reconstructions with different number of noisy input images

Compressive Focal Stack

391

Fig. 5. MSE Vs number of noise interrupted images

(a)

(b)

(c)

(d)

Fig. 6. Results with varying sparsity levels: (a) Sparsity Factor = 0.0; (b) Sparsity Factor = 0.09; (c) Sparsity Factor = 0.30; and, (d) Sparsity Factor = 1.0.

392

M. Abuhussein and A. L. Robinson

The results show that we can still reconstruct the image a very low sampling ratio and a limited number of images, but the performance enhances when the number of images used move from 4 to 10. A visual set of experimental results are shown in Fig. 6. Fig. 7 shows a sample noisy image 7(a) side by side to the reconstructed image 7(b).

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 7. (a) Sample input image with 10% noise, (b) Reconstructed image from 10 input images with 10% noise, (c) Sample input image with 80% noise, (d) Reconstructed image from 1 noisy input image, (e) Reconstructed image from 5 noisy input images, (f) Reconstructed image from 10 noisy input images.

Compressive Focal Stack

6

393

Limitations and Future Work

Compressive Sensing shows its ability to provide better reconstructions in case of lossy or noisy signals. A key limitation for this algorithm is the lack of accountability of motion in the scene, meaning it has to be a fixed scene and a fixed camera, while varying aperture or lighting. Another limitation is all images need to perfectly aligned or cropped pre-processing stage. In addition, this method is not sensitive to texture-less regions. Using a trained dictionary and greedy sampling algorithms could provide better results for less number of input images.

References 1. Kalaivani, K., Phamila, Y.A.: Analysis of image fusion techniques based on quality assessment metrics. Indian J. Sci. Technol. 9(31), 1–8 (2016) 2. Kutulakos, K., Hasinoff, S.W.: Focal stack photography: high-performance photography with a conventional camera. In: MVA, 20 May 2009, pp. 332–337 (2009) 3. Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Process. Mag. 25(2), 21–30 (2008) 4. Candes, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006). https://doi.org/10.1109/TIT.2005.862083 5. Divekar, A., Ersoy, O.: Image fusion by compressive sensing. In: 2009 17th International Conference on Geoinformatics, Fairfax, VA, pp. 1–6 (2009). https://doi. org/10.1109/GEOINFORMATICS.2009.5293446, https://doi.org/10.1109/HPDC. 2001.945188 6. Ito, A., Tambe, S., Mitra, K., Sankaranarayanan, A.C., Veeraraghavan, A.: Compressive epsilon photography for post-capture control in digital imaging. ACM Trans. Graph. (TOG) 33(4), 88 (2014) 7. Candès, E.J.: The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique 346(9–10), 589–592 (2008). https://doi. org/10.1016/j.crma.2008.03.014., http://www.sciencedirect.com/science/article/ pii/S1631073X08000964, ISSN 1631-073X 8. Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr. Approx. 28(3), 277–89 (2008) 9. Broad, T., Grierson, M.: Light field completion using focal stack propagation. In: ACM SIGGRAPH 2016 Posters, 24 July 2016, p. 54. ACM (2016) 10. Assirati, L., Silva, N.R., Berton, L., Lopes, A.D., Bruno, O.M.: Performing edge detection by difference of Gaussians using q-Gaussian kernels. J. Phys.: Conf. Ser. 490(1), 012020 (2014) 11. OpenCV. Smoothing Images—OpenCV 2.4.13.5 documentation. https://docs. opencv.org/2.4/doc/tutorials/imgproc/gausianmedianblurbilateralfilter/gausian medianblurbilateralfilter.html

SfM Techniques Applied in Bad Lighting and Reflection Conditions: The Case of a Museum Artwork Laura Inzerillo(&) University of Palermo, 90100 Palermo, Italy [email protected]

Abstract. In recent years, SfM techniques have been widely used especially in the field of Cultural Heritage. Some applications, however, remain undefined in cases where the boundary conditions are not suitable for the technique. Examples of this are instances where there are poor lighting conditions and the presence of glass and reflective surfaces. This paper presents a case study where SfM is applied, using a DSLR camera (Nikon D5200), to the “Head of Hades” inside a glass theca and under a large number of light sources at different distances and of different intensities and sizes. The geometric evaluation has been made comparing the DSLR camera model against the 3D data acquired with structured light systems. Keywords: Photogrammetry

Museum 3D model

1 Introduction SfM techniques had widespread use and applications especially in 3D acquisition of Cultural Heritage and in particular of Architectural and Archeological works [1–6]. Thanks to numerous applications and experiments, it has been possible to compile a sort of manual that indicates the ideal and optimal conditions to get a trusted model under the metric profile [7, 8]. In this study, on the contrary, a very particular application is shown: it is a specimen of inestimable historical value: the Head of Hades, exhibited in a museum inside a theca. The presence of 5 light sources, each with a diameter of 6 cm, at the top, and an unlit background has made the use of the flash necessary, which has created problems of reflectance due to the glass. The data set produced according to the photogrammetric indications [9–12] produced a final model that seems not to have taken into account the real problems of using the flash, the change of lens and focal length. On the contrary, the model obtained has exposed colors and geometries of the work that are not always perceptible from an observation with the naked eye. Despite these operating conditions it cannot speak of geometrical accuracy and, let alone the radiometric one. The geometric evaluation has been performed comparing the DSLR camera (Nikon D5200) model against the 3D data acquired with structured light systems.

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 394–401, 2020. https://doi.org/10.1007/978-3-030-17795-9_29

SfM Techniques Applied in Bad Lighting and Reflection Conditions

1.1

395

Aim and Structure of the Paper

This paper aims to evaluate the potential of a DSLR camera to conduct SfM studies, verifying its imaging capability to provide accurate 3D geometries of the 3D models in bad lighting and reflection conditions. For this reason, a method is applied to the Head of Hades sample and the result obtained is compared to ground truth data collected with a high-resolution structured light scanner. The method adopted, requires skills in the fields of image acquisition and processing [13, 14]. However, it would be desirable to identify a repeatable methodological approach in these extreme conditions. 1.2

Significance of the Research

Life Cycle Analysis (LCA) in the Cultural Heritage field is a new concept which is being introduced to management staff to guarantee the conservation of historically important artwork in the future. The possibility to obtain a high accuracy 3D survey using friendly technology with low cost investment is an important goal in the management of Cultural Heritage. It has been possible to apply these techniques on exhibited and accessible works but not on those placed inside a glass box, until now. And this limited the completion of the survey. The research has a fundamental impact in the monitoring phases of the artworks when they are contained inside a glass box.

2 Head of Hades Sample The Head of Hades (or Barbablù) is a polychrome terracotta head dating from the Hellenistic age, most likely depicting the Greek god of the Inferi Ade and coming from the archaeological park of San Francesco Bisconti to Morgantina (Enna, Sicily, Italy). The use of color in the Head of Hades has a clear symbolic value: the blue of the beard, in fact, devoid of realistic references, recalls the concept of eternity for assimilation with the color of the sky, but has also funerary references, relating, therefore, to the image of the god of the underworld. The Head of Hades was stolen at the end of the Seventies by the archaeological site of Morgantina, in the territory of Aidone. Between the end of the Seventies and the Eighties, the archaeological park was the subject of numerous underground excavations, resulting in the confiscation of finds, illicitly exported and returned to Italy in recent years. The following figures show the Head of Hades within the glass theca where the light sources are overhead. In Fig. 1, we see the environment and the visibility of the object in the same way of the naked eye: without the flash mode, the camera reproduces the same colors that you see with your eyes. The colors are a little faded and difficult to see in some parts of the head. In Fig. 2, we can see the black environment and the contrasting colors due to the activation of the camera’s flash.

396

L. Inzerillo

Fig. 1. The Head of Hades exhibited at Salinas museum inside a glass theca, under light sources. These pictures were taken without flash.

Fig. 2. The Head of Hades flash pictures: in these pictures it is possible to admire the hair and beard colors.

The final model is affected by these shades and intensities of colors from both data sets with and without flash. Therefore, the shadows due to the light sources (without flash) and the contrasted colors of the hair and beard (with flash) can be observed. To obtain a color fidelity of the 3D model, a color checker had to be used close to the object and under different light sources, but this was impossible in the Head of Hades sample.

SfM Techniques Applied in Bad Lighting and Reflection Conditions

397

Table 1. Nikon D 5200 features. Nikon D 5200 Sensor name/type APS-C CMOS Sensor size 23.5 15.6 mm Image resolution 6000 4000 px Pixel size 4l Focal length 55 Distance from the object (m) 1 GSD (mm) 0.07272

Table 2. Structured light scanner features. 3D resolution (mm) 3D point accuracy (mm) 3D accuracy over distance Texture resolution (mp) Colors (bpp) Structured light source Data acquisition speed Video frame rate (fps)

Structured light scanner 0.1 0.05 0.03% over 100 cm 1.3 24 Blu LED 1 mln points/s. 7.5

3 3D Reconstruction Model The data set has been obtained by considering both the resolution of the Structured Light Laser model and [15] a series of photographs taken at a 1 m distance from the object to obtain a GSD of 0.07272 mm according to the Nikon and laser specs (Tables 1 and 2). The data set was made at two different altitudes according to two different horizontal layers which follow a circle around the object with 1 m radius approximately. The model produced has been scaled using the side of the glass theca as an accurate measurement. It was forbidden to use any other markers. The calibration and the scale phase of the model, made before the processing to produce the dense reconstruction, allowed the comparison in the Cloud Compare environment [16]. The data set has been made with 284 shots at 24 Mpx, of which have all been aligned. The parametric calculation required 22 h 27 min 51 s and the final 3D model has 7.4332,829 faces and 3.781,806 vertices (Figure 3). In Figs. 4, 5, 6 and 7, you can see the different visualizations of the photogrammetric model: the textured, meshed one along with the point and dense clouds and the textured detail.

398

L. Inzerillo

Fig. 3. Top and front view of the 284 shots.

Fig. 4. Right, front, back and left side of Head of Hades 3D textured model.

Fig. 5. Right, front, back and left side of Head of Hades 3D meshed model.

SfM Techniques Applied in Bad Lighting and Reflection Conditions

Fig. 6. Point and dense cloud of Head of Hades.

Fig. 7. Detail of Head of Hades 3D textured model.

399

400

L. Inzerillo

4 Geometric Results The geometric evaluation has been performed in the Cloud Compare environment through the alignment between the dense cloud model carried out from the photogrammetric reconstruction and the Structured Light scanner. The alignment was made using the two clouds by picking at least 4 equivalent point pairs. The result exceeded every expectation with a RMS value of 0.00689649. As you can see in Fig. 8, the worst parts of the model are those where the hair and beard are present. This result was expected as those are the most complex geometric parts of the Head.

Fig. 8. Two different views of the alignment phase on Cloud Compare.

5 Conclusions This paper has presented an evaluation of the Photogrammetric technique using Image Based Modeling software to verify the use of a DSLR camera and its capability to provide accurate 3D geometries of the final models produced under bad lighting and reflective conditions. The comparison of the models shows that the 3D model from the Nikon D 5200 camera is suitable to cover the geometric requirements in replicating small objects in a bad external environment. This outcome fits the needs of Small and Medium Heritage Museums in producing accurate 3D catalogues of their small artifacts and objects even if they are inside a glass theca. The results from the experiments show potential for the future of 3D Modeling within the fields of Archaeological and Architectural artefacts in any lighting and reflection condition. Acknowledgment. This research has received funding from the European Union’s Horizon 2020 Programme, SMARTI ETN, under the Marie Curie-Skłodowska actions for research, technological development and demonstration, under grant agreement number 721493.

SfM Techniques Applied in Bad Lighting and Reflection Conditions

401

References 1. Santagati, C., Inzerillo, L., Di Paola, F.: Image-based modeling techniques for architectural heritage 3D digitalization: limits and potentialities. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XL-5(W2), pp. 550–560 (2013) 2. Inzerillo, L.: Smart SfM: salinas archaeological museum. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Science, vol. XLII-2/W5, pp. 369–374 (2017). https://doi.org/10.5194/isprs-archives-XLII-2-W5-369-2017 3. Scopigno, R., et al.: Digital fabrication techniques for cultural heritage: a survey. Comput. Graph. Forum 36(1), 6–21 (2017) 4. Garstki, K.: Virtual representation: the production of 3D digital artifacts. J. Archaeol. Method Theory 24(3), 726–750 (2017) 5. Gonizzi Barsanti, S., Guidi, G.: 3D digitization of museum content within the 3D icons project. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. II-5/W1, 151–156 (2013) 6. Wachowiak, M.J., Karas, B.V.: 3D scanning and replication for museum and cultural heritage applications. J. Am. Inst. Conserv. 48(2), 141–158 (2009) 7. Inzerillo, L., Di Paola, F.: From SfM to 3D print: automated workflow addressed to practitioner aimed at the conservation and restoration. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Science, vol. XLII-2/W5, pp. 375–382 (2017). https://doi.org/10.5194/isprs-archives-XLII-2-W5-375-2017 8. Barsanti, S.G., Remondino, F., Visintini, D.: Photogrammetry and laser scanning for archaeological site 3D modeling–some critical issues. In: Roberto, V., Fozzati, L. (eds.) Proceedings of the 2nd Workshop on The New Technologies for Aquileia, pp. 1–10 (2012). http://ceur-ws.org/Vol-948/paper2.pdf 9. Mikhail, E.M., Bethel, J.S., Mcglone, C.: Introduction to Modern Photogrammetry. Wiley, Hoboken (2001) 10. Remondino, F., Fraser, C.: Digital camera calibration methods: considerations and comparisons. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 36(5), 266–272 (2006) 11. Schonberger, J.L., Frahm, J.-M.: Structure-from-Motion Revisited. In: Proceedings of CVPR (2016) 12. Menna, F., et al.: 3D digitization of an heritage masterpiece - a critical analysis on quality assessment. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences - ISPRS Archives, vol. 41, pp. 675–683, February 2016. https://doi. org/10.5194/isprsarchives-xli-b5-675-2016 13. Gaiani, M., Remondino, F., Apollonio, F.I., Ballabeni, A.: An advanced pre-processing pipeline to improve automated photogrammetric reconstructions of architectural scenes. Remote Sens. 8(3), 178 (2016) 14. Di Paola, F., Milazzo, G., Spatafora, F.: Computer aided restoration tools to assist the conservation of an ancient sculpture. In: The colossal Statue of Zeus enthroned, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLII-2/W5, pp. 177–184 (2017). https://doi.org/10.5194/isprs-archives-XLII-2-W5-1772017 15. Georgopoulos, A., Ioannidis, C. Valanis, A.: Assessing the performance of a structured light scanner. In: Commission V Symposium on International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 38(5). Citeseer (2010) 16. Compare, Cloud. CloudCompare (version 2.8)[GPL software]. (2017)

Fast Brain Volumetric Segmentation from T1 MRI Scans Ananya Anand1(B) and Namrata Anand2(B) 1

Warren Alpert School of Medicine, Brown University, Providence, RI, USA ananya [email protected] 2 Bioengineering Department, Stanford University, Stanford, CA, USA [email protected]

Abstract. In this paper, we train a state-of-the-art deep neural network segmentation model to do fast brain volumetric segmentation from T1 MRI scans. We use image data from the ADNI and OASIS image collections and corresponding FreeSurfer automated segmentations to train our segmentation model. The model is able to do whole brain segmentation across 13 anatomical classes in seconds; in contrast, FreeSurfer takes several hours per volume. We show that this trained model can be used as a prior for other segmentation tasks, and that pre-training the model in this manner leads to better brain structure segmentation performance on a small dataset of expert-given manual segmentations. Keywords: Magnetic resonance imaging · Supervised machine learning · Neural networks (computer) Artificial intelligence · Computer vision systems

1

·

Introduction

Mapping anatomical changes in the brain is helpful for diagnosing neurodegenerative and psychiatric disorders, identifying tumors, as well as identifying healthy development and aging [1–4]. Manual segmentation of brain anatomical structures is time-consuming and requires expertise; therefore, there is a need for automated segmentation methods that are robust and fast. The most widely-used method for automated brain structure segmentation is FreeSurfer [5], a probabilistic method which is comparable to manual segmentations but takes several hours per brain volume. In this paper, we use deep convolutional neural network models to predict FreeSurfer segmentations of transverse, sagittal, and coronal sections from T1-weighted MRI scans. Deep convolutional neural networks have achieved great success in image classification and segmentation. These models are capable of learning complex functions of input data, without the need for hand-engineered features. Moreover, the forward inference for these models is very fast, on the order of seconds for even the most complex models. Therefore, we train a deep convolutional neural network model to predict anatomical segmentations, which allows for very fast c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 402–415, 2020. https://doi.org/10.1007/978-3-030-17795-9_30

Fast Brain Segmentation from T1 Scans

403

segmentation of MRI brain volumes. Several deep learning models have been proposed for semantic segmentation [6,7], and in this paper we use the Deeplabv3+ architecture (Fig. 1) [8], the current state-of-the art for segmentation, which is described in Sect. 2.1 below. In addition to fast volumetric segmentation, we are interested in a model which can improve when given expert-labeled manual segmentations. We show that after pre-training the network to do anatomical segmentation using FreeSurfer labels, the network is able to achieve better performance on an extremely small dataset of expert-labeled scans compared to the same model that is not pre-trained on FreeSurfer labels. Several papers have used deep neural networks specifically for classification and segmentation of brain MRI images, as we do in this paper, with variation in terms of datasets, architectures, and objectives [9–14]. Most similar to this paper is QuickNat, which also uses FreeSurfer labels to train a fast anatomical segmentation model [15]. A comparison between our method and QuickNat is given in the discussion. Our main contributions are as follows: 1. We train a state-of-the art segmentation model, Deeplab-v3+, to predict anatomical segmentations from a large open-source dataset. 2. We show that this trained model can be used as a prior for other segmentation tasks, and that pre-training the model in this way leads to better brain structure segmentation performance on a small dataset of expert-labeled images.

2 2.1

Methods Models

Deeplab-V3+ Segmentation Model. The Deeplab-v3+ model (Fig. 1) [8] is a deep neural network segmentation model, which achieves state-of-the-art performance on the PASCAL VOC 2012 semantic image segmentation test dataset without any post-processing techniques. This model employs atrous or dilated convolutions to allow for multi-scale feature learning. The primary feature extractor used in the model is the Xcpetion network [16], which extends the Inception v3 model by adding depthwise separable convolutions [17]. Furthermore, Deeplab-v3+ includes atrous spatial pyramid pooling (ASPP) and a decoder module for refining segmentation results. A schematic of the model is given in Fig. 1. In this paper, we fine-tune a Deeplab-v3+ model trained on the PASCAL VOC 2012 dataset to predict FreeSurfer segmentations and report results in Sect. 3.1. We use as inputs rescaled 2D transverse, coronal, and sagittal sections from T1 MRI scans; we found that a single model could learn segmentations for all three types of sections. We further fine-tune the trained model on expertlabeled datasets.

404

A. Anand and N. Anand

Fig. 1. Deeplab-v3+ model architecture. Figure adapted from [8] Table 1. Input MRI image datasets and labels. Dataset

Patients MRI scans

Segmentations

OASIS-3 train set

996

2528 (T1)

2514 (Freesurfer)

772676

OASIS-3 test set

100

239 (T1)

179 (Freesurfer)

140592

ADNI train set

826

14112 (T1)

5208 (Freesurfer) 2289824

380 (T1)

143 (Freesurfer)

ADNI test set

2.2

92

Labeled images

99441

MRBrainS13 train set

5

5 (T1, T1-IR, T2-FLAIR) 5 (Manual)

240

MRBrainS18 train set

7

7 (T1, T1-IR, T2-FLAIR) 7 (Manual)

336

Data

Input Data for Segmentation Model. Our input data are T1-weighted MRI scans from the Open Access Series of Imaging Studies (OASIS-3) project [18] and the Alzheimer’s Disease Neuroimaging Initiative (ADNI-3) project [19]. Statistics for the datasets are given in Table 1. The ADNI data is from patients age 34–96 with a male to female ratio of 1.3 : 1; the OASIS data is from patients age 42–97, with a male to female ratio of 1 : 0.77. Both datasets include longitudinal data for individual patients. The ADNI dataset includes T2, fluid-attenuated inversion recovery (FLAIR), diffusion tensor imaging (DTI), functional magnetic resonance imaging (fMRI), and positron emission tomography (PET) scans for some of the patients, as well as additional multi-modal clinical, genetic, and biospecimen data available online. The OASIS dataset also includes T2-weighted, FLAIR, and DTI sequences, among others. We expect model performance to improve after incorporating this additional data, but for the purpose of this paper, we solely use 3T and 1.5T T1 scans as input data for our segmentation models. We divide the ADNI and OASIS scans into training (∼90%) and test sets (∼10%), ensuring that scans corresponding to the same patient only appeared in either the train or test set.

Fast Brain Segmentation from T1 Scans

405

As labels for our segmentation model, we use FreeSurfer cortical reconstruction and volumetric segmentations available for a subset of the ADNI and OASIS scans [5]. FreeSurfer is a Bayesian method for automated segmentation of the brain, including subcortical structures such as thalamus, hippocampus, and basal ganglia. The FreeSurfer process pipeline includes intensity normalization and atlas registration [20], segmentation of white matter and gray matter structures from T1 images [5] and cerebral cortex parcellation [21]. Although FreeSurfer circumvents the need for manual labeling of MRI images, the automated process is still fairly time-intensive, requiring several hours for full scan segmentation. In contrast, the trained neural network presented in this paper takes seconds to segment the entire brain volume. Moreover, the trained network does not require that the input scans be atlas-registered. Input Data for Fine-Tuning and Assessing Segmentation Model. To demonstrate the usefulness of the trained segmentation model as a prior for other segmentation tasks, we also make use of two additional datasets – the MRBrainS13 [22] and MRBrainS18 [23] datasets. Both datasets are available as part of challenges associated with the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference. Statistics for the datasets are given in Table 1. The MRBrainS13 and MRBrainS18 datasets consist of 5 and 7 brain MRI T1, T1 inversion recovery (T1-IR), and T2-FLAIR scans available, respectively. These scans have corresponding segmentations of 10 anatomical brain structures, made manually by experts. Some of the patient scans also include images with pathologies, such as white matter lesions. We fine-tune our segmentation models on this small dataset in order to show improved generalization after pre-training on FreeSurfer automated segmentations; results are given in Sect. 3.1. The scans are bias field corrected using the N4ITK algorithm [24] and all scans are aligned. Different MRI modalities provide different imaging perspectives through manipulation of various parameters. T1 weighted images tend to have short echo and relaxation times and, as such, create images that most closely resemble the macroscopic appearance of tissue (fat - high intensity, fluid – low intensity, white matter intensity > grey matter intensity); therefore we primarily used T1 scans for anatomical segmentation. T2 weighted images have long echo and relaxation times, which cause grey matter to appear more hyperintense compared to white matter and fluid to appear hyperintense as well. FLAIR (Fluid attenuation inversion recovery) imaging is similar to T2 imaging; however, the signal from fluid (e.g. cerebrospinal fluid) is removed. T1-inversion recovery (T1-IR) improve T1 contrast and can be used to suppress signals from specific types of tissues. In combination, these different imaging methods might be utilized to better segment anatomical structures as well as to identify abnormalities such as tumors, bleeds, and areas of infarction. We had access to far more T1 scans compared to T2, FLAIR, or T1-IR scans, and, as a result, used T1 scans only for our primary segmentation model.

406

A. Anand and N. Anand

Fig. 2. Deeplab-v3+ segmentation results. Example segmentation results across 13 anatomical classes for the (a) OASIS test set and (b) ADNI test set. Results were randomly selected from respective test sets. The segmentation model can segment transverse, coronal, and sagittal sections.

3 3.1

Experiments Segmentation Model

Data Processing and Training. As described above, we use ADNI and OASIS T1 scans and corresponding FreeSurfer labels to train a segmentation model. The FreeSurfer segmentations give labels corresponding to subcortical segmentation (‘aseg’) and cortical parcellation (‘aparc’). We map these labels onto the 13 classes given in Table 2. We define the “Other” class as those regions which would ordinarily be skull-stripped; these regions are those which overlap with the complement of the brain mask generated by FreeSurfer but are not low-intensity background pixels. Example FreeSurfer-derived segmentation labels are shown in Fig. 2. We pre-process the input data as follows. We first use FreeSurfer software to rescale and preprocess all of the T1 scans. These steps include registering the scans to the MNI305 atlas [25], correcting for intensity non-uniformity in the

Fast Brain Segmentation from T1 Scans

407

MRI data, and doing intensity normalization; we omit the skull-strip step. We then normalize the entire volume to have mean 0 and standard deviation 1. We then shift the entire volume to lie between the values 0 and 1, multiply these values by 255, and cast the volume to integers to save as image files. We save the segmentations as PNG files as well, where the pixel class is the explicit value of the corresponding pixel in the image file. Finally, as we did our experiments in the Tensorflow framework [26], we build custom data loaders to save the images and corresponding segmentations as TFRecord files for fast data loading during training. We fine-tune the Deeplab-v3+ model pre-trained on the PASCAL VOC 2012 dataset on our T1 2D MRI training data, extending the open-source Tensorflow code available online [8]. We train with a batch size of 4 images for 1.8 million steps. Note that this corresponds to just over one training epoch given the size of our input dataset. After this point, test set accuracy for a subset of classes started to decline, likely due to over-fitting, while test set accuracy for other classes continued to improve. The atrous rates are set to [6, 12, 18] and output stride and decoder output strides are set to 16 and 4, respectively. The 256 × 256 input images are resized to 513 × 513 and the batch normalization parameters are fine-tuned as well during training. The model is trained using softmax loss, and we optimize the model with stochastic gradient descent (SGD) with momentum. We use a base learning rate of 1e − 4 and reduce the learning rate by 10% every 2000 steps. We use L2-regularization of the weights with a weighting of 4e − 5 on the regularization term. Importantly, rather than re-learn the last layer of the segmentation network, we simply fine-tune the existing last layer which is set-up for a 21 class segmentation task. The network learns to ignore those 8 classes for which we have no corresponding labeled data. In practice, this network learns correct segmentations much faster this way. We augment our dataset during training by including random scalings up to ±10% and random horizontal flips of the images. We found that including random scalings allows the network to segment scans that are not atlas-registered. Segmentation Performance. In Fig. 2 we show example segmentations for the different anatomical planes and in Table 2, we report mean Intersection over Union (IOU) for each class. IOU is defined as IOU = T P +FT PP +F N where TP, FP, and FN are number of pixel-wise true positives, false positives, and false negatives, respectively. We achieve a mean IOU across 13 classes of 0.7744 and 0.7720 for the ADNI and OASIS test datasets, respectively. Among the anatomical classes, the lowest IOU is attained for segmentations of white matter lesions. This is likely due to the fact that white matter hyperintensities are much easier to detect on T2 and FLAIR scans, which were not as input to our segmentation model; therefore we do not expect strong performance for white matter lesion detection. Relative to the other classes, lower IOU is also attained for segmentations of the corpus callosum and hypothalamus, which might be due to the small volume

408

A. Anand and N. Anand

Table 2. Segmentation classes and performance for OASIS and ADNI test datasets after 1.8 million iterations. Class Class

Oasis test IOUa ADNI test IOUa

0

Background

0.9893

0.9857

1

Cortical gray matter 0.7469

0.7759

2

Basal ganglia

0.6984

0.6621

3

White matter

0.8549

0.8647

4

White matter lesions 0.4779

0.4499

5

Cerebrospinal fluid

0.7560

0.7156

6

Ventrices

0.8647

0.8584

7

Cerebellum

0.8727

0.8862

8

Brain stem

0.8603

0.8399

9

Thalamus

0.7740

0.7625

10

Hypothalamus

0.6445

0.6611

11

Corpus callosum

0.5716

0.6407

12

Other

0.9554

0.9335

a

Total mean IOU 0.7744 IOU = Intersection-over-union

0.7720

a

of these structures relative to the entire brain volume as well as possible errors in the FreeSurfer automated segmentation process. Overall, the network can robustly detect and segment the other anatomical structures, which we can see quantitatively from the reported IOU values (Table 2) as well as qualitatively from the correspondence between the labels and predictions for the test set images (Fig. 2). Unlike FreeSurfer segmentations which can take 20–40 h to complete, our model can segment 256 × 256 × 256 brain volumes in on average under 30 s on a GeForce GTX 1080 Ti GPU. We provide both our pretrained model and data loading, normalization, and evaluation scripts online at https://github. com/nanand2/deeplab mri seg. 3.2

Fine-Tuning on Expert Annotations

Improved Performance with Pre-training. One of the benefits of building a neural network segmentation model is our ability to fine-tune that model on other tasks and datasets. We attempt to learn more refined segmentations from MRBrains18 dataset which provides expert-given manual anatomical segmentations across 9 classes. The main issue with this dataset is its size; there are only 7 segmented brain scans given in the training set. We expect that simply training a segmentation model on these 7 scans without a strong prior would lead to sub-optimal overfitting. Therefore, we present results with and without pre-training on the FreeSurfer labels.

Fast Brain Segmentation from T1 Scans

409

Fig. 3. MRBrains18 Segmentation results. Example test set segmentation results across 9 anatomical segmentation classes for MRBrains18 challenge.

The MRBrains18 dataset contains T1-weighted, T2-IR, and FLAIR scans, along with the corresponding anatomical segmentation. We experimented with fine-tuning models only on T1 scans or using all three scans (multi-modal). The experts used the FLAIR scans to segment the white matter lesions, and they used both T1-weighted scans and the T1-weighted inversion recovery scans to segment the cerebrospinal fluid (CSF) in the extracerebral space. We therefore expected that white matter lesion and CSF segmentation performance would improve for the multi-modal model. We normalized data as described above in Sect. 3.1. The input scans contained only 48 scans in the transverse plane; therefore we restricted ourselves to segmenting transverse planes. We zero-padded the transverse sections to 256 × 256 and created data loaders for both T1-only and multi-modal inputs. We generated 7 different train/test splits for leave-one-out cross-validation. We fine-tuned both the Deeplab-v3+ model pretrained on the PASCAL VOC 2012 dataset as well as the Deeplab-v3+ model further trained on the FreeSurfer segmentation labels; we distinguish these two cases as “without pre-training” and “with pre-training”, to indicate pre-training with anatomical segmentations. We fine-tune these networks for 50K iterations with a base learning rate of 1e − 4 and learning rate decay of 10% every 2000 steps. As before, the 256 × 256 input images are resized to 513 × 513, the batch normalization parameters are finetuned during training. The model is trained with SGD with momentum, and we retain the full last layer. The batch size for each step is 4 images. We use

410

A. Anand and N. Anand

L2-regularization of the weights with a weighting of 4e − 5 on the regularization term. Unlike other fine-tuning approaches, we fine-tune the entire network, not only the last layer(s). Table 3. MRBrains18 segmentation results averaged across all 7 test set folds T1 scans only

T1, T2, and FLAIR scans

Description

0K itersa 20K iters 50K iters 0K iters 20K iters 50K iters

Background

0.843

0.990

0.990

0.802

0.992

0.992

Cortical gray matter 0.620

0.707

0.702

0.000

0.705

0.698

Basal ganglia

0.239

0.648

0.639

0.000

0.640

0.630

White matter

0.669

0.720

0.727

0.000

0.736

0.720

White matter lesions 0.262

0.364

0.319

0.000

0.322

0.305

Cerebrospinal fluid

0.544

0.646

0.646

0.029

0.669

0.655

Ventricles

0.722

0.852

0.841

0.000

0.829

0.817

Cerebellum

0.506

0.828

0.808

0.000

0.825

0.814

Brainstem

0.431

0.563

0.527

0.005

0.620

0.576

Mean IOUb

0.531

0.696

0.684

0.093

0.699

0.685

a

iters = iterations;

b

IOU = Intersection-over-union

Results for models fine-tuned on the pre-trained network are given in Table 3 and shown in Fig. 3. Results in Table 3 are averaged across all cross-validation folds. We see that the pre-trained network immediately performs well when the input is T1 scans only, as the network has only seen T1 scans and not multi-modal input. After fine-tuning, performance improves for both single- and multi-modal inputs, and we see over-fitting around 50K iterations, where performance no longer improves and begins to diminish. We also see that across cross-validation folds, there is not a dramatic improvement in performance for the multi-modal input model vs the single-modal model. We attribute this to the fact that the original segmentation network is trained on T1 scans only. We compare performance with and without pre-training in Table 4 and Fig. 4 for one cross-validation fold. In Table 4 we report mean IOU across classes for one randomly selected test set fold with and without pre-training on FreeSurfer labels for both single- and multi-modal inputs. For both input types there is clear improvement in performance using the pretrained network. In Fig. 4 we see the issue of overfitting for small datasets – over time the test accuracy flat-lines and ultimately starts to worsen as the network overfits to the small training set; however, with pre-training, the network generalizes better than without. Comparison to Other Methods. In order to compare our segmentation model performance to other methods, we present results on the completed MRBrains13 challenge [22]. The MRBrains13 training dataset contains 5 T1weighted, T2-IR, and FLAIR scans, along with corresponding expert-given anatomical segmentations. We fine-tune our Deeplab-v3+ model pre-trained with FreeSurfer segmentations on T1 scans and segmentations from this dataset

Fast Brain Segmentation from T1 Scans

411

Table 4. MRBrains18 segmentation results for one randomly selected test set fold with and without pre-trained network after 50K iterations

Class

T1 scans only T1, T2, and FLAIR scans No pre-training Pre-training No pre-training Pre-training

Background

0.984

0.989

0.985

0.990

Cortical gray matter 0.628

0.710

0.619

0.710

Basal ganglia

0.369

0.700

0.232

0.646

White matter

0.674

0.731

0.650

0.726

White matter lesions 0.230

0.430

0.223

0.451

Cerebrospinal fluid

0.624

0.708

0.610

0.712

Ventrices

0.567

0.886

0.637

0.864

Cerebellum

0.699

0.861

0.748

0.863

Brainstem

0.350

0.747

0.572

0.735

0.751

0.586

0.744

a

Mean IOU 0.569 IOU = Intersection-over-union

a

for 200K iterations, with the same optimization scheme as described above. In Table 5, we report the performance of our fine-tuned model on the test dataset, which contains scans for 15 patients, compared to other methods which predict segmentations from T1 scans. The methods are ranked according to segmentation performance for 15 patient test scans across three metrics (dice coefficient (%), 95th percentile of hausdorff distance (mm), and absolute volume difference (%)) and three anatomical classes (gray matter, white matter, and cerebrospinal fluid). The metrics are defined in [22]. Out of 18 reported methods, our method ranks fourth in terms of overall performance. As shown in Table 5, the fine-tuned model outperforms FreeSurfer automated labeling in adhering to the expert-given manual segmentations. Importantly, our fine-tuned model performs well compared to other algorithms despite no skull-stripping of inputs, no post-processing after segmentation, and no hyperparameter tuning.

4

Discussion

In this paper, we train a state-of-the-art deep neural network segmentation model to do fast volumetric brain segmentation of T1 MRI scans. We show that this model is a good prior for transfer learning for other brain MRI anatomical segmentation tasks and can be trained to outperform FreeSurfer segmentations with respect to expert-labeled data. There are many steps that can be taken to improve the performance of our segmentation models. Since the Freesurfer labels are not error-free, a valuable next step would be to make the segmentation model objective robust to errors by not propagating loss for predictions where the network is highly confident in

412

A. Anand and N. Anand

Fig. 4. MRBrains18 test accuracy curves with and without pre-training. Test set IOU (Intersection-over-union) versus iteration for one randomly selected test set fold for segmentation with T1 scans as input. Fine-tuning network pre-trained on FreeSurfer labels leads to better performance across segmentation classes.

segmentations that deviate from the ground-truth label. Moreover, the segmentation model can be trained to identify more brain substructures, which might lead to better fine-tuning on other datasets. Segmentation performance can be improved by using Conditional Random Fields (CRFs) [27–29] or Recurrent Neural networks RNNs [30–32] as (learnable) post-processing steps. In addition, we can improve the 2-D patchwise segmentations by taking into account the global 3-D structure of the MRI scans using 3D convolutions or 3D CRFs. Finally, we did not do extensive hyperparameter tuning, which would likely improve the performance of the models. The other existing method similar to the method in this paper is QuickNat [15]. The training methodology for QuickNat also involves pre-training on FreeSurfer labels and fine-tuning on expert-given annotations. Their model architecture is an encoder-decoder fully convolutional network (FCN) model with skip connections. The authors use data from the Multi-Atlas labeling challenge

Fast Brain Segmentation from T1 Scans

413

Table 5. Comparison of DeepLab-v3+ fine-tuned segmentation model and other reported algorithms on MRBrainS13 dataset with T1 MRI scans as input Model

Rank Time

Gray matter DCa

HDb

White matter AVDc DCa

HDb

Cerebrospinal Fluid AVDc DCa

HDb

AVDc

Fully convolutional network

1

∼2 s

86.03

1.47

5.71

89.29

1.95

5.84

82.44

2.41

7.69

Modified U-Net3

2

∼40 s

85.24

1.62

5.98

89.00

2.00

5.94

81.27

2.66

8.28

Atlas of classifiers

3

∼6 s

84.59

1.71

6.19

88.68

2.08

6.21

80.46

2.81

8.73

Deeplab-v3+ finetune

4

∼15 s

84.21

1.92

6.56 88.32

2.32

6.96 80.80

2.71

8.70

Multi-stage voxel classification

5

∼1.5 h

85.77

1.54

5.77

88.66

2.10

6.28

81.08

2.66

8.63

VBM12 r738 with WMHC=2

6

∼6 min

82.29

2.48

6.82

87.95

2.49

7.06

74.56

3.44

14.31

MAP-based framework

7

∼6 s

82.96

2.27

6.77

87.88

2.70

7.13

78.86

3.03

10.07

VGG based FCN

8

∼2 s

74.53

4.35

15.52

80.75

4.94

17.39

70.95

4.01

16.06

MAP on priors

9

∼10 min 80.85

2.90

7.23

86.73

3.06

7.74

76.97

3.25

12.78

Hybrid modified entropy-based

10

∼6 min

80.09

3.03

8.61

86.76

3.01

7.69

68.03

4.59

20.40

Fuzzy clustering

11

= α si |S| i=1 i=1

(1)

To make our measurement robust against the choice of α, we try three different α values, namely 0.97, 0.98, and 0.99, and accordingly compute three local selfsimilarity values. Figures 2 and 3 show the local self-similarity values of different denoising results of the same input image generated by BM3D with different parameter settings. This example shows that the local self-similarity value correlates well with the smoothness of the denoising results instead of the actual denoising quality. We address this problem with two methods. First, we use a data-driven approach to find the right decision boundary values w.r.t. this local self-similarity feature to identify the right feature value range of a good denoising result. Second, as discussed before, we do not expect that a single feature will be sufficient and we jointly use other features to measure image denoising quality. As shown in our

420

S. Lu

Fig. 3. SS values for denoising results with various smoothness. All denoising results are generated by BM3D with different parameters. Note that the PSNR values in this figure are normalized to (0, 1).

experiment in Sect. 3, although this feature alone is insufficient, it contributes to the overall performance of image denoising quality prediction. Structural Residual. The original image structure in an input noisy image should be preserved as much as possible during denoising [2]. Thus, the noise layer In , which is the difference between the input noisy image and the denoised result, should contain as little structure residual as possible. Our method estimates the residual structure as a feature to measure image denoising quality. Our method first uses the method from Chen et al. [4] to compute a structure residual map to capture the amount of structures in the noisy layer In . Specifically, the structural residual at pixel p is computed as a weighted average of a number of nearby and “similar” noise samples in In . w(p, q, σd , σc , σs )In (q) (2) Inr (p) = q∈Ω

where Ω is the neighborhood of pixel p. w(p, q, σd , σc , σs ) is a weight function that computes how the noise value at pixel q contributes to the residual structure computed at pixel p. σd , σc and σs are the standard deviation parameters that control the relative impacts of the spatial distance, color distance, and image structure difference. After we obtain the structural residual map Inr , we compute the structure residual feature SR as follows:

No-reference Image Denoising Quality Assessment

421

Fig. 4. SR for two different denoising results.

SR =

r 2 p In (p)

N

(3)

where N is the total number of pixels in the image.

Fig. 5. SR-PSNR distribution.

Fig. 6. V R-PSNR distribution.

To capture structural residuals at different image scales, we use three sets of parameters (SR1 : σd = σs = 1.0, σc = 4.0, SR2 : σd = σs = 4.0, σc = 10.0, SR3 : σd = σs = 10.0, σc = 30.0), as suggested in [4], and accordingly estimate three structure residual feature values for each denoising result. As shown in Fig. 5, the SR value of a denoising result is statistically negative correlated with the image denoising quality. Figure 4 shows the SR values of both a good denoising result and a bad denoising result. There is nearly no structural residual in the good denoising result.

422

S. Lu

Noisy input

SGM=4.59 PSNR=19.15

SGM=2.22 PSNR=22.05

SGM=0.18 PSNR=27.01

Fig. 7. SGM values of different denoising results.

SC=0.607, PSNR=26.30

SC=0.692, PSNR=30.03

Fig. 8. SC values for different denoising results.

Small Gradient Magnitude. According to Liu et al. [20], gradients of small magnitude often correspond to noise in flat regions. We follow their approach and use the standard deviation of the m% smallest non-zero gradient magnitudes in the denoising result to measure the denoising quality. To make this small gradient magnitude feature (SGM ) robust against the parameter m, we set m = 40, 50, 60 and compute three corresponding feature values. Figure 7 shows an example of this feature. Adaptive Structure Correlation. The structure similarity between a well denoised image Iˆ and the input noisy image I should be low in the flat region and high in the highly textured region, while the structure similarity between the resulting noise map In and the input noisy image I should be high in the flat region and low in the textured region [16]. Then for a good denoising result, the structure similarity map between the denoised image and the input noisy image and the structure similarity map between the resulting noisy map and the input noisy image should be negatively correlated. Our method follows the denoising

No-reference Image Denoising Quality Assessment

V R=2.28*e4, PSNR=21.58 Good denoising result

Noisy input

423

V R=2.45*e4, PSNR=15.49 Bad denoising result

Fig. 9. V R values for different denoising results. Table 1. Parameter selection for V R. V R1 V R2 V R3 V R4 V R5 V R6 ld 1

1

2

2

2

2

ls 1

1

1

1

2

2

λ 0.5

1

0.5

1

0.5

1

quality metric in [16] and uses the negative correlation measurement between the two structure similarity maps to encode denoising quality as follows. ˆ I), Iss (In , I)) SC = −corr(Iss (I,

(4)

where Iss (·) computes the structure similarity map between two input images according to the structure similarity metric SSIM [34]. To better capture structure correlation among different image scales, we choose three different patch sizes, namely 6 × 6, 8 × 8 and 10 × 10, to compute three feature values. Figure 8 shows SC values of denoised images with different qualities. The performance of the above variation denoising formulation depends on both the parameter λ and the selection of the norm operator. We therefore use various combination of values for λ and the norm operations, as reported in Table 1, and compute multiple feature values accordingly. Variational Denoising Residual. A range of variational methods have been developed for image denoising [28,40]. These methods formulate image denoising as an optimization problem with the pixel values in the denoising result as variables. Many of these methods formulate denoising as the following optimization problem. ˆ l + λ∇I ˆl I − I s d (5) min N Iˆ

424

S. Lu

where the first term is the data term that encourages the denoising result Iˆ should be as close to the input image I as possible. The second term is a reguˆ λ larization term that aims to minimize the gradient of the denoising result ∇I. is a parameter. ld and ls indicate the norm operator used in each term. There are three popular norm operator combinations used in existing denoising methods, as reported in Table 1. N is the number of pixels in the image. Variational methods minimize the above energy function to obtain the denoising result. Therefore, we compute the above energy function value denoted as V R by plugging in the de-noising result to measure the denoising quality. As shown in Fig. 6, the V R value of a denoising result is statistically negative correlated with the image denoising quality. Figure 9 shows that V R values are negative correlated with the denoising qualities. Note that when 2 norm is used, the corresponding term is computed as the square of the norm. Gradient Histogram Preservation. A good denoising result should preserve as many image structures as possible. A recent image denoising method from Zuo et al. first estimates a target image gradient histogram Hg and then aims to find such a denoising result that has a gradient histogram as similar to the target gradient histogram as possible [45]. Accordingly, our method uses the difference between the gradient histogram of an denoising result Iˆ and the target gradient histogram to indicate the denoising quality. ˆ 1 GH = Hg − H

(6)

where Hg is the normalized target gradient histogram estimated using a method ˆ is the normalized gradient histogram of the denoising result I. ˆ from [45] and H 2.2

Denoising Quality Prediction

Given an input noisy image I and its k denoising results {Iî }i=1,··· ,k , we aim to predict the denoising quality without ground truth. Specifically, given an input ˆ we model the image denoising quality noisy image I and its denoising result I, as a Random Forests Regression [19] of the quality feature vector fI,Iî extracted from I and Iˆ using the methods described in Sect. 2.1. qI,Iî = RF R(fI,Iî )

(7)

where qI,Iî indicates the denoising quality and RF R is the Random Forests Regression model that is trained using our training dataset. We directly use the output value of this Random Forest Regression model to indicate denoising quality and obtain the final quality ranking according this regression value. For color images, we independently assess the denoising quality of each channel and then averaging them to obtain the final quality assessment results. 2.3

Automatic Parameter Tuning

Our denoising quality metric can be used to select a good denoising result that a denoising method can produce. Like previous work [16], we can try all the

No-reference Image Denoising Quality Assessment

425

reasonable algorithm parameters to denoise an input noisy image and use the prediction model to pick the best one. This brute-force approach, however, is time-consuming. Instead, we formulate parameter tuning as the following optimization problem. θ∗ = arg max q(θ) θ

(8)

where q(·) is the denoising quality computed according to Eq. 7 with a denoising parameter setting θ. We then use a Gradient Ascent method to find the optimal parameter settings as follows: θk+1 = θk + λ∇q(θk )

(9)

where λ is the step size. ∇q(θk ) is the denoising quality gradient computed using finite difference: ∇q(θk ) =

3

q(θk + dθ) − q(θk − dθ) 2dθ

(10)

Experiments

We developed an image denoising quality assessment benchmark and experimented with our denoising quality ranking method on it. Our experiments compare our method with the state-of-the-art methods Q [41] and SC [16]. Below we first describe our image denoising benchmark. We then validate our regression model and evaluate the performance of our method w.r.t. overall quality ranking. We finally evaluate how our method can be used to tune the parameter setting of a denoising algorithm. As PSNR and SSIM are two popular metrics that reflect different aspects of image quality assessment, we trained two different models to estimate the denoising quality according to PSN and SSIM, respectively. We then test our method on denoising quality assessment according to both PSNR and SSIM in our experiments. 3.1

Image Denoising Quality Benchmark

We download 5,000 high-quality images from Flickr and use them as the noisefree ground-truth images. These images cover a wide range of scenes and objects, such as landscape, buildings, people, flowers, animals, food, vehicles, etc. For convenience, we downsample these images so that their maximum height is 480 pixels. We then add synthetic noise to each of the original images. We use three types of noise, namely Additive White Gaussian noise, Poisson noise, and Salt & Pepper noise. For the Gaussian noise, we use three different standard deviation values, namely σ = 10, 20, and 30. For the Poisson noise, we use the Poisson random number generation function poissrnd in Matlab to generate noisy images with Poisson noise: In = k · poissrnd(I/k) with k = 0.05, 0.10 and 0.15.

426

S. Lu

For the Salt & Pepper, we add noise with three density levels, namely d = 0.1, 0.2, and 0.3. Thus, we create 9 noisy images for each input image and obtain 5000 × 3 × 3 = 45000 noisy images in total. We then denoise each noisy image using seven (7) representative denoising algorithms, namely Gaussian Filter, Bilateral Filter [33], Median Filter [35], Non-Local Means [2], Geodesic denoising [5], DCT denoising [38], and BM3D [7]. For each algorithm, we choose several different parameter settings to generate denoising results. The numbers of parameter settings for these algorithms are reported in Table 2. For each noisy image, we have 23 denoising results. For each denoising result, we compute the PSNR and SSIM value and use them as the ground truth labels to train different models for denoising quality assessments. Table 2. Numbers of parameter sets of denoising methods. Algorithm

GF BF MED NLM GE DCT BM3D

# of settings 3

4

3

4

3

3

3

We partition our benchmark into a training set and a testing set. We randomly select 250 (5%) out of the 5,000 clean images to generate the training set and the rest to generate the testing set. For each noisy image, we have 23 denoising results. In total, we obtain 250 × 9 × 23 = 51750 denoising results and use them to train the denoising quality predication model. We repeated this random partition 10 times and take the average as the final result. 3.2

Regression Validation

We evaluate our regression model for PSNR/SSIM prediction w.r.t the Root Mean Squared Error (RMSE) between our predicted values and the ground truth and the Relative Squared Error (RSE) in Fig. 10. We also compare our quality prediction using all the features to our method using individual features. This result shows that by aggregating all features together, our method performs significantly better than any individual features.

(a) PSNR prediction

(b) SSIM prediction

Fig. 10. Regression model evaluation.

No-reference Image Denoising Quality Assessment

(a) Prediction error

427

(b) Ranking performance

Fig. 11. Leaving-one-out experiment results.

To further study how each individual feature contributes to final PSNR/SSIM prediction, we leave out that feature and use the rest features train our model and report the results in Fig. 11(a). This result shows that removing any single features only slightly downgrades the performance. 3.3

Evaluation on Denoising Quality Ranking

We adopt the Kendall τ correlation [14] between our ranking result and the ground truth to evaluate the performance of our ranking method. We directly compare our method with Q [41], SC [16], BRISQU E [23] and N IQE [24] in Table 3. As can be seen, our approach can better rank denoising results than these state-of-the-art image quality metrics. Figure 13 shows some denoising quality ranking results of our method. To evaluate how our method performs across different noise levels, we select 100 clean images from the testing set and corrupt them with Gaussian noise of σ = 15, 25, Poisson noise of k = 0.075, 0.125 and Salt & Pepper noise of d = 0.15, 0.25. Note that images with these noise levels are not used during the training process. We report the performance of our method on these images in Table 4. Our method outperforms existing metrics on these noise levels. This result indicates that our method can effectively and robustly estimate image denoising quality across different noise levels. We compare our method using all the features to our method using individual features in Fig. 12. It can be seen that the model trained using all features significantly outperforms models trained using individual features. This result

(a) PSNR prediction

(b) SSIM prediction

Fig. 12. All features vs. individual features on denoising quality ranking.

428

S. Lu

Table 3. Comparison of our method and existing denoising metrics Q, SC, BRISQU E and N IQE. τP SN R and τSSIM indicate that the ground truth labels used for training are PSNR and SSIM values, respectively. Metric Ours Q

SC

BRISQU E N IQE

τP SN R 0.854 0.483 0.241 0.326

0.506

τSSIM 0.784 0.453 0.191 0.250

0.399

Table 4. Performance our method on images on noise levels different from the training set. τP SN R and τSSIM indicate that the ground truth labels used for training are PSNR and SSIM values, respectively. Metric Ours Q

SC

BRISQU E N IQE

τP SN R 0.681 0.436 0.239 0.363

0.482

τSSIM 0.507 0.488 0.200 0.199

0.356

proves our observation that although individual features alone is not enough to correctly estimate the denoising quality, they complement each other well and can thus produce robust quality assessment. To further evaluate how each individual feature contributes to denoising quality ranking, we leave out that feature and use the rest features to rank denoising results and report the results in Fig. 11(b). This leave-one-out test shows that removing any single feature only downgrades the ranking performance very slightly. This indicates that any single feature can almost be replaced by the combination of the rest of the features. In addition, by exploring multiple SC values for training, our method has improved the Kendall τ ranking performance of the SC metric from 0.239/0.200 to 0.348/0.321 for denoising quality assessment according to PSNR/SSIM. These tests show the capability of our method in aggregating multiple weak features into a more powerful quality ranking method. 3.4

Evaluation on Parameter Tuning

We selected the BM3D [7] algorithm to evaluate our automatic parameter tuning method. We randomly select 50 noise-free images, add three types of noise to each of them in the same way as described above, and create 150 noisy images. We obtain the ground-truth optimal parameters setting for each noisy image using a brute-force method according to the PSNR and SSIM values. To test the performance of our parameter tuning method, we randomly select half of the noisy images as the training set and the rest half as the testing set. For each noisy image in the training set, we use BM3D to denoise it using 80 different parameter settings, from σ = 1 to σ = 80. We use all images in the training set to train dedicated predicting models for BM3D in purpose of predicting PSNR and SSIM. We then use these models in our automatic parameter tuning algorithm to select an optimal parameter setting for the noisy image in the testing set. The step size λ is set to 2 for the PSNR prediction model. Since the SSIM value

No-reference Image Denoising Quality Assessment

429

Fig. 13. Our method reliably ranks the denoising results in terms of PSNR and SSIM. The denoising results are listed from left to right according to our predicted quality (from high to low). Due to the space limit, we only show 7 denoising results for each example.

ranges from 0 to 1, which is smaller than that of PSNR values, we use a larger λ value (20) for the SSIM prediction model. For the initial guess σ0 , we find that simply selecting the center of the parameter searching space works well for all our testing images. A straightforward way to evaluate the performance of our method is to compute the error between our estimated algorithm parameters to the groundtruth optimal parameters. However, this parameter error is difficult to intuitively understand the performance of our parameter tuning algorithm in finding optimal parameters to produce the optimal denoising result. Therefore, we evaluate our method w.r.t. the quality of the final denoising result using the estimated parameter setting. Specifically, we compute the mean PSNR/SSIM difference(Dif fP SN R and Dif fSSIM ) between the denoising results using the estimated parameter setting and the result using the optimal parameter setting. We compare our parameter tuning method with the state-of-the-art denoising metrics Q and SC in Table 5. For Q and SC, we use brute-force search to directly select their estimated parameter settings according to the Q and SC values. It can be seen that our approach outperforms Q and SC on optimal parameter tuning. The denoising results produced with the parameters estimated by our method are closer to the results produced with the ground-truth parameters. On average, the mean Dif fP SN R value is 0.426 and the mean Dif fSSIM is 0.033. Figure 14 shows three examples of our parameter tuning method.

430

S. Lu

Fig. 14. Parameter tuning results for BM3D using our PSNR or SSIM predicting models. Comparing to Q and SC, our predicted parameter setting is closer to the ground truth optimal parameter setting σpsnr and σssim .

Fig. 15. Parameter tuning process of Fig. 14(f) and (k).

Table 5. Parameter tuning performance. Our method Mean Std

Q Mean Std

SC Mean Std

Dif fP SN R 0.426 0.673 1.165 1.817 1.650 2.659 Dif fSSIM 0.033 0.047 0.066 0.140 0.097 0.183

No-reference Image Denoising Quality Assessment

431

Table 6. Computation cost of our tuning algorithm. Number of iterations Mean Std PSNR prediction 2.53

0.98

SSIM prediction

1.56

2.84

To evaluate the computation cost of our parameter tuning, we record the number of iterations during the tuning process and report the results in Table 6. For each noisy input image, our method converges in about less than 3 iterations. It takes about 5 seconds for our method to predict the quality of one denoised image. Thus, comparing to brute-force searching that requires denoising the image with all possible parameter settings, our method can save a lot of computation cost on denoising process while introducing only a small amount of computation on quality prediction. Figure 15 shows our parameter searching process of the noisy image from Fig. 14(f) and (k). This example shows that our method effectively and efficiently reaches a parameter setting that is very close to the real optimal selected according to the Ground Truth (GT) PSNR/SSIM. Acknowledgements. This work was supported by NSF grants IIS-1321119.

References 1. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006) 2. Buades, A., Coll, B., Morel, J.: A non-local algorithm for image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60–65 (2005) 3. Chen, F., Zhang, L., Yu, H.: External patch prior guided internal clustering for image denoising. In: IEEE International Conference on Computer Vision (ICCV), pp. 603–611 (2015) 4. Chen, J., Tang, C., Wang, J.: Noise brush: interactive high quality image-noise separation. ACM Trans. Graph. 28(5), 146:1–146:10 (2009) 5. Chen, X., Kang, S.B., Yang, J., Yu, J.: Fast patch-based denoising using approximated patch geodesic paths. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1211–1218 (2013) 6. Chen, Z., Jiang, T., Tian, Y.: Quality assessment for comparing image enhancement algorithms. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3003–3010 (2014) 7. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007) 8. Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)

432

S. Lu

9. Gu, K., Wang, S., Yang, H., Lin, W., Zhai, G., Yang, X., Zhang, W.: Saliencyguided quality assessment of screen content images. IEEE Trans. Multimed. PP(99), 1 (2016) 10. Gu, K., Wang, S., Zhai, G., Ma, S., Yang, X., Lin, W., Zhang, W., Gao, W.: Blind quality assessment of tone-mapped images via analysis of information, naturalness, and structure. IEEE Trans. Multimed. 18(3), 432–443 (2016) 11. Gu, K., Zhai, G., Yang, X., Zhang, W.: Using free energy principle for blind image quality assessment. IEEE Trans. Multimed. 17(1), 50–63 (2015) 12. Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2862–2869 (2014) 13. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 419–426 (2006) 14. Kendall, M.: A new measure of rank correlation. Biometrika 30, 81–93 (1938) 15. Knaus, C., Zwicker, M.: Dual-domain image denoising. In: 2013 IEEE International Conference on Image Processing (ICIP), pp. 440–444 (2013) 16. Kong, X., Li, K., Yang, Q., Wenyin, L., Yang, M.H.: A new image quality metric for image auto-denoising. In: IEEE International Conference on Computer Vision (ICCV), pp. 2888–2895 (2013) 17. Levin, A., Nadler, B., Durand, F., Freeman, W.T.: Patch complexity, finite pixel correlations and optimal denoising. In: European Conference on Computer Vision (ECCV), pp. 73–86 (2012) 18. Li, S., Zhang, F., Ma, L., Ngan, K.N.: Image quality assessment by separately evaluating detail losses and additive impairments. IEEE Trans. Multimed. 13(5), 935–949 (2011) 19. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002) 20. Liu, Y., Wang, J., Cho, S., Finkelstein, A., Rusinkiewicz, S.: A no-reference metric for evaluating the quality of motion deblurring. ACM Trans. Graph. 32(6), 175:1– 175:12 (2013) 21. Lu, X., Lin, Z., Jin, H., Yang, J., Wang, J.Z.: Rating image aesthetics using deep learning. IEEE Trans. Multimed. 17(11), 2021–2034 (2015) 22. Luo, Y., Tang, X.: Photo and video quality evaluation: focusing on the subject. In: European Conference on Computer Vision (ECCV), pp. 386–399. Springer (2008) 23. Mittal, A., Moorthy, A., Bovik, A.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012) 24. Mittal, A., Soundararajan, R., Bovik, A.: Making a “completely blind” image quality analyzer. IEEE Sig. Process. Lett. 20(3), 209–212 (2013) 25. Moorthy, A., Bovik, A.: A two-step framework for constructing blind image quality indices. IEEE Sig. Process. Lett. 17(5), 513–516 (2010) 26. Mosseri, I., Zontak, M., Irani, M.: Combining the power of internal and external denoising. In: IEEE International Conference on Computational Photography (ICCP), pp. 1–9 (2013) 27. Roth, S., Black, M.: Fields of experts: a framework for learning image priors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 860–867 (2005) 28. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D: Nonlinear Phenom. 60(1), 259–268 (1992) 29. Saad, M., Bovik, A., Charrier, C.: A DCT statistics-based blind image quality index. IEEE Sig. Process. Lett. 17(6), 583–586 (2010)

No-reference Image Denoising Quality Assessment

433

30. Tang, H., Joshi, N., Kapoor, A.: Learning a blind measure of perceptual image quality. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 305–312 (2011) 31. Tang, X., Luo, W., Wang, X.: Content-based photo quality assessment. IEEE Trans. Multimed. 15(8), 1930–1943 (2013) 32. Tian, X., Dong, Z., Yang, K., Mei, T.: Query-dependent aesthetic model with deep learning for photo quality assessment. IEEE Trans. Multimed. 17(11), 2035–2048 (2015) 33. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision (ICCV), pp. 839–846 (1998) 34. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 35. Weiss, B.: Fast median and bilateral filtering. ACM Trans. Graph. 25(3), 519–526 (2006) 36. Xu, J., Zhang, L., Zuo, W., Zhang, D., Feng, X.: Patch group based nonlocal selfsimilarity prior learning for image denoising. In: IEEE International Conference on Computer Vision (ICCV), pp. 244–252 (2015) 37. Ye, P., Kumar, J., Kang, L., Doermann, D.: Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1098–1105 (2012) 38. Yu, G., Sapiro, G.: DCT image denoising: a simple and effective image denoising algorithm. Image Process. Line 1, 292–296 (2011) 39. Yue, H., Sun, X., Yang, J., Wu, F.: CID: combined image denoising in spatial and frequency domains using web images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2933–2940 (2014) 40. Zhang, L., Vaddadi, S., Jin, H., Nayar, S.: Multiple view image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1542–1549 (2009) 41. Zhu, X., Milanfar, P.: Automatic parameter selection for denoising algorithms using a no-reference measure of image content. IEEE Trans. Image Process. 19(12), 3116– 3132 (2010) 42. Zontak, M., Irani, M.: Internal statistics of a single natural image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 977–984 (2011) 43. Zontak, M., Mosseri, I., Irani, M.: Separating signal from noise using patch recurrence across scales. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1195–1202 (2013) 44. Zoran, D., Weiss, Y.: From learning models of natural image patches to whole image restoration. In: IEEE International Conference on Computer Vision (ICCV), pp. 479–486 (2011) 45. Zuo, W., Zhang, L., Song, C., Zhang, D.: Texture enhanced image denoising via gradient histogram preservation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1203–1210 (2013)

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification Hiteshwari Sabrol1(&) and Satish Kumar2 1

Department of Computer Science and Applications, DAV University Jalandhar, Punjab 144012, India [email protected], [email protected] 2 Department of Computer Science and Applications, P.U. SSG Regional, Hoshiarpur 146023, Punjab, India

Abstract. The paper deals with classification of different types of diseases of tomato and brinjal/eggplant. The patterns of the diseases are considered as a feature. It may be possible that the diseases are recognized by its texture patterns. A method that uses the texture patterns of the diseases in pure grayscale is applied for feature extraction purpose. A dedicated GLCM matrix is used to compute the features. The ANFIS based classification model is used for disease recognition by classification. The pattern based features with ANFIS recognition gives accuracy of 90.7% and 98.0% for TPDS 1.0 and BPDS 1.0 datasets respectively. Keywords: Plant disease recognition GLCM Adaptive neuo-fuzzy inference system

1 Introduction The methods proposed for plant recognition for one type of plant species, or one type of disease is not necessarily a good disease recognizer for another plant disease. It’s mainly because of the inter variations in the form of diseases among different plant species. In some cases, the disease can classify by their variations in texture pattern. The image analysis technique can be used to diagnosis the level of mineral deficiency in rice plant [1]. In the study, the color and texture analysis have been used to classify the percent of disease affected pixels. The color and texture features are submitted to two different multi-layer back propagation networks for pixel classification [1]. The procedure used to detect two diseases of grape plant [2]. The k-means clustering is used for segmentation process and the images subdivided into six different clusters. The “gray level co-occurrence matrix” (GLCM) is used to compute color texture features for infected regions and healthy regions. The “spatial gray-level dependence matrices” (SGDM’s) is used for color co-occurrence texture analysis. The classification is performed using feed forward “back propagation neural network” (BPNN) [2]. The method proposed for detection of five types of diseases of paddy [3]. The fractal descriptors using Fourier spectrum are computed for extracting texture feature © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 434–443, 2020. https://doi.org/10.1007/978-3-030-17795-9_32

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification

435

and the Laplacian filter is used to sharpen the images. The fractal descriptor based method is used with the probabilistic neural network for classification [3]. In [4], the technique adopted a procedure that consists of color-texture feature extraction. The features submitted to two classifiers i.e., support vector machine and minimum distance criteria. The recognition results of rice blast and brown spot are better than rice sheath blight The three kinds of paddy crop diseases in digital image are classified using segmentation methods i.e., Otsu’s and local entropy. The texture analysis used feature extraction and the set of predefined rules are applied to discriminate according to the corresponding selected regions [5, 6]. The maize disease recognition of corn plant leaf is carried out using texture characteristics of acorn leaf. The BPNN is used for disease recognition [7]. The statistical methods are used for detecting and classifying fungal diseases of horticulture crops [8]. These statistical features include block-wise, Gray Level Cooccurrence Matrix (GLCM), and Gray Level Run Length Matrix (GLRLM). The Nearest Neighbor (NN) classifier using Euclidean distance is used to classify images into affected and healthy [8]. The extended version of this work is available in [9] where GLCM based features are classified using SVM and ANN. A web based tool for identification fruit disease is proposed by [10]. In this work, color, morphology, CCV and k-means clustering are sued for feature extraction purpose. SVM is used for classification of fruit diseases and using morphology for feature extraction reported the best identification accuracy [10]. The latest convolutional neural network for plant diseases leaf classification is used in [11]. The model designed to recognize 13 different diseases from healthy leaf images. Image augmentation is applied which includes affine transformations, perspective transformations and image rotations. The method presented in [12], to classify leaf diseases includes the various steps which are sum as: (1) removed the masked cells from disease infected clusters; (2) genetic algorithm is used for segmenting the infected leaf; (3) color co-occurrence matrix is computed for feature extraction and (4) Minimum Distance Criterion and “support vector machine” (SVM) are used for classification. The SVM classifier with color co-occurrence matrix reported overall best accuracy [12].

2 Materials and Methods 2.1

Plant Disease Database Creation

The raw sample images are collected from Chhattisgarh College of Agriculture, Bhillai, Chattisgarh, India. Chattisgarh is one of the major provinces of exporting tomatoes (NHB Database 2013–14) [13]. The size of raw images was 5152 3864 pixels in jpg format. The pre-processing is performed on collected raw images. The pre-processing phase includes the cropping according to the region of interest, resizing and segmentation. The size of all the images is fixed to 1001 801 pixels. The images are also converted to diseases infected and non-infected using Otsu’s method [14, 15] for segmenting. The sample images of disease infected leaves and stems of plants are taken for the study. The dataset created has been named as Tomato plant disease dataset (TPDS 1.0) and Brinjal/Eggplant plant disease dataset (BPDS 1.0). The total number of samples used in this study is given in Table 1.

436

H. Sabrol and S. Kumar Table 1. Total number of samples for TPDS 1.0 and BPDS 1.0 dataset TPDS 1.0 Bacterial canker Bacterial leaf spot Fungal late blight Leaf curl Septoria leaf spot Total

Sample images BPDS 1.0 130 Bacterial wilt 130 Fungal late blight 130 Cercospora leaf spot 130 Little leaf 130 650

Sample images 130 130 130 130 520

The standard pre-processing process like Otsu’s segmentation is applied on resized images for binarization [14, 15]. The Otsu’s method applies thresholding to convert color disease infected and non-infected image into equivalent color segmented and its binary image. After segmentation, we have disease infected images, which includes color segmented images and its binary images. In Figs. 1 and 2, we can see sample images of Tomato plant and Brinjal/Eggplant leaf infected sample images. 2.2

Method

The texture of an image is characterized from the spatial distribution of gray levels in the neighborhood. Thus, in this paper feature extraction process includes gray level spatial dependence matrix to compute statistical features. The color segmented images are converted to grayscale for texture analysis. The proposed method called Gray-Level Spatial Dependence based Feature Extraction for Plant Disease Recognition by Classification (GLSMPDR) is used for plant disease recognition by adaptive neuro-fuzzy inference system classifier. The overall steps for plant disease feature extraction of the algorithm are given in Algorithm 2.1. Algorithm 2.1. PDR_GLSM_based_Feature_Extraction 1. Convert Segmented RGB plant image to Gray Scale. 2.Compute GLSM=[Correlation Contrast Energy Homogeneity] 3. Store extracted features into Feature Vector Fgm = i × GLSM, where i is the number of sample images corresponding to particular class or category of plant image, and GLSM is feature vector of 4 computed features. 4. Return Fgm.

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification

Fig. 1. Tomato plant disease infected images

437

438

H. Sabrol and S. Kumar

Fig. 2. Brinjal/Eggplant plant disease infected images

The four common gray level spatial dependence matrix properties [13] computed using the normalized matrix follows as: • Correlation: K X K X ði mr Þðj mc Þpði; jÞ i¼1 j¼1

rr rc

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification

439

• Contrast: K X K X ½pði; jÞ2 i¼1 j¼1

• Energy: K X K X ½pði; jÞ2 i¼1 j¼1

• Homogeneity: K X K X pði; jÞ 1 þ ji jj i¼1 j¼1

In this study, we computed four texture feature vectors. These computed features are submitted to hybrid approach called adaptive neuro-fuzzy inference system. The ANFIS architecture for plant disease recognition using intensity model consists of 5 layers. In ANFIS architecture the Layer 1 contains the labeled computed features as input parameters and submitted through input membership functions. In Layer 2, each node is equal to the product of all inputs. The Layer 3 nodes compute the ratio of the ith rule calculating the strength to the sum of the rules and output of the layer is normalized. The Layer 4 contains each node as an adaptive node and estimates the consequent parameters to set list of parameters. The final Layer 5 computes the overall output as the sum of all inputs that classify the disease category. The proposed scheme for ANFIS architecture is depicted in Fig. 3.

Fig. 3. ANFIS architecture based tomato plant disease recognition by classification

440

H. Sabrol and S. Kumar

The ANFIS [16, 17] classifier uses a set of rules with fixed premise arguments and linear least squares estimation method to identify an optimal fuzzy model. In this method, a least squares estimation technique is combined with cluster estimation to build a fuzzy model which helps to find important input variables. The ANFIS consists of two steps: (1) Cluster estimation method used to extract an initial fuzzy model from input-output data; (2) Identification of significant input variables by testing the importance of each selected variable in the initial fuzzy model. In the proposed ANFIS model, the first step includes the grid partition technique [17, 18].

Fig. 4. Proposed ANFIS structure for tomato plant disease recognition by classification

The technique is used in situation where the number of input parameters and membership functions are less. The fuzzy rules are formed by partitioning the space into the number of fuzzy regions. The four input parameters are computed by using gray-level spatial dependence matrix. In the second step, we assume three inputs from four input parameters and one set of output parameters. In this method, four Gaussian type membership functions used i.e., 4 3 = 12 membership functions are generated. The corresponding ANFIS structure is given in Fig. 4. In input parameters, three features are computed from five different categories of tomato plant images and four different categories of brinjal/eggplant categories. The output parameter is set by the fourth computed feature for training. The features which are computed to train the ANFIS structure as same are computed for testing purpose. The minimum distance is computed between the output parameter of training and testing. The computed minimum distance is used for classification. The minimum distance between output parameter of training and testing is assumed as correct classification.

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification

441

3 Results and Discussion We propose and investigate the GLSMPDR scheme with adaptive neuro-fuzzy inference system for recognition purpose. Here, the statistical features are used for graylevel spatial analysis. It is the investigation of texture pattern found on plant leaves and stems images. In the scheme, pure grayscale images are used for analysis. No such color-texture combination is used for experiment. The computed statistical features fed to an adaptive neuro-fuzzy model based on hybrid approach. In the proposed GLSMPDR scheme, an initial fuzzy model is derived using grid partitioning technique. The grid partitioning contributes to find some final input variables and rules to finalize the fuzzy model. The ANFIS model where Root mean square error (RMSE) is minimum is considered as final. In the initial fuzzy model, consequent parameters are updated with least square estimation (LSE) method. The rules generated by using clusters based on grid partitioning technique are updated by gradient descent algorithm (back propagation neural network). The final ANFIS model is designed with the updated premise parameters for the membership functions to obtain optimize data for classification. The data for the experiment is taken from two different datasets i.e., TPDS 1.0 and BPDS 1.0. The input-output data is collected from computing their graylevel spatial features for training purpose. The datasets used for training and testing purpose is taken in ratio of 90:10. The total number of image samples of tomato is 650, and brinjal/eggplant is 520. In case of Tomato, 585 images are used for training and 65 are used for testing. Similarly, In case of brinjal/eggplant 533 images are used for training and 52 are used for testing. The overall recognition accuracy of 90.7% and 98% is achieved for a tomato plant and brinjal/eggplant plant which is quite satisfactory (see Table 3). Table 2 shows, the comparison of proposed scheme with various competent schemes. Table 2. Overall accuracy comparison for the TPDS 1.0 and BPDS 1.0 dataset Method CCM + MLP [1] Fractal Descriptor + PNN [3] CCM + Mahalanobis Distance Criteria [9] CCM + ANN [75] CCM + SVM [16] GLSMPDR + ANFIS(TPDS 1.0) GLSMPDR + ANFIS (BPDS 1.0)

Disease/deficiency/damage Deficiencies Bacterial, fungal, virus Bacterial, fungal, virus

Recognition accuracy (%) 88.5 83.0 83.1

Bacterial, fungal, virus Bacterial, fungal, virus Bacterial, fungal, virus Bacterial, fungal, virus

87.4 85.3 90.7 98.0

442

H. Sabrol and S. Kumar

Table 3. Improvement in the rates of misclassification using GLSMPDR scheme with ANFIS for TPDS 1.0 and BPDS 1.0 dataset Misclassification rate (%) Tomato plant disease GLSMPR + ANFIS Brinjal/Eggplant disease Bacterial canker 0.0 Bacterial wilt Bacterial leaf spot 7.7 Fungal leaf spot Fungal late blight 7.7 Fungal late blight Fungal septoria spot 23.1 Little leaf Leaf curl virus 7.7

GLSMPDR + ANFIS 0.0 7.7 0.0 0.0

4 Conclusion The proposed GLSMPDR scheme with ANFIS classifier is reported efficient for recognition. In this scheme, the basic statistical features are extracted from grayscale images and submitted to ANFIS classifier for further classification. The ANFIS used grid partitioning technique for generating important input variables and rules. This approach follows the hybrid method for learning and is executed until the root mean squared error reaches the minimum. The proper tests are conducted on both the datasets. The recognition accuracy of 90.7% for TPDS 1.0 and 98.0% for BPDS 1.0 is achieved. The comparative analysis of the existing competent schemes is carried with the proposed scheme. The experimental results show that the proposed scheme has improvement regarding recognition accuracy and misclassification rates.

References 1. Sanyal, P., Bhattacharya, U., Parui, S.K., Bandyopadhyay, S.K., Patel, S.: Color texture analysis of rice leaves diagnosing deficiency in the balance of mineral levels towards improvement of crop productivity. In: Proceeding of 10th International Conference on Information Technology (ICIT 2007), pp. 85–90. IEEE, Orissa (2007) 2. Sannakki, S.S., Rajpurohit, V.S., Nargund, V.B., Kulkarni, P.: Diagnosis and classification of grape leaf diseases using neural networks. In: Proceeding of 4th International Conference (ICCCNT), pp. 1–5. IEEE, Tiruchengode (2013) 3. Asfarian, A., Herdiyeni, Y., Rauf, A., Mutaqin, K.M.: Paddy diseases identification with texture analysis using fractal descriptors based on fourier spectrum. In: Proceeding of International Conference on Computer, Control, Informatics and Its Applications, pp. 77–81. IEEE, Jakarta (2014) 4. Arivazhagan, I.S., Shebiah, R.N., Ananthi, S., Varthini, S.V.: Detection of unhealthy region of plant leaves and classification of plant leaf diseases using texture features. Agric. Eng. Int.: CIGR J. 15(1), 211–217 (2013) 5. Kurniawati, N.N., Abdullah, S.N.H.S., Abdullah, S.: Investigation on image processing techniques for diagnosing paddy diseases. In: Proceeding of 2009 International conference on Soft Computing and Pattern Recognition, pp. 272–277. IEEE, Malacca (2009)

Plant Leaf Disease Detection Using Adaptive Neuro-Fuzzy Classification

443

6. Kurniawati, N.N., Abdullah, S.N.H.S., Abdullah, S.: Texture analysis for diagnosing paddy disease. In: Proceeding of 2009 International Conference on Electrical Engineering and Informatics, pp. 23–27. IEEE, Selangor (2009) 7. Kai, S., Zhikun, L., Hang, S., Chunhong, G.: A research of maize disease image recognition of corn based on BP networks. In: Third International Conference on Measuring Technology and Mechatronics Automation, Shangshai, pp. 246–249 (2011) 8. Pujari, J.D., Yakkundimath, R., Byadgi, A.S.: SVM and ANN based classification of plant diseases using feature reduction technique. Int. J. Interact. Multimed. Artif. Intell. 3(7), 3–14 (2016) 9. Pujari, J.D., Yakkundimath, R., Byadgi, A.S.: Identification and classification of fungal disease affected on agriculture/horticulture crops using image processing techniques. In: IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, pp. 1–4 (2014) 10. Bhange, M., Hingoliwala, H.A.: Smart farming: pomegranate disease detection using image processing. In: Proceedings of Second International Symposium on Computer Vision and Internet (VisionNet’ 2015), Procedia Computer Science, vol. 58, pp. 280–288 (2015) 11. Sladojevic, S., Arsenovic, M., Anderla, A., Culibrk, D., Stefanovic, D.: Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016, 1–11 (2016) 12. Singh, V., Mishra, A.K.: Detection of plant leaf diseases using image segmentation and soft computing techniques. Inf. Process. Agric. 4, 41–49 (2017) 13. NHB Homepage. http://nhb.gov.in/area-pro/NHB_Database_2015.pdf 14. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall Inc., Englewood Cliffs (2006) 15. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979) 16. Saki, F., Tahmasbi, A., Soltanian-Zadeh, H., Shokouhi, S.B.: Fast opposite weight learning rules with application in breast cancer diagnosis. Comput. Biol. Med. 43(1), 32–41 (2013) 17. Jang, J.S.R.: ANFIS: adaptive network based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 23(3), 665–685 (1993) 18. Jang, J., Sun, C., Mizutani, E.: Neuro-Fuzzy and Soft Computing. Prentice Hall, New York (1997)

Fusion of CNN- and COSFIRE-Based Features with Application to Gender Recognition from Face Images Frans Simanjuntak and George Azzopardi(B) University of Groningen, Groningen, The Netherlands [email protected], [email protected]

Abstract. Convolution neural networks (CNNs) have been demonstrated to be very effective in various computer vision tasks. The main strength of such networks is that features are learned from some training data. In cases where training data is not abundant, transfer learning can be used in order to adapt features that are pre-trained from other tasks. Similarly, the COSFIRE approach is also trainable as it configures filters to be selective for features selected from training data. In this study we propose a fusion method of these two approaches and evaluate their performance on the application of gender recognition from face images. In particular, we use the pre-trained VGGFace CNN, which when used as standalone, it achieved 97.45% on the GENDER-FERET data set. With one of the proposed fusion approaches the recognition rate on the same task is improved to 98.9%, that is reducing the error rate by more than 50%. Our experiments demonstrate that COSFIRE filters can provide complementary features to CNNs, which contribute to a better performance.

Keywords: VGGFace

1

· COSFIRE · Fusion · Gender recognition

Introduction

Convolutional neural networks (CNNs) are advanced versions of the original networks introduced by Fukushima [1]. They are inspired by the feline visual processing system. The architecture of a CNN is built on its predecessor, the ordinary neural network, which has layers that receive input, has activation functions, contains neurons with learnable weights and biases, and performs forward and backward propagation to adjust the network. The fundamental difference between a CNN and an ordinary neural network is that it contains a set of stacked convolutional-pooling pairs of layers in which the output of the last layer is fed to a set of stacked fully connected layers. Since few years ago, CNNs have become state-of-the-art for object detection and image classification. Their effectiveness is attributable to their ability to learn features from training data, instead of handcrafting them. In applications were training data is abundant, a c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 444–458, 2020. https://doi.org/10.1007/978-3-030-17795-9_33

Fusion of CNN- and COSFIRE-Based Features

445

new network can be entirely trained from such data. In other applications, however, where training data is limited, one may apply transfer learning techniques to fine tune only the last layer(s) of the network, for instance. Another approach that uses trainable features is called Combination of Shifted Response (COSFIRE) filter. Unlike CNNs, COSFIRE configures nonlinear filters that achieve tolerance to rotation, scale and reflection. The complexity of the preferred pattern can vary from a simple edge or a line to corners, curvatures, bifurcations and shapes of whole objects like traffic signs. So far, COSFIRE filters have been configured by presenting single training examples. In principle, however, learning algorithms can be applied in order to determine the selectivity of the filters from multiple training examples. One major challenge for CNNs is dealing with adversarial attacks. The linear nature of the convolutional layers makes them vulnerable to such attacks [2]. On the other hand, COSFIRE filters rely on nonlinear connections between the output of low-level filters and, in principal, they are more robust to adversarial attacks. Considering the fact that both the CNN and COSFIRE approaches are based on features determined from training data, it is intriguing to investigate a fusion approach that maximizes their strengths. In this study, we fuse the trainable features from CNNs and COSFIRE filters by applying two types of fusion, namely feature fusion and decision fusion. In the former strategy, we concatenate CNN and COSFIRE-based features and use the resulting feature vector as input to another classification model, and in the latter, we learn a stacked classification model without merging the CNN- and COSFIRE-based features. In general, the fusion approaches that we investigate are applicable to any classification task, however, for the sake of demonstration, we use the application of gender recognition from face images to quantify their effectiveness. The rest of the paper is organized as follows. In Sect. 2 we give an account of related works. In Sect. 3 we describe the proposed approach followed by Sect. 4 where we explain the experiments and report the results. In Sect. 5 we provide a discussion and finally we draw conclusions in Sect. 6.

2

Related Works

CNNs have been applied in many computer vision tasks, such as face recognition [3,4], scene labelling [5–7], image classification [8–12], action recognition [13–15], human pose estimation [16–18], and document analysis [19–22]. CNNs made a breakthrough in image classification in the ImageNet Large-Scale Visual Recognition Challenge in 2014 (ILSVRC14). In that competition, GoogleNet was able to perform the classification and detection of large scale images by achieving an error rate of 6.67% [23], which is very close to human level performance. Various CNN architectures have been proposed, namely Lenet [24], AlexNet [11], VGGNet [25], and ResNet [26]. Closer to the application at hand, Parkhi et al. [4] also proposed an architecture called VGGFace, which is designed to recognize the identity of a person from a single photograph or a set of faces tracked

446

F. Simanjuntak and G. Azzopardi

in a video. CNNs have also been adapted to natural language processing applications, such as speech recognition [27–29] and text classification [30–33]. In general, only 4% Word Error Rate Reduction (WERR) was obtained when the speech was trained on 1000 h of Kinect distance [34] using deep neural networks (DNNs) proposed in [35]. The other approach that we are concerned with, namely, Combination of Shifted Filter Responses (COSFIRE), is also a brain-inspired visual pattern recognition method, which has been effectively applied in various applications, including traffic sign recognition [36], handwritten digit recognition [37], architectural symbol recognition [38], quality visual inspection [39], contour detection [40], delineation of curvilinear structures [41], such as the blood vessels in retinal fundus images and cracks in pavements [42], butterfly recognition [36], person identification from retinal images [43], and gender recognition from face images [44,45]. COSFIRE is a filtering approach whose selectivity is determined in a configuration stage by the automatic analysis of given prototype patterns. In its basic architecture, a COSFIRE filter takes input from a set of low-level filters, such as orientation-selective or filters with center-surround support, and combines them by a nonlinear function [46]. In [47], it has also been demonstrated that hierarchical or multi-layered COSFIRE filters can be configured to be selective for more deformable objects. Similar to CNNs, the number of layers is an architectural design. In relation to gender recognition from face images, several studies have been conducted using CNNs and COSFIRE. For instance, Liew et al. [48] proposed a CNN architecture which focuses on reducing CNN layers to four and performs cross-correlation to reduce the computation time. The proposed method was evaluated using two public face data sets, namely SUMS and AT&T, and achieved accuracy rates of 98.75% and 99.38%, respectively. Levi et al. [49] introduced a CNN architecture which focuses on age and gender classification. They proposed an architecture, so-called deep convolutional neural networks (DCNN), which has two approaches, namely single crop and over samples and achieved accuracy rates of 85.9% and 86.8%, respectively, on the Audiance benchmark dataset. Dhomne et al. [50] also proposed a deep CNN architecture for gender recognition that focuses mainly on enhancing VGGNet architecture to be more efficient. Studies have shown that CNNs are vulnerable to adversarial attacks. For instance, Narodytska et al. [51] proposed a simple attack by adding perturbation and applying a small set of constructed pixels using greedy local-search to a random location of the image. That method is able to fool a CNN and increases the misclassification rate. Moosavi-Dezfooli et al. [52] proposed DeepFool based on iterative linearization procedure to generate adversarial attacks. Tang et al. [53] employed a steganographic scheme that aims at hiding a stego message and fooling a CNN at the same time. The experiments showed that it is secure and adequate to cope with powerful CNN-based steganalysis. Several methods based on the COSFIRE approach have also been proposed for the recognition of gender from face images. The first experiment was performed in 2016 [44] where a set of COSFIRE filters were used and encoded

Fusion of CNN- and COSFIRE-Based Features

447

by a spatial pyramid to form a feature vector that was fed to a classification model. In the experiments, they used two data sets, namely GENDER-FERET and Labeled Faces in the Wild (LFW), and the results show that COSFIRE is able to achieve 93.7% classification rate on the GENDER-FERET and 90% on the LFW. In the following year, they continued the work by conducting another experiment that combines the features from domain-specific and trainable COSFIRE [45]. In that study, the domain specific part uses SURF descriptors from 51 facial landmarks related to the nose, eyes, and mouth. The extracted features from those landmarks were fused with features from COSFIRE filters and achieved accuracy rates of 94.7%, 99.4%, and 91.5% on the GENDER-FERET, LFW, and UNISA-Public data sets, respectively.

3

Methods

In the following, we describe the proposed fusion methods within the context of gender recognition from face images. First, we describe how we perform face detection and if necessary correct the orientation of the face to an upright position. Then, we describe VGGFace as one of the most widely used CNN architectures for face recognition, followed by the COSFIRE filter approach. Finally, we elaborate on our fusion approaches. Figure 1 illustrates the high-level architecture of our pipeline.

Fig. 1. A high level diagram of the proposed system.

3.1

Face Detection and Alignment

For a given face image, we apply the algorithm proposed by Uricar et al. [54] that gives us 68 fiducial landmarks. For the purpose of this work we determine the two centroids of the locations of the two sets of points that characterize the eyes and discard the remaining landmarks. Next, we compute the angle between these two centroids and use it to rotate the face image in such a way that the angle between the eyes becomes zero. Thereby, we ensure that all face images are appropriately aligned.

448

F. Simanjuntak and G. Azzopardi

Finally, we apply the Viola-Jones algorithm [45] to crop the close-up of the face. Figure 2 illustrates the preprocessing pipeline that we employ.

(a)

(b)

(c)

(d)

Fig. 2. Input image containing the fiducial landmarks detected by the algorithm proposed in [54]. The blue spots indicate the centroids of the landmarks that describe the eyes. (b) Image rotated appropriately based on the angle found between the blue landmarks in (a). (c) Face detection with the Viola-Jones algorithm and (d) the cropped face image.

3.2

VGGFace

VGGFace is an extended CNN of VGGNet developed by Parkhi et al. [4]. They showed that the depth of the network is a critical component for good performance. The goal of VGGFace architecture is to deal with face recognition either from a single photograph or a set of faces tracked in a video [4]. The input to the VGGFace is a face image of size 224 × 224 pixels and the network consists of 13 convolutional layers, 15 Rectified Linear Units (ReLu), 5 sub sampling (max pooling) layers, 3 fully connected layers, and 1 softmax probability as shown in Fig. 3. For further technical details on VGGFace we refer the reader to [4]. In this study, we use the pre-trained VGGFace to extract features from face images. Following the requirement of VGGFace we resize our pre-processed images to 224 × 224 pixels. We apply the VGGFace to every given image and take the 4096-element feature vector from the FC7 layer. Finally, we stretch the feature vectors between 0 and 1 such that they share the same range of values of the COSFIRE approach. 3.3

COSFIRE

Combination Of Shifted Filter Responses (COSFIRE) is a trainable filter approach which has been demonstrated to be effective in various computer vision tasks. Here we apply the COSFIRE filters in the same way as proposed in [44] where a spatial pyramid was employed to form a feature vector with COSFIRE features. For completeness sake we briefly describe this method in the following sub-sections.

Fusion of CNN- and COSFIRE-Based Features

449

COSFIRE Filter Configuration. A COSFIRE filter is nonlinear and it is automatically configured to be selective for a given single prototype pattern of interest. The automatic configuration procedure consists of two main steps: convolution followed by keypoint detection and description. In the convolution step a bank of orientation-selective (Gabor) filters is applied with different scales λ and orientations θ, and the resulting feature maps are superimposed on top of each other. In the second step, a set of concentric circles with given radii ρ is considered and the local maximum Gabor responses along those circles are identified as keypoints. Each keypoint i is described with four parameters: (λi , θi , ρi , φi ), where λi and θi are the parameters of the Gabor filter that achieves the maximum response in the location with a distance ρi and polar angle φi with respect to the center of the prototype pattern.

Fig. 3. The architecture of the VGGFace as proposed by Parkhi et al. [4]. The red bounding box indicates the FC7 layer that we use to extract features from the network.

Therefore, a COSFIRE filter is defined as a set of 4-tuples: Sf = {(λi , θi , ρi , φi ) | i = 1...k}

(1)

where the subscript f represents the prototype pattern and k denotes the number of keypoints. In our experiments we configure multiple COSFIRE filters with equal number of randomly selected local patterns from male and female face training images. Figure 4 illustrates the configuration of a COSFIRE filter with a local pattern selected from a face image, as well as its application to the same image. COSFIRE Filter Response. The response of a COSFIRE filter is computed by combining with a geometric mean the intermediate response maps generated from the tuples describing the filter. There is a pipeline of four operations applied for each tuple in a given COSFIRE filter. It consists of convolution, ReLU, blurring and shifting. The pipelines of the tuples can be run in parallel as they are independent of each other. In the convolution step, the given image is filtered

450

F. Simanjuntak and G. Azzopardi

(a)

(b)

(c)

(d)

Fig. 4. Configuration example of a COSFIRE filter using a training female face image of size 128 × 128 pixels. The encircled region in (a) shows a randomly selected pattern of interest that is used to configure a COSFIRE filter. (b) Superposition of a bank of antisymmetric Gabor filters with 16 orientations (θ = {0, π/8,....15π/8} and a single scale (λ = 4). (c) The structure of the COSFIRE filter that is selective for the encircled pattern in (a). (d) The inverted response map of the COSFIRE filter to the input image in (a).

with the Gabor filter with scale λi and orientation θi as specified in tuple i. Next, similar to CNNs, the Gabor response map is rectified with the ReLU function. Unlike the pooling layer of the CNNs, COSFIRE applies a max blurring function to the rectified Gabor responses in order to allow for some tolerance with respect to the preferred position of the concerned keypoint followed by shifting by ρi pixels in the direction opposite to φi . The blurring operation uses a sliding window technique on the Gabor response maps, where the Gabor responses in every window are weighted with a Gaussian function whose standard deviation σ grows linearly with the distance ρi : σ = σ0 + αρi where σ0 and α are determined empirically. The output of the blurring operation is the weighted maximum. Finally, all blurred and shifted Gabor responses sλi ,σi ,ρi ,φi (x, y) are combined using geometric mean and the result is denoted by rSf : rSf (x, y) =

n

n1 sλi ,σi ,ρi ,φi (x, y)

(2)

i=1

Face Descriptor. In contrast to CNNs, instead of downsizing the feature maps and eventually flattening the last map into a feature vector, we treat the COSFIRE response maps by a spatial pyramid of three levels. In level zero we consider one tile, which is the same size of the COSFIRE response map and take the maximum value. In the next two levels we consider 2 × 2 and 4 × 4 tiles, respectively, and take the maximum COSFIRE response in each tile. For n COSFIRE filters and (1 + 4 + 16 =) 21 tiles, we describe a face image with a 21n-element feature vector. Moreover, the set of n COSFIRE filter responses per tile is normalized to unit length [45]. An example of the COSFIRE face descriptor using a single filter is shown in Fig. 5.

Fusion of CNN- and COSFIRE-Based Features

(a)

(b)

451

(c)

(d)

Fig. 5. Application of a COSFIRE filter to a face image using a spatial pyramid of three layers, with square grids of 1, 4 and 16 tiles. The red circles in (a–c) indicate the maximum values within the tiles which are shown in the bar plot in (d).

3.4

Fusion Methods

We investigate two fusion strategies that combine CNN- and COSFIRE-based features, one which combines features and the other which combines the decisions of two separate classifiers. Feature Fusion. In the first approach we concatenate the 4096-element feature vector of VGGFace that is extracted from the FC7 layer with the 21n-element feature vector obtained by the spatial pyramid approach employed to COSFIRE feature maps. The value of n represents the number of COSFIRE filters. Here, we set n = 240, a value that was determined empirically. Combining the two sets of features results in a fused feature vector of (4096 + 21 × 240 =) 9136

452

F. Simanjuntak and G. Azzopardi

elements. Finally, we used the resulting fused feature vectors from the training data to learn an SVM classification model with linear kernel. Decision Fusion. The other approach that we investigate is called stacked classification. In this approach we keep the CNN and COSFIRE feature vectors separately and learn an SVM (with linear kernel) classification model for each set of features. Then we apply the SVMs to the training data and combine the returned values of both SVMs in a feature vector which we use to learn another classification SVM model with linear kernel. The application of SVMs return as many values as the number of classes. In our case we have two classes (male and female), so each SVM in the lower layer returns two values that are related to the probabilities of having a certain gender. Subsequently, the SVM in the top layer is fed with a vector of four values.

4

Experiments and Results

Here, we describe the experimental design along with the data sets that we used and the results obtained for both standalone methods and fusion strategies. 4.1

Data Sets

We used two benchmark data sets, namely GENDER-FERET [55] and Labeled Face in the Wild (LFW) [56]. The GENDER-FERET data set consists of 946 face images of people with different expression, age, race, and pose in controlled environment. The data set, which is publicly available1 , is already divided equally into training and test sets. The LFW data set gives us the opportunity to validate the proposed methods in unconstrained environment. It consists of 13,000 images of 5,749 celebrity, athlete, and politician faces collected from websites when the subjects were doing their daily activities, such as playing sports, doing a fashion show, giving a speech, doing an interview, among others. Looking at the facts that the photographs of the subjects were taken in their natural environment, multiple faces may appear in the same image. Also, the data set shows variability in illumination, pose, background, occlusions, facial expression, gender, age, race, and image quality. Following the recommendations in [44] and [45], 9,763 grayscale images were chosen for the experiment in which 2,293 are females while the rest are males. We labeled the gender manually as it is not provided with the data set. The images were aligned to an upright position using facial landmark tracking [54] as explained in the previous section and whenever we were in dilemma we discarded the faces whose gender was not easy to establish. Since the number of images between male and female is not balanced, we applied 5-fold cross-validation by partitioning images into five subsets of similar size and keeping the same ratio between male and female [57]. Then, the accuracy was computed by taking the average of all folds. 1

https://www.nist.gov/programs-projects/face-recognition-technology-feret.

Fusion of CNN- and COSFIRE-Based Features

453

Table 1. Results of the COSFIRE and VGGFace-based standalone approaches on the GENDER-FERET and LFW data sets using SVM classifier. Method

4.2

Data set Accuracy (%)

COSFIRE-only GF LFW

93.85 99.19

CNN-only

GF LFW

97.45 99.71

Feature fusion

GF LFW

98.30 99.28

Decision fusion GF LFW

98.94 99.38

Experiments

Below, we report the evaluation of the standalone approaches followed by the evaluation of the two above mentioned fusion strategies. Following similar procedures as explained in [44] and [45], we conducted several experiments with the COSFIRE-based method on the GENDER-FERET and LFW data sets. Instead of employing 180 filters as used in the prior works, we configured 240 COSFIRE filters in order to have more variability. Of the 240 COSFIRE filters, 120 are configured from randomly selected local patterns (of size 19×19 pixels) of male face training images and the other half from randomly selected local patterns of female training faces. If a randomly selected local pattern was sufficiently salient and resulted in a COSFIRE filter that consisted of at least five keypoints (tuples), then we considered it as a valid prototype, otherwise we discarded it and chose another random pattern. As suggested in [44], we set the parameters of the COSFIRE filters as follows: t1 = 0.1, t2 = 0.75, σ0 = 0.67, α = 0.1 and selected keypoints from a set of concentric circles with radii ρ = {0, 3, 6, 9}. For the CNN-based approach we used the 4096-element features vectors along with SVM with linear kernel. Moreover, we conducted other experiments where we evaluated the two fusion strategies that combine both approaches. The first is referred to as feature fusion where we concatenated the VGG-Face and COSFIRE feature vectors into longer ones, and the other is decision fusion where we used a classification stacking approach as explained in Sect. 3.4. Table 1 reports the results obtained by the two standalone and the two fusion approaches to the GF and LFW data sets. For the GF data set, the standalone CNN-based approach performs significantly better than the standalone COSFIRE approach, and for the other data set the accuracies of both standalone methods are roughly the same, with the marginal difference not being statistical significant. As to the fusion, we observe that both strategies improve the accuracy rate with high statistical significance on the GF data set, while there is no statistical difference between the results of all methods for the LFW data set.

454

F. Simanjuntak and G. Azzopardi

Comparison with Other Methods. We also compare the results of our approaches with those already published in the literature, Table 2. For the GF data set both fusion strategies that we propose outperform existing works with high statistical significance. For the LFW data set, we do not observe statistical difference between any of the methods. Table 2. Comparison of the results between the proposed approaches and existing ones on both the GF and LFW data sets. Method Azzopardi et al. [58] Azzopardi et al. [44] Azzopardi et al. [45] Proposed 1 (Feature fusion) Proposed 2 (Decision fusion) LFW data set Tapia et al. [59] Dago-Casa et al. [60] Shan et al. [57] Azzopardi et al. [45] Proposal 1 (Feature fusion) Proposal 2 (Decision fusion) GF data set

5

Description Accuracy (%) RAW LBP HOG 92.60 COSFIRE 93.70 COSFIRE SURF 94.70 COSFIRE VGGFACE 98.30 COSFIRE VGGFACE 98.90 LBP 92.60 Gabor 94.00 Boosted LBP 94.81 COSFIRE SURF 99.40 COSFIRE VGGFACE 99.28 COSFIRE VGGFACE 99.38

Discussion

The most important contribution of this study is that COSFIRE and CNN features from a pre-trained CNN can indeed complement each other for gender recognition from face images. The experiments on the GENDER-FERET data set demonstrate this complementarity where the decision fusion approach reduced the error rate by more than 50%. For the other data set, the fact that both standalone (CNN and COSFIRE) methods achieved very high recognition rates (above 99%), left very little room for further improvement when fused together. We are eager to find out whether the same or similar improvement can be observed in other challenging recognition applications, and aim to investigate this matter in our future works. Both COSFIRE and CNNs are approaches that learn features directly from the given training data. CNNs with high number of layers, such as the VGG-Face that we use here, rely on deep learning with gradient descent to determine the best features, while COSFIRE is conceptually simpler as it configures the selectivity of a filter from a single prototype pattern by establishing the mutual spatial arrangement of keypoints within the given local pattern. So far, COSFIRE filters have been configured with single examples with empirically-determined setting of hyper parameters, such as the standard deviations of the blurring functions. In future, we will investigate a learning mechanism that can determine better filters by analyzing multiple prototypes. COSFIRE filters share important steps with CNNs, including the convolution and the ReLU layers along with the possibility of designing architectures with multiple layers. The fundamental difference between the two approaches lies in

Fusion of CNN- and COSFIRE-Based Features

455

the fact that COSFIRE is a filter, while a CNN is a fully embedded classification technique. The input to the COSFIRE approach can be any complex scene, while for a CNN to detect an object of interest the given input image must contain the concerned object roughly in the center and must take the majority of the space. The latter requirement is due to the downsizing decision that CNNs implement, a step that is not present in the COSFIRE approach. Instead, COSFIRE applies a blurring function that allows some tolerance with respect to the mutual spatial arrangement of the defining features of an object of interest.

6

Conclusions

The proposed fusion strategies prove to be very effective in combining the COSFIRE and CNN approaches. We used a case study of gender recognition to evaluate our methods and it turned out that with the fusion approaches the error rate drops by more than 50% on the GENDER-FERET data set. Considering the simplicity of the COSFIRE filters, the achieved results are very promising. The proposed fusion approaches are independent of the application at hand and thus they can be adapted to any image classification task.

References 1. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980) 2. Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015) 3. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8(1), 98–113 (1997) 4. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015) 5. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013) 6. Pinheiro, P., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning Research, PMLR, 22–24 June 2014, Beijing, China, vol. 32, pp. 82–90 (2014) 7. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017) 8. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77(C), 354–377 (2018) 9. Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 317–324, February 2010 10. Uetz, R., Behnke, S.: Large-scale object recognition with CUDA-accelerated hierarchical neural networks. In: 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, vol. 1, pp. 536–541, November 2009

456

F. Simanjuntak and G. Azzopardi

11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097– 1105. Curran Associates, Inc. (2012) 12. Yan, Z., Jagadeesh, V., DeCoste, D., Di, W., Piramuthu, R.: HD-CNN: hierarchical deep convolutional neural network for image classification. CoRR, abs/1410.0736 (2014) 13. Kim, H.-J., Lee, J.S., Yang, H.-S.: Human action recognition using a modified convolutional neural network. In: Proceedings of the 4th International Symposium on Neural Networks: Part II–Advances in Neural Networks, ISNN 2007, pp. 715– 723. Springer, Heidelberg (2007) 14. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, pp. 3361–3368. IEEE Computer Society, Washington (2011) 15. Wang, P., Cao, Y., Shen, C., Liu, L., Shen, H.T.: Temporal pyramid pooling based convolutional neural networks for action recognition. CoRR, abs/1503.01224 (2015) 16. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 17. Weiss, D.J., Sapp, B., Taskar, B.: Sidestepping intractable inference with structured ensemble cascades. In: NIPS, pp. 2415–2423. Curran Associates, Inc. (2010) 18. Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1653–1660. IEEE Computer Society, Washington (2014) 19. Guyon, I., Albrecht, P., Le Cun, Y., Denker, J., Hubbard, W.: Design of a neural network character recognizer for a touch terminal. Pattern Recogn. 24(2), 105–119 (1991) 20. Zhu, R., Mao, X., Zhu, Q., Li, N., Yang, Y.: Text detection based on convolutional neural networks with spatial pyramid pooling. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1032–1036, September 2016 21. Bengio, Y., LeCun, Y., Henderson, D.: Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden Markov models. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6, pp. 937–944. Morgan-Kaufmann (1994) 22. Yin, X., Yin, X., Huang, K., Hao, H.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014) 23. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015) 24. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324 (1998) 25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014) 26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015) 27. Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Proc. 22(10), 1533–1545 (2014)

Fusion of CNN- and COSFIRE-Based Features

457

28. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 2203–2213 (2014) 29. Santos, R.M., Matos, L.N., Macedo, H.T., Montalv˜ ao, J.: Speech recognition in noisy environments with convolutional neural networks. In: 2015 Brazilian Conference on Intelligent Systems (BRACIS), pp. 175–179, November 2015 30. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751 (2014) 31. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), June 2014, Baltimore, Maryland, pp. 655–665. Association for Computational Linguistics (2014) 32. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., Hao, H.: Semantic clustering and convolutional neural network for short text categorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 352–357. Association for Computational Linguistics (2015) 33. Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. In: The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2015, 31 May–5 June 2015, Denver, Colorado, USA, pp. 103–112 (2015) 34. Wasenm¨ uller, O., Stricker, D.: Comparison of kinect v1 and v2 depth images in terms of accuracy and precision, November 2016 35. Huang, J., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4989–4993, April 2015 36. Gecer, B., Azzopardi, G., Petkov, N.: Color-blob-based COSFIRE filters for object recognition. Image Vis. Comput. 57(C), 165–174 (2017) 37. Azzopardi, G., Petkov, N.: A shape descriptor based on trainable COSFIRE filters for the recognition of handwritten digits. In: Wilson, R., Hancock, E., Bors, A., Smith, W. (eds.) Computer Analysis of Images and Patterns, pp. 9–16. Springer, Heidelberg (2013) 38. Guo, J., Shi, C., Azzopardi, G., Petkov, N.: Recognition of architectural and electrical symbols by COSFIRE filters with inhibition. In: CAIP (2015) 39. Fern´ andez-Robles, L., Azzopardi, G., Alegre, E., Petkov, N., Castej´ on-Limas, M.: Identification of milling inserts in situ based on a versatile machine vision system. J. Manuf. Syst. 45, 48–57 (2017) 40. Azzopardi, G., Rodr´ıguez-S´ anchez, A., Piater, J., Petkov, N.: A push-pull CORF model of a simple cell with antiphase inhibition improves SNR and contour detection. PLOS One 9(7), 1–13 (2014) 41. Strisciuglio, N., Petkov, N.: Delineation of line patterns in images using BCOSFIRE filters. In: 2017 International Conference and Workshop on Bioinspired Intelligence (IWOBI), pp. 1–6, July 2017 42. Strisciuglio, N., Azzopardi, G., Petkov, N.: Detection of curved lines with BCOSFIRE filters: a case study on crack delineation. CoRR, abs/1707.07747 (2017) 43. Azzopardi, G., Strisciuglio, N., Vento, M., Petkov, N.: Trainable COSFIRE filters for vessel delineation with application to retinal images. Med. Image Anal. 19(1), 46–57 (2015)

458

F. Simanjuntak and G. Azzopardi

44. Azzopardi, G., Greco, A., Vento, M.: Gender recognition from face images with trainable COSFIRE filters. In: 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 235–241, August 2016 45. Azzopardi, G., Greco, A., Saggese, A., Vento, M.: Fusion of domain-specific and trainable features for gender recognition from face images. IEEE Access 6, 24171– 24183 (2018) 46. Azzopardi, G., Petkov, N.: Trainable COSFIRE filters for keypoint detection and pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 490–503 (2013) 47. Azzopardi, G., Petkov, N.: Ventral-stream-like shape representation: from pixel intensity values to trainable object-selective COSFIRE models. Front. Comput. Neurosci. 8, 80 (2014) 48. Liew, S.S., Khalil-Hani, M., Radzi, F., Bakhteri, R.: Gender classification: a convolutional neural network approach. Turkish J. Electr. Eng. Comput. Sci. 24, 1248– 1264 (2016) 49. Levi, G., Hassncer, T.: Age and gender classification using convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 34–42, June 2015 50. Dhomne, A., Kumar, R., Bhan, V.: Gender recognition through face using deep learning. Procedia Comput. Sci. 132, 2–10 (2018). International Conference on Computational Intelligence and Data Science 51. Narodytska, N., Kasiviswanathan, S.P.: Simple black-box adversarial perturbations for deep networks. CoRR, abs/1612.06299 (2016) 52. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. CoRR, abs/1511.04599 (2015) 53. Tang, W., Li, B., Tan, S., Barni, M., Huang, J.: CNN based adversarial embedding with minimum alteration for image steganography. CoRR, abs/1803.09043 (2018) 54. Uricar, M., Franc, V., Hlavac, V.: Facial landmark tracking by tree-based deformable part model based detector. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 963–970, December 2016 55. Gender recognition dataset. http://mivia.unisa.it/datasets/video-analysisdatasets/gender-recognition-dataset/. Accessed 28 May 2018 56. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07-49, University of Massachusetts, Amherst, October 2007 57. Shan, C.: Learning local binary patterns for gender classification on real-world face images. Pattern Recogn. Lett. 33(4), 431–437 (2012). Intelligent Multimedia Interactivity 58. Azzopardi, G., Greco, A., Vento, M.: Gender recognition from face images using a fusion of SVM classifiers. In: Campilho, A., Karray, F. (eds.) Image Analysis and Recognition, pp. 533–538. Springer, Cham (2016) 59. Tapia, J.E., Perez, C.A.: Gender classification based on fusion of different spatial scale features selected by mutual information from histogram of LBP, intensity, and shape. IEEE Trans. Inf. Forensics Secur. 8(3), 488–499 (2013) 60. Dago-Casas, P., Gonz´ alez-Jiménez, D., Yu, L.L., Alba-Castro, J.L.: Single- and cross- database benchmarks for gender classification under unconstrained settings. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2152–2159, November 2011

Standardization of the Shape of Ground Control Point (GCP) and the Methodology for Its Detection in Images for UAV-Based Mapping Applications Aman Jain1, Milind Mahajan2, and Radha Saraf3(&) 1

2

Visvesvaraya National Institute of Technology, Nagpur, India [email protected] Medi-Caps Institute of Technology and Management, Indore, India [email protected] 3 Skylark Drones Pvt. Ltd., Bangalore, India [email protected]

Abstract. The challenge of georeferencing aerial images for an accurate object to image correspondence has gained significance over the past couple of years. There is an ever-increasing need to establish accurate georeferencing techniques for Unmanned Aerial Vehicles (UAVs) for tasks like aerial surveyance of mines/construction sites, change detection along national highways, inspection of major pipelines, intelligent farming, among others. With this paper, we aim to establish a standard method of georeferencing by proposing the design of a simple, white colored, L-shaped marker along with the pipeline for its detection. In a first, the less common DRGB color space is used along with the RGB color space to segment the characteristic white color of the marker. To carry out recognition, a scale and rotation invariant modification of the edge oriented histogram is used. To allow for accurate histograms, improvements are made on canny edge detection using adaptive approaches and exploiting contour properties. The histogram obtained displayed a characteristic distribution of peaks for GCP-markers. Thus, a new peak-detection and verification methodology is also proposed based on the normalized cross-correlation. Finally, a CNN model is trained on the Regions of Interest around the GCP-markers that are received after the filtering. The results from EOH and CNN were then used for classification. Regions with a diverse range of locality, terrain, soil quality were chosen to test the pipeline developed. The results of the design and the pipeline combined were quite impressive, with regards to the accuracy of detection as well as its reproducibility in diverse geographical locations. Keywords: Ground control point Differential RGB Edge oriented histogram LeNet model CNN Normalized cross-correlation

1 Introduction The use of UAVs equipped with cameras has increased rapidly in recent years, especially in the field of geomatics. This is because of cost-efficient data acquisition and high spatiotemporal resolution oils, point cloud synthesis, 3d-mesh, digital surface © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 459–476, 2020. https://doi.org/10.1007/978-3-030-17795-9_34

460

A. Jain et al.

model (DSM) and digital elevation model (DEM) generation [2]. The raw imagery cannot be used directly for photogrammetric application as it is not accurately georeferenced, owing to the limited accuracy (about 2 m) of commercial grade GPS mounted on the UAVs. Moreover, terrain interaction and local elevation changes introduce additional errors into the data [3]. To make use of the aerial data for scalable geomatic applications an ortho-rectification stage is a necessity. It removes erroneous image displacements caused by the interaction between terrain relief or elevation changes and sensor orientation variation [4–6]. In order to orthorectify the image data, various methods have been proposed out of which establishing ground control seems the most favorable [7]. Ground control points (GCPs) are points on the ground with a known location in the world. It is obtained using survey grade instruments like Differential Global Positioning System (DGPS) or Real Time Kinematic Global Positioning System (RTK-GPS) [8]. The GCPs influence aerial-triangulation and help in maximizing photogrammetric precision and accuracy [1, 9, 10]. They also aid in rendering accurate geo-referenced systems. Currently, the majority of Geographic Information System (GIS) based enterprises, use different types of markers to establish ground control, some of which are shown in Fig. 1. The markers are such that they can be easily distinguished from the background region in aerial surveys. A specific point on these markers is labeled as the GCP. Currently, there is no standard for the design of the marker and every enterprise creates their own custom design. Consequently, there are no standard algorithms developed so far, to automatically detect the marker and give an accurate image to world correspondence, owing to which, GCP-Detection is carried out manually by an operator [11, 12]. It comprises of detecting the GCP-marker in an image and clicking on the point labeled as GCP. Corresponding to that point, its location is then identified in the world. This is to be done for all images that were captured by the UAV near GCPs. Evidently, the task is highly labor-intensive, time-consuming, monotonous, and prone to human errors. Its usage is therefore not scalable and profitable on large-area projects.

(a)

(b)

(c)

Fig. 1. (a) Spray painted marker with permanent center. (b) Glare resistant ceramic tile. (c) Black circle with white background and provision for center marking

Specific markers like ArUco, April Tags [31], have been used previously in robotic applications for pose-estimation and navigation [13], however, their usage in geomatics has been largely limited because of operational difficulties. There is, therefore, an

Standardization of the Shape of (GCP) and the Methodology

461

urgent need of a standard marker whose use is operationally feasible in geomatic applications. It should be compliant enough with features to write an illumination, scale, rotation invariant algorithm to automate the mapping process. In this paper, a generic design of GCP-marker is proposed which is aimed at solving operational difficulties in the survey and support automation in mapping. The paper also proposes a pipeline to automate the mapping task with the proposed-GCP design. It is accomplished using Edge Orientation Histograms [12] with some variations. There is also a provision for integrating learning model into this pipeline to increase its robustness.

2 Related Work The problem of automatically detecting a GCP-marker in an image could be considered as an object detection and recognition problem. This problem could be tackled in two ways. One is to determine the feature vector manually, based on shape, size or texture and then apply a method similar to Support Vector Machines [23] for classification. The other way is to let the machine learn features in the marker and generate its own classification algorithm. The first method is a traditional computer vision technique with feature engineering as a prerequisite while the latter is known as supervised learning with computing power and datasets as a prerequisite. This section describes some of the related-work done till date to detect an object in an image based on its shape, color, and texture. One of the first methodologies to detect varying shapes in an image was proposed by Ballard [24] using the generalized Hough transform. It was followed by Ullman [25] who tried using basic visual operators like color, shape, texture, etc. to detect objects in images. The breakthrough in feature engineering was observed when Lowe [26] came up with SIFT (Scale Invariant Feature Transform) to extract a large collection of feature vectors which were invariant to scale, rotation, illumination. Malik et al. [27] discussed the efficient shape matching process using shape contexts. Hu Moments [28], computed from image moments, is also one of the most powerful shape features which are invariant to scale and rotation. Real-time object detection with low processing power became possible using a cascade of weak classifiers known as Haar Cascade [29]. Edges which inherit shape properties gave rise to different algorithms like Histogram of Oriented Gradients (HOG) [30], Edge Oriented Histograms (EOH) [12], etc. The current state of the art technologies use deep-neuralnetworks which require a large amount of data and computing power. These technologies are based on different architectures among which CNN is most common, used mostly on images. This work is majorly inspired by Ullman visual routines, EOH and LeNet Model [14] - a CNN based architecture.

3 Methodology Placement of Ground Control Point (GCP) in the survey arena is a tedious task and a lot of work is done in this domain to optimize their number and distribution [15–18]. In a UAV-based mapping/surveying project, the drone mapping software needs to

462

A. Jain et al.

accurately position the ground map in relation to the real world around it. For this, it is required to get the correspondence between the map and the real world. This correspondence is achieved with the help of GCPs. GCP-markers are a specific design of a marker placed at strategic intervals on the ground which is to be mapped aerially. GCPmarkers are peculiar in the sense that they can be distinguished from their surroundings in terms of their shape, color, texture or pattern. A point on this GCP-marker (usually the center) is accurately detected in the aerial image and its corresponding geolocation is computed (using DGPS or RTK-GPS) to a millimeter level accuracy in the world. This correspondence can then be fed to the ortho-rectification software. It also helps to create a dataset of accurately geo-referenced images. The following section is divided into two parts, one is regarding the design of proposed GCP-marker that aids in its operational feasibility and the other is about the pipeline developed for its detection. 3.1

Design of GCP-Marker

The design of the GCP-marker primarily depends on two factors – the operational difficulty in laying the markers in the survey arena and the technical difficulty in detecting the GCP in the image. The latter is described in the section devoted to the detection pipeline. The workflow of mapping a region established with ground control commences with estimating the number and the distribution of GCPs. Based on the Ground Sampling Distance (GSD) requirement of the client [19], the GCP-markers are fixed onto the ground at pre-decided intervals in accordance with the distribution. This is done by the field vendor. The survey team then measures the exact geolocation of all the laid GCPs to millimeter level accuracy using DGPS/RTK-GPS. Finally, the flight is planned and the UAV captures images of the region. To date, different organizations have come up with different designs of GCP-markers to accomplish these tasks involved in mapping. A very few designs amongst them, however, are scalable enough to survey hundreds of kilometers of land with varying topology, without a tradeoff in speed, ease, and reliability. There are operational difficulties associated with the design of these GCP-markers. If the shape of the GCP-marker is complex, affixing it to the ground is more time-consuming. It is difficult to carry such markers and maintain them if they get soiled or damaged. Moreover, if the center of the GCP needs to be marked or if the marker has to be in a particular orientation, it demands extra care on the part of the field vendor. Then there is the opposition from localites who are unwilling to allow a survey of their lands. Recollection of GCPs in large-scale projects creates occasions in coming face to face with uncooperative people for the second time. Also, it’s a waste of time and labor and creates a lag in the workflow. The operational difficulties thus compel us to choose a design for the marker which is simple and portable. Disposable markers equate to zero maintenance and avoid wastage of time in recollecting them. It also lowers the risk of project incompletion by reducing the interactions with localites. Lastly, the process must be as streamlined as possible so that the minimum is left to think on the part of the field-vendors to ensure smooth data acquisition. The proposed design of the GCP-marker shown in Fig. 2. below was chosen keeping in mind the aforementioned aspects. It is a simple L-shape made from identical

Standardization of the Shape of (GCP) and the Methodology

463

rectangular strips of dimensions: 65 cm 15 cm placed perpendicular to each other. The strips are white in color and have a high coefficient of reflectance, making it distinctly visible from higher altitudes. The strips can be rolled, making the storage compact and portable and they can be used directly in the survey, without any modifications. The point to be detected on this marker is the inner corner of the letter L. This eliminates the need for a field vendor to worry about orientation of marker or estimate center points, which is the case with most other GCP-markers with convex geometric shapes. The strips, being disposable do not require any maintenance like in case of GCPs printed on cloth or other materials. The GCP strips are cheap so recollection is also out of the question.

Fig. 2. Proposed design of the GCP-marker.

3.2

Detection of GCP

The features that could be used in the detection process with the proposed GCPs are the white color of GCP, its plain texture and the L-shape. The proposed pipeline uses this information for detection and consists of the steps mentioned below. The specifications of each step, necessary to replicate the pipeline are discussed in the implementation section. RGB and Differential RGB Color Space Thresholding. The pipeline commences with the simultaneous color thresholding of the image in the RGB color space and the Differential RGB color space (DRGB). DRGB color space is an alteration of the existing RGB space and is designed to overcome noise and illumination effects. An RGB image can be converted into DRGB space by performing the following operations. DRðu; vÞ ¼ absðRðu; vÞ Gðu; vÞÞ

ð1Þ

DGðu; vÞ ¼ absðGðu; vÞ Bðu; vÞÞ

ð2Þ

DBðu; vÞ ¼ absðBðu; vÞ Rðu; vÞÞ

ð3Þ

A pure white pixel in an 8-bit RGB image is denoted by the values: 255, 255, 255. White or near white pixels in an image have almost the same values in all the three

464

A. Jain et al.

channels. As a result, whitish pixels in DRGB space will take low values because there is small to zero difference between the values of the three channels. Thus, we can extract white pixels if we apply a low threshold in the DRGB space on all 3-channels. However, the same is true for a black pixel and any other pixels that have similar values in all the three channels. As a result, they cannot be differentiated from the white pixels in the DRGB space. Thus, thresholding only the DRGB space is bound to be replete with a lot of noisy pixels apart from the GCP ones. To overcome this, we perform a color thresholding in the RGB color space simultaneously with a higher threshold, obtained as a function of the image intensity distribution. The resulting binary images from both thresholds are then bitwise ANDed to finally yield only whitish pixels. Contour Approximations and Checks. Once we get a binary image after simultaneous thresholding, we perform morphological operations (erosion and dilation) to fill holes and remove noisy pixels. This leads to the formation of solid blobs in image. The contours are then fitted around these blobs. Every fitted contour goes through various checks which includes area enclosed by contour, perimeter of contour, concavity of contour, bounding box (which is estimated about the contour) area and aspect ratio checks. The contours that qualify all these checks are passed on further. Also the corresponding bounding boxes are stored for future use as regions of interest (ROIs). Figure 3 summarizes the algorithm for ROI generation in a flowchart.

Fig. 3. Flowchart of the algorithm for ROI generation.

Standardization of the Shape of (GCP) and the Methodology

465

Edge Detector. A region of interest (ROI) is extracted from the original image using the bounding boxes as they are possible containers of GCP-markers. These extracted ROIs and the corresponding contours are passed to the edge-detector function. The edge detector function is built on the conventional canny-edge detection [22] algorithm with some modifications. The conventional canny algorithm requires arguments to be tuned manually, most of which take in default values and others, like the lower threshold and a higher threshold vary from one image to another. To overcome this manual tuning of the canny arguments, an adaptive implementation of canny is used [20, 21]. The output from this step is checked further for the number of detected edge pixels, where it is compared with the perimeter of the GCP in the image. The perimeter in the image can be approximated using photogrammetry. If it is less than the perimeter, a decremented threshold is used. The decremented value was experimentally determined after checking it across the different datasets. If it still fails, the ROI and the contour is rejected. Those which pass the check are retained. The ROIs, after edge-detection, may have some noisy edge pixels apart from the GCP-marker. To eliminate these noisy pixels we use the shape information furnished by the contour. For, every edge pixel, we, check whether it lies in the vicinity (3–5 pixels) of the contour corresponding to that ROI. If it is not in the vicinity it is eliminated. Thus, a refined edge image is obtained. Figure 4 summarizes the algorithm for Edge detection in a flowchart.

Fig. 4. Flowchart of the algorithm for Edge Detection.

466

A. Jain et al.

Edge Oriented Histogram. Once the edge pixels have been identified in an ROI, the corresponding gradient magnitudes and directions are computed and stored. The gradients in x and y directions were calculated using the Scharr operator, which is a more efficient Sobel variant. The resultant directions are stored as angles with values between [0, 360]. A 36-bin histogram is plotted using the frequency of occurrence of every angle. From this histogram, a line plot is formed by average smoothening with neighboring elements. The line plot stores necessary information regarding the shape of the GCP-marker. The proposed GCP-marker is an L-shaped marker having 6 sides. The 6 edges are aligned in 4 major directions, almost ninety degrees apart as shown in Fig. 5(a). This remains true irrespective of scale, rotation or contrast, with certain boundary conditions, discussed in the section devoted to Conclusion. It can be said to be scale independent until the shape of GCP is clearly visible in images. It is rotation invariant in ground plane because irrespective of the rotation of GCP, there will always be 4 major directions of edges that are ninety degrees apart. Also, since only the gradient direction is considered and not its magnitude, it can be said that contrast will not have any major influence over the histogram shown in Fig. 5(c). Thus, when we plot an edge orientation histogram, we will see 4-peaks, approximately 90° apart as shown in Fig. 5 (b). This is a very useful property to identify a GCP-marker of the proposed shape.

(a)

(b)

(c)

Fig. 5. (a) Quiver plot of edge gradients, (b) Line plot with 4 peaks, in interval (0–36), (c) Edge oriented histogram of GCP-marker with 360 bins.

Multiple Peak Detection and Verification. From the histogram obtained, the next step is to check for the existence of 4 peaks that are nearly ninety degrees apart from one another like those shown in Fig. 6(a). We cannot directly use the existing peak detection algorithms since finding exactly 4 peaks is an ideal case as compared to the general case of finding a single peak. In practice, we need to find peaks in intervals of nearly 90° each and ensure that not only are they close to the corresponding maxima in

Standardization of the Shape of (GCP) and the Methodology

467

that interval but are also ninety degrees apart to similar peaks in other intervals. To solve this problem, the proposed solution uses 1D normalized cross-correlation technique. Firstly, the 360 different orientations are divided into 36 bins consisting of 10 consecutive degrees each. The angles extracted from the edge image are used to fill in the 1D-array of these 36 bins with each bin storing the frequency of the corresponding bin. It is known as the base array. Now, nine different templates of 1D-arrays similar to the base array are created. In every template there are 4 indices set to high values, and rest are initialized to zero. These indices are exactly nine units apart, i.e. the first template has peaks at (0, 9, 18, 27) as shown in Fig. 6(b), the second one has peaks at (1, 10, 19, 28) and so on till (8, 17, 26, 35) in the last template as shown in Fig. 6(c). Next, a normalized cross-correlation is performed between each template and the base array, and a score is computed. The template with the maximum score amongst the nine templates corresponds to the peak in 0–8 interval, and is regarded as the first peak with desired properties and the rest of the peaks are assumed to be 9, 18 and 27 units apart from that peak. Once these peaks are computed, they are verified by ensuring that the individual peak is in the vicinity of the local maxima in that region. A score is then computed on the basis of the number of peaks (out of four) that satisfy this property. This score along with the probability score from the learning model is used to calculate the final score. The final score then classifies the edge image for whether it contains a GCP or not based on whether it is higher than a pre-decided threshold or not. This score is stored for classification along with the score from learning model.

(a)

(b)

(c)

Fig. 6. (a) Line plot of base array. (b) Line plot of template with peak indexes at (0, 9, 18, 27). (c) Line plot of template with peak indexes at (8, 17, 26, 35).

Learning Model. To increase the robustness of the detection, a learning model is also integrated into the pipeline. The score from the model is simultaneously computed along with the score computed from the edge-oriented histogram. An ROI is given as input to the model. As the size of the ROI containing the GCP-marker is small with height and width varying between 20–60 pixels, a shallow 6 layer Convolutional Neural Network (CNN) is used. The architecture of the network is inspired by the LeNet model [14], however the input to the CNN is a 32 32 3 pixels RGB image rather than a single channel grayscale image as was the case in LeNet model. Out of the 6 layers, 3 are convolutional layers and the rest are fully connected layers. Each

468

A. Jain et al.

convolutional layer is followed by Max Pooling and Batch Normalization [21]. Each max pool operation uses a stride of 2 pixels which essentially reduces the dimension of input image by a factor of 2 while maintaining the number of channels. The output layer of the CNN consists of a softmax activation function which gives the probability score of a particular ROI containing a GCP. To reduce the chances of overfitting in the dataset, dropout [16] is used in the first 4 layers of the model. As shown in Fig. 7, the first convolutional layer filters the 32 32 3 input image with 32 kernels of size 7 7 with padding as ‘same’ (this generates the output image of same size as of input) and a dropout factor of 0.5. Thus we get an output image of size 32 32 32 pixels, which is then max-pooled and normalized. The second convolutional layer takes the output of the first layer and filters it with 64 kernels of size 5 5 also with padding as ‘same’ and dropout factor of 0.4. Then the third convolution filters the output of second convolutional layer with 128 kernels of size 3 3 with padding as ‘same’ and dropout factor of 0.3. The resulting output image is then flattened out, making it a single dimensional layer of 2048 nodes. This layer is then connected with first fully connected layer having 512 nodes and dropout of 0.2. The subsequent fully connected layers have 256 and 128 nodes in order and do not contain any dropout. The last fully connected layer is connected to a 2-way softmax layer. This finally gives the probability score for a GCPs presence in an ROI.

Fig. 7. Architecture of CNN model.

Isometric Weights. The score computed from the Edge oriented histogram is normalized by dividing it with 4. The score computed from the learning model is the probability, which lies between 0 and 1. A final score is computed from the normalized score from EOH and the probability from the learning model. Both scores are given equal weights (0.5 each). The final score is then used for classification of ROI as a container of GCP-marker. The summary of the methodology is explained in the flowchart in Fig. 8 below.

Standardization of the Shape of (GCP) and the Methodology

469

Fig. 8. Flowchart of methodology.

4 Implementation 4.1

Data Acquisition

In order to test the accuracy of the proposed GCP design and its detection pipeline, datasets were obtained (courtesy Skylark Drones Pvt. Ltd.) and chosen keeping in mind that they should encompass as many terrain variations as possible. Thus a diverse dataset consisting of rural areas, farmlands, roads, national highway, mines, construction sites, and deserts, was chosen. The details of the datasets are mentioned in the Results section in Table 1. GCPs were laid at pre-decided locations and the ground truth data was collected by field measurements by using a differential GPS receiver giving mm level accuracy. Meanwhile, on the software aspect, implementation of the pipeline was developed in Python using OpenCV, Matplotlib, numpy, piexif, keras and other auxiliary libraries. At certain places throughout the detection pipeline, values for specific parameters have been hard-coded after careful calculations, corresponding observations and subsequent

470

A. Jain et al. Table 1. Dataset description.

Dataset name

Latitude Longitude Relative (deg.) (deg.) altitude (Range in m)

Average image intensity

Dataset region

Objects present

Amplus_Bikaner Cleanmax Goa_Vedanta NH_150

28.14 17.11 15.34 12.93

Desert Farmlands Mines National highway Farms/barren land Solar farms Orange soil

Solar panels Stones/trees Trucks Farms/settlements

N N N N

72.97 77.23 74.13 76.75

E E E E

60–70 105–115 140–150 55–65

178.48 66.92 182.39 155.79

Sterling&Wilson_U1 14.27 N 77.46 E

105–115

123.26

Sterling&Wilson_U2 14.25 N 77.46 E TUV 13.98 N 76.54 E

90–100 55–65

112.35 166.33

L&T_Navi_Mumbai 18.95 N 73.02 E L&T_NIMZ 17.77 N 77.67 E

90–100 180–200

131.76 29.73

Mines Agricultural land

Large stones Roads/trees White construction materials White dust Crops

validation across all the 9 different datasets. Most of these parameters can be estimated based on the pinhole camera model given the camera parameters, intrinsics and the relative altitude from which the images were taken, considering some variation here and there. However, UAVs rarely maintain a constant elevation level throughout a flight because of which unexpected results can always show up. This is mainly because, the effect of the variation of elevation on the different parameters is non-linear and cannot be modeled as it depends on more than a single constraint. Nevertheless, one can always proceed with the hard-coded values as defaults or manually tune it according to the application to further improve the accuracy. For an 8-bit, 3 channel image, the threshold range for the RGB color-space is set to 180–255 whereas that for the DRGB color-space, it is set to 0–30. However, in cases with average intensity of image below 100, it is advisable to use threshold value as (average intensity + 100). Thresholded images in both the color spaces are bitwise ANDed to get only white pixels in the resultant binary image. However, as was explained earlier in the section on methodology, the thresholded image is replete with noisy patches similar to a GCP-marker patch. Tiny noise freckles always show up. Thus, to remove these smaller noisy pixels, we perform morphological opening operation which is erosion followed by dilation. Next we perform morphological closing operation. In closing operation, the order of dilation and erosion is reversed. All morphological operations are carried out with a square kernel of size 3. After running morphological operations on the image, contours are extracted. The extracted contours go through various checks to remove the false positives. For the area enclosed by the contour, range of values varies between 25 and 600 pixels. The range is intentionally kept large so that a likely GCP-marker pixel patch in an image qualifies the check for it being taken at different altitudes ranging from 50 m to 150 m. To check for convexity, it is essential that first the contour be approximated to a polygon, which best represents its shape. The parameter value of epsilon quantitatively determines the

Standardization of the Shape of (GCP) and the Methodology

471

number of vertices in a polygon. It is indicative of the difference in the perimeters of the approximated polygon and the original contour. The value chosen for such an approximation was 10 percent of the perimeter of the contour. This step is essential because the shape of contour might not be exactly be same as the GCP-marker owing to variation in pixel intensity. But it could be converted to the shape of GCP-marker if we allow for such an approximation. The next check is on the bounding box area and the aspect ratio. Given the shape of the GCP-marker, it was experimentally found that the aspect ratio of the bounding box did not exceed 1:2. Also, when constructing bounding boxes around contours, a padding of 15 pixels is added on all sides so that edges do not get trimmed off. Constraints can also be imposed on the bounding box area, evident after the experimental findings. The lower and upper limits for such a constraint were set to 1200 and 4500 pixels respectively. Once these checks are performed the ROIs are extracted from the image using bounding-box values. They can now be used in edge detectors and learning model. The ROIs extracted from the image and corresponding contours are fed to the edgedetector. The detector is designed to take advantage of the information furnished by the contours. Using the adaptive Canny approach, we get a lower and a higher threshold. Otsu implementation of Canny gets us the higher threshold. Assuming a constant ratio of 3 between the higher and lower thresholds, the lower threshold can be found out from the higher threshold from Otsu’s method. But this adaptive type thresholding might fail in some cases when, say the Otsu threshold value is above 200 in an 8-bit image. To overcome such scenarios, a feedback loop is implemented. The aim of this loop is to extract edges till the number of edge pixels reaches the perimeter of the GCPmarker. This is to be ensured without letting higher thresholds fall below 150. The value of 150 was also experimentally determined and validated across all data-sets. The starting point of this loop is the threshold given by Otsu algorithm and then the threshold value is continuously decremented by 5 in every iteration till the number of edge pixels detected are greater than the perimeter of GCP-marker or the lower limit of higher threshold is reached. The perimeter can be determined using photogrammetric constraints. In case, these are not available, the empirical relation equating perimeter to 15% of the area enclosed by the marker contour could be used. After the feedback loop ends we are left with edges with a lot of noise. To remove those noisy edges we can use contour information. This could be done by removing edge pixels which are at a distance of more than 3 pixels from the contour stored in that ROI. This gives us a clean edge image without any non-related noise pixels. After getting the edge image, we compute gradients Gx and Gy in the x, y directions using the Scharr operator. The resultant gradient magnitude (G) and orientation (theta) is calculated at each edge pixel where theta is in degrees. After getting the orientations, a histogram with 36 bins is formed. The histogram is smoothened by averaging the previous, current and next bin values. Next, a line plot is constructed from the histogram. As discussed in methodology, we can see 4-peaks in the line plot that are nearly ninety degrees apart, representing a GCP-marker in image. To check this condition, peak detection and peak verification algorithm is implemented. Nine templates with peaks at indexes (0, 9, 18, 27) till (8, 17, 26, 35) are created. A higher value of 100 is given at these peak-indexes. Also, an empty array of nine elements is created

472

A. Jain et al.

to store the score of normalized cross-correlation of every template. Normalized crosscorrelation between a template and base array is given by: Normalized Cross correlation ¼

Rðbase array½i template½iÞ ½Normðbase arrayÞ NormðtemplateÞ

ð4Þ

The index at which highest score is computed is the point of occurrence of first first peak. Other peaks are then computed from the first peak as index of first peak + 9, 18 and 27 respectively. Once these peaks are computed, it is checked whether in the vicinity of each peak there lies a global maxima in that interval of 90°. The vicinity is defined by ±5° in this case. Number of peaks that satisfy this property are then divided by 4, as the total number of peaks are 4, and the score is computed. Simultaneously, the score is also computed from the CNN model. The CNN has a total of 1,342,018 parameters. It was trained on a dataset containing 30,588 images of size 32 32 pixels. The dataset had 11,000 images containing GCP and 19,588 images containing random objects. For the CNN model to generalize over all types of image intensities and background color, different variations of the same GCP-markers were generated by modifying the brightness and rotating the images. The machine learning library Keras along with Adam’s Optimization algorithm [22] was used to train the model with a batch size of 48 and a learning rate of 0.0006. It was observed that a small learning rate and dropout made the training process longer but it eventually resulted in better accuracy of the model. The final score is obtained by the isometric weighting of both the scores. To classify the ROI as containing a GCP-marker, the predecided threshold score was chosen as 0.75.

5 Results The information regarding all the datasets is given in Table 1. It consists of the mean latitude, mean longitude and mean altitude of the dataset. There are additional columns that specify the lighting conditions, type of region encompassing the dataset and the presence of objects in those regions that are plausible causes of false detections. All the tests were conducted on these 9 datasets. Table 2 covers the analysis of the implementation. It consists of columns that encompasses the accuracy of ROI generation algorithm, i.e. accuracy of bounding box detection, and accuracy metrics of classification. For classification of the ROIs, it is ambiguous to define the accuracy of an algorithm. Various metrics have been proposed to date. This paper makes use of the F-Score which is the most generic criteria used in classification problems. Mathematically, it is the harmonic mean of precision and recall, where, Precision ¼ Recall ¼

True Positives True Positives þ False Positives

True Positives True Positives þ False Negatives

Standardization of the Shape of (GCP) and the Methodology

473

Table 2. Implementation analysis. Dataset name

B-box detection accuracy (%)

Amplus_Bikaner Cleanmax Goa_Vedanta NH_150 Sterling&Wilson_U1 Sterling&Wilson_U2 TUV L&T_Navi_Mumbai L&T_NIMZ

1.00 0.92 0.98 0.68 1.00 0.96 1.00 0.93 0.92

F Score ¼

Precision

Recall

F-Score

CNN

Proposed CNN

Proposed CNN

Proposed

1.0000 1.0000 0.9742 0.8000 0.9556 1.0000 1.0000 0.2222 0.6590

1.0000 1.0000 0.9771 0.9393 0.9595 1.0000 1.0000 0.9333 0.7982

0.9809 0.8571 0.9901 0.9934 0.9595 0.9412 1.000 0.9655 0.7459

0.9904 0.9231 0.9836 0.9662 0.9595 0.9697 1.0000 0.9492 0.7712

0.9745 0.7619 1.000 0.9664 0.8784 0.8462 1.0000 1.0000 0.9344

0.9871 0.8649 0.9869 0.8753 0.9265 0.9167 1.0000 0.3636 0.7729

2 Precision Recall Precision þ Recall

A higher Precision value is indicative of fewer false positive cases whereas a higher Recall value is indicative of fewer false negative cases. Ideally, we would want our algorithm to give fewer false cases (both positive and negative ones). For GIS purposes, it is beneficial to have high values of Precision and F-score. This is because of the fact that even if a few GCPs get missed (case of false negative) in some of the images, i.e. low recall, they could possibly be traced back from other nearby images of the same GCP-marker, which allows for a lower Recall. However, it shouldn’t be very low, as it will affect the final F-score, which is a balance between the two. Comparison of accuracy metrics is tabulated in Table 2, between the CNN based classification and proposed implementation. It can be seen that there is little difference in the accuracy of the two methods in all the datasets with the proposed implementation being slightly better. During experimentation however, it was observed that the learning model took significantly less time to process an image in comparison to the combined EOH and learning model. Hence, for usage on large-scale projects, if the area to be mapped is similar to one on which model was trained, it is better to use the CNN based classification for operational efficiency. In case, it is an exploration project, proposed methodology is to be used, because of its better accuracy in terms of Precision and F-score. It can be observed that there is a lower Recall in case of the Cleanmax dataset. This is because of the presence of a shadow on the GCP-marker as shown in Fig. 9(a). below. A drastic fall in the accuracy of bounding box detection is observed in NH_150 Flight because of the overlap of the GCP-marker with the white-stripes on road. The overlap resulted in a poor accuracy of ROI generation. One such case is shown in Fig. 9(b). below. Very poor precision of the CNN is observed in L&T-Navi-Mumbai, on account of the presence of white dust in the surrounding. This can be taken care of, if more data of similar kind is fed to the system. We observe poor Precision and Recall in L&T_NIMZ, which is because of the very high relative-altitude, which results in

474

A. Jain et al.

distortion of shape of GCP in image as shown in Fig. 9(c) below. The distorted shape will now not necessarily obey properties mentioned in the methodology section. Also false-positives increase because of the small size of white blobs that arbitrarily satisfy the properties mentioned in the methodology.

(a)

(b)

(c)

Fig. 9. Edge cases in the proposed algorithm: (a) False classification as a Negative owing to shadowing of GCP-marker. (b) Failure in ROI generation because of overlap with objects of similar visual appeal. (c) Distortions owing to extreme altitudes.

6 Conclusion The proposed methodology was successful in segmenting the GCP-markers from the image with a descent F-score. This was true irrespective of scale, rotation and illumination pertaining to certain boundary conditions. It can be observed that the best results were obtained in the altitudes ranging from 50 to 150 m. The marker orientation does not affect the result as long as the L-shape is clearly visible in the image, i.e. a rotation of 0–360° is permitted on ground plane, irrespective of slight terrain variations. It is of interest to use bright images for processing as it helps the detection process. Certain precautions need to be taken on the operations aspect to ensure a smooth workflow. These include avoiding placement of GCP-marker on white objects or close to white objects, minimum to zero occlusion and absence shadows on the GCP-marker. If camera parameters and photogrammetric constraints are known, it is advisable to use it for tuning the hard-coded values. That will lead to improved accuracy metrics. On the learning side, it is advisable to furnish the model with as varied a dataset as possible to improve its accuracy with unknown datasets.

References 1. Oniga, V.-E., Breaban, A.-I., Statescu, F.: Determining the optimum number of ground control points for obtaining high precision results based on UAS images. In: Proceedings vol. 2, no. 7 (2018) 2. Tiwari, A., Dixit, A.: Unmanned aerial vehicle and geospatial technology pushing the limits of development. Am. J. Eng. Res. (AJER) 4, 16–21 (2015)

Standardization of the Shape of (GCP) and the Methodology

475

3. Eltohamy, F., Hamza, E.H.: Effect of ground control points location and distribution on geometric correction accuracy of remote sensing satellite images (2009) 4. Tucker, C.J., Grant, D.M., Dykstra, J.D.: NASA’s global orthorectified Landsat dataset. Photogram. Eng. Remote Sens. 70, 313–322 (2004) 5. Koeln, G.T., Dykstra, J.D., Cunningham, J.: Geocover and Geocover-LC: orthorectified Landsat TM/MSS data and derived land cover for the world (1999) 6. Armston, J.D., Danaher, T.J., Goulevitch, B.M., Byrne, M.I.: Geometric correction of Landsat MSS, TM, and ETM+ imagery for mapping of woody vegetation cover and change detection in Queensland (2002) 7. Liba, N., Berg-Jürgens, J.: Accuracy of orthomosaic generated by different methods in example of UAV platform MUST Q. In: Proceedings of IOP Conference Series: Materials Science and Engineering (2015) 8. Landau, H., Chen, X., Klose, S., Leandro, R., Vollath, U.: Trimble’s Rtk And DGPS solutions in comparison with precise point positioning. In: Sideris, M.G. (eds.) Observing our Changing Earth. International Association of Geodesy Symposia, vol 133. Springer, Berlin (2009) 9. Alhamlan, S., Mills, J.P., Walker, A.S., Saks, T.: The influence of ground control points in the triangulation of Leica Ads 40 Data (2004) 10. Kim, K.-L., Chun, H.-W., Lee, H.-N.: Ground control points acquisition using spot image the operational comparison. In: International Archives of Photogrammetry and Remote Sensing, vol. XXXIII, Part B3. Amsterdam (2000) 11. Zhou, G.: Determination of ground control points to subpixel accuracies for rectification of spot imagery (2018) 12. Ren, H., Li, Z.: Object detection using edge histogram of oriented gradient. In: IEEE International Conference on Image Processing (ICIP), Paris, pp. 4057–4061 (2014) 13. Nguyen, T.: Optimal ground control points for geometric correction using genetic algorithm with global accuracy. Eur. J. Remote Sens. 48, 101–120 (2015) 14. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 15. Tahar, K.N., Ahmad, A., Abdul, W., Wan, A., Akib, M., Mohd, W., Wan, N.: Assessment on ground control points in unmanned aerial system image processing for slope mapping studies. Int. J. Sci. Eng. Res. 3, 1–10 (2012) 16. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207. 0580 (2012) 17. Federman, A., Santana Quintero, M., Kretz, S., Gregg, J., Lengies, M., Ouimet, C., Laliberte, J.: Uav photogrammetric workflows: a best practice guideline. In: ISPRS International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-2/W5.237-244 (2017).https://doi.org/10.5194/isprs-archives-xlii-2-w5-2372017 18. Huo, Y.-K., Wei, G., Zhang, Y., Wu, L.: An adaptive threshold for the canny operator of edge detection. In: International Conference on Image Analysis and Signal Processing, Zhejiang, pp. 371–374 (2010) 19. Fang, M., Yue, G., Yu, Q.: The study on an application of Otsu method in canny operator (2009) 20. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8(6), 679–698 (1986) 21. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

476

A. Jain et al.

22. Diederik, P., Kingma, J.B.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 23. Cortes, C., Vapnik, V.N.: Support vector networks. Mach. Learn. 20, 273–297 (1995) 24. Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recogn. 13, 111–122 (1987) 25. Ullman, S.: Visual routines. In: Cognition, vol. 18, pp. 97–159 (1985) 26. David, G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 27. Mori, G., Belongie, S., Malik, J.: Efficient shape matching using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1832–1837 (2005) 28. Hu, M.-K.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179–187 (1962) 29. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR (2001) 30. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition CVPR (2005) 31. Olson, E.: AprilTag: a robust and flexible visual fiducial system. In: Proceedings of IEEE International Conference on Robotics and Automation (2011)

Non-linear-Optimization Using SQP for 3D Deformable Prostate Model Pose Estimation in Minimally Invasive Surgery Daniele Amparore1 , Enrico Checcucci1 , Marco Gribaudo2 , Pietro Piazzolla3(B) , Francesco Porpiglia1 , and Enrico Vezzetti3 1

2 3

Division of Urology, Department of Oncology, School of Medicine, University of Turin-San Luigi Gonzaga Hospital, Regione Gonzole 10, 10043 Turin, Orbassano, Italy Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, via Ponzio 51, 20133 Milan, Italy Department of Management and Production Engineering, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Turin, Italy [email protected]

Abstract. Augmented Reality began to be used in the last decade to guide and assist the surgeon during minimally invasive surgery. In many AR-based surgical navigation systems, a patient-specific 3D model of the surgical procedure target organ is generated from preoperative images and overlaid on the real views of the surgical field. We are currently developing an AR-based navigation system to support robot-assisted radical prostatectomy (AR-RARP) and in this paper we address the registration and localization challenge of the 3D prostate model during the procedure, evaluating the performances of a Successive Quadratic Programming (SQP) non-linear optimization technique used to align the coordinates of a deformable 3D model to those of the surgical environment. We compared SQP results in solving the 3D pose problem with those provided by the Matlab Computer Vision Toolkit perspective-three-point algorithm, highlighting the differences between the two approaches. Keywords: Augmented Reality · Robotic surgical procedures Prostatectomy · Computed-assisted surgery · Successive Quadratic Programming · Performance evaluation

1

·

Introduction

Laparoscopic surgery is a popular technique that benefits patients by minimizing surgical invasiveness. Augmented Reality (AR) began to be used in the last decade to help the surgeon’s perception in overcoming the limitations of this technique in terms of a restricted field of view or limited depth perception, and to guide and assist him or her during the surgical procedure. In these AR-based surgical navigation systems, a patient-specific 3D model of the surgical procedure target organ is generated from preoperative images and overlaid on the real c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 477–496, 2020. https://doi.org/10.1007/978-3-030-17795-9_35

478

D. Amparore et al.

views of the surgical field. Nowadays the introduction of image-guided navigation systems in robot-assisted surgery, which has recently become a standardized medical procedure for several types of operations, is providing the surgeon with the ability to access the radiological images and surgical planning contextually to the anatomy of the real patient [4]. According to the taxonomy proposed in [8] the major components required to implement an AR-based Image Guided Surgery (IGS) system are: the data, patient specific, visually processed and shown within a particular surgical scenario, for a specific surgical step; the visualization processing techniques or transformations on the data used to provide the best visual representation of the data; the view of the processed data, ready for the enduser to interact with, usually by mean of a custom tailored application or system. The main challenges that AR-based surgical navigation systems face in visualization processing concern both registration accuracy between the 3D model and the surgical procedure environment, both how to maintain this correspondence in time because of intraoperative dynamics, such as laparoscopic camera movements and scene deformation [1]. We are currently working on developing an AR-based surgical navigation system to support both robot-assisted radical prostatectomy (AR-RARP) both partial nephrectomy (AR-RAPN) and preliminary results about patient-specific 3D organ reconstruction have been published in, e.g.: [15–18]. In this paper we present the results obtained in this ongoing study, specifically addressing the registration and localization challenge of the 3D prostate model during the surgical procedure. We are currently investigating for our system a solution that relies on optical tracking of spherical markers applied to the prostate surface. In this work, we evaluate the performances of a Successive Quadratic Programming (SQP) non-linear optimization technique used to align the coordinates of the 3D prostate model to those of the surgical environment. The input data for the SQP are collected using the feature extraction technique for detecting circles on the laparoscopic camera real-time stream, described in [14] and implemented by OpenCV [3]. We show how the SQP technique we adopted can reduce the mean error in determining the position and rotation matrix used for the 3D prostate model alignment compared to technique presented in [6] and implemented in standard Computer Vision libraries as those of, e.g.: MatLab [11] or OpenCV. We will also show that our approach is less error prone when faced with issues related to the deformation of prostatic tissues during surgery. To obtain a precise analysis of the error estimation, we use an accurate synthetic reconstruction of the prostate and laparoscopic camera movements during the surgical procedure. Thanks to this reconstruction, we have an exact and consistent three-dimensional reference, that we use to develop precise metrics and an accurate evaluation of the performance of the proposed tracking procedure. The paper is organized as follows. After presenting an overview of the related literature in Sect. 2, we present in Sect. 3 the AR-based navigation system we develop, as well as a discussion on the method used to generate the synthetic video used to evaluate the SQP technique. In Sect. 4 the SQP optimization technique used to solve the constrained non-linear optimization problem of finding the 3D pose of the prostate model is presented. We evaluate the quality of the tracking under different parameters in Sect. 5, while in Sect. 6 we discuss future improvements to our system.

SQP for Virtual Prostate Registration

2

479

Related Works

There is currently much interest in the surgical community adapting Augmented Reality technology into routine surgical practice, especially in minimally invasive IGS. This has stimulated researchers to find solutions for the faced challenges, using different ad-hoc methods, as the literature on topics related to AR in surgery attests. The number of reviews recently published accounts for a quite vast subject, e.g.: in [1], the authors reviewed the different methods proposed concerning AR in intra-abdominal minimally invasive surgery; in [5] the review focused on the use of AR in open surgery, both in clinical and simulated settings; the review in [23] evaluates whether augmented reality can presently improve the results of surgical procedures, finding promising results in a large number of papers; the utility of wearable head-up displays in different surgical settings is evaluated by the review in [24]. Different solutions for aligning the 3D virtual model and the patient’s real anatomy have been proposed in the literature. When the 3D model of the organ is not generated contextually with the surgical procedure, as in our system, alignment usually leverages on landmarks visible in both the preoperative data and the laparoscopic image. In some approaches, e.g.: [21], natural landmarks are used, while in others the landmarks are artificial, such as in e.g.: [9] where helix-shaped fluorescent gold fiducials are used, or e.g.: [20] where custom made markers with spherical heads are instead used. Solutions that adopt a markerless tracking method also have been tested, as in e.g.: [13] where arbitrary surface points on the real organ are selected and then tracked. The alignment procedure itself is then performed at different levels of automatism: from an interactive approach that totally relies on the manual input from an expert to register preoperative data to the laparoscopic image using visible landmarks as reference, to systems able to determine automatically every degree of freedom (DoF) that parametrizes the 3D model to match the real one during the surgery procedure, exploiting feature extracting algorithms on the selected landmarks. In the latter type of systems, the well-known problem of the 3D pose estimation of the virtual organ model is approached using different Perspective-n-Point problem solution methods, as those proposed most notably in e.g.: [6,10,26].

3

Marker-Based Tracking During Robot-Assisted Radical Prostatectomy

In this Section, we first present the current state of advancement of the system, with a focus on the application that performs prostate tracking and 3D model visualization, then we discuss the methods we followed to generate the synthetic video used in Sect. 5. The AR-based surgical navigation systems we are currently developing is aimed to integrate the 3D virtual reconstruction of the patient’s prostate inside the Tile-Pro of the da Vinci [7] surgical console and to automatize the virtual model localization with respect to the in-vivo anatomy using a markers tracking technique. The Intuitive TilePro system enables the surgeon to

480

D. Amparore et al. Laparoscopic Camera Stream

Laparoscopic Camera Stream

Augmented Video Stream

Workst

da Vinci Console

da Vinci Surgical Robot

Fig. 1. The main elements that compose the context of operation of our AR system.

view additional information in the surgeon console from additional sources during surgery below the laparoscopic image. In Fig. 1 the main elements composing the setting in which our navigation system operates are shown. The video stream from the laparoscope, one of the instruments of the da Vinci Surgical Robot, is sent to the vision system of the da Vinci Console in order to allow doctors a clear view of the intraoperative environment. Using the DVI port of the console, the video is streamed at the same time to the Workstation, a typical on market average laptop, where it is augmented with the overlay of the 3D prostate model by a dedicated application, named pViewer and described in Sect. 3.1. The Augmented Video Stream is then sent back to the console where its visualization can be requested by the surgeon in the Tile-Pro. For the augmentation to be enabled, an accurate 3D model of the patient’s prostate is required. Using high resolution pre-operatory medical imaging techniques, a team of biomedical engineers constructs the model which is loaded by the pViewer application at the beginning of the operation. In particular, the prostate model is created using the Hyper-Accuracy 3D (HA3DT M ) reconstruction method, based on multiparametric MRI and provided by M3dics (Turin, Italy). As artificial landmarks, we are testing a set of optical markers, with a sterilizable plastic colored spherical head, that are inserted in the prostate during the medical procedure, after the organ has been exposed. Since the organ has to be removed at the end of the procedure this invasive solution can be utilized. The doctor operating the pViewer application interactively registers the initial 3D model position as soon as the markers are in place. The patient’s prostate undergoes extreme rotations and stretching before removal, hence the registration procedure has to be recalibrated when the organ’s back face is exposed for the first time and more markers are added into it. This continuous deformation of the tracked organ is one of the main differences between standard 3D pose estimation techniques based on markers and the same techniques required in the minimally invasive surgery context when soft tissue organs are involved. Such deformations may easily confuse the

SQP for Virtual Prostate Registration

481

optimization algorithms inferring the positioning parameters from the intraoperative video stream. In Sect. 5.2 we show that our method is able to remain robust even under such circumstances. 3.1

The Target Application

The several tasks required to create the Augmented Video Stream, performed by the pViewer application, are presented in Fig. 2. This application has been implemented using the Unity platform [22] and C-Sharp [12]. This software program is engineered to optimize and contain the resources of the machine where it is run not to increase the complexity of the system as a whole but without losing visual quality or real-time performances. After the Laparoscopic Video Stream is acquired by the application, the Feature Extraction component identifies the markers. We use the word stream to indicate a flux of data that is received, processed or produced with a frequency of several times per second. The exact rate may vary according to different factors but, in the case of images, never drops below 24 Fps. The feature extraction component reduces the video image dimension, originally 1920 × 1080 pixels, to a quarter in order to fasten subsequent computations. The frame, in RGB coding, is converted in HSV color space to ease the application of the cvHoughCircles OpenCV function which detects circular patterns in an image and returns circles’ on-image position and radius (measured in pixel). Using these data it is possible to extract the color of the markers head as the average color of the pixels inside the circle. This allows the application to identify each marker and its location. The extracted features, i.e.: circles position, diameter and color, compose the Landmarks 2D Position Stream which is sent to the Optimization step. Here the stream is used to infer the position of the 3D prostate model, taking into account the initial pose of it as well as the lens distortion in the laparoscopic image. This step is the main contribution of this work and it will be described in depth in Sect. 4. The returned translation and a rotation matrix for the 3D model are sent to the 3D Model Localization component where are used to perform the positioning and

Landmarks 2 Stream

Laparoscopic Camera Stream

Prostate 3D Model

3D Model L

Mixer Video

Manual

R Matrices Stream

Augmented Video Stream

Fig. 2. The components of the pViewer application. The Laparoscopic Camera Stream is augmented with the overlay of the 3D prostate model thanks to the Optimization algorithm that correctly estimate its pose.

482

D. Amparore et al.

a.

b.

c.

Fig. 3. Target application augmentation steps.

orientation of the prostate model. In order to reproduce accurately the movement and rotation behavior of the real prostate in this type of surgical procedure, all transformations are applied from the base of the urethra, or prostate apex. This is because that point is the last to be severed, hence all movements of the organ during the procedure have it as their starting point. After the 3D prostate model is aligned with the actual organ, a rendered image of the mesh is produced and used by the Mixer component to augment the laparoscopic video stream. In Fig. 3 different stages of the application processing are schematically shown. In Fig. 3a the Feature Extraction component detects the markers in the laparoscopic video image. In Fig. 3b the Optimization component infers the 3D pose of the prostate model, whose position is represented using white edges only, and in Fig. 3c the result of the 3D Model Localization component is shown. The process is superseded by an operator at the workstation. When necessary it can perform manual re-registration (Manual Correction) of the 3D model on the patient’s prostate, help the optimization step achieving optimal results when the number of visible markers is not high enough to perform the algorithm-based tracking and to interactively deform the 3D model mesh to replicate actual prostate deformations during the procedure. Realistic modeling of nonlinear soft tissue deformation in real-time is a challenging research topic in surgical simulation [25] and we addressed it by simulating the application of forces that modify the prostate shape. In particular, we approximate the deformation of the 3D model mesh by applying to it non-linear parametric deformations [19]. These parametric transformations are able to twist, bend, stretch or taper the model and can be summed together in order to combine deform effects. The use of these formulas gave the workstation operator a fast and intuitive mean of deforming the prostate model while increasing only a little the computational effort required. As the input device to modify deformation values we adopt a 6 degrees of freedom on-the-market 3D mouse. Even if the resulting visual effect has improved the 3D-onto-real prostate overlay accuracy, the anatomical accuracy of this procedure is currently under investigation by our team. Of the available parametric deformation, we selected bend and stretch as those most suited for our visualization and overlay purposes. Their functioning is shown in Fig. 4.

SQP for Virtual Prostate Registration

483

DeformaƟon Hull

Front-Back Bend

DeformaƟon Hull

Up-Down Stretch

DeformaƟon Hull

Fig. 4. Target application choosen deformers.

3.2

The Simulation of the Surgical Operation

In order to perform the error analysis of the algorithm in Sect. 4, we synthetically generate a set of laparoscopic video streams to be used as the algorithm input data. Generating the video, instead of using real surgery recordings, allowed us to control many numerical parameters of interest like, e.g.: the markers applied into the prostate starting position in world coordinates or the camera parameters, and in particular to develop the precise error metric discussed in Sect. 5. The accuracy and realism of the synthetic videos were discussed with our medical team in order to realistically represent the main stages of the prostatectomy procedure. For this simulated video, we randomly selected one of the 3d prostate model reconstructed for real surgery, stripping it of any reference that related it to its patient’s name. The 3D prostate model was loaded into Blender [2] and we simulate laparoscopic camera movements animating the 3D scene main camera. We simulate the reaction of the organ to the surgery tools using the Simple Deform Modifier and applying the same bend and stretch deformations introduced in Sect. 3.1. In Fig. 5 the 3D prostate model used for the synthetic video stream is shown, together with the markers applied to it. Markers have been implemented as colored spheres parented to the vertexes of the prostate mesh. A total of 21 markers are applied to the model’s surface, divided into 3 sets of 7 markers each. This to ease the control of the total number of detectable markers: since the elements of each set were evenly scattered across the surface,

484

D. Amparore et al.

excluding a set from the detection algorithm reduced the density of markers on the surface but not coverage of it. The animations were rendered as a 1920×1080 MP4 movie, then send to an OpenCV library-using program that performed the very same feature extractions of the pViewer application. The Landmarks 2D Position Stream described in Sect. 4 was saved on file and then used by the MatLab script described in Sect. 5 to retrieve the 3D pose matrices. Several different synthetic laparoscopic videos have been created, in particular with and without the mesh deformation enabled to allow a better study of the performances of the proposed tracking technique.

Fig. 5. The 3D prostate model used for the synthetic video stream.

4

The Optimization Procedure

Let us call x = (x1 , . . . , x7 ) the vector we use to express the position and the orientation of the object in the 3D world space. in particular, vector t = (x1 , x2 , x3 ) represents the position of the object, while quaternion q: q = x7 + x4 i + x5 j + x6 k

(1)

7 with n=4 x2n = 1, encodes the rotation of the object. We have used the SQP optimization technique to solve the following constrained non-linear optimization problem. min f (x) x (2) s.t. b(x) ≥ 0 c(x) = 0 where f (x) is the metric we want to optimize, b(x) is a vector of inequality constraints, and c(x) is a single equality constraint. In particular, let us call P the Projection Matrix, that is, a 4 × 4 matrix that transforms a homogeneous coordinate pW in the corresponding screen coordinates pS , that is: ⎞ ⎛ nx

⎜ ny ⎟ nx /nw ⎟ = P · pW , ⎜ pS = (3) ⎝ nz ⎠ ny /nw nw

SQP for Virtual Prostate Registration

485

Let us suppose to have K markers, all with the same diameter d, and that their position in local space mi , i ∈ 1, . . . K, expressed as in homogeneous coordinates, is known. Let us now consider a frame f , where the nf markers belonging to set Mf = {m1 , . . . , mnf } are visible. Let us call sx (f, mi ), sy (f, mi ) and sd (f, mi ) the horizontal position, vertical position and the diameter onscreen values of the labeled marker mi ∈ Mf as returned by the preliminary video analysis component (i.e. the Feature Extraction component described in Sect. 3.1). Let us consider a given solution vector x: we call ux (mi , x), uy (mi , x) and ud (mi , x) the horizontal, vertical and diameter on-screen values that marker mi ∈ Mf will be projected to when the corresponding object is subject to the transform encoded by x. Such positions can be computed as: ⎞ ⎛ nx

⎜ ny ⎟ ux (mi , x) nx /nw ⎟ = P · T (x) · R(x) · mi , ⎜ = (4) ⎝ nz ⎠ ny /nw uy (mi , x) nw ⎛ ⎞ ⎞ nx d ⎜0⎟ ⎜ ny ⎟ ⎜ ⎟ = P · T (x) · R(x) · ⎜ ⎟ , ⎝0⎠ ⎝ nz ⎠ 1 nw ⎛

ud (mi , x) = nx /nw

(5)

Matrices T (x) and R(x) represents respectively the translation and the rotation of the object identified by vector x, and in particular: ⎛

1 ⎜0 T (x) = ⎜ ⎝0 0

0 1 0 0

0 0 1 0

⎞ ⎛ x1 1 − 2x25 − 2x26 2x4 x5 − 2x6 x7 2x4 x6 + 2x5 x7 ⎟ ⎜ 2x4 x5 + 2x6 x7 1 − 2x2 − 2x2 2x5 x6 − 2x4 x7 x2 ⎟ 4 6 , R(x) = ⎜ ⎝ 2x4 x6 − 2x5 x7 2x5 x6 + 2x4 x7 1 − 2x2 − 2x2 x3 ⎠ 4 5 0 0 0 1

⎞ 0 0⎟ ⎟ 0⎠ 1

(6)

The optimization function f (x) aims at minimizing the on-screen distance between labeled marking and the corresponding projected point with the considered vector x. It is defined as: 2 f (x) = αc (sc (f, mi ) − uc (mi , x)) (7) mi ∈Mf c∈{x,y,d}

Constants αx , αy and αd are used to give a different weight to the distance between the labeled and the projected markers on the different axis. In particular, we have set αx = αy = 1 and αd = 0.1 to give a smaller influence on the difference of the marker diameters. Constraints are used to make sure that the SQP algorithm does not look for a solution that moves the object too far away from the camera, and that returns a unitary quaternion, as required to encode the rotation of an object in a 3D space. The equality constraint c(x) is a single scalar function that returns 0 if the last four components of the solution can represent a valid rotation quaternion, that is: c(x) = 1 − x24 − x25 − x26 − x27

(8)

486

D. Amparore et al.

The inequality constraints b(x) ensure that the three coordinates are in a valid range (xmin ≤ x ≤ xmax , ymin ≤ y ≤ ymax and zmin ≤ z ≤ zmax ), as well as the quaternion components are in the range [−1, 1]. In particular, we have: ⎛ ⎛ ⎞ ⎞ xmin xmax ⎜ ymin ⎟ ⎜ ymax ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ zmin ⎟ ⎜ zmax ⎟

⎜ ⎜ ⎟ ⎟ x − xmin −1 ⎟ 1 ⎟ , xmax = ⎜ b(x) = , with xmin = ⎜ ⎜ ⎜ ⎟ ⎟ (9) xmax − x ⎜ −1 ⎟ ⎜ 1 ⎟ ⎜ ⎜ ⎟ ⎟ ⎝ −1 ⎠ ⎝ 1 ⎠ −1 1 The optimization is repeated in every frame to follow the moving markers. The SQP algorithm requires an initial point from which start the search. At the first time instant, this is defined by the operator that manually matches the marking on the 3D model with the one on the real body. In the following frames, considering that the movement of the camera object is physically limited, their next solution will be very close to the one obtained in the previous frame. For this reason, we use the solution at the previous time instant as the starting point for the next iteration. Although the SQP algorithm has constraint c(x) to ensure that rotation quaternion remains unitary, small round-off errors might sum up, leading after a few frames to an unfeasible solution. To overcome this problem, the quaternion part of the solution is normalized after each step to ensure its unitary length. Algorithm 1 summarizes the proposed procedure. Figure 6, shows the results of the positioning procedure at two extreme time instants. The position inferred by the SQP algorithm from the values of the detected markers is used to overlay the 3D prostate model, rendered as a white contour line, over the synthetic laparoscopic stream prostate, which is rendered as a black surface with colored markers on it. In Fig. 6a the overlay is very accurate because of the 14 visible markers, while in Fig. 6b the number of visible markers is 4, the lowest number before which the algorithm error probability starts to increase dramatically. Algorithm 1. TrakcingWithSQP 1: j ← 0, x0 given by the marker calibration 2: loop 3: j ← j + 1; set frame j 4: Minimize f (x) using SQP, given Mj = {mi }, sx (j, mi ), sy (j, mi ) and sd (j, mi ), starting ⎛ ⎞ ⎛ ⎞ from x0 x4 x4 ⎜ x5 ⎟ ⎜ x5 ⎟ 1 ⎟ ⎜ ⎟ ⎜ 5: ⎝ x6 ⎠ ⎝ x6 ⎠ = 2 2 2 2 x4 + x5 + x6 + x7 x7 x7 6: x0 = x 7: end loop

SQP for Virtual Prostate Registration a.

487

b.

Fig. 6. The results of the Optimization method (the white-edged model) over the synthetic video stream prostate model (in black), under different visible markers conditions.

4.1

Validation

In this section we compare the results with the P3P algorithm in the reference implementation provided by the Matlab Computer Vision Toolkit, which solves the perspective-n-point (PnP) problem using the perspective-three-point (P3P) algorithm, and eliminates spurious correspondences using the M-estimator sample consensus (MSAC) algorithm. In particular, we compare the average distance between the screen coordinates of the markers sx (f, mi ), sy (f, mi ) and the one obtained from projection the points using the solution vector x, ux (mi , x), uy (mi , x) for both the reference technique and the one proposed in Algorithm 1. The metric used for comparing the techniques is the following: 2 2 (sy (f, mi ) − ux (mi , x)) + (sy (f, mi ) − uy (mi , x)) (f, x) =

mi ∈Mf

|Mf |

(10)

Figure 7 shows the results when considering a non-deformable or a deformable object with the two techniques, focusing on the first 500 frames of the simulation. The P3P solution technique provided by Matlabmight fail due to the inability of finding enough reference points: this is shown by arbitrarily assigning a negative value to the distance function (f, x) = −10. Instead, the SQP technique always converges to a solution, even if it might be very distant from the actual position. In particular, we have experienced a failure of the P3P technique in the 55.5% of cases when there is no deformation, which increases to 63.2% when deformation is considered. Figure 8 shows the distribution of the average screen error for the considered scenario: due to the lack of failure, the CDFs of SQP technique start from point 0, while the ones concerning the P3P technique experience a large probability mass in the origin to consider frames in which the position cannot be estimated. It is also clear that our proposed method experiences a smaller average error at the 90 percentile. However the P3P seems to be winning when considering lower percentiles of the error distribution: in other words, P3P is better than SQP for points with a small error, but it is inferior when considering frame coverage and when handling difficult cases.

488

D. Amparore et al.

Fig. 7. The average distance in pixel between projected markers and screen positions per frame.

5

Evaluation

In this Section, we evaluate the effectiveness of the proposed technique, as a function of the number of markers used for tracking. In order to perform a fair evaluation, we have defined the specific metric η(x, f ) to compare the quality of tracking under different marking settings. Basically, we compare the average distance in the 3D world, for a set of N target points pi , between the actual position, and the one when applying the transforms obtained by the SQP procedure. Let us call pi (f ) the real position of point pi at frame f , and qi (f, x) the position inferred using the solution x of the SQP problem. If |a| represents the length of vector a, we have:

Fig. 8. Distribution of the average distance in pixel between projected markers and screen positions.

SQP for Virtual Prostate Registration N

η(x, f ) =

5.1

489

|qi (f, x) − pi (f )|

i=1

N

(11)

Considering Different Number of Markings

We started focusing on a scene where no tissue deformation is considered. Figure 9(a) and (b) show respectively the x, y, z coordinates of the object and the components of the quaternion representing the rotation determined by the SQP procedure when 16 different markers are used. Peaks represent regions where the SQP tracking fails and returns values that are not consistent with the real positions. The tracking error for the configurations with 16, 12 and 7 markers are instead considered in Fig. 10(a) as a function of time. The secondary axis represents the number of visible markers in the configuration with 16 points: as expected the error increases as the number of marker decreases. It is instead interesting to see how the 12 marker configuration drift away from the other

Fig. 9. Tracking results per frame (no deformations considered): (a) Position, and (b) Rotation.

490

D. Amparore et al.

Fig. 10. The average distance per frame in the 3D world between actual marker positions and data obtained by the tracking procedure (no deformations considered): (a) Full video, and (b) First 750 frames.

in final frames of the experiment (f > 1600). In this case, the use of the previous solution as the initial point of the search moves the solution away from the wanted result. We have however to note that in the real application, the operator at the workstation has the ability to manually correct this inconvenience and solve this type of issues. Figure 10(b) zooms the results by displaying only the first 750 frames: this shows that a small number of markers creates big fluctuations that can distract the users. The error distribution is instead shown in Fig. 11(a) for the complete range, and (b) focusing on small-scale errors. It is interesting to see that for small-scale errors, the system behaves as expected: best results are provided with a larger number of markings. The drift that conditions the case with 12 markers makes such configuration much worse in the long run.

SQP for Virtual Prostate Registration

5.2

491

Adding Deformation

When deformation is also included, tracking becomes more unstable, and wobbling images are created even with a large number of markers, as it is shown in Fig. 12 for what concerns the position1 . As opposed to the case discussed in Sect. 5.1, in this case the configurations with 7 and 16 markers are affected by drift (as shown in Fig. 13), while the one with 12 remains stable even through the end of the experiment. While the case with 7 marker can be easily explained by the lack of information, for the 16 marker case the problem seems due to the deformation that confuses the SQP algorithm. In particular, as shown in Fig. 14, cases with 7 and 16 markers seem to have similar behavior, while the configuration with 12 provides the best results.

Fig. 11. Distribution of the average distance per frame in the 3D world between actual marker positions and data obtained by the tracking procedure (no deformations considered): (a) Entire domain, (b) Limiting to errors below 8 pixels.

Fig. 12. Tracking results per frame of the position, when full deformations are considered. 1

Similar results are obtained also for the rotation but are not shown here for space issues.

492

D. Amparore et al.

Fig. 13. The average distance per frame in the 3D world between actual marker positions and data obtained by the tracking procedure (all deformations considered): (a) Full video, and (b) First 750 frames.

From this experiment, we thus see that when deformation occurs, but it is not considered in the underlying tracking model, it is better to have a smaller number of properly placed markers that a larger equally distributed set. 5.3

Filtering

As shown in Fig. 12, deformation tends to produce wobbling images with highfrequency oscillations that can make the output of the application quite annoying to see by the users. This problem can, however, be easily corrected with a lowpass digital filter applied to the solution of the SQP problem. Figure 15 shows the frequency spectrum of the x coordinate of the results obtained by the SQP

SQP for Virtual Prostate Registration

493

Fig. 14. Distribution of the average distance per frame in the 3D world between actual marker positions and data obtained by the tracking procedure (all deformations considered): (a) Entire domain, (b) Limiting to errors below 8 pixels.

Fig. 15. Frequency spectrum of the x coordinate of the object in the 3D space computed with a FFT.

algorithm2 computed using FFT (Fast Fourier Transforms). As it can be seen, the signal seems to reach a constant strength at around 5 Hz. We have thus applied a simple two-pole Butterworth filter with a cutoff at 5 Hz to reduce the impact of the high frequencies. Figure 16 shows the results by comparing the original solution of the SQP with the filtered version. Filtering indeed reduces wobbling, by creating a more pleasant view: however it introduces a slight delay in the tracked image (about 2 frames, that is 80 ms). This does not create real problems for the users since it is smaller than the typical reaction time of the doctors using the system.

2

Again, the other components of the solution vector shows similar behavior.

494

D. Amparore et al.

Fig. 16. Comparison of the object position before and after filtering for the first 300 frames.

6

Conclusion and Future Work

In this paper, we presented the AR-based navigation system to support robotassisted radical prostatectomy we are currently developing. In particular, we addressed the 3D pose estimation problem of the prostate model displayed during the procedure investigating a Successive Quadratic Programming (SQP) nonlinear optimization technique. When compared to the perspective-three-point (P3P) algorithm jointly used with the M-estimator sample consensus (MSAC) algorithm by the MatLab toolkit, the SQP method showed an higher precision in inferring the 3D pose when the organ to be tracked is subject to tissue deformations. In future implementations of our navigation system, the marker-based technique actually used will be substituted by a more advanced markerless solution, currently still under investigation to account the very specific characteristics involved in prostatectomy surgery, in terms of organs visibility and elasticity.

References 1. Bernhardt, S., Nicolau, S.A., Soler, L., Doignon, C.: The status of augmented reality in laparoscopic surgery as of 2016. Med. Image Anal. 37, 66–90 (2017) 2. Blender Online Community: Blender (2017). http://www.blender.org 3. Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools 25(11) 120, 122–125 (2000) 4. Cutolo, F.: Augmented Reality in Image-Guided Surgery, pp. 1–11. Springer, Cham (2017) 5. Fida, B., Cutolo, F., di Franco, G., Ferrari, M., Ferrari, V.: Augmented reality in open surgery. Updates Surg 70, 389–400 (2018) 6. Gao, X.S., Hou, X.R., Tang, J., Cheng, H.F.: Complete solution classification for the perspective-three-point problem. IEEE Trans. Pattern Anal. Mach. Intell. 25(8), 930–943 (2003) 7. Intuitive: da Vinci Surgical Systems. https://www.intuitive.com Accessed 30 Aug 2018

SQP for Virtual Prostate Registration

495

8. Kersten-Oertel, M., Jannin, P., Collins, D.L.: DVV: a taxonomy for mixed reality visualization in image guided surgery. IEEE Trans. Vis. Comput. Graph. 18(2), 332–352 (2012) 9. Kong, S.H., Haouchine, N., Soares, R., Klymchenko, A., Andreiuk, B., Marques, B., Shabat, G., Piechaud, T., Diana, M., Cotin, S., Marescaux, J.: Robust augmented reality registration method for localization of solid organs’ tumors using CT-derived virtual biomechanical model and fluorescent fiducials. Surg. Endosc. 31(7), 2863–2871 (2017) 10. Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 81(2), 155 (2008) 11. MATLAB: version 8.6.0 (r2015b) (2015) 12. Microsoft Inc.: C# Language Specification (2018). https://docs.microsoft.com/enus/dotnet/csharp/language-reference/language-specification/. Accessed 30 Aug 2018 13. Nguyen, T.T., Jung, H., Lee, D.Y.: Markerless tracking for augmented reality for image-guided endoscopic retrograde cholangiopancreatography. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7364–7367, July 2013 14. O’Gorman, F., Clowes, M.B.: Finding picture edges through collinearity of feature points. IEEE Trans. Comput. 25(4), 449–456 (1976) 15. Porpiglia, F., Bertolo, R., Amparore, D., Checcucci, E., Artibani, W., Dasgupta, P., Montorsi, F., Tewari, A., Fiori, C.: Augmented reality during robot-assisted radical prostatectomy: expert robotic surgeons’ on-the-spot insights after live surgery. Minerva Urologica e Nefrologica 70(2), 226–229 (2018) 16. Porpiglia, F., Bertolo, R., Checcucci, E., Amparore, D., Autorino, R., Dasgupta, P., Wiklund, P., Tewari, A., Liatsikos, E., Fiori, C.: The ESUT research group: development and validation of 3D printed virtual models for robot-assisted radical prostatectomy and partial nephrectomy: urologists’ and patients’ perception. World J. Urol. 36(2), 201–207 (2018) 17. Porpiglia, F., Checcucci, E., Amparore, D., Autorino, R., Piana, A., Bellin, A., Piazzolla, P., Massa, F., Bollito, E., Gned, D., De Pascale, A., Fiori, C.: Augmented-reality robot-assisted radical prostatectomy using hyper-accuracy three-dimensional reconstruction (HA3DTM ) technology: a radiological and pathological study. BJU Int. (2018) 18. Porpiglia, F., Fiori, C., Checcucci, E., Amparore, D., Bertolo, R.: Augmented reality robot-assisted radical prostatectomy: preliminary experience. Urology 115, 184 (2018) 19. Sederberg, T.W., Parry, S.R.: Free-form deformation of solid geometric models. SIGGRAPH Comput. Graph. 20(4), 151–160 (1986) 20. Simpfend¨ orfer, T., Baumhauer, M., M¨ uller, M., Gutt, C.N., Meinzer, H.P., Rassweiler, J.J., Guven, S., Teber, D.: Augmented reality visualization during laparoscopic radical prostatectomy. J. Endourol. 25, 1841–1845 (2011) 21. Thompson, S., Schneider, C., Bosi, M., Gurusamy, K., Ourselin, S., Davidson, B., Hawkes, D., Clarkson, M.J.: In vivo estimation of target registration errors during augmented reality laparoscopic surgery. Int. J. Comput. Assist. Radiol. Surg. 13(6), 865–874 (2018) 22. Unity Technologies ApS: Unity3D (2017). https://unity3d.com. Accessed 30 Aug 2018 23. V´ avra, P., Roman, J., Zonˇca, P., Ihn´ at, P., Nˇemec, M., Kumar, J., Habib, N., El-Gendi, A.: Recent development of augmented reality in surgery: a review. J. Healthcare Eng. 2017, Article ID 4574172, 9 (2017)

496

D. Amparore et al.

24. Yoon, J.W., Chen, R.E., Kim, E.J., Akinduro, O.O., Kerezoudis, P., Han, P.K., Si, P., Freeman, W.D., Diaz, R.J., Komotar, R.J., Pirris, S.M., Brown, B.L., Bydon, M., Wang, M.Y., Wharen, R.E., Quinones-Hinojosa, A.: Augmented reality for the surgeon: systematic review. Int. J. Med. Robot. Comput. Assist. Surg. 14(4), e1914 (2018) 25. Zhang, J., Zhong, Y., Smith, J., Gu, C.: Energy propagation modeling of nonlinear soft tissue deformation for surgical simulation. Simulation 94(1), 3–10 (2018) 26. Zheng, Y., Kuang, Y., Sugimoto, S., ˚ Astr¨ om, K., Okutomi, M.: Revisiting the pnp problem: a fast, general and optimal solution. In: 2013 IEEE International Conference on Computer Vision, pp. 2344–2351, December 2013

TLS-Point Clouding-3D Shape Deflection Monitoring Gichun Cha1, Byungjoon Yu1, Sehwan Park1, and Seunghee Park2(&) 1

2

Department of Convergence Engineering for Future City, Sungkyunkwan University, Seobu-ro, Suwon-si, Gyeonggi, Republic of Korea [email protected], [email protected], [email protected] School of Civil and Architectural Engineering, Sungkyunkwan University, Seobu-ro, Suwon-si, Gyeonggi, Republic of Korea [email protected]

Abstract. As high-rise buildings and aging structures increase, researches are underway to effectively inspect structures and prevent accidents in Korea. The structure cannot maintain its original shape due to natural disasters and loads, and deformation occurs. The accidents caused by the deformation of the structure have become a common news in the surroundings. Therefore, this study proposes a noncontact shape management monitoring method that can check structures and detect deformation. Keywords: Terrestrial Laser Scanning Deflection Monitoring

Point cloud ICP algorithm

1 Introduction Recently, the planning and construction of high-rise buildings is increasing due to the development of construction technology and the improvement of awareness of buildings. This is a key evaluation criterion for the construction technology and the competitiveness of the construction industry of countries and companies, and is accelerating the high-rise of the building in line with the national propaganda effect including technology demonstration and tourism industry. However, such a high-rise building can cause enormous loss of life and property if accidents such as earthquakes, typhoons, etc. are caused by accumulation of natural disasters or structural damage. Therefore, a monitoring system in the maintenance stage is needed to detect the microbehavior of structures that can’t be confirmed with the naked eye and prevent collapse accidents in advance [1, 2]. Sensors that are currently in commercial use are mostly sensors with a narrow measuring range of contact type, which is difficult to use in places where high-rise buildings and hazardous facilities are not easily accessible. In addition, a complicated wired network configuration is required to acquire data from each sensor installed in

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 497–502, 2020. https://doi.org/10.1007/978-3-030-17795-9_36

498

G. Cha et al.

the structure. In addition to the cost of the sensor, the wired network construction requires additional cost for wires connecting each sensor and space for wired network. Therefore, there is a need for a method for acquiring a shape without building a wired network [2, 3]. Terrestrial Laser Scanning (TLS) is a system that can acquire three-dimensional coordinate’s information of a target remotely using a laser, and has advantages other than the commercially available sensors. First, it is a remote system using a laser. Monitoring method using TLS can be applied to places where it is difficult to access because it obtains shape information of a structure remotely. Also, it is not necessary to construct a wired network for data acquisition, which can reduce the additional cost and hassle. Second, monitoring of the entire surface of the structure is possible. TLS can overcome the limitations of local monitoring techniques because it can acquire shape information on the entire surface of the structure, rather than information on specific parts of the structure and structure. Therefore, in this paper, TLS is used to find the transformation of the structure surface information and use it for monitoring.

2 Laser Scanning System The principle of position measurement of a three-dimensional scanner is to measure the distance by emitting a laser beam to a measurement object and calculating a pulse round trip time or a phase shift measurement of the returning beam, To calculate the three-dimensional coordinates of X, Y, and Z. 2 3 2 3 x cos a cos b 4 y 5 ¼ r 4 sin a cos b 5 z sin b

ð1Þ

The measurement points calculated here are massively larger than 10 million to 50 million points and are called point cloud. They are stored in a unique format according to the equipment manufacturer. In actual use, they are converted into three-dimensional coordinate data such as ASCII or dxf is used [3–5]. The 3D scanning system can be divided into static scanning and dynamic scanning. The static system is a general scanning method in which the scanning system is installed in one place as mentioned above. In this case, since the machine points of each scan are recognized as x = 0, y = 0, and z = 0 registration of each scan is required. Figure 1 shows the classification according to the data acquisition method. In this study, the experiment was conducted using the data acquisition method of Time of Flight method.

TLS-Point Clouding-3D Shape Deflection Monitoring

499

Fig. 1. Classification by data acquisition method (California Department of Transportation, 2011)

3 Point Cloud Registration Most of the work of measuring objects with a 3D laser scanner is to fix the object and move the measuring instrument to a position rotating at a certain angle around the object or to measure the object by rotating the object at a certain angle. In order to measure the shape, several local coordinate systems are required. In order to express the entire shape with a point cloud obtained based on each local coordinate system, a coordinate conversion process is required in which each local coordinate system is matched to one coordinate system. The process of obtaining the transformation matrix and matching each local coordinate system using this transformation matrix is called registration. In this paper, Iterative Closest Point (ICP) algorithm is used for point cloud matching. The ICP is a process of finding a transformation matrix (transformation matrix) that defines points corresponding to the closest distance between two input data as corresponding points and minimizes the distance between corresponding points. The transformation matrix in the three-dimensional space consists of a translation matrix and a rotation matrix. Equation (2) represents the relation between the source data and the reference data, and finds a transformation matrix that minimizes E. p is the point of reference, q is the point of source, R is the rotation matrix, and t is the movement matrix. E¼

N X pi Rqi þ t 2

ð2Þ

i

4 3D Shape Deflection Monitoring In this study, the shape measurement experiment was carried out for bridges. Noseong Bridge is constructed as a three-span continuous steel box girder bridge structure, and the lower structure type is constructed as an alternating T-type bridge. The figure shows the measurement position of the lower bridge of the Noseong Bridge.

500

G. Cha et al.

A total of 27.29 tonf and 26.14 tonf dump trucks were used for the load test, and each vehicle was about 63.2% and 60.5% of the design load, and the dump trucks were statically placed in the facilities, Scanning measurement was performed. The variation of structure shape change estimation using laser scanning was calculated by applying TLS data matching algorithm and Hausdroff Distance which is distance estimation technique. For the experiment, four times of scanning were performed for each of Load Case 1–3. For LC3, the maximum deflection value of LVDT was −4.826 mm, and the TLS shape estimation value was −5.049 mm, which was 4.6%. Figure 2 shows the location of the Laser scanning and LVDT installation, and Fig. 3 shows the scanning data alignment using the ICP algorithm.

Fig. 2. Laser scanning installation and measurement experiment of Noseong bridge

Fig. 3. Point cloud matching using ICP algorithm

TLS-Point Clouding-3D Shape Deflection Monitoring

501

As a result of repeating the process of minimizing the distance between the corresponding points by matching the reference data and the LC case-based source data, it was confirmed that the distance between the two scan data is minimum in the repetition within 15 times as shown in Fig. 4.

Fig. 4. Matching iteration result of scan data

5 Conclusion In this study, we performed data matching for shape change estimation using laser scanning. Laser scanning can acquire three - dimensional point cloud data of the object quickly, it is possible to measure the structure which is difficult to access, and it is possible to acquire shape information on the whole surface of the structure. The variation of structure shape change using laser scanning was calculated by ICP algorithm and Hausdroff Distance. For the experiment, four times of scanning were performed for each of Load Case 1–3. For LC3, the maximum deflection value of LVDT was −4.826 mm, and the TLS shape estimation value was −5.049 mm, which was 4.6%. The research on shape change estimation using laser scanning is still in the beginning stage in Korea, and researches on laboratory scale are actively proceeding. This indicates that it is possible to manage the shape of the structure using the coordinate values of the laser scanning. Based on the results of the field experiments, future research will continue to advance the visualization of the shape change estimation algorithm and develop the shape change monitoring system. Acknowledgment. This work is financially supported by Korea Ministry of Land, Infrastructure and Transport (MOLIT) as “Smart City Master and Doctor Course Grant Program” and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. NRF-2017R1A2B3007607).

References 1. Lee, H.M., Park, H.S.: Estimation of deformed shapes of beam structures using 3D coordinate information from terrestrial laser scanning. Comput. Model. Eng. Sci. 29(1), 29–44 (2008) 2. Huber, D., Akinci, B., Tang, P., Adan, A., Okorn, B., Xiong, X.: Using laser scanners for modeling and analysis in architecture, engineering and construction, In: 44th Annual Conference on Information Sciences and Systems, pp. 1–6 (2010)

502

G. Cha et al.

3. Kumar, P., Mcelhinney, C., Lewis, P., Mccarthy, T.: An automated algorithm for extracting road edges from terrestrial mobile LiDAR data. ISPRS J. Photogramm. Remote Sens. 85, 44–55 (2013) 4. Beger, R., Gedrange, C., Hecht, R., Neubert, M.: Data fusion of extremely high resolution aerial imagery and LiDAR data for automated railroad centre line reconstruction. ISPRS J. Photogramm. Remote Sens. 66, 40–51 (2011) 5. Melzer, T., Briese, C.: Extraction and modeling of power lines from ALS point clouds. In: Proceeding of 28th OAGM Workshop on Austrian Association for Pattern Recognition, pp. 47–54. Osterreichische Computer Gesellschaft, Hagenberg (2004)

From Videos to URLs: A Multi-Browser Guide to Extract User’s Behavior with Optical Character Recognition Mojtaba Heidarysafa1(B) , James Reed2 , Kamran Kowsari1 , April Celeste R. Leviton2,5 , Janet I. Warren2,4 , and Donald E. Brown1,3 1

4

Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA, USA [email protected] 2 Institute of Law, Psychiatry, and Public Policy, University of Virginia, Charlottesville, VA, USA 3 Data Science Institute, University of Virginia, Charlottesville, VA, USA Department of Psychiatry and Neurobehavioral Sciences, University of Virginia, Charlottesville, VA, USA 5 Department of Sociology, University of California, Riverside, CA, USA

Abstract. Tracking users’ activities on the World Wide Web (WWW) allows researchers to analyze each user’s internet behavior as time passes and for the amount of time spent on a particular domain. This analysis can be used in research design, as researchers may access to their participant’s behaviors while browsing the web. Web search behavior has been a subject of interest because of its real-world applications in marketing, digital advertisement, and identifying potential threats online. In this paper, we present an image-processing based method to extract domains which are visited by a participant over multiple browsers during a lab session. This method could provide another way to collect users’ activities during an online session given that the session recorder collected the data. The method can also be used to collect the textual content of web-pages that an individual visits for later analysis. Keywords: Web search · User behavior Optical character recognition

1

· Image processing ·

Introduction

Since the invention of World Wide Web (WWW) in the 1980s by Tim BernersLee, the internet has continued to impact our society, culture, and everyday life activities. Given the explosive increase in information on internet, it has become the first resource people turn to when seeking information. One side effect of using the web for information retrieval is that looking at users’ behavior patterns on the internet opens up doors to understanding their interests. Researchers work to leverage this insight to improve multiple facets of user experiences over the c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 503–514, 2020. https://doi.org/10.1007/978-3-030-17795-9_37

504

M. Heidarysafa et al.

web. Areas such as designing interfaces, marketing, and digital advertisement are directly benefited from such research. Furthermore, it is also of interest to understand how people surf the internet and the pathways that lead them to specific places. One such places is the dark web where illegal activities including promoting terrorism and other cyber crimes have been spreading [18]. In order to use the dark web, Tor, a spacial browser is needed which provides access to dark web content anonymously using multiple nodes which reroute the connection. In our study, the usual scenario of information retrieval via the internet and dark web consists of the user beginning their search by accessing a regular internet browser, and then switching to the dark web using Tor. This process may repeat multiple times until the user finds the information they are seeking. Understanding the time spent on each domain in this circumstance is not a trivial task. However, given that this is a research experiment, one could use videos of these online search session which recorded the whole screen during the interaction. To accomplish this, utilized a general framework that took a stream of screen images and outputs the domain or URL visited in that frame regardless of the type of browser (internet or Tor). These URLs were collected for each user and are able to be analyzed separately. The purpose of this paper is to demonstrate an image-based approach that can be used in other scenarios where multiple browsers are working in parallel, given that the computer screen has been recorded. Such conditions might be suitable for research conducted in controlled environments, such as online searching sessions in a lab. Moreover, since the screen has been recorded, other analysis can be done using the same approach. As an example, one can analyze the textual content of web pages by using optical character recognition and the technique provided in this paper. The method that is described in this paper is fully open-source and available to use for similar tasks. This paper is structured as follows. In Sect. 2 we discuss related work for tracking users’ activities on the web as well as image processing techniques specifically, optical character recognition (ORC). Section 3 describes our implementation in details. Subsequently, in Sect. 4, we present the result of this approach. Finally, in Sect. 5 we discuss possible improvement to this approach and provide our concluding remarks.

2

Related Work

As mentioned before, users’ interactions with materials on the internet reveal details related to the individuals’ interests. Since the early days of internet, researchers have been using this insight to try to understand online user behaviour [7]. Discovering user patterns by mining web usage behaviors has been addressed by Srivastava et al. [19]. Specifically, web search patterns have been investigated extensively by researchers [9,10,15]. The result of this scholarship has provided ways for improving search engine rankings, advertisement placements, and search engine performance [1,4]. At an upper level, this research relies on log-files of user interactions which are gathered from the client or server side.

Video to URLs with OCR

505

Raw Image PRE PROCESSING

SEGMENTATION

FEATURE EXTRACTION

Assci code for letter “A”

POST PROCESSING

CLASSIFIER

Fig. 1. Structure of OCR system

Usual approaches include the use of add-ons such as Firefox Slogger or déjàclick (which rely on the browser history log files) or standalone software such as Track4Win which tracks internet usage and computer activities. Although these tools might provide insights of user activities, they cannot be used in every situation we might encounter during experiments with users. As an example, most add-ons are specific for a browser and a different browser, like Tor, would not allow add-on installation. Similarly as its name suggests, Track4Win is mostly suitable for machines with Windows as their operating system. Alternatively however, one can design the experiment such that they record the interaction of users with Internet using a video recorder such as OBS studio for windows machines or built-in quick-time screen recorders. In such scenarios, these videos combined with image processing tools and Optical Character Recognition (OCR) techniques can work to extract user interactions with internet, regardless of the browser and operating systems. A major cornerstone of the previously mentioned approach is Optical Character Recognition. OCR is used in many practical applications such as scanners in ATMs and office scanning machines which use OCR to understand characters [13]. OCR will be referred to either as off-line (where the writing or printing is completed) or as on-line (where character recognition will be performed simultaneously with writing). Different tasks such as hand-writing recognition or hand-written script verification may be performed with OCR techniques. However in this work, we focused on character recognition in a printed text as it is in url field of a browser. Figure 1 shows the structure of an OCR system. Pre-processing usually includes steps such as binarization, noise removal, and skew detection [8]. Next, segmentation will be performed to deconstruct an image into lines or characters. Feature extraction is another important step in OCR structure and different approaches have been suggested to perform this task [11]. Finally the most important step is classification of the character with high accuracy. Traditionally, this has been done by template matching or correlation-based techniques [8]. Other researchers have presented techniques such as fast tree-based clustering, HMM based on combination of frequency and time domain, and K- Nearest

506

M. Heidarysafa et al. Start

Create images from Videos

Finding number of frames per second

Convert videos into images per second

Crop the top third of image

Cropping URL area

Match browser anchers Crop the area based on browser ancher

Denoise the image

Image Pre-Processing

Sharpen the text in the image

Morphological transformation

Feed to OCR engine

1) Pytesseract 2) Tesseract- OCR engine

Fixing usual misspelling of result

Text post processing

Applying regular expression

Grab URL and Domain Write the session to spreadsheet

End

Fig. 2. General preview of the our implementation steps.

Neighbours for classification task [3]. Moreover, other machine learning classifiers for this step have also gained a lot of attention. In recent years, Support Vector Machines (SVM) have been used as powerful classifiers at the last part of an OCR system [20]. Also, Artificial Neural Network (ANN) has been used due to its high tolerance for noise provided correct features [2]. As one can see, we used tesseract OCR which uses a two-step process for classification with an adaptive classifier to perform the recognition. We will describe the structure of this open-source OCR along with other necessary steps such as image pre-processing and target selection with template matching in the next section.

Video to URLs with OCR

3

507

Method

In this section, we discuss the implementation of URL retrieval using OCR, which has been used for this work. First, we describe the general idea and its steps. Later each step will be explained in more detail. The approach we used in this work relies on screen recording of a user’s interaction with multiple browsers. Therefore, the input to the pipeline is a video recording or consecutive screen shots of a computer screen. Free tools such as OBS studio (Windows) and Build-in quick player (Mac) are able to record computer screens as users are interacting with their machines. Also, software such as PC screen capture captures screen shots per second of the online searching sessions of an experimental study. Figure 2 shows the general picture of our implementation structure and steps. The rest of this section describes each of steps shown in Fig. 2. 3.1

Extracting Images

In order to retrieve the URLs, first we needed to convert videos into images taken per second. One could easily calculate the correct number of images to generate by considering the frame per second (fps) rate at which the video is recorded. In this work, python Open CV library (CV2) was used to convert the videos to images. The last step was to crop these images so that the template matching method would have a lower chance of picking an area by mistake. We made the assumption that at any given time the URL text field area would be in the top one third of each screen shot. Therefore, we could crop the top 1/3 of images and discard the rest safely. 3.2

Template Matching

After retrieving the images per second, we needed to narrow down and specify the URL area for future steps. Our thinking was that doing so would both make it easier for the OCR engine to convert characters into text in less time and produce less noisy data from which to grab URLs and domains later on. Consequently, we needed to define specific anchors around the URL text field which could be detected with template matching algorithms [5]. These templates were carefully selected based on the type of browser used for interaction. Python Open CV includes 6 types of template matching algorithms. In this work, we used the TM CCOEFF NORMED method to perform the template matching using a threshold of 80% for maximum value of the match. This threshold might need to be altered for other applications depending on the quality of images and fine tuning may be needed to get the best results. Considering the source image as I and Template image as T and Result matrix as R, normalized correlation coefficient matching can be computed by Eq. 1. x ,y (T (x , y ) · I (x + x , y + y )) (1) R(x, y) = Z(x, y)

508

M. Heidarysafa et al.

Where T (x , y ) = T (x , y ) −

(w · h) ·

I (x + x , y + y ) = I(x + x , y + y ) − Z(x, y) =

x ,y

x ,y

(w · h) ·

T (x , y )2 ·

1 T (x , y )

x ,y

1 I(x + x , y + y )

I (x + x , y + y )2

(2)

(3)

(4)

x ,y

w and h are the width and height of the template image and Z addresses the normalization part of this algorithm. Finding a match will provide the height of the anchor which corresponds to the upper corner of browser text field. Using the height of the best match and width of template, we can crop an image such that mostly only the address bar remains. The resulting cropped image undergoes several image processing techniques which will be described in the following section. 3.3

Image Pre-processing

Depending on the quality of video or screen shots taken during a session, the browser text field may require pre-processing. The process used in this implementation is as follows: 1. Convert RGB images to gray scale images. Different algorithms exist to convert an RGB image into its corresponding gray scale one. The average method simply averages over all three channels values while the lightness method averages only maximum and minimum vales of all channels. Open CV library uses the luminosity averaging technique where specific weights would be applied to each channel as given by Eq. 5 Gray scale value :

Y = 0.299 · R + 0.587 · G + 0.114 · B

(5)

2. Re-scaling image to a bigger size. Although we cannot change the quality of pictures, this step increases the number of pixels and thus makes it possible to improve the result by using other image processing filters. Our experiments showed that re-sizing the image three times of the original yielded the best results. 3. De-noise the resulting images. De-noising images is a significant task in image processing and different algorithms has been proposed to effectively perform the task. One can separate these algorithms into three main categories: special domain methods, transform domain methods, and learning based methods [16]. We used the Non-local Means Denoising algorithm which comes in the same library to have a consistent implementation purely in python. The algorithm works by considering a noisy image v = {vi |i ∈ Ω}, the result

Video to URLs with OCR

509

Fig. 3. Example of applying kernel filter to achieve sharpened image [12]

intensity of a pixel ui can be computed by weighted average of neighboring pixels within certain neighborhood I of that pixel. ω(i, j)vj (6) ui = j∈I

where these weights can be computed by Eq. 7. 1 ω(i, j) = exp ω(i, j) j

vNi − vNj 22,a h2

(7)

such that Ni refers to the patch size centered at i and a represents the standard deviation of a Gaussian kernel [6]. Generally, the method averages over all similar pixels and the similarity is measured by comparing patches of the same size around pixels in the search window. To perform denoising in this work, patch size and search windows of 7 and 21 has been selected respectively. These parameters along with filter strength of 10 produced the best result for our experiments. 4. Sharpen the final images. To perform sharpening, kernel 2D filter was used. This kernel goes over the original image pixels applying it on windows around that pixel. Figure 3 shows an example of this process. To perform the sharpening in this work, we applied a 5 × 5 Gaussian kernel similar to the above picture. Performing the last two steps improves the shape and contrast of the characters with its white surrounding and thus improving the accuracy of OCR engine output.

3.4

Optical Character Recognition

The main technology used to find URLs and domains visited by a user is Optical Character Recognition (OCR). Different OCR engines are available but in order to have a consistent open-source solution, we chose to work with Tesseract-OCR as the main engine for character recognition. Tesseact is an open-source OCR engine written in C and C++ in Google [17]. Despite being open-source, Tesseact performs well even in comparison with commercial OCR engines such as Transym OCR [14]. Moreover, there exists a python wrapper library “pytesseract”

510

M. Heidarysafa et al. Character

Input Image

Adaptive Thresholding

Binary Image

Connected Component Analysis

Outlines

Find Lines and Words Character Outlines

Organized into words Extracted text from image

Recognize Word (Pass 2)

Recognize Word (Pass 1)

Fig. 4. Architechture of Tesseract OCR

which provides direct access to this engine in python. As a result, the implementation will be completely in python. Tesseract OCR architecture will be briefly explained here. Figure 4 shows Tesseract architecture and steps that were performed. As a first step, the image was converted to a binary image by applying an adaptive threshold. Next, using connect component analysis, character outlines were extracted. The outlines then were converted into Blobs which themselves were converted into text lines. Text lines were analyzed for fixed patches and broken down into words using character spacing. Words were further broken into character cells immediately and proportional text was broken using definite space and fuzzy space. Tesseract uses a 2-stage word recognition at the end to improve its result. The satisfactory result of the first pass was given to a classifier as training data to increase the result accuracy. 3.5

Text Post Processing

OCR engines try to convert any visual clue in the given image into a character. This leads to unwanted characters appearing in the results. For example, websites that use Hypertext Transfer Protocol Secure (HTTPS) usually include a lock sign in the browser next to their URL text field. Other cases include when the browser is not in full-screen mode and thus, the computer desktop’s texts or other browsers’ texts might be fed into OCR. OCR tries to convert these shapes into characters that will affect the result. Thus, we needed to use post processing on the result of OCR engine. Regular expressions (Regex) is a powerful tool that took care of these cases. Python regular expression library (re module) provided us access to this tool. Regex can also be used to fix typos and unify the results (e.g., dropping www from the URL if it exists, etc.).

Video to URLs with OCR

511

Fig. 5. Template matching with anchors specified to Tor and Chrome

4

Experimental Results

We used the above mentioned approach on the results of a lab experiment where participants were instructed to look at sensitive topics in order to investigate the pathways they used to find their topic of interest, given their religious and political backgrounds. Due to the nature of experiment, participants were given the chance to use a regular web browser or Tor browser to perform their search (all sessions were conducted in a secure research lab). A video (or consecutive screen shots) were recorded from each computer screen during the online search sessions. These videos were stored for later analysis after the session. Following the steps explained in Sect. 3, we created a set of images and cropped the top 1/3 of the image. In the experiments we ran, the user could either use Google Chrome or Tor for browsing the web, we defined the anchors for these two browsers as shown in Fig. 5. We used the onion shape in the Tor

Original Image

Denoised Image

Sharpened Image

Output

https://www.americanthinker.com

Fig. 6. Image processing steps illustration used to improve results

512

M. Heidarysafa et al.

browser and one extension added to Chrome browser. Depending on the match, the URL text field will be cropped from the image.

Fig. 7. Distribution of domains and the time (in second) for a participant

The text field was gray-scaled, re-sized, de-noised, and finally sharpened. The final result was fed into Tesseract OCR to extract the text. Figure 6 shows an example of detecting a domain. Finally, text post-processing was performed on the output as needed. Collecting the aggregated results of images, we were able to understand which domains participants spent most of their time on, as can be seen in Fig. 7.

Fig. 8. The path that the participant took in the same experiment. Horizontal axes represent the time passed of the session

Using this process, we can investigate the paths each participant took during their interaction with the internet regardless of the type of browser they chose to use. Figure 8 shows a visualization of the same user path along with the duration of the experiment that had passed. This visualization helps to understand where a particular participant starts and ends, given their interests and backgrounds.

Video to URLs with OCR

5

513

Conclusion and Future Works

This paper presents a different approach for tracking user behaviours as they interact with the World Wide Web (WWW). We used image processing methods along with an OCR engine (tesseract-OCR) to extract URLs and domains visited by a participant in a session. This was particularly difficult due to the nature of these experiments where each participant was allowed to use multiple browsers (Chrome or Tor) as the web browsing tool and could switch between them at will. The computer screen and audio recording during the interaction session was kept for separate analysis later on. To extract visited domains, we converted videos into consecutive images (if needed) and cropped only the URL area of each browser using template matching. The resulting area went into image pre-processing and then fed into the OCR engine and finally using regular expressions, we extracted the URL and domain within each image. With very few exceptions, by using this method we were able to collect correct URLs and track the path a participant took during the experiment. This was mainly due to the quality of videos and images causing some noise into pictures that was extremely difficult to remove. However, since most of these pictures were taken consecutively, one could find the correct domain by checking other very similar results extracted from other frames. As for future research, one could try to automate such cases as well. This could be done by sending requests to these domains and replace the similar non-perfect results with domains that correctly return the requests. Moreover, this approach could be used to extract the content of a web page as well as the advertisements around the page depending on their importance in experiments. Lastly, this methodology could aid in the effort for machines to learn how to identify users’ patterns of behaviour online and respond to potential online threats.

References 1. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating user behavior information. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19–26. ACM (2006) 2. Barve, S.: Optical character recognition using artificial neural network. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 1(4), 131 (2012) 3. Berchmans, D., Kumar, S.: Optical character recognition: an overview and an insight. In: 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), pp. 1361–1365. IEEE (2014) 4. Borisov, A., Markov, I., de Rijke, M., Serdyukov, P.: A context-aware time model for web search. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 205–214. ACM (2016) 5. Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media, Inc., Sebastopol (2008) 6. Buades, A., Coll, B., Morel, J.M.: Image denoising methods. A new nonlocal principle. SIAM Rev. 52(1), 113–147 (2010)

514

M. Heidarysafa et al.

7. Catledge, L.D., Pitkow, J.E.: Characterizing browsing behaviors on the world-wide web. Technical report, Georgia Institute of Technology (1995) 8. Chandarana, J., Kapadia, M.: Optical character recognition. Int. J. Emerg. Technol. Adv. Eng. 4(5), 219–223 (2014) 9. H¨ olscher, C., Strube, G.: Web search behavior of internet experts and newbies. Comput. Netw. 33(1–6), 337–346 (2000) 10. Hsieh-Yee, I.: Research on web search behavior. Libr. Inf. Sci. Res. 23(2), 167–185 (2001) 11. Kumar, G., Bhatia, P.K.: A detailed review of feature extraction in image processing systems. In: 2014 Fourth International Conference on Advanced Computing and Communication Technologies (ACCT), pp. 5–12. IEEE (2014) 12. Lowe, D.: CPSC 425: Computer Vision (January–April 2007) (2007) 13. Mori, S., Nishida, H., Yamada, H.: Optical Character Recognition. Wiley, New York (1999) 14. Patel, C., Patel, A., Patel, D.: Optical character recognition by open source OCR tool Tesseract: a case study. Int. J. Comput. Appl. 55(10), 50–56 (2012) 15. Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proceedings of the 13th International Conference on World Wide Web, pp. 13–19. ACM (2004) 16. Shao, L., Yan, R., Li, X., Liu, Y.: From heuristic optimization to dictionary learning: a review and comprehensive comparison of image denoising algorithms. IEEE Trans. Cybern. 44(7), 1001–1013 (2014) 17. Smith, R.: An overview of the Tesseract OCR engine. In: 2007 Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 2, pp. 629– 633. IEEE (2007) 18. Spalevic, Z., Ilic, M.: The use of dark web for the purpose of illegal activity spreading. Ekonomika 63(1), 73–82 (2017) 19. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000) 20. Xue, Y.: Optical character recognition. Department of Biomedical Engineering, University of Michigan (2014)

3D Reconstruction Under Weak Illumination Using Visibility-Enhanced LDR Imagery Nader H. Aldeeb(B) and Olaf Hellwich Computer Vision and Remote Sensing, Technische Universit¨ at Berlin, Berlin, Germany [email protected], [email protected]

Abstract. Images of objects captured under poor illumination conditions inevitably contain noise and under-exposed regions where important geometric features may be hidden. Using these images for 3D reconstruction may impair the quality of the generated models. To improve 3D reconstruction under poor illumination, this paper proposes a simple solution for reviving buried features in dark images before feeding them into 3D reconstruction pipelines. Nowadays, many approaches for improving the visibility of details in dark images exist. However, according to our knowledge, none of them fulfills the requirements for a successful 3D reconstruction. Proposed approach in this paper aims not only to enhance the visibility but also contrast of features in dark images. Experiments conducted using challenging datasets of dark images demonstrate a significant improvement of generated 3D models in terms of visibility, completeness, and accuracy. It also shows that the proposed methodology outperforms state-of-the-art approaches that tackle the same problem.

Keywords: Dark image

1

· Photogrammetry · Visibility enhancement

Introduction

Image-based acquisition of 3D shapes of objects has been an important part for a diversity of modern applications, including but not limited to cultural heritage [1,2], where archaeological pieces in museums are digitally archived for visualization in virtual museums, medicine [3], where 3D shapes are employed as guidance for physician in plastic surgery, and mobile-industry [4], where 3D reconstruction pipelines are built into smart-phone apps as a creativity feature. 3D shape acquisition can also be implemented in different ways than by image-based techniques, e.g. by range-based modeling [5]. Nevertheless, imagebased techniques are distinguished from many other techniques in being fast, easy-to-use, flexible, and cost-efficient [6]. This has made them very attractive for a huge diversity of applications. However, they work well only under good lighting conditions, especially if standard cameras are used for capturing images. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 515–534, 2020. https://doi.org/10.1007/978-3-030-17795-9_38

516

N. H. Aldeeb and O. Hellwich

Having dark images like the one seen in Fig. 1 top row (left), due to the imperceptibility in under-exposed regions, it is difficult for humans to observe small image details. The observation problem becomes worse when the task is given to a computer vision algorithm, e.g. 3D shape estimation techniques, confronted with the challenge to capture sufficient geometric details, specifically in dark regions. As a result, due to lack of detailed geometric features, 3D reconstruction may not complete successfully. While 3D shape estimation techniques will sometimes succeed in finding feature points in dark image-regions and generate 3D models, generated models will show weak texture with non-realistic color information. To exemplarily investigate this situation, we conducted an experiment where five datasets for the same scene are captured. They are identical but acquired using different lighting levels (i.e. each has its own darkness level). Each dataset is independently fed to a 3D reconstruction pipeline computing structure from motion (SfM) and multi-view stereo (MVS) using the open-source implementations VSFM [7–9] and PMVS2 [10]. The first row of Fig. 1 shows sample views taken from the five datasets, and the second row shows snapshots of the generated models for all the datasets. As can be seen, the darker the dataset, the lower the quality of the generated model in terms of visibility, completeness, and even in terms of accuracy as we will see later in Sect. 4.

Fig. 1. Darkness vs. 3D quality: first row shows sample images from five datasets that differ in lighting (darkest to brightest), second row shows corresponding 3D models.

Though it is found to solve a different problem, High Dynamic Range (HDR) imaging can be employed to work around this problem. By using special cameras and fusing different exposures, the generated HDR-image captures a wide range of intensity variations. Tone-mapping approaches of HDR images try exploit the limited target range in a way that preserves image details as much as possible. As can be seen in Fig. 2 (right), HDR imaging reveals the frog which was completely buried in the dark image (left). This increases image details and will potentially increase the number of inliers during image matching procedure that is the core of any 3D reconstruction technique. Consequently, HDR imaging improves the 3D reconstruction of objects and scenes that are exposed to weak lighting conditions. However, floating-point HDR images require a relatively massive storage compared to Limited Dynamic Range (LDR) images. In

3D Reconstruction Under Weak Illumination

517

addition, using the native floating point HDR images will increase the processing time. Also, capturing HDR images requires more effort than LDR ones. Therefore, HDR imaging techniques have not been much exploited in computer vision applications, especially in photogrammetry. In this research, as an alternative to the intricacies of HDR imaging, we investigate enhancing the visibility of the available LDR dark images before feeding them to the reconstruction pipeline.

Fig. 2. Dark image enhancement: (left to right) dark image example, enhanced using exposure fusion EF approach [11], using LIME approach [12], and using HDR imaging.

2

Literature Review

Image-based 3D reconstruction means to estimate geometrical structure of an object or a scene using a set of its acquired images. It is one of the vision-based techniques that are highly affected by the visibility of details in the acquired images. Hence, darkness in images does not only impair the pleasing visibility of details, but it may also degrade the quality of the estimated 3D models. So it is necessary to find a way to improve the visibility of details in dark images. If it does not lead to saturating the intensity of pixels in relatively bright areas, intensity magnification of all pixels would be the easiest approach to enhance the visibility of details in dark images. Therefore, many other sophisticated approaches for improving the visibility of details in dark images have been developed in the last decades. For example, the works of Li et al. [13], Guo et al. [12], and Ying et al. [11] make use of Retinex theory that mimics the response of human visual system (HVS) to estimate the light coming from an observed scene. By modeling dark image formation as a product of the real scene and the light’s transmitting medium (illumination-map), the real scene can easily be recovered, given the illumination-map. Usually, illumination-maps are not given. Therefore, Retinex-based approaches try to first estimate the illumination-map from each captured image, then use it to achieve the desired image enhancement. Dark pixels and haze are highly correlated. Ideally, a purely dark (black) image is a haze-free image [14]. This means that the inverted version of any dark image should look like a hazy image. Exploiting this correlation, Li et al. in [13] try to lighten dark images indirectly by de-hazing the inverted version of them, and then invert the resulting de-hazed image to obtain the required lightened image. In these methods, the formation of dark images is usually represented by a model that transforms the desired well-exposed image into a dark one using illumination-map and global atmospheric light. Such kind of approaches

518

N. H. Aldeeb and O. Hellwich

are straightforward and can achieve satisfactory results. However, because of indirect enhancements, color distortions and information loss might be present in the final result. In addition, these methods need to estimate two unknowns, the illumination-map and the atmospheric light. Estimation of the latter is a time-consuming process. Guo et al. [12] proposed a method called LIME, which stands for low-light image enhancement. The method depends basically on the previously mentioned image construction model that decomposes the captured image into a product of the desired scene and illumination-map that represents the light’s transmission medium and hence controls the darkness of the captured image. Based on that model and attributing the cause of image-darkness to the illuminationmap, the enhancement of dark images becomes a straightforward process, given that map. Hence, illumination-map estimation is assumed as the key and most important component to recover the desired real scene. An initial estimate of the illumination at each individual pixel is first approximated. Then, the initial illumination-map is refined to be used for enhancing the input dark image. More concretely, for each pixel in the input image, the maximum corresponding intensity among the three color-channels (Red, Green, and Blue) is chosen as the initial illumination for that pixel. The generated initial illumination-map is then refined by employing a details-aware smoothing. Some former approaches like [15–17] have a different strategy in estimating the illumination-map or at least the initial version of it. To achieve a higher accuracy and consistency, they consider estimating the illumination for a given pixel based on the light-intensities falling in its neighborhood. For efficiency reasons, the authors of LIME do not consider the neighborhood of pixels. However, as a compensation, they proposed to refine the initial illumination using a method that preserves the main structure while smoothing fine details of the image. We see that LIME is one of the most promising and easy approaches for dark image enhancement. However, we believe that the way it uses for measuring image structures to decide smoothing is not robust to noise. The measure can easily be biased by the existing noise which is inevitable, especially that we are dealing with dark images. Important image structures like edges are at different scales. The employed measure may generate larger responses to noise than to small-scale structures. This leads to counterproductive enhancement results (i.e. noise will be preserved, and smallscale structures will be smoothed). Consequently, generated visibility-enhanced images using LIME may contain noise and weak textures, an example can be seen in Fig. 2, the second image from right. In photogrammetry, strong textures and low noise are necessary to get accurate 3D models. Hence, without noise suppression, LIME’s generated visibility-enhanced images may not be considered as good candidates to be used for 3D reconstruction. As mentioned before, compared to LDR imaging, HDR imaging is not intensively employed in computer vision, because it relatively requires more storage, longer processing time, and more effort during capturing images. However, Ying et al. in [11] proposed an approach that acts as an alternative to HDR imaging for enhancing image contrast. They use a new Exposure Fusion (EF) framework that

3D Reconstruction Under Weak Illumination

519

approaches the same goal of mitigating under- and over-exposed regions like the goal of HDR imaging. Based on a single input image, they synthetically generate a new different-exposure image which is then fused with the original input image to generate an enhanced version. In details, a weight matrix is designed and prepared for image fusion, where well-exposed pixels are assigned a high weight while low weights are given to the poorly exposed pixels. Based on the fact that highly illuminated regions are mostly well-exposed regions and vice versa, they use the illumination-map of the input image as the weight matrix to be used for exposure fusion. They consider the brightness of the input image as an initial estimate of the illumination-map, which is then refined by optimization. They follow the same refinement procedure that has been used by the previously mentioned work, LIME. So far, the weight matrix (obtained refined illumination-map) is ready to be used for exposure fusion; however, the new synthetic exposure is not generated yet. Using their own camera response model as a brightness transform function and based on the input image, they generate a new image with different exposure. In their model, they carefully set a parameter, called exposure ratio, in order to guarantee generating a new synthetic image that well exposes the pixels that were not well exposed in the input image. To do so, they first extract an image that contains only dark pixels of the input image. Then they formulate an optimization problem to find the best exposure ratio that allows their camera response model to generate an enhanced version of the extracted dark image with a maximum possible entropy. The theory behind this work is conceptually simple, robust, and can easily be implemented. However, according to our point of view, we recorded some drawbacks in it. First, the way by which they perform fusion does not only enhance under-exposed regions and protect well-exposed regions from being changed, but unfortunately it prevents enhancing the contrast in over-exposed regions. In addition, to speed up the refinement process of the initial illumination-map, they use a down-sampled version of the input dark image. In the end, they up-sample the resulting illumination-map to match the resolution of the original image. Unfortunately, this way of estimating the refined illumination-map does not capture all significant details of the original image. It is well-known that down-sampling has the effect of discarding details in order to reduce the resolution. We believe that the generated illumination-map will not be very informative. Accordingly, using this map in weighing the fusion process will lead to generating images that have low details and contrast, see Fig. 2 the second image from left. This often impairs the quality of the generated 3D models, if such images are used for 3D reconstruction. To summarize, after many investigations and according to the aforementioned analytical review, we conclude that unfortunately none of the available darkimage enhancement methods generates images that fulfill the requirements for a successful 3D reconstruction. They only tend to generate images that are pleasing to HVS. In addition, despite of dealing with dark images that are mostly noisy, most investigated methods do not consider avoiding the enhancement of the existing noise. Also, they do not focus on generating sharp image details with high contrast, which is crucial to have a successful 3D reconstruction. In this

520

N. H. Aldeeb and O. Hellwich

paper, we propose a new dark-image enhancement methodology that facilitates achieving good 3D results by enhancing both visibility and contrast of details in images while avoiding noise enhancements.

3 3.1

Proposed Methodology Overview

Regardless of their drawbacks, it is obvious that Retinex-based methods like the EF approach in [11] and LIME approach in [12] are able to enhance the visibility of details hidden in dark-regions (c.f. Fig. 2, the two middle images). In both approaches, the illumination-map is the key component to improve the visibility of input images. In LIME approach, the illumination-map is estimated based on a full-resolution version of the input image, while in EF approach it is estimated based on a down-sampled version of the input image. So, LIME is relatively superior in enhancing the visibility of input images. In this paper, we first develop the illumination-map estimation process of LIME through preserving important image details of different scales and decreasing noise level. Then, we use the improved version of LIME to enhance the visibility of dark images. The generated enhanced images are expected to have higher visibility, more details, and higher signal to noise ratio. However, because of smoothing during illumination-map refinement, generated images may still have some regions of low-contrast which makes some details indistinguishable. This could impair the performance of any subsequent 3D reconstruction process. Hence, it is important to further enhance the contrast of details in the generated images. Yet, because the existence of noise is still probable, our contrast-enhancement should be noise-aware in order to prevent noise from being enhanced again. In this paper, we adapt the Contrast Limited Adaptive Histogram Equalization (CLAHE) approach of Zuiderveld [18] to expand the dynamic range of images. Our adapted CLAHE will focus only on enhancing textural regions while avoiding homogeneous regions where we expect to have noise. 3.2

Flowchart of the Proposed Approach

The flowchart of the proposed approach can be seen in Fig. 3. The input is supposed to be an image of a poorly-illuminated object or scene, and the output is an enhanced version of it. We expect not only lack of details visibility in input image, but also severe noise. First, the visibility of dark images is improved using the LIME but employing our illumination-map estimation process. The estimation of the illumination-map is the core component, based on which the visibility of the input image is enhanced. As already mentioned before, generated visibility-enhanced images may still contain regions of low contrast and noise. Wherefore, we further enhance the contrast of the generated images using a noiseaware contrast enhancement method, which is obtained by adapting CLAHE. Following, the main components of the proposed methodology will be discussed in more details.

3D Reconstruction Under Weak Illumination

521

Fig. 3. Flowchart of the proposed approach.

Illumination-Map Initialization and Refinement. As mentioned before, the idea of enhancing image illumination using Retinex approaches depends mainly on the well-known image-formation model seen in (1). It models the formation of any captured image I as the product of the real exposure E, that we intend to estimate, and the light transmission medium (illumination-map) T . From this model, given a captured image I, it is obvious that the illuminationmap is the key to recovering the real scene exposure. Hence, our main concern here is to find an optimal estimate of this map. I =E◦T

(1)

As in LIME, for each individual pixel at location (x, y), we first start with an initial estimate, T (x, y), of the illumination-map as in (2), where c is an index to the three color channels. T (x, y) =

max

c∈{R,G,B}

I c (x, y)

(2)

Then the refined illumination-map is obtained by solving for T in the following optimization problem. min T − T 2 + β W ◦ ∇T 1 2

T

(3)

where β is a parameter used to control the contribution of the second term, which is responsible for smoothing the desired illumination-map T . W is a weight that will be set in a way to protect the major structure from being smoothed, and ∇T = (∇h T, ∇v T ) is the gradient that contains both horizontal and vertical central differences (first-order derivative) of T . The first term of the optimization problem maintains consistency between the new refined map and the initial one. Structure-Aware Weight Estimation. The weight W = (Wh , Wv ) is very central in the illumination-map refinement process, as it is the only component

522

N. H. Aldeeb and O. Hellwich

that will protect important image details from being smoothed by discriminating them from other details. In LIME, for a given single pixel at (x, y), weight components are estimated based on horizontal and vertical gradients of the initial illumination-map at the same point, as can be seen in (4), where is a tiny value used to avoid dividing by zero. Wd (x, y) =

1 , | ∇d T (x, y) | +

d ∈ {h, v}

(4)

Ideally, important image details (at different scales) should usually have relatively larger gradients than gradients in homogeneous regions. Accordingly, using (4), important image details are assigned smaller weights (i.e. protected from smoothing), while homogeneous regions are penalized (i.e. will be smoothed). However, as already mention before, we believe that the measure (standard gradient) employed by LIME in (4) may respond larger to noise in homogeneous regions than to small-scale details in non-homogeneous regions. As a result, noise may be protected from smoothing and important fine details may be included in smoothing. This leads to enhanced images containing weak textures and noise. These problems impair the quality of 3D reconstruction. Accordingly, we see that it is important to deeply investigate image details in order to avoid noise and exclude edges of small scales from smoothing. To this end, we adapt the LIME approach and propose to estimate gradients in a multi-resolution fashion. This guarantees the contribution of all significant image information in the weight estimation process while trying to suppress the noise. So, now our goal is to find the gradients ∇d T for a given input image T in a multi-resolution fashion. Based on the input image, we use a Gaussian kernel to build the pyramid G as in (5), such that G0 is the input image itself (full resolution) and Gn is the coarsest resolution of it. The image Gk+1 is produced by first convolving the image Gk with a Gaussian kernel then eliminating even rows and columns. The size of any image in the pyramid is one-quarter the size of its predecessor, and the value n is selected such that neither the width nor the height of the image Gn is less than 256. Then, for each level k in the pyramid, we find the k th level gradient ∇Gk (x, y) at point (x, y) as can be seen in (6). G = {G0 , G1 , ...., Gn }

(5)

∇Gk (x, y) = [∇h Gk (x, y), ∇v Gk (x, y))] Gk (x + 1, y) − Gk (x − 1, y) (6) 2k+1 k k G (x, y + 1) − G (x, y − 1) ∇v Gk (x, y) = 2k+1 Finally, for a given direction d ∈ {h, v}, to get an estimate of the gradient ∇d G of the pyramid G, we simply propagate all its level gradients starting from the coarsest till the finest resolution. The entire procedure for propagating gradients is summarized in Algorithm 1. This multi-resolution fashion of gradient estimation guarantees capturing all significant information. Finally, we estimate the ∇h Gk (x, y) =

3D Reconstruction Under Weak Illumination

523

weight components Wd (x, y) as can be seen in (4), but now based on our generated multi-resolution (propagated) gradient, ∇d G(x, y), instead of ∇d T (x, y). Then, the estimated weight W is used in (3) to generate the refined illuminationmap. Given the original dark image I, substituting the obtained illuminationmap T in the model already shown in (1), the desired exposure E can be easily recovered. Algorithm 1. Gradients Propagation

1 2 3 4 5

Input: A finite set Ad = {∇d G0 , ∇d G1 , . . . , ∇d Gn } of level gradients Output: Propagated gradient ∇d G of the input for i ← n − 1 to 0 do T emp ← Scale Ad [i+1] up to the size of Ad [i] using linear interpolation Ad [i] ← Ad [i] ◦ T emp end return Ad [0]

Noise-Aware Contrast Enhancement. For more efficiency in this processing stage, we use the L component of the visibility-enhanced image after being converted to Lab color space. Afterwards, the generated contrast-enhanced image is converted back to RGB. As mentioned before, towards smoothing during illumination-map refinement by (3), generated images after visibility enhancement may still have low contrast regions. In addition, despite our noise suppression during visibility enhancement, generated images may still contain noise. Accordingly, though the visibility is considerably enhanced, generated images after visibility enhancement may not be considered as good candidates to be used as inputs for 3D reconstruction pipelines. For better 3D performance, it is necessary to further enhance the contrast of the generated images. Histogram Equalization (HE) technique expands the dynamic range of images to enhance their quality. More precisely, it redistributes pixels among gray levels such that they span a wider dynamic range than before. This has the effect of enhancing both the contrast and illumination of images. However, standard HE-methods do not consider avoiding noise enhancement. Hence, they usually enhance existing unwanted noise and generate over-enhanced areas (areas having incomplete and distorted details). So, we think that standard HE techniques are not suitable for further enhancing the contrast of our generated images. Zuiderveld in [18] proposed the CLAHE approach that finds a solution to the noise problem associated with the standard HE-methods. Like any other HE technique, based on the input image, CLAHE first estimates the image histogram, Probability Density Function (PDF, seen in (7)), upon which the Cumulative Distribution Function (CDF, seen in (8)) is estimated and used to map any gray level l as in (9). Where nj is the total number of pixels having gray level j, N is the total number of pixels in the image, and L is the maximum gray level value in the image.

524

N. H. Aldeeb and O. Hellwich

P DF (j) =

nj , N

CDF (j) =

j = 0, 1, . . . , L − 1 j

(7)

P DF (k)

(8)

T (l) = (L − 1) ∗ CDF (l)

(9)

k=0

As can be seen in (9), gray level mapping depends ultimately on the histogram (PDF) of the input image. Usually, histograms of homogeneous regions are characterized by a high peak. The higher the histogram’s peak, the larger the slope of the corresponding CDF. A larger slope of the CDF means a higher enhancement (higher magnification) of gray levels. This may lead to enhancing existing noise and over-enhancing relatively bright details in such regions. Hence, the idea of CLAHE basically involves eliminating high PDF’s peaks before estimating the CDF. This is being achieved by employing a clipping value which defines the maximum number of pixels a bin in the PDF may have. If for a given bin the number of pixels exceeds the maximum allowed limit, the excess is redistributed equally through all bins, including the current one. It should be noted that a higher clipping value leads to a larger CDF slope and consequently higher contrast enhancement, and vice versa. For better performance, CLAHE subdivides the input image into blocks (commonly, 8 × 8) and enhances each block individually. Blocking artifacts are avoided by using intra-block bilinear interpolation. CLAHE can noticeably enhance image contrast. However, the used clipping value should be set manually and remains constant for all blocks. Besides the problem of being manually set, we believe that a fixed clipping value may not fit all different blocks at the same time. Textured blocks should be assigned higher clipping values to enhance their contrast, and other homogeneous blocks should be assigned a lower clipping value to limit the enhancement in order to avoid noise enhancement. In this paper, we propose an improved version of CLAHE, where the clipping value is set automatically and adaptively for each block depending on its content. According to the aforementioned and to achieve good enhancement results, our goal is to use a low clipping value that prevents magnifying details inside homogeneous blocks, while increasing that value at blocks that seem to have relatively more details. In meanwhile, our selected clipping value should not lead to over enhancement of originally bright details in the blocks. To achieve this, we adaptively set the clipping value for each block based on its T exture intensity and relative Brightness, as can be seen in (10), where N is the total number of pixels in the block, D is the dynamic range of the block, and α is a parameter used to control the contribution of the T exture intensity term. Following, we will see how to estimate T exture intensity and relative Brightness terms. Clip =

N × min Brightness, 1.5 + α × T exture D

(10)

3D Reconstruction Under Weak Illumination

525

Image gradients are well-known tools that are used for measuring image structural details such as edges, corners, and textures [19]. Unfortunately, noise in our images may look like original image details and mislead standard gradientbased tools. This poses a kind of confusion and inaccuracy in deciding if a block is homogeneous. In fact, image signal and noise have different properties. Accordingly, reducing image resolution by convolving neighboring pixels with a Gaussian-kernel would reduce the noise in it. Therefore, again in this paper, we employ our proposed multi-resolution technique to estimate block gradients. Now, for a given block in the input image, we first build its Gaussian pyramid G as seen in (5), but here the number of levels n is selected such that neither the width nor the height of the image Gn is less than 32. After that, for each level k in the pyramid, we first estimate the horizontal and vertical components, ∇h Gk and ∇v Gk , of the gradient ∇Gk as in (6), then we find the gradient magnitude M k (x, y) for each point at (x, y) as in (11). After that, all level magnitudes are propagated from the coarsest to the highest resolution to get an estimate of the block’s gradient magnitude (propagated) Mp as can be seen in Algorithm 2. M k (x, y) = ∇Gk (x, y) = (∇h Gk (x, y))2 + (∇v Gk (x, y))2 (11)

Algorithm 2. Magnitudes Propagation

1 2 3 4 5

Input: A finite set A = {M 0 , M 1 , . . . , M n } of level-gradient’s magnitudes Output: Propagated gradient’s magnitude Mp for i ← n − 1 to 0 do T emp ← Scale A[i+1] up to the size of A[i] using linear interpolation A[i] ← A[i] ◦ T emp end return A[0]

Finally, from (12), we get an estimate of block’s texture intensity. Where, Mp is the block’s multi-resolution-based propagated gradient magnitude, and Mf is the block’s full-resolution-based gradient magnitude. Investigations about the reliability of this measure will be discussed in the next subsection. T exture =

mean(Mp ) mean(Mf ) +

(12)

To measure relative brightness of a given block B, we simply use the ratio of block’s average gray value to that of the entire image E as in (13). The contribution of the block’s Brightness term in (10) can be one of three cases. First, (< 1) as long as the average gray level of the block is less than that of the entire image, this leads to a lower clipping value corresponding to a relatively darker block. Second, (= 1) when average gray values of block and image are equal, this leads to a neutral contribution of the Brightness term. Third, (> 1)

526

N. H. Aldeeb and O. Hellwich

as long as the average gray of the block is greater than that of the entire image, this leads to a higher clipping value corresponding to a relatively brighter block. However, as can be seen in (10), to prevent over enhancement of relatively very bright blocks compared to the entire images, the contribution of Brightness term is limited to 1.5. Brightness =

Avg(B) Avg(E) +

(13)

Reliability of Multi-resolution Based Gradients. This subsection checks the performance of our multi-resolution-based gradient estimation method by investigating its response in both weakly-textured and homogeneous noisy blocks. Comparisons are also held with the response of single-resolution-based gradient estimation method. Figure 4a shows a sample dark image, and Fig. 4b shows the L component of it after enhancing its visibility and blocking. Responses of both gradient estimation methods are shown in Fig. 4c and d for some randomly selected homogeneous and weakly-textured blocks respectively. Please note that we use the nonlinear logarithmic scale to represent the results, because of the large range of quantities obtained for both gradient estimation methods.

(a)

(b)

(c)

(d)

Fig. 4. Performance of the two gradient estimation methods: (a) sample dark image, (b) enhanced version of (a) with blocks visualized on it, (c) gradient responses based on homogeneous blocks, and (d) gradient responses based on textured blocks (Log scale).

As mentioned before, it is supposed that the visibility-enhanced image contains noise, especially in homogeneous regions like the blocks that are colored orchid in

3D Reconstruction Under Weak Illumination

527

Fig. 4b. Despite of that, the highest gradient response that has been recorded by our multi-resolution based gradient is less than 0.1, while it is around 3.5 using the single-resolution based gradient. For all homogeneous blocks, the singleresolution based gradient was recording much higher responses (1.76 → 3.5) compared to the responses of the multi-resolution based gradient (0.02 → 0.09). This means that the multi-resolution based gradient estimation approach is more robust to noise than the single-resolution based one. This stems from the fact that it uses down-sampling which reduces noise already by its nature. On the other hand, for blocks those have some textures (not homogeneous anymore) like the blocks that are colored green in Fig. 4b, we noticed that the response of single-resolution based gradient does not change much (2.21 → 3.37) compared to (40.84 → 260.63) when using our multi-resolution based gradient. Among all blocks those have some textures, the highest response of the singleresolution based gradient is (3.37). It is less than the highest response of the same gradient estimation method (3.5) for homogeneous blocks. This leads to counterproductive enhancement results. This happens because of the non-robustness of the single-resolution gradient estimation methods to noise. They consider the existing noise in homogeneous blocks as original details, and because noise is severe, it gives larger response than that for other blocks those have some textures. This means that, though it employs smoothing, our multi-resolution based gradient estimation method overcomes the single-resolution based gradient estimation method in the ability to detect blocks with little textures. To be sure about the usefulness of our gradient estimation method in finding a suitable clipping value, we conducted an experiment to watch the Clip value corresponding to different types of blocks. Figure 5 shows six histograms corresponding to six different blocks (taken from Fig. 4b) along with the Clip values estimated using (10), based on our multi-resolution gradient estimation (red), and based on standard single-resolution gradient estimation (blue). Please note that, the top row of Fig. 5 are the histograms that correspond to homogeneous blocks, while the bottom row is for histograms that correspond to textured blocks. For all homogeneous blocks, our gradient estimation leads to a smaller Clip value than that obtained using single-resolution based gradient estimation and vice versa for the textured blocks. It is also clear that our proposed gradient estimation approach is much more robust to noise than the standard one. This can easily be seen in Fig. 5c which corresponds to the homogeneous and noisy block B58. The existing noise in the block seemed to the single-resolution based gradient estimation method as real data. Accordingly, its generated Clip is very high (35000) compared to (6896) using our gradient estimation method. If this high Clip value is used for enhancing the contrast of this block using CLAHE, no clipping will happen to the input histogram. This leads to the highest possible magnification for the details in this block. As we are talking about a homogeneous block, noise will be magnified in this case.

528

N. H. Aldeeb and O. Hellwich

(a) B0

(b) B48

(c) B58

(d) B2

(e) B11

(f) B33

Fig. 5. Adaptive histogram clipping: Employing multi-resolution based gradient (Red), and single-resolution based gradient (blue) for some selected blocks.

4

Experimental Results

The ultimate goal of this paper is to improve the quality (visibility, completeness, and accuracy) of 3D models that are estimated based on images acquired under poor lighting conditions. This is being achieved by improving the visibility and contrast of acquired images. Therefore, we first explore the results of image improvement, then investigate the outcome of our proposed methodology in improving 3D quality. We used α = 0.2 and β = 0.15 for all experiments. 4.1

Image Visibility and Contrast Improvement

Figure 6 (top row) shows a dark image (left), and its improved versions using EF approach (second), using LIME approach (third), and using the proposed approach (fourth). The bottom row of the figure shows the corresponding histograms of images after being converted into gray-scale. Histograms are used to evaluate image’s contrast. Regarding visibility, it is obvious that the three approaches are able to reveal the details that were hidden in darkness. However, in preparation for 3D reconstruction, detail’s contrast concerns us more. Comparing the histograms, we can easily notice that the original dark image is of low contrast because its histogram spans a very narrow dynamic range (≈ 0 → 37). After enhancement using EF and LIME approaches, spanned dynamic range becomes wider. However, generated image using our approach has a histogram that is spread over the whole gray-scale (≈ 0 → 255). Means, pixels are redistributed among all possible gray levels. This poses higher variations in luminance which makes image details more distinguishable. Hence, among other images, our gener-

3D Reconstruction Under Weak Illumination

529

ated image presents the highest contrast which will definitely facilitate obtaining better 3D reconstructions.

Fig. 6. Visibility and contrast investigations: First row (left to right), original dark image along with its enhanced versions using three different approaches (EF, LIME, and Proposed). Second row shows corresponding gray-scale histograms.

4.2

3D Quality Improvement

3D Visibility and Completeness Assessment. Figure 7 shows sample views (first row) taken from three different datasets for the same scene along with snapshots (second row) of their corresponding point clouds. The first dataset (left), is an originally dark dataset, the second (middle) is an enhanced version of it using our approach, and the third (right) is an originally bright dataset. Regarding visibility, it is clear that the point-cloud generated using our enhanced images has stronger texture with more realistic color information than that of the point cloud generated using the non-enhanced (dark) images. Moreover, details of the point-cloud generated using our enhanced images are almost as clear as the details of the point-cloud generated using originally bright images.

Fig. 7. 3D visibility and completeness assessment: First row, sample views of a dark dataset (left), enhanced using our approach (middle), and originally bright dataset (right). Second row, corresponding point clouds.

530

N. H. Aldeeb and O. Hellwich

Regarding completeness, it is also clear that the point cloud generated using our enhanced images is more complete than the one generated using the dark images. Likewise, it is also more complete than the one generated using the originally bright images. This is due to the fact that we are not only enhancing the visibility but also the detail’s contrast. Quantitatively speaking, the point cloud generated using our enhanced-images contains (257400) points, while the one generated using bright-images contains (203671) points, and the one generated using dark-images contains (157729) points. It is worth mentioning that this is a challenging scene, because it contains untextured surfaces. That is why there are still some regions that are missing 3D data. However, this problem has been addressed in a different research of us [20]. For further assessment, we compared the quality of 3D models generated using images of our approach to the quality of 3D models generated using images of EF approach [11], and using images of LIME approach [12]. Comparisons are also made to 3D models generated using the standard approach (based on dark images). Figure 8 summarizes the obtained results based on 9 datasets. Datasets DS1 → DS5 are our own datasets, while DS6 → DS9 are taken from the freely available repository (MVS Data Set - 2014) [21]. It contains 128 datasets for different scenes using different lighting conditions. Each dataset also contains a 3D model that can be used as a ground-truth in experiments. We selected the darkest scenes for our experiments. As can be seen in Fig. 8, it is evident that our approach for image enhancement leads to generate more complete models with clearer textures than the models generated using the other approaches. Numerically speaking, Fig. 9a summarizes the number of 3D points obtained when using each of the four approaches for all the datasets already shown in Fig. 8. As can be seen, for all datasets, DS1 → DS9, 3D models generated based on improved images using our approach are denser than the models generated based on images of all other approaches. In average, using our approach leads to 46% increase in the number of 3D points over the 3D points obtained using the standard approach (based on dark images). This percentage becomes 20% when using LIME approach, and 4.5% when using EF approach. In some datasets like DS1, DS2, DS4, and DS5, EF approach leads to having less 3D points than those obtained using dark images. This happens because it does not generate high contrast images, as we already mentioned before. For all datasets, LIME performs better than the standard approach. However, for some datasets, like DS3, DS7, and DS9, EF approach outperforms LIME.

3D Reconstruction Under Weak Illumination

531

Fig. 8. 3D quality assessment: Columns from left to right, sample images, point clouds based on dark images, based on enhanced images using EF approach, based on enhanced images using LIME approach, and based on enhanced images using the proposed approach.

532

N. H. Aldeeb and O. Hellwich

3D Accuracy Assessment. Having large numbers of 3D points does not necessarily mean that we are getting higher quality 3D models. The obtained 3D points can also be partially or entirely noisy (outliers). Thus, this subsection is dedicated to measuring the accuracy of the generated 3D models of different approaches. To assess the accuracy of the generated 3D models, we compare them to the ground truth models that we have. The open-source project, CloudCompare [22], is employed to compute cloud-to-cloud distance (nearest neighbor Euclidean distance). However, because ground truth models are not dense enough, default measured distance may not be accurate. Hence, for each point of the compared model, we measure the distance to a sphere that fits the nearest point and its surrounding points in the ground truth model. Figure 9b summarizes the accuracy results of the four approaches for datasets, DS6 → DS9, as they contain ground truth models. As can be seen, the accuracy of the generated 3D models based on dark images (standard approach) is the least among the others. This confirms practically that using dark images impairs both visibility and quality of generated 3D models. Also, even though it contains more 3D points, models generated based on enhanced images of our approach recorded the highest accuracy compared to those generated based on images of the other three approaches. The average error (mean distance) equals 3.32 when using our approach, 5.27 when using EF approach, 5.55 when using LIME approach, and 7.51 when using the standard approach (based on dark images). With average standard deviations equal 8.47, 18.64, 18.01, and 25.39 respectively.

(a) Completeness

(b) Accuracy

Fig. 9. Charts to quantitatively compare the completeness (a) and accuracy (b) of the generated 3D models using different approaches for datasets already shown in Fig. 8.

5

Conclusion

3D reconstruction of scenes and objects based on images acquired under poor illumination conditions may lead to unclear, incomplete, and inaccurate 3D models. This paper proposes an approach that tends to improve the quality of 3D reconstruction through reviving buried features in dark images by enhancing image’s visibility and contrast at the same time.

3D Reconstruction Under Weak Illumination

533

Image visibility can easily be recovered, given a good estimate of the unknown illumination map. This paper proposes a structure-aware illumination-map estimation method that aims at protecting small-scale image details from being smoothed, and noise suppression. Though the visibility of dark images is significantly enhanced using our illumination-maps, their contrast may not be enough for being used in 3D reconstruction pipelines. Consequently, it is important to further enhance the contrast of the generated visibility-enhanced images. However, because visibility-enhanced images may still contain noise that was hidden in dark regions, a noise-aware contrast enhancement method is adapted by this paper. It enhances the contrast of regions that are relatively bright and contain more texture. Also, it avoids enhancing the contrast of regions that are relatively dark and homogeneous, as they may contain noise. After many investigations, we concluded that the gradients that are estimated using a multi-resolution fashion can be used as accurate classifiers of image blocks to decide whether they are homogeneous or textured. It has been proven qualitatively and quantitatively, that our proposed approach outperforms other state-of-the-art approaches in generating 3D models with higher visibility, completeness, and accuracy. Finally, we did not investigate the impact of changing the number of levels n on the performance of our multi-resolution based gradient estimation approach. In addition, we used parameters like α and β that were set empirically. Investigations about these issues need to be dressed in future research. Acknowledgment. The authors would like to thank the German Academic Exchange Service (DAAD) for supporting this research.

References 1. Arias, P., Ord´ on ˜ez, C., Lorenzo, H., Herraez, J., Armesto, J.: Low-cost documentation of traditional agro-industrial buildings by close-range photogrammetry. Build. Environ. 42(4), 1817–1827 (2007) 2. Kim, J.-M., Shin, D.-K., Ahn, E.-Y.: Image-based modeling for virtual museum. In: Multimedia, Computer Graphics and Broadcasting, pp. 108–119. Springer (2011) 3. Ali, M.J., Naik, M.N., Kaliki, S., Dave, T.V., Dendukuri, G.: Interactive navigation-guided ophthalmic plastic surgery: the techniques and utility of 3dimensional navigation. Can. J. Ophthalmol./J. Can. d’Ophtalmologie 52(3), 250– 257 (2017) 4. Nocerino, E., Lago, F., Morabito, D., Remondino, F., Porzi, L., Poiesi, F., Rota Bulo, S., Chippendale, P., Locher, A., Havlena, M., et al.: A smartphonebased 3D pipeline for the creative industry-the replicate EU project. 3D Virtual Reconstr. Vis. Complex Arch. 42(W3), 535–541 (2017) 5. Remondino, F., El-Hakim, S.: Image-based 3D modelling: a review. Photogramm. Rec. 21(115), 269–291 (2006) 6. Aguilera, D.G., Lahoz, J.G.: Laser scanning or image-based modeling? A comparative through the modelization of San Nicolas Church. In: Proceedings of ISPRS Commission V Symposium of Image Engineering and Vision Metrology (2006)

534

N. H. Aldeeb and O. Hellwich

7. Wu, C.: SiftGPU: a GPU implementation of scale invariant feature transform (sift) (2007). http://cs.unc.edu/∼ccwu/siftgpu 8. Wu, C.: Towards linear-time incremental structure from motion. In: International Conference on 3D Vision - 3DV 2013, pp. 127–134 (2013) 9. Wu, C., Agarwal, S., Curless, B., Seitz, S.M.: Multicore bundle adjustment. In: Computer Vision and Pattern Recognition, pp. 3057–3064 (2011) 10. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010) 11. Ying, Z., Li, G., Ren, Y., Wang, R., Wang, W.: A new image contrast enhancement algorithm using exposure fusion framework. In: International Conference on Computer Analysis of Images and Patterns, pp. 36–46. Springer (2017) 12. Guo, X., Li, Y., Ling, H.: LIME: low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 26(2), 982–993 (2017) 13. Li, L., Wang, R., Wang, W., Gao, W.: A low-light image enhancement method for both denoising and contrast enlarging. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 3730–3734. IEEE (2015) 14. Shi, Z., Mei Zhu, M., Guo, B., Zhao, M., Zhang, C.: Nighttime low illumination image enhancement with single image using bright/dark channel prior. EURASIP J. Image Video Process. 2018(1), 13 (2018) 15. Forsyth, D.A.: A novel algorithm for color constancy. Int. J. Comput. Vis. 5(1), 5–35 (1990) 16. Funt, B., Shi, L.: The rehabilitation of MaxRGB. In: Color and Imaging Conference, vol. 1, pp. 256–259. Society for Imaging Science and Technology (2010) 17. Joze, H.R.V., Drew, M.S., Finlayson, G.D., Rey, P.A.T.: The role of bright pixels in illumination estimation. In: Color and Imaging Conference, vol. 1, pp. 41–46. Society or Imaging Science and Technology (2012) 18. Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Graphics Gems IV, pp. 474–485. Academic Press Professional, Inc. (1994) 19. Sanaee, P., Moallem, P., Razzazi, F.: A structural refinement method based on image gradient for improving performance of noise-restoration stage in decision based filters. Digit. Signal Process. 75, 242–254 (2018) 20. Aldeeb, N.H., Hellwich, O.: Reconstructing textureless objects - image enhancement for 3D reconstruction of weakly-textured surfaces. In: Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISAPP, vol. 5, pp. 572–580. INSTICC, SciTePress (2018) 21. Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view stereopsis evaluation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 406–413. IEEE (2014) 22. Cloudcompare, version 2.10, GPL software. http://www.cloudcompare.org/. Accessed 15 Sept 2018

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model Marius Cordea1(&), Bogdan Ionescu1, Cristian Gadea2, and Dan Ionescu2 1

Mgestyk Technologies, Ottawa, ON, Canada [email protected] 2 University of Ottawa, Ottawa, ON, Canada

Abstract. Convolutional neural networks (CNN), more recently, have greatly increased the performance of face recognition due to its high capability in learning discriminative features. Many of the initial face recognition algorithms reported high performance in the small size Labeled Faces in the Wild (LFW) dataset but fail to deliver same results on larger or different datasets. Ongoing research tries to boost the performance of Face Recognition methods by modifying either the neural network structure or the loss function. This paper proposes two novel additions to the typical softmax CNN used for face recognition: a fusion of facial attributes at feature level and a dynamic margin softmax loss. The new network DynFace was extensively evaluated on extended LFW and much larger MegaFace, comparing its performance against known algorithms. The DynFace achieved state-of-art accuracy at high speed. Results obtained during the carefully designed test experiments, are presented in the end of this paper. Keywords: Face recognition Face verification Face identification Convolutional neural networks Deep learning Multi-label Softmax Additive margin softmax

1 Introduction Biometrics represents automated identification of a person based on physiological or behavioral characteristics. For quite some time, due to its use in a series of applications spanning from Facebook applications to smart cities, border control, airport security, surveillance, and many others, face recognition has been and still is the main biometric-based technique present in the architecture of the above applications. This active domain has been revived by the reborn convolution neural networks (CNN) applied to face recognition. Implementations of such multi-layer neural networks have been reported of surpassing human performance and achieving outstanding results [1, 2] on Labeled Faces in the Wild (LFW) benchmark [3]. Due to the limitations of the original matching LFW protocol (only 6,000 image pairs), recognition results became saturated and difficult to compare. Moreover, algorithms performing well in LFW, perform poorly on much larger datasets. In order to solve this issue, © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 535–550, 2020. https://doi.org/10.1007/978-3-030-17795-9_39

536

M. Cordea et al.

several face recognition challenges have been created. For instance, MegaFace [4] challenge collected a gallery dataset of one million distractors (images not present in the probe set) to build a more realistic setup for training and testing face recognition algorithms. The standard practices used to improve face recognition performance in large datasets are: 1. augmenting the training data, 2. modifying the structure of the neural network (e.g. increasing the depth), 3. modifying the loss function. 1.1

Training Data

Most employed, publicly available datasets to train a face neural model are VGGFace2 [5], CAISA-WebFace [6], MS-Celeb-1M [7], and MegaFace [4]. The number of images, contained in the above datasets, ranges from thousand to millions. For example, CASIA-Webface contains 494,414 training images from 10,575 identities, but the images stored in it contain a large amount of noise. MegaFace training dataset (challenge-2) has a large number of identities, but also suffers from annotation noise and long tail distributions. Microsoft released the large MS-Celeb-1M dataset with 10 million images from 100,000 celebrities. Despite its significant number of images, the intra-identity variation is restricted to an average of 80 images per person and exposes identity noise. The VGGFace2 dataset contains 3.31 million images from 9,131 celebrities covering a wide range of ethnicities and professions. The images expose large variations in pose, age, lighting and background. The dataset is genderbalanced, with 59.7% males, and has between 87 and 843 images per identity, with an average of 362.6 images. Facebook and Google have much larger in-house, private datasets. For instance, Facebook [8] trained a face recognition model on 500 million images of over 10 million subjects. The Google model [2] was trained on 200 million images of 8 million identities. 1.2

Network Structure

Higher capacity deep neural networks such as ResNet [14] deliver better prediction accuracy compared to older configurations such as VGG2 [5]. Deeper networks require large storage, increased training and recall time. Other attempts try to inject knowledgebased information to help the training of a given network structure. Nowadays, large amount of facial data comes annotated with various facial attributes (location, pose, expressions, gender, age, ethnicity, etc.). Since humans recognize facial identity mostly by identifying the facial traits [16], it is logical to use facial attributes to help face recognition task. Recently, research attempts were made to use facial annotations in a multi-task learning (MTL) framework [21, 22]. In these cases, face attributes prediction and recognition share the same CNN structure. Arguably, forcing the CNN to learn simultaneously face feature vector used in face recognition and face characteristics lead to increased network complexity and training times. For instance in [22] the all-in-one network takes 3.5 s to process one image. This

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

537

is a prohibitive factor for real-time applications. Moreover, applications rarely need face recognition and face analysis in one shot. 1.3

Loss Functions

The most widely used classification loss is the log-likelihood cost using a softmax function, often called softmax loss. The softmax function is adopted by many CNN due to its simplicity, probabilistic interpretation, and excellent performance when used in a loss function. The main goal of a classifier is to increase intra-class compactness and enlarge the inter-class separation. Despite the popularity, softmax loss does not have this discrimination power. In order to overcome this, scientists developed other Euclidian loss functions. The contrastive loss [17] and triplet loss [2] use image pairs/triplets approach to increase recognition performance. The contrastive and triplet loss employ positive and negative pairs of trained feature vectors to bring together genuine pairs and separate imposter ones. These new techniques are limited by the complexity of training and effectiveness of selecting adequate image pairs. The Center loss [18] adds a penalty during the main training process for compressing the same class face clusters, while separating them in different classes. Recently, researchers moved the loss function from Euclidian to angular or cosine domain. Liu et al. [15] proposed a large margin Softmax (L-Softmax) by imposing multiplicative angular constraints to each true class for improving feature separation. SphereFace or A-Softmax [9] applies L-Softmax to face recognition using weights normalisation. However, these loss definitions need complex backpropagation gradient computations and contain additional parameters to be learned. Additive margin AM-Softmax or CosFace [11, 19], tries to overcome the optimization complexity of SphereFace by bringing the angular margin into the cosine space. The expression of AM-Softmax is simple since it modifies only the forward network pass without the need of backpropagation derivatives. This makes it very effective. ArcFace or InsightFace [20] imposes a stricter margin penalty compared to SphereFace and AM-Softmax and argues that the approach has a better geometric interpretation. However, ArcFace loss adds implementation complexity without a clear performance improvement. AM-Softmax (CosFace) and ArcFace showed an outstanding face recognition performance on MegaFace challenge. 1.4

Proposed Model

Simply increasing the training data amount and the depth of the neural network to boost face recognition performance comes with a greater cost of computational resources. This paper takes another path and presents two novel additions to the existing CNN architectures in an attempt to increase the accuracy of face recognition, keeping at the same time the computational cost in control. The network structure is kept light, making it easier to train. This makes it suitable for real-time face recognition applications. The paper describes the model and the architecture which has been implemented, the model being named the DynFace model, and can be summarized as follows:

538

M. Cordea et al.

1. Starting from the idea that facial attributes can help face recognition, a multi-label (ML) model by concatenating face attributes, namely pose (yaw, pitch) to the learned feature vector has been implemented. 2. A dynamic margin softmax (DM-Softmax) loss, where the experimentally fixed margin is now controlled by the pose variation, has been proposed. The pose will modify the margin, signaling the network to pay more attention to non-frontal faces, hence compressing the variability of same class identities, while increasing the inter-class separation. 3. The efficiency of proposed approach has been demonstrated in a small-size toy example and extensive evaluation on extended LFW and MegaFace datasets against benchmarked, state-of-art algorithms has been realized. 4. The trained DynFace model was used to curate the face datasets to prepare for further training sessions. The employed clustering method of matching results found an impressive number of noise identities on VGGFace2, which was known as a clean dataset.

2 Preliminaries 2.1

Multi Label

Multi-task methods in [21] and [22] implement joint learning of facial attributes and face identity in a single framework with shared deep network architecture. The loss function is a weighted sum of the two corresponding task losses: Ltotal ¼ a1 Lidentity þ a2 Lattributes . Normally, Lidentity is a softmax loss and Lidentity is a mean square error loss. Training these multi-task configurations did not result in significant improvements when compared to the softmax loss. Moreover, due to added network complexity and computation times, another, simpler strategy for using multiple labels, without the need of predicting them, is proposed. 2.2

Loss Margin

The angular loss functions modify the typical softmax loss, which is expressed for one sample as: eW yi xi ¼ log PK W T x j i j e T

LSoftmax

!

0

1 kW yi kkxi k cos hyi e A ð1Þ ¼ log@ P ekW yi kkxi k cos hyi þ ekW j kkxi k cos hj j6¼yi

where xi 2 Rd is the deep feature x of the i-th samples, belonging to the yi -th class. The deep feature dimension is d. Wj 2 Rd denotes the j-th column of the weights W 2 RdxK in the last fully connected layer, which acts as a classifier, and K represents the number of classes. The h is the angle between the weight and feature vectors when logit

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

539

function W Tj xi is formulated as an inner product. The bias is omitted to simplify the analysis. The L-Softmax loss, introduced in [15], employs a hard angle margin constraint a in the original softmax loss, to improve its discriminative ability: 0

LLSoftmax

1 kW yi kkxi k cos ahyi e A ¼ log@ P ekW yi kkxi k cos ahyi þ ekW j kkxi k cos hj

ð2Þ

j6¼yi

Considering a binary example with a vector x from class 1, in order to classify correctly, softmax loss will enforce the weight computation such as: W T1 x [ W T2 x or kW 1 kkxk cos h1 [ kW 2 kkxk cos h2 , when expressed in terms of inner-product. L-Softmax will constrain the classification by introducing the margin a. kW 1 kkxk cos h1 [ kW 1 kkxk cos ah1 [ kW 1 kkxk cos h2

ð3Þ

SphereFace [9] with A-Softmax loss is adding to L-Sofmax L2 -norm on weights, and replaces cos ahyi by a pice-wise monotonic function. However, due to the angle margin presence, extra parameters to tune, the forward and especially backward computation become complicated. In order to overcome this complexity the AM-Softmax transformed the angular margin into an additive one using a constant hyper-parameter m: W T1 x W T1 x m [ W T2 x

ð4Þ

And 0

LAMSoftmax

1 sðW Tyi xi mÞ e A ¼ log@ P T T esðW yi xi mÞ þ j6¼yi esW j xi

ð5Þ

where weights W j and feature vector x are L2 -normalized, and x is scaled to s which is the hypersphere radius, set to 30 in [11]. The additive margin has a broader margin coverage than A-Softmax and changes only the forward computation of softmax. After feature and weight normalization the network propagates the simple function: f ðzÞ ¼ z m, where z ¼ W Tyi xi ¼ cos hyi and there is no need to compute the gradients for backpropagation since f I ðzÞ ¼ 1. Also, AM-Softmax is easier to train and delivers high performance. For m ¼ 0 the AMSoftmax reduces to the original softmax. However, despite being very effective, the AM-Softmax method manually sets the global hyper-parameter to a constant, empirically found value. The method proposed in this paper, dynamically controls the m value based on head pose. This way the training becomes more balanced by assigning hard samples more weight.

540

M. Cordea et al.

3 Multi-label, Dynamic-Margin Model 3.1

Label Fusion

Differently than the existing multi-task techniques that share a CNN to jointly train for face recognition and analysis [21, 22], our model uses a feature level fusion approach of facial attributes to improve face recognition performance. The columns of weights of the last fully connected layer W ¼ ½W1 ; W2 ; W3 . . .; WK , K being number of classes, can be seen as linear classifiers for the last feature vector: x ¼ ½x1 ; x2 ; x3 . . .; xd . The d values of the feature vector might be seen as representing the facial structure. It seems logical adding at this level more variables characterizing the face (pose, gender, age, expression, etc.). After concatenating the face pose, namely yaw and pitch, to the x vector, hence there is a need to learn two more rows in the weight matrix W of the final classifier, W 2 Rðd þ 2ÞxK . At validation stage one extracts only the d-dimensional facial feature vector to be used in face recognition matching. The yaw and pitch take discrete values [−2, −1, 0, 1, 2] corresponding to pose intervals: [−90, −50], [−50, −20], [−20, 20], [20, 50], [50, 90]. 3.2

Dynamic Margin

The proposed procedure starts with the AM-Softmax due to its simplicity and high performance. The hyper-parameter m in (6) loss function is derived empirically and is constant for all face images. It was found that m can vary between 0.2 and 0.5 [11]. In this paper a value of m = 0.35 is considered. The hyper-parameter m controls the degree of punishment. When m increases, the loss becomes larger. Intuitively, we want the margin to be more restrictive for difficult images, which in our case are faces exposing deviations from frontal pose. We express m as a function of pose: uðmÞ ¼ m þ a ðabsðyawÞ þ absðpitchÞÞ

ð6Þ

where m is the default value 0.35, and a is a scaling parameter, needed to keep uðmÞ bounded, set to 0.05. Yaw and pitch are the five discrete values [−2, −1, 0, −1, −2], corresponding to the face pose intervals. When the face image is displayed in a frontal pose, uðmÞ will approach the default value. For rotated faces uðmÞ will increase, forcing the network to better adjust the training, so those faces will get more weight. This approach seems reasonable since softmax tends to discard hard samples and train on the majority of good ones. The new loss function becomes: 0

LMLDMSoftmax

1 sðW Tyi xi uðmÞÞ e A ¼ log@ P T sW Tj xi e esðW yi xi uðmÞÞ þ K j6¼yi

ð7Þ

where the weight matrix W 2 Rðd þ 2ÞxK has now d þ 2 rows, with the extension corresponding to face attributes.

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

541

4 Implementation First step in developing a deep learning face recognition model is selecting a training dataset. Most of existing face datasets expose lot of identity noise hard to remove manually due to the large amount of images to review. 4.1

Face Selection

The training of the proposed neural network was realized on VGGFace2 database due to its great variability in pose, age, quality, illumination etc. Moreover, the dataset went through a process of automatic and manual filtering to remove near duplicates, overlapped subjects, and outlier faces from each identity. A quick exploring showed that VGGFace2 is the cleanest face dataset among the publically available ones for face recognition training. First stage in data preparation for training is face/feature/pose detection. A modified version of the MTCNN [12] was used because it delivers face and facial landmark location with a 99.5% detection rate in our experiments. The minimum face size was set as a third of the image width in order to discard small background faces. The face rotation has been computed from detected features using PnP method [10]. VGGFace2 exposes a single, centred face in most images. However, there are many images with multiple faces, which will increase the risk of extracting the wrong identity for training. While running the detector the centered face with the highest confidence has been chosen. Also, we discard faces on image boundaries if they are cut (width/height ratio < 0.7 or >1.4). More than 10,000 ‘cut’ faces were removed from the VGGFace2 dataset. Before starting the training, techniques to further clean the VGGFace2 dataset were implemented. Feature descriptors in a grid manner have been computed on images of each identity. As a result, images have been clustered based on similarity. Only one face per cluster was kept which ended up discarding more than 200,000 duplicates (Fig. 1).

Fig. 1. Discarding similar faces from each identity

4.2

Model Training

In the works described in this paper a deep neural network has been implemented using Caffe framework [13]. After detecting the face and its features, the face has been

542

M. Cordea et al.

aligned to a canonical template. Using the middle point between the eyes and the mouth center, face images are warped to a canonical image of size 160 160. Choosing two anchor points on the vertical axis has ensured that yaw-rotated faces will not change in size. Finally, face images have been converted to the gray level and then have been normalized to [−1, 1] interval by subtracting 128 and by dividing by 128. The reason for choosing grey images was for lowering the storage needs and for boosting the speed of training. It was shown that despite containing more information, color images do not deliver a significant performance improvement [6]. Another reason of preferring gray over color images for training was the authors’ observation that networks trained on color faces have a drop in accuracy when matching gray images. ResNet-20 [14] is employed as the backbone architecture of DynFace, the ML-DMSoftmax network to train on VGG2Face dataset. Figure 2 illustrates the network configuration. The backbone is composed of 8 ResNet units (4 of 3 and 4 of 2 convolution blocks). Multi-label input is implemented in block “ML:concat”, and the margin is represented by the “DM:dynamic_margin” block. The first convolution layer uses stride = 2 to halve the input image and reduce the storage requirements. The training batch size is set to 64 and test size to 32. The learning rate starts at 0.01 and after 45k iterations is divided by 2 every 10k iterations. The training is performed using SGD with momentum 0.9 and weight decay of 0.0005 for 100k iterations. We implemented our network on a machine with 4 CPU cores and one GTX 1070 GPU. We also trained an AM-Softmax network on same VGGFace2 dataset to evaluate the impact of our two contributions.

Fig. 2. Simplified DynFace configuration (one unit of 3 convolution blocks is shown)

In test mode, given a face image, it takes an average of 135 ms to extract a deep feature vector (template) from normalized inner-product “fc5” block, with a size of d ¼ 512:

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

543

5 Experiments In this section, the experimental designs of the DynFace is introduced. The design starts with a toy example in which a visually comparison of the proposed neural network discriminative capability against existing network structures has been done. Finally, a comparison of the performance of the proposed neural network against several state-ofthe-art face recognition algorithms in LFW and MegaFace datasets has been also done. 5.1

Toy Experiment

In order to better understand the impact of our multi-label, dynamic-margin ML-DMSoftmax framework, a toy experiment for visualizing the trained feature distributions of four trained networks: Softmax, AM-Softmax, ML-AM-Softmax, and ML-DMSoftmax has been designed. A number of 10 identities (Fig. 3) from VGGFace2 dataset that contain most images and expose largest variation on head pose (yaw, pitch) has been selected. From each identity, randomly 80% (over 4000) of images for training have been selected, while the rest of 20% (about 1000) of images have been used for testing.

Fig. 3. Sample images from the 10-identity toy experiment

A simple 7-layer CNN model on the four frameworks has been trained, and a 3D normalized feature vector for visualization has been produced in the output. The 3D features from the test set have been drawn onto a sphere to visually compare the classification clusters of the four trained scenarios (Fig. 4). Using the yaw and pitch face labels (ML-AM-Softmax), leads to a clearer separation of identities in 3D space. Adding a DM-Softmax layer will further improve the classification.

544

M. Cordea et al.

Fig. 4. Feature clusters on the test set: Softmax, AM-Softmax, ML-AM-Softmax, ML-DMSoftmax. Each color represents different identity.

5.2

Evaluation Metrics

Face verification and identification are two main scenarios of face recognition. Verification matches two face images to decide if they belong to the same identity. Identification determines the identity of a face image (probe) by matching it against a gallery of images. Face verification uses the Receiver Operating Characteristic (ROC) curve to measure the performance. ROC plots the verification rate or True Acceptance Rate (TAR) versus the False Acceptance Rate (FAR) by changing the decision match/non-match threshold. Face identification uses the Cumulative Matching Characteristic (CMC) curve to measure the performance. CMC plots cumulative matching score versus the rank of the probe. When comparing multiple biometric systems a single-point performance measure can be useful. Such a number is the Equal Error Rate (EER), which is the operating point on the ROC curve where FAR equals False Rejection Rate (FRR). A lower value indicates a better biometric system. Extensive experiments on two unconstrained datasets, LFW and Megaface, show that state-of-the-art results on face recognition tests has been achieved. A comparison of DynFace, ML-DM-Softmax model, against CASIA [6], VGG2 [5], CosFace [11], and SphereFace [9] networks has been obtained. CASIA model was trained on CAISAWebFace dataset, has 10 convolution layers, uses 100 100 gray images and employs a contrastive loss. VGG2 model implementation [27] is a ResNet-50 network, trained on VGGFace2, inputs 224 224 color images, and uses a softmax loss function. CosFace model implementation [26] trained a ResNet-20 network on 96 112 color images from CAISA-WebFace, and employs an AM-Softmax loss function. Similarly, SphereFace model implementation [28] trained same backbone ResNet-20 network on 96 112 color images from CAISA-WebFace, and employs an A-Softmax loss function. CosFace, SphereFace and the proposed DynFace models have same backbone ResNet-20 network. They differ in loss functions and training datasets. A version of DynFace called also DynFace-AM, which has same architecture as CosFace has been included in the tests for assessing the impact of our network additions on performance. An evaluation framework based on OpenBR [25] has been designed and developed for testing the proposed new model DynFace. The face images used in evaluation were first aligned to the algorithm-specific canonical face, and templates extracted in order to compare the performance of each algorithm introduced. LFW-ALL Labeled Faces in the Wild (LFW) [3] contains 13,233 images of 5,749 people. The standard LFW evaluation is not statistically sound because at low FAR only a limited

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

545

number of impostor scores are available. In order to overcome this issue, the matching protocol reported in this paper uses the whole LFW database for both verification and identification scenarios (LFW-ALL). Before starting the evaluation, the trained DynFace model was run in LFW dataset to remove potential noise. Low genuine and high impostor matches were recorded and manually reviewed. A total of 35 mislabelled identities were found and removed from the test set. Verification/Identification Performance The entire image set has been compared against itself, discarding identical and symmetrical matches, resulting in 242,180 genuine matches and over 86 million impostor matches. The performances of all engines are illustrated in Table 1 and Figs. 5 and 6, indicating DynFace as the top performer. Table 1. Verification/Identification Performance on LFW-ALL FR algorithm Verification Identification TAR@ TAR@ TAR@ Rank 1 Rank 5 FAR = 1e − 3 FAR = 1e − 5 FAR = 1e − 6 CosFace 0.981 0.848 0.718 0.980 0.994 SphereFace 0.979 0.835 0.690 0.976 0.993 VGG2 0.961 0.670 0.450 0.968 0.987 Casia 0.550 0.185 0.092 0.751 0.848 DynFace 0.996 0.952 0.874 0.992 0.997

Fig. 5. Verification and identification performance graphs of FR engines in LFW-ALL

546

M. Cordea et al.

Fig. 6. Equal Error Rate of evaluated FR engines in LFW-ALL. DynFace, with the lowest value, is the best FR system

MegaFace MegaFace test gallery contains 1 million images from 690,000 individuals with unconstrained pose, expression, lighting, and exposure and is used for evaluating face recognition algorithms. The probe set of MegaFace used in the first challenge consists of two databases: Facescrub [23] and FGNet [24]. Facescrub dataset contains more than 100,000 face images of 530 people and FG-NET contains 975 images of 82 individuals, with ages spanning from 0 to 69. The MegaFace challenge evaluates performance of face recognition algorithms by increasing the numbers of distractors (from 10 to 1 million) in the gallery set. Similarly to the majority of participants in the challenge, in the experiments described in this paper, Facescrub selected set (over 3500 images) has been used as a probe set for performing the tests. A simple cleaning procedure, similar to one from LFW, was performed on FaceScrub dataset by looking at the low genuine and high impostor matches. Over 700 mislabelled identities were found and removed from the entire FaceScrub probe set. Verification/Identification Performance Facescrub selected image set that has been compared against itself and against MegaFace 10,000 distractor images using provided probe/gallery lists. This results in over 75,000 genuine matches and over 41 million impostor matches. The face verification and identification performances are illustrated in Table 2 and Fig. 7, indicating DynFace as the best system.

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

547

Table 2. Verification/Identification Performance on MegaFace-10k FR Algorithm Verification Identification TAR@ TAR@ TAR@ Rank 1 Rank 1 FAR = 1e − 3 FAR = 1e − 5 FAR = 1e − 6 CosFace 0.954 0.791 0.661 0.993 0.997 SphereFace 0.962 0.836 0.742 0.994 0.997 VGG2 0.888 0.609 0.452 0.981 0.992 Casia 0.625 0.294 0.196 0.899 0.950 DynFace-AM 0.967 0.856 0.778 0.995 0.998 DynFace 0.980 0.899 0.838 0.996 0.998

Fig. 7. Verification and identification performance graphs of FR engines in MegaFace-10k

Another evaluation test was performed using MegaFace 100,000 distractor images to compare DynFace against main contenders. The ranking is kept with DynFace on top. The verification TAR increases, especially at low security levels FAR = 1e − 5: DynFace = 0.906, CosFace = 0.826, SphereFace = 0.852. In this paper, the evaluation results were reported for the 10,000 distractor setup, the most challenging one. Result Interpretation The new model DynFace introduced in this paper, archives state-of-art performance on two datasets, LFW-ALL and MegaFace. It surpasses the accuracy of the latest, highperformance architectures (SphereFace, CosFace), in verification and identification tests. The difference is more evident in verification mode at high security levels (FAR = 1e − 5). The DynFace ML-DM contribution can be seen when comparing also against DynFace-AM (Table 2), which uses the same network backbone as DynFace but the classical loss AM-Softmax. The better performance of DynFace-AM compared to CosFace is mostly the effect of using curated VGGFace2 training dataset, since both have the same architecture.

548

5.3

M. Cordea et al.

Recall-Time Data Filtering

The trained DynFace model was used to curate face image datasets. As mentioned above, image pairs of low genuine and high impostor matches where extracted during testing on LFW-ALL and MegaFace. A low genuine score happens in cases of mislabeled identity (different persons under same name), displaced facial landmarks or differences in facial appearance due to extreme poses, age, lighting etc. A high impostor score will indicate a mislabeled identity (same person under different names) or very similar facial appearances (e.g. twins, siblings). In order to ensure a correct evaluation, wrong landmarks were manually adjusted and mislabeled identities removed. In [5] the authors removed the VGGFace2 identity outliers by repetitively training/testing a deep network model, ordering images based on matching scores and removing the noise. We continued the cleaning of training dataset by implementing a highly effective graph clustering method [29, 30] based on DynFace matching scores. In each identity folder, the image templates were created, compared all against all with a matching threshold of 0.35, and built the identity clusters. Consequently, the biggest face cluster was kept as the true identity, while the rest were discarded (Fig. 8). On a second stage, a weight on each face was increased every time it matched with a very low score, bellow 0.15. Finally, the top 350 identity folders with most weight were manually reviewed to filter the remaining noise. Over 70,000 identity outliers were discovered and deleted from the VGGFace2 training dataset. Folders with extreme noise (e.g. n000671, n002535, n005321) were removed entirely. Also, most of Asian training identities were found to be severely noisy. This explains the large number of high impostor scores for Asian faces during LFW-ALL evaluation.

Fig. 8. Sample images of 4 clusters found under same identity n002078. The true-ID, which is the top cluster and most populated one, was kept

DynFace: A Multi-label, Dynamic-Margin-Softmax Face Recognition Model

549

This filtering method provides much cleaner identity images for further training sessions of a higher face recognition performance model.

6 Conclusions and Future Work In this paper, two strategies for boosting the performance of a deep network trained for face recognition have been proposed. First addition is the fusion of facial properties (pose in this case) at the feature level of last fully connected layer. Second method builds upon existing AM-Softmax loss. The resulting loss function is a dynamic margin controlled by pose values. This way, rotated faces will receive more weight in classification process. Comprehensive experiments on LFW-ALL and MegaFace datasets show that the proposed DynFace network has better accuracy than deeper networks trained with typical Softmax loss (VGG2), Contrastive loss (CASIA) or latest high-performance angular loss (SphereFace) and additive loss (CosFace). Finally, the trained DynFace model was used to remove the noise from test (LFW, MegaFace) and train VGGFace2 datasets. The presented filtering method delivered a clean dataset for next training sessions. Future work may include in training process new face attributes known to influence negatively face recognition performance such as age, expression, and lighting.

References 1. Taigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: closing the gap to human-level performance in face verification. In: CVPR, pp. 1701–1708 (2014) 2. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015) 3. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report, pp. 7–49 (2007) 4. Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The MegaFace benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016) 5. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. arXiv:1710.08092 (2017) 6. Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014) 7. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision, pp. 87–102. Springer (2016) 8. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Web-scale training for face identification. In: CVPR, pp. 2746–2754 (2015) 9. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: CVPR (2017) 10. Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 81(2), 155–166 (2009)

550

M. Cordea et al.

11. Wang, F., Liu, W., Liu, H., Cheng, J.: Additive margin softmax for face verification. arXiv: 1801.05599 (2018) 12. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016) 13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014) 14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 15. Liu, W., Wen, Y., Yu, Z.: Large-margin softmax loss for convolutional neural networks. In: ICML (2016) 16. Rudd, E.M., Gunther, M., Boult, T.E.: MOON: a mixed objective optimization network for the recognition of facial attributes. In: ECCV (2016) 17. Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identificationverification. In: Advances in Neural Information Processing Systems, pp. 1988–1996 (2014) 18. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer (2016) 19. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: CosFace: large margin cosine loss for deep face recognition. Tencent AI Lab (2017) 20. Deng, J., Guo, J., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: arXiv:1801.07698 (2018) 21. Wang, Z., He, K., Fu, Y., Feng, R., Jiang, Y.-G., Xue, X.: Multi-task deep neural network for joint face recognition and facial attribute prediction. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 365–374. ACM (2017) 22. Ranjan, R., Sankaranarayanan, S., Castillo, C.D., Chellappa, R.: An all-in-one convolutional neural network for face analysis. In: Proceedings of the 12th International Conference on Automatic Face & Gesture Recognition (FG), Washington, DC, USA, pp. 17–24 (2017) 23. Ng, H.-W., Winkler, S.: A data-driven approach to cleaning large face datasets. In: IEEE International Conference on Image Processing (ICIP), pp. 343–347 (2014) 24. Fu, Y., Hospedales, T.M., Xiang, T., Gong, S., Yao, Y.: Interestingness prediction by robust learning to rank. In: European Conference on Computer Vision, pp. 488–503. Springer (2014) 25. Klontz, J., Klare, B., Klum, S., Burge, M., Jain, A.: Open source biometric recognition. Biometrics: Theory Appl. Syst. (2013) 26. https://github.com/happynear/AMSoftmax 27. http://www.robots.ox.ac.uk/*vgg/data/vgg_face2 28. https://github.com/wy1iu/sphereface 29. Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80 (2006) 30. King, D.E.: DLib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

Towards Resolving the Kidnapped Robot Problem: Topological Localization from Crowdsourcing and Georeferenced Images Sotirios Diamantas(B) Department of Engineering and Computer Science, Tarleton State University, Texas A&M System, Box T-0390, Stephenville, TX 76402, USA [email protected]

Abstract. In this research, we address the kidnapped robot problem, a fundamental localization problem where a robot has been carried to an arbitrary unknown location for which no prior maps exist. In our approach, topological maps are created using a single sensor, namely, a camera, with the aim to localize the robot and drive it to its initial, home, position. Our approach differs significantly from other approaches that attempt to solve this localization problem. In order to localize a robot within an unknown environment we exploit the potential of social networks and extract the GPS information from in georeferenced images. The experiments carried out within a university campus affirm the validity of our approach and provide the means to resolving similar problems with the methods presented. Keywords: Kidnapped robot problem Georeferencd images · Crowdsourcing

1

· Robot localization ·

Introduction

The problem of estimating the location of a robot has engaged for years the robotics community as it lies at the heart of robot navigation and mobile robotics. Various techniques have been developed and deployed throughout the years for resolving the localization and mapping problem. Most of the approaches tackle the localization problem within unknown environments and employ several sensors for building metric or topological maps. Our research is focused on building and interpreting topological maps with the aim to aid a robot return to its home position. In today’s world, there is a plethora of information found on the web from sensors such as cameras, lidar, eand others. Most modern digital cameras as well as smart phones are equipped with a built-in GPS receivers. This research was carried out while the author was a postdoctoral researcher at the University of Nebraska at Omaha. The contribution by Prof. Raj Dasgupta is acknowledged along with the financial support of the U.S. Office of Naval Research grant no. N000140911174 as part of the COMRADES research project. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 551–563, 2020. https://doi.org/10.1007/978-3-030-17795-9_40

552

S. Diamantas

An image, therefore, apart from the visual information it carries, can provide geographical information as to the place it was taken. Moreover, there exists a multitude of image sharing sites that contain such images. It is this information that we try to exploit in our research to solve the problem of location recognition and thereby the kidnapped robot problem. Furthermore, the information that social networks provide is enormous. It is rather relatively recently that the scientific community has developed tools and methods to analyze the vast information deployed on social networks such as Twitter and Facebook. One portion of our method is looking at the information a social network can provide in order to interpret and recognize images a robot perceives with its camera. Therefore, in our proposed research we try to solve the kidnapped robot problem by having a robot interact with a social network and at the same time extract information from web sites that hold a large number of images. We believe that our proposed approach to solving the kidnapped robot problem will provide an inspiration for tackling other known problems in robotics such as search and rescue, path planning, and others. In this paper we consider the problem of localizing a robot in an a priori unknown environment with the view to identify the exact location of a robot and at the same time to provide waypoints to drive the robot to its home position. In particular, the kidnapped robot problem refers to a situation where the robot has moved to an unknown location where no information about the environment is provided. For a human trying to localize in an unknown environment, such as in a city center, may be a trivial task. A human is able to read and interpret maps or visual cues. Yet, a human may easily ask questions such as “How can I go to place x?” or “Where is this building in the map?”. But for a robot the semantic understanding of an image and, moreover, the identification of its position within a 3D map is certainly not a trivial task, yet, it constitutes a fundamental problem in computer vision and image understanding. In our proposed method the localization of a robot could be achieved by providing an input to the robot as to where approximately is located, e.g., in a city center or at a university campus and then deploy methods for estimating the exact location of the robot. Nevertheless, in order to make things more realistic and challenging at the same time we make no assumptions as to where the robot is located on a global map. In order to resolve the initial localization problem of the robot within a map we rely upon the power of the crowds. In particular, in our research, the robot is trying to imitate the actions a human would do in order to localize herself in an unknown environment. Therefore, the robot in our research is able to ask people, through a social network, where particular images have been taken. This same task is carried out by the robot for the purpose of initializing its position in the environment. The response from the social networks is used as an input to make a query to a database of images, in particular on Flickr and from there extract the GPS coordinates, i.e., longitude and latitude, stored on EXIF data of images. The paper consists of five sections. In the following section, Sect. 2, related works are presented on the techniques used to solve the kidnapped robot problem. The related works section identifies also works that exploit the power

Towards Resolving the Kidnapped Robot Problem

553

of the crowds as well as applications with georeferenced images. In Sect. 3 we present the methodology followed for tackling the kidnapped robot problem by means of crowdsourcing for location initialization of a robot and a query search of images through Flickr. Section 4 presents the results obtained from out method. Finally the paper is epitomized with Sect. 5 and a discussion on the conclusions reached from this research. Yet, the prospects of using collective data for solving other problems of interest in robotics research is discussed.

2

Related Works

The kidnapped robot problem is considered one of the most difficult localization problems in robotics. Its difficulty lies in the fact that a robot is moved to an unknown location with no maps preserved of its surroundings. Examples of situations where a kidnapped robot problem occurs appears in cases where a robot has gone through a series of localization failures. In [1] the authors present a robust Monte Carlo Localization (MCL) method based on the MixtureMCL algorithm. Their approach is tested in a series of localization problems and mainly on the more difficult global localization and the kidnapped robot problem. The Mixture-MCL algorithm is shown to surpass regular MCL algorithms. Another Monte Carlo method for solving the kidnapped robot problem appears in [2] and in [3]. Within the context of filter-based solutions of the kidnapped robot problem is the work presented in [4] where the purpose in to first detect whether a kidnapped has been occurred or not and to identify the type of it. In an extended work, the authors in [5] present a method called probabilistic double guarantee kidnapping detection which combines features’ positions and robot’s pose. Extended Kalman Filter and Particle Filters are employed to show that the proposed method can be applied in existing filter-based (Simultaneous Localization and Mapping) SLAM solutions. A purely image-based approach to the solution of the kidnapped robot problem is presented in [6] where 2D Scale-Invariant Feature Transform (SIFT) image features of 3D landmarks are stored prior to the kidnap of the robot. Upon the localization failure, the robot is searching for the 3D SIFT landmarks and compares them with the online landmarks. Featuring matching using Oriented FAST and Rotated BRIEF (ORB) as well as SIFT along with an Adaptive Monte-Carlo Localization (AMCL) is presented in [7]. In [8] a solution to the kidnapped robot problem using a laser range finder and wifi signal is presented. Particles are sampled based on the wifi signal and a likelihood is calculated based on the laser range finder particles using probability density functions from both sensors. In [9], the authors present a method based on the entropy extracted from measurement likelihoods. Their approach is tested in indoor environments. In [10] the authors present a solution to the localization problem and in particular to the kidnapped robot problem using string matching algorithms similar to the work in [11] where visual features extracted from an environment are converted into strings. Exploiting the potential of crowdsourcing is the first of the two main methods introduced in this research for tackling the kidnapped robot problem. The aim of this method is to localize a mobile robot shortly after the kidnapped robot phase

554

S. Diamantas

has ended and the localization process is about to begin. The exploitation of the vast information crowds can provide to robots can be very beneficial and has been employed in an increasing number of robotics research projects. A well-known robot that interacts with humans through face recognition and natural language capabilities and publishes information online and in particular, on Facebook, is the FaceBot robot [12]. FaceBot being the first robot of its kind on a social network is able to establish dialogues and, socialize and interact with humans. Those relationships are further enhanced by having a pool of shared memories which are embedded in a social network [13]. Another research that makes of use of crowdsourcing appears in [14] where the authors utilize crowdsourcing for enhancing the grasping capabilities of the Amazon Mechanical Turk. The authors in [15] present a model that is based on the patterns of action and dialogue generated by online game players. These patterns from the human-human interaction are evaluated on a robot operating and interacting in a real-world environment, such as a museum, with humans. In another attempt, the authors in [16] use crowds in the loop to tackle maze-solving problems within a closed loop control system. Over the past decade an increasing number of technology companies have incorporated GPS receivers in their consumer products. As a result, GPS information exists either as standalone or within images, i.e., in EXIF data. Image database querying and extracting GPS information from EXIF data comprises the second method of this research. The exploitation of these data for the purpose of locating an image or a landmark in a global scale has been considered by several researchers. In particular, in [17] the authors present one of the first attempts to estimating geographic information from images. In their work the authors use a dataset of over 6 million images from Flickr in order to estimate the location of a single image. To locate the position of the image within a global scale they have extracted GPS information and geographic keywords from the images on Flickr. In a work based on the same dataset, presented in [18], the authors demonstrate a method for estimating geographic location based on the time stamps of the photographs. More specifically, their training algorithm is based on GPS information of 6 million images. They, then, can geolocate an image by taking into consideration the images taken by travel priors, i.e., the temporal information of all images. Their approach is modeled as a Hidden Markov Model variant using the Forward-Backward algorithm for inference of geographic tags. Geolocating cameras have been studied in [19] where the weather maps provided by satellites are correlated with the video stream of Web cameras. Two other localization schemes are presented in the same paper. In particular, the localization of a camera in an unknown environment and the addition of a camera in a region with many other cameras. In [20], the authors solve the problem of scene summarization by using multi-user image collections from the Internet. In an interesting work [21], the authors have gathered images from photo sharing Web sites with the aim to co-register online photographs of famous landmarks.

Towards Resolving the Kidnapped Robot Problem

555

Fig. 1. An image of the mobile robot, Corobot, used in the experiments along with a camera mounted side-ways.

Their system provides 3D views of landmarks for browsing and provides photo tours of scenic or historic locations. The kidnapped robot problem has also similarities with visual place recognition approaches. In [22] the authors recognize locations captured by a cameraequipped mobile device. Their approach is based on a hybrid image and keyword search technique. Their approach has been tested on a university campus allowing users to browse web pages that match the image of a location captured by the mobile device. Similarly, in [23] the authors present a method where given an image of an object the objective is to retrieve all instances of that object from large databases.

3

Methodology

This section provides an overview of the developed system for tackling the kidnapped robot problem. In our research, we have deployed a Corobot mobile robot equipped with a Logitech C920 Web camera mounted on the side of the robot and a laptop with a CPU of 2.4 GHz/3.4 GHz (quad-core), 8 GB of RAM, 2 GB of graphics RAM, and Wi-Fi access. Figure 1 provides a snapshot of the mobile robot used in the experiments. 3.1

Query to Social Media

The kidnapped robot problem begins by transferring the robot outdoors while no maps or other information is known apart from the initial home coordinates of the robot. The initial coordinates will be used by the robot to recover and eventually drive it back to its home position. In contrast to other kidnapped robot approaches we do not maintain any prior images of the environment.

556

S. Diamantas

As soon as the robot has moved to a completely unknown place the first component of our method is activated. More specifically, the robot is using its camera to takes various snapshots of its surroundings and through the use of a Wi-Fi network it uploads those images onto an online social network. We have created a Twitter account where the robot is able to make posts through a MATLAB interface. As mentioned in the first section of this paper, the robot is trying to localize itself and act similarly to what a human would do in a similar situation. The robot at first is asking where those images have been taken and is awaiting for a response from the Twitter. It is important that the social network account is connected to various teams, organizations, and bodies, where information can be provided fairly quickly. 3.2

Image Matching Process and GPS Data Extraction

Our Twitter interface is responsible for the communication established between the robot and the crowd, that is, people connected to Twitter trying to help the robot establish an initial location. Therefore, as soon as a response is received from Twitter this is then used as query for searching images. Flickr and Google Image Search are two of the most popular sites for eliciting images from the web. In our work we make queries to Flickr based on the responses received from the social network. This action activates the second major component of our method. Initially, the first 50 images are downloaded from the query and are matched with the current perceived view of the robot. At this stage, we consider only the images that are GPS-tagged. For the image matching process we employ the SIFT algorithm [24]. So long as the matching process is successful we then extract the longitude and latitude information stored in the EXIF data of images. If, however, the matching process yields no results we increase the threshold of the number of images to be downloaded up until a satisfying matching occurs. Upon completion of the image matching process we proceed to building a graph based on the GPS information of the images. The graph consists of Si nodes and Vi edges. Each node is weighted based on the time and date the image was taken. This weight gives us the ability to consider those images that were taken recently and during day time. We want to discard images taken e.g., 10 years ago since an environment is likely to have changed since that time. Similarly, we wish the images to have been taken during day time and not during evening or early morning hours since this would entail an erroneous image matching yet a time-consuming process. We have developed two functions for this, one for the date stamp and another for the time stamp of the images. Equation (1) expresses the relationship between the date an image is taken and the weight that is assigned to it, Wd , Wd =

exp

1 D Dh

∗ ln(2)

(1)

Towards Resolving the Kidnapped Robot Problem

557

where D denotes the date, in days from the current date, an image was taken whereas Dh denotes the number of days after which the weight will be equal to half, that is halftime. The halftime in this equation has been set equal to 365 days. In a similar manner, Eq. (2) shows the relationship between the weight assigned to an image based on the time the image was taken. Wt =

exp

1 T Th

∗ ln(2)

(2)

In Eq. (2) Wt denotes the weight whereas T is equal to the current time an image is taken and Th is the halftime at which the weight is equal to 0.5. The reference point in this equation is 5 h from noon. The product of the two weights Wd and Wt is equal to the final weight assigned to each node Si in the graph (Eq. (3)). Si = Wd ∗ Wt

(3)

Equations (1) and (2) convey the exponential relationship between weights, and the date and time images are taken. Figure 2 shows an example of the relationship between nodes and edges of the graph. In Fig. 2(a), it is shown a Google Maps image from the University of Nebraska campus in Omaha. The red circles depict the locations the images were taken. In the graph of Fig. 2(b) this is shown in the form of a graph. For finding the appropriate route from the current robot location to the initial (home) position we simply use the Euclidean distances between the nodes of the graph.

(a)

(b)

Fig. 2. (a) An image from Google Maps showing the university campus and its surrounding area. Red circles (nodes) denote the locations images are taken and, (b) the corresponding representation on a graph.

558

4

S. Diamantas

Results

This section describes the results obtained from the outdoor experiments carried out at Peter Kiewit Institute, at University of Nebraska at Omaha. At first, a query is made at Twitter about an image the robot has taken at a starting point. According to the response received a query to Flickr is performed. An example of the steps followed is shown in Fig. 3. At the top of the figure (first image), the current robot view is shown that is posted on Twitter for recognizing robot’s initial location. The second image shows robot interaction with a Twitter user. The responses from the user are used as input to query Flickr’s photo sharing database. The results retrieved from Flick’s database are depicted in the last image of Fig. 3. From the pool of images retrieved from Flickr the ones that fulfill the criteria are selected (shown in green color at the last row), i.e., georeferenced images. Assigning weights to nodes, i.e., images, and calculating Euclidean distances among nodes are the steps before finalizing the route to be followed. Our outdoor experiments took place during day time hours. For this reason in Eq. (2) we have used noon as a reference point. Figure 4 shows a comparison for matching the same point of view at different times. In particular, Fig. 4(a) presents the outcome of image matching of a scenery with images taken both during day and night from the same point of view. The image matching process under different lighting conditions provides far less accurate results than when images are taken during similar times of a day. Therefore, taking into consideration that our experiments are carried out during day time we ‘exponentially neglect’ images taken at night hours and vice versa. Matching, however, a scene during night hours yields satisfactory results as shown in Fig. 4(b). Therefore, the equation parameters can change according to the time images are captured. The images downloaded from Flickr have been down sampled to 640 × 480 to match the image resolution of the camera used in the experiments. According to [24] lower resolution images yield stronger descriptors. Yet, lower resolution images are fast-processed and less computational expensive compared to higher resolution images during the matching process. For our experiments, the query for ‘University of Nebraska’ has yielded several hundreds of images. However, refining our search to ‘Peter Kiewit Institute’ yields tens of images. The process of matching every single image downloaded with the view perceived by the robot required the most processing power. Specifically, for every image compared, about 3 sec were required for processing and our camera could capture approximately 30 images per second. So long as there is a match between an image perceived and a downloaded image, the robot extracts GPS information from all downloaded images and builds a graph upon which the path is created based at the same time on information from Google Maps. In our experiments, the route followed by the robot is shown in Fig. 5. The red line denotes the path followed by the robot whereas the red circles with letters depict the points at which an image matching between the camera view and the downloaded images has been successful.

Towards Resolving the Kidnapped Robot Problem

559

Fig. 3. Flow of steps followed for initializing robot’s position after carrying away the robot (kidnapping). At first, current robot view is uploaded onto Twitter (upper image). Upon receiving a response from Twitter (middle image), this is used as input to querying Flickr’s photo sharing database. The results from the query (lower image) are further refined (i.e., georeferenced images; shown in green rectangular grid) to be used for recognizing robot’s initial location.

(a)

(b)

Fig. 4. (a) Image matching between two images taken from the same standing point at different times of day, in particular, during day and night and, (b) image matching during same or similar times in a day, in particular during night hours.

560

S. Diamantas

Fig. 5. The environment that experiments were carried out. The path traversed by Corobot robot is shown in red. The red circles, accompanied by letters A to D, are the locations at which there was a successful match between a perceived robot view and the downloaded images from the Web. The points were matching has occurred contain GPS locations (longitude and latitude) extracted from the EXIF data of the downloaded images (image adapted from Google maps). The starting point of robot is denoted with ‘X’ whereas the final point is shown with an ‘O’

Figure 6 shows the results of matching current robot views with images retrieved from Flick’s photo sharing database. The four different images correspond to the points A to D shown in Fig. 5. Figure 6(a) shows the first matching between the perceived robot image and the downloaded image that takes place

(a)

(b)

(c)

(d)

Fig. 6. Recognition of locations at which a successful matching occurs between perceived robot views and GPS-tagged images. (a)–(d) correspond to points A to D shown in Fig. 4.

Towards Resolving the Kidnapped Robot Problem

561

at point A. This is the figure where the largest number of matches between the two images occur. Following, is Fig. 6(b) with only 4 matches due to the significant different points of view (point B). In Fig. 6(c) location C is recognized whereas in Fig. 6(d) the final point, D, is shown. As can be seen from figures, well-identified locations (or landmarks) can be matched fairly accurately.

5

Conclusions and Future Work

In this research we presented a new approach to solve one of the most challenging problems in robot navigation, the localization of a kidnapped robot within an unknown and unstructured environment. It is our belief that localizing a robot with information found on web is a challenging yet an excellent topic for solving other similar problems of interest under the umbrella of cloud robotics and, to our knowledge this is the first attempt to solve this particular problem using social networks and georeferenced images. Our path planning model has been kept simple, i.e., going from one node to the next assumes no obstacles in the vicinity of the robot as this domain is beyond the scope of this research but of a future work. In spite of the contributions made in this research, there is still plenty of space for improvement in this particular problem. More specifically, this method can be coupled with traditional SLAM techniques, supported with additional sensors, with the view to create metric maps of the environment, a task we plan to implement in the near future. Yet, our approach can easily be used in multirobot scenarios where mobile robots exchange information about their positions within a map and the state they belong to (e.g., moving towards node ‘x’). Further to this, street view maps provided by Google is an additional interesting application our methods can be applied to.

References 1. Thrun, S., Fox, D., Burgard, W., Dellaert, F.: Robust monte carlo localization for mobile robots. Artif. Intell. 128(1–2), 99–141 (2000) 2. Bukhori, I., Ismail, Z.H.: Detection of kidnapped robot problem in monte carlo localization based on the natural displacement of the robot. Int. J. Adv. Robot. Syst. 14(4), 1–6 (2017) 3. Bukhori, I., Ismail, Z.H., Namerikawa, T.: Detection strategy for kidnapped robot problem in landmark-based map monte carlo localization. In: Proceedings of the IEEE International Symposium on Robotics and Intelligent Sensors (IRIS), pp. 1609–1614 (2015) 4. Tian, Y., Ma, S.: Kidnapping detection and recognition in previous unknown environment. J. Sens. 2017, 1–15 (2017) 5. Tian, Y., Ma, S.: Probabilistic double guarantee kidnapping detection in SLAM. Robot. Biomimetics 3(20), 1–7 (2016) 6. Majdik, A., Popa, M., Tamas, L., Szoke, I., Lazea, G.: New approach in solving the kidnapped robot problem. In: Proceedings of the 41st International Symposium in Robotics (ISR) and 6th German Conference on Robotics (ROBOTIK), pp. 1–6 (2010)

562

S. Diamantas

7. Wei, L.: A feature-based solution for kidnapped robot problem. Ph.D. thesis, Electrical and Computer Engineering, Auburn University (2015) 8. Seow, Y., Miyagusuku, R., Yamashita, A., Asama, H.: Detecting and solving the kidnapped robot problem using laser range finder and wifi signal. In: Proceedings of the IEEE International Conference on Real-time Computing and Robotics (RCAR), pp. 303–308 (2017) 9. Yi, C., Choi, B.U.: Detection and recovery for kidnapped-robot problem using measurement entropy. Grid Distrib. Comput. 261, 293–299 (2011) 10. Gonzalez-Buesa, C., Campos, J.: Solving the mobile robot localization problem using string matching algorithms. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2475–2480 (2004) 11. Lamon, P., Nourbakhsh, I., Jensen, B., Siegwart, R.: Deriving and matching image fingerprint sequences for mobile robot localization. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1609–1614 (2001) 12. Mavridis, N., Emami, S., Datta, C., Kazmi, W., BenAbdelkader, C., Toulis, P., Tanoto, A., Rabie, T.: FaceBots: steps towards enhanced long-term human-robot interaction by utilizing and publishing online social information. J. Behav. Robot. 1(3), 169–178 (2011) 13. Mavridis, N., Datta, C., Emami, S., Tanoto, A., BenAbdelkader, C., Rabie, T.: FaceBots: robots utilizing and publishing social information in Facebook. In: Proceedings of the IEEE Human-Robot Interaction Conference (HRI), pp. 273–274 (2009) 14. Sorokin, A., Berenson, D., Srinivasa, S.S., Hebert, M.: People helping robots helping people: crowdsourcing for grasping novel objects. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2117–2122 (2010) 15. Chernova, S., DePalma, N., Morant, E., Breazeal, C.: Crowdsourcing human-robot interaction: application from virtual to physical worlds. In: Proceeding of the 20th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 21–26 (2011) 16. Osentoski, S., Crick, C., Jay, G., Jenkins, O.C.: Crowdsourcing for closed-loop control. In: Proceeding of the Neural Information Processing Systems, NIPS 2010 Workshop on Computational Social Science and the Wisdom of Crowds, pp. 1–4 (2010) 17. Hays, J., Efros, A.A.: IM2GPS: estimating geographic information from a single image. In: Proceedings of the IEEE Conferece on Computer Vision and Pattern Recognition (CVPR) (2008) 18. Kalogerakis, E., Vesselova, O., Hays, J., Efros, A.A., Hertzmann, A.: Image sequence geolocation with human travel priors. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2009) 19. Jacobs, N., Satkin, S., Roman, N., Speyer, R., Pless, R.: Geolocating static cameras. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 1–6 (2007) 20. Simon, I., Snavely, N., Seitz, S.M.: Scene summarization for online image collections. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 1–8 (2007) 21. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. (SIGGRAPH Proc.) 25(3), 835–846 (2006)

Towards Resolving the Kidnapped Robot Problem

563

22. Yeh, T., Tollmar, K., Darrell, T.: Searching the web with mobile images for location recognition. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. II–76–II–81 (2004) 23. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: automatic query expansion with a generative feature model for object retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1–8 (2007) 24. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

Using the Z-bellSM Test to Remediate Spatial Deficiencies in Non-Image-Forming Retinal Processing Clark Elliott1(&), Cynthia Putnam1(&), Deborah Zelinsky2(&), Daniel Spinner1, Silpa Vipparti1, and Abhinit Parelkar1 1

2

DePaul University, 1 DePaul Center, Chicago, IL 60604, USA {elliott,cputnam}@depaul.edu Mind-Eye Institute, 1414 Techny Rd, Northbrook, IL 60062, USA [email protected]

Abstract. Preliminary evidence from a larger study is presented demonstrating that non-image-forming retinal processing takes place even through closed eyelids. The Z-bellSM test, which has been in clinical use for more than twenty years shows that these processing channels affect how we perceive context in the space around us when forming visual imagery. By using therapeutic eyeglasses and pitched bells, we can measure changes in a subject’s spatial processing, and remediate deficiencies among non-image-forming neural channels that operate in even the low-light conditions produced by closed eyelids. Using what we know of both the top-down feedback filtering of retinal input triggered purely by aural signals and also the characteristic difficulties that brain-injured patients have in organizing visual scenes (which the Z-bellSM test links to difficulties with non-image-forming retinal processing), it is argued that the non-imageforming retinal channels demonstrated in this study may be critical in any human-centric model of computer vision. Spatial coding as a basis for human cognition is also briefly discussed. Keywords: Peripheral vision

Retina Z-bell Spatial cognition Context

1 Introduction In working to develop models of computer vision we most naturally focus on what humans see. Indeed, the label “computer vision” itself tends to limit a broader view of how humans translate retinal images into cognitive meaning. In this paper, preliminary evidence is presented for a collection of non-image-forming retinal pathways that set the context for the peripheral and center vision systems, and also contribute to the forming of 3D spatial images derived from what we hear. The data presented shows that retinal spatial processing can be measured and altered through the filter of closed eyelids, which in turn suggests that any full human-based model of computer vision must account for retinal processing that takes place independently of the cognitive apprehension of visual artifacts. Our study is among the first attempts to gather scientific evidence for the efficacy of the Z-bellSM test which is used to measure—and prescribe remediation for—deficiencies in this mode of sensory processing. © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 564–579, 2020. https://doi.org/10.1007/978-3-030-17795-9_41

Using the Z-bellSM Test to Remediate Spatial Deficiencies

565

Deborah Zelinsky, O.D., F.N.O.R.A., F.C.O.V.D. is a recipient of the Founding Father’s Award of the Neuro-Optometric Rehabilitation Association. She emphasizes neurodevelopmental optometric rehabilitation techniques in her clinical practice. Such techniques draw on a broad range of neuroscience findings and are based on the idea that retinal processing is a critical component of deeper brain processing, both at conscious—and also beneath conscious—levels. The Neuro Optometric Rehabilitation Association (NORA) has listed more than 500 research papers relating to neurooptometric processing at their website [1]. As part of her work, Zelinsky developed the Z-bellSM test to help her diagnose appropriate glasses for her patients who are experiencing post-concussion syndrome (PCS). The Z-bellSM test has been in clinical use for over twenty years, and is now in use in more than ten countries worldwide. Zelinsky estimates that it has been used as a diagnostic tool in the treatment of more than 4,000 PCS cases. Optometrists and other rehabilitation neuroscientists from around the world come to Chicago for training in using Z-bellSM testing in their practices. Most optometrists mainly consider stimuli to the central eyesight retinal pathway; by contrast, developmental optometrists additionally focus on peripheral retinal processing. Peripheral processing can be divided into three components: the peripheral eyesight, the non-image forming signals to the brainstem (for controlling such processes as posture and spatial awareness, and on which this paper focuses) and the nonimage forming signals to the hypothalamus (for controlling such processes as sleep cycles). Various optometrists emphasize each of these three portions of peripheral retinal processing. Zelinsky, and others that follow her work, extend these assessments by focusing on the diagnostic testing and selective stimulation of a set of non-imageforming retinal processes that she argues are critical for setting the context for both central and peripheral visual processes, and for integrating retinal processing with the internal visual 3D interpretation of audio input. The Z-bellSM test, which filters out eyesight processing, is used to test this aspect of retinal processing. In the Z-bellSM test, a patient sits in a chair with her eyes closed throughout the duration of the test; the clinician rings a pitched bell in an area on each side of the patient (slightly to the front) asking the patient to reach out and touch the bell. The patient (with eyes closed throughout) is fitted with various prescription glasses (including possibly with tinted lenses, and glasses with partially translucent occlusions on the lenses) and the bell ringing is repeated. Different lighting conditions, different postures, and differently pitched bells are typically used in the test. If the prescription glasses are effective in correcting non-image-forming retinal processing problems, the patient’s ability to reach and touch the bell improves. The key to this test is that the closed eyelids filter out the light used for eyesight processing, but let through enough light to trigger retinal responses in the non-image-forming retinal processes (which can operate at very low light levels). Anecdotal clinical evidence—in the form of many PCS patients who have improved—suggests that the Z-bellSM test, and the remediation of problems with the non-image-forming retinal processing as part of neuro-developmental rehabilitation, has been effective in the practices that use it. The principles of the non-image-forming retinal processes have also been discussed in the literature [1, 2]. But there is an absence of empirical evidence about the kinds of processing the Z-bellSM test measures.

566

C. Elliott et al.

Such evidence may provide further insight to support models of retinal processing that can help us to build working systems that can be implemented on computers. Humans use cognitive mapping to form 3D spatial representations that are the basis of thought [3, 4]. We access this cognitive machinery when we form the symbols of thought (and the relationships between them), and when we interpret real-time input from our retinas as well as from our ears and our proprioceptive pathways. In this way we can consider one destination of all this input (including abstract creative input) as the 3D visual/spatial processing system. But an important piece of input that sets the context for knowing who we are, and where we are in space, is often overlooked. This non-image-forming input comes through the retinas but branches off to the brain stem and elsewhere before reaching the visual cortext. It affects visual processing, and also alters the way we feel the world around us and also hear that world. The theory behind our interpretation of the following evidence is that closed eyelids filter out light to the peripheral and central visual systems so that as humans we can no longer see, but there is enough light passing through the closed-eyelid filters to affect how we form internal images based on our 3D interpretation of how we hear the world around us, how we form spatial images of that world, and how we reach for objects located within that 3D space. In this study the following three hypotheses were explored: (1) that the Z-bellSM test shows a repeatable pattern in how close a subject comes to touching various bells in space, in various (light, bell-tone) conditions; (2) that this pattern is repeatable with different non-communicating administrators (i.e. bell-ringers); and (3) that the pattern is altered in repeatable ways with various prescriptions (i.e. helpful versus non-helpful versus null prescriptions). In the following sections, our methods and our preliminary findings with 14 participants are summarized.

2 Methods The study was approved by DePaul University’s Internal Review Board. The data collection team included Dr. Deborah Zelinsky a neurodevelopmental optometrist, three graduate student research assistants (RAs), and DePaul professors Clark Elliott and Cynthia Putnam who acted as project supervisors. Prior to the study, Zelinsky worked with the three RAs to teach the bell-ringing techniques; the RA team practiced administering the test for five hours prior to conducting the first experiment. Elliott worked with the RA team on precise methods to guarantee absolute isolation in the knowledge of which glasses were which, making tester bias and inter-administrator telegraphing impossible. When the Z-bellSM test is observed in clinical settings it can be striking how differently clinical PCS (and other) patients will reach for the bell with prescription correction for the non-image-forming retinal processes compared to them reaching for the bell without correction—including with repeated glasses-on/glasses-off conditions. In designing this study to use primarily healthy young subjects and 96 bell-rings for each, it was understood that some of these dramatic effects would be lost. (Healthy young brains can adapt even within a few minutes.) Additionally, a trained clinician will find the “sweet spot” in the space around a patient that highlights differences in the

Using the Z-bellSM Test to Remediate Spatial Deficiencies

567

two prescription conditions for that patient, whereas the RAs in this study used the same positioning throughout. But this model was chosen to guarantee absolute isolation of the effects. The study does, after all, support the claim that all people can “see” (perceive) through closed eyelids, and that this has important and measurable effects on cognitive appraisals of the world around us. It was felt that likely smaller effects using a general subject population and rigid bell positioning, but an iron-clad study design was a reasonable compromise design. In the next sections we discuss our participants, our data collection, and our data analysis methods. During testing there were always at least five researchers present overseeing the environment. 2.1

Participants

Forty-one participants were recruited between September and November 2017. Most participants (n = 38) were graduate and undergraduate students (mean age = 25.9 years, sd = 6.3 years) from the College of Computing and Digital Media at DePaul University; students were recruited through an online ‘participant pool.’ The participant pool allows DePaul students to gain extra credit in courses. Among the 38 student participants, 22 had uncorrected vision and the remaining 16 wore corrective glasses or contact lenses. Three additional participants were recruited by Zelinisky; all three had experienced a brain injury. They ranged in age from 57 to 73 (mean 64.6). All had uncorrected vision. Testing took place on five different dates. At the time of this paper submission 14 of the 41 sessions have been analyzed, all with the student participants (mean age = 27.6 years, sd = 6.7 years). Among those 14 participants, 9 had uncorrected vision and the remaining 5 wore glasses or contact lenses. 2.2

Data Collection

Overview All trials were conducted in the School of Computer Science building at DePaul University. Three physical stations were used in the study: an isolated foyer area, a clinical testing room, and an empirical testing room located across the hallway. None of the activity in any of the areas could be seen or heard from either of the others, though the hallway between them was shared. None of the three stations had external windows, so there were no complications from outside light. The three stations were typically used simultaneously, for the series of participants, with each participant moving from one station to the next in sequence. The clinical testing room and the empirical testing room each had identical five-bulb floor lamps used for illumination (bright light and dim light conditions), and pitch- and timber-identical sets of hand bells (pitched high and low). The clinical testing and empirical testing sessions were all recorded on video. Two sets of two pairs of optometric frames were used (four frames). One set of frames were marked A and B (under paper flaps that hid the marking from casual view). The other set of frames were marked C and D. The frames held temporary lenses as randomly determined by Zelinsky, and she was the only person who ever knew

568

C. Elliott et al.

whether a frame (A, B, C or D) held lenses intended to improve or impair retinal processing for a particular subject. Prescription lenses and (occasional) translucent partial occlusions were used, but although the Z-bellSM test typically also works with tints they were not used in this study because of concerns about telegraphing which lenses were which. Zelinsky’s clinical notes, which included which lenses and occlusions were used, were sealed until data collection was completed. Zelinsky had no access to the data or results until after her clinical notes were formally entered in the study archive, and data collection for all subjects was completed. The three RAs rotated among three roles. The first RA role was the “glasses chooser,” responsible for retrieving the box with the prepared glasses in it, bringing it along with the participant from the clinical testing room to the empirical testing room, and selecting the order the glasses would be tested based on rolls of a die. (E.g., for two pairs with markings [hidden under the flaps] of A and B we might get the order A, B, B, A—but also see below.) This role was also responsible for handing the succession of two glasses (four trials) to the tester in the right order so that the tester never knew what the marking was on frames. (And note that knowing the marking would still not give any information about the respective prescriptions.) Lastly, this role also watched to ensure that the participant’s eyes were closed through the Z-bell testing. The second RA role was that of “bell-ringer” who in addition to ringing the bells, was also responsible for communicating with the participant, giving instructions about when to reach for the bell. Communication additionally involved handing the optometry glasses to the participant, instructing participants to keep their eyes closed and their feet flat on the ground, and announcing for the recording which light and bell condition was being tested for which number participant. The third RA role was responsible for oversight which included assuring that a participant’s eyes were closed throughout and that all of the conditions (explained in the next sections) of the experiment were completed. Additionally, the third RA recorded a real-time casual assessment of how close the participant was to the bell for each condition. This data has not yet been coded. Pre-processing Elliott or Putnam greeted participants in the isolated foyer and explained the informed consent procedures. Participants were assigned a unique number chosen by drawing a name-tag from a box. They completed a pre-study questionnaire of fourteen questions relating to age, gender, prior head injuries, sleep and study preferences, organizational habits relative to ADHD, and etc. The questionnaire is included as an appendix. The questionnaire data has not yet been processed relative to the Z-bellSM clinical test results or the experimental test results, and analysis is not included in this paper. Clinical Evaluation Participants were taken to the clinical testing room, along with their completed questionnaire, each identified only by the tag-number. Zelinksy scanned each questionnaire to reduce the time it took her to determine therapeutic eyeglass prescriptions that would (a) improve the participant’s ability to touch the bells (accurate non-image-forming retinal processing prescription) and (b) worsen the participant’s ability to touch the bells (impaired prescription)—with their eyes closed. Zelinsky then administered the ZbellSM test for the participant, trying various sets of prescription lenses, in two light

Using the Z-bellSM Test to Remediate Spatial Deficiencies

569

conditions, while taking clinical notes for later use. She determined the participant’s two prescriptions (accurate and impaired) ultimately setting up the prescriptions in wearable optometry frames (i.e., creating eyeglasses that accepted temporary lenses of different prescriptions), which were labeled either A and B or C and D. (The two sets were used to keep the eyeglass sets sorted out during the simultaneous use of the clinical testing room and the experimental testing room). The determination of which label was used (e.g. randomly matching “A” with “impaired prescription”) was made prior to the testing of the subject and entered into the clinical testing notes. When testing was complete, the two pairs of appearance-identical, labeled glasses were placed in a closed box. The box and the participant were then taken to the experimental testing room by the “glasses chooser” RA. Experimental Testing The RAs performed six cycles of bell ringing in two rounds of three: Cycles (1) and (6) were the baseline (neutral) condition in which participants wore optometry glasses with clear lenses (no prescription) so they presented as appearance-identical to the two test prescriptions. All the RAs knew that cycles (1) and (6) were with clear lenses, but the participants did not. Cycles (2) and (3), and independently cycles (4) and (5) each contained a pair of randomly assigned accurate and impaired prescriptions (one of each). The experimental conditions (accurate vs. impaired) were randomly chosen by the RA who was performing the “glasses chooser” role by using the rolled die to determine the order (accurate vs. impaired) for each of the four middle bell-ringing cycles. The design choice to use two random accurate/impaired sets rather than one random quad was made to reduce adaptation by—and the effects of testing fatigue on— the subject. See Table 1 for the four possible prescription assignment sequences for cycles (2)–(4). Table 1. Four possible bell ringing sequences. Round Sequence 1 One Cycle 1: Neutral Cycle 2: Accurate Cycle 3: Impaired Two Cycle 4: Accurate Cycle 5: Impaired Cycle 6: Neutral

Sequence 2 Sequence 3 Sequence 4 Neutral Neutral Neutral Accurate Impaired Impaired Impaired Accurate Accurate Impaired Accurate Impaired Accurate Impaired Accurate Neutral Neutral Neutral

Each bell-ringing cycle included 16 bell rings that included two light conditions (dim vs. bright) and two bell-tone conditions (low vs. high). The “bell-ringer” RA located the bell rings in each of four quadrants: (quadrants 1 and 2) upper left and upper right—approximately aligned to the participant’s knees and slightly above their shoulders; and (quadrants 3 and 4) lower left and lower right—approximately aligned to the participant’s knees and waist. Thus, for each (accurate or impaired) cycle the bell was rung as follows: Each 16 rings: 8 rings in low light, 8 rings in bright light. Each 8

570

C. Elliott et al.

rings: 4 rings with the high-pitched bell, 4 rings with the low-pitched bell. Each 4 rings: 1 in each spatial quadrant. As such, for each participant, data was collected for 96 bell rings (6 cycles 16 rings). The sessions were videotaped (30 fps) from the side (for determining y and z measures) and the back (for determining the x measures); we therefore had a total of 288 measures (three dimensions 96) per participant. Participants’ eyes were closed throughout testing during the experimental testing phase. Data Analysis For each session, the back and side videos were imported into an Adobe After Effects editor in order to render them as a combined sequence. A grid was placed over each video that equated to an approximate ½ inch lattice for the combined renders; See Fig. 1.

Fig. 1. Rendered image.

Additional graduate student RAs (who received independent study credit) were enlisted to assist with data analysis. They were not involved in the data collection. Combined renders (back + side with overlay grids) were used to identify the one second of video (i.e. the 30 frames) in which the participant made their first forward movement towards pointing to the bell. Those thirty frames were then rendered as still images and a single frame was determined capturing the point the participant ended their initial forward movement to the bell. (This technique mitigated issues with participants who waved around after not touching the bell initially.) Once the frame that determined the end of the participant’s forward motion was isolated, measurements were made of the x, y, and z distances from each participant’s finger to the nearest point on the bell. That is, by counting the intervening squares in

Using the Z-bellSM Test to Remediate Spatial Deficiencies

571

the overlay grids distances were calculated for the x (right-to-left), y (up-down) and z (in-out towards body) dimensions. Some problems were encountered with estimating the x-distances for two reasons: (1) at times the bell ringer positioned the bell too low so the camera (from the back) could not see the bell and (2) the camera’s auto-focus occasionally malfunctioned in the bright light condition. This occurred 34 times—about 5% of the bell-rings; as such, the decision was made to exclude the x-distance data. This approach is justified because there were minimal variances in the x-distances (as it was possible to measure) when compared to y and z distances—probably due to the alignment of the bell rings to the participants’ knees and shoulders. The variance (sd2) for the first 14 participants which this summary includes, in each dimension are as follows: (a) for x = 15.2 (mean = 3.2, sd = 3.9), (b) for y = 45.0 (mean = 5.9, sd = 6.7), and (c) for z = 31.0 (mean = 4.2, sd = 5.6). For each bell-ring, the y and z distances were averaged (using approximately ½-inch units); we also noted if the participant touched the bell and assigned a subjective ‘confidence’ score from 1–3. A confidence score of 1 was assigned to bell-rings where the participant was tentative (e.g., waived around a lot) and assigned a 3 when participants directly pointed to where they felt the bell was (regardless of accuracy). Five measures were then created for each participant per bell ringing cycle: (1) an average of y and z distances (in approximately ½-inch units) when wearing the neutral, accurate and impaired glasses prescriptions; this equated two measures for each prescription because each prescription was tested twice—see Table 1; (2–3) average of the y and z distances under the dim light and bright light conditions for the corrected versus impaired prescriptions; and (4–5) average of the y and z distances under the high bell and low bell conditions for the corrected versus impaired prescriptions. Six hypotheses were tested in this preliminary evaluation: • H1 and H2: Participants will point closer to the bell (smaller distances) in the accurate prescription eyeglass conditions when compared to the neutral and (H2) impaired eyeglass prescriptions though their eyes are closed throughout the testing; • H3: Participants will point farther from the bell (larger distances) in the impaired condition when compared to the neutral (and accurate) prescriptions; • H4: Lighting (bright versus dim) will not have any significant effect on participants’ performance when comparing the accurate versus the impaired prescription; • H5: Bell tone (high and low) will not have any significant effect on participants’ performance when comparing the accurate versus the impaired prescription; • H6: There will be no significant changes from the first cycle (wearing neutral glasses) to the last bell-ringing cycle indicating minimal learning and/or fatigue. For this initial analysis we evaluated our hypothesis through paired t-tests using SPSS version 22.

3 Preliminary Findings (n = 14) In this section, the findings for our six hypotheses for the initial 14 participants (28 data points) are presented.

572

3.1

C. Elliott et al.

Prescription Differences

Recall, three-paired t-tests were used to explore if there were differences in combined y-z distances when comparing the neutral, accurate and impaired prescriptions. • A statistically significant difference was found when comparing the distances in the neutral-accurate condition, t (27) = 2.12, p < .05, d = .404 (a small to medium effect). The combined average distance for the neutral prescription was 5.06 (sd = 2.33) units (recall units were approximately 1/2 inch), while the average distance for the accurate prescription was significantly less at 4.42 units, sd = 2.31) indicating that the accurate glasses improved participants’ abilities to locate the bell in space. • The differences between the accurate and impaired glasses narrowly failed significance, t (27) = 2.12, p = .073; the combined average distance for the impaired prescription was 5.06 (sd = 2.72). • Finally, the differences between the impaired and neutral glasses was not significant (t (27) = −0.009, p > .05); they had almost equivalent combined average distances. 3.2

Effect of Light Conditions

Recall, a paired t-test was used to explore if the two different light conditions (dim versus bright) had any effect on the average combined y-z distances in both the accurate and impaired conditions. • Brightness of light did not appear to affect participants’ accuracy. Differences were not significant between the bright or dim light conditions when wearing the accurate prescriptions (t (27) = 0.71, p > .05) or the impaired prescription (t (27) = 0.29, p > .05). 3.3

Effect of Bell Conditions

Recall, a paired t-test was used to explore if the two different bell conditions (high versus low) had any effect on the average combined y-z distances in both the accurate and impaired conditions. • Bell tone did not appear to affect participants’ accuracy. Differences were not significant between the low or high bell-tone conditions when wearing the accurate prescriptions (t (27) = 0.47, p > .05) or the impaired prescription (t (27) = 1.14, p > .05). 3.4

Learning/Fatigue Effects

Recall, we used paired t-test to explore if there were differences from the first baseline (i.e., first bell-ringing cycle—neutral prescription) to the last baseline (last bell-ringing cycle) in the average combined y-z distances to explore if there were any learning or fatigue effects.

Using the Z-bellSM Test to Remediate Spatial Deficiencies

573

• There did not appear to be significant learning or fatigue effects, (t (27) = 1.17, p > .05). The distances were slightly larger for the first baseline (mean = 5.4, sd = 2.3) than in the final bell-ringing cycle (mean = 4.7, sd = 2.4).

4 Discussion Our findings indicated that with eyes closed throughout, accurate therapeutic lens prescriptions for non-image-forming retinal processing significantly improved participants’ abilities to locate the bell in space when compared to the neutral prescriptions; additionally, participants performed much better when wearing the accurate prescription as compared to the impaired prescriptions. However, there was not any decrease in performance when comparing the impaired and neutral prescriptions. Combined, this indicated that the Z-bellSM test could accurately assess a prescription to improve nonimage-forming retinal processing but was not as successful at assessing a prescription that would worsen non-image-forming retinal processing. We also found that neither light nor bell tone significantlly changed participants’ accuracy in locating the bell in the space. Finally, there did not appear to be any learning and/or fatigue effects (though it is our intuition that short-term adaptation probably did take place in these primarily healthy brains starting as early as the clinical testing; see below). The central visual acuity is often tested as 20/20 in many brain-trauma patients (corrected as needed by prescription eyeglasses) and indeed the visual part of retinal processing is most often studied when looking at feedforward and feedback systems for scene understanding [5]. But, many such trauma patients still suffer mild to extreme deficits in being able to interpret the world around them, in organizing their thoughts [6], and in effecting normal movement through space [7]. External (and internal) visual scenes can easily deteriorate into a flattened collage of unrelated features—yet at such times PCS sufferers will still be able to completely describe the details of a visual scene with great accuracy and their (possibly corrected) central eyesight will also still test at 20/20 [8, 9]. During these “context” breakdowns, such people may find it necessary to reconstruct meaning via intentional internal dialogs. For example in holding a printed page of this article in front of them, and while fully capable of reading, they might say to themselves, “OK. This is a white, flat object. I know that I know what it is used for, so I just have to retrieve that information. It has ninety-degree angled corners so it forms a rectangle. But the fact that it is a rectangle is not functionally important. It is flat and flexible and there are front and a back sides. It has printed writing on the front side. There is a top-to bottom orientation to the writing, and currently the top is more important than the bottom…”. A neurodevelopmental optometrist will test many non-central eyesight processes to find which parts of context-setting for (peripheral and) central eyesight are either not working correctly, or are not being integrated sufficiently with other processes. When deficiencies are identified, these can sometimes be remediated using therapeutic

574

C. Elliott et al.

eyeglasses (of the kind described in this article) and other interventions, in the same way that central eyesight can be corrected. These processes, including the non-imageforming retinal processes, are important in scene understanding. Furthermore, Bellmund et al. [3] have viewed hippocampal-entorhinal place- and grid-processing mechanisms from a cognitive perspective to propose a neuroscience-based spatial representation for cognition. This approach is consistent with the complex difficulties in cognition described in the ten-year case study of PCS and retinal-based recovery described by Elliott [8]. So, for a full human-centered model of computer vision (including the internal spatial representation of cognition), we will also need to model the kind of spatial context-setting produced by the non-image-forming retinal processes. Important questions arise: (1) “How do we model non-image-forming retinal processing, such that bending the light entering into the retinas (e.g., using therapeutic eyeglasses)—while simultaneously filtering out visual image processing—alters the 3D internal visual/spatial representation of what we hear?” (2) “How do we model the relationship between the 3D world we see, and the 3D world we hear, based on non-image-forming retinal processing?” and lastly, (3) “How do these non-image-forming retinal processing systems help humans to set the context for the integration of disparate objects into a cohesive visual/spatial scene?” Using fMRI Vetter, et al., [10] have shown that top-down cognitive pre-filtering of visual scenes depends in part in how we hear the world. They argue that because feedback filtering of input visual signals far outnumbers the feed-forward inputs themselves, that the “nonretinal influence of an early visual cortex” must critically be studied. They describe the use of auditory signals to bias visual input interpretation. But our data here suggest this is not a one-directional influence, because non-imageforming retinal inputs also bias the way we hear. Some approaches to computer vision have hypothesized that we limit the search space for interpretation of senses through various forms of early visual processing that returns, e.g., basic visual features such as orientation, contrast and simple shape. Pylyshyn [11] has even argued that this takes place in a cognitively impenetrable module. Certainly whether or not this takes place in a separate module, this functionality is a necessary component of later cognitive interpretation. But the Z-bellSM data suggests there is also a purely spatial (non-image) component to this early (possibly cognitively impenetrable) processing that relies on retinal input that helps set the context for how we interpret the 3D world around us. We might suppose that this argues in favor of Pylyshyn’s theorized cognitively-impenetrable early vision module. Such a module would then also provide generalized, chunked information for the integration of retinal processing with the 3D spatial interpretation of aural images from our hearing system as well.

Using the Z-bellSM Test to Remediate Spatial Deficiencies

575

5 Limitations and Future Work This initial analysis was conducted as a preliminary query into what we might find; obviously with only 14 participants (28 data points) there are limits on how much power can be used to find significance. However, these early findings are extremely promising and it is a reasonable expectation that with the increased power of the additional 24 student participants (48 more data points) we will find more definitive evidence of the Z-bell’s ability to access an accurate prescription to improve nonimage-forming retinal processes. Our preliminary results are consistent with clinical practice, and together these suggest the need for expanded models in computer vision research. The inability to reliably find prescriptions that worsened non-image-forming processing could be related to the fact that all of the participants represented in these preliminary results reported themselves free of brain-injury and also were relatively young when compared to the entire adult population. This may indicate that younger, non-injured healthy people have an ability to rapidly adapt to the impaired prescriptions; i.e., these young subjects may have already have begun to significantly adapt to the experimental lenses by the end prescribing phase, before the testing phase began. Along with evaluating the data from the remaining 24 student participants, we will also evaluate the three participants who had a brain-injury; however, because the samples are so small, we will only be able to do so descriptively. In future work, we would like to include a larger sample of people who have had brain injuries of different types [12] to explore how they might differ from people who have not had a brain injury as this may eliminate the quick adaptation and give us further insight into how healthy brains process non-image-forming retinal input. In future similar studies it seems likely that using only the therapeutic prescription and baseline non-correcting lenses, limiting the number of bell-rings, and restricting the total time for the testing of each subject to just a few minutes will show stronger results. In addition to using prescriptions and occlusions, treatments using differently colored tints might also show increased results. Lastly, we plan on assessing how participants’ answers to the pre-study questionnaire (see Appendix) were associated with their performance in our study, and how questions regarding dispositional attention style may associate with clinical readings of the health of the non-image-forming retinal processing.

576

C. Elliott et al.

Appendix—Pre-study Questionnaire Pre-study questionnaire

ID #__________

Participant Initials ________ Date ___________________________ Time_________ Do you wear glasses?

YES

NO

Do you wear contacts?

YES

NO

Birthday: Month _______ Year ________ Have you had (i.e., aware of) a head injury in the past? YES NO If yes, can you provide some detail below including the date and type of your head injury.

Which statement best describes your sleep in the last six months? (Pick one)

o o o

I fall asleep as soon as I'm in bed and sleep solidly all night. I have trouble falling asleep, because my mind won't turn off I fall asleep easily, but can't stay asleep sleep through the night.

Which statement best describes your waking from sleep in the last six months if you do NOT set and alarm? (Pick one)

o o o

I wake up as soon as light comes into the room. No matter what time I go to sleep, I wake up at the same time each morning. I wake up whenever my body is rested; the light doesn't bother me.

Which statement best describes how you have fall asleep in the last six months? (Pick one)

o o o

I can easily fall asleep anywhere. I have trouble falling asleep if not in my own bed. I can easily fall asleep as long as there is no noise.

If you are involved in something you enjoy, do you forget to eat? In other words, would you NOT be aware of hunger pangs? (Pick one)

o o o

Yes, I would forget to eat and not feel hunger. I would be aware of the hunger, but choose to ignore it I would be aware of the hunger and have to stop what I was doing because my body has to eat.

Which statement best describes your current morning eating habits? (Pick one)

o o o

I love having a big breakfast. I can't eat much in the morning. I'm not really too hungry when I wake up, but I eat because I'm supposed to eat.

Using the Z-bellSM Test to Remediate Spatial Deficiencies Which statement best describes how you think? (Pick one)

o o

I prefer learning the big picture first. Details can come whenever. I prefer learning in small steps. Learning details first helps me organize information more accurately.

Which statement best describes how you learn? (Pick one)

o o o

I learn best when I have a hands-on task. I learn best when I can listen to instructions. I learn best when I can watch an example.

How do you feel about clutter? (Pick one)

o o o o

I can ignore visual clutter around me and still concentrate. I need to have a clean desk in order to concentrate. I can tolerate visual clutter, but prefer a neat desk in order to concentrate. I actually concentrate better when there is clutter surrounding me.

How do you feel about auditory (sound) clutter? (Pick one)

o o o o o

I can ignore auditory clutter around me and still concentrate. I need to have sounds (such as music in the background) in order to concentrate. I can tolerate surrounding sounds, but prefer quiet in order to concentrate. I cannot study well unless the room is silent. Sounds distract me, as I cannot tune them out. I actually concentrate better when there is noise surrounding me (such as a noisy restaurant).

Which statement best describes your relationship to rules: (Pick one)

o o o o

I object to silly or unfair rules and tend not to follow them. Rules can be annoying but I’ll usually follow them if they make sense I follow rules I like the order that rules provide and rarely have problems with them

Which statement about singing best describes you? (Pick one)

o o o

I am a poor singer, who can not sing on key. I am a good singer who can maintain my part and stay on key. I am a great singer who can maintain my harmony even when someone next to me is singing a different tune.

Which statement best describes how distracted you are when you work? (Pick one)

o o o

I am easily distracted when I work. I have trouble focusing on any problem for long. I prefer an environment with few distractions because otherwise sometimes I get distracted. I don’t really notice getting distracted more than others do.

577

578

C. Elliott et al.

o o

I can tolerate some chaos in the environment and can usually focus well anyway. I am calm and focus well even when working on poorly defined problems. A chaotic environment around me doesn't matter.

What goes on in your head when reading fiction? (Select all that apply) I visualize the characters I hear a narration by a narrator I hear my own voice speaking for the characters I hear the characters speaking I both see and hear the characters I just see the words on the page.

References 1. NORA: Neuro-Optometric Rehabilitation Association. https://noravisionrehab.org/, https:// noravisionrehab.org/healthcare-professionals/bibliography. Accessed 27 Nov 2018 2. Super, S.: Brain imaging and therapeutics congress: key points & extended bibliography. In: Elliott, C. (ed.) The 15th Annual World Congress of Society for Brain Mapping & Therapeutics, Proceedings of the Workshop on Neuro-optometry, Neuromodulation and Cognition, pp. 59–73 (2018) 3. Bellmund, J., Gardenfors, P., Moser, E., Doeller, C.: Navigating cognition: spatial codes for human thinking. Science 362(6415) (2018) 4. Balkenius, C.: Gärdenfors, spaces in the brain: from neurons to meanings. Front. Psychol. 7, 1820 (2016) 5. Kafaligonul, H., Breitmeyer, B., Ogmen, H.: Feedforward and feedback processes in vision. Frontiers Psychol. 6, 279 (2015) 6. Markus, D.: Designs for strong minds’ cognitive rehabilitation for mild and moderate posttraumatic head injuries. Phys. Med. Rehabil. Clin. N. Am. 18(1), 109–131 (2007) 7. Elliott, C.: The brain is primarily a visual-spatial processing device: altering visual-spatial cognitive processing via retinal stimulation can treat movement disorders. Funct. Neurol. Rehabil. Ergon. 7(3), 24–38 (2017) 8. Elliott, C.: The Ghost In My Brain: How a Concussion Stole My Life and How The New Science of Brain Plasticity Helped Me Get it Back. Viking Penguin, New York (2015) 9. Zelinksy, D.: Neuro-optometric diagnosis, treatment and rehabilitation following traumatic brain injuries: a brief overview. Phys. Med. Rehabil. Clin. N. Am. 18(1), 87–107 (2007)

Using the Z-bellSM Test to Remediate Spatial Deficiencies

579

10. Vetter, P., Smith, F.W., Muckli, L.: Decoding sound and imagery content in early visual cortex. Curr. Biol. 24(11), 1256–1262 (2014) 11. Pylyshyn, Z.: Is vision continuous with cognition? The case for cognitive impenetrability of visual perception. Behav. Brain Sci. 22(3), 341–423 (1999) 12. Tramatic Brain Injury and Concussion, Centers for Disease Control and Prevention. https:// www.cdc.gov/traumaticbraininjury/pubs/index.html. Accessed 27 Nov 2018

Learning of Shape Models from Exemplars of Biological Objects in Images Petra Perner(&) Institute of Computer Vision and Applied Computer Sciences, IBaI, Arno-Nitzsche-Str. 45, 04277 Leipzig, Germany [email protected] http://www.ibai-institut.de

Abstract. Generalized shape models of objects are necessary to match and identify an object in an image. Acquiring these kinds of models’ special methods is necessary as they allow learning the similarity pair-wise between the shapes. Their main concern is the establishment of point correspondences between two shapes and the detection of outlier. Known algorithm assume that the aligned shapes are quite similar in a way. But special problems arise if we align shapes that are very different, for example aligning concave to convex shapes. In such cases, it is indispensable to consider the order of the point-sets and to enforce legal sets of correspondences, otherwise the calculated distances are incorrect. We present our novel shape alignment algorithm which can also handle such cases. The algorithm establishes symmetric and legal one-to-one point correspondences between arbitrary shapes, represented as ordered sets of 2D-points and returns a distance measure which runs between 0 and 1. Keywords: Shape alignment

Correspondence problem Shape acquisition

1 Introduction The analysis of shapes and shape variation is of great importance in a wide range of disciplines. Initially in 1917 Thompson [1] studied the field of geometrical shape analysis from a biological point of view. Intuitively it is especially interesting for biologists since shape is one of the most concise features of an object class and may change over time due to growth or evolution. The problems of shape spaces and distances have been intensively studied by Kendall [2] and Bookstein [3] in a statistical theory of shape. We follow Kendall [2] who defined shape as all the geometrical information that remains when location, scale, and rotational effects are filtered out from the object. Sometimes it is also interesting to retain the size of the objects, but we will not consider this case here. In most applications this invariance is not given a priori, thus it is indispensable to transform the acquired shapes into a common reference frame. Different terms are equivalently used for this operation, i.e. superimposition, registration, and alignment. Only if the shapes are aligned, it will be possible to compare them to describe deformations or to define a measure distance between them. Bookstein geometrically analyzed shapes and measured their change in many biological applications, e.g. bee wings, skulls, and human schizophrenic brains [4]. © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 580–599, 2020. https://doi.org/10.1007/978-3-030-17795-9_42

Learning of Shape Models from Exemplars

581

To give an example, with his morphometric studies he found out that the region connecting the two hemispheres is narrower in schizophrenic brains as in normal brains. In digital image processing the statistical analysis of shape is a fundamental task in object recognition and classification. It concerns applications in a wide range of fields, e.g. [5]. The paper is organized as follows. We describe the problem of 2D-shape alignment in Sect. 2. Related work is described in Sect. 3. The image material used for this study is presented in Sect. 4. We describe how image information can be extracted from an image and mapped into a case description in Sect. 5. The algorithm for pair-wise alignment of the shapes and calculation of distances is proposed in Sect. 6. In Sect. 7 are given results for our algorithm. Finally, we give conclusions in Sect. 8.

2 Problem of Alignment of 2-D Shapes Consider two shape instances P and O defined by the point-sets pi 2 R2 , i ¼ 1; 2; . . .; NP and ok 2 R2 , k ¼ 1; 2; . . .; NO respectively. The basic task of aligning two shapes consists of transforming one of them (say P) so that it fits in some optimal way the other one (say O) (see Fig. 1).

Fig. 1. Alignment of two shapes instances with superimposition and similarity transformation

Generally the shape instance P ¼ f pi g is said to be aligned to the shape instance O ¼ f ok g if a distance d ð P; OÞ between the two shapes cannot be decreased by applying a transformation w to P. Various alignment approaches are known [6–12]. They mainly differ in the kind of mapping (i.e. similarity [8], rigid [13], affine [14]) and the chosen distance measure. A survey of different distance measures used in the field of shape matching can be found in [15]. For calculating a distance between two shape instances the knowledge of corresponding points is required. If the shapes are defined by sets of landmarks [16, 17], the knowledge of point correspondences is implicit. However, at the beginning of many

582

P. Perner

applications this condition is not hold and often it is hard or even impossible to assign landmarks to the acquired shapes. Then it is necessary to automatically determine point correspondences between the points of two aligned shapes P and O, see Fig. 2.

Fig. 2. Aligned shapes with established point correspondences

One of the most essential demands on these approaches is symmetry. Symmetry means obtaining the same correspondences when mapping instance P to instance O and vise versa instance O to instance P. This requirement is often bound up with the condition to establish one-to-one correspondences. This means a point ok in shape instance O has exactly one corresponding point pk in shape instance P. If we compare point sets with unequal point numbers under the condition of one-to-one mapping, then some points will not have a correspondence in the other point set. These points are called outlier. Special problems arise if we must align shapes that are very different, for example aligning concave to convex shapes. In these cases, it is indispensable to consider the order of the point-sets and to enforce legal sets of correspondences, otherwise the calculated distances are incorrect. This paper presents our novel algorithm for aligning arbitrary 2D-shapes, represented as ordered point-sets. In our work natural shapes are acquired manually from real images. The object shapes can appear with varying orientation, position, and scale in the image. The shapes are arbitrary and there is nothing special about them. They might have a great natural variation. They also might be very similar or very dissimilar. They even might have concave or convex shapes. The algorithm establishes symmetric and legal one-to-one point correspondences between arbitrary shapes, represented as ordered sets of 2D-points and returns a distance measure which runs between 0 and 1.

Learning of Shape Models from Exemplars

583

3 Related Work The problems of shape spaces and distances have been intensively studied [2, 3] in a statistical theory of shape. The well-known Procrustes distance [8, 16] between two point-sets P and O is defined as the sum of squared distances between corresponding points. d ð P; OÞ ¼

2 NPO X ð pi lP Þ ðoi lO Þ ; R ð h Þ r r i¼1

P

ð1Þ

O

where RðhÞ is the rotation matrix, lP and lO are the centroids of the object P and O respectively, rP and rO are the standard deviations of the distance of a point to the centroid of the shapes and NPO is the number of point correspondences between the point-sets P and O. This example shows that the knowledge of correspondences is an important prerequisite for calculation of shape distances. There has been done a lot of work which concerns the problem of automatic finding of point correspondences between two unknown shapes. Hill et al. [14] presented an interesting framework for the determination of point correspondences. First, the algorithm calculates pseudo-landmarks based on a set of two-dimensional polygonal shapes. The polygon approximation is controlled by an automatic calculated threshold to identify a subset of points on each shape. In the next step they establish an initial estimate of correspondences based on the arc-path length of the polygons. Finally, the greedy algorithm is used as an iterative local optimization scheme to modify the correspondences to minimize the distance between both polygons. The complexity of the greedy algorithm is Oðnlog nÞ for sorting the elements and OðnÞ for the selection. Brett and Taylor [12] presented the extension of this method to three-dimensional surfaces. However, the algorithm was only applied to groups of objects from the same category. They assume that all acquired shapes are similar and compared them under non-rigid transformation. Another popular approach in solving the correspondence problem is called Iterative Closest Point (ICP) developed by Besl and McKay [18] and further improved in [10, 11, 19]. Given a set of initial, estimated registrations the ICP automatic converges to the nearest local minimum of a mean squared distance metric. It establishes correspondences by mapping the points of one shape to their closest points on the other shape. In the original version of the ICP [18] the complexity of finding for each point pk in P the closest point in the point-set O is OðNP NO Þ in worst case. Marte et al. [11] improved this complexity by applying a spatial subdivision of the points in the set O. They used clustering techniques to limit the search space of correspondences for a point pk to points which are located within a defined range around pk . But this can only be done because they assume that the point-sets are already in a proximity, i.e. a reasonably good initial registration state is already given. In general, the ICP is a very simple algorithm which can also be applied to shapes with geometric data of different representations (e.g. point-sets, line segments, curves, surfaces). On the other hand, it is not very robust with respect to noise and outlier. Fitzgibbon [10] replaced the closedform inner loop of the ICP by the Levenberg-Marquardt algorithm, a non-linear

584

P. Perner

optimization scheme. His results showed an increased robustness and a reduction of the dependence of the initial estimated registrations without a significant loss of speed. The main problem of the ICP is that is does not guarantee to produce a legal set of correspondences. By a legal set of correspondences is meant that there are no inversions between successive pairs of correspondences (see Table 1). In detail, when starting from a reference pair of corresponding points and traveling successive around the complete boundaries to pairs of correspondences, the arc path lengths in relation to the reference points always must be increasing. Table 1. Illegal and legal sets of correspondences

(A) Illegal sets of correspondences due (B) Legal sets of correspondences without to inversions at the points o2 and o4 any inversions

An extension of the classical Procrustes alignment to point-sets of differing point counts is known as the Softassign Procrustes Matching algorithm [8]. It alternates between solving for correspondence, the spatial mapping, and the Procrustes rescaling. The Softassign Procrustes Matching algorithm is also an iterative process and uses deterministic annealing to solve the optimization problem. In deed it is a very timeconsuming and computationally expensive procedure. They applied the algorithm to dense point sets not to closed boundaries. One-to-one correspondences between points were established and points without correspondences are rejected as outlier. The establishment of point correspondences is only held in a nearest neighbor framework, so they do not guarantee to produce legal sets of correspondences. Another solution of the correspondence problem was presented by Belongie et al. [20]. He added to each point in the set a descriptor called shape context. The shape context of a point is a histogram which contains information about the relative distribution of all other points in the set. The histogram can provide invariance to translation, scale, and rotation. The cost of mapping two points is calculated by comparing their histograms using the v2 test statistic. The v2 distances between all possible pairs of points between the two shapes must be calculated which results in a square distance matrix. The best match is found where the sum of distances between all matched histograms has reached its minimum. To solve this square assignment

Learning of Shape Models from Exemplars

585

problem, they applied a shortest augmenting path algorithm for bipartite graph matching which has a time complexity of Oðn3 Þ. The result is a set of one-to-one correspondences between points with similar shape contexts. The algorithm can also handle point-sets with different point counts by integrating dummy points. But it is also not guaranteed that legal sets of correspondences were established. In none of these works is described the problem of aligning a convex to a concave shape. There arise special problems: Let us suppose that the concave shape representing the letter C is compared with the shape of the letter O (see Table 2). If the pairwise correspondences were established between nearest neighbored points the resulting distance between both shape instances will be very small (see Table 2A). But intuitively we would say that these shapes are not very similar. In such cases it is necessary to regard the order of point correspondences and to remove correspondences if they produce inversions. In addition to legality of correspondences it is important to consider the complete contours of both shapes (see Table 2B). In result there arise big distances between corresponding points which leads to an increased distance measure. Table 2. Establishing correspondences while mapping a concave and convex shape

(A) An unconstrained nearest neighbor (B) Enforcing legal correspondences and framework may result in a set of illegal correconsidering the complete contours results in an spondences with inversions and a small disincreased and more suitable measure tance measure

4 Material Used for This Study The materials we used for this study are fungal strains that are naturally 3-D objects but that are acquired in a 2-D image. These objects have a great variance in the appearance of the shape of the object because of their nature and the imaging constraints. Six fungal strains representing species with different spore types were used for the study. Table 3 shows one of the acquired images for each analyzed fungal strain.

586

P. Perner Table 3. Images of six different fungi strains

Alternaria Alternata

Aspergillus Niger

Rhizopus Stolonifer

Scopulariopsis Brevicaulis

Ulocladium Botrytis

Wallenia Sebi

The strains were obtained from the fungal stock collection of the Institute of Microbiology, University of Jena/Germany and from the culture collection of JenaBios GmbH. All strains were cultured in Petri dishes on 2% malt extract agar (Merck) at 24 °C in an incubation chamber for at least fourteen days. For microscopy fungal spores were scrapped off from the agar surface and placed on a microscopic slide in a drop of lactic acid. Naturally hyaline spores were additionally stained with lactophenol cotton blue (Merck). A database of images from the spores of these species was produced.

5 Acquisition of Shape Cases 5.1

Background

The acquisition of object shapes from real images is still an essential problem of image segmentation. For automated image segmentation often low-level methods, such as edge detection [8] and region growing [21, 22] are used to extract the outline of objects from an image. Low-level methods yield good results if the objects have conspicuous boundaries and are not occluded. In the case of complex backgrounds and occluded or noisy objects, the shape acquisition may result in strong distorted and incorrect cases. Therefore, the acquisition is often performed manually at the cost of a very subjective, time-consuming procedure. In some studies, it might be enough to reduce the shapes of objects to some characteristic points which are common features of the object class. Landmark coordinates [2, 4, 17, 23] are manually assigned by an expert to some biologically or anatomically significant points of an organism. This must be done for every single object separately. Additional landmarks can be automatic constructed using a combination of existing landmarks. The calculation of the position of these

Learning of Shape Models from Exemplars

587

constructed landmarks has to be precisely defined, i.e. landmark C is located in the midpoint of the shortest line between landmark A and landmark B. Every single landmark is not only defined by its position but also by the specific and unique feature that it represents. Landmarks might provide information about special features of an object contour but cannot capture the shape of an object because important characteristics might be lost. In many applications it is insufficient or impossible to describe the shape of an object only by landmarks. Then it is a common procedure to trace and capture the complete outlines of the objects in the images manually [14]. Indeed, manually image segmentation may be a very time-consuming and inaccurate procedure. Therefore, new semi-automatic approaches were developed [24, 25] for interactive image segmentation. These approaches use live-wire segmentation algorithms which are based on a graph search to locate mathematically optimal boundaries in the image. If the user moves the mouse cursor in the proximity of an object edge, the labeled outline is automatically adjusted to the boundary. The manual acquisition of objects from real images with the help of semi-automatic segmentation approaches is faster and more precise. As described in Sect. 4 we are studying the shape of airborne fungi strains. The great natural variance in the shape of these objects and the imaging constraints have the effect that a manual assignment of landmark coordinates is quite impossible. We also reject a fully automatic segmentation procedure because this might produce outlier shapes due to the objects might be touching and overlapping. Therefore, we decided to use a manually labeling procedure in our application. 5.2

Acquisition of Object Contours from Real Images

As a shape we are considering the outline of an object but not the appearance of the object inside the contour. Therefore we want to elicit from the real image the object contour P represented by a set of NP boundary points pk , k ¼ 1; 2; . . .; NP . The user starts labeling the shape of the object at an arbitrary pixel p1 of the contour P. In our application we are only interested in the complete and closed outline of the objects. So after having traced the object, the labeling should end at a pixel pm , k ¼ 1; 2; . . .; m; . . .; NP in the 8-neighbourhood of p1 . Because it might be difficult to exactly meet this pixel manually, the contour will be closed automatic using the Bresenham [26] procedure. This algorithm connects two pixels in a raster graphic with a digital line. In addition to that we demand that two successive points may not be defined by the same point coordinates in the image. In consequence we obtain an ordered, discrete, and closed sequence of points where the Euclidean distance between pffiffiffi two successive points is either 1 or 2. In Fig. 3, the scheme of the acquisition of a shape is shown for illustration. Figure 4 presents a screenshot from our developed program Case Acquisition and Case Mining - CACM while labeling a shape of the strain Ulocladium Botrytis with the point coordinates on the right side of the screenshot.

588

P. Perner

Fig. 3. Scheme of the labeled object outline

Fig. 4. Labeled shape with coordinates

In fact, image digitization and human imprecision always implies an error during the acquisition of the object shapes. It might be very difficult to exactly determine and meet every boundary pixel of an object when manually labeling the contour of an object. Also, the quantization of a continuous image constitutes a reduction in resolution which causes considerable image distortion (Moiré effect). Furthermore, the contour of an object in a digitized image may be blurred which means the contour is extended over a set of pixels with decreasing grey values.

Learning of Shape Models from Exemplars

5.3

589

Approximation

The amount of acquired contour points NP of a shape P depends on the resolution of the input image and on the area and the contour length of the object. To speed up the succeeding computation time of the following alignment process we introduced a polygonal approximation. The resulting number of points from a polygonal approximation of the shape will be influenced by the chosen order of the polygon and the allowed approximation error. We apply the approximation only to shapes which consist out of more than 200 points. This limit was introduced because the contour of very small objects in images might be defined by a few pixels only. A further reduction of information about the contour of these objects would be disadvantageous. For the polygonal approximation, we used the approach based on the area/length ratio according to Wall and Daniellson [27] because it is a very fast and simple algorithm without time-consuming mathematic operations. Suppose the set of NP points p1 ; p2 ; . . .; pn defining the contour of the object P, for which a polygonal approximation is desired. We use the first labeled point p1 as the starting point for the first approximation. Next, we virtually draw a line segment from this starting point to the successor point in P. The area A between this line and the corresponding contour segment of P is measured. If the area A divided by the length of the line L is smaller than a predefined threshold T, then the same process is repeated for the next successor point in the set P (see Fig. 5).

Fig. 5. Polygonal approximation based on the area/length ratio [27]

This procedure repeats until the ratio exceeds the threshold T. In that case the current point pm becomes the end of the approximated line and the starting point for the next approximation. The same process is then repeated until the first point p1 is reached. The result of the approximation is a subset of points of the contour P. The ratio AL controls the maximal error of the approximation since A is the area and L the side length of a virtual rectangle. If the ratio is low, then the other side of the virtual rectangle is small.

590

5.4

P. Perner

Normalization of Shape Cases

In our application we demand the distance measure between two shapes to be invariant under translation, scale, and rotation. Thus, in a preprocessing step we remove differences in translation and we rescale the shapes so that the maximum distance of each point from the centroid of the shape will not be larger than one. The invariance to rotation will be calculated during the alignment process (see Sect. 6). * The centroid l of a set of NP points is given by: x* ¼ N1P l

y* ¼ N1P l

NP P i¼1 NP P

xi

ð1Þ

yi :

i¼1

To obtain invariance under translation we translate the shape so that its centroid is at the origin x0i ¼ xi x* l

y0i ¼ yi y*

i ¼ 1; 2; . . .; NP :

ð2Þ

l

Furthermore, for each shape we calculate the scale factor tP : tP ¼

1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; 02 max x02 i þ yi

i ¼ 1; 2; . . .; NP :

ð3Þ

i

This scale factor is applied to all points of the shape P to obtain the transformed shape instance P0 . The set of acquired shapes are now superposed on the origin of a common coordinate system. The maximum Euclidean distance of a point to the origin of the coordinate system is one.

6 Shape Alignment and Distance Calculation 6.1

The Alignment Algorithm

In the Procrustes Distance (see Eq. (1)) each of the shapes is rescaled in such a way that the standard deviation of all points to the centroid is identical ðrP ¼ rO Þ. This normalization does not ensure that the resulting distance runs between 0 and 1. Since we are interested in comparing objects of different categories we rescale the shapes so that the maximum distance of a point to the centroid is not larger than 1. In comparison to Procrustes Distance we also do not need lP and lO because we have already removed the translation by transforming the objects into the origin (see Sect. 5.4). The differences in rotation will be removed during our iterative alignment algorithm. In each iteration of this algorithm the first shape is rotated stepwise while the second shape is kept fixed. For every transformed point in the first shape we try to find

Learning of Shape Models from Exemplars

591

a corresponding point on the second shape. Based on the distance between these corresponding points the alignment score is calculated for this specific iteration step. When the first shape is rotated once around its centroid, finally that rotation is selected and applied where the minimum alignment score was calculated. We are regarding arbitrary shapes with varying orientations and different point counts and do not have any information about how the points of two shape instances must be mapped onto each other. As already stated, we must solve the correspondence problem before we are able to calculate the distance between two shape instances. In summary, for the establishment of point correspondences we demand the following facts: • • • •

Produce only legal sets of correspondences, Produce one-to-one point correspondences, Determine points without a correspondence as outlier, and Produce a symmetric result, which is obtaining the same results when aligning instance P to instance O as when aligning instance O to P.

It was shown in Sect. 3 that the establishment of legal sets of correspondences is an important fact to distinguish between concave and convex shapes. The drawback of this requirement is that the acquired shapes must be ordered point-sets. The demand for establishing only legal sets of correspondences is also the main reason why we decided to extend the initial version of our alignment algorithm [28] that can only align concave with concave and convex with convex shapes. We extended our nearest neighborsearch algorithm presented there so that it can handle the correspondence problem while aligning concave with convex object shapes. The input for our alignment algorithm is a pair of two normalized shapes P and O as described in Sect. 5. We define the shape instance P as the one which has less contour points then the second shape instance O. The instance with more points is always that one, which will be aligned to shape instance with less points. That means shape P will be kept fixed while shape O will be stepwise rotated to match P. Before the iterative algorithm starts we define a range to search for potential correspondences. This range is defined by a maximum deviation of the orientation according to the centroid (see Fig. 6). This restriction will help us to produce legal sets of correspondences. The maximum permissible deviation of orientation cdev will be calculated in dependence of the amount of contour points nO of the shape O, which is the instance that has more points than the other one. Our investigations showed that the following formula leads to a well-sized range cdev ¼

4p : nO

ð4Þ

592

P. Perner

Fig. 6. The permissible deviation of orientation for finding a point correspondent of the point pk is illustrated. The line marks the orientation of the point pk . The range where to search for a correspondence is shown

The outline of our iterative alignment algorithm is as follows: Initialize ψ SET ψ i = 0

/* stepwise rotation angle */ /* actual rotation angle */

REPEAT UNTIL ψ i ≥ 2π or SCORE (P,Oi ) = 0 (A) Rotation of O with ψ i = ψ i −1 + ψ (B) Calculate point correspondences between P and O (C) Calculate distance SCORE (P , Oi )

6.2

Calculation of Point Correspondences

The most difficult and time-consuming part is the establishment of pointcorrespondences. For each point pk with 1 k nP in the set P we try to establish exactly one corresponding point oj with 1 j nO in the set O. To create a list of potential correspondents we first select all points in O which are located in the permissible range of orientation (see Fig. 6) and insert them into this list of potential correspondents f CorrListðpk Þ g of pk . Each of the selected points meets the following condition

Learning of Shape Models from Exemplars

cpk cdev coj cpk þ cdev :

593

ð5Þ

with cpk the angle of point pk and cdev the permissible range of orientation around pk . In Fig. 7 is shown a part of two aligned shape instances at this state of registration. Each point in shape P is connected by lines to all its potential correspondences. It can also be seen that there is outlier on shape O. As an outlier we define all points where no correspondence could be established. Resulting points without correspondences are a logical consequence if we demand one-to-one correspondences when mapping shapes with unequal number of points. Such kind of points does not have any influence on the calculated distance.

Fig. 7. Part from a screenshot made during the alignment of two shape cases. The outer dotted shape ðPÞ is aligned to the inner dotted shape ðOÞ. Each point pk in P is connected by lines with all its potential correspondences that are stored in f CorrListðpk Þ g

If no point in O is in the permissible range (f CorrListðpk Þ g is empty), pk is marked as an outlier. Normally in case of aligning convex objects there is no point in P which has not at least one potential correspondence. The permissible range of orientation is chosen big enough anyway. Indeed, this kind of outliers occurs only in cases of aligning convex to concave objects. Each point pk which has not a single potential corresponding point in O is included with the maximum distance of one when calculating the distance measure. If there is at least one potential correspondent, we calculate the squared Euclidean distances between pk and each point in this list. Then the points in this list f CorrListðpk Þ g were sorted with ascending distances in relation to pk . In succession we check for each point in this list if there was already assigned a corresponding point in P. The first found point without a correspondence is selected and we establish a oneto-one correspondence between them. If all points in the list f CorrListðpk Þ g have already a correspondence the point pk is marked as an outlier. In contrast to those

594

P. Perner

outliers in P which do not even have one potential correspondence in O, this kind of outlier will not be included in calculating the distance measure. Let us assume we had gone through the complete set P and assigned to each point pk either a correspondence in O or marked it as an outlier. In the next step we want to ensure to produce a legal set of correspondences. Therefore we go through the set P again and remove all one-to-one correspondences that are inversions (see Table 1). Suppose we have found a point pk assigned to a correspondence og in O that produces an inversion with the point pm which is assigned to the point of ðk; m nP ; g; f nO Þ. First we try to remove the inversion by switching the correspondences, i.e. pk will be assigned to of and pm will be assigned to og . Otherwise we remove the correspondence of pk and check in succession all the remaining points in the list f CorrListðpk Þ g if it is possible to assign a correspondence with one of these points that does not produce an inversion. If we find a point in this list, we establish the one-to-one correspondence between these two points. Otherwise we must mark pk as an outlier. 6.3

Calculation of the Distance Measure

The distance between two shapes P and O is calculated based on the sum of Euclidean distances. If there was assigned a correspondence to the point pk , 1 k nP in P, the Euclidean distance between this point and its corresponding point ok in O is calculated by d ð pk ; ok Þ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 xp xo þ yp yo :

ð6Þ

If the point pk is an outlier because there could not be assigned at least one potential correspondence and the list f CorrListðpk Þ g is empty, the distance is set with the maximal value of one d ðpk ; ok Þ ¼ 1:

ð7Þ

The sum of all pair-wise distances between two shapes P and O is defined by eðP; OÞ ¼

nP X

d ðpk ; ok Þ:

ð8Þ

k¼1

Since we interested in obtaining a distance measure which runs between zero and one, we normalize the sum according to nP , the number of points in P eðP; OÞ ¼

1 eðP; OÞ: nP

ð9Þ

The mean Euclidean distance alone is not a sufficiently satisfactory measure of distance. Particularly in cases of wavy or jagged contours important information get lost using the averaged distance. Therefore, we also calculated the maximum distance between a pair of established correspondences

Learning of Shape Models from Exemplars

emax ðP; OÞ ¼ max

nP pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X d ð pk ; o k Þ :

595

ð10Þ

k¼1

The final similarity measure is the weighted sum of the mean Euclidean distance and the maximum Euclidean distance ScoreðP; OÞ ¼ a eðP; OÞ þ b emax ðP; OÞ:

ð11Þ

The choice of the value for the weights a and b depends on the importance the user wants to respective distance measure. We chose for a and b a value of 0.5 which results in an equal influence of both distances to the overall distance.

7 Evaluation of Our Alignment Algorithm At first we will demonstrate that the algorithm is symmetric (see Table 4), i.e. the same distance is calculated when aligning instance P to instance O as when aligning instance O to P. As described in Sect. 6.1 the instance with more contour points is always that one which will be aligned to shape instance with less points. Therefore, the order in which the cases are given as input into the algorithm doesn’t matter. A special case where the order matters is if both shape instances are defined by the same amount of contour points. Table 4. Evaluation of symmetry

ulocladium_12 is aligned to ulocladium_13 (each shape is defined by 340 contour points) ψ min = 0.2094 ε = 0.0842 ε max = 0.1835

ulocladium_13 is aligned to ulocladium_12 (each shape is defined by 340 contour points) ψ min = −0.2094 ε = 0.0856 ε max = 0.1822

Score = 0.1339

Score = 0.1339

Outlier included: 0

Outlier included: 0

The Table 5 presents some results of pair-wise aligned shape cases. In the left column of the table the visual results are shown with connecting lines between

596

P. Perner

Table 5. Different shapes with corresponding distance measures and the alignment scores

concave is aligned to concave

ulocladium_10 is aligned to ulocladium_01

ulocladium_03 is aligned to ulocladium_01

ε =0 ε max = 0

ε = 0.1015 ε max = 0.1557

ε = 0.2367 ε max = 0.4303

Score = 0

Score = 0.1286

Score = 0.3335

Outlier included: 0

Outlier included: 0

Outlier included: 0

aspernig_01 is aligned to alternaria_42

rectangle middle is aligned to ulocladium_01

ε = 0.3766 ε max = 0.6305 Score = 0.5035

ε = 0.2835 ε max = 0.5011 Score = 0.3932

Outlier included: 0

Outlier included: 0

point is aligned to ulocladium_01

ε = 0.3751 ε max = 0.3751 Score = 0.3751 Outlier included: 0

corresponding points. The right column of the table presents the calculated scores and the number of outliers which are included in the alignment score with a distance value of one. The alignment score of value zero means identity and the value of 0.5 can be understood as neutral. Up to a value of one the shapes become more and more dissimilar. Again, we take a closer look at the pair-wise alignment of a concave and a convex shape. Reconsider the example from above where the shape of the letter O must be aligned to the shape of a letter C. Figure 8 presents a screenshot from our program

Learning of Shape Models from Exemplars

597

where a similar situation occurred. There were twelve outliers determined on the convex shape which are included in the calculation of the alignment score. As more of these outliers are included as more dissimilar the two shapes become because each of them is assigned with the maximum distance of one.

Fig. 8. Screenshot with outlier, marked as green dots, occurred during the alignment of a circle and a concave shape

Under these circumstances the points on the inside of the concave shape may not have any influence on the resulting alignment score. This is since they were not mapped onto opposite points of the contour of the convex shape (see Table 2B). Indeed, this is an error of the algorithm but otherwise it may happen that the alignment score exceeds the value one.

8 Conclusions We have proposed a method for the acquisition of shape instances and our novel algorithm for aligning arbitrary 2D-shapes, represented by ordered point-sets of varying size. Our algorithm aligns two shapes under similarity transformation; differences in rotation, scale, and translation are removed. It establishes one-to-one correspondences between pairs of shapes and ensures that the found correspondences are symmetric and legal. The method detects outlier points and can handle some amount of noise. We have evaluated that the algorithm also works well if the aligned shapes are very different, like i.e. the alignment of concave and convex shapes. A distance measure which runs between 0 and 1 is returned as result. The methods are implemented in the program CACM Version 1.4 which runs on a Windows PC.

598

P. Perner

Acknowledgment. The project “Development of methods and techniques for the imageacquisition and computer-aided analysis of biologically dangerous substances BIOGEFA” is sponsored by the German Ministry of Economy BMWA under the grant number 16IN0147.

References 1. Thompson, D.A.: On Growth and Form. Cambridge University Press, Cambridge (1917) 2. Kendall, D.G.: A Survey of the Statistical Theory of Shape. Statistical Science 4(2), 87–120 (1989) 3. Bookstein, F.L.: Size and shape spaces for landmark data in two dimensions. Stat. Sci. 1(2), 181–242 (1986) 4. Bookstein, F.L.: Landmark methods for forms without landmarks: morphometrics of group differences in outline shape. Med. Image Anal. 1(3), 225–244 (1997) 5. Alon, J., Athitsos, V., Sclaroff, S.: Online and offline character recognition using alignment to prototypes. In: Proceedings of the 8th International Conference on Document Analysis and Recognition ICDAR 2005, IEEE Computer Society Press (2005) 6. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993) 7. Alt, H., Guibas, L.J.: Discrete geometric shapes: matching, interpolation and approximation. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of Computational Geometry, pp. 121–153. Elsevier Science Publishers B.V. (1996) 8. Rangarajan, A., Chui, H., Bookstein, F.L.: The softassign procrustes matching algorithm. In: Proceedings of Information Processing in Medical Imaging, pp. 29–42 (1997) 9. Sclaroff, S., Pentland, A.: Modal matching for correspondence and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 17(6), 545–561 (1995) 10. Fitzgibbon, A.W.: Robust registration of 2D and 3D point sets. In: Proceedings of British Machine Vision Conference, Manchester, UK, vol. II, pp. 411–420 (2001) 11. Marte, O.-C., Marais, P.: Model-based segmentation of CT images. S. Afr. Comput. J. 28, 54–59 (2002) 12. Brett, A.D., Taylor, C.J.: A framework for automated landmark generation for automated 3D statistical model construction. In: Proceedings of Information Processing in Medical Imaging 1999, pp. 376–381 (1999) 13. Feldmar, J., Ayache, N.: Rigid, affine and locally affine registration of free-form surfaces. Int. J. Comput. Vis. 18(3), 99–119 (1996) 14. Hill, A., Taylor, C.J., Brett, A.D.: A framework for automatic landmark identification using a new method of nonrigid correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 22(3), 241–251 (2000) 15. Veltkamp, R.C.: Shape matching: similarity measures and algorithms. In: Shape Modelling International, pp. 188–197 (2001) 16. Lele, S.R., Richtsmeier, J.T.: An Invariant Approach to Statistical Analysis of Shapes. Chapman & Hall/CRC, Boca Raton (2001) 17. Dryden, I.L., Mardia, K.V.: Statistical Shape Analysis. Wiley, Chichester (1998) 18. Besl, P., McKay, N.: A method for registration of 3-D shapes, IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992) 19. Aksenov, P., Clark, I., Grant, D., Inman, A., Vartikovski, L., Nebel, J.-C.: 3D thermography for quantification of heat generation resulting from inflammation. In: Proceedings of 8th 3D Modelling Symposium, Paris, France (2003)

Learning of Shape Models from Exemplars

599

20. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(24), 509–522 (2002) 21. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. In: 1st International Conference on Computer Vision, London, pp. 259–268 (1987) 22. Cheng, D.-C., Schmidt-Trucksäss, A., Cheng, K.-S., Burkhardt, H.: Using Snakes to Detect the Intimal and Aventitial Layers of the Common Carotid Artery Wall in Sonographic Images. Comput. Methods Programs Biomed. 67, 27–37 (2002) 23. Cootes, T.F., Taylor, C.J.: A mixture model for representing shape variation. Image Vis. Comput. 17(8), 567–574 (1999) 24. Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: Computer Graphics Proceedings, pp. 191–198 (1995) 25. Haenselmann, T., Effelsberg, W.: Wavelet-based semi-automatic live-wire segmentation. In: Proceedings of the SPIE Human Vision and Electronic Imaging VII, vol. 4662, pp. 260–269 (2003) 26. Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Syst. J. 4(1), 25–30 (1965) 27. Wall, K., Daniellson, P.-E.: A fast sequential method for polygonal approximation of digitized curves. Comput. Graph. Image Process. 28, 220–227 (1984) 28. Perner, P., Jänichen, S.: Learning of form models from exemplars. In: Fred, A., Caelli, T., Duin, R.P.W., Campilho, A., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, Proceedings of the SSPR 2004, LNCS 3138, p. 153. Springer Verlag, Lisbon/Portugal (2004)

A New Technique for Laser Spot Detection and Tracking by Using Optical Flow and Kalman Filter Xiuli Wang1, Ming Yang2, Lalit Gupta2, and Yang Bai3(&) 2

1 Anhui University, Hefei 230039, Anhui, China Southern Illinois University of Caronbdale, Carbondale, IL 62901, USA 3 Chongqing University of Technology, Chongqing 400054, China [email protected]

Abstract. Laser spot is widely used in numerous fields such as pointing during presentations, guiding robots in human machine interaction (HCI), and guiding munitions to targets in military, etc.. The interest of this paper is to develop an effective method for laser spot detection and tracking with exploiting fewer features. In this paper, a new technique jointly combined pyramid Lucas– Kanade (PLK) optical flow for detection and extended Kalman filter for accurately tracking laser spot in low-resolution and varying background video frames is presented. By using this new technique, only deploying intensity of laser spot with displacement of laser spot could get a very good result. This technique could be embedded into multi-core processing devices with inexpensive cost for potential high benefits in the future. Keywords: Laser spot Detection Tracking PLK optical flow Extended Kalman filter

Low-resolution

1 Introduction Laser could be practically exploited in different fields for industrial and commercial purpose. Laser spot is widely used for guidance in HCI. In [1], the authors used the librarian robots to identify the position of a target relying on the laser point. And for accurate tracking in launching munitions to targets, the system could seek and lock aim point without operator intervention by exploiting a beam of laser energy, as in [2]. Other applications are about surveillance, mapping, remote sensing, etc. So there are many interesting studies focused on laser spot detection and tracking. In this paper, the interest is to develop an attainable algorithm to detect and track laser spot precisely. It is a challenge to detect laser spot since it is very tiny in shape. And the spot would deform to eclipse, strip, or other irregular shapes in movement. So, it is hard to detect the whole spot in general conditions. Instead, to detect the center of spot is general. The laser equipment could emit red, blue, green or other possible colored beam as well. Besides, the luminance of laser spot depends on the output power and

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 600–613, 2020. https://doi.org/10.1007/978-3-030-17795-9_43

A New Technique for Laser Spot Detection and Tracking

601

background intensity. So, it is difficult to extract general features or segment from all possible laser spots under real environmental background. In other word, it is very easy to miss the position of laser spots or false locating noisy points. In the tracking parts, the moving speed and direction of laser spot are unknown. Obviously, the movement is nonlinear which makes tracking more difficult. The spots may drown in noisy obstructions in frames or even just move out of frames. It is a big challenge to track those missing spots accurately. Currently, there are some different methods proposed for object detection and tracking. All the methods have deficiencies and their performances are limited. In [3], they reported success rate is low when using template matching technique, because their technique is not suitable for tiny object tracking. While in [4], color segmentation and brightness features were applied in color frames that increase the complexity of computation and could not work well when meeting similar color noises. In [5], the authors reported good result, but still used template matching. The algorithm in this paper mainly addresses the problem under some conditions of weakening the difficulties. Grayscale frames are processed to avoid color limitation. Now, only single spot location in each frame will be considered to address. The intensity of laser spot is assumed to be brighter than the local background, and there exists a displacement difference between the laser spot and background. The proposed method makes use of the relative ratio of intensity between the spot and local background. Besides, the movement trajectory of laser spot is taken into account to detect small spot in low-resolution images in time. The detailed algorithm will be described in Sect. 2. Because it do not need to make a spot template for matching and avoid colorful noise pollution also, this new method is fast attainable and very effective. The rest of this paper is organized as follows. In Sect. 2, the algorithm framework is presented in detail. In Sect. 3, the Pyramid Lucas-Kanade (PLK) optical flow technique is introduced. Next, in Sect. 4, the extended Kalman filter (EKF) is described. And the experimental results are given in Sect. 5. In the last, Sect. 6 is the conclusion.

2 Algorithm Framework Before dealing with data, the color video frames are converted to grayscale images and to double precision [0, 1]. First, two adjacent images (fpre, fnex) subtract to get the difference frame and all negative pixels are set to zero. This adjacent frame subtraction processing is for removing noise and most of the background. Then, the image gradient technique is applied to detect the candidates of laser spot in the difference frame. The gradient, denoted by, is defined as the vector gx in horizontal and gy in vertical direction in frame f,

g rf grad ð f Þ ¼ x gy

"

¼

@f @x @f @y

# ð1Þ

602

X. Wang et al.

where they are partial derivatives of frame f, correspondingly. They could be obtained by kinds of approximation operators in digital image processing, like Sobel operators, Prewitt operators, etc. see [6]. Then, the magnitude of vector, M(x, y), is computed by M ðx; yÞ ¼ magðrf Þ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffi g2x þ g2y

ð2Þ

It yields the gradient image. Next, the gradient image is binarized by choosing suitable threshold and use dilation to get the backup areas. Because the gradient just stresses the variation of intensity, when the intensity of some areas changes from bright to dark, the gradient is still large. Then by comparing the areas extracted from the binary frame fpre and the backup areas from the gradient image of difference frame, the spot candidates will be picked up if they belong to the two areas. Assume the spot could not be too tiny or too large, only k regions with middle size as laser spot candidates are picked. For each region Rk, the shape is different, so only the region center is computed as xkc ¼

m 1X xkj m j¼1

ð3Þ

ykc ¼

m 1X ykj m j¼1

ð4Þ

where m is the number of pixels in each region (generally 5 < m < 150) and (xkj, ykj) represents the position of each pixel in region Rk. The (xkc, ykc) is a candidate location of the laser spot center. After all the candidates are captured, PLK are exploited to estimate serial new positions fð~x1 ; ~y1 Þ; . . .; ð~xkc ; ~ykc Þg of tracked candidates in the adjacent frame fnext. Thus, the displacement dk could be obtained from the difference of the two positions in the two adjacent frames. dk ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxkc ~xkc Þ2 þ ðykc ~ykc Þ2

ð5Þ

Then, the displacement is sorted to get a set D = {d1, d2, … dk}. Because the method is under the assumption of existing largest displacement difference between spot and background, the median dm from D is chosen as background motion to calculate the displacement difference dik, dik ¼ absðdk dm Þ

ð6Þ

The candidates are ranked again with displacement difference and mark the ranked first (x1c, y1c) as the center of laser spot. As well, the corresponding ð~x1c ; ~y1c Þ is merged in the candidates of adjacent frames. If there are more than one candidate with the highest displacement difference and gradients, we mark the laser spot missed in the

A New Technique for Laser Spot Detection and Tracking

603

frame and remain this problem in tracking parts. This technique could detect the laser spot no matter if it moves faster than background or fixes in stationary. But this technique may not be efficient when laser spot has speed similar or equal to backgrounds. After the position of laser spot are detected, the EKF will be applied for position fine tune and missing spot tracking. The whole processing structure is shown in Fig. 1.

Difference

Image gradient & Binary and dilation

Binary the first frame with high intensity threshold

Segmentation

Candidates chosen

PLK

Location of laser

EKF Fig. 1. The whole structure of laser spot detection and tracking

3 PLK Optical Flow Optical flow is the distribution of instantaneous velocity of apparent motion observed in visual system. It expresses the variation of brightness in images that contain important spatial information of relative motions. So, it could be used to determine the movement of targets. Optical flow is wildly applied in many fields like object detection

604

X. Wang et al.

and tracking, robot navigation, background extraction and separation, etc. Traditional optical flow computation methods for motions are differential techniques like HornSchunck (HS) global algorithm, Lucas-Kanade (LK) local algorithm [7, 8], regionbased matching [8, 9], energy-based methods [7, 8], phase-based techniques, etc. [7, 8]. They all have their own advantages and disadvantages. To improve the performance, many methods are introduced in [10, 11]. PLK is famous for quickly computing sparse optical flow. In this situation of this paper, only the candidates which composite the sparse feature matrix need optical flow to be estimated, so we could use PLK technique to tackle it. The computation of optical flow is based on two assumptions. First, the luminance does not change or change very slightly over time between adjacent frames. In other words, the brightness is constrained by constancy: Iðx þ u; y þ v; t þ 1Þ Iðx; y; tÞ

ð7Þ

where the flow velocity (u, v) is defined as optic flow at frame t. Second, the movement is very small, which means that the displacement does not change rapidly. So using Taylor expansion, Iðx þ u; y þ v; t þ 1Þ ¼ Iðx; y; tÞ þ Ix u þ Iy v þ It þ H:O:T

ð8Þ

For small displacement, after omitting the higher order terms (H.O.T): Ix u þ Iy v þ It ¼ 0

ð9Þ

where the subscripts denote partial derivations. Just (9) is not enough for us to compute the two unknown velocity components (u, v) uniquely. To solve this problem, another extra assumption is introduced to constrain the equations. By assuming that the moving pixels displace in a small neighborhood, the LK algorithm is used, which is a local differential technique with a weighted least square fit to minimize the squared error function ELK ðu; vÞ ¼

X

2 W 2 ð pÞ Ix u þ Iy v þ It

ð10Þ

p2X

Here W(p) represent window weight function for pixel (x,y) associate with spatial neighborhood X of size p. Then the motion of the central pixel could be easily computed by using the neighboring spatial and temporal gradients @uELK ¼ 0 and @vELK ¼ 0. This sets up a linear system. WðpÞ2 Ix2 WðpÞ2 Ix Iy

!

WðpÞ2 Ix Iy u WðpÞ2 ðIx It Þ ¼ v WðpÞ2 Iy It WðpÞ2 Iy2

ð11Þ

A New Technique for Laser Spot Detection and Tracking

605

This system is not only easy and fast for computation but also robust under noise. But in more general situation, large and non-coherent displacements are typical. To ensure the last constraint, we recommend PLK method to circumvent this assumption. The detail is described in [10]. That means the motion could be estimated using an image pyramid over the largest spatial scales at the top layer and then the initial motion velocity could be iterative refined down layer by layer to the raw image pixels at the bottom layer. In short, it is a coarse to fine optical flow. The structure is shown as follows. From Fig. 2, the optical flow run over the top layer to estimate the motions. Then for the next each layer, the resulting estimation from the previous up layer are used as the starting points for continue estimation until reach the bottom layer. While combining with the image pyramids, it could be implemented to estimate faster motions from image data sequences, as well, reduce lots of the computational cost. So, it is very suitable for our case.

Run iterative LK Run iterative LK

Image It-1

Image It

Pyramid of image It-1

Pyramid of image It

Fig. 2. Coarse-to- fine optical flow estimation

606

X. Wang et al.

4 EKF A Kalman filter (KF) is a highly effective recursive data processing algorithm that could produce an optimal estimation of unknown variables underlying system state. It is widely applied for navigation, tracking and estimation because of its simplicity and robustness. Let us define a linear dynamic system as follows: xk þ 1 ¼ Axk þ wk

ð12Þ

yk ¼ Hxk þ vk

ð13Þ

where xk is the state variable at time k, yk is the measurement at time k, A is the state transition matrix and H is the observation matrix in linear dynamic system, wk and vk are system noise, respectively. The basic Kalman filter works on two steps in linear system. In the prediction step, the Kalman filter gives an uncertain estimation of the current state variables using the formula ^x xk1 k ¼ A^ T P k ¼ APk1 A þ Q

ð14Þ ð15Þ

Where ^x xk1 means the estimate k is the prediction of the state variables at time k, ^ of the state variables at time k − 1, P represents the prediction of error covariance at k time k, Pk1 is the estimate of the error covariance at time k − 1, Q is the covariance matrix of wk. In the correction process, the estimates are updated by the compensation of the difference between measurement from the system and the previous prediction. T 1 T Kk ¼ P k H HPk H þ R

ð16Þ

Pk ¼ P k Kk HPk

ð17Þ

^xk ¼ ^x x k þ Kk yk H^ k

ð18Þ

where Kk is called Kalman gain, R is the covariance matrix of vk, ^xk means the updated estimates of state variables. From the mathematical model above, the Kalman filter works on linear dynamic system very well. But it is difficult to apply on nonlinear systems in practical engineering applications. In this experiment, with the initial position, instant velocity,

A New Technique for Laser Spot Detection and Tracking

607

acceleration and direction all unknown, we only wave the laser pointer aimlessly. So, the trajectory of laser spot is obviously nonlinear. Therefore, the traditional KF should be adjusted to fit for the nonlinear situation. Many kinds of extension and generalization algorithms for Kalman filter are already developed to mitigate the effects of nonlinearities. The EKF could simply approximate a linear model for the nonlinear dynamic system and exploits the Kalman filter theory at each state [12–14]. In the general nonlinear system model, the nonlinear functions replace the parameters (A, H) of linear system. xk þ 1 ¼ f ðxk Þ þ wk

ð19Þ

y k ¼ hð x k Þ þ v k

ð20Þ

So, in the prediction and correction steps, the EKF estimates the state variables by ^x xk1 Þ k ¼ f ð^

^xk ¼ ^x xk k þ K k yk h ^

ð21Þ ð22Þ

Also, instead of A and H in the two steps, the Jacobian matrix of nonlinear function f ðÞ and hðÞ are computed below: @f A @x ^xk

ð23Þ

@h @x ^xk

ð24Þ

H

@ where @x denotes partial derivatives with x. When the consecutive linearization approximates smoothly, the EKF could converge well. And the EKF works recursively, so only the last estimate state is needed rather than the whole process. And the computation consumption is affordable, which is very suitable for our case.

5 Experiments In this section, the performance of the technique is demonstrated by making experiment on live video frames. A web camera is used to record some short videos including one laser spot moving in different backgrounds outside our lab at recording speed 25 frames per sec. Each frame has 720 1280 pixels and a few unstable frames at the beginning

608

X. Wang et al.

and the last parts are discarded, 300 frames are picked up in each video. The spot is projected by a 640–660 nm red laser pointer with a tiny output power ( > r2, the time complexity of this method is O(n). The algorithm is implemented in MATLAB 2014 B. The operating system is Windows 7, the memory is 4G, the Cpu is Intel Core I5, and the average processing speed is 47.34 s per million pixels.

626

Z. Li et al.

Original image Otsu Yen Sauvola Phansalkar Proposed (a)

(b)

Fig. 8. Comparison of binarization results

(c)

Historical Document Image Binarization

627

5 Conclusion There are many challenges in binarization of historical documents. In order to get better binarization effect, this paper studies from two aspects: (1) Use edge information in the process of binarization; (2) Add contrast information to the calculation of local threshold. Firstly, in the process of edge extraction, we use the characteristic that pixels on a single edge are connected to each other, effectively alleviate the contradiction between denoising and maintaining the integrity of the edge, and get more accurate text edge. Secondly, in order to solve the problem of uneven brightness distribution in historical documents, the contrast value of edge is added to the calculation of the local threshold, and the obtained local threshold can better reflect the brightness distribution of the current position. Finally, the binarization is carried out by using local thresholds and edge positions. The proposed method can suppress noise while retaining the foreground information of low contrast. All kinds of fading in low-quality document images can be effectively processed, such as ink stains, page defacement, complex background and so on. Comparing with other method on DIBCO database, the paper performs well in Fm, p-Fm, PSNR and DRD. Further work will be carried out in two aspects: (1) adaptive neighborhood size determination, (2) algorithm efficiency optimization. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 61772430) and by the Gansu Provincial first-class discipline program of Northwest Minzu University. The Program for Leading Talent of State Ethnic Affairs Commission supports the work also.

References 1. Chen, J., Wu, B.: A otsu threshold segmentation method based on rebuilding and dimension reduction of the two-dimensional histogram. J. Graphics 36(4), 570–575 (2015) 2. Hadjadj, Z., Cheriet, M., Meziane, A., et al.: A new efficient binarization method: application to degraded historical document images. SIViP 11(6), 1155–1162 (2017) 3. Sari, T., Kefali, A., Bahi, H.: Text extraction from historical document images by the combination of several thresholding techniques. Adv. Multimedia (2014) 4. Yan, F.: Study of ancient books contrast image binarization based on compensation. Microelectron. Comput. 33(4), 50–54 (2016) 5. Phansalkar, N., More, S., Sabale, A., et al.: Adaptive local thresholding for detection of nuclei in diversity stained cytology images. In: International Conference on Communications and Signal Processing, pp. 218–220 (2011) 6. Xiong, W., Zhao, S., Xu, J., et al.: Research on degraded document image binarization. Comput. Appl. Softw. 33(7), 204–208 (2016) 7. Zeng, F.F., Guo, Y.Y., Xiao, K.: Document image binarization method with reserved edge and uneven illumination. Comput. Eng. Des. 37(3), 700–704 (2016) 8. Su, B., Lu, S.: Robust document image binarization technique for degraded document images. IEEE Trans. Image Process. 22(4), 1408–1417 (2013) 9. Lazzara, G., Géraud, T.: Efficient multiscale Sauvola’s binarization. Int. J. Doc. Anal. Recogn. 17(2), 105–123 (2014)

628

Z. Li et al.

10. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2011 document image binarization contest (DIBCO 2011). In: International Conference on Document Analysis and Recognition 2011, 1506–1510 (2011) 11. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: International Conference on Frontiers in Handwriting Recognition 2012, pp. 817–822 (2012) 12. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2013 document image binarization contest (DIBCO 2013). In: International Conference on Document Analysis and Recognition 2013, 1471–1476 (2013) 13. Prewitt, J.M.S., Mendelsohn, M.L.: The analysis of cell images. Ann. N. Y. Acad. Sci. 128(128), 1035–1053 (2010) 14. Yen, J.C., Chang, F.J., Chang, S.: A new criterion for automatic multilevel thresholding. IEEE Trans. Image Process. 4(3), 370–378 (1995) 15. Rangoni, Y., Shafait, F., Breuel, T.M.: OCR based thresholding. In: Proceedings of MVA 2009 IAPR Conference on Machine Vision Applications, pp. 98–101 (2009)

Development and Laboratory Testing of a Multipoint Displacement Monitoring System Darragh Lydon1(&), Su Taylor1, Des Robinson1, Necati Catbas2, and Myra Lydon1 1

School of Natural and Build Environment, Queens University Belfast, Belfast BT9 5AG, UK {Dlydon01,s.e.taylor,des.robinson,m.lydon}@qub.ac.uk 2 University of Central Florida, Orlando, FL, USA [email protected]

Abstract. This paper develops a synchronized multi-camera contactless vision based multiple point displacement measurement system using wireless action cameras. Displacement measurements can provide a valuable insight into the structural condition and service behavior of bridges under live loading. Traditional means of obtaining displacement readings include displacement gauges or GPS based systems which can be limited in terms of accuracy and access. Computer Vision systems can provide a promising alternative means of displacement calculation, however existing systems in use are limited in scope by their inability to reliably track multiple points on a long span bridge structure. The system introduced in this paper provides a low-cost durable alternative which is rapidly deployable. Commercial action cameras were paired with an industrially validated solution for synchronization to provide multiple point displacement readings. The performance of the system was evaluated in a series of controlled laboratory tests. This included the development of displacement identification algorithms which were rigorously tested and validated against fiber optic displacement measurements. The results presented in this paper provide the knowledge for a step change in the application current vision based Structural health monitoring (SHM) systems which can be cost prohibitive and provides rapid method of obtaining data which accurately relates to measured bridge deflections. Keywords: Computer vision Bridge monitoring

Structural health monitoring

1 Introduction The civil infrastructure of a nation is a key indicator of economic growth and productivity, possession of a reliable transport infrastructure facilitates production, tourism and many other commercial interests [1]. In 2016, the United Kingdom (UK) government invested over £18.9 billion in infrastructure, with over 85% of this figure allocated to transport infrastructure [2]. Facilitating over 90% of motorized passenger © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 629–638, 2020. https://doi.org/10.1007/978-3-030-17795-9_45

630

D. Lydon et al.

travel and 65% of domestic freight, the road network is the most popular means of transport in the UK. The road network is under continuous levels of stress from loading and environmental impacts whose effects can be detrimental to the integrity of the network. UK transport infrastructure is rated as second worst among the G7 countries [3], with the bridge maintenance backlog valued at £6.7bn in 2019 [4]. Bridges are considered a critical element in any road system, any failure or efficiency degradation in such infrastructure results in a negative effect on the daily life of the population and over the long term slows down the economic and social progress of the country. In the UK, the budget for core bridge maintenance has been reduced by up to 40% in recent years [5]. This problem is extensible to most western countries. For instance, according to the 2017 Infrastructure report card, the corresponding figure in the USA is $123bn resulting in 188 million daily trips across structurally deficient bridges [6]. This budgetary shortfall means that cost effective and accurate structural information on bridge condition is becoming increasingly important. Bridges must be monitored periodically to avoid dangerous incidents and ensure public safety. Efficient monitoring of bridge structures can reduce the short and long term cost of maintenance on the road network, as improved information on bridge stock condition will allow for prioritization of budget allocation for necessary improvements. According to literature [7, 8] the prevalent method for bridge monitoring continues to be visual inspections which can be highly subjective and differ depending on climatic conditions. In addition, bridge inspection leads to overly conservative decision making, putting further strain on budgets as unnecessary maintenance is ordered on functional bridge stock. This problem would not be solved by hiring additional inspectors, as the subjective nature of visual inspections would not result in a uniform rating approach across the entire bridge network. It is necessary to facilitate the work of existing engineers; in this sense, technology allows for a more scalable and consistent decision-making solution for bridge monitoring. Structural Health Monitoring (SHM) systems provide a valuable alternative to traditional inspections and overcome many of the previous limitations. SHM can provide an unbiased means of determining the true state of our ageing infrastructure. Sensor systems are used to monitor bridge deterioration and provide real information on the capacity of individual structures, hence extending the safe working life of bridges and improving safety. In particular, monitoring of the displacement of a structure under live loading provides valuable insight into the structural behavior and can provide an accurate descriptor of bridge condition. Testing under live loading conditions also removes the requirement for bridge closure, an expensive undertaking which has knock on effects on other bridge structures as vehicles are diverted to alternative routes, increasing their loading. This paper presents a Computer Vision method for SHM. The basic principle involves using a camera to monitor the behavior of a bridge as it experiences various effects – traffic load, varying temperature etc. Once it has been suitably post processed, the data collected from a camera monitoring system can be used to generate statistical information on bridge condition. Computer vision methods are used in this research as they are low cost, accurate and easily deployable in the field in comparison to traditional instrumentation [9]. Previous research in this area has employed cameras to determine displacement, strain, vibration analysis and response to temperature loads of bridges with varying

Development and Laboratory Testing

631

span lengths [10–13]. The bulk of the existing research is centered on single camera systems [14–16]; this means that a trade-off between pixel accuracy and monitoring accuracy must be taken into consideration on bridges with a longer span. To reduce this factor, multi-point/multi-camera systems for displacement monitoring have been explored in the literature. Existing multicamera methods require extensive cabling or impose range limitations on the system [17–19]. This paper will expand upon the current state of the art by successfully demonstrating the application of a wireless long range fully time synchronized application of displacement monitoring using Computer Vision. The findings presented in this work show that displacement monitoring using computer vision methods can be used on long span bridges with the methodology of the system development presented in this paper.

2 Laboratory Validation 2.1

Displacement Accuracy Validation

The accuracy of this low cost vision based system has been previously presented by the authors [9]. However, prior to the field trail of the multipoint displacement system laboratory trials were conducted to confirm the accuracy of the time synchronization between the vision sensors. This element is critical for determining the complete displacement profile of long span bridges. The initial trail involved the use of two GoPro cameras which will later be built on to demonstrate the ability to integrate unlimited vision sensors to the proposed system. The time synchronization method presented in this paper has been applied in several commercial projects, specifically in the TV and film production industry, the application of this method for SHM has never been explored in previous research. The method utilizes Syncbac Hardware in conjunction with the GoPro camera system and is verified through a series of experiments and has been proven to be suitable for usage in laboratory and field conditions [20]. The hardware has been designed as an accessory for the GoPro camera system, the system has a maximum scanning capability of 25 frames per second (fps) this determined the overall scanning frequency for the trial. In this test the accuracy of vision based displacement is confirmed via a fiber optic displacement gauge [21]. The accuracy of the time synchronization is determined by using all three instruments to monitor displacement of a single location. The displacement apparatus is presented in Fig. 1, a target was attached to the monitoring location which was manually displaced during the test. The distance between the monitoring location and the cameras was 4.3 m and the pixel-mm conversion was established by comparing a physical measurement in the view frame of the camera to the equivalent distance in pixels in the image scene. The Root Mean Square Error (RMSE) and Correlation coefficient (CC) in comparison to the validation sensor and between each camera is shown in Table 1, with the time-displacement series for all instrumentation plotted in Fig. 2.

632

D. Lydon et al.

Fig. 1. Setup of displacement apparatus for multiple cameras at single monitoring location trial Table 1. Results: multiple cameras at a single monitoring location. GoPro 1 RMSE vs FOS (mm) .1533

GoPro 1 CC vs FOS .9914

GoPro 2 RMSE vs FOS (mm) .0928

GoPro 2 CC vs FOS .9975

GoPro 1 CC vs GoPro 2 .9869

Fig. 2. Results of multiple cameras at single node trial

2.2

Laboratory Testing to Obtain Multipoint Displacement of Long-Medium Span Bridge

On confirmation of the synchronization accuracy of the system further cameras were added to increase the field capabilities to monitor long-medium span bridges. This

Development and Laboratory Testing

633

laboratory trial was carried out at the University of Central Florida using a test structure representing a 4-span bridge as shown in Fig. 3. This provide an opportunity to test the multipoint system in a controlled environment prior to the full-scale implementation presented in this paper. For this test series the inner spans were monitored and the two outer spans (shown in blue Fig. 3) acted as approach spans. The Span of each section is, 120 cm, 304.8 cm, 304.8 cm and 120 cm respectively and the overall width is 120 cm. The 3.18 cm thick steel deck is supported by HSS 25 25 3 girders and is fixed in place using four quarter inch bolts representing simply supported end conditions.

Fig. 3. 4-Span structure in lab and schematic of structure

A moving load was applied to the surface of the bridge structure and a number of nodes were monitored as the load travelled along the inner spans of the structure. The moving load was repeated for a series of different node sets (Fig. 4) and the results showing the perfectly synchronized captured response are presented in Fig. 5. The vertical displacement of the selected Node is shown on the Y axis and has been plotted against time as the load crossed the structure. As the accuracy performance of the system in terms of displacement calculations has been rigorously tested by the authors previous work, it was not the focus of this test series. However, there has been substantial testing of this bridge structure and the response captured from the vision based system in this paper correspond to the expected behavior [22].

Fig. 4. Plan layout of internal spans of test structure showing node locations (1–16)

The results presented in this section provided confidence on the accuracy of the synchronization of the system in a controlled environment. The following section details a field trial which was carried out to assess its robustness for real bridge monitoring applications.

634

D. Lydon et al.

Run 1 Y Displacement

Displacement (mm)

1 0.5 0

Node 1

-0.5

Node 2

-1

Node 3

-1.5

Node 4

-2 -2.5

Run 2 Y Displacement

Displacement (mm)

1 0.5 0

Node 1

-0.5

Node 5

-1

Node 6

-1.5

Node 7

-2 -2.5

Run 3 Y Displacement

Displacement (mm)

1 0.5 Node 1 0

Node 8

-0.5

Node 9 Node 10

-1 -1.5 Fig. 5. Vertical displacement of bridge structure under moving load.

Development and Laboratory Testing

635

Run 4 Y Displacement Displacemement (mm)

1 0.5 0

Node 1

-0.5

Node 11

-1

Node 12

-1.5

Node 13

-2 -2.5 Fig. 5. (continued)

3 Field Testing The field testing of the system was carried out in Northern Ireland, a three-span reinforced concrete bridge with an overall span of 62.2 m was selected for testing. The bridge, known as Governors bridge was constructed in 1973, it crosses the river Lagan in the south of Belfast City. The image shown in Fig. 6 is taken from a Concrete pier south of the structure. This pier provided an ideal location for setting up the cameras and the west span of the bridge was chosen for monitoring; the monitoring distance was 11.3 m and the monitoring points are highlighted in Fig. 6.

Fig. 6. Cameras and monitoring locations from Governors bridge trial

636

D. Lydon et al.

Monitoring Point 1, 2 and 3 (MP1, MP2 & MP3), respectively correspond to the ¼, ½ and ¾ span of the support beam. Natural features on the bridge were utilized for feature extraction and tracking for analysis, with all 3 cameras set to record at 25 fps. The recordings from each camera were analyzed using the algorithm previously developed by the authors [9], the results from a number of vehicles passing the bridge are presented in Fig. 7.

Displacemement (mm)

40

Results from Multipoint Monitoring of Governors Bridge

30 20 10 0 -10 -20 -30

Time GoPro 1 - 3/4 Span

GoPro 2 - 1/4Span

GoPro 4 - MidSpan

Fig. 7. Multipoint monitoring of Governors bridge

The results show that the response of the bridge has been captured successfully, a recording error with a camera to be used for traffic monitoring meant that images of vehicles could not be captured to correspond with measured displacement events. This initial test of the field applicability of the multipoint the system provided confidence in potential for the use of this system for field monitoring of medium to long span bridges.

4 Discussion and Conclusions This paper has presented the results from a number of lab trials which validate the development and accuracy of a wireless fully synchronized contactless vision-based bridge monitoring system. In the initial trial the system was shown to accurately obtain single point displacement measurement from two independent camera locations. The system was then further developed to include additional cameras and the results show that in laboratory conditions the system was capable of capturing the response of a structure representing a long-medium span bridge under a moving load. The success of the laboratory trials instigated field testing which was carried out on a bridge in Northern Ireland. The results from this field test confirmed the laboratory testing was repeatable in the field and successful multipoint monitoring of a bridge structure was

Development and Laboratory Testing

637

presented. This verifies that the multipoint implementation of our system can perform at a high level of accuracy in real world field conditions. In summary the work carried out in the experiential trials gave confidence in the accuracy of the system. This allowed for rapid deployment on site and minimized the equipment needed for site measurement.

References 1. Romp, W., De Haan, J.: Public capital and economic growth: a critical survey. Perspektiven der Wirtschaftspolitik 8(Spec. Issue), 6–52 (2007). https://doi.org/10.1111/j.1468-2516. 2007.00242.x 2. Office of National Statistics: Developing new statistics of infrastructure: August 2018 (2018) 3. World Economic Forum: The global competitiveness Report 2018 - Reports - World Economic Forum (2018) 4. RAC Foundation: Council road bridge maintenance in Great Britain. https://www. racfoundation.org/media-centre/road-bridge-maintenance-2400-council-bridges-substandard-press-release. Accessed 14 Mar 2018 5. OECD: Transport infrastructure investment and maintenance spending (2016) 6. ACSE: Report card for America’s infrastructure (2017) 7. Graybeal, B.A., Phares, B.M., Rolander, D.D., Moore, M., Washer, G.: Visual inspection of highway bridges. J. Nondestruct. Eval. 21(3), 67–83 (2002) 8. See, J.E.: SANDIA REPORT Visual Inspection: A Review of the Literature (2012) 9. Lydon, D., et al.: Development and field testing of a time-synchronized system for multipoint displacement calculation using low cost wireless vision-based sensors. IEEE Sens. J., 1 (2018) 10. Lee, J.J., Shinozuka, M.: Real-time displacement measurement of a flexible bridge using digital image processing techniques. Exp. Mech. 46(1), 105–114 (2006) 11. Jin, Y., Feng, M., Luo, T., Zhai, C.: A sensor for large strain deformation measurement with automated grid method based on machine vision. In: International Conference on Intelligent Robotics and Applications, pp. 417–428. Springer, Heidelberg (2013) 12. Kromanis, R., Kripakaran, P.: Predicting thermal response of bridges using regression models derived from measurement histories. Comput. Struct. 136, 64–77 (2014) 13. Goncalves, P.B., Jurjo, D.L.B.R., Magluta, C., Roitman, N.: Experimental investigation of the large amplitude vibrations of a thin-walled column under self-weight. Struct. Eng. Mech. 46(6), 869–886 (2013) 14. Khuc, T., Catbas, F.N.: Computer vision-based displacement and vibration monitoring without using physical target on structures. Struct. Infrastruct. Eng. 13(4), 505–516 (2017) 15. Feng, D., Feng, M.Q., Ozer, E., Fukuda, Y.: A vision-based sensor for noncontact structural displacement measurement. Sensors 15(7), 16557–16575 (2015) 16. Lages Martins, L.L., Rebordão, J.M., Silva Ribeiro, A.S.: Structural observation of longspan suspension bridges for safety assessment: Implementation of an optical displacement measurement system. In: Journal of Physics: Conference Series, vol. 588, no. 1, p. 12004 (2015) 17. Feng, D., Feng, M.Q.: Vision-based multipoint displacement measurement for structural health monitoring. Struct. Control Heal. Monit. 23(5), 876–890 (2016) 18. Ho, H.-N., Lee, J.-H., Park, Y.-S., Lee, J.-J.: A synchronized multipoint vision-based system for displacement measurement of civil infrastructures. Sci. World J. 2012, 1–9 (2012)

638

D. Lydon et al.

19. Park, J.-W., Lee, J.-J., Jung, H.-J., Myung, H.: Vision-based displacement measurement method for high-rise building structures using partitioning approach. NDT E Int. 43(7), 642– 647 (2010) 20. Timecode Systems: SyncBac Pro Home | Timecode Systems (2016). https://www. timecodesystems.com/syncbac-pro/. Accessed 09 Mar 2018 21. Micron Optics (2019). http://www.micronoptics.com/product/long-range-displacementgage-os5500/ 22. Celik, O., Terrell, T., Gul, M., Catbas, F.N.: Sensor clustering technique for practical structural monitoring and maintenance. Struct. Monit. Maint. 5(2), 273–295 (2018)

Quantitative Comparison of White Matter Segmentation for Brain MR Images Xianping Li(B) and Jorgue Martinez University of Missouri-Kansas City, Kansas City, MO 64110, USA [email protected]

Abstract. The volume of white matter in brain MR image is important for medical diagnosis, therefore, it is critical to obtain an accurate segmentation of the white matter. We compare quantitatively the up-to-date versions of three software packages: SPM, FSL, and FreeSurfer, for brain MR image segmentation, and then select the package that performs the best for white matter segmentation. Dice index (DSC), Hausdorff distance (HD), and modified Hausdorff distance (MHD) are chosen as the metrics for comparison. A new computational method is also proposed to calculate HD and MHD efficiently. Keywords: Image segmentation · Brain MRI · White matter FLS · FreeSurfer · Dice index · Hausdorff distance

1

· SPM ·

Introduction

Modern imaging techniques such as Ultrasound, Computer Tomography (CT) and Magnetic Resonance Imaging (MRI) offer radiologists high quality digital images revealing patients’ internal tissues or organs non-invasively. Manually reading a large amount of image data, detecting abnormality or making accurate measurement can be very time-consuming. Therefore, computer-aided diagnostic (CAD) systems have become necessary tools, in many clinical settings, to provide radiologists a second opinion when making final clinical decisions regarding disease diagnosis and monitoring lesion activities, treatment assessment and surgical planning. Processing medical images is not an easy task because medical images are complex in nature. Firstly, different soft tissues may have a small range of difference in image intensities (low contrast). Secondly, anatomical structures vary in size and shape. Thirdly, lesions (i.e. abnormal tissues) is often subtle especially in the early stage of diseases. And lastly, image intensities can be distorted by noise and other imaging artifacts. It is thus necessary to pre-process images before extracting relevant information for further computer-aided diagnostic image processing and analysis. Segmentation is one of the most common c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 639–647, 2020. https://doi.org/10.1007/978-3-030-17795-9_46

640

X. Li and J. Martinez

steps in the pre-processing pipeline. Here we focus on soft tissue segmentation for its clinical importance, especially white matter. Segmentation is to identify a set of voxels satisfying certain criteria that define an anatomical structure, such as an organ or a type of tissue. Meaningful information can then be extracted from the output of segmentation, such as volume/shape, motion tracking of organs, for detecting abnormalities, and assisting surgery and treatment plan. In the past few decades, many effective algorithms have been proposed for segmentation. However, boundaries between organs/tissues may not be clearly defined in medical images, due to noise and partial volume effect. To achieve high accuracy segmentation in medical image still remains challenging. There are several software packages that are able to perform image segmentation. In this paper, we compare three widely used software packages with their up-to-date versions: Statistical Parametric Mapping (SPM) [1] version 12.0, FMRIB Software Library (FSL) [2] version 5.0, and FreeSurfer [3] version 6.0. Similar effort for comparison had been taken in [4] and [5], which only focused on the overall performance for segmentation including white matter (WM), gray matter (GM) and/or cerebrospinal fluid (CSF). On the other hand, the software packages have also been updated significantly in recent years. In this work, we use the up-to-date versions of the software packages to segment brain MR images into WM, GM and CSF, but only focus on the accuracy of white matter segmentation.

2

Software Packages and Data Source

In this section, we briefly introduce the software packages to be investigated. The source of MR image data used in this work is also described. 2.1

SPM

The latest version of SPM is 12.0 released on October 1st, 2014. It provides a major update to the software packages, containing substantial theoretical, algorithmic, structural and interface enhancements. Specifically, the new version uses additional tissue classes, allows multi-channel segmentation such as T2 and PD-weighted MR images, and incorporates a more flexible image registration component. Re-scaling of the tissue probability maps is also re-introduced in the new version. More details can be found on the software website [1]. In our investigation, we use the default parameters in SPM12 for segmentation with two exceptions. One is the parameter “cutoff for Bias FWHM” whose value is changed from “60 mm” to “No correction” if it provides better results for the investigated images. The other is the “sampling distance” and its value is changed from “3 mm” to “1 mm”.

Quantitative Comparison of WM Segmentation

2.2

641

FSL

The latest version of FSL is 5.0.11 released on March 23, 2017. To minimize the possible confusion of CSF and the edge of the skull, the Brain Extraction Tool (BET) in FSL is used on all T1-weighted brain MR images to separate the tissues from the skull. We used the FMRIB Automated Segmentation Tool (FAST) in FSL to segment the tissues into 3 classes. The segmentation is performed by setting parameters based on a mixture of Gaussian distributions and setting each tissue class mean and variance intensity. For further segmentation smoothness, each voxel is labeled with respect to its local neighbors, utilizing a hidden Markov random field and an associated Expectation-Maximization algorithm. FAST is able to segment all T1, T2, and PD-weighted MR brain images into WM, GM, and CSF tissue types. However, we only focus on T1 images and white matter segmentation in this work. 2.3

FreeSurfer

The latest version of FreeSurfer is 6.0 released on January 23, 2017. The software package provides a full processing stream for structural MRI data including skull stripping, B1 bias field correction, reconstruction of cortical surface models, labeling of regions on the cortical surface as well as subcortical brain structures, and nonlinear registration of the cortical surface of an individual with a stereotaxic atlas [3]. Since we only focus on white matter segmentation, we used the command “recon-all” with default parameters for the segmentation and only extracted and analyzed the results for white matter. 2.4

Image Source

Due to high variability of lesion location, shape, size and lesion image intensities in addition to MR imaging artifacts, we decided to compare the softwares with respect to normal white matter segmentation first. Accuracy of segmentation algorithms can be evaluated through volume and spatial agreement measures with respect to the ground truth. We acquired out ground truth images from BrainWeb Simulated Brain Database [6,7] that contains simulated brain MRI data based on two anatomical models: normal and multiple sclerosis (MS). In the current work, the images are from the Normal Brain dataset. The data were obtained from BrainWeb with modality as T1 and slice thickness as 1 mm. Two sets of data were investigated. One set is for different noise levels while fixing the intensity non-uniformity RF = 0%, and the other set is for different levels of RF with fixed noise as 3%. The image dimension is 181×217×181. The downloaded files were in “.mnc” format and were converted into “.nii” format to serve as the input for segmentation. The ITK-SNAP tool was used to display the “.nii” files.

642

3

X. Li and J. Martinez

Comparison Metrics

To compare the accuracy of the segmentation from different software packages, we consider three metrics that are commonly used in literatures - Dice similarity coefficient (DSC), the spatial based Hausdorff distance (HD) [8], and the modified Hausdorff distance (MHD) [9]. The metrics allow us to compare our segmented MR brain images to a ground truth model quantitatively [10]. 3.1

DICE Index

Dice similarity coefficient was originally developed by Thorvald Sorensen in 1948 and Lee Ryamond Dice in 1949, independently. It can quantitatively measure spatial overlap and has been clinically applied to certify segmentation of white matter in MRIs. The coefficient is defined as DSC(A, B) = 2

|A ∩ B| |A| + |B|

(1)

where |A| and |B| are the cardinalities of two sets and ∩ is the intersection. The values of DSC range from [0,1], with 0 representing no binary overlap similarity and 1 representing perfect binary overlap. Larger value of DSC indicates better segmentation quality. DSC has been used in many literatures including [11–13]. 3.2

Hausdorf Distance (HD)

The Hausdorff distance (HD) is used as a measure of dissimilarity between two images and has been used to evaluate quality of image segmentation [12–15] as well as image registration [8,16]. It is defined as HD(A, B) = max(max min d(a, b), max min d(a, b)) a∈A b∈B

b∈B a∈A

(2)

where d(a, b) is the distance between point a and b, for example, Euclidean distance. The HD is generally sensitive to noise and outliers so different variations of HD have been developed to address the issue [9,17]. Modified Hausdorff distance (MHD) is shown to be stable and less sensitive to outliers than the HD [9]. It is defined as follows. 1 1 min d(a, b), min d(a, b) . (3) MHD(A, B) = max a∈A b∈B |A| |B| a∈A

b∈B

Smaller values of HD or MHD indicates better segmentation quality. To calculate HD and MHD efficiently, a modified version of the nearest neighbor algorithm was proposed in [18]. The major idea was to build a 3D cell grid on the point cloud and the search for nearest neighbor is restricted to a subset of the cell grid. In our computations, we follow the similar idea but with a different approach.

Quantitative Comparison of WM Segmentation

643

Since A and B have exactly the same dimensions, we firstly locate the points that are different in the sets A and B. For example, in set A, some voxels are labeled as white matter but in set B they are considered as background. The location of those voxels form a set denoted as Ad . Similarly, for the voxels labeled as white matter in B but as background in A, we denote the corresponding set as Bd . Then we will compute HD between Ad and Bd . Secondly, we construct 3D Delaunay meshes for the points in Ad and Bd and denote them as “meshA” and “meshB”, respectively. With the mesh structure, the search for nearest neighbor becomes very efficient. Thirdly, for all points in meshA, we search for the nearest points in meshB and find the maximum distance hdA = max min d(a, b) or the average distance a∈Ad b∈Bd mhdA = |A1d | min d(a, b). Then we swap meshA and meshB and perform a∈Ad b∈Bd

the same computations to obtain hdB and mhdB . Lastly, the HD and MHD are computed as HD = max(hdA , hdB ) and MHD = max(mhdA , mhdB ), respectively.

4

Comparison Results

In this section, we present the results obtained from the three software packages for the two sets of simulated T1-weighted brain MR image data. The dimensions of the images are 181×217×181. Figure 1 shows three slices in different directions of the image with 3% noise level and 20% RF level.

(a)

(b)

(c)

Fig. 1. A slice of simulated T1-weighted brain MR image in different directions [7]. (a) x-direction; (b) y-direction; (c) z-direction.

Figure 2 shows the ground truth white matter segmentation of the slice in Fig. 1(a) and the corresponding results obtained from the three software packages.

644

4.1

X. Li and J. Martinez

Images with Different Noise Levels

For this set of image data, the intensity non-uniformity is fixed as RF = 0%. The noise levels considered are 0%, 1%, 3%, 5%, 7%, and 9%. The white matter segmentations using SPM, FSL and FreeSurfer are compared with the ground truth in Fig. 2(a). The values of DSC, HD, and MHD are presented in Fig. 3.

(a)

(b)

(c)

(d)

Fig. 2. White matter ground truth and segmentations obtained from the three software packages for Fig. 1(a). (a) ground truth; (b) segmentation using SPM; (c) segmentation using FSL; (d) segmentation using FreeSurfer.

As indicated by the Dice index in Fig. 3(a), the segmentation obtained using FSL performs the best among the three software packages. The segmentation quality decreases for both FSL and SPM as the noise level in the original image increases, while FreeSurfer is not affected significantly by the noise. In other words, FreeSurfer can handle noise data relatively well. When noise level is more than 7%, the segmentation quality are comparable among FreeSurfer, SPM and FSL. However, for images with lower noise level, Dice index for FreeSurfer is much lower than that for FSL and SPM. The HD and MHD values using the three software packages are shown in Fig. 3(b) and (c), respectively. The results are comparable. In terms of MHD, FreeSurfer performs the best for images with lower noise level and FSL performs the best for images with higher noise level. However, the differences are not significant.

Quantitative Comparison of WM Segmentation

(a)

(b)

645

(c)

Fig. 3. Quantitative comparison of white matter segmentation for brain MR images with different noise levels. (a) Dice index (DSC); (b) Hausdorff distance (HD); (c) modified Hausdorff distance (MHD).

4.2

Images with Different Intensity Non-uniformities

For this set of image data, the noise level is fixed as 3%. Three intensity nonuniformity values are considered: RF = 0%, 20%, and 40%. The results are shown in Fig. 4.

(a)

(b)

(c)

Fig. 4. Quantitative comparison of white matter segmentation for brain MR images with different intensity non-uniformities. (a) Dice index (DSC); (b) Hausdorff distance (HD); (c) modified Hausdorff distance (MHD).

As can be seen, for fixed noise level, the DSC and MHD values do not change significantly with respect to the change of RF values. All the three metrics show that FSL performs the best among the three software packages at 3% noise level.

5

Conclusions

In this paper, we have considered the white matter segmentation of brain MR images due to its importance in medical diagnosis. The segmentation qualities of three commonly used software packages are compared with their up-to-date versions: SPM v12, FSL v5, and FreeSurfer v6. Simulated brain MR image data from BrainWeb are chosen for the investigation due to the availability of the ground truth segmentation. Images with different noise levels and intensity nonuniformities have been studied.

646

X. Li and J. Martinez

Three widely used metrics have been chosen to compare the qualities of segmentations quantitatively - Dice index (DSC), Hausdorff distance (HD), and modified Hausdorff distance (MHD). A new computational method is proposed to calculate HD and MHD efficiently, which utilizes the mesh structure formed by the triangulation of the points. Our results indicate that for images with lower noise level, FSL performs the best in terms of DSC; while for images with higher noise level, FreeSurfer starts to perform better than others. The performances of all three software packages are not affected significantly by the intensity non-uniformity at a fixed noise level.

References 1. The FIL Methods Group: Statistical Parametric Mapping (SPM). http://www.fil. ion.ucl.ac.uk/spm/. Accessed 15 Sept 2017 2. Analysis Group: FMRIB Software Library (FSL). http://fsl.fmrib.ox.ac.uk/fsl/ fslwiki/. Accessed 15 Sept 2017 3. Laboratory for Computational Neuroimaging: FreeSurfer. http://freesurfer.net. Accessed 15 Sept 2017 4. Klauschen, F., Goldman, A., Barra, V., Meyer-Lindenberg, A., Lundervold, A.: Evaluation of automated brain MR image segmentation and volumetry methods. Hum. Brain Mapp. 30, 1310–1327 (2009) 5. Kazemi, K., Noorizadeh, N.: Quantitative comparison of SPM, FSL, and brainsuite for brain MR image segmentation. J. Biomed. Phys. Eng. 4(1), 13 (2014) 6. Kwan, R.K.-S., Evans, A.C., Pike, G.B.: MRI simulation-based evaluation of image-processing and classification methods. IEEE Trans. Med. Imaging 18(11), 1085–1097 (1999) 7. BrainWeb: Simulated Brain Database: McConnell Brain Imaging Centre. http:// www.bic.mni.mcgill.ca/brainweb/. Accessed 15 Sept 2017 8. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992) 9. Dubuisson, M.-P., Jain, A.K.: A modified Hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566-568 (1994) 10. Taha, A.A., Hanbury, A.: Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 2015, 15–29 (2015) 11. Zou, K.H., Wells, W.M., Kikinis, R., Warfield, S.K.: Three validation metrics for automated probabilistic image segmentation of brain tumours. Stat Med. 23(8), 1259–1282 (2004) 12. Babalola, K.O., Patenaude, B., Aljabar, P., Schnabel, J., Kennedy, D., Crum, W., Smith, S., Cootes, T.F., Jenkinson, M., Rueckert, D.: Comparison and evaluation of segmentation techniques for subcortical structures in brain MRI. In: Medical Image Computing and Computer-Assisted Intervention, 11(Pt 1), pp. 409–416 (2008) 13. Cardenes, R., de Luis-Garcia, R., Bach-Cuadra, M.A.: A multidimensional segmentation evaluation for medical image data. Comput. Methods Programs Biomed. 96(2), 108–124 (2009) 14. Morain-Nicolier, F., Lebonvallet, S., Baudrier, E., Ruan, S.: Hausdorff distance based 3D quantification of brain tumor evolution from MRI images. In: Conference Proceedings of IEEE Engineering in Medicine and Biology Society, vol. 2007, pp. 5597–5600 (2007)

Quantitative Comparison of WM Segmentation

647

15. Narendran, P., Narendira Kumar, V.K., Somasundaram, K.: 3D brain tumors and internal brain structures segmentation in MR images. Int. J. Image Graph. Sig. Process. 2012(1), 35–43 (2012) 16. Chui, H., Rangarajan, A.: A new point matching algorithm for non-rigid registration. Comput. Vis. Image Underst. 89(2), 114–141 (2003) 17. Zhao, C., Shi, W., Deng, Y.: A new Hausdorff distance for image matching. Pattern Recognit. Lett. 26, 581–586 (2005) 18. Taha, A.A., Hanbury, A.: An efficient algorithm for calculating the exact Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2153–2163 (2015)

Evaluating the Implementation of Deep Learning in LibreHealth Radiology on Chest X-Rays Saptarshi Purkayastha1(&), Surendra Babu Buddi1, Siddhartha Nuthakki1, Bhawana Yadav1, and Judy W. Gichoya2 1

2

Indiana University Purdue University Indianapolis, Indianapolis, IN 46202, USA [email protected] Oregon Health & Science University, Portland, OR 97239, USA [email protected]

Abstract. Respiratory diseases are the dominant cause of deaths worldwide. In the US, the number of deaths due to chronic lung infections (mostly pneumonia and tuberculosis), lung cancer and chronic obstructive pulmonary disease has increased. Timely and accurate diagnosis of the disease is highly imperative to diminish the deaths. Chest X-ray is a vital diagnostic tool used for diagnosing lung diseases. Delay in X-Ray diagnosis is run-of-the-mill milieu and the reasons for the impediment are mostly because the X-ray reports are arduous to interpret, due to the complex visual contents of radiographs containing superimposed anatomical structures. A shortage of trained radiologists is another cause of increased workload and thus delay. We integrated CheXNet, a neural network algorithm into the LibreHealth Radiology Information System, which allows physicians to upload Chest X-rays and identify diagnosis probabilities. The uploaded images are evaluated from labels for 14 thoracic diseases. The turnaround time for each evaluation is about 30 s, which does not affect clinical workflow. A Python Flask application hosted web service is used to upload radiographs into a GPU server containing the algorithm. Thus, the use of this system is not limited to clients having their GPU server, but instead, we provide a web service. To evaluate the model, we randomly split the dataset into training (70%), validation (10%) and test (20%) sets. With over 86% accuracy and turnaround time under 30 s, the application demonstrates the feasibility of a web service for machine learning based diagnosis of 14-lung pathologies from Chest X-rays. Keywords: Deep learning Radiology LibreHealth Chest X-ray CheXNet

1 Introduction 1.1

The Challenge of Chronic Respiratory Disease

Chronic respiratory diseases include Asthma, Chronic Obstructive pulmonary disease (COPD), Emphysema, cystic fibrosis, tuberculosis, lung cancer and pneumonia which are caused to the airways and other parts of the lungs. The deaths due to these diseases © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 648–657, 2020. https://doi.org/10.1007/978-3-030-17795-9_47

Evaluating the Implementation of Deep Learning

649

are increasing each year over the past 35 years in the United States of America. Many reports estimate that more than 4.6 million Americans have died due to chronic respiratory illness from 1980 to 2014 [1]. Late detection or misdiagnosis of the disease is the main reason for increasing the mortality rates, length of stay and health care costs [2]. In the past, lung diseases were detected through a physical Chest checkup as they are routinely available and cost-effective [2]. In recent times, diagnosis is mainly done using laboratory reports, microbiological tests, and radiographs, mainly Chest X-ray images. However, the WHO mentions that Chest X-ray remains the most commonly used technique because of its wide availability, low radiation dose, relatively inexpensive and is the best available method in diagnosing pneumonia [8]. However, Chest X-ray images are very hard to interpret due to the complex anatomical structure of the lungs that are superimposed or due to the infiltrates formed in the blood vessels. 1.2

The Promise of Machine Learning Methods for Reading Chest X-Rays

In recent years, artificial intelligence has gained tremendous importance in assisting physicians to take better clinical decisions. The CheXNet algorithm is a state-of-the-art machine learning algorithm to detect pneumonia at a level of a human practicing radiologist. It is a 121 layer Convoluted Neural Network trained on the Chest X-ray 14 dataset, containing over 100,000 frontal view X-ray images, published by the National Institutes of Health. The CheXNet algorithm can identify 14 pathologies from a Chest X-ray. The performance of CheXNet reported an F1 score of 0.435 which exceeded the average F1 metric of 0.38 for four radiologists in previous studies [6]. Automated detection of diseases from Chest X-rays at this level of performance is invaluable in healthcare delivery in populations with limited access to diagnostic imaging specialties. However, there is no implementation of these algorithms in the real-world electronic health record (EHR) or radiology information system. We believe that such algorithms should be made available to the physicians and health care professionals through EHRs to evaluate the feasibility of such tools in clinical workflows, and the usability of such an integrated AI system. This paper describes the work of integrating the CheXNet deep learning algorithm into the LibreHealth Radiological Information System (RIS) which is an open source distribution of an EHR system.

2 Background Wang et al. [5] proposed a 2D ConvNet for classifying abnormalities in Chest X-Ray images by using a simple binary approach and published the most extensive publicly available ChestX-Ray8 dataset. The dataset unified multi-label classification and disease localization techniques for identifying eight thoracic diseases. The eight diseases were used as keywords to pull information from radiology reports and the related

650

S. Purkayastha et al.

images from a Picture archiving and communication system (PACS). This database consists of 108,948 frontal-view Chest X-ray images which were labeled according to their pathology keywords by using natural language processing (NLP). The images were labelled by searching for diseases in findings and impressions of radiology reports and was labelled as “Normal” if they don’t find any disease. These reports were further mined by using DNorm and MetaMap. The negative pathological statements were eliminated by using parsers (Bllip parser) in NLTK. The quality control of disease labelling is done by considering the human annotators and a subset of reports were used as the gold standard. The images in the Chest X-ray 8 database consists of 1024 1024 pixels images along with detailed contents of the images. To evaluate the disease localization performance, some of the reports were hand labelled by adding bounding boxes (B-boxes). To detect the images with multiple labels, DCNN classification containing 4 pre-trained models (AlexNet, ResNet-50, VGGNet-16, GoogLeNet) was used to generate a heat map showing the likelihood of pathologies. This network takes the weights from these models, while the prediction and transition layers are trained from DCCN. For classifying and localizing pathologies, the data sets were divided into three subgroups: training (70%), testing (20%) and validation (10%). This DCNN model was trained using a Linux server containing 4 Titan GPU server. Due to the limited memory size of GPUs, the batch size had to be reduced to load the model. Yao et al. [3] addressed the problems faced in conditional dependencies both in interface and training while predicting the multiple labels. They used a 2D ConvNet model to study the images and further used recurrent neural networks (RNN) to encode the information from the previous predictions done by the model. They used the sigmoid design to predict because that addressed the issue conveniently at each step in prediction. They tested the performance of the model by splitting the data sets in training (70%), testing (20%), and validation (10%), similar to Wang et al. [5]. Rajpurkar et al. [6] designed a 121-layer dense convolutional neural network (DenseNet) trained on the Chest X-ray 14 dataset containing 112,120 frontal Chest Xray images labeled with 14 thoracic diseases, using the same process as Wang et al. [5]. They used batch normalization and dense connections techniques to improve the flow of gradients and information in the network. This network used the weights from a pretrained model on ImageNet [4] and further trained end-to-end using standard parameters of Adam. As per the ImageNet training dataset, the images from the Chest X-ray 14 were downscaled from 1024 1024 to 224 224. For training the model, the dataset was split into training (98637 images), testing (420 images) and validation (6351 images). The test dataset was then labeled by four radiologists practicing at the Stanford University. This model was trained with the batch size of 16. The performance of the model is compared with that of radiologists. They found that the CheXNet model exceeded the average performance of radiology by F1 metrics.

Evaluating the Implementation of Deep Learning

651

Yadong Mu (2017) then improved this model by adopting a ten crop method both in validation and test sets. They have slightly improved the model to calculate the mean AUROC value and per-class AUROC value of the improved model is similar to the model CheXNet. We used this model, and its PyTorch implementation to integrate CheXNet into the LibreHealth RIS.

3 Methodology 3.1

Project Objective

The work reported in this paper brings an important addition to the abovementioned long-term progress of CheXNet. The following were our main objectives: 1. We wanted to implement the work done on CheXNet, which had proved to be more efficient diagnostic technique when compared to other algorithms, into a real-world EHR system and validate if the performance worked on other Chest X-ray images. 2. We wanted to get to the fastest time in diagnosing from a Chest X-ray, such that we could implement it locations where there is a lack of radiologists. LibreHealth RIS is commonly used in low- and middle-income countries, and thus, we decided to integrate our work with this EHR system. 3. We wanted to build an architectural innovation, such that the CheXNet algorithm could be deployed over a web service, such that physicians and other users of the LibreHealth EHR/RIS did not have to own a powerful graphics processing unit (GPU) or server, and could merely upload images and get back the diagnosis results from the web service. This would enable simplified distribution of our innovation, and would allow more substantial impact of our work. 3.2

Software Development

While initiating the work we reviewed all the existing models, we wanted to work on a model with a related dataset and strong, proven, and easily distributable algorithm. We selected the CheXNet model by Yadong Mu (2017), which is a PyTorch reimplementation of CheXNet. CheXNet currently works on the largest publicly available Chest X-ray dataset. It is a 121-layer convolutional neural network trained on Chest Xray 14, containing over 100,000 frontal view X-ray images with 14 diseases. For this project, we used the rapid prototyping methodology [7]. In the prototype that we built, the LibreHealth radiology module which consisted of all the capabilities of the radiology information system was used as the base module. The radiology module required the installation of a series of prerequisites which includes Java JDK 8, Maven Build tool, Node JS, Docker. We decided to use Java maven for this project, as it helped us in getting all the required libraries and plugins to handle the module’s routine tasks. To use maven, we had to ensure its compatibility with the JDK version

652

S. Purkayastha et al.

for which we had to make some minor changes to LibreHealth core codebase. In the development process, we followed an iterative process of design, build, test, feedback and re-design based on the feedback from the previous cycle. In the initial phase of the module design, we emphasized designing and evaluating the requirements of the project demonstrating the functionality and performance of the system. Use of Docker helped in streamlining the development lifecycle by allowing us to work in a standardized development environment using local containers with all parts of the LibreHealth core platform and its MySQL database. We used containers for continuous integration and continuous delivery workflows, such that any development that we did could be automatically tested for performance metrics which we planned to improve after every development cycle. The radiology module and its docker image are also built by using Maven, which made the build process easy and helped to provide a uniform build system. We used the LibreHealth RIS codebase: https://gitlab.com/librehealth/radiology/lhradiology) to fork and create a development branch. This proved to be hard to manage as there was parallel work that was done by other contributors from the open-source community, and thus we had to switch to a modular approach, such that our web application does not get impacted by any code changes to the other parts of the LibreHealth RIS. LibreHealth RIS code repository contains useful tools for radiology and imaging, which proved very helpful in this project. Post repository cloning we used docker to build and run the radiology module, on top of which we started to develop our web application. This assembled docker image was later published to docker hub so that all developers could share the starting codebase to begin working on the app. The new Docker image was created based on the open source lh-radiology-docker, along with integrating required modules, which are described below. To retrieve all the radiology terminology services required by the radiology module, the core concept dictionary was installed. Radiology concepts were also imported from CIEL dictionary. We also integrated four other publicly available open-source modules as shown in Fig. 1 below. The radiology module v1.4.0 package file (called OMOD) of the radiology module along with LibreHealth core called lh-toolkit was integrated into this docker image. The radiology module is the core RIS features such as procedures for ordering or placing the radiology orders. The functionality to view the DICOM images and creating the reports after completing the radiology orders are also part of the radiology module. Secondly, the OMOD of the Legacy UI modules was downloaded from the add-ons repository of the OpenMRS EHR, another open-source EHR platform on which LibreHealth has been developed. This module consists of all the administrative functions of the OpenMRS along with the patient dashboards. This module includes features such as finding/creating patients, concept dictionary management and all the essential functions of an EHR system. Then the OMOD of REST web services module was downloaded from OpenMRS and integrated. It is used to run RESTful web services of the OpenMRS EHR, which can be used by client-side HTML applications

Evaluating the Implementation of Deep Learning

653

for data exchange with the EHR system. Lastly, the open web apps (OWA) v1.4 module of OpenMRS was integrated into the docker image. This module allows users to deploy open web apps which consists of HTML, JavaScript, CSS, and a manifest file as a zip package, for it to be launched as client-side applications. The module was useful in altering the User Interface on top of Rest Webservices for our project.

Fig. 1. The architecture of the project using LibreHealth Toolkit as the base and OpenMRS modules to deploy an open web app, which communicates with the CheXNet web service.

The best performing CheXNet was cloned from GitHub - https://github.com/ arnoweng/CheXNet). Labeled ChestX-ray14 dataset images were downloaded from the National Institutes of Health - https://nihcc.app.box.com/v/ChestXray-NIHCC into a server which had four Nvidia 1080Ti graphics processing units (GPU). We uploaded the model.py, model.pth.tar (the trained model) to the GPU server and updated the model to use PyTorch 0.4.0. All the medical terms and concepts of the 14 thoracic diseases were created in LibreHealth RIS. These concepts were also clubbed into ConvSet (Convention Set) of clinical terms as a list so the CheXNet can report back accurately under the appropriate diagnosis concept. 3.3

Development of Buddi-DL, the Open Web App for LibreHealth RIS

An Open web app’s rough skeleton was created and named Buddi-DL (Deep Learn) after the primary developer and one author of this paper. We scaffold the open web app by installing Node JS to generate the boilerplate Open Web App (OWA) by using a NodeJS tool called Yeoman, by following the development workflow defined here https://wiki.openmrs.org/display/docs/Open+Web+App+Development+Workflow. The HTML, CSS, and JavaScript for the OWA could then be uploaded into the OpenMRS server using the Open Web Apps module, as shown in Fig. 2. The code for the buddi-dl OWA development is here: https://gitlab.com/Surendra04buddi/buddi-dl.

654

S. Purkayastha et al.

Fig. 2. The LibreHealth user-interface to upload the Buddi-DL OWA

The main innovation of the project was creating a Flask application with Pythonbased Flask framework. The Flask application hosts a web service which receives an image and then executes the code in the backend on the GPU server for CheXNet evaluation on that image, and then returns the list of lung diagnosis and their probabilities. This innovative web service allows the client-side application not to own a GPU server to run the CheXNet algorithm. The Flask application serves as a medium between the OWA and the CheXNet model on the GPU server, so that the user can use the OWA to upload the images. Using RESTful API that was deployed using the Flask framework, the OWA calls the URL endpoint to post the image, and this triggers the execution of the CheXNet model. All computation intensive tasks are completed on the server, and a response is sent back once the processing is completed. The OWA receives a push callback from the Flask REST service, once the processing is completed and the output diagnosis is shown on the OWA.

Fig. 3. The form on the LibreHealth RIS using the OWA through which the Chest X-Ray image is uploaded, and the Impression is returned into the form

Evaluating the Implementation of Deep Learning

655

4 Results The radiologist or clinical user can log in to the EHR system and open the patient record and then upload the images into the OWA using a mobile phone or low-end laptop, which has any web browser. We tested the OWA to work with all major web browsers such as Internet Explorer, Chrome, Firefox, and Edge. All browsers were able to upload the images and receive the results back from the Flask CheXNet web service. A total of 180 new images were uploaded using the OWA for the test process. These images were different from the Chest X-rays on which the CheXNet algorithm had been trained. These were labeled images with about 70% (n = 126) were positive for pneumonia. This made the test dataset to be somewhat biased towards a positive detection of pneumonia, but we deliberately selected those images, because we wanted to test if out of the 14 diagnoses, we would get the accurate probabilities back from the CheXNet algorithm. Our system performed with 86% accuracy such that 108 images were returned with the highest probability of the diagnosis of pneumonia. In the remaining 18 positive images, pneumonia was among the top-3 diagnosis, but not the one with the highest probability. Thus, the integrated system worked reasonably accurately and showed similar performance to the model performance from previous studies on CheXNet. All the calls made by OWA correctly triggered by Flask App, and further to the algorithm. After the images are successfully posted, the OWA makes a remote call to execute the model.py using a subprocess. After completing execution of the model, the Flask app was able to push a callback and send the data for the probability of the 14 thoracic diseases. The diseases with the highest probability should be considered by the clients. The model is altered such that after successfully executing the model the images posted is moved into other directories so that we were able to upload new images. All of the functionality worked correctly and even with a poor and intermittent internet connection, we were able to receive data back whenever the client and server had reconnected. Thus, the flask application continued to push, until a successful acknowledgment from the client came back, and the similar case was from the clientside when the images were being sent by the client application to the Flask web service. The primary metric for the evaluation of our project was the turnaround time for the AI system to work, since the accuracy of the model has been established by other researchers, as well as the usability of such an integrated system. All the 180 images returned their diagnosis within 30-s timeframe, with the minimum time being 22 s and the maximum time being 29 s, when the internet connections were kept stable. We deliberately disconnected and reconnected the internet to verify that the push callback was working correctly, and that was verified to work correctly. Formal usability tests were performed in each of the iterative cycles during the development phase with one informatician and one clinical user. That feedback was used to improve the usability of the system. Finally, as can be seen in Fig. 3, the form is very straightforward and does not feel any different from a regular LibreHealth or OpenMRS EHR system. The attempt here is to black box the backend of the AI system, such that the user only sees a simple front-end and all the heavy-lifting is done on the GPU server, without the user knowing the internal complexities. After the image is uploaded, we show a progress bar

656

S. Purkayastha et al.

for the upload, and a waiting icon till the diagnosis impression is shown. Due to the complexity of implementing an accurate progress bar of the detection model from the CheXNet web service, we do not display it for the user.

5 Discussion The integrated system met all the objectives that we had set when integrating the AI system with LibreHealth RIS. The performance of the models was as expected and did not require much effort to execute. The challenge for integration was mainly to do with creating the Flask web service which can deal with push callbacks and RESTful API which is generic for some different algorithms in the future. Another major challenge was to simplify the OWA user experience. With such a state-of-the-art backend, there was a lot of motivation from the developers to showcase a fancy internal view of whatever happened within the model and how the deep learning model worked in the hidden layers and made the classification. Yet, the simplification made it clear for clinical and informatics users, that this was better designed as a black box. 5.1

Limitations

The composition of Docker image which required the high runtime for the tomcat to launch on its own port was a major bottleneck at the start of the project. A persistent issue has to be able to deal with CAS authentication (Internal authentication system of Indiana University) which prevented posting of images into the GPU server. When we are trying to post images into the server, it was getting redirected to the CAS authentication page. The OWA failed to verify the authenticity of the login, particularly in getting a ticket for the secure connection. Cross-Origin Resource Sharing (CORS) in most modern browsers also prevented the OWA from posting images and receiving the callback. Since the OWA is launched on a different domain from the Flask application, the CORS mechanism blocked the API requests from OWA. We dealt with this by implementing a JSONP format for data exchange. CUDA run-time errors for running PyTorch continue to be an issue, and the model had to be retrained due to backward compatibility issues with different versions of PyTorch. We have used the 128 GB RAM, 4x Nvidia GTX 1080Ti GPU server, which is somewhat limited, if a large number of simultaneous models have to start evaluating images uploaded by many users. The server showed out of memory errors multiple times due to large batch size.

6 Conclusion The detection of Chest X-ray images at the level of expert radiologists might be advantageous in assisting clinicians in healthcare settings, particularly in places with very limited radiologists. In other places, this might augment radiologists and help them evaluate Chest X-rays with a second perspective. The impressions obtained from this application can be overwritten by the physicians or radiologists and treatment can be given accordingly. This is amongst the first attempts at integration of Artificial

Evaluating the Implementation of Deep Learning

657

Intelligence into clinical workflow in an open-source product. The primary goal of this project is to design a clinically meaningful automated system that can assist the physicians in analyzing the radiology images before they can get detailed reports from radiologists. We have developed an OWA which can be launched into any Docker image containing the Radiology module and Open Web App module for OpenMRS EHR or LibreHealth. This OWA may serve as a template of integration of Artificial Intelligence in Radiology.

References 1. World Health Organization: Noncommunicable diseases: progress monitor 2017 (2017) 2. Fazal, M., Patel, M., Tye, J., Gupta, Y.: The past, present and future role of artificial intelligence in imaging. Eur. J. Radiol. 105, 246–250 (2018) 3. Yao, L., Poblenz, E., Dagunts, D., Covington, B., Bernard, D., Lyman, K.: Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710. 10501 (2017) 4. Fei-Fei, L., Deng, J., Li, K.: ImageNet: constructing a large-scale image database. J. Vis. 9(8), 1037 (2009) 5. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3462–3471. IEEE (2017) 6. Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., et al.: Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017) 7. Jones, T.S., Richey, R.C.: Rapid prototyping methodology in action: a developmental study. Educ. Technol. Res. Dev. 48(2), 63–80 (2000) 8. WHO: Standardization of interpretation of chest radiographs for the diagnosis of pneumonia in children (2001)

Illumination-Invariant Face Recognition by Fusing Thermal and Visual Images via Gradient Transfer Sumit Agarwal(B) , Harshit S. Sikchi, Suparna Rooj, Shubhobrata Bhattacharya, and Aurobinda Routray Indian Institute of Technology, Kharagpur, India [email protected]

Abstract. Face recognition in real life situations like low illumination condition is still an open challenge in biometric security. It is well established that the state-of-the-art methods in face recognition provide low accuracy in the case of poor illumination. In this work, we propose an algorithm for a more robust illumination invariant face recognition using a multi-modal approach. We propose a new dataset consisting of aligned faces of thermal and visual images of a hundred subjects. We then apply face detection on thermal images using the biggest blob extraction method and apply them for fusing images of different modalities for the purpose of face recognition. An algorithm is proposed to implement fusion of thermal and visual images. We reason for why relying on only one modality can give erroneous results. We use a lighter and faster CNN model called MobileNet for the purpose of face recognition with faster inferencing and to be able to use it in real time biometric systems. We test our proposed method on our own created dataset to show that realtime face recognition on fused images shows far better results than using visual or thermal images separately. Keywords: Biometrics · Face recognition · Image fusion Thermal face detection · Gradient transfer · MobileNet

1

·

Introduction

In this age of smart technologies, biometric plays an important role in keeping us secure. Devices that use face recognition as biometric are non-intrusive, reliable and convenient. Face recognition is considered to be best suited for identification [1,2]. Since, the pioneer work of [3] a number of research work has been proposed like [4–8] but some unsolved problems still persists. The performance of such algorithms are vulnerable to poor illumination condition [9], disguises and spoofing attacks [10]. Literature evince that Infrared Imaging has the potential to counter-attack the aforementioned problems [11–13]. However, thermal imagery c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 658–670, 2020. https://doi.org/10.1007/978-3-030-17795-9_48

Illumination-Invariant Face Recognition

659

Fig. 1. Flowchart

has its own drawbacks - it is opaque to glasses, sensitive to surrounding temperature, the distribution of heat changes with individual’s facial expression. Also, face detection and holistic feature extraction is still a challenge from thermal images [14,15]. One of the objectives of this paper is to propose an efficient face recognition algorithm against the mentioned constraints. The proposed method uses the positive characteristics of both visible and thermal spectra to recognize a given face. The limitations of both the domain are addressed by fusing the thermal and visible images optimally to get a reliable result. Our experimental results reinforce our claim of solving the problem of losing eyes information under spectacles, poor recognition under poor illumination. Also, thermal maps cannot be generated artificially by playing video or showing images to the biometric system which can be done with the intention of breaching the security. A database was needed to test the performance of our methodology. While a vast number of databases designed for various tasks exist for the visual spectrum, only a few relevant thermal face databases prevail. In the past, the most prominent databases used for facial image processing on the thermal infrared domain were the EQUINOX HID Face Database [16] and the IRIS databases. The NIST/Equinox database contains image pair’s co-registered using hardware setting. The image pairs in the UTK-IRIS Thermal/Visible Face Database are not, and therefore spatial alignment is required before fusion. However, both resources are no longer available. Kotani Thermal Facial Expression (KTFE) Database [17] is another such database, but it is small and consists of limited examples. A currently available database upon request is the Natural Visible and Infrared facial Expression database (USTC-NVIE) [18]. The database is multimodal, containing both visible and thermal videos acquired simultaneously. The spatial resolution of the infrared videos is 320 × 240 pixels. However, both the thermal and visual images were captured at different angles which makes manual annotation difficult as the images are not in the same orientation. To fulfill the need for such a dataset, we developed a simultaneous thermal and visual face dataset. The dataset contains hundred subjects, whose thermal and visual images are taken simultaneously to avoid the face alignment issues.

660

S. Agarwal et al.

In this paper, Sect. 1 explains the proposed dataset and its protocol and also shows few samples: Sect. 2 contains the explanation of the procedure used in this paper which are subdivided into face detection part in thermal imagery, visible and thermal image fusion part with its optimization and the face recognition part; Sect. 3 shows the experimental results and accuracy obtained using proposed method and; Sect. 4 concludes the paper. The methodology is also presented in Fig. 1. 1.1

Proposed Dataset

The database presented and used here is multimodal, containing both visible and thermal images that are acquired simultaneously. It contains acquired images of 100 participants with 10 image sets of visual and thermal images each at different illumination conditions for the objective of illumination invariant facial recognition. Rather than emphasizing on acquiring data in various modes, we focus on the accuracy of the alignment of sets of visible as well as thermal images. Therefore, our database provides: – High-resolution data at 640 × 480 pixels, much higher than currently available databases that usually work with 320 × 240 pixel data. – Facial images at a certain headpose were captured at the same alignment. It assures that they can be superimposed on each other without misalignment in a low light setting. This is possible using a specific setting between the thermal and visible cameras in the hardware. – A wide range of head poses instead of the usually fully frontal recordings provided elsewhere. To the best of our knowledge, our database is the only set available with simultaneously aligned facial images in the visual and infrared spectrum at variable illumination. All images for our dataset were recorded using a FLIR One Pro high resolution thermal infrared camera with a 160 × 120 pixel-sized microbolometer sensor equipped working in an infrared spectrum range of 8 µm–14 µm. Sample images of the created dataset are shown in Fig. 2.

2 2.1

Proposed Methodology Pixel Intensity Based Face Detection Algorithm

The first step in our proposed pipeline is face detection in thermal images. This algorithm operates based on the pixel intensity profiling of the thermal image, in which the face along with the neck region is extracted. We have considered here two assumptions; one of which is that the facial region is covering the biggest part in the whole image and the other one, pixel intensity of face is largely different from the other parts below neck due to clothing or various reasons. After extraction of the face in the thermal image we can get pre-processed data for the next step which is image fusion to get a fused face. Our proposed method is very simple and does not depend on the detection of any other facial part, a curve of the face or anthropometric relationships but only on the pixel intensity, which is readily available. The steps followed to do so are described in Algorithm 1.

Illumination-Invariant Face Recognition

661

Fig. 2. Sample images from the dataset

Algorithm 1. Face Detection in Infrared Images 1: A thermal image is obtained in grayscale format as shown in Figure 3a. 2: Use histogram equalization to improve the contrast of the thermal image. The top 1% and the bottom 1% of the pixel values are saturated. (shown in Figure 3b). 3: Use a two dimensional median filter in order to smooth the image. The result (shown in Figure 3c smooths out the minor discontinuity in the pixel values in a region of the image. 4: Apply histogram equalization again to improve the contrast of the image (shown in Figure 3d. The top 1% and the bottom 1% of the pixel values are saturated. 5: Now the image we have is multimodal usually having three to four modes. Select the mode having more number of pixels as well as based on the one having bright intensity. For this we threshold the image using the information from the histogram (shown in Figure 3e. 6: Process the thresholded image and to fit the smallest rectangle possible and extract the face region from the original image. (shown in Figure 3f).

662

S. Agarwal et al.

Fig. 3. Step-wise result of face extraction: (a) captured thermal image in grayscale format, (b) histogram equalized image, (c) result after median filtering, (d) contrast enhanced image, (e) region of our interest and (f) extracted face

2.2

Image Fusion

After obtaining the corresponding visible and thermal images after the preprocessing step, we work on the lines of Ma et al. [19] present the method of image fusion which uses gradient transfer, and optimizes for the best fusion using total variation minimization. Our aim is to create a fused image that preserves both the infrared radiation information and the visual information in the two images given that the images are co-registered and aligned. The example can be seen in Fig. 4. Both the visual and thermal images are considered to be grayscale. Let the size of thermal, visible and fused images be m × n, and their vectors in column forms be denoted by ir, vi, x, ∈ R(mn × 1), respectively. Infrared images can reliably distinguish between targets and background by pixel intensity differences. This point provides motivation to force the fused

Illumination-Invariant Face Recognition

663

image to have the properties of similar pixel intensities with the given thermal image. This can be achieved by minimizing the empirical error measured by some lr norm (r ≥ 1) . 1 r ε1 (x) = x − ir r (1) r Also, we want the fused image to preserve the characteristics on the visual image. The fused image should have similar pixel intensities with the visual image in order to fuse the detailed appearance information. But, the infrared and visual images represent different phenomena which leads to different pixel intensities in the same pixel location making it unsuitable to generate x by simultaneously r s minimizing 1r ∇x − ∇ir r and 1s ∇x − ∇vi s . We consider gradients of the image to characterize the detailed appearance information about the scene. Therefore, we propose to constrain the fused image to have similar pixel gradients rather than similar pixel intensities with the visible image. 1 s (2) ε2 (x) = ∇x − ∇vi s s where ∇ is the gradient operator which we will define in details latter. In the case of s = 0 , Eq. 2 is defined as ε2 (x) = ∇x − ∇vi 0 , which equals the number of non-zero entries of ∇x − ∇vi. Hence from Eqs. 1 and 2, the fusion problem is formulated as minimization of the following objective function: ε(x) = ε1 (x) + λε2 (x) 1 1 r s = x − ir r + λ ∇x − ∇vi s r s

(3)

where the first term constrains the fused image x to have the similar pixel intensities with the infrared image ir, the second term requires that the fused image x and the visible image vi have the similar gradients at corresponding positions, and λ is a positive parameter to control the trade-off between the two terms. The above objective function extent aims to transfer the edges in the visible image onto the corresponding positions in the infrared image. This method increases the fidelity of a standard infrared image by fusing important information from visual image. 2.3

Optimization

The norms lr and ls , are being considered in the objective function Eq. 3. The Gaussian difference between the fused image x and the infrared image ir will lead to r = 2 as a natural choice whereas Laplacian or impulsive case will lead to r = 1. Specifically, in our problem we expect to keep the thermal radiation information of u, which means that Most entries of x−ir should be zero since the fused image contains the thermal information. Also a small part of the entries could be large due to the purpose of gradient transfer from the visible image vi. This leads to the difference between x and ir to be Laplacian or impulsive rather

664

S. Agarwal et al.

Fig. 4. Image fusion example: (a) Thermal Image, (b) Visual Image representing the same alignment and region as the Thermal Image, (c) Fusion result (Parameters: λ = 7)

than Gaussian, i.e. r = 1. The property of piece-wise smoothness is often exhibited by natural images leading their gradients to be sparse and of large magnitude at the edges. Encouraging sparseness of the gradients leads to minimizing the l0 norm, i.e., s = 0. Since the l0 norm is NP-hard, leads to an alternative convex relaxation approach to replace l0 by l1 . The exact recovery of sparse solutions by l1 is guaranteed by the restricted isometry property condition. Therefore, we consider minimizing the gradient differences with l1 norm, i.e. s = 1, and l1 on the gradient is the total variation. Let y = x − vi, the optimization problem Eq. 4 can be rewritten as: y ∗ = argy

mn

|yi − (iri − vii )| + λJ(y)

i−1

with

J(y) =

mn i−1

|∇i y| =

mn i−1

(4) (∇hi y)2

+

(∇νi y)2 ,

√ where |x| := x1 2 + x2 2 for every x = (x1 , x2 ) ∈ R2 , ∇i = (∇i h , ∇i v ) denotes the image gradient ∇ at pixel i with ∇h and ∇v being linear operators corresponding to the horizontal and vertical first order differences, respectively. More specifically, ∇i h x = xi − xr(i) and ∇i v x = xi − xb(i) , where r(i) and b(i) represent the nearest neighbor to the right and below the pixel i . Besides, if pixel i is located in the last row or column, r(i) and b(i) are both set to be i. The objective function Eq. 4 is seen to be convex and thus has a global optimal solution. The second term can be considered as a regularization item, which strikes as an appropriate parameter in considering the detailed appearance information in the visual image. The problem Eq. 4 is a standard l1 -TV minimization problem and

Illumination-Invariant Face Recognition

665

can be understood better using the algorithm proposed in [20]. The proposed GTF algorithm is very simple yet efficient. The global optimal solution of the fused image x∗ is then determined by: x∗ = y ∗ + v. 2.4

Face Recognition on fused images

Convolutional neural networks [21] are designed to process data that come in the form of multiple arrays. These include a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. The key ideas on which form the foundation on CNN are: local connections, shared weights, pooling and the use of many layers. A typical CNN comprises of: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filters. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Using this kind of architecture is beneficial as in array data like images, local group of values are correlated, forming distinctive templates that can be easily detected by the filters. Also, using this filters brings about translation invariance. In other words, if a template can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name. CNN’s have been proven to be very useful in computer vision since it preserves the spatial relationship between pixels. In many real world applications such as biometrics, the recognition tasks need to be carried away in a timely fashion on a computationally limited platform. Real time and offline inferencing are some of the major challenges in modern biometrics for the purpose of robust fine grain classification and increased reliability in the absence of internet. The class of efficient CNN models called MobileNet are small and have low latency on mobile and embedded applications. Two hyperparameters - ‘width multiplier’ and ‘resolution multiplier’ dictates the property of resulting architecture. 2.5

MobileNet Architecture

The trend to go deeper and more complicated for achieving higher accuracy in CNN’s also leads to high memory requirements and slow inference. We use the MobileNet architecture by Howard et al. [22] so as to make the architecture readily deployable on mobile systems. A typical convolutional layer works by taking a input of size HF × HF × M feature map F and producing a feature map of size HG × HG × N , where HF , HG is the spatial width and height of square input feature map and square output feature map respectively, M is the number of input channels and N is the number of output channels. The standard convolutional layer is parameterized by convolution kernel K of size HK × HK × M × N where HK is the spatial dimension of the kernel

666

S. Agarwal et al.

assumed to be square and M is number of input channels and N is the number of output channels as defined previously. Standard convolutional neural networks have a typical cost of HK · HK · M · N · HF · HF . In a MobileNet architecture the same convolution operation is divided into two convolutions called the depthwise and the pointwise convolutions. Depthwise convolutions is used to apply single filter per input channel and the pointwise convolution which is a simple 1 × 1 convolutions is used to create linear combinations of output of depthwise layer. This modification reduces the computation cost to HK · HK · M · HF · HF + M ·N ·HF ·HF , which is the sum of computation cost for depthwise and pairwise computations. The reduction in computation is given by: 1 1 H K · H K · M · HF · H F + M · N · H F · H F + 2 = H K · H K · M · N · HF · H F N HK

(5)

Equation 5 shows that with a kernel of size 3 × 3, we can increase the computation speed by 8 to 9 times than that of standard convolutions with a small reduction in accuracy. The MobileNet architecture can be shown in Fig. 5.

3 3.1

Experiment and Results Dataset

We use our proposed dataset for the face recognition pipeline since there are no well aligned and simultaneous thermal and visual facial image datasets available

Fig. 5. Body architecture of MobileNet

Illumination-Invariant Face Recognition

667

to our current knowledge. The images are already space alignment registered due to the hardware setting of the visual and thermal lenses in the Flir One camera. There are 10 images at different head poses in each modality per person. The total number of subjects are 100. Since there are a limited fused images per person i.e. 10 images, dataset augmentation is required in order to increase the generalization capabilities of trained data mobile network. For augmentation, the faces are rotated at 10◦ from −90◦ to 90◦ , thus increasing the dataset size by about 19 times. 3.2

Image Fusion parameters

As shown in Fig. 6, the increase in λ i.e. regularization parameter leads to increase in the value of optimization function in Eq. 4. Also the increase in λ can be seen as providing more importance to the visual gradient features than the thermal intensity ones in the fused image, leading to the image turning more towards the visual domain. The trade off of both of these properties result in fused images representing varying types of features. 3.3

Face Recognition Accuracy

To compare the performance of our proposed framework, we compare our model with two other contemporary deep learning models viz. Nasnet [23] and MobileNet V2 [24]. The results of the face recognition for different modes of acquisition by varying λ are shown in Figure. 7. The recognition accuracies for different image modes using different models at λ = 8 are shown in Table 1. Since CNN’s perform best for the image data, this leads to better accuracy results for MobileNet since they are the faster and smaller variants of CNN. We can see here that the face recognition for the thermal images provide very less accurate results because of the unavailability of sharp edges i.e. gradient features in the images. During the optimization in the last steps of image fusion the smaller gradient in the visual images are turned into larger values leading to hidden edges in visual images to become more clear in the fused images. The facial heat maps

Fig. 6. Objective function vs Lambda characteristics

668

S. Agarwal et al.

of individuals in the thermal images contain some features unique to the one which when along with the visual features collaborate to give higher accuracy of recognition. Table 1. Face recognition accuracy for image modes by different models at λ = 8 Thermal Visual Fusion NasNet Mobile 62.6%

66.1% 75.4%

MobileNet V2

79.3%

87.3% 93.3%

MobileNet V1

82.6%

90.7% 95.7%

Fig. 7. Accuracy of recognition of fused images for different models against varying λ

4

Conclusion

In this paper, we propose a novel methodology for thermal face recognition and experimentally show the superior performance of our approach by evaluating the performance of recognition on the fused images. We create a dataset of both thermal and visual images of faces in simultaneous snaps. This ensures the least error due to the alignment of faces. Visual images have its limitation in scenarios like face spoofing and liveliness detection. Hence we have incorporated the strengths of two modalities i.e. thermal and visual to create a merged representation of the face. We further deploy MobileNet to extract robust features from the merged face to show higher accuracy of face recognition. In the paper, we have shown that the performance of accuracy for the merged face is better than face images in individual modalities. Also, we have captured the face data with certain constraints like the pose, expression, and distance from the camera. In the future, we propose to extend the dataset for the unconstrained environment. We further open the scope for research fraternity to address the face recognition in light of merged modality representation.

Illumination-Invariant Face Recognition

669

References 1. Ekenel, H.K., Stallkamp, J., Gao, H., Fischer, M., Stiefelhagen, R.: Face recognition for smart interactions. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 1007–1010. IEEE (2007) 2. Pentland, A., Choudhury, T.: Face recognition for smart environments. Computer 33(2), 50–55 (2000) 3. Galton, F.: Personal identification and description. J. Anthropol. Inst. Great Br. Irel. 18, 177–191 (1889) 4. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: recognition using class specific linear projection. Yale University New Haven United States, Technical report (1997) 5. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.-J.: Face recognition using Laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 328–340 (2005) 6. Gao, Y., Leung, M.K.: Face recognition using line edge map. IEEE Trans. Pattern Anal. Mach. Intell. 24, 764–779 (2002) 7. Kirby, M., Sirovich, L.: Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12(1), 103– 108 (1990) 8. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent component analysis. IEEE Trans. Neural Netw. 13(6), 1450 (2002) 9. Adini, Y., Moses, Y., Ullman, S.: Face recognition: the problem of compensating for changes in illumination direction. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 721–732 (1997) 10. Wen, D., Han, H., Jain, A.K.: Face spoof detection with image distortion analysis. IEEE Trans. Inf. Forensics Secur. 10(4), 746–761 (2015) 11. Cutler, R.G.: Face recognition using infrared images and eigenfaces. University of Maryland (1996) 12. Bebis, G., Gyaourova, A., Singh, S., Pavlidis, I.: Face recognition by fusing thermal infrared and visible imagery. Image Vis. Comput. 24(7), 727–742 (2006) 13. Socolinsky, D.A., Selinger, A., Neuheisel, J.D.: Face recognition with visible and thermal infrared imagery. Comput. Vis. Image Underst. 91(1–2), 72–114 (2003) 14. Forczma´ nski, P.: Human face detection in thermal images using an ensemble of cascading classifiers. In: International Multi-Conference on Advanced Computer Systems, pp. 205–215. Springer (2016) 15. Wong, W.K., Hui, J.H., Desa, J.B.M., Ishak, N.I.N.B., Sulaiman, A.B., Nor, Y.B.M.: Face detection in thermal imaging using head curve geometry. In: 5th International Congress on Image and Signal Processing (CISP), pp. 881–884. IEEE (2012) 16. Selinger, A., Socolinsky, D.A.: Appearance-based facial recognition using visible and thermal imagery: a comparative study. Technical report, EQUINOX CORP NEW YORK NY (2006) 17. Nguyen, H., Kotani, K., Chen, F., Le, B.: A thermal facial emotion database and its analysis. In: Pacific-Rim Symposium on Image and Video Technology, pp. 397–408. Springer (2013) 18. Wang, S., Liu, Z., Lv, S., Lv, Y., Wu, G., Peng, P., Chen, F., Wang, X.: A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Trans. Multimedia 12(7), 682–691 (2010) 19. Ma, J., Chen, C., Li, C., Huang, J.: Infrared and visible image fusion via gradient transfer and total variation minimization. Inf. Fusion 31, 100–109 (2016)

670

S. Agarwal et al.

20. Chan, T.F., Esedoglu, S.: Aspects of total variation regularized L1 function approximation. SIAM J. Appl. Math. 65(5), 1817–1837 (2005) 21. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 22. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications, CoRR, vol. abs/1704.04861 (2017) 23. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition, vol. 2, no. 6, arXiv preprintarXiv:1707.07012 (2017) 24. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

An Attention-Based CNN for ECG Classification Alexander Kuvaev(B) and Roman Khudorozhkov Gazprom Neft, Saint Petersburg, Russia {a.kuvaev,r.khudorozhkov}@analysiscenter.org

Abstract. The paper considers the problem of improving the interpretability of a convolutional neural network on the example of ECG classification task. This is done by using an architecture based on attention modules. Each module generates a mask that selects only those features that are required to make the final prediction. By visualizing these masks, areas of the signal that are important for decision-making can be identified. The model was trained both on raw signals and on their logarithmic spectrograms. In the case of raw signals, generated masks did not perform any meaningful feature maps filtering, but in the case of spectrograms, interpretable masks responsible for noise reduction and arrhythmic parts detection were obtained.

Keywords: Convolutional neural networks ECG classification

1

· Attention mechanism ·

Introduction

In the last few years, deep neural networks have been successfully applied to a large variety of tasks achieving state-of-the-art results. But in the fields where the cost of a mistake is high, such as medical diagnostics, model performance is not the only thing that matters: the person who makes the final decision needs to understand the model’s behavior and particular features of input data that affected its prediction. Thus, it is crucial to obtain not only an accurate model but also an interpretable one. The paper is focused on building such a model for an ECG classification task. To accomplish this, a convolutional neural network architecture based on an attention mechanism was used. This mechanism allows discovering the most informative features of an ECG signal at various network depths. Models with different forms of attention have already been successfully applied to a large variety of tasks including neural machine translation, speech recognition, caption generation and image classification. The model was trained both on raw signals and on their logarithmic spectrograms. As a result, interpretable masks responsible for noise reduction and arrhythmic parts detection were obtained. c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 671–677, 2020. https://doi.org/10.1007/978-3-030-17795-9_49

672

2

A. Kuvaev and R. Khudorozhkov

Datasets

Two datasets were used for model training and testing: publicly available part of the 2017 PhysioNet/CinC Challenge dataset [1,2] and the MIT-BIH Atrial Fibrillation Database [3]. The PhysioNet dataset contains 8,528 single-lead ECG recordings lasting from 9 to 61 s with an equal sampling rate of 300 Hz. All ECGs were collected from portable heart monitoring devices. A significant part of the signals had their R-peaks directed downwards since the device did not require the user to hold it in any specific orientation. These signals were flipped during the preprocessing stage. All recordings were manually classified by a team of experts into 4 classes: atrial fibrillation, normal rhythm, other rhythm or too noisy to be classified. For this dataset, the model was trained to predict the correct class. The MIT-BIH Atrial Fibrillation Database contains 23 approximately 10-h ECG recordings. Each signal was sampled at 250 Hz and has two leads, but we used only the first one for model training. The signals are divided into segments marked with one of four labels: atrial fibrillation, atrial flutter, junctional rhythm or other rhythms. Such annotation allows sampling segments around the points of heart rhythm change and define model’s target as a fraction of an arrhythmic part in the entire segment.

3

Model

The model is based on Residual Attention Network [4] and is constructed by stacking multiple ResNet blocks [5] and slightly modified attention modules. 3.1

Attention Module

Each attention module is a neural network that starts with a pre-processing ResNet block, then splits into a trunk branch and a mask branch, that are then joined together and passed through a post-processing ResNet block. The module’s architecture is shown in Fig. 1. The trunk branch performs feature processing by passing its input through two consecutive ResNet blocks. The mask branch creates a soft binary mask that highlights the most informative features of an input ECG signal. Leading and trailing convolutions in this branch are used to squeeze and expand the number of feature maps. Upsampling is performed by linear interpolation. The computed mask is subsequently multiplied by the output of the trunk branch. In order to help the gradient flow and preserve good properties of features, the output of the trunk is then added to the result of this multiplication as suggested in [4]. 3.2

Model Architecture

The model trained on raw signals has two leading ResNet blocks with 16 filters, three attention modules with 20 filters and finally 2 trailing ResNet blocks with

An Attention-Based CNN for ECG Classification

673

Fig. 1. Attention module architecture. Trunk branch is shown on the left, mask branch – on the right.

24 filters. The model trained on logarithmic spectrograms consists of two attention modules and a trailing ResNet block with 12 filters each. All of these attention modules and top-level ResNet blocks are followed by a dropout operation [6] with a drop rate of 0.2 and perform downsampling along spatial dimensions. In a ResNet block, the downsampling is performed after the first convolution using a max pooling operation with a kernel size and a stride of 2. In an attention module, the downsampling is performed by the post-processing ResNet block. In both models, first convolutions in mask branches squeeze the number of feature maps to 4, while the last convolution expands them back to match the shape of the corresponding trunk branch. These convolutions have a kernel size of 1, all others have a kernel size of 5. The convolution stride in all layers is fixed to 1. The input of each convolutional layer is padded in such a way that the spatial resolution is preserved afterwards. ReLU activations [7] are used throughout both networks, batch normalization [8] is used before each activation. Both networks end with a global max pooling operation and a fully-connected layer. In the case of PhysioNet dataset, this layer is followed by a softmax activation to predict class probabilities and categorial cross entropy is used as a loss function. In the case of MIT-BIH Atrial Fibrillation Database, sigmoid

674

A. Kuvaev and R. Khudorozhkov

activation and binary cross entropy loss are used. All weights were initialized by the scheme, proposed in [9], adam [10] was used as an optimizer. Described models were trained on 10-s crops from original signals with the help of CardIO framework for ECG processing [11].

4 4.1

Results Raw Signals

Figure 2 shows one signal from the PhysioNet dataset and a predicted mask, responsible for R-peaks detection. The problem lies in the fact that the trunk branch is able to detect the peaks by its own. Almost all of the meaningful attention masks behave this way: they highlight only those areas, where the corresponding feature map from the trunk branch activates. This means, that mask branches don’t perform feature selection, but only scale the output of the trunk.

(a) Input ECG signal.

(b) Attention mask, responsible for R-peaks detection.

(c) Output of the trunk.

(d) Input to the post-processing ResNet block.

Fig. 2. An example of a mask, responsible for R-peaks detection.

An Attention-Based CNN for ECG Classification

4.2

675

Logarithmic Spectrograms

More useful and interpretable masks can be obtained by switching from raw signals to their logarithmic spectrograms. Figure 3 illustrates one of the masks predicted by the network, trained on the PhysioNet dataset. In the first two seconds, it suppresses almost all the frequencies in the recording, while in the rest of the signal, it acts as a low-pass filter with a cutoff frequency of about 60 Hz. This mask can be applied to the original signal by taking its short-time Fourier transform, multiplying it by the mask, and performing the inverse transform. This operation results in noise removal at the beginning of the recording.

(a) Input ECG signal.

(b) Log-spectrogram of the signal.

(c) Input signal with the mask applied.

(d) Predicted mask.

Fig. 3. An example of a mask, responsible for noise reduction.

Figure 4 shows an ECG from the MIT-BIH Atrial Fibrillation Database and two masks responsible for the detection of arrhythmic and non-arrhythmic parts of the signal. Recall that in this case, the model did not know the position of the boundary of heart rhythm change, but only a fraction of an arrhythmic part in the entire segment.

676

A. Kuvaev and R. Khudorozhkov

(a) Input ECG signal.

(b) Attention mask, responsible for arrhythmic segment detection.

(c) Attention mask, responsible for non-arrhythmic segment detection.

Fig. 4. An example of masks, responsible for arrhythmic and non-arrhythmic segments detection. The green line represents the boundary of heart rhythm change.

5

Conclusion

In this work, an interpretable convolutional neural network based on the attention mechanism was considered on the example of ECG classification task. Such architecture allows a better understanding of the model’s decision-making process by visualizing masks generated by attention modules. The model was trained both on raw signals and on their logarithmic spectrograms. In the first case, generated masks did not perform any meaningful feature maps filtering, but in the second case, interpretable masks responsible for noise reduction and arrhythmic parts detection were obtained. This means that the attention mechanism behaves differently depending on the type of input data and it is important to determine whether the predicted masks actually perform feature selection.

References 1. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000) 2. Clifford, G., Liu, C., Moody, B., Lehman, L.H., Silva, I., Li, Q., Johnson, A., Mark, R.G.: AF classification from a short single lead ECG recording: the PhysioNet computing in cardiology challenge 2017. Comput. Cardiol. 44 (2017)

An Attention-Based CNN for ECG Classification

677

3. Moody, G.B., Mark, R.G.: A new method for detecting atrial fibrillation using R-R intervals. Comput. Cardiol. 10, 227–230 (1983) 4. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. ArXiv e-prints, April 2017 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. ArXiv e-prints, December 2015 6. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 7. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning (2010) 8. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv e-prints, February 2015 9. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010) 10. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ArXiv e-prints, December 2014 11. Khudorozhkov, R., Illarionov, E., Kuvaev, A., Podvyaznikov, D.: CardIO library for deep research of heart signals (2017)

Reverse Engineering of Generic Shapes Using Quadratic Spline and Genetic Algorithm Misbah Irshad1(&), Munazza Azam1, Muhammad Sarfraz2, and Malik Zawwar Hussain3 1

2

Lahore College for Women University, Lahore, Pakistan [email protected] Department of Information Science, Kuwait University, Kuwait City, Kuwait [email protected] 3 University of the Punjab, Lahore, Pakistan [email protected]

Abstract. An approach, for reverse engineering of generic shapes is proposed which is useful for the vectorization of the generic shapes. The recommended scheme comprises of different steps including extracting outlines of images, identifying feature points from the detected outlines, and curve fitting. The quadratic spline functions are used to find the optimal solution of the curve fitting with the help of a soft computing technique genetic algorithm (GA), which gives best suitable values of shape parameters. Genetic algorithm, a technique, usually used to find the optimal solutions of bit-complicated problems has been utilized to calculate optimal values of parameters in the representation of quadratic spline, which give minimum error between detected boundary of the image and the fitted spline curve. Keywords: Spline Reverse engineering Generic shapes Images

Genetic algorithm

1 Introduction An activity of designing, manufacturing, assembling, and maintaining products and systems is known as Engineering. Reverse engineering (RE) is a new phenomenon that involves different activates. Reverse engineering may also be defined as reconstruction of products from extracted information. Scientifically RE is relevant to computer science and computer aided geometric design (CAGD). Reverse engineering of shapes is the progression of indicating a current object geometrically in form of computer aided design (CAD) model. It also helps investigating and understanding the structure through interpreting the model of the object. Scanned digital data is recycled in contour styling for creating computer aided design (CAD) model and needs to approve some curve or surface approximation pattern. In process of reverse engineering, use of curve fitting is extensive for replicating curves from accurate geometric data obtained by bitmap images [1–16]. Thus new curve fitting algorithms are always admired.

© Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 678–686, 2020. https://doi.org/10.1007/978-3-030-17795-9_50

Reverse Engineering of Generic Shapes

679

Researchers emphasize on using splines while seeking for curve and surface approximation especially if piecewise fitting is needed. From the outlook of reverse engineering, splines are the best way to approximate the complex geometry more accurately in contrast to ordinary polynomials [2]. Soft computing plays significant role in science and engineering. Soft computing corresponds the fitness of human mind in cognitive learning. The soft computing consists of numerous paradigms such as Fuzzy Systems, Neutral Networks and Genetic Algorithms. These techniques are used to figure out solution to the problems, which are excessively complicated or tough to grab the established mathematical techniques. Soft computing is advancing day by day. It is not a fantasy, fusion, or collection, relatively, soft computing is an organization in which every partner supplies a diverse methodology for tackling problems in its area. In general, the fundamental methodologies in soft computing are analogous. Soft computing may be considered as a fundamental component for the developing field of conceptual aptitude [3, 4, 6]. Rest of the paper consists of three sections. Section 2 contains the phases of reverse engineering approach using quadratic spline. Problem mapping and optimal curves using genetic Algorithm is discussed and demonstrated in Sect. 3. Paper is concluded in Sect. 4.

2 Phases of Reverse Engineering of Generic Shapes Using Quadratic Splines This section discusses reverse engineering method by using soft computing technique genetic algorithm and spline function. This structure is useful for the vectorization of bitmap generic shapes. In computer graphics, it is a primal problem to express the boundaries of bitmap in illustration of curve. Computational Aided Geometric Design is an arena of analysis narrated to computational aspects of shapes with different approaches and algorithms. Functions of spline are basically helpful in computer graphics, creation of curve and surfaces, CAGD, geometric modeling; because of its stretchable qualities where pieces of curves are connected. This section proposes a scheme comprises of different steps like: • extracting outlines of images, • identifying feature points from the detected outlines, • Fitting curve. The suggested scheme uses quadratic spline functions, having two parameters to calculate the curve fitting optimal solution with the help of genetic algorithm, which gives best suitable values of shape parameters. 2.1

Boundary Extraction

The suggested scheme begins with detecting the boundary of bitmap image and utilizing upshot in finding corner points. Objective of boundary detection is to produce shape of planar objects in graphical depiction. Some familiar demonstrations are chain code, syntactic methods, boundary approximations and scale-space techniques. Chain

680

M. Irshad et al.

code is broadly used representation. The advantage of consuming chain code is to presents the edges direction. The boundary points are chosen as contour points grounded on their corner potency and variations. Bitmap images and its extracted boundary are shown in Figs. 1 and 2, respectively.

Fig. 1. Real airplane image

2.2

Detecting Corner Points

In this phase, corners are beneficial in sketching an image. These are important for following reasons. • They diminish the intricacy of boundary and shorten the fitting method. • Every shape contains natural break points at which boundary splits into smaller segments for better approximation, which are generally discontinuous points. The approach to find corner points is explained in [7]. It comprises two passes algorithm. In the first pass the algorithm integrates corner potency to every contour point, as every point a contender corner point. In second pass, excessive pints were discarded by giving only feasible corners. Corners split the contour into pieces; each piece contains two successive corners. After dividing data into small pieces, data values can be expressed as Pi;j ¼ xi;j ; yi:j ;

i ¼ 1; 2; . . .; n: j ¼ 1; 2; . . .; mi

Where in above expression n shows number of parts, mi is number of data points in ith piece and Pi;j represents jth points of ith piece. Figure 3 presents the corners point of bitmap image. 2.3

Quadratic Function for Curve Fitting

For curve fitting quadratic spline function constructed by Sarfraz is used which is an alternative of cubic spline functions [16]. Let Fi ; Zi ; Fi þ 1 i 2 Z be the control points of ith segment, and the corresponding tangents Di ; Di þ 1 on corner points. The quadratic spline consists of two conic pieces such that conic 1 passes through Fi and Zi and conic 2 passes through Zi and Fi þ 1 .

Reverse Engineering of Generic Shapes

681

Fig. 2. Determination of borderline points from airplane image

Fig. 3. Determinations of corner points from the boundary of image plane

Then the conic spline functions with shape control parameters ri and si are as follows: Conic 1 with control point’s Fi , Vi and Zi is defined as Ri ðtÞ ¼ Fi ð1 hÞ2 þ 2Vi hð1 hÞ þ Zi h2 ;

ð1Þ

and ti þ ti þ 1 ; 2 1 Vi ¼ Fi þ ri hi Di ; 2 Vi þ Wi ; Zi ¼ 2 ti ¼

Conic 2 with control point’s Zi ; Wi and Fi þ 1 is defined as Ri ðtÞ ¼ Zi ð1 h Þ2 þ 2Wi h ð1 h Þ þ Fi þ 1 h2 and ti ¼

ti þ ti þ 1 2

1 Wi ¼ Fi þ 1 si hi Di þ 1 2

ð2Þ

682

M. Irshad et al.

Where ri and si are shape control parameters, hi is step size, ti and ti þ 1 denote the value of t at point Fi and Fi þ 1 . Where ti , midpoint of ti and ti þ 1 , denote the value of t at Zi , h and h are normalized parameters. Figures 4, 5 and 6 show the curves of both conics.

Fig. 4. Fitted conic 1 for ri ¼ 2

3 Problem Representation and Its Solutions with the Help of GA The purpose of the work is to seek optimal methods for finding best curve fit to the original contour. Thus, appropriate values of parameter are required for minimum error of sum squared. Suppose for data segments i ¼ 1; 2. . .n; the given boundary points Pi;j ¼ xi;j ; yi;j ; j ¼ 1; 2; . . .; mi and their corresponding parametric curves are Pi tj . Mathematically, the sum squared distance is determined as: Si ¼

mi X 2 Pi tj Pi;j ;

where i ¼ 0; 1; 2; ::; n 1; j ¼ 1; 2; . . .; mi

j¼1

Here U 0 s are applied on the basis of chord length parameterization. Conic 1: The sum squared Si for the conic (1) can be calculated as: Si ¼

mi X 2 Pi Ui;j Pi;j ;

i ¼ 1; . . .; mi

j¼1

Conic 2: Correspondingly, the sum squared for the conic described by quadratic spline function (2), can be calculated as:

Si

mi X 2 ¼ Pi Ui;j Pi;j ; j¼1

i ¼ 1; . . .; mi

Reverse Engineering of Generic Shapes

683

Fig. 5. Fitted conic 2 for si ¼ 3

Fig. 6. Fitted curve for ri ¼ 2 and si ¼ 3

3.1

Initialization

The method of fitting curve requires a bitmap image of generic shape as input; the boundary extraction of image using technique explained in portion (2.1). Then second phase is detecting corner points as described in portion (2.2). This method of corner detection allots an amount of corner potency to every boundary point of image. The given interpolating spline function stated in Eqs. (1) and (2) is used for fitting the curve 0 0 in every segment, where parameters ri s and si s initial values are arbitrarily chosen and computed the tangent vectors with some approximation method. For minimum error of sum squared it is necessary for these parameters to fix their values. Genetic algorithm has been applied to calculate optimal values of parameters. Parameters of GA used in this work are as follows: • • • • • • • • •

Size of population = 25 Genome size = 5 pick ratio = 0.8 Mutation rate = 0.001 rmin ¼ 0 rmax ¼ 3 smin ¼ 0:5 smax ¼ 3 Threshold = 3

684

3.2

M. Irshad et al.

Breaking Segments

For many situations only corner points do not obtain the best results. These types of problem are dealt in the catalogue of corner points by introducing additional points. Thus, where the gap between parametric and boundary curve surpasses some predefined threshold subdivide the parts in smaller parts at main points. The distance between digitized curve Pi;j and parametric points P ti;j is defined as follows: d ¼ max Pxi;j Px ti;j ; Pyi;j Py ti;j In above expression, Pxi ;j and Pyi ;j are maximum difference of x and y coordinates on boundary points, respectively, its corresponding parametric points Px ti;j and Py ti;j on curve. For every new segment a new parametric curve is fitted. The process of subdivision is continuing, until the distance between the points of boundary and parametric drop below the limit of threshold. 3.3

Demonstration

In this section proposed scheme is demonstrated on image of plane. Original Bitmap image of Plane is shown in Figs. 1, 2 and 3 provide boundary of image and boundary along corner points. Figure 7 shows the initial fit, applying function on Eqs. (1) and (2) and first iteration of GA. Figure 8 gives fitted curve after applying 4 iteration of Genetic Algorithm with boundary and corner points, this figure also includes breaking points where needed.

Fig. 7. Fitted quadratic function (2) for first iteration of genetic algorithm

Fig. 8. Fitted curve at iteration 4

Reverse Engineering of Generic Shapes

685

4 Conclusion In this work an approach for reverse engineering of generic shapes is proposed which is beneficial for vectorization of generic shapes. The proposed scheme comprises several steps including extracting outline of images, identifying feature points from the detected outlines, and curve fitting. Quadratic spline function having two parameters is utilized to calculate the optimum curve fitted to corner points with aid of heuristic approach called genetic algorithm. Genetic algorithm (GA), helps finding the optimum values of parameters in the representation of quadratic spline function, which gives least error between obtained boundary of the image and the fitted curve by quadratic spline. In future this work might be extended to reverse engineer and design 3D shapes.

References 1. Kumar, A.: Encoding schemes in genetic algorithm. Int. J. Adv. Res. IT Eng. 2(3), 1–7 (2013) 2. Kvernes, B., Andersson, F.: Bezier and B-spline Technology (2013) 3. Brujic, D., Ainsworth, I., Ristic, M.: Fast and accurate NURBS fitting for reverse engineering. Int. J. Adv. Manuf. Technol. 54, 691–700 (2011) 4. Joshi, G.: Review of genetic algorithm: an optimization technique. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(4), 802–805 (2014) 5. Juhasz, I., Hoffmann, M.: On paramertization of interpolating curves. J. Comput. Appl. Math. 216, 413–424 (2008) 6. Borna, K., Hashemi, V.H.: An improved genetic algorithm with a local optimization and an extra mutation level for solving travelling salesman problem. Int. J. Comput. Sci. Eng. Inf. Technol. (IJCSEIT) 4(4), 47–53 (2014) 7. Chetrikov, D., Zsabo, S.: A simple and efficient algorithm for detection of high curvature points in planar curves. In: Proceedings of the 23rd Workshop of the Australian Pattern Recognition Group, pp. 1751–2184 (1999) 8. Krogmann, K., Kuperberg, M.: Using genetic search for reverse engineering of parametric behaviour models for performance prediction. IEEE Trans. Softw. Eng. 36(6), 865–877 (2010) 9. Shao, L., Zhou, H.: Curve fitting with bezier cubics. Graphical Models Image Process. 58(3), 223–232 (1996) 10. Hristakeva, M., Shrestha, D.: Solving the 0–1 Knapsack Problem with Genetic Algorithms (2003) 11. Gleicher, M.: A Curve Tutorial for Introductory Computer Graphics. Department of Computer Science University of Wisconsin, Madison (2004) 12. Irshad, M., Khalid, S., Hussain, M.Z., Sarfraz, M.: Outline capturing using rational functions with the help of genetic algorithm. Appl. Math. Comput. 274, 661–678 (2016) 13. Reddy, M., Swami, V.: Evolutionary computation of soft computing engineering, progress. Sci. Eng. Res. J. ISSN 2347-6680 (E)

686

M. Irshad et al.

14. Sarfraz, M., Sait, S.M., Balah, M., Baig, M.H.: Computing optimized NURBS curves using simulated evolution on control parameters. In: Tiwari, A., Roy, R., Knowles, J., Avineri, E., Dahal, K. (eds.) Applications of Soft Computing. Advances in Intelligent and Soft Computing, vol. 36. Springer, Heidelberg (2006) 15. Sarfraz, M., Irshad, M., Hussain, M.Z.: Reverse engineering of planar objects using GAs. Sains Malaysiana 42(8), 1167–1179 (2013) 16. Sarfraz, M., Hussain, M.Z., Chaudary, F.S.: Shape preserving cubic spline for data visualization. Comput. Graph. CAD/CAM 1(6), 185–193 (2005)

Bayesian Estimation for Fast Sequential Diffeomorphic Image Variability Youshan Zhang(B) Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA [email protected]

Abstract. In this paper, we analyze the diffeomorphic image variability using a Bayesian method to estimate the low-dimensional feature space in a series of images. We first develop a fast sequential diffeomorphic image registration for atlas building (FSDAB) to reduce the computation time. To analyze image variability, we propose a fast Bayesian version of the principal geodesic analysis (PGA) model that avoids the trivial expectation maximization (EM) framework. The sparsity BPGA model can automatically select the relevant dimensions by driving unnecessary principal geodesics to zero. To show the applicability of our model, we use 2D synthetic data and the 3D MRIs. Our results indicate that the automatically selected dimensions from our model can reconstruct unobserved testing images with lower error, and our model can show the shape deformations that corresponds to an increase of time. Keywords: Bayesian estimation · Principal geodesic analysis Diffeomorphic image registration · Dimensionality reduction

1

·

Introduction

Medical image registration is an essential branch in computer vision and image processing, and it plays a vital role in medical research, disease diagnosis, surgical navigation, and other medical treatment [1–3]. For effective information integration: the fusion of information from various images or different time series images from the same patient is relatively remarkable. It can primarily improve the level of clinical diagnosis, treatment, disease monitoring, surgery, and therapeutic effect evaluation, for example, the fusion of anatomical images and functional images. It can provide an accurate description of anatomical location for abnormal physiological regions. Also, the fusion of images from different modalities can be applied to radiation therapy, surgical navigation, and tumor growth monitoring [4]. Therefore, image registration for atlas building (template) is essential in the medical field. There are many works addressed image registration problem. Elsen et al. summarized some medical image registration technologies and realized the alignment of different images [5]. Other methods include mutual information for multi-modalities image registration [6], Fourier transform [7]. Image registration c Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 687–699, 2020. https://doi.org/10.1007/978-3-030-17795-9_51

688

Y. Zhang

will consume more substantial computation time, especially for 3D image registration. Plishker et al. discussed the acceleration techniques of medical image registration [8]. Nevertheless, one crucial criterion in medical image registration is anatomical structures are one-to-one corresponded with each other after image registration, while transformation has to be topology-preserving (diffeomorphic). If a geometric shape is significantly different in two or more images, a topology-preserving transformation is hard to generate. To solved this problem, several geodesic registration methods on manifold have been proposed, eg., Large Deformation Diffeomorphic Metric Mapping (LDDMM) [9,10]. LDDMM provides a mathematically robust solution to the large-deformation registration problems, by finding geodesic paths of transformations on the manifold of diffeomorphisms. The advantage is that it can solve the large deformation registration problem, but the transformation is computationally very costly to compute if shape change is relatively large. Zhang et al. proposed a fast geodesic shooting algorithm for atlas building based on the metric of original LDDMM for diffeomorphic image registration, which was faster and used less memory intensive than original LDDMM method [11]. However, the original LDDMM algorithm was time-consuming and stuck into local minimum if there is a significant difference between the two deformation images. To overcome these issues, we first propose a fast sequential diffeomorphic image registration for atlas building (FSDAB) to reduce the computation time. To analyze image variability, we propose a fast Bayesian version of the principal geodesic analysis (PGA) model that avoids the trivial expectation maximization (EM) framework. To show the validation of our model, we use 2D synthetic data and the 3D MRIs. Our model can reconstruct ground truth image with lower selected dimensions, and our model can show the image variability with the increasing of time.

2

Background

In this section, we briefly review the mathematical background for diffeomorphic atlas building. 2.1

Differomorphism in Image Registration

Given a source image I0 (aka. template image and fixed image), and a target image I1 (aka. moving image). The aim of image registration is to find a transformation φ : Ω −→ Ω, where Ω ∈ Rn , is the domain on the data (n = 2 for 2D images and n = 3 for 3D images), so that I1 = φI0 = I0 ◦ φ−1 . The transformation should not only guarantee the corresponding spatial locations and anatomical positions are the same within two images; but also preserve topology (diffeomorphism) of large deformation of images. A diffeomorphic transformation φ is a globally one-to-one continuous and smooth mapping and it also has a continuous and smooth inverse transformation. Specifically, the inverse of transformation (φ−1 ) exists and both φ and φ−1 are invertible. The diffeomorphism

BEFSDIV

689

could form a group of Diff using the composition operation, i.e. φ1 ◦ φ2 ∈ Diff if φ1 , φ2 ∈ Diff: Diff = {φ : Ω → Ω|φ and φ−1 are differentiable}.

(1)

By using composition operation, we can recursively form φk as a polygonal line in Diff (φk+1 = φk ◦ ψk , where φ0 = Id and ψ ∈ Diff). We then denote this polygonal line as a curve: φ(x, t), 0 ≤ t ≤ 1, where t is the time variable and φ(x, t) is the transformation of x at t. For small deformation, we use a small displacement field u to model the transformation that: φ = x+u. In contrast, for a large deformation, we introduce an extra time variable t to encode the warping transformation path φ(x, t) between source and target image. When φ(x, t) is differentiable at t, we have Eq. 2 that generates a diffeomorphism: d φt (x) = vt ◦ φt (x), dt

(2)

where v satisfies continuity conditions to guarantee the existence of the solution. Therefore, optimizing the diffeomorphism transform φ is equivalent to optimizing the time-varying velocity field vt . 2.2

LDDMM

Large Deformation Diffeomorphic Metric Mapping (LDDMM) model is one standard registration method for measuring large deformations between a source (I0 ) and the target image (I1 ) [9]. It aims to minimized energy function in Eq. 3. This energy function has two terms: regularity term (measures the smoothness of transformation); similarity term (measures the similarity of the estimated image and the target image). E(v) = 0

1

vt 2L dt +

1 2 I0 ◦ φ−1 1 − I1 , σ2

(3)

where v is time-varying velocity, L is a differential operator that controls the spatial regularity of these deformation fields, and it defined as L = −α∇2 + γIn×n , where ∇2 is the Laplacian operator and In×n is the identify operator. σ controls the similarity term, t is time, I0 ◦ φ−1 1 denotes warped source image I0 . 2.3

Numerical Algorithm of LDDMM

In the numerical implementation, a standard steepest gradient descent is used to minimize the energy in Eq. 3. Specifically, the time-varying velocity fields are discretized into N time points (vti )0≤i≤N −1 . For each time point i, the velocity is updated with: vti+1 = vti − (∇vti Eti ),

(4)

690

Y. Zhang

where ∇vt Et is the gradient of Eq. 3 with respect v. 2 0 1 0 |Dφ |(J − J )D(J ) , ∇vt Et = 2vt − K ∗ t,1 t t t σ2

(5)

where K = (L† L)−1 , ∗ is the convolution operation, |Dφt,1 | is the determinant 0 1 of the Jacobian matrix, φs,t = φt ◦ φ−1 s , Jt = I0 ◦ φt,0 and Jt = I1 ◦ φt,1 . However, LDDMM needs a longer computation time and it requires large memory to store N velocity fields. In each iteration, it needs to calculate also N gradient fields, N compositions for φt,1 and compute N inverse problems. Therefore, this is a very expensive algorithm. Also, the warped source image I0 ◦ φ−1 1 might stuck on local optimal and cause a higher mismatch error ||I0 ◦ φ−1 1 − I1 ||. As shown in Fig. 1, the wrapped source image (c) stuck into the local minimum and did not full recovery the full “C” shape. To overcome these issues, we develop a fast sequential diffeomorphic image registration for atlas building.

Fig. 1. Circle registration results using LDDMM. (a): source image, (b): target image, (c): LDDMM results, (d): difference between (b) and (c).

3

Fast Sequential Diffeomorphic Atlas Building (FSDAB)

Given input images I 1 , · · · , I N , the atlas building task is to find a template image I to minimize the difference between I and input N images. Differ from minimizing sum-of-squared distances function (minI N1 i=1 ||I − I i ||2 ) in [12], we aim to minimize following energy function: N

κ 1 1 1 i i 2 i −1 i 2 ||vt ||L dt + 2 ||Ik ◦ (φk ) − I || , (6) E = arg min I 2 0 2σ i=1 k=1

The atlas building needs to find the optimal vti and update the atlas. Differ from Eq. 3, we have sequential Iκ in the similarity term, the template image I i = Iκi . This aims to solve the local minimum of the warp source image, and avoid the situation in 1 since the I0 in Eq. 3 is never changed, but in our new FSDAB model, the template can update in the each iteration. Similar to LDDMM, the next step is to take the gradient of Eq. 6 with respect to v. The key in the proof is to introduce the Gateaux variation of φs,t w.r.t v

BEFSDIV

691

(Lemma 2.1 from [9]). Here, consider a small perturbation of v at time r (i.e. hr ) affects all the transform φt for t > r cumulatively. We have:

t

∂h φs,t = Dφs,t

−1

(Dφs,r )

hr ◦ φs,r dr

(7)

s

For the similarity term, we have (details in Féchet derivative in the proof of Theorem 2.1 in [9]): N κ 1 1 |Dφt,1 |(Jt0k − Jt1i )D(Jt0k ), hti dti i i σ 2 i=1 k=1 0 N κ 1 1 1 0k =− K 2 |Dφt,1 |(Jt0k − J )D(J ) , hti dti , ti ti i σ 0 i=1

∂h S(v) = −

(8)

(9)

k=1

where Jt0k = Iki ◦ φiti ,0 , Jt1i = I i ◦ φti ,1 , D is the Jacobian matrix and · is the determinant value of the matrix. For the regularization term, the Gateaux variation is easy to compute: ∂h R(v) =

N i=1

1

0

vti , hti V dti

(10)

By collecting both regularization and similarity term, for the Gateaux variation N 1 ∂h E(v), we can represent it as: ∂h E(v) = i=1 0 ∇vti Eti , hti dti and ∇vti Eti is defined as: (∇vti Eti ) =

N i=1

vti −

κ k=1

K(

1 |Dφt,1 |(Jt0 − Jt1 )D(Jt0 )) σ2

(11)

where K = (L† L)−1 , Jt0k = Iki ◦ φiti ,0 and Jt1i = I i ◦ φti ,1 . 3.1

Numerical Algorithm of FSDAB

Also, we use the steepest gradient descent to minimize the energy in Eq. 6. Specifically, the time-varying velocity fields are discretized into N time points (vti )0≤i≤N −1 . For each time point i, the velocity is updated with: vti = vti − (∇vti Eti )

(12)

where ∇vt Et is the gradient of Eq. 6 with respect v in Eq. 11. By getting new images Iik , we could get the close-form solution for our template I: N 1 i I= {I ◦ φiκ } (13) N i=1 k

692

Y. Zhang

To realize a fast version of sequential diffeomorphic atlas building, we calculate the correlation between warp source image and target image, if the correlation is not changed after certain iteration, we will go next stage.

Algorithm 1. Fast Sequential Diffeomorphic Atlas Building Input: Source images I 1 , I 2 , · · · , I N , noise α, number of iterations: itr , and smooth stage kk Output: Template image I, and warp images Iki ◦ φiκ 1: Initialize transformation field φ, velocity v and template image I 2: For i = 1 to N 3: For k = 1 to κ 4: Repeat 5: Calculate φik according to Eq. 2 6: Calculate vti according to Eq. 12 7: Update image Iki = Iki ◦ φiκ 8: Until corr(Iki ◦ φiκ , I i ) not change 9: end 10: end 11: Calculate template image I according to Eq. 13

4

Fast Bayesian Principal Geodesic Analysis (FBPGA)

To analyze the image variability, we develop a fast Bayesian principal geodesic analysis model. PGA model was proposed by Fetcher et al. [13], it used to reduce the dimensionality of data on manifolds. We first need to calculate the intrinsic mean of data using Algorithm 2. Afterward, we could perform the PGA using Algorithm 3.

Algorithm 2. Intrinsic Mean of Principal Geodesic Analysis Input: Warp images: Ik1 ◦ φ1κ , Ik2 ◦ φ2κ , · · · , IkN ◦ φN κ ∈ M from Alg. 1 Output: μ ∈ M , the intrinsic mean 1: μ0 = Ik1 φ1κ 2: Do N τ Δμ = N i=1 Logμj xi 3: μj+1 = Expμj (Δμ) 4: While ||Δμ|| >

To avoid the trivial EM algorithm of the Bayesian of PGA model and to automatically select the principal geodesics from images, we propose a fast version of Bayesian PGA model. Unlike [14] which defined a Gaussian prior of the slope for their model, our FBPGA includes the parameter γ which can automatically choose the optimal dimensionality.

BEFSDIV

693

Algorithm 3. Principal Geodesic Analysis Input: Warp images: Ik1 ◦ φ1κ , Ik2 ◦ φ2κ , · · · , IkN ◦ φN κ ∈ M from Alg. 1 Output: Eigenvectors vec and eigenvalues λ of input data 1: μ = intrinsic mean of {xi } 2: xi = Logμ (xi ) N T 3: S = N1 i=1 xi x i 4: veck , λk = eigenvectors/eigenvalues of S

Algorithm 4. Fast Bayesian Principal Geodesic Analysis Input: Eigenvectors v and eigenvalues λ from Alg. 3 Output: Image variability of registered image 1: γ = λD2 2: Choose reduced d √ dimension 3: Iα = μ + α di=1 V eci λi

The value of γ is estimated iteratively as λd2 in this model, and thus enforces sparsity by driving the corresponding component eigenvectors to zero. More specifically, if γ is large, eigenvectors will be effectively removed in the latent space. This arises naturally because the larger γ is, the lower the probability of eigenvectors will be. Here, we only consider the Log map and Exp in the Spherical manifold. For other manifolds (Kendall’s and Grassmannian manifolds), please refer to [15] for the detailed calculation of Log map and Exp map. Sphere Manifold. One of well-known spherical manifold is 3D sphere (2D surface embedding in 3D space), let r be the radius of the sphere, u is the azimuth angle and v is the zenith angle. Any points on 3D sphere can be expressed by: X = (r sin u sin v, r cos u sin v, r cos v). The generalized n − 1 dimensional hypersphere in Rn+1 Euclidean space (X1 , X2 , · · · , Xn ) has the constraint embedded 2 2 of: i xi = r , here r is the radius of such a hyper-sphere, we set r = 1. Let XS ans XT are such points on an n-dimensional sphere embedded in Rn+1 , and let v be a tangent vector at XS . Please refer to [16] to see the details of Log map and Exp map on Sphere manifold. The Log map between two points p, p on the sphere can be computed as following. θ·L , θ = arccos( p, p ), ||L|| L = (p − p · p, p )

v = Log(p, p ) =

(14)

where p · p, p denotes the projection of the vector p onto p. ||L|| is called Riemannian norm, ||L|| = L, L .

694

Y. Zhang

Given base point p, and its estimated tangent vector v from Eq. 14 and t, we can compute the Exp map as: Exp(p, vt) = cos θ · p +

5

sin θ · vt, θ = ||vt||. θ

(15)

Results

Our BEFSDIV model can not only accurately estimate the template image from population images; but also observe diffeomorphic image variability of the estimated template. We demonstrate the effectiveness of our model using one synthetic 2d data and real 3D T2 MRI brain data. 5.1

Synthetic 2D Data

In this synthetic 2D data, we want to estimate the template of the circle shapes, and test whether our FBPGA model can automatically reduce the dimensionality of images. We simulated a 2D synthetic dataset with 20 subjects starting from a “standard” circle image. These images have a resolution of 50 × 50 (As shown in the Fig. 2(a)). Figure 2(b) compares our estimated template circle (left one) and ground truth circle (middle one). We cannot visualize the difference between estimated template and ground truth from left image and middle image. But right one in Fig. 2(b) shows the difference between them, the blue color means there is less difference estimate template and the ground truth image, while yellow color represents the significant difference between them. We also can visualize the image difference variability of the template images (Fig. 2(c)), here these image are generated by Iα − Itrue , where Iα is estimated from Algorithm 4. Figure 2(c) demonstrates the color changes with the increase of α, and there is an obvious difference between the color of first principal geodesic model and the second principal geodesic model. Here, the color also represents the range of the difference between the reconstructed images with the ground truth image. Besides, Fig. 2(d) compares the dimensionality of BPGA and PGA model, we can obverse that our BPGA model can automatically reduce the dimensionality of eigenvalues. These results illustrate the ability of our BPGA model in reducing high dimensional features. 5.2

3D Brain Dataset

To demonstrate the effectiveness of our method on the real 3D data, we apply our BEFSDIV model to a set of 3D T2 MRIs Fig. (3(a)). It is a set of Multiple Sclerosis data [17]. From Fig. 3(b) the average MRIs is blur, but our estimated template image is obviously clearer than the average MRIs, and this demonstrates that our BEFSDIV model can well represent the general information for T2 images, and our method can be used to estimate the template of images which will provide a reliable reference for image fusion. Also, we could

BEFSDIV

695

(a) Synthetic 2D data

(b) Left: Our estimated template circle; middle: ground truth; right: the difference between estimation and ground truth image

(c) Image difference variability with α = −3, −2, −1, 0, 1, 2, 3 of first and second princial geodesic models

(d) Dimensionality of Bayesian PGA and PGA

Fig. 2. The results of synthetic 2D data using BEFSDIV model.

696

Y. Zhang

(a) Axial slices from MRIs

(b) Average MRIs

(c) Estimated template image

(d) Image variability with α = −3, −2, −1, 0, 1, 2, 3 of first and second principal geodesic models, there are different shapes changes of first and second principal geodesic models.

Fig. 3. The results of 3D MRIs using BEFSDIV model

observe significant shape deformation from Fig. 3(d), these images are generated by Iα since there is observes difference these images. Besides, our BPGA model uses less number of eigenvalues (only nine) than the PGA model, which also

BEFSDIV

697

Fig. 4. Dimensionality of Bayesian PGA and PGA model using 3D MRIs.

illustrate the model can automatically reduce the necessary features in the model as shown in Fig. 4.

6

Discussion

One apparent strength of BEFSDIV method is that it can accurately estimate the template with less computational time. From the results of synthetic images (Fig. 3), we observe that our estimate template has a small matching error, and we show the image variability which demonstrates how the shape changes. For 3D MRIs results, we can conclude that our model can be used to analyze the shape changes. It will be useful for predicting brain deformations with the increasing of ages. However, the fast stage k is determined by the correlation between the warped source image and the target image, which changes with time. Although we get a good estimate MRI template, our model has a limited sample size that we only validate our model in ten MRIs.

7

Conclusion

In this paper, we propose a BEFSDIV model to analyze the diffeomorphic image variability. We first develop a fast sequential diffeomorphic image registration for atlas building (FSDAB) to reduce the computation time. To analyze image variability, we propose a fast Bayesian version of the principal geodesic analysis (PGA) model that avoids the trivial expectation maximization (EM) framework. We test our mode using 2D synthetic data and the 3D MRIs. Our results

698

Y. Zhang

indicate that the automatically selected dimensions from our model can reconstruct unobserved testing images with lower error, and our model can show the shape deformations. In the future, we expect that the matching accuracy and efficiency of our models can be further improved by using less memory version of our FSDAB model.

References 1. Lucas, B.D., Kanade, T., et al.: An iterative image registration technique with an application to stereo vision (1981) 2. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16(2), 187–198 (1997) 3. Jordan, P., Maurer Jr., C.R., Myronenko, A., Chappelow, J.C.: Image registration of treatment planning image, intrafraction 3D image, and intrafraction 2D x-ray image. US Patent App. 15/862,438, 12 July 2018 4. Histed, S.N., Lindenberg, M.L., Mena, E., Turkbey, B., Choyke, P.L., Kurdziel, K.A.: Review of functional/anatomic imaging in oncology. Nucl. Med. Commun. 33(4), 349 (2012) 5. Van den Elsen, P.A., Pol, E.-J.D., Viergever, M.A.: Medical image matching-a review with classification. IEEE Eng. Med. Biol. Mag. 12(1), 26–39 (1993) 6. Collignon, A., Maes, F., Delaere, D., Vandermeulen, D., Suetens, P., Marchal, G.: Automated multi-modality image registration based on information theory. Inf. Process. Med. Imag. 3, 263–274 (1995) 7. Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 5(8), 1266– 1271 (1996) 8. Plishker, W., Dandekar, O., Bhattacharyya, S., Shekhar, R.: A taxonomy for medical image registration acceleration techniques. In: 2007 IEEE/NIH Life Science Systems and Applications Workshop, LISA 2007, pp. 160–163. IEEE (2007) 9. Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vis. 61(2), 139–157 (2005) 10. Cao, Y., Miller, M.I., Winslow, R.L., Younes, L.: Large deformation diffeomorphic metric mapping of vector fields. IEEE Trans. Med. Imag. 24(9), 1216–1230 (2005) 11. Zhang, M., Fletcher, P.T.: Finite-dimensional lie algebras for fast diffeomorphic image registration. In: International Conference on Information Processing in Medical Imaging, pp. 249–260. Springer (2015) 12. Zhang, M., Singh, N., Fletcher, P.T.: Bayesian estimation of regularization and atlas building in diffeomorphic image registration. In: International Conference on Information Processing in Medical Imaging, pp. 37–48. Springer (2013) 13. Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23(8), 995–1005 (2004) 14. Zhang, M., Fletcher, P.T.: Bayesian principal geodesic analysis for estimating intrinsic diffeomorphic image variability. Med. Image Anal. 25(1), 37–44 (2015) 15. Zhang, Y., Xie, S., Davison, B.D.: Generalized geodesic sampling on Riemannian manifolds (2018). https://www.researchgate.net/publication/328943977 Generalized Geodesic Sampling on Riemannian Manifolds. Accessed 15 Nov 2018

BEFSDIV

699

16. Wilson, R.C., Hancock, E.R.: Spherical embedding and classification. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 589–599. Springer (2010) 17. Loizou, C.P., Murray, V., Pattichis, M.S., Seimenis, I., Pantziaris, M., Pattichis, C.S.: Multiscale amplitude-modulation frequency-modulation (AM-FM) texture analysis of multiple sclerosis in brain MRI images. IEEE Trans. Inf. Technol. Biomed. 15(1), 119–129 (2011)

Copyright Protection and Content Authentication Based on Linear Cellular Automata Watermarking for 2D Vector Maps Saleh AL-ardhi(&), Vijey Thayananthan, and Abdullah Basuhail King Abdulaziz University, Jeddah, Saudi Arabia [email protected]

Abstract. Copyright protection and content authentication are security problems affecting applications of geographical information system (GIS) based on two-dimensional vector maps and constitute major obstacles limiting the use of such maps. By permitting original data recovery after watermark extraction, reversible watermark can eliminate such obstacles. Parallel computation models called cellular automata provide an effective approach, yielding intricate outcomes from a basic structure. The various existing types of cellular automata differ in how complex they are and how they behave, since the number of parameters that need to be configured is substantial. Unlike other multimedia forms, 2D vector maps are more difficult to watermark with cellular automata due to their distinct features. To overcome this difficulty, a new approach underpinned by a linear cellular automata (LCA) transform is suggested in this study, comprising development of a new system of coordinates consisting of relative coordinates to yield cover data, embedding an encrypted watermark key in the LSB for every relative coordinate, and LCA transform application to conceal the location of the embedded encrypted watermark key in every relative coordinate. The results confirm the approach can prevent prevalent security risks and provide complex, hidden and reversible computation, thus protecting 2D vector map integrity. Keywords: Reversible watermarking RST invariance attacks Linear cellular automata 2D vector map Copyright protection Content authentication

1 Introduction The high accuracy exhibited by vector map data has made it an integral component of geographical information systems (GIS) that is compatible with various applications, such as electronic distance metres, unmanned aerial vehicles, global navigation systems, such as GPS, and location-based services, such as LBS. However, expensive tools are required to collate, process and store the various existing data alternatives. Protection of data to prevent access from people that the map data owners have not granted permission is critical given how costly it is to generate vector maps. Thus, in the context of GIS vector maps, copyright protection and Temper detection are most commonly achieved via watermarking methods. Watermarking technology is among © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 700–719, 2020. https://doi.org/10.1007/978-3-030-17795-9_52

Copyright Protection and Content Authentication

701

the various approaches available for security tasks, such as secret communication, content authentication and copyright protection [1]. Figure 1 illustrates the three components of the watermark structure. Several aspects must be taken into account to effectively create a watermark strategy, including robustness, fidelity, invisibility and blindness. Robustness is intended to make sure that data cannot be accessed or interfered with in any way by unauthorised individuals, and there are two categories of robustness watermarking techniques, namely, fragile watermarking, which ensures the integrity of data, and robust watermarking, which ensures the safeguarding of ownership rights during illegal access attempts. Fidelity refers to the fact that the general map quality should not be diminished by watermark application, while invisibility refers to the fact that data users must not be able to see the watermark after embedding to ensure that covert data is not disclosed. Last but not least, blindness refers to the fact that the original map or original watermark does not need to be accessed when the watermark is extracted [2].

Fig. 1. Basic watermarking structure

Unlike multimedia data like raster maps, which can employ integer number, 2D vector maps have distinctive characteristics and rely on real numbers, such as double float point with constant accuracy, as well as supplying crucial data about topology and geometry. However, watermark insertion is more challenging in the case of 2D vector maps than video and audio images because of minimal data redundancy. This difficulty affecting copyright protection and Temper detection can be surpassed with reversible watermarking [3–15], otherwise known as lossless data hiding, which is capable of full recovery of original data after extraction whilst maintaining original carrier availability intact [16]. Furthermore, reversible algebraic or geometric functions, such as histogram shifting [17], difference expansion [18], space feature [19], and lossless compression [20–22] are essential for reversible watermarking. Every computable operation can be solved with the unique calculation model incorporated in cellular automata (CA). Numerous applications, including pattern detection, random number generation and music in art, can have a CA component. In the case of the computerised picture domain, CA implementation involves integration of encryption, compression [28, 29] authentication [30], scrambling [31, 32], image enhancement [33] and image watermarking [23–27]. CA underpins several multimedia watermarking techniques, which are used exclusively in multimedia. By contrast, direct implementation of conventional CA technique is impossible owing to vector data discrepancies. The reversibility property is granted to vector data by the original watermarking strategy based on a singular CA case technique known as the linear

702

S. AL-ardhi et al.

cellular automata method, which is the technique that the present study suggests as a way of overcoming the issues associated with copyright protection and Temper detection in 2D vector maps. The suggested methodology is advantageous for several reasons. First of all, it ensures that a watermark key can be embedded in every relative coordinate and at the same time it prevents relative coordinates that are poor correlates from interfering too much. Secondly, it offers satisfactory invisibility, reversibility and complexity of computation. Thirdly, it affords watermarks that are adequately concealed and position embedded owing to LCA complexity. Fourthly, it effectively safeguards against geometric attacks that entail rotation, scaling and translation, whilst ensuring that the 2D vector map retains its original integrity as well. The rest of the study is organised in the following way. The second part includes a review of the relevant literature. The third part focuses on LCA and data processing, while the fourth part presents the suggested approach of reversible watermarking. The fifth part is concerned with assessing and discussing the results of the experiments. Last but not least, the sixth part presents concluding remarks and suggestions for further research.

2 Literature Review To the best of the researchers’ knowledge, a description of the first reversible dataconcealment strategy for 2D vector maps was provided by [34]. Watermark embedding involved modification of the DCT integer or the “discrete cosine change” coefficient of map coordinates in each group. However, a marked distortion arose from the outcomes and the watermark capability was inadequate. Difference expansion was the principle underpinning two reversible watermarking maps that were proposed afterwards for 2D vector maps [35]. In this approach, watermarking data was integrated into Manhattan distance or neighbouring coordinates, but extra space for storage of location maps was necessary. In the interim, the need for amendment of the difference expansion became apparent. Meanwhile, the difference histogram is the basis of the reversible dataconcealing strategy put forth in [36], involving installation of all data in the difference of two neighbouring vertices. Although it affords a high rate of embedding and effective capacity regulation, this approach engenders marked distortion as well, prompting the creation of another strategy based on difference expansion for scalable vector graphics (SVG) [37]. This strategy enables better invisibility because it does not employ mark locations and lossless compression. The initial description of the reversible watermarking strategy based on the composite difference expansion integer transform was provided by [38]. In this strategy, subsection monotonicity of curve and polygon directions serve as the basis for the creation of the multi-dimension vectors, which additionally rely on the composite difference expansion integer transform. Furthermore, unlike the strategy suggested in [35], this strategy provides optimal payload and transparency. A different study [39] later on recommended an amended method, which helped to attenuate distortion and improve capacity by introducing the watermark not into pairs of neighbouring vertices but in the nearest pairs of vertex sequences in the x or y course. A description of the reversible watermark strategy with

Copyright Protection and Content Authentication

703

the greatest capacity was offered in [40]. This strategy minimised distortion and improved capacity even more by implementing a watermark in a discrepancy between the initial value and the approximated value. Several other studies proposed reversible watermarking strategies [41–44]. For instance, [41] explored lossless watermarking based on global features, with feature and non-feature points of each polyline being extracted through the Douglas-Peucker strategy. Furthermore, adequate vector designs with suitable polylines were achieved by generating watermark data based on relation modelling of the BP neural framework and singular value decomposition (SVD). A description of the nonlinear scrambling-based reversible watermarking was provided in [42]. The assumption that feature point locations had to be consistent was what determined the nonlinear scrambling of the relative locations of the feature points, with embedding of the watermark in the scrambled feature points. Besides improving the method capacity, such an approach helps to deal with uncomplicated attacks as well. On the downside, the application of the approach is challenging and can be accompanied by significant distortions. Meanwhile, a perception-based reversible watermarking strategy was the focus of the investigation conducted in [43]. Cover data were based on vertex directions of areas without noise sensitivity, and watermark embedding involved selective modification of the alternating current coefficients (AC) following the integer DCT, as emphasised by a trigger point established beforehand. Although it provides optimal robustness and invisibility, this strategy is compatible solely with vector maps with adequate polylines and does not provide high capacity watermarking. The author in [44] subsequently put forth recursive embedding reversible watermarking for 2D vector maps, with cover data taking the form of correlated information units of large dimensions and recursive alteration of vertex coordinates for the purpose of watermark implementation. Such an approach improves performance to some extent, but expanding iteration cycles lead to a decrease in reversibility. Meanwhile, an approach for data embedding based on iterative alteration of the mean coordinate value of every feature vertex set of high correlation was proposed in [45]. A capacity of nearly 0.667 bpv was achieved through single performance of the embedding procedure. The author in [45] put forth a different strategy, whereby high capacity (around 2c bpv, with c equal to or greater than 1) and satisfactory invisibility were achieved by taking advantage of virtual coordinates. On the downside, the approach was susceptible to RST changes. A comparable approach is the one proposed by [46], which was unaffected by RST changes, but afforded a capacity of around c bpv. Albeit with room for improvement, acceptable robustness is exhibited by both approaches, but none of them address error rectification after the watermark is extracted. Meanwhile, reversible watermarking based on a BP neural network and a reversible watermarking based on a domain transform with FFT have also been proposed [47]. They involve independent modification of the coefficient of wavelet transform and Fast Fourier Transform (FFT) of vector maps for the purpose of implementation of watermark bits. Although watermarking strategies demonstrating reversibility and invisibility have been put forth, there has been no advancement in correcting errors after watermark extraction and notable improvement in robustness to resist hostile attacks is yet to be achieved. The author in [48] suggested a watermarking algorithm for 2D vector data on the basis of normalisation, whereby the space domain watermarking algorithms associated with 2D vector data were made more usable, invisible, blind and robust through implementation

704

S. AL-ardhi et al.

of watermarks in the normalised map coordinates. However, the approach failed to address and rectify the watermark bits extracted in an incorrect way. The present study suggests a general LCA-based watermarking strategy capable of resistance to RST changes, control of embedding distortion and greater capacity. Unlike other types of CA, LCA is less complicated both in terms of the one-dimensional grid and restricted verities, as a few possible rules exist and the neighbourhood that is allowed can be made up just of two direct neighbours, but reversibility is maintained. LCA implementation led to the conclusion that better watermarking could be achieved with basic CA than with the non-CA concept-dependent alternative reversible watermarking.

3 Linear Cellular Automata and Data Pre-processing 3.1

Linear Cellular Automata Transform

Stanislaw Ulam and John von Neumann were the ones who introduced CA during the 1940s for representation of formal simulations of organisms capable of self-replication [49]. Since then, attention has continued to be paid to CA because they are uncomplicated and powerful. Their responses have a high level of complexity, regardless of how basic the computational technique they are based on is. CA are usually made up of a fixed number of homogenous cells that can all take on a certain number of modes and exhibit spatial distribution over at least one dimension. Implementation of set rules, otherwise known as transition functions, helps with the updating of the modes of all cells at the same time at every stage. The succeeding mode adopted by a cell is detected by the rules by employing the inputs of the mode of the cell and the mode of adjacent cells. CA differ in keeping with dimension, potential modes, neighbourhood correlation and rules. The fundamental type of one-dimensional CA is LCA. Every cell can take a value of either 0 or 1 and the values of the closest neighbour are the only determinant of the rules. Consequently, a table of the possible ways in which every cell can combine with its neighbours (i.e. eight potential modes) can serve to outline the manner in which LCA evolve. Such a table is generally the basis of LCA referencing, as solely 256 LCA exist and an 8-bit binary number can be employed to index all of them [50]. The rule number for reference to a specific LCA is determined with the technique in Fig. 2. As can be seen in the figure, the three binary numbers in every cell in the first row pertain to the cell and its two direct neighbours. The XOR total of the three modes (if the value of the actual cell is 0 then both, its left and right direct neighbours will have a value of 0) defines the local transition function of rule 150. Hence, the reading of the cell in the table is (000) and yields a 0 in the next generation.

Fig. 2. Rule number 150

Copyright Protection and Content Authentication

705

CA is distinguished by the fact that the level of simplicity of the computational technique employed to produce them has no impact on the complexity of responses that they are capable of. Reversibility is considered essential in CA compatible with different applications, concomitantly ensuring modelling of complex behaviour and reversal to the original data, especially when digital watermarking methods are employed. Ample research has been dedicated to the property of reversibility. Its resolution can be achieved with transition polynomials based on rule number 150 [51, 52], while its description can be attained on the basis of a particular instance of the 150 basic CA by implementing transition circulant matrices with radius k and every coefficient of value 1 [53]. Furthermore, to obtain the explicit expression of inverse CA for reversible LCA with a k value of 2, transition pentadiagonal matrices were used in [54] for resolution of a particular case of LCA. In the present study, the focus is on the LCA An, which represents the product of a cellular space of n cells with a k value of 2 and kj value of 1, with j equal to or greater than 2 but equal to or less than 2. Transition pentadiagonal matrices are employed in the present study to explore solutions [54]. If the current configuration C is multiplied by fixed matrix, such as a pentadiagonal matrix, under modulo-2 addition through Eq. (1), the succeeding LCA configuration can be attained:

T T C t þ 1 ¼ Mn :ðC t Þ ðmod 2Þ

ð1Þ

In the above, pentadiagonal matrices and a linear matrix transpose including different random bits are respectively denoted by Mn and (Ct)T. If n is considered to have a value of 5k, then Mn is a local transition matrix as follows: 0

1 B1 B B1 B B1 B Mn ¼ B B B B B0 B @0 0

1 1 1 1

1 1 1 1

1 1 1 1

0 0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

... ... ... ... . . .:: . . .: ... ... ...

... ... ... ...

... ... ... ...

... ... ... ...

0 0 0 0

0 0 0 0

... ... ...

... ... 1 ... ... 1 ... ... 1

1 1 1

1 0 0C C 0C C 0C C C C C C 1C C 1A 1

If the cellular automaton An has Mn as the defining matrix, then it constitutes a pentadiagonal matrix of nth order with a value of 1 for the coefficients different from zero, as shown above. The equation below provides the definition for inverse LCA: T T ðC t Þ ¼ Mn1 : C t þ 1 ðmod 2Þ

ð2Þ

706

S. AL-ardhi et al.

The inverse cellular automaton of An has a transition matrix denoted by:

Where 0 0 M51 1 1 0

0 0 0 1 1

1 0 1 0 1

1 1 0 0 0

0 0 0 B0 1 B 1 ðmod 2Þ; B ¼ B B0 @0 0 0 0

0 0 0 0 0

1 0 1 0 0

1 1 0 0 0

1 0 1C C 1C C: 0A 0

The equation below indicates that the transition matrix begins with five elements: jMn jmod 2 ¼

1; if n ¼ 5k or n ¼ 5k þ 1; with k 2 N 0; otherwise

ð3Þ

Given the necessity to deal with reversibility that is applied as the reversible watermarking, conversion of the host map elements during Algorithm 4.2 embedding will be performed based on LCA reversibility, in keeping with the previous equations. The key principle is that LCA can be employed for the transformation of the host 0 0 map coordinate vx1 after the encrypted watermark bit vxi ¼ vxi 1 þ awi i ¼ 1; . . .. . .; Nw (see Eq. 4) is embedded: 00

0

vx1 ¼ Mn :vx1 ðmod 2Þ

ð4Þ 00

Equation (5) enables the inverse conversion of the watermark coordinate vx1 0

00

vx1 ¼ Mn1 :vx1 ðmod 2Þ

3.2

ð5Þ

Pre-processing of Data for 2D Vector Maps

The watermark will be susceptible to rotation and scaling if its embedding is undertaken through direct alteration of the vertices. There are several entities incorporated in 2D vector maps and their coordinates are their key components. Pre-processing of data is conducted in keeping with the original data of the 2D vector in order to make sure that all the characteristics of the 2D vector data and LCA transforms are used.

Copyright Protection and Content Authentication

3.3

707

Computation of Relative Coordinates

To improve robustness against geometrical attacks, relative coordinates are amassed as cover data, since translation, rotation and scaling have been frequently conducted on the format of the 2D vector map. The following part explains how relative coordinates are determined. First step: Acquisition of every vertex from a set V ¼ fv1; v2; :::; vi; :::; vng with a secret key k through scanning of the 2D vector map M. Two reference vertices Vf 1ðxf 1; yf 1Þ and Vf 2ðxf 2; yf 2Þ are then chosen, with particular importance attributed to the two end-points of line segment parallel to the x or y axis. Second step: Determination of the segment (S) in keeping with the Euclidean distance Vf 1 Vf 2 between Vf 1 and Vf 2 and the vector map precision tolerance s (see Fig. 3). Equation (6) permits calculation of the segment:

Fig. 3. The principle of vertex difference

S¼

Vf 1 Vf 2 d

¼

V f 1 V f jVf 1 Vf 2 j

2s

ð6Þ

2s

In the above, the segment and the Euclidean distance between Vf 1 and Vf 2 divide 2s are respectively denoted by S and d. Third step: The source of the novel system of coordinates is indicated by Vf 1 and Vf 2 . The new x-axis is established to be the straight line passing through Vf1 and Vf2, in keeping with the Euclidean distance v1 ; Vf 1 between v1 and Vf 1 . Equation (7) enables calculation of the unit vectors nvi in the new x-axis and y-axis: v1 V f 1 ð7Þ nvi ¼ S Fourth step: Eq. (8) helps to determine the coordinate of any vertex viðxi; yiÞ in V (apart from Vf 1 and Vf 2 ) in the novel system of coordinates: vi ¼ nvi þ vi

ð8Þ

The novel system of coordinates is created by adding the nvi valueto every vertex. Subsequently, a set of relative vertices V ¼ v1 ; v1 ; :::; vi ; :::; vn (apart from Vf 1 and Vf 2 ) is derived, with every relative vertex being between Vf 1 and Vf 2 .

708

S. AL-ardhi et al.

Equation (9) helps to determine the reverse transform associated with the initial system of coordinates: vi ¼ nvi vi

ð9Þ

Algorithm 4.3 will be extracted based on the input parameter taking the form of the parameter d, nvi value produced here. Fifth step: After of Algorithm 4.2 embedding, the value of every implementation relative vertex vi xi ; yi in V (apart from Vf1 and Vf2) must not be higher than the value of Segment S, which has to be equal to or lower than 2s. This constitutes the novel coordinate obtained for the watermarked vector map M. 3.4

Binary Transform for Cover Data

The coordinates must be subjected to pre-processing as the input information needed by the LCA transform has to take the form of binary data. The following is the definition of the binary transform based on the example of the vxi coordinates and an embedding position P, representing the number of digits after the decimal point: First step: Acquisition of all x coordinates for production of a set vx ¼ vx1 ; vx2 ; :::; vxi ; :::; vxn (overall number of vertices in M denoted by n) is achieved by scanning vertices of the 2D vector map M (apart from Vf 1 and Vf 2 ). Second step: Coordinate vxi is transformed from double float point (IEEE754) format to binary format. Figure (4) illustrates the extraction of the binaries.

Fig. 4. The process of binary transform for cover data

Copyright Protection and Content Authentication

3.5

709

The Degree of Watermarking

The use of the vertex difference principle enables comparison between the suggested strategy and different sizes of the Mn matrix. This principle is applied with the purpose of establishing how the initial vertices differ from the watermarked ones. The embedding process, Algorithm 4.2, should yield neighbouring vertices of suitable quality, including minimal distortion, and at the same time the watermarking algorithm must be lower than the interval precision tolerance, which is equal to or lower than 2s, through the transfer of vi between ðvi is ; vi þ is Þ and achieving high diffusion and confusion features. If these conditions are met, the watermarking algorithm degree can be attained in the following way: (1) Determination of the discrepancy between the initial and watermarked versions for each vertex in the map and inference between neighbouring vertices, apart from the reference vertices, vf 1 and vf 2 . (2) Determination of the average vertex discrepancy between the initial and watermarked versions for the entire map, apart from the reference vertices, vf 1 and vf 2 . (3) Use of the average vertex discrepancy between the initial and watermarked map to determine the watermarking degree. If the initial vertex value is denoted by vi and the coordinates after embedding Algorithm 4.2 implementation intended for transferring between fvx1 is ; vi þ is Þg (see 00 Fig. 5) are denoted by (vi ), then the vertex discrepancy for every vertex in the map and the discrepancy between the initial and watermarked vertices can be determined via the following equation:

Fig. 5. Schematic representation of the algorithm of LCA transform

h i 00 VDðiÞ ¼ vi vi 2s

ð10Þ

As demonstrated in Eqs. (11) and (12), once the vertex discrepancy related to every vertex in the map (apart from the two reference vertices vf 1 and vf 2 ) is determined, it is possible to compute the average vertex discrepancy of the entire map through

710

S. AL-ardhi et al.

summation of the vertex discrepancies and division by the number of vertices (apart from the two reference vertices vf 1 and vf 2 ): P NV

VDðvi Þ : N 2 00 P NV VD vi M 0 ðVDÞ ¼ i¼1 N2 M ðVDÞ ¼

i¼1

ð11Þ ð12Þ

To identify the watermarking degree, MðVDÞ and M0 ðVDÞ can be respectively considered to be the average vertex discrepancy for the initial map and the average vertex discrepancy for the watermarked map; subsequently, Eq. (13) can be applied to calculate the nth degree of watermarking for different sizes of the Mn matrix: WDðVDÞ ¼

M 0 ðVDÞ M ðVDÞ M 0 ðVDÞ þ M ðVDÞ

ð13Þ

Exemplification: In order to add a bit, with vf 1 x ¼ 60:8887965376461258, vf 2 x ¼ 60:9059632307777147, v1 ð xÞ ¼ 60:8887960492206872, m = 60.8887977607941215, w = 1, Mn ¼ 25, s ¼ 0:5, and T = 1, calculation of the different parameters can be undertaken as fol00 lows: by implementing the embedding Algorithm 4.2, the output vx1 ¼ 60:88879587 32368730 is obtained, while the original and retrieved coordinates have a discrepancy of 7.35 10−8, which is lower than 10−7 m. Steps: 1 2

d¼ S¼

jvf 1 vf 2 j

j60:867756537646125860:9059632307777147j 2 0:5

2s

jVf 1 Vf 2 j d

2s

jv1 ðxÞVf 1 ðxÞj

j60:867756537646125860:9059632307777147j j60:867756537646125860:9059632307777147j 2 0:5

j60:888796049220687260:8887965376461258j 1

3

nv1 ð xÞ ¼

4

v1 ð xÞ ¼ nv1 ð xÞ þ v1 ð xÞ

0:0000004884254386 þ 60:8887960492206872 ¼ 60:8887965376461258

5

Convert v1 ð xÞ to Binary format

v1 ð xÞ ¼ 0100000001001110011100011100010000010001101001100001101111000010

6

Embedding encrypted watermark Key into the LSB If vi ¼ 1 then embedding 0

010000000100111001110001110001000001000011001 0001100100001011110

7

Transform by LCA, Size Mn Matrix = 25

010000000100111001110001110001000001000 0001011000010111110100010

8

Convert vx1 to Decimal format 00 00

VD vi ¼ ðvx is Þ vx vi ðvx þ is Þ

9 10

S

00

0

WDðVDÞ ¼

jM ðVDÞMðVDÞj 0 jM ðVDÞ þ MðVDÞj

00

vi ¼ 60:8887958732368730 60:3887955607952486 60:8887958732368730 61:3887960492206872 0: 0000000000001034

4 The Suggested Strategy of Reversible Watermarking The procedures that are involved in the process of watermarking are production of the watermark, watermark embedding in the host vector map in least-significant-digit (LSD) planes for every relative vertex, and retrieval of the watermark from the LSD planes from every relative vertex.

Copyright Protection and Content Authentication

4.1

711

Production of the Watermark

(1) Based on the premise that a binary bit sequence with n elements makes up the initial watermark data L ¼ fl1 ; l2 ; . . .; ln g, the procedure for generating the watermark involves: production of a chaotic sequence fx1 ; x2 ; . . .; xn g with original value x0 by a logistic map, acquisition of a binary bit sequence B through binarization, and application of an XOR operation to generate the watermark sequence W by L and B (see Eq. 14): W ¼ L B

ð14Þ

(2) The next step is conversion of the encrypted binary sequence W into a ternary 00 sequence, with W ¼ ¼ w1 ; w2 ; :::; wi ; :::; wn ; w1 ¼ f0; 1; 2g. The final watermark intended for insertion in the 2D vector map is denoted by W . 4.2

Embedding Algorithm

Employing the map file as input, the Embedding Algorithm inserts the watermark bit in the LSD planes in every relative vertex and yields an output in the form of a watermarked map through performance of the transform by the LCA. The transform indices are represented by the size of the Mn matrix (see Fig. 6a). The application of the algorithm consists of seven steps:

Fig. 6. Representation of watermark embedding (a) and watermark extraction (b) for 2D vector map

712

S. AL-ardhi et al.

(1) Scanning of the entire map file M and counting its vertices of length (N); (2) Selection of two reference vertices vf 1 and vf 2 (1 vf 1 ; vf 2 nÞ for the 2D vector map of M, which should be controlled by the private key k for security purposes; nvi denotes the Euler between the two vertices and serves as an input parameter in the process of retrieval of the watermark; the methodology from Sect. 3.3 is applied to obtain a relative vertex set V ¼ fv1 ; v1 ; :::; vi ; :::; vn g, with M ¼ N 2, and N denoting the overall count of vertices in M; (3) Application of the method from Sect. 4.1 for the purpose of encryption of the resulting in the encrypted data sequence elements of W , W ¼ wi jwi 2 f0; 1g; i ¼ 0; 1; . . .; l 1g ; (4) Embedding of a given watermark W into the mantissa part of the i th elements of vertex vi : W ¼

if Last bit is 0 then replace to 1 : wi ¼ 1 wi ¼ 0; otherwise

ð15Þ

(5) Embedding of the encrypted watermark sequence W into the LSB planes for 0 every relative vertex vxi ¼ vxi 1 þ awi i ¼ 1; . . .. . .; Nw, with vi and wi 0 respectively denoting the relative vertex data and the watermark data, a and vi respectively denoting the embedding strength and the watermark vertex after the 00 encrypted watermark bit is added, and vxi denoting the watermarked vertex data after the LCA-based performance of transform; 00 0 (6) Production of the watermarked cover map: vx1 ¼ Mn :vx1 ðmod 2Þ based on the LCA transform; (7) Iteration of the fifth and sixth step for a K number of times if high capacity is needed. 4.3

Extraction Algorithm

The extraction algorithm depends on the watermarked vector map and the size of LCA Matrix (Mn) because it does not employ the original vector map or any of its characteristics. The extraction algorithm will also require the fraction mentioned in the embedding Algorithm 4.2 in circumstances of incomplete embedding of all watermark bits in the host map (Fig. 6b). Reconstruction of the list of indices L and B can be achieved through the application of a XOR function enabled by the encrypted binary sequence of W contained in the watermark key. Watermark extraction is the embedding process in reverse, with retrieval instead of insertion (Fig. 6b). The reference vertices vf 1 and vf 2 are given by the secret key K, while the relative vertices are computed based on the nvi value, which denotes the size of LCA Matrix (Mn). The suggested watermarking strategy is made more robust by the lack of the private key k, as the locations of the initial vertices will be similar to unknown vertices.

Copyright Protection and Content Authentication

713

5 Results and Analysis As discussed in the earlier part, linear cellular automata (LCA) are the basis of the new reversible watermarking strategy put forth by the present study. A discussion of the empirical data and algorithm performance is extended in the current part to prove that the proposed strategy is effective. 5.1

Results of Experiments

A personal computer (CPU 2.3 GHz, RAM 16G, Win10 Professional, QGIS Version 3.0, python language) was employed to conduct the experiments, which consisted of 50 distinct 2D vector maps serving as covers and having the shape file format of Environmental Systems Research Institute, Inc. (ESRI) [55]. Four of the vector maps represented a spot height map of Taylor Rookery [56], a coastline map of Taylor Rookery [57], a road map [56] and a Windmill Islands map [58] (see Fig. 7). General features of these maps, including feature type, number of vertices/features, scale and precision tolerance s are indicated in Table 1. In the context of procedures of data concealment, the number of secret bits for every relative coordinate carried a was 1, Mn matrix size = 25 and T = 1 denoting repetitive embedding. Table 1. The initial vector maps and their features Vector maps Spot height map of Taylor Rookery Coastline map of Taylor Rookery Contours map of Taylor Rookery Windmill hypsometric map

Feature type Point Polyline Polyline Polygon

Features/vertices 355/355 18/4279 286/48230 5757/1163689

Scale s (m) 1:5000 1:5000 1:25000 1:50000

s (m) 0.5 0.5 2.5 5

The invisibility of the suggested watermarking strategy is proven by the first test case. The technique from Sect. 4.2 was applied for watermarking the vector maps in Fig. 7, yielding watermarked versions (see Fig. 8) that indicate satisfactory perceived quality. Equations (16) and (17) were applied to determine the average distortion dðM; M0 Þ and the maximum distortion MaxdðM; M0 Þ to assess the quality of the embedded vector maps: 1 XNV 00 jvi vi j i¼0 NV 00 Maxd ðM; M 0 Þ ¼ Max jvi vi j ; ði ¼ 1; 2; . . .NV Þ d ðM; M 0 Þ ¼

ð16Þ ð17Þ

In the above, the vertices in the original vector map M are denoted by vx1 , while the 00 vertices in the retrieved vector map M 0 are denoted by vx1 , and the overall count of

714

S. AL-ardhi et al.

vertices in the vector map M is given by NV. The method of watermark embedding from Sect. 4 was applied for embedding of the vector maps, and the corresponding methods of watermark extraction and data retrieval were subsequently employed for recovery. The watermarked versions of the maps from Fig. 7 are illustrated in Fig. 8, which also indicates that the perceived quality is acceptable.

Fig. 7. The 2D vector maps used in the experiment: (a) spot height map of Taylor Rookery, (b) coastline map of Taylor Rookery, (c) contours map of Taylor Rookery and (d) Windmill hypsometric map.

According to the experiments are indicated in Table 2. The Maxd values and the dðM; M0 Þ values of the vector maps do not exceed 10−6 m, while the 2D vector map coordinates have a storage accuracy of 0.1 mm, signifying that the initial and watermarked coordinates do not differ significantly and the watermark strategy demonstrates reversibility. Hence, it is possible to obtain accuracy requirements for most conditions.

Copyright Protection and Content Authentication

715

Fig. 8. The watermarked versions of the original 2D vector maps

Table 2. The Maxd and d values of the initial and retrieved vector maps Vector maps Spot height map of Taylor Rookery Coastline map of Taylor Rookery Contours map of Taylor Rookery Windmill hypsometric map

5.2

Maxd (m) 1.1317 10−7 3.1462 10−7 6.5095 10−7 8.2319 10−7

d (m) 7.2569 1.0306 2.8731 4.7245

10−8 10−7 10−7 10−7

Robustness Assessment

The suggested watermarking is satisfactorily robust against translation and rotation because it is achieved through alteration of the relative coordinates and therefore is unaffected by those two processes. Furthermore, the recording of the Euler distance of the two reference vertices is undertaken when the novel system of coordinates is assembled. This means that it is possible to restore the watermarked 2D vector map according to the initial distance between the reference vertices in case it is scaled. Hence, the proposed strategy demonstrates robustness against scaling as well and the measurement of this robustness is based on NC calculated as follows:

716

S. AL-ardhi et al.

PM w i¼0 wi Xwi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi NC ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PM PM w ðw Þ2 X ðw Þ2 i i i¼0 i¼0

ð18Þ

The NC values derived from translation with different translation vectors on the watermarked 2D vector maps are presented in Table 3. The suggested reversible watermarking strategy is acceptably robust against translation, rotation and scaling as all the NCs between the initial and retrieved watermark have a value of 1.0. Table 3. An overview of the results obtained. Vector maps Spot height map of Taylor Rookery Coastline map of Taylor Rookery Contours map of Taylor Rookery Windmill hypsometric map

5.3

Scaling (1 = 0.5) 1.0

Rotation (q = 60°) 1.0

Translation (−1.2 m, −2.3 m) 1.0

RST combination 1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

Assessment of the Capability for Content Authentication

In the context of assessment of integrity protection of 2D vector maps, detection and localisation of attacks can be undertaken via two approaches, namely, vertex/feature modification and vertex/feature addition/removal: 1- Vertex/feature alteration: the suggested strategy involves embedding of the watermark in every vertex coordinate, and therefore certain coordinates will be modified by 2D vector map changes. Precise performance of LCA-based reversible transforms is impossible due to the implication that this has induced changes in the original coordinates. Hence there is a distinction between the extracted watermark and the initial version and any alteration is detectable. 2- Vertex/feature addition/removal: Extraction of the watermark length is done in cases where vertices are added to or removed from the watermarked vector map by an attacker. Interference is signalled by the discrepancy between that length and the length of the embedded watermark. Thus, it is demonstrated that the suggested strategy enables precise identification of interference in the form of vertex/feature alteration or addition/removal.

Copyright Protection and Content Authentication

717

6 Conclusion The present study draws upon the notion of linear cellular automata to devise an approach for adequately representing data in real numbers with minimal correlation and without distortion, as well as without the need for the initial map to extract the watermark. The results of the experiments and the conducted assessments provide support for the fact that the suggested watermarking is acceptably reversible, invisible and robust against translation, rotation and scaling. In addition, it is highly precise in identifying and localising unauthorised interference taking the form of vertex/feature alteration and addition/removal.

References 1. Lopez, C.: Watermarking of digital geospatial datasets: a review of technical, legal and copyright issues. Int. J. Geogr. Inf. Sci. 16(6), 589–607 (2002) 2. Niu, X., Shao, C., Wang, X.: A survey of digital vector map watermarking. Int. J. Innov. Comput. Inf. Control. 2(6), 1301–1316 (2006) 3. Abubahia, A., Cocea, M.: A clustering approach for protecting GIS vector data. In: International Conference on Advanced Information Systems Engineering, pp. 133–147 (2015) 4. Peng, Z., et al.: Blind watermarking scheme for polylines in vector geo-spatial data. Multimed. Tools Appl. 74(24), 11721–11739 (2015) 5. Lafaye, J., et al.: Blind and squaring-resistant watermarking of vectorial building layers. Geoinformatica 16(2), 245–279 (2012) 6. Lee, S., Kwon, K.: Vector watermarking scheme for GIS vector map management. Multimed. Tools Appl. 63(3), 757–790 (2013) 7. Muttoo, S.K., Kumar, V.: Watermarking digital vector map using graph theoretic approach. Ann. GIS 18(2), 135–146 (2012) 8. Neyman, S.N., Wijaya, Y.H., Sitohang, B.: A new scheme to hide the data integrity marker on vector maps using a feature-based fragile watermarking algorithm. In: 2014 International Conference Data and Software Engineering (ICODSE), pp. 1–6 (2014) 9. Shao, C., Wang, X., Xu, X.: Security issues of vector maps and a reversible authentication scheme. In: Doctoral Forum of China, pp. 326–331 (2005) 10. Wang, N.: Reversible fragile watermarking for locating tampered Polylines/Polygons in 2D vector maps. Int. J. Digit. Crime Forensics (IJDCF) 8(1), 1–25 (2016) 11. Wang, N., Men, C.: Reversible fragile watermarking for 2-D vector map authentication with localization. Comput. Aided Des. 44(4), 320–330 (2012) 12. Wang, N., Men, C.: Reversible fragile watermarking for locating tampered blocks in 2D vector maps. Multimed. Tools Appl. 67(3), 709–739 (2013) 13. Yue, M., Peng, Z., Peng, Y.: A fragile watermarking scheme for modification type characterization in 2D vector maps. In: Asia-Pacific Web Conference, pp. 129–140 (2014) 14. Zheng, L., You, F.: A fragile digital watermark used to verify the integrity of vector map. In: International Conference on E-Business and Information System Security, EBISS 2009, pp. 1–4 (2009) 15. Xia, Z., et al.: Steganalysis of LSB matching using differences between nonadjacent pixels. Multimed. Tools Appl. 75(4), 1947–1962 (2016)

718

S. AL-ardhi et al.

16. Shao, C., et al.: Study on lossless data hiding algorithm for digital vector maps. J. Image Graph. 12(2), 206–211 (2007) 17. Tian, J.: Reversible watermarking by difference expansion. In: Proceedings of Workshop on Multimedia and Security (2002) 18. Celik, M.U., et al.: Lossless generalized-LSB data embedding. IEEE Trans. Image Process. 14(2), 253–266 (2005) 19. Cao, L., Men, C., Sun, J.: Space feature-based reversible watermarking theory for 2D-vector maps. Acta Geodaetica Cartogr. Sin. 39(4), 422–427 (2010) 20. Ni, Z., et al.: Reversible data hiding. IEEE Trans. Circuits Syst. Video Technol. 16(3), 354– 362 (2006) 21. Lin, S., et al.: Improving histogram-based reversible information hiding by an optimal weight-based prediction scheme. J. Inf. Hiding Multimed. Signal Process. 4(1), 19–33 (2013) 22. Weng, S., Pan, J., Gao, X.: Reversible watermark combining pre-processing operation and histogram shifting. J. Inf. Hiding Multimed. Signal Process. 3(4), 320–326 (2012) 23. Shiba, R., Kang, S., Aoki, Y.: An image watermarking technique using cellular automata transform. In TENCON 2004 IEEE Region 10 Conference, pp. 303–306 (2004) 24. Piao, Y., Kim, S.: Robust and secure inim-based 3D watermarking scheme using cellular automata transform. J. Korea Inst. Inf. Commun. Eng. 13(9), 1767–1778 (2009) 25. Li, X., et al.: Watermarking based on complemented MLCA and 2D CAT. J. Inf. Commun. Converg. Eng. 9(2), 212–216 (2011) 26. Li, X.W., Yun, J.S., Cho, S.J., Kim, S.T.: Watermarking using low and high bands based on CAT. In: IEEE ICCSIT (2011) 27. Das, T.S., Mankar, V.H., Sarkar, S.K.: Cellular automata based robust spread spectrum image watermarking. In: Indian Conference on Intelligent Systems ICIS 2007, 19–20 January 2007 (2007) 28. Lafe, O.E.: Method and apparatus for data encryption/decryption using Cellular Automata Transform (1997) 29. Lafe, O.: Data compression and encryption using cellular automata transforms. Eng. Appl. Artif. Intell. 10(6), 581–591 (1997) 30. Hwang, Y., Cho, S., Choi, U.: One Dimensional Cellular Automata based security scheme providing both authentication and confidentiality. J. Korea Inst. Inf. Commun. Eng. 14(7), 1597–1602 (2010) 31. Dalhoum, A., et al.: Digital image scrambling using 2D cellular automata. IEEE Multimed. 19, 28–36 (2012) 32. Madain, A., et al.: Audio scrambling technique based on cellular automata. Multimed. Tools Appl. 71(3), 1803–1822 (2014) 33. Rosin, P.L.: Training cellular automata for image processing. IEEE Trans. Image Process. 15 (7), 2076–2087 (2006) 34. Voigt, M., Yang, B., Busch, C.: Reversible watermarking of 2D-vector data. In: Proceedings of the 2004 Workshop on Multimedia and Security, pp. 160–165 (2004) 35. Wang, X., et al.: Reversible data-hiding scheme for 2-D vector maps based on difference expansion. IEEE Trans. Inf. Forensics Secur. 2(3), 311–320 (2007) 36. Zhou, L., Hu, Y., Zeng, H.: Reversible data hiding algorithm for vector digital maps. J. Comput. Appl. 29(4), 990–993 (2009) 37. Wu, D., Wang, G., Gao, X.: Reversible watermarking of SVG graphics. In: WRI International Conference on Communications and Mobile Computing, CMC 2009, pp. 385– 390 (2009)

Copyright Protection and Content Authentication

719

38. Zhong, S., Liu, Z., Chen, Q.: Reversible watermarking algorithm for vector maps using the difference expansion method of a composite integer transform. J. Comput. Aided Des. Comput. Graph 21(12), 1840–1849 (2009) 39. Hua, Z., Shoujian, D., Daozhen, Z.: A reversible watermarking scheme for 2D vector drawings based on difference expansion. In: 2010 IEEE 11th International Conference on Computer-Aided Industrial Design & Conceptual Design (CAIDCD), pp. 1441–1446 (2010) 40. Chen, G., et al.: Reversible watermark algorithm for large-capacity vector map. Jisuanji Gongcheng/Comput. Eng. 36(21) (2010) 41. Men, C., et al.: Global characteristic-based lossless watermarking for 2D-vector maps. In: 2010 International Conference on Mechatronics and Automation (ICMA), pp. 276–281 (2010) 42. Men, C., Cao, L., Li, X.: Perception-based reversible watermarking for 2D vector maps. In: Visual Communications and Image Processing, pp. 77-44-34 (2010) 43. Cao, L., Men, C., Ji, R.: Nonlinear scrambling-based reversible watermarking for 2D-vector maps. Vis. Comput. 29(3), 231–237 (2013) 44. Cao, L., Men, C., Gao, Y.: A recursive embedding algorithm towards lossless 2D vector map watermarking. Digit. Signal Process. 23(3), 912–918 (2013) 46. Wang, N., Zhang, H., Men, C.: A high capacity reversible data hiding method for 2D vector maps based on virtual coordinates. Comput. Aided Des. 47, 108–117 (2014) 47. Wang, N., Zhao, X., Xie, C.: RST invariant reversible watermarking for 2D vector map. Int. J. Multimed. Ubiquit. Eng. 11(265), 276 (2016) 48. Sun, J., et al.: A reversible digital watermarking algorithm for vector maps. Coordinates 3(9), 16–18 (2014) 49. Wang, N.: Reversible watermarking for 2D vector maps based on normalized vertices. Multimed. Tools Appl. 10(4), 471–481 (2016) 50. Wolfram, S.: A New Kind of Science. Wolfram Media, Champaign (2002) 51. Chaudhuri, P.P.: Additive Cellular Automata: Theory and Applications. Wiley, Chichester (1997) 52. Encinas, L.H., del Rey, A.M.: Inverse rules of ECA with rule number 150. Appl. Math. Comput. 189(2), 1782–1786 (2007) 53. del Rey, A.M.: A note on the reversibility of elementary cellular automaton 150 with periodic boundary conditions. Rom. J. Inf. Sci. Technol. 16(4), 365–372 (2013) 54. Del Rey, A.M., Sánchez, G.R.: On the reversibility of 150 Wolfram cellular automata. Int. J. Mod. Phys. C 17(07), 975–983 (2006) 55. Martı, A., Rodrı, G.: Reversibility of linear cellular automata. Appl. Math. Comput. 217(21), 8360–8366 (2011) 56. E. ESRI: Shapefile technical description. An ESRI White Paper (1998) 57. Harris, U.: Windmill Islands 1: 50000 Topographic GIS Dataset. Australian Antarctic Data Centre-CAASM Metadata (1999). Accessed 09 June 12 58. http://www.ibge.gov.br/english/geociencias/default_prod.shtm (2012). Accessed 09 June 12 59. http://gcmd.nasa.gov/KeywordSearch/Metadata.do?Portal=amd_au&MetadataView= Full&MetadataType=0&Keyword

Adapting Treemaps to Student Academic Performance Visualization Samira Keivanpour(&) Department of Management, Information and Supply Chain, Thompson Rivers University, Kamloops, Canada [email protected]

Abstract. Treemap visualization method is applied to student academic performance via an empirical study. This approach is developed to facilitate educational decision-making. The case study provides analyzing and classifying of hierarchical academic data and allows decision-makers to modify features of the visual platform dynamically. Keywords: Academic performance Data mining Multidimensional data Visualization Treemap Educational decision support Moodle

1 Introduction The educational research relies increasingly on digitized students’ information. It is required to manage and organize student data for evaluation of the performance. Since different students attributes could be analyzed, the visualization of the overall picture becomes essential for decision makers. Discussions with several departments’ heads, professors, and faculty’s deans revealed that an integrated approach for analyzing and visualization of students’ performance data becomes a fundamental requirement for educational decision support systems. The integration of educational data along multiple dimensions of data is challenging. The instructors require having access to a coherent view of the information that has been collected. This research paper contributes to the development of data integration and visualization components of the educational system. It provides the instructors, department chairs, and other decision makers with a tool to quickly browse students data. It supports academics stakeholders in visualizing complex information such as correlations to establish new educational hypotheses. In this paper, incorporation of educational knowledge for visualization and comparative analysis of students data is shown. The objective is visualizing multidimensional data for developing a useful performance measurement framework that aids decision makers by a fast and concrete platform. By mapping multi-dimensional students data attributes, a practical approach to structuring and classification of students’ records is developed. The following problem is addressed in this study: students database like Moodle contain students records with attributes, such as assignments marks, the attendances and final marks. The induced hierarchical visualization for classification of students is performed. The rest of paper is organized as follows: Sect. 2 discusses the literature © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 720–728, 2020. https://doi.org/10.1007/978-3-030-17795-9_54

Adapting Treemaps to Student Academic Performance Visualization

721

review. Section 3 provides the method. Section 4 illustrates the application with real data. In Sect. 5, conclusions and future research areas are discussed.

2 Literature Review Data mining and visualization are widely used in the educational context. Here, we briefly review data mining applications and visualization for analyzing students’ record data. Raji et al. (2017) applied graph theory and flow analysis for analyzing the patterns in student progression [1]. The authors emphasized the benefits of the developed model by presenting the positive feedback from faculty members, vice provost, and department heads. Ginda et al. (2017) used multi-level heat map for facilitating learning management system [2]. They used this visualization technique for aggregation of students’ record data such as weekly engagement and submission grades. Emmons et al. (2017) discussed the advantages of using data analysis and visualization as an empowering tool for multiple stakeholders like students, instructors, and scholars [3]. The authors explained how visualization could be helpful in analyzing a large amount of data. Romero et al. (2010) used fuzzy rule systems for analyzing the data on Moodle and predicting the final mark of students [4]. Graphical interfaces are also developed for tracking and representing data on Moodle for online courses [5, 6]. The visualization of data could provide instructors with active monitoring and analyze the performance of students, particularly in the e-learning context. Lakshmi et al. (2013) used a genetic algorithm for analyzing the complex data to identify the influential factors in their performance [7]. Doctor and Iqbal (2012) applied fuzzy Linguistic Summarisation method for finding the relationship between engagement, activity and the students’ performance [8]. Ogor (2007) developed a data mining approach with neural network for monitoring the performance of the students [9]. Donnellan and Pahl (2002) applied data mining including pattern recognition, statistical analysis and time series approach for evaluation of students’ performance [10]. In another study, El-Halees (2009) used classification and clustering for analyzing the students’ data from the course database [11]. Romero et al. (2009) studied the applications of different data mining approaches such as visualization, statistics, text mining, clustering and classification for monitoring the educational performance of the students. Romero and Ventura (2007) surveyed the applications of data mining approaches in educational context from 1995 to 2005 [12]. Based on the gaps in the literature, the authors recommended the future research as follows: easy to use data mining for educators, standardization of methods and data, integration with e-learning and educational knowledge [12]. Based on the synthesis in the literature, the following conclusions could be highlighted: • Data mining approaches are essential for educators to analyze the multidimensional data from course databases • Visualization is more valuable as it can facilitate and expedite the interpretation of data and pedagogical decision-making

722

S. Keivanpour

• The interfaces and the applied methods should be user-friendly for instructors and the other key stakeholders • Data mining and data analysis approach should be integrated into educational knowledge background for effective outcomes These key findings are considered in developing a visualization tool in this study.

3 Proposed Methodology The Treemaps are used widely in different applications such as business, the stock market and manufacturing [13–15]. Wattenberg (1999) used treemaps that include rectangles shapes to illustrate the size and change in the stock market price [16]. For portfolio analysis, this visualization provides the fast interpretation of stock market with complex data including several attributes. It reveals more insights into the context with the classification of the hierarchal databases. Figure 1 shows a sample treemap. As shown in this figure, each rectangle could be assigned to one element of the database. The size and color of the squares could be used for the different features of each record in the data set. For example, in the stock market context, the large rectangles represent a specific sector of the market and small rectangle inside of that rectangle represent the companies’ shares in that sector. The size can be used for the share of the company, and the color can be used for changes in the stock price. This approach could be used for visualization of academic data from course databases. In the academic database, we have the same complexity regarding the number of students (like companies in the case of the stock market) and their educational performance (the change in the price over time). Instructors usually use different learning management systems for monitoring and evaluations of the students’ performance. Open-source learning management system such as Moodle is used widely in higher education systems. Grade books in Moodle could be used for managing and monitoring the students’ performance during an academic term. The assignments marks, midterm, activities, quizzes, participation and attendances could be recorded in the Moodle database, and it could be downloaded as spreadsheets for further analyzing and data mining. However, all the instructors do not have the skills for applying data mining tools or using the results for pedagogical purposes. Furthermore, the contextual knowledge and educational insights could be used for analyzing vast databases and help universities’ decision makers and policy makers in strategic, tactical and operational decisions. With the help of treemaps, the performance of the students during the semester could be visualized, and the instructors have more insights into the influence of different tasks, assignments, quizzes, participation in the class on the final mark of students. This visualization can reveal which factor is more predictive the final grade of students. The process of visualization will be started with the preparation of data. The Moodle gradebook could be used for this purpose. The records of different activities should be classified based on experts’ opinions, their knowledge and the outcomes of the previous analysis. The users upload the database into the program, and then the interactive interface will be shown after running the program. The users can change the attributes to get different results. The detailed case study is illustrated in the next section.

Adapting Treemaps to Student Academic Performance Visualization

723

Fig. 1. Treemap example, rectangles represent the elements in a database, and the size and color represent the features and attributes

4 Application 4.1

Background and Dataset

For illustration purpose, the data of four sections of undergrads Management Information System course is used to show the applications of the proposed visual platform in an academic context. The data set includes 166 students’ records during two consecutive semesters. The fields include the original country of the students, the average mark of the activities, the average grade of the assignments, participation marks, midterms, final exam marks, and the final marks. The students are from Canada, China, India, Bangladesh, Nigeria, Russia, Colombia, and Saudi Arabia. The records are classified based on the four intervals of the marks in the related task (less than 40%, between 40% and 60%, between 60% and 80% and greater than 80%). 4.2

Results

The visualization interface has been coded in MATLAB. The first results are data mining based on the countries. Figures 2 and 3 show the treemaps of the final mark of the students based on classifications of the students in eight categories based on their original countries. Each rectangle represents a student. The rectangles in Fig. 3 names from S1 to S166. The size and color of the rectangles show the students final grades. The light yellow color shows the highest marks and the dark blue ones show the lowest marks. Figure 2 shows the borders of each eight regions clearly. Figures 4, 5, 6, 7 and 8 represent final marks based on the students’ performance in activities, assignments, midterm, participation and final exam respectively. In these figures, each rectangle represents one student, and the size and the color show the final marks. Results reveal that participation and the average of students marks in assignments could be more predictive for the final exam marks of students. The user interface is shown in Fig. 9.

724

S. Keivanpour

Fig. 2. Academic performance map of 166 students based on their original country-part A

4.3

Discussion

The visualization platform for integrating educational knowledge into the enormous databases of students’ performance should have the potential for monitoring and track the changes in data. Though, this is the first step of the systematic knowledge platform, the social process for using the data by educators, department heads, decision makers principally concerning the sophisticated features of multidimensional data is required. Therefore, the integrating of the visualization technique, using the contextual knowledge are mainly needed to complete the process. Such a holistic approach and integration is challenging. The intervals for classification could be set based on the previous studies, experts’ opinions and experience. The visualization technique is a useful tool for data survey of the broad and spread data [15]. The students with a smaller size of rectangles present those that have lower performance, and the larger rectangle belongs to the students with high performance. The students with the small size of rectangles and the dark blue colors are those that need more consideration in the pedagogical plan. This analysis could be used for the prerequisite courses too. Monitoring the performance of students in the Management Information Systems (MIS) course could also be a good predictor for the other courses like Statistics II, Supply chain courses, and advanced MIS courses. Visualization of the students’ performance in the core courses that are pre-requisites of the other courses also helps instructors for better planning and advising guidelines.

Adapting Treemaps to Student Academic Performance Visualization

725

Fig. 3. Academic performance map of 166 students based on their original country-Part B

Fig. 4. Academic performance map of 166 students based on their performance in activities

Fig. 5. Academic performance map of 166 students based performance in assignments

726

S. Keivanpour

Fig. 6. Academic performance map of 166 students based on their performance in midterm

Fig. 7. Academic performance map of 166 students based on their performance in participation

Fig. 8. Academic performance map of 166 students based on their performance in final exam

Adapting Treemaps to Student Academic Performance Visualization

727

Fig. 9. User interface

5 Conclusion With the growing use of Learning Management Systems and information systems in educational systems, data mining and data analytics are more and more essential for instructors. The academic crew in universities include instructors, scholars, department heads, advisors and the other decision-makers who need to communicate effectively in taking the appropriate actions. Considering the complexity of applying data mining techniques, visualization is a valuable tool for the key stakeholders in educational systems to monitoring and managing the student’s performance. This study proposed a user-friendly visualization platform for data mining and analysis of the student’s performance. Treemaps are used in different research area to map and track complex data. To the best of our knowledge, the application of treemap for academic evaluation is fresh. A case study including 166 data points is illustrated to show the application perspective. The data were classified according to the performance of students in activities, assignments, midterm, participation in the class and the final exam. Then, the treemap approach is used to visualize the final mark of students based on the different elements of course evaluation. The visualized interface could aid educators for fast and easy tracking of students performance. The joint application of fuzzy rule-based modeling and treemap is recommended for future research. The fuzzy rule-based approach can be used for systematic integrating experts opinions and contextual knowledge into the mapping process.

728

S. Keivanpour

References 1. Raji, M., Duggan, J., DeCotes, B., Huang, J., Zanden, B.V.: Visual progression analysis of student records data. arXiv preprint arXiv:1710.06811 (2017) 2. Bueckle, M.G.N.S.A., Börner, K.: Empowering instructors in learning management systems: interactive heat map analytics dashboard (2017). Accessed 2 Nov 2017 3. Emmons, S.R., Light, R.P., Börner, K.: MOOC visual analytics: Empowering students, teachers, researchers, and platform developers of massively open online courses. J. Assoc. Inf. Sci. Technol. 68(10), 2350–2363 (2017) 4. Romero, C., Espejo, P.G., Zafra, A., Romero, J.R., Ventura, S.: Web usage mining for predicting final marks of students that use Moodle courses. Comput. Appl. Eng. Educ. 21(1), 135–146 (2013) 5. Mazza, R., Milani, C.: GISMO: a graphical interactive student monitoring tool for course management systems. In: International Conference on Technology Enhanced Learning, Milan, pp. 1–8, November 2004 6. Mazza, R., Dimitrova, V.: Visualising student tracking data to support instructors in webbased distance education. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp. 154–161. ACM, May 2004 7. Lakshmi, T.M., Martin, A., Venkatesan, V.P.: An analysis of students performance using genetic algorithm. J. Comput. Sci. Appl. 1(4), 75–79 (2013) 8. Doctor, F., Iqbal, R.: An intelligent framework for monitoring student performance using fuzzy rule-based linguistic summarisation. In: 2012 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8. IEEE, June 2012 9. Ogor, E.N.: Student academic performance monitoring and evaluation using data mining techniques. In: Electronics, Robotics and Automotive Mechanics Conference, 2007. CERMA 2007, pp. 354–359. IEEE, September 2007 10. Donnellan, D., Pahl, C.: Data mining technology for the evaluation of web-based teaching and learning systems. In: E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, pp. 747–752. Association for the Advancement of Computing in Education (AACE) (2002) 11. El-Halees, A.: Mining Students Data to Analyze e-Learning Behavior: A Case Study (2009) 12. Romero, C., Ventura, S.: Educational data mining: a survey from 1995 to 2005. Expert Syst. Appl. 33(1), 135–146 (2007) 13. Vliegen, R., Van Wijk, J.J., van der Linden, E.J.: Visualizing business data with generalized treemaps. IEEE Trans. Visual Comput. Graphics 12(5), 789–796 (2006) 14. Jungmeister, W.A., Turo, D.: Adapting Treemaps to Stock Portfolio Visualization (1992) 15. Keivanpour, S., Ait Kadi, D.: Strategic eco-design map of the complex products: toward visualization of the design for environment. Int. J. Prod. Res. 56(24), 1–17 (2017) 16. Wattenberg, M.: Visualizing the stock market. In CHI 1999 Extended Abstracts on Human Factors in Computing Systems, pp. 188–189. ACM, May 1999

Systematic Mobile Device Usage Behavior and Successful Implementation of TPACK Based on University Students Need Syed Far Abid Hossain(&), Yang Ying, and Swapan Kumar Saha School of Management, Xi’an Jiaotong University, No.28, Xianning West Road, Xi’an 710049, Shaanxi, People’s Republic of China [email protected] Abstract. This paper analyzes the findings of a research study exploring the development of Technological Pedagogical Content Knowledge (TPACK) model in accordance with diverse university students’ need for sustainable teaching and learning in different universities. Towards the aim of successful implementation of the TPACK model through systematic mobile device usage, the study included the participation of 201 university students in different undergraduate programs and 10 university lecturers. In order to collect their feedback on the TPACK model, the students and teachers were requested to complete a set of questionnaire. Furthermore, a focus group interview was also conducted to further investigate students’ and tutors’ scrutiny on the TPACK development with regular mobile device usage in accordance with diverse learners’ needs. Even though the majority of the findings support the development of TPACK structure in accordance with diverse university students’ need, a few limitations are found regarding the implementation. This research will be helpful to determine post-secondary learning and teaching with technology. Keywords: TPACK Diverse Mobile device Teachers

Learners Need Teaching Learning

1 Introduction The TPACK concept originated from famous scholar Lee S. Shulman from his great contribution with the development of PCK (Pedagogical Content Knowledge) as stated by [1] by the following model. The connections of all the three circles above come up with a TPACK model which basically represent four knowledge-based teaching criteria which are: Pedagogical Content Knowledge (PCK), Technological Content Knowledge (TCK), Technological Pedagogical Knowledge (TPK), and Technological Pedagogical Content Knowledge (TPACK). Researchers discovered that university learner need is absolutely different than other learners so that their learning need should be considered specially and specifically [3]. In the context of education, the TPACK model is associated with the learners because it focuses on improving learners more effectively. Typically, instructors or teachers are responsible to design the model or structure like TPACK for the learners such as course outline design. Five most critical needs for university © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 729–746, 2020. https://doi.org/10.1007/978-3-030-17795-9_55

730

S. F. A. Hossain et al.

students were stated by [4] which include comfortable schedules, relevant course outline, clear expectations, and feedback from the instructor and finally teaching should be acknowledged by learners’ previous learning experience. Giving priority to the students is another aspect of teaching which may result in better understanding and make the learners more responsible so that they can have their own control over learning [5]. Students may have their broader intelligence of ownership over learning outcomes if they are guided and accompanied with required technology [4]. Although earlier studies revealed the potential of the TPACK model for education, the necessity of TPACK according to university learners’ need approach is still undefined and under shadow. It is valuable to investigate this issue because it may help the concerned parties to improve the education policy. In order to accomplish the research gap, this study is an attempt to: (1) Develop TPACK model in accordance with diverse university learners’ needs with the focus on mobile usage. (2) Identify the benefits of this new approach from the students’ and instructors’ perspectives. The outcome of this study endows insight into learners’ and instructors’ observation of the importance of developing TPACK model in accordance with diverse learners’ needs. In addition, the findings of this research are helpful for academicians, education researchers, policymakers, and scholars.

2 Literature Review 2.1

TPACK and Higher Education

Researchers, instructors, and education training providers developed the frequent use of the TPACK model [6]. For teaching in a digital era, the TPACK model is the best way to share knowledge with others [7]. The need for the TPACK model suits the best in technology-enriched classrooms for university learners. Higher education in the twentieth century should emphasize content mastery with the transmission of knowledge from instructor to learners [8]. In the twenty-first century, life is blessed with the help of digital technologies with instant and available access to global access to communication [9]. As a result successful university educated learners are seemed to be more flexible with the demonstration of information technology [10] which indicates the emerging need of the TPACK model for higher education. Therefore, universities and instructors are required to redesign the teaching and learning procedure which may suit the students’ development in a global world [11]. Apart from the necessity of cognitive skills for designing a content [12], newer model of learning is appreciated by [13]. Earlier, researchers discovered that most of the teachers have more knowledge on content rather than on technology [14]. But a teacher should be organized with all the TPACK dimensions for quality teaching [15]. Most importantly, the higher educational institutions which basically consist of university learners should continuously improve the TPACK model because the need for higher education is increasing continuously over the world. According to [16], 30% increase in twenty-five plus old learners’ enrolment was observed in the USA between 2000 and 2009. It is typically assumed that the university learners are matured enough to select their learning content and if this could be ensured then they may utilize the learning outcome in the practical world [17].

Systematic Mobile Device Usage Behavior and Successful Implementation

731

Day by day university learners are depending more on themselves and becoming individualized and able to manage lots of difficulties regarding learning by themselves [18] because of the unlimited access of learning opportunities around the world. 2.2

Challenges of TPACK for Higher Education

One major challenge observed by the researchers is to adapt with the changing environment of Information and Communication Technologies (ICT) and the instructors need to rethink about the innovation and integration of TPACK model [19]. The framework of TPACK is practically new so that the pre-service teachers should be trained well [20] otherwise it might be difficult for the new teachers to adapt TPACK model in the class. Although pedagogical knowledge has the largest impact on TPACK [21] integration of technological knowledge is considered as the most challenging factor of TPACK for higher education [22]. Also, pedagogical techniques along with technology should be used in a constructive way based on university learners’ need [23] which is quite challenging to implement. 2.3

Changing Environment for University Learning

Group study or teamwork as a method of learning has been appreciated by researchers in recent decades because of the effect of a rapid change in the business world [24]. Group learning is also considered as one of the seven principles for quality teaching for undergraduate university learners [25]. Many researchers suggested that the instructors have a crucial impact on university learners because of the attitude of instructors toward advanced technology has a significant influence on university learning [26]. In addition, the collaborative approach to learning has also discussed as a beneficial way for university teaching [27]. According to [28], many course coordinators attempted to utilize technologies in their course in order to pay considerably more attention for higher education as well as for university learners. Co-construction of learning is another recent suggestion for developing teaching skills for the educators [29].

3 Study Context The study tried to prove its aim by collecting data from a private university in Bangladesh named International University of Business Agriculture and Technology (IUBAT). All the instructors of IUBAT are instructed to prepare a course outline for the students at the beginning of each semester. The whole campus provided internet facility by powerful Wi-Fi connection. The instructors use a multimedia projector and internet connection in the class if necessary to ensure the best quality education. This study focused students’ comment and idea from IUBAT. The University mainly provides higher education in Business, Agriculture, Engineering, Nursing, and Social Science. In order to make a clear and authentic view for the readers, an exemplary class plan has been made by the authors who further divided the learners into two categories based on the learning need and capability of the university learners.

732

S. F. A. Hossain et al.

Tables 1 and 2 represent a different category of students and specific class plan followed by TPACK. Category “A” consist of the students who have current average CGPA 3.0 or above and category “B” consist of the students who have current average CGPA 2.0–2.99. In order to teach the same topic: “Referencing methods for writing a dissertation”, two different class plans are shown in Tables 1 and 2. The class plans are prepared in advance which considered the understanding level of high and low CGPA earners. Although both class duration is same (sixty minutes), it can be allocated differently based on learners’ need and capability. 3.1

Possible ICT Technologies Used to Carry Out the Proposed Activities

In Tables 1 and 2, usage of the internet was discussed already. In addition, further possible ICT technology usage can strengthen learning procedure because recent research showed the varying result of with TPACK instruction [30]. A tool like an enterprise social media [31] to communicate effectively within the classroom with ICT technology like a mobile phone or portable device can help to implement the proposed activities. As a technology, “communication” construct human [32]; so that, any ICT technology with a portable device like laptop, notebook, iPad, tablet or smartphone can be used to communicate among the students and teachers in order to implement TPACK model effectively. However, mobile phone as ICT technology is available, affordable and easy to use in the class for TPACK implementation as shown in Tables 1 and 2. Table 1. A class plan according to TPACK (For category “A” learners) Topic: Referencing methods for writing a dissertation. Learning Objectives: By the end of the session students will be able to: i. Use suitable referencing styles in their dissertation ii. Use suitable referencing techniques with proper tool or software Learning Activity (LA) Part 1 LA1. Interpret the sample research paper (15 min) provided in the class and talk about the subsequent questions in a tiny group of 5 students: Q1. Define referencing and Citation Q2. Why you should learn and use referencing for your academic research or dissertation? Q3. When you should use reference? LA2. Enjoy a video tutorial and take notes on APA referencing style https://www.youtube.com/watch?v= SOEmM5gmTJM

Duration: 1 h Number of students: 40

Activity type Collective work: Group activity consists of 5 members in each group

Independent, Referred and Practical learning

(continued)

Systematic Mobile Device Usage Behavior and Successful Implementation

733

Table 1. (continued) Topic: Referencing methods for writing a dissertation. Part 2 LA3. Read the sample journal paper and (15 min) discuss Q1. How does the writer refer to the sources in the body text? Q2. Why the writer cited the sources in that way? Q3. How to know the source details? LA4. Study in groups of 5 students to investigate the internet to recognize resources on the referencing approach according to the instruction below: build a table of diverse referencing styles 1. Group 1: Refer one author’s book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. 2. Group 2: Two authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. 3. Group 3: Three to five authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. 4. Group 4: Six or more authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. Part 3 LA5. Group work to discover examples of (15 min) using quotes, paraphrasing, and summary respectively by the author in the provided paper. Discuss the answers in the class 1. Quote: how to use long and short quote; how to modify direct quote; word counting for a lengthy direct quote, line spacing, size of the font, use of bold or italic etc. 2. Paraphrase: how to paraphrase in an effective way 3. Summarize: how to summarize multiple paragraphs or articles with parallel ideas Part 4 LA6. Complete individually the exercise (10 min) form given in the class on “How to refer your writing” LA7. Complete individually the exercise form given in the class on “How to make citation in your writing” Post Feedback from the students includes their class need for further learning in the next session (5 min)

Duration: 1 h Collective work: Group activity consists of 5 members in each group

Collective work: Group activity consists of 5 members in each group

Collective work: Group activity consists of 5 members in each group

Independent activity to assess the learners

Insightful activity

734

S. F. A. Hossain et al. Table 2. A class plan according to TPACK (For category “B” learners)

Topic: Referencing methods for writing a dissertation Learning Objectives: by the end of the session students will be able to: i. Use suitable referencing styles in their dissertation ii. Use suitable referencing techniques with proper tool or software Learning Activity (LA) Part 1 LA1. Interpret the sample research (10 min) paper provided in the class. Explanation of the following questions: Q1. Define referencing and Citation Q2. Why you should learn and use referencing for your academic research or dissertation? Q3. When you should use reference? LA2. Discuss in details on a video tutorial and take notes on APA referencing style. https://www.youtube.com/watch?v= SOEmM5gmTJM Part 2 LA3. Read the sample journal paper (20 min) and discuss with your partners. You can ask any question if necessary. Q1. How does the writer refer to the sources in the body text? Q2. Why the writer cited the sources in that way? Q3. How to know the source details? End of the discussion session answers will be given by the instructor. LA4. Study in groups of 2-3 students to investigate the internet to recognize resources on the referencing approach according to the instruction below 1. Group 1: Refer one author’s book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. 2. Group 2: Two authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. 3. Group 3: Three to five authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc.

Duration: 1 h Number of students: 25

Activity type Collective work: Class lecture activity with answers to the questions

Independent, Referred and Practical learning

Collective work: Group activity consists of 2–3 members in each group

Collective work: Group activity consists of 2–3 members in each group

(continued)

Systematic Mobile Device Usage Behavior and Successful Implementation

735

Table 2. (continued) Topic: Referencing methods for writing a dissertation 4. Group 4: Six or more authors’ book, book chapter, journal paper, webpage, conference paper, website, e-book, newspaper etc. One member from each group will discuss the given topic Part 3 LA5. Individual work to discover (10 min) examples of using quotes, paraphrasing, and summary respectively by the author in the provided paper. Find the answers in the class and ask your friend to explain if you are not clear 1. Quote: how to use long and short quote; how to modify direct quote; word counting for a lengthy direct quote, line spacing, size of the font, use of bold or italic etc. 2. Paraphrase: how to paraphrase in an effective way 3. Summarize: how to summarize multiple paragraphs or articles with parallel ideas Answers will be provided and discussed again by the instructor Part 4 LA6. Explanation of the exercise form (10 min) given in the class on “How to refer your writing” by the instructor LA7. Complete individually the exercise form given in the class on “How to refer your writing” Feedback from the students includes Post their need for further learning in the class next session, arrange one to one session (10 min) if necessary

Duration: 1 h

Independent activity to assess the learners with the help of comparatively better students

Independent activity to assess the learners

Insightful activity

4 Methodology 4.1

Participants

In this study, a total number of 201 students and 10 university teachers participated eagerly to answer the research questions. The students were selected from five different disciplines: EEE, Mechanical Engineering, BBA, BA in Nursing and BA in

736

S. F. A. Hossain et al.

Economics. The ages of the respondents range from 18 to 27 years old for the students and 28 to 45 years old for the teachers. The 10 teachers were later interviewed for their ideas and thinking regarding TPACK and its future in accordance with individual university learners’ need analysis. At the beginning of the study, a university officer was assigned to conduct the ethical approval procedure from the university and also a consent form containing full information of the study was signed by each participant. 4.2

Data Collection and Analysis

Primary data were collected in two different ways for the study. One questionnaire for the users was designed on TPACK structure in accordance with diverse university learners’ needs. The questionnaire was designed to identify the students’ desire for TPACK design and implementation. Also, the questionnaire divided the student group into two different categories based on their current CGPA in order to analyze the different needs of university learners. In addition, the interview was conducted with the teachers to get more accurate ideas about the research topic. The reason behind choosing 3 students out of 201 is to get an overall idea from student representatives. Selected students for interview are chosen from well-known student representatives. 4.3

User Acceptance Questionnaire for TPACK Structure

With high relevance to the circumstance of the present study, the four constructs employed by [33] which are the main base of the questionnaire, were used for this study. The questionnaire (shown in Table 3) includes 8 items that may be divided into four categories: two factors on perceived usefulness (PUF), two factors on perceived ease of use (PEU), two factors on attitudes towards TPACK (ATTPACK), and two items on behavioral intention on TPACK (BITPACK). Student responses are collected on a 5-point Likert scale (1 = strongly disagree and 5 = strongly agree). The respondents were informed about the TPACK model with a short brief in the class from where data was collected. After the brief explanation, respondents were randomly asked questions to make sure that they understood the model and can complete a set of questionnaire about it. Table 3. Questionnaire for TPACK structure in accordance with diverse university learners’ needs Code PUF1 PUF2 PEUF1 PEUF2 ATTPACK1 ATTPACK2 BITPACK1 BITPACK2

Item Different TPACK model helps me to learn more effectively Different TPACK model enables me to complete my assignments more quickly. I find my supported TPACK model easy to learn I find my supported TPACK model is easy to access The schedule of my supported TPACK model is comfortable I feel interested for my supported TPACK model I intend to use my own supported TPACK model during the semester I intend to use my own supported TPACK model as much as possible

Systematic Mobile Device Usage Behavior and Successful Implementation

737

This research focused on the above questionnaire in two different categories students: Category A (CGPA 3 and above) and Category B (CGPA 2.0–2.99). The 12 items which are shown in Table 3 are the main focus of the questionnaire. Apart from the main body context the questionnaire also includes some demographic factors given Table 4: Table 4. Breakdown of the demographic status of the students and teachers Demographics Gender of the students Age (Students) Age (Teachers)

Student type

Employment status of the students

Male Female 18–22 23–27 28–32 33–37 38–42 Regular (Full time) Irregular (Part time) Employed Unemployed

Students/Teachers (Local) 140 51 142 49 5 3 2 181

Students/Teachers (International) 3 7 6 4 0 0 0 10

Total (All) 143 58 148 53 5 3 2 191

10

0

10

92 99

0 0

92 99

The questionnaire was collected from the students in the class as per prior permission by the university authority. Although the medium of instruction of the institution is strictly followed by English, for the clarity and authenticity of the research the writers decided to translate the questionnaire in the respondents’ local language so that they could come up with crystal clear output. Altogether 201 samples were used for a paired t-test to estimate differences in respondents’ view about TPACK structure in accordance with diverse university learners’ need. 4.4

Focus Group Interview

At this stage of the research, the writers attempted to generate some qualitative data gathered from the focus group interview. A total number of 3 students who are student representative and 10 university lecturers were selected. Convenience sampling approach was used to select these respondents. Each interview was conducted separately which lasted from 25 to 50 min. The interview sessions were recorded by audio with an audio recorder and “we chat” application and later summarized and categorized by the help of thematic analysis [34]. Emerging ideas were recognized from the audio

738

S. F. A. Hossain et al.

recording of the interview to cross-check and validate the quantitative data result obtained. For students the following interview questions were designed by the authors: Could you please describe a course which follows TPACK structure? How is that different from the other courses which do not follow TPACK structure properly? How do you think about TPACK structure in accordance with diverse university learners’ needs? What changes might happen in future if you have an opportunity to get the benefit of TPACK structure in accordance with diverse university learners’ need? What changes do you predict for your classmate learners if they have an opportunity to get the benefit of TPACK structure in accordance with diverse university learners’ need? The questions which were used for the teachers’ interview were not particularly similar with the students because instructors are the primary concern to bring changes in using TPACK according to the university learners need. As a result, the following interview questions were designed by the authors for teachers: Would you like to share us about a course in which you used TPACK approach? How is that different from the other courses which do not follow TPACK structure properly? How do you cope up with TPACK structure with the changing environment? Would you enjoy keep changing TPACK structure according to the need of your students? What type of change do you expect from your students if you develop TPACK model according to their need?

5 Results 5.1

Results About the Need for Technological Device Usage by University Learners

In this stage, the authors tried to find the result about technological advancement of the university learners in terms of using different technological devices for education (Table 5), the location of using those devices (Table 6) and duration of using these technological devices by the students (Fig. 1). This information has been collected from post-questionnaire. Table 5 represents that laptops and smartphones were the most frequently used devices by the learners. In addition, it was identified that most of the university students have more than one technological devices. In order to identify the usage frequency of technological devices, the authors collected the data for devices using places by the university students. Almost 32% of students use devices in all the places, about 62% of the students use devices at home, university and classroom or home and university. Only about 6% of the students use devices at home. From the focus group interview it also has been observed that many of the students wish to use the devices even on the move but due to lack of internet connection and poor quality transportation facility, they can’t do it. This device usage can be used for different learning activities. “Learning Activity Types” known as “LATs” [35] is an initiative which can be implemented by proper usage of devices like mobile phone.

Systematic Mobile Device Usage Behavior and Successful Implementation

739

Table 5. Devices used by students, subdivided by male and female (N = 201, male = 143, female = 58) Device type Laptop

Count Percentage of usage by student participants

Male 121 Female 42 Smartphones Male 118 Female 48 Desktop Male 27 Female 14 Tablet Male 28 Female 8 Electronic book reader Male 1 Female 0

84.61% 72.41% 82.51% 82.75% 18.88% 24.13% 19.58% 13.79% 0.49% 0%

Table 6. Different places for using devices by students (N = 201) Place Percentage Home, university, classroom and library 31.84% Home, university and classroom 25.87% Home and university 36.81% Home only 5.47%

Count 64 52 74 11

Fig. 1. TPACK model [2]

According to Fig. 2 below, the use of technological devices is more by the high CGPA achievers except for usage of Smartphone. The usage of desktop and tablet are comparatively low. It is also noted that some respondents don’t own a desktop but they

740

S. F. A. Hossain et al.

have the choice to use the desktop in the university library. However, the tablet, as a technological device is easy to carry but its usage rate is very low as per the figure below. Usage of Laptop 500 400 300 200 100 0

Usage of Smartphone

3.81-4.00 3.50-3.80 3.00-3.49 2.51-2.99

120 100 80 60 40 20 0

2.00-2.50

Usage of Desktop

2.51-2.99 3.00-3.49 3.50-3.80 3.81-4.00

Usage of Tab

20 15

2.00-2.50

2.00-2.50

10

2.51-2.99

5

3.00-3.49

0

3.50-3.80

5 4 3 2 1 0

3.81-4.00

2.00-2.50 2.51-2.99 3.00-3.49 3.50-3.80 3.81-4.00

Fig. 2. Duration of various devices usage by respondents (N = 201) for study and personal purpose.

5.2

User Acceptance Results of TPACK Model in Accordance with University Learners Need

Table 7 describes the reliability result for the scales of respondents’ questionnaire. Cronbach’s alpha result for each dimension of the questionnaire was calculated to examine the strength of reliability. The reliability values range from 0.746 to 0.871 for high CGPA achievers and 0.736 to 0.796 for low CGPA achievers. The reliability results of all the dimensions (i.e. PUF, PEUF, ATTPACK, and BITPACK) for both questionnaires were above 0.70 which indicates a valid measurement of the response. The descriptive statistics and t-test results in Table 8 represent the result of both high and low CGPA achievers. On average the results in both questionnaires provided high ratings from the respondents (M = 4.06 to 4.34 and SD = 0.24 to 0.59) which indicates that the respondents have a positive attitude toward using TPACK structure based on university learners need.

Systematic Mobile Device Usage Behavior and Successful Implementation

741

Table 7. Reliability statistics for the questionnaire scales Construct

Number of items Cronbach’s alpha High CGPA Low CGPA PUF 2 0.871 0.796 PEUF 2 0.805 0.760 ATTPACK 2 0.746 0.766 BITPACK 2 0.789 0.736

Table 8. Descriptive statistics and t-test results for the high- and Low-CGPA achievers Item

CGPA High (N = 100) Mean SD PUF1 4.30 0.58 PUF2 4.29 0.57 PEUF1 4.33 0.53 PEUF2 4.29 0.52 ATTPACK1 4.34 0.59 ATTPACK2 4.24 0.57 BITPACK1 4.21 0.52 BITPACK2 4.20 0.53 *p < .05

Low (N = 101) Mean SD 4.25 0.43 4.31 0.46 4.11 0.34 4.06 0.24 4.30 0.54 4.31 0.56 4.25 0.56 4.32 0.53

High-Low paired t differences (N = 100) Mean 0.05 –0.02 0.22 0.23 0.04 –0.07 –0.04 –0.12

SD 0.68 0.69 0.63 0.59 0.83 0.87 0.77 0.70

SE 0.07 0.07 0.07 0.06 0.09 0.09 0.08 0.07

–0.884 0.15 –3.63 –4.08 –0.48 0.81 0.39 1.71

df p

99 99 99 99 99 99 99 99

0.379 0.89 0.000* 0.000* 0.63 0.42 0.70 0.09

The outcome of the paired t-test in Table 8 indicates no significant differences in the ratings of most dimensions (except PEUF) between high and low CGPA achievers (p > 0.05). This difference is observed due to the different capability level of the students and their understanding. Such a difference with the students may require further investigation in the future by the researchers.

6 Discussion 6.1

User Acceptance and Future of TPACK Approach

This study further tried to explore students’ and teachers’ views on TPACK and its future. A total number of ten university teachers (T1 to T10) and three students (S1 to S3) were asked to attend in a focus group interview sessions as mentioned in the research methodology section. Teachers view on TPACK and its future. Teacher interviewees (T1 to T10) were selected randomly from the same institution: IUBAT. The first interviewee (T1) considers that TPACK structure in accordance

742

S. F. A. Hossain et al.

with diverse university learners’ need might be very helpful for university students’ engagement in the class. The instructor interviewee (T1) said: “TPACK structure in accordance with diverse university learners’ need will have an influential impact on the students. I think many of the university learners are not attentive in the class at all because of various issues. One of the issues is that the content may not interesting. This method will engage them more in the class so that they will eagerly concentrate and learn more.” The next instructor interviewee (T2) explained that TPACK structure in accordance with diverse university learners’ need will enhance the creativity of the students and they will be able to come up with something new. He said: Whenever I assign any presentation topic for the students, then the first problem I face is to distribute the topic among different groups. They really want to work on the topic of their own interest. So that I appreciate this TPACK structure and really believe that it will be very effective. Almost all instructor interviewees (T1 to T10) agreed that TPACK structure according to university learners need not only attract the students but also increase the desire for learning. They also explained that this structure could be more convenient and interesting for both the students and instructors. For example, an instructor interviewee T3 expressed: “Sometimes I really feel boring to teach the students a topic which they are not interested at all to learn. I can understand from their face and attitude in the class. Though the university I work has no strictness about teaching content, I think I can give the learners a chance to choose their topic of interest based on the course and then I may finalize it. This really will be convenient for university learners to learn new things as well as interesting too.” Another instructor interviewee (T4) express that TPACK structure according to university learners need is a good model for independent learning: “I think this model is a good example for learning something real which is also convenient for individual learning….if students have the opportunity to use this model then they can be more focused on their individual choice of learning and naturally they will learn more.” One of the most cited factors by instructor interviewees is the practical learning of the students. Some instructor interviewees (T5, T7, T8, and T10) believe that university learners may learn something practical from TPACK structure according to university learners need. According to (T7): “The main fact for students is to learn something which they can use in their future….practical learning is more important than theoretical learning which they may forget at any time. For example, I have seen many students struggle with making a good PowerPoint presentation even in the final year of their study because of the lack of practical learning.” In addition, an instructor interviewees (T6) believe that TPACK structure according to university learners need will enhance the view of the university learns. He expressed: “I believe that university learners have their own judgment for learning a specific topic. As a result, if they have a chance to take part in choosing content design activity then it will definitely enhance their view towards the subject.”

Systematic Mobile Device Usage Behavior and Successful Implementation

743

However, 3 instructor interviewees (T4, T9, and T10) argued that the students’ choice should not be limited only with this structure. In addition, the choice of the student on a particular course is also crucial. “According to my observation, some students really hate business mathematics course. Although I tried to make it interesting some students have no interest at all on this subject. It may be because of their lack of previous knowledge but my concern is to arrange an alternative for this type of situation.” Students View on TPACK and its Future: Student interviewees (S1, S2, and S3) strongly recommended and appreciated “TPACK structure according to university learners need” concept. A student interviewee (S1) considered that any activities regarding learning which is supported by the students; are more meaningful. He added that: “I have a group in my “advertising” class. We want to learn how to make an advertisement, how to edit videos, how to build a meaningful concept within a few seconds etc. But most of the teachers don’t want to listen to our thought. But I believe that it’s a more meaningful way of learning something.” Another student interviewee (S2) suggested that “TPACK structure according to university learners need” concept will enhance the ability to use the technological devices for learning purpose. According to his observation, many students use technological devices for social communication only but they can improve themselves when they will be self-motivated by this type of strategy in higher education. According to his perception: Many of my friends are very active in Facebook but they only use their laptop for study when the teachers assign homework. But I think we can learn every day from our technological devices according to our class lecture. Also, we can ask others about a topic on WhatsApp or WeChat. These should be enforced by the teachers in the class. Although all the students represented their view about the significance of “TPACK structure according to university learners need”; they didn’t mention it as the only key factor for ensuring higher education. A student interviewee (S3) shared: “Despite a lot of advantage, I don’t think that this approach will be easy to implement because all students are not perfect for choosing the most valuable content for study. Even some students will find a very easy way just to pass in the exam. This will make the issue critical for implementation.” 6.2

Concerns About “TPACK Structure According to University Learners’ Need” Approach

Despite the advantages of the TPACK model based on university learners’ need found by the authors, quite a few concerns, especially about the implementation of this model emerge from this study. First of all, the ethical issue has been raised about the model which indicated a big question about the honesty of selecting needs by the university learners. A few student interviewees (S2, S3) suggested getting feedback from students about the model and then verifying the quality of their learning interest compared to the international standard for the similar discipline. All three student interviewees

744

S. F. A. Hossain et al.

demanded continuous lesson on technological usage so that they can be more comfortable with using new technology. The quotation below (S3) supports that. “Still, many students are using Windows XP but some teachers are using the latest version of windows and we sometimes find it difficult to prepare an assignment given by the teacher which requires an updated version of the software. I think continuous training in technology is needed for us to do better in the future.” Another concern about implementing the model is the mentality of the instructors for accepting such a change where students are focusing more rather than the instructors. An instructor interviewee (T4) stated that, although this model is good for independent learning, he is afraid that some instructors might not feel comfortable to change their teaching materials again and again. Also, some students might be upset when they find their chosen idea is rejected by the instructor. They might lose their interest in learning and get de-motivated desperately.

7 Implications The study results have important implications for educators, researchers, and practitioners. Most importantly, this study highlighted to need for designing TPACK structure according to the need of university learners which also focus on the need for independent learning with mobile phone usage. Developing a “TPACK structure according to university learners need” approach can make the students more responsible and make their study life flexible so that they could focus on further learning based on their individual need. Furthermore, this study focuses on the critical success factor of “TPACK structure according to university learners need” which may help the concerned parties related to higher education for implementing the TPACK model effectively in the future.

8 Conclusion This paper tried to discover university learners’ attitudes toward learning in the context of TPACK along with a special focus on the individual needs of the university learners with the focus of mobile phone usage. The results of this paper illustrated positive attitude toward the different model of TPACK according to university learners need and highlighted the usefulness by the respondents. Learning could be interesting and authentic for universities by implementing TPACK with learners need in the future. Evidence from this research clearly stated the emerging need of TPACK according to university learners need. Although few concerns are identified by the researchers, it is possible to handle. In order to implement the concept, institutional support along with the development of instructors and students is crucial. Further empirical evidence with a broader context may enhance existing knowledge of TPACK structure according to university learners need. Few limitations are observed in the study. The findings of the study are analyzed from a limited amount of data. Additional data may need for in-depth understanding of TPACK structure according to university learners need. More significantly, pre and

Systematic Mobile Device Usage Behavior and Successful Implementation

745

post-semester survey could provide a crystal clear focus on the topic. The study depended on a response from students of five different disciplines which may not sufficient enough. Further research efforts should focus on more disciplines, more devices, more students and instructors as well as conduct pre and post-semester survey to achieve robust findings of TPACK model based on different university learners need. List of Acronyms TPACK EEE BBA BA CGPA

Technological Pedagogical Content Knowledge Electronics and Electrical Engineering Bachelor of Business Administration Bachelor of Arts Cumulative Grade Points Average

References 1. Matthew, J.K.: What is TPACK? http://www.matt-koehler.com/tpack/what-is-tpack/. Accessed 24 Apr 2018 2. TPACK.ORG: Using the TPACK image. TPACK http://tpack.org/. Accessed 20 Mar 2018 3. Ota, C., DiCarlo, C.F., Burts, D.C.: Training and the needs of university learners. https://joe. org/joe/2006december/tt5.php. Accessed 23 May 2018 4. White, D.L.: Gatekeepers to millennial careers: teachers who adopt technology in education. In: Handbook of Mobile Teaching and Learning, pp. 1–10 (2015) 5. Looi, C., Wong, L., So, H., Seow, P., Toh, Y., Chen, W., Soloway, E.: Anatomy of a mobilized lesson: learning my way. Comput. Educ. 53(4), 1120–1132 (2009) 6. Mishra, P., Koehler, M.J.: Technological pedagogical content knowledge: a framework for teacher knowledge. Teach. Coll. Rec. 108(6), 1017–1054 (2006) 7. Koehler, M.J., Mishra, P., Kereluik, K., Shin, T.S., Graham, C.R.: The technological pedagogical content knowledge framework. In: Handbook of Research on Educational Communications and Technology, pp. 101–111 (2013) 8. Pacific Policy Research Center: 21st-century skills for students and teachers. Kamehameha Schools Research and evaluation. http://www.ksbe.edu/_assets/spi/pdfs/21_century_skills_ full.pdf. Accessed 20 May 2018 9. Gunn, T.M., Hollingsworth, M.: The implementation and assessment of a shared 21st century learning vision. J. Res. Technol. Educ. 45(3), 201–228 (2013) 10. Harvard, C.D.: Comparing frameworks for “21st century skills’’. http://watertown.k12.ma. us/dept/ed_tech/research/pdf/ChrisDede.pdf. Accessed 17 May 2018 11. Johnson, L., Adams, B.S., Estrada, V., Freeman, A.: The NMC horizon report: 2014 library edition. New Media Consortium, Austin, TX. http://cdn.nmc.org/media/2014-nmc-horizonreport-library-EN.pdf. Accessed 20 May 2018 12. Blyth, W.A., Bloom, B.S., Krathwohl, D.R.: Taxonomy of educational objectives. Handbook I: cognitive domain. Br. J. Educ. Stud. 14(3), 119 (1966) 13. Krathwohl, D.R.: A revision of bloom’s taxonomy: an overview. Theory Pract. 41(4), 212– 218 (2002) 14. Roig-Vila, R., Mengual-Andrés, S., Quinto-Medrano, P.: Primary teachers’ technological, pedagogical and content knowledge. Comunicar 23(45), 151–159 (2015) 15. Shulman, L.S.: Those who understand: knowledge growth in teaching. Educ. Res. 15(2), 4– 14 (1986)

746

S. F. A. Hossain et al.

16. Lambert, C.: Technology and adult students in higher education: a review of the literature. Issues Trends Educ. Technol. 2(1) (2014) 17. David, L.C.: School of education at Johns Hopkins University-the role of aging in university learning: Implications for instructors in higher education. education.jhu.edu/PD/ newhorizons/lifelonglearning/higher-education/implications/. Accessed 15 May 2018 18. Pronovost, P.J., Mathews, S.C., Chute, C.G., Rosen, A.: Creating a purpose-driven learning and improving health system: the Johns Hopkins medicine quality and safety experience. Learn. Health Syst. 1(1) (2016) 19. Burridge, P: Teacher pedagogical choice. New pedagogical challenges in the 21st century contributions of research in education (2018). https://doi.org/10.5772/intechopen.73201 20. Young, J.R., Young, J.L., Shaker, Z.: Technological pedagogical content knowledge (TPACK) literature using confidence intervals. TechTrends 56(5), 25–33 (2012) 21. Hofer, M., Grandgenett, N.: TPACK development in teacher education. J. Res. Technol. Educ. 45(1), 83–106 (2012) 22. Kimmons, R.: Examining TPACK’s theoretical future. J. Technol. Teach. Educ. 23(1), 53– 77 (2015) 23. Bulfin, S., Parr, G., Bellis, N.: Stepping back from TPACK. Engl. Aust. 48(1), 16–18. http:// newmediaresearch.educ.monash.edu.au/lnm/stepping-back-from-tpack/. Accessed 11 May 2018 24. Elgort, I., Smith, A.G., Toland, J.: Is wiki an effective platform for group course work? Australas. J. Educ. Technol. 24(2), 195–210 (2008) 25. Chickering, A.W., Gamson, Z.F.: Seven principles for good practice in undergraduate education. Biochem. Educ. 17(3), 140–141 (1989) 26. Webster, J., Hackley, P.: Teaching effectiveness in technology-mediated distance learning. Acad. Manag. J. 40(6), 1282–1309 (1997) 27. Stahl, G., Koschmann, T., Suthers, D.D.: Computer-supported collaborative learning. In: The Cambridge Handbook of the Learning Sciences, pp. 409–426 (2006) 28. Guo, Z., Zhang, Y., Stevens, K.J.: A “uses and gratifications” approach to understanding the role of wiki technology in enhancing teaching and learning outcomes recommended citation. http://aisel.aisnet.org/cgi/viewcontent.cgi?article=1021andcontext=ecis2009. Accessed 17 May 2018 29. Gu, X., Zha, C., Li, S., Laffey, J.M.: Design, sharing and co-construction of learning resources: a case of lifelong learning communities in Shanghai. Australas. J. Educ. Technol. 27(2), 204–220 (2011) 30. Krause, J.M., Lynch, B.M.: Faculty and student perspectives of and experiences with TPACK in PETE. Curriculum Stud. Health Phys. Educ. 9(1), 58–75 (2018) 31. Khajeheian, D.: Enterprise social media. Int. J. E-Services Mob. Appl. 10(1), 34–46 (2018) 32. Khajeheian, D.: Telecommunication policy: communication act update. Glob. Media J. Can. Ed. 9(1), 135–141 (2016). (Review of the call of the Energy and Commerce Committee for Communications Act Update) 33. Shroff, R.H., Deneen, C.C., Ng, E.M.: Analysis of the technology acceptance model in examining students’ behavioural intention to use an e-portfolio system. Australas. J. Educ. Technol. 27(4), 600–618 (2011) 34. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77– 101 (2006) 35. LATs website: learning activity types. http://activitytypes.wm.edu. Accessed 20 May 2018

Data Analysis of Tourists’ Online Reviews on Restaurants in a Chinese Website Meng Jiajia(&) and Gee-Woo Bock Sungkyunkwan University, 25-2 Sungkyunkwan-Ro, Seoul, South Korea [email protected], [email protected]

Abstract. The proliferation of online consumer reviews has led to more people choosing where to eat based on these reviews, especially when they visit an unfamiliar place. While previous research has mainly focused on attributes specific to restaurant reviews and takes aspects such as food quality, service, ambience, and price into consideration, this study aims to identify new attributes by analyzing restaurant reviews and examining the influence of these attributes on star ratings of a restaurant to figure out the factors influencing travelers’ preferences for a particular restaurant. In order to achieve this research goal, this study analyzed Chinese tourists’ online reviews on Korean restaurants on dianping.com, the largest Chinese travel website. The text mining method, including the LDA topic model and R statistical software, will be used to analyze the review text in depth. This study will academically contribute to the existing literature on the field of the hospitality and tourism industry and practically provide ideas to restaurant owners on how to attract foreign customers by managing critical attributes in online reviews. Keywords: Online reviews Regression analysis

Text mining

Latent Dirichlet Allocation

1 Introduction The rapid development and continuous popularization of e-commerce over the past decade have led to the proliferation of consumer online review websites, such as TripAdvisor and Yelp, where consumers can post their own experiences or evaluations about products they have bought and give specific suggestions to others. These reviews can provide consumers with extensive information about products they are interested in, ranging from restaurants to movies, while only needing to enter the name and then click a button. These reviews can also affect the purchase intention of users visiting these review sites in tourism and hospitality business settings [11]. For instance, if a person wants to eat steak, they may search information online about steak restaurants and other consumers’ comments, negative or positive comments, and then choose their preferred restaurant. Thus, online reviews are crucial for making purchase decisions in the tourism and hospitality sectors, especially if the decision maker is someone who travels abroad. Meanwhile, numerous comment data becomes a valuable and available resource for restaurants, enabling the restaurant owners to examine users’ experience in a more timely and detailed way, while also enhancing services and customer satisfaction [3]. © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 747–757, 2020. https://doi.org/10.1007/978-3-030-17795-9_56

748

M. Jiajia and G.-W. Bock

Restaurant owners endeavor to provide delicious food and perfect service to customers, while it is difficult to satisfy customers. Customers might enjoy the food but dislike the decorations, and sometimes even dislike the servers’ uniform. Customers might leave a word online to express their dissatisfaction about the restaurant. Sometimes, the restaurant staff may provide service without a thorough understanding of foreign tourists’ culture or preferences, which is bound to influence customer satisfaction. In the case of Korea, which tourists from different cultural backgrounds prefer to visit, it is crucial for Korean restaurant owners to know about a variety of customers in advance. Reisinger, Yvette, and Lindsay W. Turner also noted that it is important for a tourist destination to educate its tourism industry employees about the cultural background of its international visitors [16]. Thus, analyzing customers’ reviews is the most efficient way to know about customers’ needs. Previous published studies are limited to taking attributes specific to a consumer’s dining experience - such as food quality, service, ambience, and price into consideration. Yan et al. observed that customers consider four aspects about whether to revisit a restaurant: food quality, price, service quality, and atmosphere [22]. In addition, the type of restaurant also affects customer satisfaction with these four aspects through the use of a text mining method. Mattila pointed out that food quality is the most important factor in the causal restaurant industry, followed by service and atmosphere [13]. Namkung and Jang also found that the key attribute affecting customer satisfaction is food, followed by environment and service [15]. Theoretically, research about Chinese tourists’ dining experience in Korean restaurants should be similar to previous research. Food would be the main attribute when Chinese tourists write about their dining experience in Korea, followed by service and price etc., which does not differ in comparison with previous findings. The objective of this study is the dining experience in Korea, which has a booming entertainment industry, and various aspects of Korean entertainment, including TV dramas, movies and popular music, have contributed a huge amount of financial revenue towards the national economy. This cultural phenomenon is known as ‘Hallyu’ (Korean Wave), which has greatly driven Korean tourism and is one of the main reasons why Chinese people visit Korea. This study intends to find whether the attributes of Chinese tourists would be identical to those mentioned in other research related to this cultural context. However, it is difficult to find research based on Chinese websites about the impact of Chinese tourists’ dining experience in Korea from their perspective of Korean culture. In addition, methodologically, empirical studies on online restaurant reviews, with a few exceptions, use only a few pieces of metainformation, such as consumer-assigned numerical and easily understood rating numbers. Moreover, it is rare to use text mining method on Chinese online review websites to study Chinese tourists’ dining experience in Korea. Therefore, the purpose of this study is to address these gaps by using text mining analysis based on the Chinese online review website dianping.com to identify the factors influencing Chinese tourists’ dining experiences in Korea. What requires special attention is whether the cultural factor of ‘Hallyu’ will be significant during the decision process. This study used latent dirichlet allocation technology to analyze

Data Analysis of Tourists’ Online Reviews on Restaurants

749

Chinese customer online reviews to identify the attributes that customers consider when writing comments about the restaurants they have visited, and then conducted a regression model to investigate the relationship between the new factors identified and their consumer star rating numbers. Specifically, this study seeks answers to the following questions: (1) What attributes do Chinese tourists consider when they reflect online on their dining experience in Korean restaurants? (2) How do the attributes from online reviews affect Chinese tourists’ online ratings of Korean restaurants?

2 Literature Review 2.1

Online Restaurant Reviews

In the past, professional critics or journalists mainly wrote reviews and these were published in newspapers or magazines. For instance, the Michelin Guide evaluates hotels and restaurants using evaluation criteria established by their professional appraisers. With the advent of the Web 2.0 era, the Internet environment has been gradually optimized, and ordinary people have the power to write their own opinions on company websites or third-party websites. Many of the opinions are these consumers’ evaluations of the consumption process and consumption content after purchasing goods or services. It is their most direct feedback related to products and services, positive or negative, which people often voluntarily share with others online. The information that consumers provide has great authenticity and often has great commercial value. Everyone has the right to write and understand reviews of products in which they are interested. Online reviews have become omnipresent, and Purcell found that 31% of adult American Internet users have had the experience of rating a person, product or service [10]. It is rare to blindly make a purchase decision without reading through several online reviews. A user review report from SEO’s Bright Local shows that 84% of consumers trust in online reviews and make decisions based on comments. The result of a survey that Piller conducted in 1999 showed that nearly 60% of 5500 respondents considered that consumer-generated comments were more valuable than expert comments. Almost two-thirds of shoppers think that online reviews are an essential part of the decision-making process. When wanting to know the best places to go or even which movie to watch, they are more likely to turn to Google, Yelp and TripAdvisor than they are to a newspaper or even their friends and family. Therefore, it is not surprising that people are increasingly relying on online reviews in their daily life to help them choose a restaurant to eat in [19]. As Mudambi and Schuff noted, “online customer reviews (also known as electronic word of mouth) can be defined as peer-generated product evaluations posted on company or third-party websites” [14]. Review websites (e.g. TripAdvisor, Yelp) allow users to post open-ended comments about product or personal judgments about consumption experiences, usually together with a numerical star rating on a scale of 1 to 5.

750

M. Jiajia and G.-W. Bock

For example, users can enter text and rate any restaurant with 5 stars, which means the user is very satisfied with the restaurant, and other people can also see all these reviews and stars. On these platforms, customers can search information related to restaurants or hotels based on other customers’ prior experiences. Meanwhile, service providers can obtain timely feedback from customers and make analyses based on the collected data to improve their products and services, while making them more suitable for users’ real needs. These numerical star ratings are quantitative, can be easily obtained from customers, and are easily calculated using statistical computing methods. Although text review comments are relatively difficult to analyze and calculate because of their qualitative nature, text reviews can make sense of simple star rating numbers because they allow reviewers to explain the reasons why they rated a service the way they did. Therefore, research on online reviews is not only vital for reviewers, but also for product or service providers. 2.2

Dining Experience

As shown in Table 1, prior researchers generally believe that food, service, atmosphere and price are the four basic aspects that comprise the consumer’s dining experience, but the order of these affective factors is uncertain. Blank suggested that food, service and decor are typical parameters of restaurant reviews used by Zagat ratings and AAA diamond ratings [1]. In 2006, Andaleeb and Conway concluded that customers wanted contact personnel to “respond” to their presence with courteous, helpful, and knowledgeable attitudes and customer satisfaction was influenced mostly by employees’ service quality, followed by price and food quality [17]. Gupta, McLaughlin and Gomez suggested that the attributes comprising consumers’ dining experience is food quality, price, greetings and service [4]. According to Zhang et al., food quality is the main variable that influences customer satisfaction [23]. In the study of four themed restaurants in Singapore, MacLaurin et al. pointed out that food quality and menu were two important elements, along with concept, service quality, atmosphere, convenience, value, product merchandise, and pricing [12]. The restaurant’s location and convenient transportation greatly influence consumers’ consumption decisions in an unfamiliar environment. For instance, when traveling overseas, tourists will choose to dine at restaurants that are located around popular attractions or with convenient traffic arrangements. Hyun indicated that food quality, service quality, price, location, and environment are five dimensions that influence restaurant patrons’ behavior [6]. According to Sulek and Hensley, “food quality, the restaurant’s atmosphere and the fairness of the seating procedures had significant effects on customer satisfaction” [20]. Tzeng et al. claimed that successfully setting up a new restaurant in an advantageous location is the most indispensable factor [21]. Most researchers have used the survey method as the primary tool for collecting data to analyze customer satisfaction. For instance, James used the SERVQUAL survey to measure the service quality that customers perceived during their dining process [2], which typically requires respondents to answer questions that researchers have predetermined rather than freely describing their dining experience, which may deviate from expectations. Only a few researchers chose to use data mining method for studies on measuring customer satisfaction in the restaurant field. Moreover, almost no

Data Analysis of Tourists’ Online Reviews on Restaurants

751

Table 1. A summary of dining experience attributes Authors

MacLaurin, D.J., and MacLaurin, T.L. (2000) Mattila, A.S. (2001) Sulek, J.M., and Hensley, R.L. (2004) Saad Andaleeb, S., and Conway, C. (2006) Blank, G. (2006) Gupta, S., McLaughlin, E., and Gomez, M. (2007) Wall, E.A., and Berry, L.L. (2007) Namkung, Y., and Jang, S. (2008) Hyun, S.S. (2010) Ryu, K., and Han, H. (2010) Pantelidis, I.S. (2010) Parsa, H.G., Gregory, A., Self, J., and Dutta, K. (2012) Zhang, Z., Zhang, Z., and Law, R. (2014)

Dining experience Food Service quality √ √

Price

Environment

Location

√

√

√

√ √

√ √

√ √

√

√

√

√ √

√ √

√

√

√

√

√

√

√

√ √ √ √

√ √ √ √

√

√

√

√ √

√ √ √ √

√

√

researchers have used big data analysis method to research Chinese tourists’ dining experience in Korea based on Chinese websites. Korean pop culture includes Korean music, dramas and variety shows that became extremely popular in China by the early 2000s. Therefore, Chinese tourists tend to value the element of Hallyu when they visit and evaluate Korean restaurants. Whether Korean culture affects Chinese tourists’ dining experience in Korea has recently become a particularly important issue. 2.3

Hallyu

Recently, an increase in the popularity of Korean pop culture, which is called the ‘Korean Wave’ or ‘Hallyu’, includes television dramas, popular music, celebrities, movies, animation, and games, and has expanded to the Middle East, the U.S., Europe and even South America [8]. The term ‘Korean Wave’ first appeared through a broadcasting planning company operated by Koreans in Beijing, China, and in the middle of 1999, the Chinese media also began to use the term ‘Korean Wave’. The mass media of other countries also began to follow the emergence of the ‘Korean Wave’. The term ‘Korean Wave’ was officially used at the end of the 20th century. Korean celebrities at the center of the ‘Korean Wave’ have become very popular all

752

M. Jiajia and G.-W. Bock

over the world. These celebrities have extremely large fan bases not just in Korea but also in other countries. Fans imitate the behavior of idols that fascinate them, and sometimes even go to Korea to catch a glimpse of their idol. Shim stated that the popularity of Korean pop culture has meant that regional fans are eager to learn the Korean language and travel to Korea [18]. With the wide acceptance of Korean pop culture, the popularity of celebrities’ associations and television dramas have a strong effect on preferences for Korean restaurants [9]. According to Kim, Agrusa, Chon, & Cho, because of the successful broadcasting of the Dae Jang Geum drama series, the number of Korean restaurants in Hong Kong increased rapidly, as well as the interest in Korean food, and the number of Hong Kong tourists traveling to Korea also grew significantly [7]. In 2014, the Korean drama My Love from the Star won a high level of popularity in China with a rating of more than 3 billion people. Korean TV dramas affect the consumption of Chinese audiences, resulting in the large amount of sales in China of cosmetics used by protagonists in Korean dramas, along with the clothing they wear, and the foods that are consumed in these dramas. According to the Korea Cultural Tourism Research Institute, the number of Chinese visiting South Korea has sharply increased from 4.32 million in 2013 to 6.12 million in 2014. Therefore, Hallyu could be one of the crucial attributes affecting Chinese tourists’ dining experiences in Korea.

3 Methodology This study will use the data posted on Chinese online review website dianping.com, which provides billions of cumulative reviews of all types of businesses. Data preprocessing includes word segmentation, removing stop words and feature selection, etc., and will be carried out after the data collection stage. The next step will use topic model method to perform an in-depth analysis of the data that has been preprocessed. After the attributes are identified based on the review text, this study will examine how these attributes are significant for customer satisfaction (consumer overall star ratings). 3.1

Data Collection

The data that will be used in this study is collected from dianping.com, which is one of the largest review websites in China. It is a third-party consumer guide website where merchants can create a webpage to introduce their stores, and users can freely enter this webpage to obtain information and post their own comments. Dianping.com is a leading review website in China because it hosts more than 4.4 million merchants and 310 million monthly active users. As mentioned in Sect. 2, people currently rely heavily on online reviews before making decisions, especially in the case of Chinese tourists. This group prefers their own ‘review and advice tools’, and dianping.com is their most frequently used travel advice site. The first reason why this study chose dianping.com is its widespread use among Chinese online users. There are over 2800 cities across China and nearly 200 countries are listed in this platform. The second reason is that this website allows users to freely publish reviews anonymously only after logging in with a valid email address or phone

Data Analysis of Tourists’ Online Reviews on Restaurants

753

number, which ensures the user comment’s authenticity. The third reason is that the reviews that customers leave consist of textual comments and numerical star ratings in terms of overall stars, taste, environment and service, which allows for additional related studies.

Fig. 1. Screenshot of a dianping.com restaurant page

The current study is about Chinese tourists’ online restaurant reviews when they travel to Korea. Therefore, this study collected review data only for restaurants located in Seoul, which has the most tourist attractions in Korea. Figure 1 provides a screenshot of a customer review of a restaurant. This study will collect data of restaurants from throughout Seoul, and each review contains the restaurant’s name, reviewer’s username, overall star rating, the rating of taste, the rating of physical environment, the rating of service quality and the text comment. Table 2 shows an example of a customer review. Representative Korean restaurants except for cafes were included in the data analysis, such as those focusing on fried chicken or Korean BBQ. 3.2

Analysis Method

Data collected from dianping.com will be preprocessed in order to implement the Latent Dirichlet Allocation (LDA) method to investigate the first research question of what attributes do Chinese tourists consider when they dine in Korea. Lastly, a regression model will be conducted after the LDA process to explore the relationship between attributes and customer satisfaction (overall star ratings). Latent Dirichlet Allocation. Text mining technology can efficiently discover topic information in massive text data and concisely provide the main content of document data sets to users. In most text mining research, the most commonly used analysis method is to extract the words from the text and count the frequency of words, while this method might lose a lot of semantic information between texts. Topic model is the method that models the latent literal topics embedded in text and then mines out potential semantic variables. One topic is a series of related words that represent a specific topic occurring in a text or a collection of documents. For instance, when a user writes a comment to evaluate a restaurant, they will first determine which aspects of the restaurant to evaluate. Once they decide to evaluate the restaurant’s service, a series of words related to the service may appear to express this evaluated topic, such as ‘server’ or ‘helpful’. Thus, topic model is better for researchers to extract valuable information. Latent Dirichlet Allocation is the representative algorithm of topic model. Blei et al. originally proposed it in 2003, and it is seen as a generalization of probabilistic latent semantic indexing (PLSI) [5]. LDA essentially is a typical bag-of-words model, as it assumes that each document is a collection of baskets (topics) formed by a set of words,

754

M. Jiajia and G.-W. Bock Table 2. An Example of User Review Data 番茄和鱼

Username

Stars 4 Taste 4 Restaurant name 三岔口肉铺 Environment 5 Service 5 Text 去韩国肯定要尝一尝韩式烤肉,做了些攻略,在弘大的三岔口肉铺感觉不错,而且又是YG老板开的,说不定运气好还能偶遇到明星哦,追星小伙伴很适合,点了两块肉, 还有喝的等等,食材都是比较新鲜的,口感也不错,小菜送的也挺多,味道也好!推荐尝试

and there is no order or sequential relationship between words. A basket refers to the topic that involves a set of words chosen from a set of vocabulary. One document can contain multiple topics, and each word in the document is generated by one of the topics, and follows a topic-word distribution. Consumers write reviews with words picked from a limited vocabulary created in their brains to express their ideas about a product. Each review is created by one or more topics with a mixed rate. Thus, LDA method is a more efficient way to find topics embedded in massive documents and their implied relationship. This study assumes there is a corpus D that consists of M reviews, and each review contains a set of N words. There should be K topics expressed in the corpus that includes all M reviews that customers have posted in a given time. In this study, customers choose words from latent K topics to describe their dining experience in Korea, and each topic is characterized by a distribution of words. The LDA process is shown in Fig. 2. W ¼ ðw1 ; w2 ; . . .; wM Þ represents the total words and Z ¼ ðz1 ; z2 ; . . .; zM Þ represents the total topics in the corpus. W is the only observable variable in this model, and Z and h represent the latent variables. Wm;n refers to the specific N-th word that occurs in review M, and Zm;n means the topic for the N-th word in review M. Uk and hm are the word distribution for topic k and topic distribution for

β

α

θ

Φk

Zm,

Fig. 2. Process of LDA Model

Wm

Data Analysis of Tourists’ Online Reviews on Restaurants

755

review M. The boxes are ‘plates’ representing replicates, which are repeated entities. The outer plate represents documents (in this case, reviews), while the inner plate represents the repeated word positions in the given document, and each position is associated with a choice of topic and word. The parameters a and b are the hyperparameters that will be Dirichlet distributed. Based on these definitions, the reviews’ generative process in the corpus can be divided into the following steps: (1) Draw each per-corpus topic distributions Uk * Dir(b) for k 2 {1, 2,…K} (2) For each document, draw per-document topic proportions hm *Dir(a) (3) For each document and each word, draw per-word topic assignment Zm;n * Multi (hm ) (4) For each document and each word, draw observed word Wm;n * Multi(Zm;n , n) This formula can be used to calculate the probabilities: pðU1:k ; h1:m ; Z1:m ; W1:m Þ ¼

YK

pð/ijbÞ i¼1

YM

pðhm jaÞf m¼1

YN n¼1

pðZm;n jhm Þp Wm;n jU1:k ; Zm;n

ð1Þ Regression Model. Consumers evaluate the satisfaction level of a restaurant using an overall star rating number with a 5-point scale, where 1 means ‘the lowest level of satisfaction’ and 5 means ‘the highest level of satisfaction’. This study conducted a linear regression model to examine how significant the identified attributes are for the customer’s overall star rating. The model is shown in Formula 2. The dependent variable (stars) is the consumer’s overall star rating of the restaurant, and the independent variables are the key attributes identified using the LDA results. stars ¼ b0 þ b1 attribute1 þ b2 attribute2 þ b3 attribute3 þ b4 attribute4 þ . . . þ e ð2Þ

4 Expected Contributions 4.1

Academic Contributions

In previous research, price, atmosphere, taste, etc. are the main variables of restaurant reviews. This study will extend the literature in this research field by investigating the influence of cultural factors, especially in the context of Chinese tourists’ online reviews of Korean restaurants in the hospitality and tourism industry. This study intends to determine whether Hallyu will be the new attribute that influences Chinese tourists’ dining experience in Korea. In fact, it is expected that Hallyu will become one of the most important variables when Chinese tourists evaluate Korean restaurants. 4.2

Practical Contributions

It is expected that the findings of this research will also have several practical implications for restaurant owners and platform operators. It is possible to use the results of

756

M. Jiajia and G.-W. Bock

this study to identify the attributes that Chinese tourists are most concerned with, while restaurant operators could use this knowledge to make improvements to attract more customers. Web developers could update webpage information to increase the number of users. Furthermore, the restaurant can attract new customers by promoting the unique aspects of Korean culture.

References 1. Blank, G.: Critics, ratings, and society: the sociology of reviews. Rowman & Littlefield Publishers, Lanham (2006) 2. Carman, J.M.: Consumer perceptions of service quality: an assessment of T. J. Retail. 66(1), 33 (1990) 3. Gan, Q., Ferns, B.H., Yu, Y., Jin, L.: A text mining and multidimensional sentiment analysis of online restaurant reviews. J. Qual. Assur. Hospitality Tourism 18(4), 465–492 (2017) 4. Gupta, S., McLaughlin, E., Gomez, M.: Guest satisfaction and restaurant performance. Cornell Hotel Restaurant Adm. Q. 48(3), 284–298 (2007) 5. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers, July 1999 6. Hyun, S.S.: Predictors of relationship quality and loyalty in the chain restaurant industry. Cornell Hospitality Q. 51(2), 251–267 (2010) 7. Kim, S.S., Agrusa, J., Chon, K., Cho, Y.: The effects of Korean pop culture on Hong Kong residents’ perceptions of Korea as a potential tourist destination. J. Travel Tourism Mark. 24(2–3), 163–183 (2008) 8. Kim, S.S., Lee, H., Chon, K.S.: Segmentation of different types of Hallyu tourists using a multinomial model and its marketing implications. J. Hospitality Tourism Res. 34(3), 341– 363 (2010) 9. Lee, B., Ham, S., Kim, D.: The effects of likability of Korean celebrities, dramas, and music on preferences for Korean restaurants: a mediating effect of a country image of Korea. Int. J. Hospitality Manag. 46, 200–212 (2015) 10. Lenhart, A., Purcell, K., Smith, A., Zickuhr, K.: Social Media & Mobile Internet Use among Teens and Young Adults. Millennials. Pew internet & American life project (2010) 11. Liu, Z., Park, S.: What makes a useful online review? Implication for travel product websites. Tour. Manag. 47, 140–151 (2015) 12. MacLaurin, D.J., MacLaurin, T.L.: Customer perceptions of Singapore’s theme restaurants. Cornell Hotel Restaurant Adm. Q. 41(3), 75–85 (2000) 13. Mattila, A.S., Wirtz, J.: Congruency of scent and music as a driver of in-store evaluations and behavior. J. Retail. 77(2), 273–289 (2001) 14. Mudambi, S. M., Schuff, D.: Research note: what makes a helpful online review? A study of customer reviews on Amazon.com. MIS quarterly, pp. 185–200 (2010) 15. Namkung, Y., Jang, S.: Are highly satisfied restaurant customers really different? A quality perception perspective. Int. J. Contemp. Hospitality Manag. 20(2), 142–155 (2008) 16. Reisinger, Y., Turner, L.W.: Cultural differences between Asian tourist markets and Australian hosts, Part 1. J. Travel Res. 40(3), 295–315 (2002) 17. Saad Andaleeb, S., Conway, C.: Customer satisfaction in the restaurant industry: an examination of the transaction-specific model. J. Serv. Mark. 20(1), 3–11 (2006) 18. Shim, D.: Hybridity and the rise of Korean popular culture in Asia. Media Cult. Soc. 28(1), 25–44 (2006)

Data Analysis of Tourists’ Online Reviews on Restaurants

757

19. Shindell, D., Kuylenstierna, J.C., Vignati, E., van Dingenen, R., Amann, M., Klimont, Z., Anenberg, S.C., Muller, N., Janssens-Maenhout, G., Raes, F., Schwartz, J.: Simultaneously mitigating near-term climate change and improving human health and food security. Science 335(6065), 183–189 (2012) 20. Sulek, J.M., Hensley, R.L.: The relative importance of food, atmosphere, and fairness of wait: The case of a full-service restaurant. Cornell Hotel Restaurant Adm. Q. 45(3), 235–247 (2004) 21. Tzeng, G.H., Teng, M.H., Chen, J.J., Opricovic, S.: Multicriteria selection for a restaurant location in Taipei. Int. J. Hospitality Manag. 21(2), 171–187 (2002) 22. Yan, X., Wang, J., Chau, M.: Customer revisit intention to restaurants: evidence from online reviews. Inf. Syst. Front. 17(3), 645–657 (2015) 23. Zhang, Z., Zhang, Z., Law, R.: Relative importance and combined effects of attributes on customer satisfaction. Serv. Ind. J. 34(6), 550–566 (2014)

Body of Knowledge Model and Linked Data Applied in Development of Higher Education Curriculum Pablo Alejandro Quezada-Sarmiento1,2(&), Liliana Enciso3, Lorena Conde4, Monica Patricia Mayorga-Diaz5, Martha Elizabeth Guaigua-Vizcaino5, Wilmar Hernandez6, and Hironori Washizaki7 1

Programa de Doctorado en Ciencias y Tecnologias de la Computación para Smart Cities, Universidad Politécnica de Madrid, Madrid, Spain [email protected] 2 Dirección de Investigation y Posgrados, Universidad Internacional del Ecuador, Quito, Ecuador [email protected] 3 Departamento de Ciencias de la Computación y Electrónica, Universidad Técnica Particular de Loja, Loja, Ecuador [email protected] 4 Escuela de Informática y Multimedia, Universidad Internacional del Ecuador, Quito, Ecuador [email protected] 5 Facultad de Sistemas Mercantiles, Universidad Regional Autónoma de Los Andes, Ambato, Ecuador {ua.monicamayorga,ua.marthaguaigua}@uniandes.edu.ec 6 Facultad de Ingenieria y Ciencias Aplicadas, Universidad de las Américas, Campus Queri, Quito, Ecuador [email protected] 7 Department of Computer Science and Engineering, Waseda University, Tokyo, Japan [email protected]

Abstract. The corpus of knowledge or Bodies of Knowledge (BOK) term was used to describe a set of structures than codify all concept, terms, techniques, and sustainable educational activities that constitute the domain of the exercise of a profession or specific area of knowledge. This is the reason why it is important for both the scientific and the educational communities to carry out research in BOK. In short, BOK proposes new strategies that allow to describe, represent, and combine the above-mentioned set of structures with different computational techniques. One of these techniques is Linked Data, which appeared in the Web Semantic Context. This paper describes the study and implementation of Linked Data Technologies for the publication of data related to the academic offer of a Higher Education Center, using a methodology of publication of Linked Data and supported by the model of description of BOK. From the process described above, a web application was obtained, where the BOK model and principles of the were combined, for the visualization of academic data, as a contribution to curricular development. The next steps are © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 758–773, 2020. https://doi.org/10.1007/978-3-030-17795-9_57

Body of Knowledge Model and Linked Data

759

implemented of BOK model with the combination of other semantic web techniques and artificial intelligence principles. Keywords: Bodies of Knowledge Semantic Web

Curriculum Education Linked Data

1 Introduction BOK describes the relevant knowledge for a discipline and it is necessary to reach consensus between the Knowledge Areas (KA) and related disciplines (RD) [1]. The semantic web is considered as the extension of the current web, whose objective is that the resources of the web can be consumed for both people and software agents [2]. Tim Berners-Lee, known as the “father of the web”, promotes, through W3C, languages and methodologies that contribute to the development of the Semantic Web. Linked Data arises as an effort of the Semantic Web that proposes the publication of data following the Resource Description Framework (RDF) model, the creation of URIs for the connection and exchange of data, and SPARQL for the consultation of them [2]. Currently, due to the increasing number of organizations that have published their data as a set of linked data. The data that have been published correspond to different areas: education, government, health, and legislation, among others [3]. In the field of education, [4] points out that the greatest needs in relation to the semantic web are the following: changes in social demands, changes in the teachinglearning processes, and organizational changes of Higher Education Institutions (HEIs). This paper presents a brief introduction to the concept of BOK and semantic web, followed by the study of Linked Data and its principles. Also, this paper addresses a Linked Data methodology, recommended tools, and vocabularies for the publication of linked data supported by the model of BOK. In addition, some of the projects and communities that have been created with the purpose of developing tools and technologies to support the semantic web and Linked Data are pointed out. The case study presented here was carried out in a higher education institution, in order to publish the data of the academic offer as a set of linked data supported by the principles of BOK. The outline of this paper is as follows: Sect. 2 is devoted to present the background of BOK, semantic web, some principles of Linked Data and the application of this concept on educational context. The main result is presented in Sect. 3. Section 4 is devoted to the conclusions of this paper.

2 Background According to [5], one of the main concerns of the software industry is to develop the talent of its human resources, since the quality and innovation of its products and services depend to a large extent on the knowledge, capacity and talent of its software and system engineers.

760

P. A. Quezada-Sarmiento et al.

Knowledge already exists and the objective of BOK is to establish a consensus on the subset of the knowledge core that characterizes the disciplines. BOK are used by those interested in expanding their capacities and professional development [6]. Researchers can find it useful to identify the technology applicable to their research, and to help define the competencies needed for the research teams. The process of building the BOK should help to highlight the similarities between the disciplines. For example, some of these similarities are the techniques used in the science of materials that are common to both chemistry and physics [7]. Regarding the levels of knowledge in a BOK, they define the amount of knowledge that will be offered within a specific level of an educational program [8]. BOK has a specific structure according to the area of science or engineering [9]. 2.1

Semantic Web

In [2], Tim Berners-Lee through the W3C, promotes the development of the Semantic Web, by developing two important Semantic Web technologies: Extensible Markup Language (XML) and Resource Description Framework (RDF). The main objective of the Semantic Web is to get software agents to interpret the meaning of the contents of the web to help users develop their tasks [10]. It also aimed at improving the existing systems to optimize the time required in an advanced search [11]. 2.2

Principles of Linked Data

Tim Berners-Lee points to Linked Data as part of the Semantic Web. The Semantic Web is not just about putting the data on the web, it is about making the links between the data, so that a person or machine can explore the data network. Unlike the current web, Linked Data links data described in RDF through the use of URIs. The Data Web is based on four principles described in (W3C) by Berners-Lee: 1. To use a URI to identify each resource published on the Web. 2. Having this data published in an HTTP-based URI to ensure that any resource can be searched and accessed on the Web. 3. To provide useful, detailed or extra information about the resource which is accessed through an HTTP-based URI. 4. To include links to other URIs related to the data contained in the resource [12]. Making the data available to society means that any person or company or organization can build on them a new idea that results in new data, knowledge, process improvement, and generation of added value to existing ones. Under the same context of Linked Data, Tim Berners-Lee, as indicated in Table 1, proposes that the data be opened and linked.

Body of Knowledge Model and Linked Data

761

Table 1. Aspects for data to be opened and linked - Linking Open Data.

Use URIs as unique names for resources. Accessibility through the HTTP protocol. Representation of RDF Data and SPARQL queries. Include links to another URIs.

2.3

Linked Data Publishing Methodology

Like software engineering, the data publishing process suggests a life cycle. Figure 1 shows each of the phases of the Linked Data methodology. The description of each phase is as follows:

Fig. 1. Linked Data methodology used.

Phase 1 - IDENTIFICATION OF DATA SOURCES: In this phase the context of the application and the nature of the data to be published were identified. The following questions were considered: • What data will be provided? • How will the data be delivered? • Who will consume the data? Phase 2 - MODELING OF VOCABULARIES: In this phase vocabularies or ontologies were defined. A vocabulary was developed that fits the needs of the project, for which the existing vocabularies are reused supported on the principles of the bodies of knowledge and the proposed model.

762

P. A. Quezada-Sarmiento et al.

Phase 3 - DATA CLEANING: In order to test if the information will be correctly accessed, it is important to carry out this phase before publishing, testing and debugging the data. In this research, the data cleaning process included validation and data correction, to achieve quality data. Phase 4 - GENERATION OF DATA RDF: This phase consisted of taking the chosen data sources in the specification activity and transform them into RDF, according to the developed vocabulary. In addition, based on the reuse of the necessary vocabularies, the data was also transformed. In this phase, the scheme of URIs based on HTTP was made. Phase 5 - PUBLICATION AND EXPLOITATION: After the conversion of the data to RDF format was carried out, it was necessary to store the data in an RDF Store. The publication and exploitation phase considered the implementation of an SPARQL Endpoint, through which a query service is offered through the SPARQL language. Phase 6 - DISPLAY OF RDF DATA: After the data publication, Pubby was used to visualize the triplets in html format. 2.4

Tools

For the publication of Linked Data, there are some tools that facilitate the execution of each of the phases of the life cycle. Table 2 shows some of the most commonly used tools. Table 2. Tools Linked Data Name D2R Server Open Virtuoso Fuseky 4Store Sesame Apache Jena

Tool http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/ http://virtuoso.Openlinksw.com/

Category Storage Storage

https://jena.apache.org/documentation/servingdata/ http://4store.org/ http://www.openrdf.org/ http://jena.sourceforge.net

RDFLib Swoogle Silk Server

https://code.google.com/p/rdflib/ http://lov.okfn.org/dataset/lov/search/#s=people https://www.assembla.com/spaces/silk/wiki/Silk_ Server http://code.google.com/p/google-refine/

Storage Storage Storage Generation, Storage Library Semantic Seeker Semantic Seeker

Open Refine RDF Validator OOPS!

http://www.w3.org/RDF/Validator/

GeneratorValidator Validator

http://oeg-lia3.dia.fi.upm.es/oops/index-content.jsp

Validator

Body of Knowledge Model and Linked Data

2.5

763

Vocabularies

Within the process of data publishing, it was necessary to use vocabularies. In [13], it is mentioned that, as good practice, vocabularies should be reused whenever possible. It is necessary to know vocabularies that are widely used and that can be reused. For the development of this research, the following vocabularies have been reused: 1. FOAF: (Friend of A Friend) Vocabulary that allows the definition of people and their relationships. 2. Dublin Core: RDF vocabulary, with well-defined properties, used for the creation of document metadata. 3. AIISO: Vocabulary used to define classes and properties of the internal organizational structure of an academic institution. 4. ORG: Ontology used to represent data of organizations. 5. TEACH: It is a lightweight vocabulary that provides conditions for teachers to relate things in their courses. 2.6

Projects and Communities of Linked Data

• DBpedia: It is a project aimed at extracting the structured content of the information created as part of the Wikipedia project. • Open University Data: Platform to present the available data in several institutional repositories of the University and make them available to be reused. The data sets refer to the publications, courses, and produced material by Open University. These data are available in RDF and SPARQL. • SWEO: Working group of the Linking Open Data Project that has been established for the discussion of already available themes and ideas, which consider low level tools (mainly for developers) and are used for applications aimed at end users. • Linking Open Data: The Open Data movement aims to make data freely available to everyone under open licenses, converting those licenses into RDF, complying with the principles of Linked Data, and then publish the data on the web. • Red Linked Data: Created to facilitate the exchange and transfer of knowledge in the area of Data Web, among national research groups associated with universities, technological centers, public administrations, and companies. • LOD2: European Project that develops documents, technologies and tools that support the Linking Open Data initiative. • DATA LINKED TO THE WEB - LDOW: Provides a forum for the presentation of the latest research on Linked Data and promotes the research agenda in this area. LDOW is a Linked Data deployment in different application domains. 2.7

Linked Data in Higher Education

In general, universities have a large amount of administrative, academic, scientific, technical information, databases, and repositories, among others. This information is found in different representations, such as: PDF, HTML, EXCEL, etc. In the context of

764

P. A. Quezada-Sarmiento et al.

higher education, the semantic web seeks to build knowledge in such a way that it helps to gather useful information much faster than by using a standard format. Before implementing Linked Data at universities, it was necessary to adopt a publication methodology, in [14, 15] some of these are proposed. According to Linking Open Data standards, it is also important to define both the most important data and the fact that they can be of public domain. The information that is handled at universities is very broad, so it is necessary to define the context in which it is required to implement Linked Data technologies, which constitutes a task that requires the help of an expert in the domain [16]. In this research, we have focused specifically on the academic offer of the National University of Loja. The academic offer is a public domain information and service that can be seen by all. But currently it is found as an either individual or isolated information, which makes it necessary to structure and organize data in a timely manner so that users can access to them in a simple, fast manner.

3 Results 3.1

Model of Bodies of Knowledge (BOK)

In the present paper, a BOK model based on [17–19] was developed, which serves as a support for the use of semantic technologies and knowledge retention in the educational context of a higher education organization. Figure 2 shows the aforementioned BOK model.

Fig. 2. Model of Bodies of Knowledge.

Body of Knowledge Model and Linked Data

765

As mentioned above, in this paper it is intended to implement both Linked Data technologies and principles based on the BOK model in the educational field, specifically for the publication of data referring to the academic offer of a higher education institution. Likewise, said BOK model was implemented using both Circlepacked and software pattern designs, obtaining a visualization of the data generated by the project. The institution of higher education analyzed in this paper (i.e., University of Loja, Ecuador) is a public law institution that offers a high-level academic education, with an academic offer that is distributed in areas, careers, modules or cycles, subjects, teachers, and other data that are offered in different modalities and periods of study. Here, in order to publish this data under the principles of Linked Data and BOKS in their different areas, an analysis of the academic offer has been made. 3.2

Publication Architecture of the Academic Offer

Figure 3 shows the proposal of the architecture that will be used for the implementation of Linked Data and the BOK. This figure shows each of the phases of the methodology previously shown in Fig. 3 along with the necessary tools for the execution of each phase.

Fig. 3. Publication architecture of the academic offer through Linked Data and BOK

3.3

Academic Offer Domain

The necessary data for the domain definition were extracted from the web services of the university under study. This was done through the creation of a Python client to

766

P. A. Quezada-Sarmiento et al.

access the methods of web services. Through the university portal, information was obtained in the format: .pdf, .xls and .html. After obtaining the information related to the academic offer, a set of data was established and the format in which it was intended to be delivered to the user was also established. Examples of this information are careers and/or programs of the different levels and modalities that are offered by each area of the higher education institution, which is in accordance with an academic offer, area directors, study careers coordinators, description of an academic offer, enrollment of an academic offer, and teachers by study career. Through Linked Open Vocabulary (LOV), the search was made for the classes and properties with the highest score. That is to say, the most reused. Then, we proceeded to the separation of the new terms to define the classes and attributes [17]. Once the classes and properties were chosen and established, the modeling of the ontology was carried out, for which we used PROTEGÉ 4.2. Figure 4 shows the view, on Ontoviz de Protégé, of the ontology, which we have named “Linked Data Academic University” (LDAU).

Fig. 4. View of the ontology of the academic offer PROTÉGÉ - ONTOVIZ.

Body of Knowledge Model and Linked Data

767

Before the generation of the RDF degree, it is necessary to clean the data, which is also known as Data Cleaning. Open Refine is the tool that has been used to clean the data, in order to detect inconsistencies such as: incomplete data, errors in the structure of a field, duplicate data, and the same name and different scripts [20]. Figure 5 shows the html file, created by the Protégé Export OWLDoc plugin.

Fig. 5. Html file of the ontology of the academic offer based on Liken Data and BOK model.

3.4

Design of HTTP URIs

As established by the first two principles of Linked Data, URIs must be used to name a resource and have it implemented on http. This facilitates the creation of a data website in which URIs can be displayed, exchanged and connected. Therefore, it is necessary to define patterns of the URIs of the data that will be accessed [21]. Table 3 shows the scheme of URIs, which have been designed considering some considerations established in such a table. Table 3. Scheme of URIs

URIs schema for the vocabulary and resources of the academic offer SCHEMA:URI HTTP de la ontología http://{uri-base}/{project-name}/schema# http://data.unl.edu.ec/academic-offer/schema# RESOURCE: URI HTTP de Recursos http://{uri-base}/{project-name}/resource/{typeresource}/{id-resource} http://data.unl.edu.ec/academicoffer/resource/Rol/Director

768

3.5

P. A. Quezada-Sarmiento et al.

RDF Generation

As can be seen in Fig. 6, the conversion of data to RDF triplets has been done using the established vocabularies, Open Refine and RDF extension.

Fig. 6. RDF triplets

Open Refine allows to export the data in different formats: XLS, CSV, RDF/XML, and TURTLE. It also provides support to perform data reconciliation and thus comply with the fourth principle of Linked Data “include links to other URIs”. Table 4 shows Table 4. RDF triplets generated in Turtle format - Areas of the higher education center supported by the principles of BOK.

http://data.unl.edu.ec/academic-offer/resource/AcademicUnit/AJSA>a "AJSA”; foaf:name "AREA JURÍDICA, SOCIAL Y ADMINISTRATIVA" ; foaf:mbox"[email protected]”; foaf:homepage "http://www.unl.edu.ec/juridica/"

;

a

"AARNR" ; foaf:name "AREA AGROPECUARIA Y DE RECURSOS NATURALES RENOVABLES" ; foaf:mbox "[email protected]" ;

Body of Knowledge Model and Linked Data

769

the data generated in turtle format with respect to the information of the areas of the university. 3.6

Publication and Exploitation of Data

After the conversion of the data to RDF format, it is necessary to store them in an RDF Store, for which Virtuoso Open Source has been selected, since it allows RDF data management and provides an SPARQL Endpoint that allows the query of resources that are contained in the server. Figure 7 shows the virtuous SPARQL Endpoint in which a query has been made about the data of the degrees that belong to the Energy Area.

Fig. 7. Query SPARQL Virtuoso.

SPARQL is the language recommended by W3C, for performing queries on RDF triplets, its syntax is similar to SQL [19]. Table 5 shows the syntax of the SPARQL language, by means of which a query is made in order to obtain the data of the degrees that belong to the Energy Area. After the publication of the data in Virtuoso, it is advisable to use a tool that allows visualizing the data in a friendlier manner for the users. Pubby is the tool that has been configured to display the triplets in html format. In Fig. 8, the RDF data is displayed, in html format, corresponding to the information of the National University of Loja (UNL), Ecuador. Data are displayed as: acronym, category, authority, description, name, phone number, and address. The educational computer applications require that educational curriculums have a high level of interoperability [20, 23]. The Semantic Web project combined with BOK model project aims to achieve that machines can understand and use the web contains, and thus facilitate the location of resources, communication between systems and programs [21].

770

P. A. Quezada-Sarmiento et al.

Table 5. SPARQL query regarding the data of the degrees that belong to the Energy Area. PREFIX foaf: PREFIX ldau:

PREFIX dc:

SELECT ?CARRERA ?MODALIDAD ?NIVEL_ACADEMICO ?AREA ?OFERTA WHERE{ ?carrera a ldau:AcademicCareer. ?carrera foaf:name ?CARRERA. ?carrera ldau:hasMode ?mod. ?mod foaf:name ?MODALIDAD. ?carrera ldau:hasAcademicLevel ?level. ?level foaf:name ?NIVEL_ACADEMICO. ?carrera ldau:isPartOfAcademicUnit ?are. ?are foaf:name ?AREA. ?carrera ldau:isOffered ?ofer. ?ofer dc:description ?OFERTA. FILTER regex(?NIVEL_ACADEMICO, "PREGRADO") FILTER regex(?OFERTA, "PREGRADO MARZO-JULIO 2014") FILTER regex(?AREA, "ÂREA DE LA ENERGÍA","i") } ORDER BY (?CARRERA)

Fig. 8. Visualization of the main page of the UNL by means of the PUBBY configuration and BOK model.

Body of Knowledge Model and Linked Data

771

Despite the advances that have taken place within the Web, there is a large amount of isolated data that is meaningless. With regard to Higher Education, a topic to be resolved is the accessibility to academic and educational resources, due to which it is necessary to apply the principles provided by Linked Data to link the data based on its meaning, facilitate the search and offer better accessibility to them. In the generation and publication of linked data, it is important to carry out the validation process, so that the published data are of quality [22]. In this sense, the validation of the ontology was performed in the first place to detect possible faults in the modeling. Next, the acceptance tests and validation of cURL were carried out. In order to validate the semantic application, the next aspects were following [23]: • • • •

Validation of Ontologies. Acceptance Tests. Validation cURL. Through the service offered by the W3C, RDF VALIDATOR, the conformation of the triplets generated from the academic data of the UNL was validated. • To verify that the information published and displayed is: useful, understandable to the user and easy to navigate; an acceptance test was carried out through an online survey to different users. • Through a http cURL client, it was possible to verify that the data is understandable by users and computers.

4 Conclusions Bodies of Knowledge provide the basis for the development of the curriculum, maintenance, support for professional development, and the systems of current and future certifications. Also, BOK promotes the integration and connections with other related disciplines. For this reason, the development of alternative forms of representation and applicability in different scientific communities are the basis for the consensus of these disciplines. This is the reason why in this paper the combination of the BOK Model and semantic web techniques was developed for the visualization of the information of a higher education center, providing support to take educational sustainability decisions. Through the development of a semantic application, Linked Data allowed to give meaning to the information, since by means of the reuse of vocabularies a standard language was used in this paper, making information understandable for users and computers. The use of different tools and semantic technologies, such as Open Refine, allowed to extract and transform data, and facilitated the cleaning and conversion of these to RDF format. The storage of the data was done through Virtuoso Open Source, since it provides an endpoint SPARQL and SPARQL Update support for queries. Likewise, Protégé facilitated the creation of the vocabulary, since we performed both a better semantic validation and evaluate inconsistencies of concepts and hierarchy of relations between classes and properties, which were supported by the BOK model. This paper describes the study and implementation of Linked Data Technologies combinate with BOK Model in order to visualization the curriculum of Higher

772

P. A. Quezada-Sarmiento et al.

Education Center facilitating the location of resources, the communication between systems and programs and the interoperability of information. The next steps are implemented of BOK model with the combination of other semantic web techniques and artificial intelligence principles to improve the curriculum on different knowledge areas.

References 1. Quezada, P., Enciso, L., Mayorga, M., Mengual, S., Hernandez, W., Vivanc, J., Carrion, P.: Promoting innovation and entrepreneurship skills in professionals in software engineering training: an approach to the academy and bodies of knowledge context. Paper presented at the IEEE global engineering education conference, EDUCON, April 2018, pp. 796–799 (2018). https://doi.org/10.1109/educon.2018.8363312 2. Hendler, J., Berners-Lee, T., Miller, E.: Integrating applications on the semantic web. J. Inst. Electr. Eng. Jpn. 122(10), 676–680 (2002) 3. Bizer, C., Cyganiak, R., Heath, T.: How to Publish Linked Data on the Web (2007). http:// wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/ 4. Koper, R.: Use of the semantic web to solve some basic problems in education: increase flexible, distributed lifelong learning, decrease teachers’ workload. J. Interact. Media Educ. 6, 1–22 (2004). Número especial sobre “Educational Semantic Web” 5. Gamaliel, J., Rivera, I., Fernandez, J., Serrano, A.: Competency Framework for Software Engineers. IEEE, Tijuana (2010). https://doi.org/10.1109/cseet.2010.21 6. Quezada-Sarmiento, P.A., Enciso, L., Garbajosa, J.: Use of body knowledge and cloud computing tools to develop software projects based in innovation. Paper presented at the IEEE global engineering education conference, EDUCON, 10–13 April 2016, pp. 267–272 (2016). https://doi.org/10.1109/educon.2016.7474564 7. Quezada, P., Garbajosa, J., Enciso, L.: Use of standard and model based on BOK to evaluate professional and occupational profiles (2016). https://doi.org/10.1007/978-3-319-31232-3_27 8. Hamidi, H., Fazely, K.: Analysis of the essential factors for challenges of mobile learning adoption in education. Comput. Appl. Eng. Educ. (2018). https://onlinelibrary.wiley.com/ doi/abs/10.1002/cae.21993 9. Quezada-Sarmiento, P.A., Enciso-Quispe, L.E., Garbajosa, J., Washizaki, H.: Curricular design based in bodies of knowledge: engineering education for the innovation and the industry. Paper presented at the proceedings of 2016 SAI computing conference, SAI 2016, pp. 843–849 (2016). https://doi.org/10.1109/sai.2016.7556077 10. Berners, L.T.: Linked Data: Reglas and Linked Open Data. W3C, recommendation, W3C (2009). http://www.w3.org/DesignIssues/LinkedData.html 11. Berners, L.T., Chen, Y., Chylton, L., Connolly, D., Dhanaraj, R., Hollenbach, J., Lerer, A., Sheets, D.: Exploring and analyzing linked data on the semantic web (2006) 12. Hall, W., Berners, L.T., Shadbolt, N.: The Semantic Web Revisited, p. 6 (2006) 13. Martín, C.D., Ferreras, F.T., Ríos, H.A.: Linked Data y Linked Open data: Su Implantación en una Biblioteca Digital. El caso Europeana, Salmanca (2012) 14. Alvarez, R.J., Cifuentes, S.F., Labra, J.E.: A proposal of architecture process of deployment for Linked Data Projects (2011) 15. Corcho, O., Gomez, P.A.: Linked Data Tutorial. Universidad Politécnica de Madrid, Florianópolis (2010) 16. W3C Consortium, Guía Breve Linked data. W3C recommendation, W3C. http://www.w3c. es/Divulgacion/GuiasBreves/LinkedData

Body of Knowledge Model and Linked Data

773

17. Quezada, P., Ramirez, R.: Develop, research and analysis of applications for optimal consumption and visualization of linked data. Paper presented at the Iberian conference on information systems and technologies, CISTI (2017). https://doi.org/10.23919/cisti.2017. 7975964 18. Yang, H.Z., Chen, J.F., Ma, N., Wang, D.Y.: Implementation of knowledge-based engineering methodology in ship structural design. CAD Comput. Aided Des. 44(3), 196– 202 (2012). https://doi.org/10.1016/j.cad.2011.06.012 19. Eras, A.G., Quezada, P.S., González, P.L., Gallardo, C.: Comparing competences on academia and occupational contexts based on similarity measures. Paper presented at the WEBIST 2015 - 11th international conference on web information systems and technologies, Proceedings, pp. 540–546 (2015) 20. Necula, S.-C., Pavaloaia, V.D., Strîmbei, C., Dospinescu, O.: Enhancement of e-commerce websites with semantic web technologies. Sustainability (Switzerland), 10(6) (2018). https:// doi.org/10.3390/su10061955 21. Rodić-Trmčić, B., Labus, A., Barać, D., Popović, S., Radenković, B.: Designing a course for smart healthcare engineering education. Comput. Appl. Eng. Educ. 26, 484–499 (2018). https://doi.org/10.1002/cae.21901 22. Quezada-Sarmiento, P.A., Enciso, L., Washizaki, H., Hernandez, W.: Body of knowledge on IoT education. In: Proceedings of the 14th International Conference on Web Information Systems and Technologies, ITSCO, vol. 1, pp. 449–453 (2018). https://doi.org/10.5220/ 0007232904490453. ISBN 978-989-758-324-7 23. Quezada-Sarmiento, P.A., Enciso-Quispe, L.E., Jumbo-Flores, L.A., Hernandez, W.: Knowledge representation model for bodies of knowledge based on design patterns and hierarchical graphs. Comput. Sci. Eng. (2018). https://doi.org/10.1109/mcse.2018.2875370

Building Adaptive Industry Cartridges Using a Semi-supervised Machine Learning Method Lucia Larise Stavarache(&) IBM, Columbus, OH 43235, USA

Abstract. In the middle ground between research and industry applicability, there is optionality, although the first comes with proven results, the latter is challenged by scalability, constraints and assumptions when applied in real case scenarios. It is very common that promising research approaches or PoC (proof of concepts) encounter difficulties when applied in industry solutions, due to specific industry requirements, bias or constraints. The paper is to show how industry business knowledge can be incorporated into machine learning algorithms to help eliminate bias, that might have been overlooked, and build industry domains cartridges models to be used in future solutions. The industry models are currently explored by businesses’ that want to enhance their portfolios with cognitive and AI capabilities and learn from transaction-based insights. With this research we aim to show how best machine learning models can learn from industry expertise and business use cases to create re-usable domain cartridges which can stand as core for: Bots, RPA (Robotic Processing Automation), industry patters, data insights discovery, control and compliance. Keywords: Classification Clustering Information retrieval Topic mining Industry models Industry domain cartridges

1 Introduction 1.1

Context

Translating research work to applied industry use case scenarios is not straightforward, and falls under specific constraints: business requirements, scalability, performance, corpora size, industry or clients specific data landscape. In the absence of consolidated industry domain cartridges models, the applicability of the research inquires additional customization to be trained: domain and business rules, supervised annotation and human validation, resulting in a specific, non-reusable and non-scalable model for similar use cases. In a high competitive business ecosystem, the triple link between is the key to decipher the meaning of transaction-based insights [1]. As the race for staying ahead becomes more Agile, the traditional research approach that evolves and matures over a period of time is not sustainable long term. Hence, grounding on existing research together with industry business use-case scenarios [2], this paper presents a method and an approach that combines information retrieval techniques, machine learning algorithms, transfer learning and user feedback to build scalable and re-usable industry cartridges models. © Springer Nature Switzerland AG 2020 K. Arai and S. Kapoor (Eds.): CVC 2019, AISC 943, pp. 774–788, 2020. https://doi.org/10.1007/978-3-030-17795-9_58

Building Adaptive Industry Cartridges

1.2

775

Chapters Structure

Chapter 1 – Succinct background explanation of the research approach, the rationale for the selection of the study area, and explains the structure from both research and industry lens. Chapter 2 – Highlights the existing state of the art, and accordingly, contains analysis of existing methods, models and theoretical frameworks available in this research space. Additionally, a comprehensive analysis of current industry business needs, terms definitions, language and glossary, is presented, following the proposed approach method compared with existing solutions, in a logical manner that highlights the method innovation. Chapter 3 – In depth walkthrough of proposed method, experiments and results, emphasizing the research approach, algorithms design, industry business constraints and the implementation steps. Corpora and sampling details of the study, together with discussions of ethical considerations are also included in this chapter. Chapter 4 – Emphasizes an in-depth discussion of core problems that industry solutions face when boarding the cognitive and AI path, especially: scalability, re-usability, training, performance and insights accuracy (i.e. the effort of re-training the model between similar solutions is higher than building a model from scratch). Chapter 5 – Assembles the main conclusions of this study and summarizes the level of achievement of the research goals and objectives. Alongside, known research limitations are acknowledged and presented in the context of the future work and experiments.

2 State of the Art Information retrieval, big data, crawling, bots, information mining and their derivates, have been a common concern for both research and industry, in the pursue of data insights discovery from large unknown or unseen content (i.e. documents, logs, conversations, incidents, phone calls). Highest development solutions in the industry that come with some pre-trained outof-the box models, often exposed in a Blackbox manner via API’s are: (a) IBM Watson Discovery [3] – “Powered by the latest innovations in machine learning, Watson lets you learn more with less data. You can integrate AI into your most important business processes, informed by IBM’s rich industry expertise. You can build models from scratch, or leverage our APIs and pretrained business solutions…”; IBM Watson Discovery comes with a collection of pre-trained machine and deep learning algorithms and computational linguistics processing, that offers a quick ramp for building new models. Comparing the two approaches, although IBM Watson Discovery allows a quick start of new projects and performs well on known business use-cases it requires a consistent amount of training data, semi-supervised validation and annotation when the analyzed content is not in the known sets. A second drawback that applies to all cloud

776

(b)

(c)

(d)

(e)

(f)

L. L. Stavarache

platforms that offer similar API’s, detailed below is their blackbox approach, making it hard to debug, difficult to tune or to monitor model’s accuracy over time. Out of the box without custom alternations, the Ai solution platform from IBM is highly preferred by non-technical users and performing with high accuracy in spaces as: health, human resources, news and Q&A (cognitive RPA Bots) solutions. Google AI Products [4] – “Cloud AI provides modern machine learning services, with pre-trained models and a service to generate your own tailored models. Our neural-net-based ML service has better training performance and increased accuracy compared to other deep learning systems. Our services are fast, scalable, and easy to use. Major Google applications use Cloud machine learning, including Photos (image search), Translate, Inbox (Smart Reply), and the Google app (voice search)”. The AI API’s master scalability and deep learning algorithms for usecases revolving: images, voice or translations. Although, Google AI Products come with insights discovery services, these are very narrow in their analysis and do not support complex semantic understanding of data. Furthermore, the predefined trained model is focused on open available data as: news, articles, books, movies and very thin understanding of business domains. Microsoft AI Platform [5] – “At Microsoft, we aim to help companies transform by bringing AI to every application, every business process, and every employee”. Microsoft AI platform comes a suite of machine learning and deep learning algorithms sampled with a consistent sample of use-cases available, in a small set of industries as: healthcare, manufacturing, banking and retail. Compared with the other solution the platform offers a friendly user experience, nevertheless the data analysis and discovery capabilities are not matured, offering limited capabilities: NER, bagging and information retrieval. Amazon SageMaker [6] – “Enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. It removes the complexity that gets in the way of successfully implementing machine learning across use cases and industries—from running models for real-time fraud detection, to virtually analyzing biological impacts of potential drugs, to predicting stolen-base success in baseball.” Amazon’s Sage offers one of the most cohesive data science platform that bundles together: machine and deep learning algorithms, a set of pre-trained API’s (i.e. voice, images, risk). Its targeted audience is open platform scientists that want to build applications fast, using the platform’s capabilities and services. In major industry business scenarios the options are limited with Sage, offering a very thin repository of industry models to leverage. Other players are represented by the smaller companies, either startups which master 1–2 use case based or niche companies that claim success in targeted domains. Last but not least, and highly disconnected of the first two categories are the open research initiatives, summarized via papers or patents that target specific use-cases reach with major delays the industry ecosystems.

Building Adaptive Industry Cartridges

777

This paper research acknowledges the capabilities and limitations of the existing solutions for the declared scope, building industry cartridges, and comes with the following differentiators claims: (a) Industry Cartridge Models [7] – or accelerators help define and accelerate the core structures, definition and content required by the industries to transform, automate and digitize their business via cognitive solutions (i.e. preventive maintenance solutions, helper bots, predictive models, risk analysis). The cartridge will act as the core component, basis for all cognitive business use-case implementations and leveraged across. Cartridges are industry specific and their viability is determined by the industry evolution, these represent domain trained models incorporating current business requirements and challenges of the industry, in opposite with training for a specific industry implementation of problem. (b) Curation & Sanitization steps – are mandatory before adding content in the systems. As different formats are permitted: docx, doc, ppt, pptx, pdf, txt it is important when converting them to txt to extract information in the right logical sense and meaningful concatenated (i.e. content may be associated with images, in tables or charts). Bulking everything together will reduce accuracy and precision and can often result in “garbage-in/garbage-out” paradigm. Off the shelf API’s, algorithms and solutions disregard this major constraint and analyze all as bulk. (c) Industry Terminology [8] – industry and business terminology are domain specific, therefore using custom models or available NER dictionaries can result in false positive as the meaning can be dual in given the context. (d) Training corpus – is relatively small and contains a representative set of industry domain use-cases, describing the cartridges, also known as an abstract definition of the domain. (e) LDA Topic Model [9] – applied also in the current research, is regularly used to reduce space dimensionality of large content into decomposable categories, topics or building blocks, with the capability of observing the model evolution in time. (f) Eliminating duplicate content [10] for test, train and learn – industry cartridges have to respect the following principles: domain precision, avoid deprecation (i.e. Garter modifies one term, the cartridge model will have to understand the similarity between the new term and previous term), size as it directly imparts performance and production scalability, when working with large size documents. Having these considerations, the detection of similarities [11] and duplication measured is part of the method approach. (g) Training new cartridges – is simple and requires small amount of supervision and human validation, the models can fall under one of these category types: enhanced cartridges, using transfer learning derived from core cartridge model or new cartridges which can come from existing industry fields or emerging technologies. For the cartridges that not reached industry maturity, the method allows for ingesting the model abstract definition as training base and as content is added it learns and reinforces the cartridge domain model to the aimed accuracy policy. Resuming the key differentiators, the proposed research requires for training a relatively small human validated corpus, compared with existing pre-trained models (i.e. Giganews corpus [12], Wordnet [13]); a semi-supervised approach to “lift and shift” the models; continuous learning through: received feedback, ingested content and transferred learning.

778

L. L. Stavarache

3 Method 3.1

Corpora

For exemplification of the model, three industry domain cartridges: automation, blockchain and cognitive, have been build using the proposed research method, from existing industry documentation content (available in different formats: pptx, ppt, pdf, doc, docx or txt). The training corpora is part of a large 85K CMS [14] (Content Management System) corpus with content from: business solutions, industry offerings, white papers, solution architecture documents, client proposal presentations or industry reports from the three areas of interest. The selected training corpora comprised 500 documents, and have been reduced to 300 after applying the curation, sanitization and duplication steps described in the method. Referencing back to the state of art chapter, as part of the proposed method there are several quality gates steps: curation, sanitization, conversion to .txt format, identification and removal of duplicate content. Duplication is often caused in business and industry documentation by on-going copy-pasting (from a client proposal to another, from one solution to another), many times because same content is reused without crediting the original author. Other sources of duplication come from two different documents uploaded by different authors in the CMS serving different purpose or as common reference. An asset in industry terms is a packaged method, software or solution that can be deployed with minimum customizations in different industry use-cases. In this paper we reference asset as technical software solutions, were we aim to analyze their content (formed of one or many constituent documents that can be: asset solution architecture, asset business functionality, non-functional sizing requirements, common use cases) in order to gain understanding of their applicability in industry domain contexts. The asset maturity level can range from PoC (proof of concept), MVP (minimum viable product) to Commercialized (mature solution with at least 1 client). In the specific scenario of this research, building the industry cartridges for assets highlights valuable insights as: clear understanding of the coverage between existing asset offerings and industry requirements and trends, identifying uncovered areas were asset offerings have to be build, and last understanding the deviation between current industry state and asset viability (i.e. if the industry migrates to cloud and the asset offering solutions are designed to be deployed on premised). The training corpora has been manually validated and classified into three classes: cognitive, blockchain and automation, an “ALL” class has been defined for future discovery fostering all assets that will not meet the prediction threshold. Additional considerations, as the models are learning from user feedback and ingested content (every time a new asset is reviewed by the system), the purpose is not to add content exponentially by duplication or less meaningful relevance, therefore duplicate documents detected and deleted as part of the pre-processing steps. Corpora observation, as industry domains evolve and mature we can notice a concept stagnation as the domain has reached its maturity ages, however in the initial stages the domain is still considered emergent or “trending” (i.e. words like cognitive, digital, serverless, blockchain), thus many solutions reference them because of that to look innovative, although there is no direct linkage, introducing a persisting industry human bias.

Building Adaptive Industry Cartridges

3.2

779

Method and Algorithms

Method and algorithm Steps for Training: 1. Building new cartridges takes as input a set of representative documents and optional an abstract industry cartridge definition (i.e. emerging technical domains). The system can ingest data from different formats: zip, Box/DropBox or crawled web sources. 2. Remove documents that have close to no content (i.e. a power point can have 10 slides, however its content in raw text is lower than 5 lines). 3. Convert documents to .txt format using a custom parser over Apache-Poi [15], that is capable understand pptx, ppt, doc, docx or pdf formats and connect information in a meaningful manner from elements as: smart art, tables, charts, bullets. 4. Identifying similar content and remove duplicates using the following computed similarity distances [16]: Jaccard [17], Euclid [18], Levenshthein [19], Cosinus [20], Jaro-Wrinkeler [21], Jensen Shannon [22] or KL [23]; Multiple distances [24] are computed due to the following constraints: the compared content comes in different sizes, formats and author language styles, therefore one method by itself cannot validate all the scenarios. This step is mandatory both in training and learning phases of the method, to ensure models are not over fitted or to large preventing scalability. 5. Readability sanitization, the goal behind understanding the content structure is to extract meaningful domain insights, in order to remove the garbage readability sanitization steps are applied as follows: other formats than permitted ones, other vocabularies than English, incorrect English grammar content (non-sense or left over text) are removed. 6. The obtained curated documents will now serve as input for building the industry cartridges models: cognitive, automation and blockchain. Models are constructed using [9] algorithm and a custom implementation of Mallet [35] Text2Vectors that is capable of merging two models, the new and the older, transferring the learning. Additionally, pruning techniques over the new model are used to adjusting the alphabet features weights based on user feedback and new learned concepts captured from the usage of the system in existing implementations. 7. The next step is building the industry cartridges classifier, settling between Naïve Bayes [36], and MaxEnt, considering that both have similar approaches with the difference that MaxEnt [34] maximizes the likelihood of the training data using search-based optimization within the assumption that our concepts are related. 8. System vitality is part of the learning steps, therefore for each re-training step multiple machine learning classification algorithms are compared (Naïve Bayes, Ada Boost, Linear Regression) [34], that can help discover model anomalies and alert the system. 9. Industry terminology definitions are used to augment the models and eliminate bias, industry specific business terms are retrieved by the system but flagged as “common business language” (i.e. IBM, business, governance), allowing the system to differentiate between them and actual document content.

780

L. L. Stavarache

10. The industry cartridge knowledge graph is obtained through the trained models and represents a connected list of the most powerful concepts (uni-grams and n-grams) and their associated topics (see Table 1). The orange nodes represent the common concepts between the topics; the topics nodes are marked with blue, while size of node represent its cartridge weight. Table 1. Documents the corpora distribution and classification per the three selected domains. Domains

Industry cartridge concept graph See Fig. 1

Cognitive

Fig. 1. Cognitive cartridge concept graph See Fig. 2

Automation

Fig. 2. Automation cartridge concept graph (continued)

Building Adaptive Industry Cartridges

781

Table 1. (continued) Domains

Industry cartridge concept graph See Fig. 3

Blockchain

Fig. 3. Blockchain cartridge concept graph

11. This research on this paper is currently implemented and running in a real time production system that offers various capabilities: NLP [33, 34] searches, content summarization, content similarity analysis, semantic concept graphs, tagging (unigrams, n-grams). 12. By uploading a new asset in the system, new learning is available for the model if the concepts are not duplicate or similar with existing knowledge of the models; content that does not add value in the augmenting the cartridge, or does not match threshold of 70% is not considered for learning purposes. For each uploaded asset in the system the following information is received: description, zero or more documents attached, author, domain, title, architecture. 13. Feedback is collected from the users and captured back in the system, as these choose to add or remove the obtained uni/ngrams from tagging (augmenting or reducing the weight of the specific feature in the alphabet), accept discovered similarities or enhanced proposed summaries (obtained from the asset attachment documents using an extractive approach, that looks at the attachments groups them by similarity and acts to summarize only the unique content). 3.3

Implemented Method System Architecture

The research of this paper served as core for the production system implementation. As highlighted above the system uses both supervised and unsupervised techniques to analyze, map and understand content (in this example asset type content) in order to build and augment domains specific industry cartridges models. Without overlooking human validation the model starts with a set of automated steps which reduce the time for data preparation, sanitization or manual annotation (mandatory steps in traditional

782

L. L. Stavarache

processing). Supervised continuous learning, feedback adaptability and the autorecalibrating of the system are the core features that validate the applicability of the method in real industry scenarios (going back to the introduction and state of the art chapter claims). The system is designed to be adaptive, learning from ingested content and collected user feedback using several quality gates: user claimed expertise, user community expertise validation (votes, likes, comments, contribution), contribution consistency (consisting of the feedback value to the system, additional dimensions as user profile are also used when triggering a new model learning cycle. The solution architecture (see Fig. 4) can serve several industry use cases, is scalable, cloud compatible, and industry/domain agnostic, within the constraints that the method approach is respected. The system is performing real time analysis, with performances of: