Proc. 4 inter. conf. Computer Vision and Image Processing, CVIP 2019, Part 1 9789811540141, 9789811540158

807 78 7MB

English Pages 440 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Proc. 4 inter. conf. Computer Vision and Image Processing, CVIP 2019, Part 2 9789811540172, 9789811540189

1,343 143 9MB Read more

Proc. 3 inter. conf. Computer Vision and Image Processing, CVIP 2018, Vol. 1 9789813290877, 9789813290884

1,086 79 8MB Read more

Proc. 3 inter. conf. Computer Vision and Image Processing, CVIP 2018, Vol. 2 9789813292901, 9789813292918

548 102 10MB Read more

Advances in computer vision. Proc. 2019 computer vision conf., Vol.1 9783030177942, 9783030177959

1,225 114 15MB Read more

Computer Vision and Image Processing: 7th International Conference, CVIP 2022, Nagpur, India, November 4–6, 2022, Revised Selected Papers, Part I 3031314069, 9783031314063

This two volume set (CCIS 1776-1777) constitutes the refereed proceedings of the 7th International Conference on Compute

553 15 112MB Read more

Computer Vision and Image Processing: 7th International Conference, CVIP 2022, Nagpur, India, November 4–6, 2022, Revised Selected Papers, Part II 3031314166, 9783031314162

This two volume set (CCIS 1776-1777) constitutes the refereed proceedings of the 7th International Conference on Compute

512 82 152MB Read more

The proc. of inter. conf. on sensing and imaging, 2018 9783030308247, 9783030308254

310 52 4MB Read more

Machine intelligence and signal processing. Proc. conf., MISP 2019 9789811513657, 9789811513664

688 132 8MB Read more

Advancements in Computer Vision and Image Processing 1522556281, 9781522556282

Interest in computer vision and image processing has grown in recent years with the advancement of everyday technologies

1,193 231 17MB Read more

Computer vision and image processing 9780367265731, 9780815370840, 9781351248396, 1561571601, 3393413503

1,920 335 8MB Read more

Proc. 4 inter. conf. Computer Vision and Image Processing, CVIP 2019, Part 1
9789811540141, 9789811540158

Author / Uploaded
Nain N (ed.)

Table of contents :
Preface......Page 6
Organization......Page 7
Contents – Part I......Page 9
Contents – Part II......Page 13
Biometrics......Page 18
1 Introduction......Page 19
2.1 Statistical Features......Page 21
2.2 Transform-Based Features......Page 22
3 Experimental Setup and Results......Page 23
3.1 Results of Statistical Feature Descriptor......Page 24
3.2 Results of Transform-Based Feature Descriptor......Page 26
References......Page 27
Computer Forensic......Page 29
1 Introduction......Page 30
2.2 Finding Keypoints and Extraction of Features Using SIFT......Page 32
2.4 Feature Matching......Page 33
3.1 Dataset Description......Page 34
3.2 Metrics for Performance Evaluation......Page 35
3.4 Qualitative Results......Page 36
References......Page 39
Computer Vision......Page 40
1 Introduction......Page 41
2.1 Use of Computer Vision in Fault Detection in the Pipe Industry......Page 42
3 Problem Definition......Page 43
4.1 Comparing the State-of-the-Art Circle Detection Algorithms......Page 44
4.2 Preprocessing......Page 45
4.3 Morphological Processing: Intensification of Foreground......Page 46
4.6 Region-Based Hough Transform......Page 47
5 Results......Page 48
6 Discussion and Future Work......Page 50
References......Page 51
1 Introduction......Page 53
2 Related Work......Page 54
3.1 Multi-camera Image Fusion......Page 55
3.2 Multi Camera Video Transition......Page 59
4 Experimental Results......Page 60
5 Conclusion......Page 63
References......Page 64
1 Introduction......Page 65
2 Proposed Algorithm......Page 67
3 Simulation Results......Page 68
3.2 Evaluation Parameters......Page 69
References......Page 73
1 Introduction......Page 75
2 Related Work......Page 76
3.1 Model Architecture......Page 77
3.2 Loss Function......Page 79
4 Experimental Analysis......Page 81
References......Page 85
1 Introduction......Page 87
1.1 Overview of Laplacian Pyramid Blending......Page 88
2.1 Proposed Spatially Variant Level-Based Blending......Page 90
3.1 Qualitative Comparisons......Page 93
3.2 Quantitative Comparisons......Page 94
References......Page 95
1 Introduction......Page 96
2 Related Work......Page 97
3 Proposed System......Page 98
3.2 Spatial Transformer Network......Page 99
3.3 Convolutional Neural Network......Page 100
3.4 GPU Embedded Development Board......Page 101
4 Implementation and€Results......Page 102
References......Page 106
1 Introduction......Page 108
2 Related Works......Page 110
3 Proposed Methods for Depth Estimation......Page 112
3.3 Method 3: Models Using Weight Matrix Pruning......Page 113
4 Experiments and Results......Page 114
5 Conclusion......Page 117
References......Page 118
Dimension Reduction......Page 120
1 Introduction......Page 121
2 Problem Statement......Page 123
3.2 Redundancy......Page 124
4 Proposed Feature Selection Framework......Page 125
5.3 Experiment Architecture......Page 127
5.4 Comparative Study......Page 128
6 Conclusion......Page 129
References......Page 130
Healthcare Information Systems......Page 132
1 Introduction......Page 133
2.2 Feature Extraction......Page 135
2.3 Feature Selection......Page 136
3 Classification......Page 137
3.2 Convolutional Neural Network......Page 138
4 Experimental Results and Discussions......Page 139
References......Page 141
1 Introduction......Page 144
3 Proposed Method......Page 146
3.1 Problem Formulation......Page 147
4.1 Noiseless Case......Page 150
4.2 Noisy Case......Page 153
5 Conclusion......Page 154
References......Page 155
1 Introduction......Page 156
2 Related Works......Page 157
3.2 Damage Patterns......Page 158
3.3 DataSet......Page 159
4 Computational Model Architecture......Page 160
5.1 Phase 1......Page 161
5.2 Phase 2......Page 163
5.3 Phase 3......Page 164
5.4 Phase 4......Page 165
References......Page 166
1 Introduction......Page 168
2 Literature Survey......Page 169
3 Proposed Method......Page 170
3.3 Gamma Adjustment......Page 171
3.5 Feature Extraction......Page 172
3.6 Artificial Neural Network......Page 174
3.7 Support Vector Machine (SVM)......Page 175
References......Page 182
1 Introduction......Page 184
2.1 Encoder-Decoder Framework......Page 186
2.2 Context Level Visual Attention (CLVA)......Page 187
2.3 Textual Attention (TA)......Page 188
3.1 Training Algorithm......Page 189
3.2 Quantitative and Qualitative Results......Page 190
4 Conclusion......Page 191
References......Page 192
Image Processing......Page 193
1 Introduction......Page 194
2 Wavelet Shrinkage......Page 196
3 Proposed Methodology......Page 197
4.1 Qualitative Analysis......Page 199
4.2 Quantitative Analysis......Page 200
References......Page 202
1 Introduction......Page 204
2 Cellular Automaton Model......Page 206
3.1 Multithreshold Binary Conversion (MBC)......Page 207
3.2 Recombination of Binary Images (RBI)......Page 208
4 Experimental Result......Page 210
References......Page 212
1 Introduction......Page 214
2 Proposed Algorithm......Page 215
2.1 Entity Detection......Page 216
2.2 Entity Correlation......Page 219
3 Results......Page 221
4 Applications......Page 223
References......Page 224
1 Introduction......Page 225
2.1 Denoising......Page 227
2.4 Uniform Deblurring......Page 228
2.5 Merging......Page 230
3 Implementation and Results......Page 231
References......Page 234
1 Introduction......Page 236
1.1 Motivation......Page 237
2 State-of-the-Art......Page 238
3 Recapture Video Dataset......Page 239
4 Proposed Methodology......Page 241
5 Experimental Result......Page 244
References......Page 246
1 Introduction......Page 248
2 Related Work......Page 249
3 Methodology......Page 250
4 Results and Analysis......Page 252
References......Page 255
1 Introduction......Page 257
2 Defocus Model......Page 258
3 Proposed Depth Map Estimation......Page 259
4.1 Range......Page 261
4.2 Ground Truth......Page 264
5 Verification......Page 265
6 Conclusion......Page 266
References......Page 267
Image Segmentation......Page 268
1 Introduction......Page 269
2 Background......Page 270
3.1 Optical Flow......Page 271
4 Proposed Framework......Page 272
4.2 Lung Nodule Segmentation......Page 273
5 Experimental Results......Page 274
References......Page 276
1 Introduction......Page 278
2 Proposed Method......Page 279
2.1 Warping Position Parameters......Page 281
2.2 Warping Control Parameters......Page 282
2.3 Calculating the Warping Factors......Page 283
3 Experimental Results and Evaluation......Page 284
References......Page 286
1 Introduction......Page 289
2.1 Point Set Simplification......Page 291
3 Results and€Discussions......Page 293
3.2 Reconstruction Broken Alphabet......Page 294
3.3 Reconstruction Dot-Matrix Alphabet......Page 295
4 Conclusion......Page 296
References......Page 297
1 Introduction......Page 299
2 Graph Cut in Image Segmentation......Page 300
3 Reduced Graph Cut with Flexible User Input......Page 301
4 Experimental Results and Comparison......Page 304
4.1 User Study......Page 305
4.2 Comparison with Other State-of-the-art Methods......Page 306
References......Page 308
1 Introduction......Page 309
2 Fuzzy Clustering Algorithm by Incorporating Constrained Class Uncertainty-Based Entropy......Page 311
3.1 Experiments on BrainWeb Data......Page 313
3.2 Experiments on Clinical Brain MR Image Data......Page 315
4 Conclusion......Page 316
References......Page 317
1 Motivation......Page 319
2 Related Work......Page 320
3 Proposed System Framework......Page 322
3.3 Cascade Encoder-Decoder Network......Page 323
4 Training of Proposed OBJECTNet......Page 324
5.1 Performance on Videos of CDnet-2014......Page 325
5.3 Computational Complexity Analysis......Page 327
References......Page 328
1 Introduction......Page 331
2.1 Images and€Graphs......Page 332
2.2 Construction of€Forest......Page 333
2.3 Proposed Segmentation Process......Page 334
3 Results and€Discussions......Page 335
4 Conclusion......Page 339
References......Page 340
1 Introduction......Page 342
2 Related Work......Page 343
3 Feature Extraction......Page 344
3.2 Directional Distance Distribution......Page 345
3.3 Gabor......Page 346
3.4 DCT......Page 347
3.5 Zernike Moments......Page 348
4 Classification Methods......Page 349
4.2 k-Nearest Neighbor (k-NN)......Page 350
5 Performance Evaluation......Page 352
References......Page 353
1 Introduction......Page 356
2 Quaternion......Page 357
3 Modified Gray-Centered RGB Colour Cube......Page 359
4 Linear Quaternion Convolution (LQC)......Page 361
6 Results and€Discussion......Page 362
7 Conclusion......Page 365
References......Page 366
Information Retrieval......Page 367
1 Introduction......Page 368
2 Retrieval of Documents......Page 369
2.1 Non-text Based Document Retrieval......Page 370
2.2 Text Based Document Retrieval......Page 372
2.3 Combined Query Formation and Retrieval......Page 374
3 Experimental Results......Page 375
References......Page 377
1 Introduction......Page 379
2.2 Local Ternary Co-occurrence Pattern (LTCoP)......Page 381
3.1 Directional Structure Transformed Pattern......Page 382
3.2 Proposed Transformed Directional Tri Concomitant Triplet Patterns (TdtCTp)......Page 383
3.3 Proposed System Framework......Page 384
4.1 Retrieval Accuracy on Corel Dataset......Page 385
4.2 Retrieval Accuracy on VIA/I-ELCAP Dataset......Page 386
5 Conclusion......Page 387
References......Page 388
1 Introduction......Page 390
2 Related Work......Page 391
4 Conventional Approach......Page 392
5 Proposed Method......Page 393
7 Training......Page 395
8 Results and€Inferences......Page 397
References......Page 400
1.1 Motivation......Page 402
1.2 Related Work......Page 403
2.1 Residual Learning......Page 405
2.2 Index Matching and Image Retrieval......Page 406
3.1 Retrieval Accuracy on Corel-10K Dataset......Page 407
3.2 Retrieval Accuracy on GHIM-10K Dataset......Page 408
References......Page 409
Instance Based Learning......Page 412
1 Introduction......Page 413
2 Related Work......Page 415
3 Proposed Method......Page 416
3.1 BlobBag......Page 417
3.2 SpiderBag......Page 419
4 Evaluation......Page 420
References......Page 422
Machine Learning......Page 425
1 Introduction......Page 426
2 Related Work......Page 427
3.1 Dataset Description of Big Mart......Page 428
3.3 Data Cleaning......Page 429
3.5 Model Building......Page 430
4 Implementation and Results......Page 433
5 Conclusions......Page 435
References......Page 436
Author Index......Page 438

Citation preview

Neeta Nain Santosh Kumar Vipparthi Balasubramanian Raman (Eds.)

Communications in Computer and Information Science

1147

Computer Vision and Image Processing 4th International Conference, CVIP 2019 Jaipur, India, September 27–29, 2019 Revised Selected Papers, Part I

Communications in Computer and Information Science

1147

Commenced Publication in 2007 Founding and Former Series Editors: Phoebe Chen, Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu, Krishna M. Sivalingam, Dominik Ślęzak, Takashi Washio, Xiaokang Yang, and Junsong Yuan

Editorial Board Members Simone Diniz Junqueira Barbosa Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil Joaquim Filipe Polytechnic Institute of Setúbal, Setúbal, Portugal Ashish Ghosh Indian Statistical Institute, Kolkata, India Igor Kotenko St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia Lizhu Zhou Tsinghua University, Beijing, China

More information about this series at http://www.springer.com/series/7899

Neeta Nain Santosh Kumar Vipparthi Balasubramanian Raman (Eds.) •

•

Computer Vision and Image Processing 4th International Conference, CVIP 2019 Jaipur, India, September 27–29, 2019 Revised Selected Papers, Part I

123

Editors Neeta Nain Malaviya National Institute of Technology Jaipur, Rajasthan, India

Santosh Kumar Vipparthi Malaviya National Institute of Technology Jaipur, Rajasthan, India

Balasubramanian Raman Indian Institute of Technology Roorkee Roorkee, Uttarakhand, India

ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-981-15-4014-1 ISBN 978-981-15-4015-8 (eBook) https://doi.org/10.1007/978-981-15-4015-8 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

This volume contains the papers from the 4th International Conference on Computer Vision and Image Processing (CVIP 2019). The event was endorsed by the International Association for Pattern Recognition (IAPR) and organized by Malaviya National Institute of Technology, Jaipur, during September 27–29, 2019. CVIP is a premier conference focused on image, video processing, and computer vision. The conference featured world-renowned speakers, technical workshops, and demonstrations. CVIP 2019 acted as a major forum for presentation of technological progress and research outcomes in the area of image processing and computer vision, serving as a platform for exchange between academia and industry. The selected papers come from around 202 original submissions by researchers based in several countries including South Korea, Norway, Malaysia, Iceland, Ethiopia, Canada, Bangladesh, India, and the USA. The highly diversified audience gave us the opportunity to achieve a good level of understanding of the mutual needs, requirements, and technical means available in this field of research. The topics included in this edition of CVIP the following fields connected to computer vision and image processing: data acquisition and modeling, visualization and audio methods, sensors and actuators, data mining, image enhancement and restoration, segmentation, object detection and tracking, video analysis and summarization, biometrics and forensics, deep learning, document image analysis, remote sensing, multi-spectral and hyper-spectral image processing, etc. All the accepted papers were double-blind peer reviewed by three qualified reviewers chosen from our Technical Committee based on their qualifications, areas of interest, and experience. The papers were evaluated on their relevance to CVIP 2019 tracks and topics, scientific correctness, and clarity of presentation. Selection was based on these reviews and on further recommendations by the Program Committee. The editors of the current proceedings are very grateful and wish to thank the dedicated Technical Committee members and all the other reviewers for their valuable contributions, commitment, and enthusiastic support. We also thank CCIS at Springer for their trust and for publishing the proceedings of CVIP 2019. September 2019

Neeta Nain Santosh Kumar Vipparthi Balasubramanian Raman

Organization

Organizing Committee Neeta Nain Santosh Kumar Vipparthi Partha Pratim Roy Ananda Shankar Chowdhary

MNIT Jaipur, India MNIT Jaipur, India IIT Roorkee, India Jadavpur University, India

Program Committee Balasubramanian Raman Sanjeev Kumar Arnav Bhaskar Subramanyam Murala Abhinav Dhall

IIT IIT IIT IIT IIT

Roorkee, India Roorkee, India Mandi, India Ropar, India Ropar, India

International Advisory Committee Uday Kumar R. Yaragatti Anil K. Jain Bidyut Baran Chaudhari Mohamed Abdel Mottaleb Mohan S. Kankanhalli Ajay Kumar Ales Prochazka Andrea Kutics Daniel P. Lopresti Gian Luca Foresti Jonathan Wu Josep Llados Kokou Yetongnon Koichi Kise Luigi Gallo Slobodan Ribaric Umapada Pal Xiaoyi Jiang

MNIT Jaipur, India Michigan State University, USA ISI Kolkata, India University of Miami, USA NUS, Singapore Hong Kong Poly University, Hong Kong Czech Technical University, Czech Republic ICU, Japan Lehigh University, USA University of Udine, Italy University of Windsor, Canada University of Barcelona, Spain University of Burgundy, France Osaka Prefecture University, Japan National Research Council, Italy University of Zagreb, Croatia ISI Kolkata, India University of Münster, Germany

viii

Organization

Local Committee Emmanuel S. Pilli Dinesh Kumar Tyagi Vijay Laxmi Arka Prakash Mazumdar Mushtaq Ahmed Yogesh Kumar Meena Satyendra Singh Chouhan Mahipal Jadeja Madhu Agarwal Kuldeep Kumar Prakash Choudhary Maroti Deshmukh Subhash Panwar Tapas Badal Sonu Lamba Riti Kushwaha Praveen Kumar Chandaliya Rahul Palliwal Kapil Mangal Ravindra Kumar Soni Gopal Behera Sushil Kumar

Sponsors

MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India NIT Hamirpur, India NIT Uttarakhand, India GEC Bikaner, India Bennett University, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India MNIT Jaipur, India

Contents – Part I

Biometrics Towards Ocular Recognition Through Local Image Descriptors . . . . . . . . . . Ritesh Vyas, Tirupathiraju Kanumuri, Gyanendra Sheoran, and Pawan Dubey

3

Computer Forensic A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shraddha Wankhade, Anuja Dixit, and Soumen Bag

15

Computer Vision Automated Industrial Quality Control of Pipe Stacks Using Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sayantan Chatterjee, Bidyut B. Chaudhuri, and Gora C. Nandi Asymmetric Wide Tele Camera Fusion for High Fidelity Digital Zoom . . . . . Sai Kumar Reddy Manne, B. H. Pawan Prasad, and K. S. Green Rosh

27 39

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dharambhai Shah and Tanish Zaveri

51

Fast Semantic Feature Extraction Using Superpixels for Soft Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shashikant Verma, Rajendra Nagar, and Shanmuganathan Raman

61

Spatially Variant Laplacian Pyramids for Multi-frame Exposure Fusion . . . . . Anmol Biswas, K. S. Green Rosh, and Sachin Deepak Lomte Traffic Sign Recognition Using Color and Spatial Transformer Network on GPU Embedded Development Board . . . . . . . . . . . . . . . . . . . . . . . . . . Bhaumik Vaidya and Chirag Paunwala Unsupervised Single-View Depth Estimation for Real Time Inference . . . . . . Mohammed Arshad Siddiqui, Arpit Jain, Neha Gour, and Pritee Khanna

73

82 94

x

Contents – Part I

Dimension Reduction A Novel Information Theoretic Cost Measure for Filtering Based Feature Selection from Hyperspectral Images . . . . . . . . . . . . . . . . . . . . . . . Vikas Kookna, Ankit Kumar Singh, Agastya Raj, and Biplab Banerjee

109

Healthcare Information Systems CNN and RF Based Classification of Brain Tumors in MR Neurological Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vishlavath Saraswathi, Ankush D. Jamthikar, and Deep Gupta Tensor Based Dictionary Learning for Compressive Sensing MRI Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minha Mubarak, Thomas James Thomas, J. Sheeba Rani, and Deepak Mishra

123

134

Nonparametric Vibration Based Damage Detection Technique for Structural Health Monitoring Using 1D CNN . . . . . . . . . . . . . . . . . . . . . . . Yash Sarawgi, Shivam Somani, Ayushmaan Chhabra, and Dhiraj

146

Neural Network and SVM Based Kidney Stone Based Medical Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Chak, Payal Navadiya, Bhavya Parikh, and Ketki C. Pathak

158

Automatic Report Generation for Chest X-Ray Images: A Multilevel Multi-attention Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaurav O. Gajbhiye, Abhijeet V. Nandedkar, and Ibrahima Faye

174

Image Processing Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pranaba K. Mishro, Sanjay Agrawal, and Rutuparna Panda

185

MBC-CA: Multithreshold Binary Conversion Based Salt-and-Pepper Noise Removal Using Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . Parveen Kumar, Mohd Haroon Ansari, and Ambalika Sharma

195

Image to CAD: Feature Extraction and Translation of Raster Image of CAD Drawing to DXF CAD Format . . . . . . . . . . . . . . . . . . . . . . . . . . . Aditya Intwala

205

Non-uniform Deblurring from Blurry/Noisy Image Pairs . . . . . . . . . . . . . . . P. L. Deepa and C. V. Jiji

216

Contents – Part I

xi

An Effective Video Bootleg Detection Algorithm Based on Noise Analysis in Frequency Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preeti Mehta, Sushila Maheshkar, and Vikas Maheshkar

227

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikhil Kumar, Meenakshi Massey, and Neeta Kandpal

239

Calibration of Depth Map Using a Novel Target . . . . . . . . . . . . . . . . . . . . . Sandip Paul, Deepak Mishra, and M. Senthil

248

Image Segmentation Optical Flow Based Background Subtraction Method for Lung Nodule Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Jenkin Suji, Sarita Singh Bhadouria, Joydip Dhar, and W. Wilfred Godfrey

261

A Method to Generate Synthetically Warped Document Image . . . . . . . . . . . Arpan Garai, Samit Biswas, Sekhar Mandal, and Bidyut B. Chaudhuri

270

Delaunay Triangulation Based Thinning Algorithm for Alphabet Images . . . . Philumon Joseph, Binsu C. Kovoor, and Job Thomas

281

A Reduced Graph Cut Approach to Interactive Object Segmentation with Flexible User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyambada Subudhi, Bhanu Pratap Prabhakar, and Susanta Mukhopadhyay A New Fuzzy Clustering Algorithm by Incorporating Constrained Class Uncertainty-Based Entropy for Brain MR Image Segmentation. . . . . . . Nabanita Mahata and Jamuna Kanta Sing A Novel Saliency-Based Cascaded Approach for Moving Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prashant W. Patil, Akshay Dudhane, Subrahmanyam Murala, and Anil B. Gonde A Novel Graph Theoretic Image Segmentation Technique . . . . . . . . . . . . . . Sushmita Chandel and Gaurav Bhatnagar Extraction and Recognition of Numerals from Machine-Printed Urdu Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harmohan Sharma, Dharam Veer Sharma, G. S. Lehal, and Ankur Rana

291

301

311

323

334

xii

Contents – Part I

Colour Sensitive Image Segmentation Using Quaternion Algebra . . . . . . . . . Sandip Kumar Maity and Prabir Biswas

348

Information Retrieval Multimodal Query Based Approach for Document Image Retrieval . . . . . . . . Amit V. Nandedkar and Abhijeet V. Nandedkar

361

Transformed Directional Tri Concomitant Triplet Patterns for Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chesti Altaff Hussain, D. Venkata Rao, and S. Aruna Mastani

372

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keshav Kumar Kedia, Gaurav Kumar Jain, and Vipul Grover

383

Feature Learning for Effective Content-Based Image Retrieval . . . . . . . . . . . Snehal Marab and Meenakshi Pawar

395

Instance Based Learning Two Efficient Image Bag Generators for Multi-instance Multi-label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. K. Bhagat, Prakash Choudhary, and Kh Manglem Singh

407

Machine Learning A Comparative Study of Big Mart Sales Prediction . . . . . . . . . . . . . . . . . . . Gopal Behera and Neeta Nain

421

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

433

Contents – Part II

Neural Network Denoising Images with Varying Noises Using Autoencoders . . . . . . . . . . . . Snigdha Agarwal, Ayushi Agarwal, and Maroti Deshmukh Image Aesthetics Assessment Using Multi Channel Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nishi Doshi, Gitam Shikkenawis, and Suman K. Mitra Profession Identification Using Handwritten Text Images . . . . . . . . . . . . . . . Parveen Kumar, Manu Gupta, Mayank Gupta, and Ambalika Sharma A Study on Deep Learning for Breast Cancer Detection in Histopathological Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oinam Vivek Singh, Prakash Choudhary, and Khelchandra Thongam Face Presentation Attack Detection Using Multi-classifier Fusion of Off-the-Shelf Deep Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Raghavendra Ramachandra, Jag Mohan Singh, Sushma Venkatesh, Kiran Raja, and Christoph Busch Vision-Based Malware Detection and Classification Using Lightweight Deep Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Abijah Roseline, G. Hari, S. Geetha, and R. Krishnamurthy A Deep Neural Network Classifier Based on Belief Theory . . . . . . . . . . . . . Minny George and Praveen Sankaran Real-Time Driver Drowsiness Detection Using Deep Learning and Heterogeneous Computing on Embedded System . . . . . . . . . . . . . . . . . Shivam Khare, Sandeep Palakkal, T. V. Hari Krishnan, Chanwon Seo, Yehoon Kim, Sojung Yun, and Sankaranarayanan Parameswaran A Comparative Analysis for Various Stroke Prediction Techniques . . . . . . . . M. Sheetal Singh, Prakash Choudhary, and Khelchandra Thongam A Convolutional Fuzzy Min-Max Neural Network for Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trupti R. Chavan and Abhijeet V. Nandedkar Anomalous Event Detection and Localization Using Stacked Autoencoder . . . Suprit D. Bansod and Abhijeet V. Nandedkar

3

15 25

36

49

62 74

86

98

107 117

xiv

Contents – Part II

Kernel Variants of Extended Locality Preserving Projection . . . . . . . . . . . . . Pranjal Bhatt, Sujata, and Suman K. Mitra DNN Based Adaptive Video Streaming Using Combination of Supervised Learning and Reinforcement Learning . . . . . . . . . . . . . . . . . . Karan Rakesh, Luckraj Shrawan Kumar, Rishabh Mittar, Prasenjit Chakraborty, P. A. Ankush, and Sai Krishna Gairuboina

130

143

A Deep Convolutional Neural Network Based Approach to Extract and Apply Photographic Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . Mrinmoy Sen and Prasenjit Chakraborty

155

Video Based Deception Detection Using Deep Recurrent Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sushma Venkatesh, Raghavendra Ramachandra, and Patrick Bours

163

Deep Demosaicing Using ResNet-Bottleneck Architecture . . . . . . . . . . . . . . Divakar Verma, Manish Kumar, and Srinivas Eregala

170

Psychological Stress Detection Using Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaushik Sardeshpande and Vijaya R. Thool

180

Video Colorization Using CNNs and Keyframes Extraction: An Application in Saving Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Singh, Anurag Chanani, and Harish Karnick

190

Image Compression for Constrained Aerial Platforms: A Unified Framework of Laplacian and cGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. G. J. Faheema, A. Lakshmi, and Sreedevi Priyanka

199

Multi-frame and Multi-scale Conditional Generative Adversarial Networks for Efficient Foreground Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Himansu Didwania, Subhankar Ghatak, and Suvendu Rup

211

Ink Analysis Using CNN-Based Transfer Learning to Detect Alteration in Handwritten Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prabhat Dansena, Rahul Pramanik, Soumen Bag, and Rajarshi Pal

223

Ensemble Methods on Weak Classifiers for Improved Driver Distraction Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Swetha, Megha Sharma, Sai Venkatesh Sunkara, Varsha J. Kattampally, V. M. Muralikrishna, and Praveen Sankaran DeepRNNetSeg: Deep Residual Neural Network for Nuclei Segmentation on Breast Cancer Histopathological Images . . . . . . . . . . . . . . . . . . . . . . . . Mahesh Gour, Sweta Jain, and Raghav Agrawal

233

243

Contents – Part II

xv

Classification of Breast Tissue Density . . . . . . . . . . . . . . . . . . . . . . . . . . . Kanchan Lata Kashyap, Manish Kumar Bajpai, and Pritee Khanna

254

Extreme Weather Prediction Using 2-Phase Deep Learning Pipeline . . . . . . . Vidhey Oza, Yash Thesia, Dhananjay Rasalia, Priyank Thakkar, Nitant Dube, and Sanjay Garg

266

Deep Hybrid Neural Networks for Facial Expression Classification . . . . . . . . Aakash Babasaheb Jadhav, Sairaj Laxman Burewar, Ajay Ashokrao Waghumbare, and Anil Balaji Gonde

283

SCDAE: Ethnicity and Gender Alteration on CLF and UTKFace Dataset. . . . Praveen Kumar Chandaliya, Vardhman Kumar, Mayank Harjani, and Neeta Nain

294

Manipuri Handwritten Character Recognition by Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanasam Inunganbi, Prakash Choudhary, and Khumanthem Manglem

307

Design and Implementation of Human Safeguard Measure Using Separable Convolutional Neural Network Approach . . . . . . . . . . . . . . R. Vaitheeshwari, V. Sathiesh Kumar, and S. Anubha Pearline

319

Tackling Multiple Visual Artifacts: Blind Image Restoration Using Conditional Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . M. Anand, A. Ashwin Natraj, V. Jeya Maria Jose, K. Subramanian, Priyanka Bhardwaj, R. Pandeeswari, and S. Deivalakshmi Two-Stream CNN Architecture for Anomalous Event Detection in Real World Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Snehashis Majhi, Ratnakar Dash, and Pankaj Kumar Sa 3D CNN with Localized Residual Connections for Hyperspectral Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shivangi Dwivedi, Murari Mandal, Shekhar Yadav, and Santosh Kumar Vipparthi A Novel Approach for False Positive Reduction in Breast Cancer Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mayuresh Shingan, Meenakshi Pawar, and S. Talbar Classification of Effusion and Cartilage Erosion Affects in Osteoarthritis Knee MRI Images Using Deep Learning Model . . . . . . . . . . . . . . . . . . . . . Pankaj Pratap Singh, Shitala Prasad, Anil Kumar Chaudhary, Chandan Kumar Patel, and Manisha Debnath

331

343

354

364

373

xvi

Contents – Part II

Object Detection A High Precision and High Recall Face Detector for Equi-Rectangular Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankit Dhiman and Praveen Agrawal Real-Time Ear Landmark Detection Using Ensemble of Regression Trees . . . Hitesh Gupta, Srishti Goel, Riya Sharma, and Raghavendra Kalose Mathsyendranath

387 398

Object Recognition A New Hybrid Architecture for Real-Time Detection of Emergency Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eshwar Prithvi Jonnadula and Pabitra Mohan Khilar Speed Prediction of Fast Approaching Vehicle Using Moving Camera. . . . . . Hutesh Kumar Gauttam and Ramesh Kumar Mohapatra Improved Performance of Visual Concept Detection in Images Using Bagging Approach with Support Vector Machines . . . . . . . . . . . . . . . Sanjay M. Patil and Kishor K. Bhoyar FaceID: Verification of Face in Selfie and ID Document . . . . . . . . . . . . . . . Rahul Paliwal, Shalini Yadav, and Neeta Nain

413 423

432 443

Online Handwriting Recognition A Benchmark Dataset of Online Handwritten Gurmukhi Script Words and Numerals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harjeet Singh, R. K. Sharma, Rajesh Kumar, Karun Verma, Ravinder Kumar, and Munish Kumar

457

Optical Character Recognition Targeted Optical Character Recognition: Classification Using Capsule Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pratik Prajapati, Shaival Thakkar, and Ketul Shah

469

Security and Privacy An Edge-Based Image Steganography Method Using Modulus-3 Strategy and Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santosh Kumar Tripathy and Rajeev Srivastava

485

Contents – Part II

Multi-level Threat Analysis in Anomalous Crowd Videos . . . . . . . . . . . . . . Arindam Sikdar and Ananda S. Chowdhury

xvii

495

Unsupervised Clustering Discovering Cricket Stroke Classes in Trimmed Telecast Videos. . . . . . . . . . Arpan Gupta, Ashish Karel, and M. Sakthi Balan

509

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

521

Biometrics

Towards Ocular Recognition Through Local Image Descriptors Ritesh Vyas1(B) , Tirupathiraju Kanumuri2 , Gyanendra Sheoran2 , and Pawan Dubey3 1

3

Bennett University, Greater Noida, UP, India [email protected] 2 National Institute of Technology Delhi, Delhi, India Accendere Knowledge Management Services, Surampalem, AP, India

Abstract. Iris and periocular (collectively termed as ocular) biometric modalities have been the most sought for modalities, due to their increased discrimination ability. Moreover, both these modalities can be captured through a single acquisition setup, leading to ease in user interaction. Owing to these advantages, this paper presents two local descriptors, namely statistical and transform-based descriptors, to investigate the worthiness of ocular recognition. The first descriptor uses mean and variance formulae after two levels of partitioning to extract the distinctive features. Whereas, second descriptor comprises of implementation of curvelet transform, followed by polynomial fitting, to extract the features. Individual iris and periocular features are combined through simple concatenation operation. The experiments performed on the challenging Cross-Eyed database, vindicate the efficacy of both the employed descriptors in same-spectral as well as cross-spectral matching scenarios.

Keywords: Ocular recognition recognition · Feature extraction

1

· Iris recognition · Periocular

Introduction

Ocular region has shown interesting results when employed in biometric recognition systems. This region comprises of the facial region in the vicinity of human eye and the eye itself [13]. The eye further contains a unique textural strip, called as iris [7], to aid the recognition accuracy. Therefore, the biometric research fraternity has shown great zest towards ocular recognition. Besides, ocular recognition can perform well in situations where iris image quality is less than acceptable, e.g. partial iris, occluded iris, and off-angle iris. The inclusion of periocular features along with iris features can increase the performance of recognition system manifold, at no extra cost. Reason being that both iris and periocular regions can be captured through the same acquisition setup [1,2]. Additionally, acquisition of periocular images require less user cooperation [10]. Periocular recognition can also outperform face recognition in c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 3–12, 2020. https://doi.org/10.1007/978-981-15-4015-8_1

4

R. Vyas et al.

situations where partial face region is masked [22], e.g. surgical masks, masks to protect from sun and heat etc. In the past, many authors have attempted to combine the iris and periocular information to obtain increased accuracies. Woodard et al. [19] employed IrisCode [6] and local binary pattern (LBP) [9] for characterizing iris and periocular features, respectively. They obtained significant improvement in performance through the score level fusion of iris and periocular features. Joshi et al. [8] utilized wavelets and LBP for extraction of iris and periocular features, respectively, followed by feature-level fusion of the normalized features and dimensionality reduction through direct linear discriminant analysis (DLDA). Raja et al. [12] demonstrated significant improvement in recognition performance through extraction of binary statistical image features (BSIF) features from iris and periocular, followed by their potential fusion. Raghavendra et al. [11] proposed extraction of iris and periocular features through sparse reconstruction classifier (SRC) and LBP, respectively. Finally, the scores from both iris and periocular were fused through weighted-sum approach.

VW Periocular

Feature descriptor

featp,v

Feature descriptor

featp,n Matching

Concatenation

feato,v

Database

NIR Periocular Concatenation

VW Iris

Feature descriptor

feati,v

Feature descriptor

feati,n

feato,n

Same-spectral matching VW-VW Matching NIR-NIR Matching Cross-spectral matching VW-NIR Matching

Periocular recognition Iris recognition Ocular recognition

NIR Iris

Fig. 1. Schematic diagram of the proposed work

In this paper, a comprehensive evaluation of ocular recognition framework is carried out through two local feature descriptors. Both the employed descriptors cover two different aspects of ocular features. The first descriptor utilizes statistical information of the Gabor filtered input images, whereas second descriptor

Ocular Recognition

5

investigates the complex surfaces of the image in frequency transformed domain through the use of curvelet transform. The benchmark Cross-Eyed database is chosen for the experimentation, as this database provides separate iris and periocular images, that too with pixel-to-pixel correspondence between near infrared (NIR) and visible wavelength (VW) images. Moreover, the periocular images from this database do not possess any iris information, because the complete eye region is masked in these images. Therefore, this database becomes a natural choice for vast evaluation of individual iris/periocular images. Nevertheless, the ocular recognition framework is satisfied by combining the individual iris and periocular features through simple concatenation. Furthermore, the evaluation protocol is made more challenging by considering the same-spectral and crossspectral matching scenarios. It is also shown that the proposed descriptors when employed for ocular recognition can perform better even in the challenging crossspectral matching scenario [15], as compared to the individual iris/periocular recognition framework. The overall self-explanatory schematic diagram of the proposed work is illustrated in Fig. 1. The rest of the paper is organized as follows. Section 2 provides the fundamental details about the employed feature descriptors. While, Sect. 3 details about the used database and the results obtained from both the employed feature descriptors. Finally, the paper is concluded in Sect. 4.

2

Feature Extraction

This section describes the two local image descriptors employed in this work for ocular recognition framework. First descriptor employs the statistical information from the image partitions to store the variations at local-levels. While, the second descriptor employs complex curvelet transform and two dimensional (2D) polynomial fitting to extract the features. It is to be noted that, all periocular images are resized to 64 × 72, for speeding up the feature extraction process. On the other hand, iris images are segmented through the method suggested by Zhao and Kumar [21], and normalized using the Daugman’s rubber sheet model [5]. 2.1

Statistical Features

This descriptor commences from the filtering of input image through multi-scale and multi-resolution 2D Gabor filter bank, Z (p, q), as depicted in Eq. (1) below, where p, q are the filter selection parameters. Each individual filer is represented by ζ (u, v, sp , fp , φq ), where u, v indicate the kernel size and (s, f, φ) are the scale, frequency and orientation, respectively. This filtering facilitates the revealing of wide-ranging texture present in the iris/periocular image [16]. Afterwards, the filtered image is partitioned at two distinct levels, leading to macro and micro image sub-blocks. This partitioning for a sample periocular image is illustrated in Fig. 2. Notably, this work employs uniform partitioning at both levels.

6

R. Vyas et al.

Z (p, q) = ζ (u, v, sp , fp , φq ) 2 2 exp (2πιfp (u cos φq + v sin φq )) ζ (u, v, sp , fp , φq ) = √ 1 2 exp − u s+v 2 2πsp p (1) √ 0.2592 p p = 0, 1, . . . , 4 sp = 1.9863 × 2 , fp = √2p q = 0, 1, . . . , 3 φq = qπ 4

First-level partitioning

Second-level partitioning

Fig. 2. Partitioning scheme for extracting statistical features

Subsequently, the standard deviation of each sub-block is calculated using the statistical formulae [3, 17] and absolute difference of standard deviation from the residing sub-blocks is stored as the feature. This is repeated for all the filtered images, to generate the complete feature vector. Reader may refer to Vyas et al. [17] for more legibility. As observed from Fig. 2, there are 32 partitions at the second level of partitioning, leading to 32 absolute differences. And, when the process is repeated for all 20 filters in the filter bank, the dimension of feature vector becomes 1 × 640. 2.2

Transform-Based Features

This descriptor emerges from the spectrum coverage capability of the curvelet transform, which sometimes makes it a better choice than the Gabor filter [20]. Moreover, availability of fast discrete curvelet transform (FDCT) [4] implementation techniques makes it intriguing to apply curvelet transform without facing major difficulties. Curvelet transform is a complex transform, which results in complex subbands. These complex sub-bands can be represented through the coefficients of fitted polynomials [18]. This work employs fitting of 2D polynomials of degree 3, in order to best represent the complex surfaces of curvelet sub-bands. Such polynomials can further be mathematically expressed as in Eq. (2). p (x, y) = c00 +c10 x+c01 y +c20 x2 +c11 xy +c02 y 2 +c30 x3 +c21 x2 y +c12 xy 2 +c03 y 3 (2)

Ocular Recognition

7

It is important to note here that 2D polynomial of degree 3 yields 10 coefficients (i.e. c00 , c10 , · · · c03 ) for each fitted sub-band. Further, curvelet transform is applied upto four levels, leading to 50 sub-bands. Hence, the dimension of feature vector in this transform-based descriptor becomes 1 × 500. Performances of both the employed descriptors are discussed in the next section.

3

Experimental Setup and Results

All experiments are carried out on Cross-Eyed database [14]. This database comprises of iris and periocular images from 120 subjects with diverse ethnicities. Each subject has contributed eight images from each of the eyes. The database provides registered VW and NIR iris/periocular images, which have pixel-topixel correspondence among themselves. The periocular images are provided with the eye region masked, so that the iris information would not come into picture while extracting the periocular features [10]. In total, the Cross-Eyed database provides 3840 images for each of iris and periocular, including 960 images each from VW and NIR-based left and right eye imaging frameworks. In this work, 500 iris/periocular images from 100 classes (or 50 subjects) have been selected for experimentation. The image resolutions of periocular and iris images are 900 × 800 and 400 × 300, respectively. Figure 3 shows some sample iris and periocular images from the Cross-Eyed database. Results of the employed statistical and transform-based feature descriptors are presented in terms of important performance metrics like equal error rate (EER), genuine acceptance rate (GAR) and discriminative index (DI). Throughout this discussion, GAR is observed at false acceptance rate (FAR) of 0.01.

Fig. 3. Sample images, (a) VW Periocular, (b) NIR Periocular, (c) VW Iris, (d) NIR Iris

Furthermore, the performance metrics for ocular (iris + periocular) recognition are also indicated, in order to showcase a clear comparison between individual iris/periocular and combination of both. It is to be noted that ocular recognition is achieved through mere concatenation of individual iris and periocular features. This concatenation may lead to a slight increase in the feature dimension. However, the improvements achieved through concatenation of features are notably significant. The results are further explored for two different types of matching, namely same-spectral and cross-spectral matching. The same-spectral

8

R. Vyas et al.

matching further comprises of VW-VW and NIR-NIR matching. Whereas, crossspectral matching includes VW-NIR matching. An all-to-all matching protocol is adopted to showcase the more robust performance metrics. 3.1

Results of Statistical Feature Descriptor

The statistical feature descriptor extracts the features through two-level partitioning and simple mean and variance calculations from the image sub-blocks. The dimension of feature vector using this descriptor turns out to be 640. The numerical results and the ROC curves for the statistical feature descriptor are illustrated in Table 1 and Fig. 4. The table clearly depicts that performance of the statistical feature descriptor is relatively inferior for cross-spectral matching, when compared with the same-spectral matching. 1

1 0.95

0.9 0.9 0.8

0.85

GAR

GAR

0.8 0.75

0.7 0.6

0.7 0.65

0.5

0.6 Periocular Iris Ocular

0.55 0.5

10-4

10-3

10-2

10-1

Periocular Iris Ocular

0.4

100

0.3

10-4

10-3

FAR

10-2

10-1

100

FAR

1 0.9 0.8 0.7

GAR

0.6 0.5 0.4 0.3 0.2 Periocular Iris Ocular

0.1 0

10-4

10-3

10-2

10-1

100

FAR

Fig. 4. ROC curves for statistical features, (top-left) VW-VW matching, (top-right) NIR-NIR matching, (bottom) VW-NIR matching

Ocular Recognition

9

Table 1. Performance metrics for statistical features Matching Same spectral

Cross-spectral

VW-VW Trait

Iris Ocular

VW-NIR

EER (%) GAR (%) DI (@FAR = 0.01)

EER (%) GAR (%) DI (@FAR = 0.01)

Metrics EER (%) GAR (%) DI (@FAR = 0.01)

Periocular

NIR-NIR

8.47

81.77

2.9196

7.87

81.18

3.0692 16.42

29.90

10.46

82.40

2.5265 11.58

77.60

1.8628 15.81

44.08

1.5215

4.45

92.87

3.6104

90.00

3.2708 10.88

54.16

2.2945

5.79

1.9123

Table 2. Performance metrics for transform-based features Matching Same spectral

Cross-spectral

VW-VW Trait

Iris Ocular

VW-NIR

EER (%) GAR (%) DI (@FAR = 0.01)

EER (%) GAR (%) DI (@FAR=0.01)

Metrics EER (%) GAR (%) DI (@FAR = 0.01)

Periocular

NIR-NIR

8.08

76.30

2.7379

8.83

76.68

2.6893 18.72

39.92

16.31

53.73

1.9322 18.18

46.29

1.5323 36.36

08.80

0.6502

4.87

88.80

3.1914

84.93

3.0244 16.92

41.91

1.8690

6.36

1.7354

Regarding the same-spectral matching, the EERs for visible light based periocular and iris recognition are 8.47% and 10.46%, respectively. This relates with the challenges posed by iris images of the employed database. However, if the iris and periocular features are further combined (through concatenation), the EER turns out to be 4.45%. Similarly, the GAR is also increased from 81.77% and 82.40% for individual periocular and iris recognition, respectively, to 92.87% for ocular recognition. Same sort of improvement can also be apprehended from the corresponding ROC curves shown in Fig. 4(top-left), where the curve for ocular recognition is at par with that of iris and periocular recognition. On the other hand, for NIR-based same-spectral matching, the error rates (EERs) for periocular and iris recognition are 7.87% and 11.58%, respectively. Whereas, the corresponding recognition rates (GARs) are 81.18% and 77.60%. Nevertheless, combination of iris and periocular features (leading to ocular recognition) leads to considerable improvements, with EER falling to 5.79% and GAR raising to 90.00%. The same is reflected through the ROC curves drawn in Fig. 4(top-right). The discriminative index (DI) also possess similar pattern for the same-spectral matching, which can be followed from Table 1. Coming to cross-spectral (VW-NIR) matching, performance of all three recognition frameworks (iris, periocular and ocular) deteriorate notably. On one side, EERs for periocular and iris recognition are observed to be 16.42% and 15.81%, respectively. Whereas, on the other side, EER of ocular recognition falls to 10.88% in cross-spectral matching. The GARs of all three recognition frameworks also descend to 29.90%, 44.08% and 54.16%, respectively. However, ocular recognition is still a better choice when compared with iris/periocular recognition. This fact can be supported by the ROC curves illustrated in Fig. 4(bottom).

10

R. Vyas et al.

Overall, the statistical feature descriptor performs better for periocular recognition than the iris recognition. This pertains to the underlying partitioning of images, which reveals the local-level structure of the periocular images more clearly than that of iris texture. 3.2

Results of Transform-Based Feature Descriptor

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

GAR

GAR

The transform-based features are derived from curvelet transform and polynomial fitting functionality. Owing to the rotation invariant nature of polynomial fitting, these features are proved to be more suitable for periocular recognition than iris recognition. This fact can also be observed from the performance metrics presented in Table 2. This table shows the performance metrics for sameand cross-spectral matching scenarios for iris, periocular and ocular features.

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 Periocular Iris Ocular

0.1 0

10-4

10-3

10-2

10-1

Periocular Iris Ocular

0.1 0

100

10-4

10-3

FAR

10-2

10-1

100

FAR

1 0.9 0.8 0.7

GAR

0.6 0.5 0.4 0.3 0.2 Periocular Iris Ocular

0.1 0

10-4

10-3

10-2

10-1

100

FAR

Fig. 5. ROC curves for transform-based features, (top-left) VW-VW matching, (topright) NIR-NIR matching, (bottom) VW-NIR matching

Observing the same-spectral matching parameters first, EERs for periocular and iris recognition in VW based matching are 8.08% and 16.31%, respectively. Whereas, the GARs are found to be 76.30% and 53.73%, respectively. Hence, this feature descriptor performs poorly when dealing with iris images alone. However,

Ocular Recognition

11

when the iris and periocular features are combined to test its performance in ocular recognition, the EER and GAR improve to 4.87% and 88.80%, respectively. The corresponding ROC curves are also shown in Fig. 5(top-left). Similarly, the genuine and imposter score distribution also become well-separated for ocular recognition, as is evident from the DI of 3.1914 in this case. There is almost a similar observation with matching of NIR images with these transform-based features. The EERs for all three types of recognition in NIR based matching are observed to be 8.83%, 18.18% and 6.36%, respectively. The performance can also be tracked through the ROC curves shown in Fig. 5(top-right). Regarding the cross-spectral matching scenario, the proposed feature descriptor performs poorly, with EERs of 18.72%, 36.36% and 16.92% for periocular, iris and ocular recognition, respectively. Therefore, it can be stated that proposed descriptor, when implemented for extracting combined ocular features, can perform better than that of individual periocular and iris modality. However, GAR for ocular recognition in cross-spectral matching scenario (41.91%) is sill lower than the acceptable level. The ROC curves corresponding to cross-spectral matching through transform-based descriptor are depicted in Fig. 5(bottom).

4

Conclusion

This paper presents a comprehensive evaluation of the ocular biometric recognition. The ocular features are extracted through individual iris and periocular images and combined into one feature vector. Both the feature vectors are examined for same-spectral and cross-spectral matching scenarios, for providing a vast analysis. Both the descriptors yield good results for periocular images (EERs ranging between 7–9%), as compared to that of iris images (EERs ranging between 10–19%). This can be attributed to the large feature area and more topical features present in the periocular images. Moreover, the adverse effects of specular reflections and eyelid occlusion hinder the considerable performance in iris recognition. Besides, the non-correlation between features in cross-spectral matching is also responsible for deteriorating performance of both the descriptors. However, when the features of iris and periocular images are combined, the EER improves to 10–17%, which otherwise remains in the range of 16–37%, for both the descriptors. Hence, ocular recognition can be used efficiently to cater with the limitations offered by individual iris or periocular modality.

References 1. Alonso-Fernandez, F., Bigun, J.: A survey on periocular biometrics research. Pattern Recogn. Lett. 82, 92–105 (2016) 2. Alonso-Fernandez, F., Mikaelyan, A., Bigun, J.: Comparison and fusion of multiple iris and periocular matchers using near-infrared and visible images. In: 3rd International Workshop on Biometrics and Forensics (IWBF), pp. 1–6 (2015) 3. Arivazhagan, S., Ganesan, L., Priyal, S.P.: Texture classification using Gabor wavelets based rotation invariant features. Pattern Recogn. Lett. 27(16), 1976– 1982 (2006)

12

R. Vyas et al.

4. Candès, E., Demanet, L., Donoho, D., Ying, L.: Fast discrete curvelet transforms. Multiscale Model. Simul. 5(3), 861–899 (2006) 5. Daugman, J.: The importance of being random: statistical principles of iris recognition. Pattern Recogn. 36(2), 279–291 (2003) 6. Daugman, J.: How iris recognition works. IEEE Trans. Circuits Syst. Video Technol. 14(1), 21–30 (2004) 7. Daugman, J.G.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1148–1161 (1993) 8. Joshi, A., Gangwar, A.K., Saquib, Z.: Person recognition based on fusion of iris and periocular biometrics. In: Proceedings of the 2012 12th International Conference on Hybrid Intelligent Systems (HIS), pp. 57–62 (2012) 9. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 10. Park, U., Jillela, R.R., Ross, A., Jain, A.K.: Periocular biometrics in the visible spectrum. IEEE Trans. Inf. Forensics Secur. 6(1), 96–106 (2011) 11. Raghavendra, R., Raja, K.B., Yang, B., Busch, C.: Combining iris and periocular recognition using light field camera. In: 2nd IAPR Asian Conference on Pattern Recognition, ACPR, pp. 155–159 (2013) 12. Raja, K.B., Raghavendra, R., Busch, C.: Binarized statistical features for improved iris and periocular recognition in visible spectrum. In: 2nd International Workshop on Biometrics and Forensics (IWBF), pp. 1–6 (2014) 13. Ross, A., et al.: Matching highly non-ideal ocular images: an information fusion approach. In: 2012 5th IAPR International Conference on Biometrics (ICB), pp. 446–453 (2012) 14. Sequeira, A.F., et al.: Cross-eyed - cross-spectral iris/periocular recognition database and competition. In: 5th International Conference of the Biometrics Special Interest Group (BIOSIG 2016), Sequeira 2016, pp. 1–5 (2016). https://sites. google.com/site/crossspectrumcompetition/home 15. Sharma, A., Verma, S., Vatsa, M., Singh, R.: On cross spectral periocular recognition. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5007–5011 (2014) 16. Vyas, R., Kanumuri, T., Sheoran, G.: Cross spectral iris recognition for surveillance based applications. Multimedia Tools Appl. 78(5), 5681–5699 (2018). https://doi. org/10.1007/s11042-018-5689-y 17. Vyas, R., Kanumuri, T., Sheoran, G., Dubey, P.: Efficient features for smartphonebased iris recognition. Turk. J. Electr. Eng. Comput. Sci. 27(3), 1589–1602 (2019) 18. Vyas, R., Kanumuri, T., Sheoran, G., Dubey, P.: Efficient iris recognition through curvelet transform and polynomial fitting. Optik 185, 859–867 (2019) 19. Woodard, D.L., Pundlik, S., Miller, P., Jillela, R., Ross, A.: On the fusion of periocular and iris biometrics in non-ideal imagery. In: Proceedings - 20th International Conference on Pattern Recognition (ICPR), pp. 201–204 (2010) 20. Zand, M., Doraisamy, S., Halin, A.A., Mustaffa, M.R.: Texture classification and discrimination for region-based image retrieval. J. Vis. Commun. Image Represent. 26, 305–316 (2015) 21. Zhao, Z., Kumar, A.: An accurate iris segmentation framework under relaxed imaging constraints using total variation model. In: IEEE International Conference on Computer Vision, pp. 3828–3836 (2015) 22. Zhao, Z., Kumar, A.: Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network. IEEE Trans. Inf. Forensics Secur. 12(5), 1017–1030 (2017)

Computer Forensic

A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN Shraddha Wankhade(B) , Anuja Dixit , and Soumen Bag Department of Computer Science and Engineering, Indian Institute of Technology (Indian School of Mines) Dhanbad, Dhanbad, India [email protected], [email protected], [email protected]

Abstract. Copy move forgery detection is a rapidly growing research area in the field of blind image forensics. Image forgery means fraction of image is copied and pasted within the same image to entrap the end users and distort the originality of the information. Methods introduced so far faces problems in detecting forged region present in large size image as well as images with post-processing attacks like noise, compression, rotation, scaling, translation, etc. We have introduced a new approach based on keypoint feature mapping. In this approach, Scale Invariant Feature Transform (SIFT) keypoint feature descriptors are computed followed by K-nearest neighbor searching and finally the copy move forgery is detected with the help of Hierarchical Density Based Scan (HDBSCAN) clustering technique. We have used several quantitative measures like true positive rate (TPR), false positive rate (FPR), precision, F-score, accuracy, negative likelihood (NL), positive likelihood (PL), etc. to compare the performance of our method with other existing methods. Our method detected forged images accurately and achieved highest F-score as well as low FPR, comparing with other existing methods.

Keywords: Copy move forgery matching · HDBSCAN · SIFT

1

· Digital image forensic · 2-NN

Introduction

Tampering image data and misleading the receiver way beyond the actual interpretation is very old technique of the counterfeiter. Digital images are easy to manipulate and forge due to availability of powerful editing software in the market which leaves no significant mark or traces on the image. Embedding a duplicate region in an image is called image forgery. Embedding can be done in many ways, like copying a fragment of an image from the same image and pasting it at different locations with several distortions like noise addition, compression, translation, scaling, etc. Forgery is meant to be done to suppress the actual information from the end users. Detection of copy move forgery is a well known research topic in blind image forensics. Blind image forensic means forgery detection without using outside authentic image information. Research in copy move c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 15–24, 2020. https://doi.org/10.1007/978-981-15-4015-8_2

16

S. Wankhade et al.

Fig. 1. Example of copy move forgery: (a) Original image, (b) Tampered image. Tampered portion is marked by red box. (Color figure online)

forgery detection domain has been extensively divided into two parts: one is block-based approach and other is key point-based approach. Various research proposals came into existence for detection of copy-move image forgery. In [1] authors proposed a robust copy move forgery detection scheme using block-based approach. In their method overlapping blocks are used to calculate feature vectors. Feature vectors are sorted to get the similar feature vectors at adjacent position which made detection of clone regions less complex. Main disadvantage with this method is high complexity involved in forgery detection process when forged images are of large size (Fig. 1). In keypoint based approaches, keypoints are the points of interest in an image. Scale Invariant Feature Transform (SIFT) is the most popular keypoint detection algorithm. Main five steps of SIFT detection includes detection of scale space extrema, localization of keypoints, computation of orientation histogram, Keypoint descriptors calculation and Keypoints matching. Amerini et al. [2] proposed a method using SIFT technique to detect forged images. In [3] stationary wavelet transform with SIFT based feature extractor is utilized to detect forged regions. The main objectives of our proposed method are as follows: • In the place of 2NN feature matching, we have used generalised-2NN (g2NN). By effect of this g2NN method, the keypoints matching technique has been improved. • We have used Hierarchical Density Based Spatial Clustering of Applications with Noise (HDBSCAN), which has the ability to beat the existing methods. HDBSCAN can also used for varying density. It helps to give compact boundary for the cluster of features to distinguish the outliers properly. • From the quantitative aspects we have improved the TPR, accuracy measures, and reduces the false matches. The remaining paper is categorized as follows. Section 2 describes proposed work. Experimental Results and Analysis have been shown quantitatively and qualitatively in Sect. 3. Finally, in Sect. 4 we have concluded our proposed methodology.

A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN

2

17

Proposed Method

In proposed methodology, the features are elicited according to SIFT algorithm to find copy moved region in an image using HDBSCAN which gives comparable results for the clusters of varying density. The fundamental framework followed for detection of forged images is shown in Fig. 2. The flow diagram followed by proposed methodology is illustrated in Fig. 3.

Input Image

Keypoints extraction and matching

After applying HDBSCAN clutering

(a)

(b)

(c)

Detected result

(d)

Fig. 2. Overview of the proposed method: (a) tampered input image, (b) detection of keypoints on an image, (c) showing matching keypoints, (d) detected result.

2.1

Preprocessing

The original RGB image I of size M × N is converted into gray scale image as it becomes easy to do computation on gray scale image. We can convert RGB to gray scale image as in Eq. (1). I = 0.2989 × R + 0.5870 × G + 0.1140 × B 2.2

(1)

Finding Keypoints and Extraction of Features Using SIFT

We have used SIFT method for keypoint extraction as well as for keypoint description. In SIFT algorithm, key points are detected using scale-space local extrema. Consider a scale space image L(u, v, σ), where σ is the scaling factor on coordinates (u, v), formed by convolving the image after applying Gaussian filter as in Eq. (2). L(u, v, kσ) = G(u, v, kσ) ∗ I(u, v)

(2)

where I(u, v) is the original image and L(u, v, kσ) is the convolved image. Difference of Gaussian at different scales can be computed as in Eq. (3). D(u, v, σ) = L(u, v, ki σ) − L(u, v, kj σ)

(3)

18

S. Wankhade et al. No

Detection of keypoints Input image

Extraction of feature descriptor

Scalar product between feature descriptor

Ratio between two neighbours < 0.6 Yes Store matched keypoints

Forgery detection outcome

Linking keypoints of different clusters

Clustering using HDBSCAN

Computation of euclidean distance between matched keypoints

Fig. 3. System architecture of the proposed method.

2.3

Keypoint Localization and Orientation Assignment

Interpolation of neighbouring information is used to correctly identify the position of each keypoint. Quadratic Taylor Expansion is used for calculating Difference of Gaussian (DoG) function D(u, v, σ) with each keypoint as in Eq. (4). D(z) = D +

1 ∂2D ∂Dt z + zt 2 z ∂z 2 ∂z

(4)

where z represents the tuple (u, v, σ). Canonical oriented structure around each key point ensures that SIFT descriptors are constant w.r.t rotation and scaling. To show the descriptor orientation a gradient based histogram is calculated using orientation of neighbourhood around a keypoint. Magnitude and orientation are computed as in [2]. We take 16 × 16 window around each keypoint, that window is further subdivided into 4 × 4 sub-blocks. For each sub-block, 8 bins oriented histogram is calculated and 128 × 1 dimensional feature vector is obtained. 2.4

Feature Matching

We have used Brute Force method to find best match for keypoints. In this method, distance of a particular keypoint is calculated with all other neighbouring keypoints in an image and to take out best match. Euclidean distance is the deciding factor. If we calculate the distance between two feature vectors in order to find matching between two keypoints using threshold then performance of proposed algorithm degrades because dimensions of the feature space is high. So, to make our method more productive we have computed ratio of the distance between two neighbours and estimated it with threshold T h. From obtained keypoint corresponding vector is calculated which is in sorted order of Euclidean distances with respect to other vectors. According to this concept, matching between keypoints found if the condition as in Eq. (5) is true.

A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN

dist1 < T h, where T h ∈ (0, 1) dist2

19

(5)

Main drawback of above mentioned 2NN test is that it fails to deal with multiple cloning, means if same region is copied over and over in the same image. To handle this drawback we used g2NN. After defining the similarity vector X = {dist1 , dist2 , dist3 , ..., distn }, observation starts from high dimensional feature space obtained from SIFT keypoints. 2NN is the ratio between the distance of optimal feature and distance of the closest neighbour. The ratio between optimal match and closest neighbour is selectively low in the case of perfect match but relatively very high for any random features. Generalization of 2NN leads to an iterating process which conducts m number of iterations until the ratio of distj /distj+1 greater than the threshold, where 1 ≤ m ≤ n and n is the total number of similar vectors. Finally after iterating m number of times over the image domain all the matched keypoints are obtained, but the isolated keypoints are not considered for further steps. 2.5

Detection of Outliers and Clustering Using HDBSCAN

HDBSCAN [4] is a clustering technique and it is an extension of DBSCAN. DBSCAN is a remarkable technique to cluster input data of different shapes, but it doesn’t work well for data of different density as well as it suffers in the selection of parameters [5]. We used HDBSCAN because it can work on data of different density as it utilizes threshold having varying density. It has robustness parametric selection. One more advantage of using HDBSCAN is that it is robust to noise as it has the quality of using high density threshold to divide two densed clusters due to which low density cluster point is treated as a noise. DBSCAN uses two parameters, ε and Minpts. Where, ε represents the radius and Minpts represents minimum points in a cluster to decide the core point, border point and, noise point respectively. HDBSCAN does not use any radial parameter to form a cluster. It uses two different parameters, minimum cluster points and minimum sample points. Both the parameters are inversely proportional to each other. Minimum cluster size helps to find the correct density thresholding value. Where, minimum sample points help to make cluster boundary tighter. The performance of HDBSCAN depends on choosing these parameters properly. At the end of the process, clusters which can not hold any remarkable number of matched keypoints are ignored.

3 3.1

Experimental Results Dataset Description

We have used MICC-F220 dataset [6] which contains 110 as original images and 110 as tampered images. Resolution of the image varies from 722 × 480 to 800 × 600 pixels. Images contains 1.2% of the total area, covered with forged

20

S. Wankhade et al.

region. This dataset contains forged images with different post processing operations such as rotation, scaling, JPEG compression, noise addition, multiple cloning, etc. 3.2

Metrics for Performance Evaluation

The performance of the proposed copy move forgery detection method has been tested quantitatively. The quantitative results have been tested using True Positive Rate (TPR), False Positive Rate (FPR), Accuracy, Positive Likelihood (PL), Negative Likelihood (NL), Precision, and F-score which can be computed as in [7]. TPR denotes the percentage of images which are correctly identified as forged. FPR denotes the percentage of images which are original but are falsely identified as forged. Accuracy denotes how accurately the forgery detection method can detect true positive and true negative from the dataset. PL denotes how likely the true positive values are with respect to the dataset. NL denotes how likely the true negative values are with respect to the dataset. F-Score indicates the harmonic mean of precision and recall. The quantitative results obtained using proposed technique are shown in Table 1. Table 1. Comparison of proposed method with other methods on the basis of certain quantitative parameters. Methods

TPR (%) FPR (%) Accuracy (%) PL

NL

SIFT+HAC [2]

97.27

7.2

95

13.5

0.029 93.04

Precision F-Score

SURF+HAC [8]

73.64

3.64

85.4

20.2

0.27

96.4

0.834

SIFT+DyWT [9]

80

10

85

8

0.22

88

0.83

SIFT+PCA [10]

97.20

0.951

7.27

95

13.37 0.03

89.3

0.9289

SIFT+PCA+DBSCAN [11] 96

3

97

32

0.04

97

0.96

SIFT+FCM [12]

99.09

9.09

95

11

0.01

91.5

0.951

Proposed Method

100

3.63

98.18

27.54 0

96

0.979

Table 1 represents the performance of our algorithm on MICC-F220 dataset on the basis of evaluation metrics TPR, FPR, accuracy, PL, NL, precision, and F-score. Quantitative results obtained using proposed algorithm are compared with existing copy-move forgery detection techniques. According to the obtained results it has been observed that merging HDBSCAN with SIFT algorithm gives better results than other existing methods.

A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN

3.3

21

Results Analysis

The performance of the proposed methodology is dependent on the following parameters of HDBSCAN clustering. Minimum cluster size: It provides better stability in flat region clustering. The more we increase the size, it will merge the homogeneous small clusters. Minimum sample size: It expresses the stability of a cluster. The larger the size of a cluster the more stable cluster will be. In our method we set the cluster value as one. Alpha( α): If minimum cluster size and minimum sample size doesn’t provide compact clustering then α can help to make cluster more restrictive in nature. We considered α value as 2 through out the experiment. Cluster selection method: HDBSCAN uses ‘Excess of Mass (eom)’ as default setting for cluster selection method. Irrespective of cluster size our main focus is to combine all the original data points in single cluster and forged points into another cluster. We achieved this by using ‘leaf’ as cluster selection method. In g2NN, we have taken the threshold T h, where 0.3 ≤ T h ≤ 0.9. The effect of ratio change over accuracy is graphically shown in Fig. 4.

Accuracy Measure 100

Accuracy

95 90 85 80 75 70 0.3

0.4

0.5

0.6

0.67

0.7

0.8

0.9

Threshold (Th)

Fig. 4. Accuracy measures with varying threshold (T h).

3.4

Qualitative Results

HDBSCAN Vs DBSCAN: See Figs. 5, 6 and 7.

22

S. Wankhade et al.

Fig. 5. Visual results from MICC-F220 dataset: (a)–(b) original images, (c)–(d) tampering detection using DBSCAN clustering, (e)–(f) tampering detection using HDBSCAN clustering.

Fig. 6. Examples of copy move forgery detection on different types of attacks: (a) plain, (b) scaling, (c) rotation, (d) detection result for plain attack, (e) detection result for scaling attack, (f) detection result for rotation attack.

A Fast and Rigid Copy Move Forgery Detection Technique Using HDBSCAN

23

Fig. 7. Examples of copy move forgery detection on different types of attacks: (a) rotation and scaling, (b) noise addition, (c) JPEG compression, (d) detection result for rotation and scaling, (e) detection result for noise addition, (f) detection result for JPEG compression.

Our method can accurately detect the forged images having multiple cloned regions as shown in Fig. 8.

Fig. 8. Examples of copy move forgery detection from MICC-F220 dataset: (a)–(b) multi-cloned images, (c)–(d) detection results.

24

4

S. Wankhade et al.

Conclusion

In this paper, a clustering based methodology for detection of copy move forgery is proposed which can firmly detect the tampered region in a given image with geometric transformations performed on it. Primarily most popular and reasonably accurate SIFT key point based method is applied to extract keypoints as well as to obtained feature descriptors. After getting feature descriptors HDBSCAN clustering is performed to detect forged region and outliers. A comparative examination of this method has been acquitted with DBSCAN clustering method. Several quantitative parameters such as TPR, FPR, and Accuracy shows the improvement in forgery detection results using proposed scheme when forged images suffer from scaling, rotation, JPEG compression, Noise Addition, and multi-cloning attacks. It approximates that the proposed methodology performs relatively better and adds a little contribution to vast field of image forensics.

References 1. Luo, W., Huang, J., Qiu, G.: Robust detection of region-duplication forgery in digital image. In: Proceedings of the International Conference on Pattern Recognition, vol. 4, pp. 746–749 (2006) 2. Amerini, I., Ballan, L., Caldelli, R., Bimbo, A.D., Serra, G.: A SIFT-based forensic method for copy-move attack detection and transformation recovery. IEEE Trans. Inf. Forensics Secur. 6(3), 1099–1110 (2011) 3. Singh, R., Chaturvedi, R.P.: SWT-SIFT based copy-move forgery detection of digital images. In: Proceedings of the International Conference on Innovations in Control, Communication and Information Systems, pp. 1–4 (2017) 4. McInnes, L., Healy, J., Astels, S.: hdbscan: hierarchical density based clustering. J. Open Source Softw. 2(11), 205:1–205:2 (2017) 5. Braune, C., Besecke, S., Kruse, R.: Density based clustering: alternatives to DBSCAN. In: Celebi, M.E. (ed.) Partitional Clustering Algorithms, pp. 193–213. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09259-1 6 6. MICC-F220 dataset. http://lci.micc.unifi.it/labd/2015/01/copy-move-forgerydetection-and-localization 7. Dixit, A., Bag, S.: Utilization of HOG-SVD based features with connected component labeling. In: Proceedings of the International Conference on Identity, Security, and Behaviour Analysis (2019) 8. Mishra, P., Mishra, N., Sharma, S., Patel, R.: Region duplication forgery detection technique based on SURF and HAC. Sci. World J. 2013, 1–8 (2013). Article Id 267691 9. Hashmi, M.F., Anand, V., Keskar, A.G.: Copy-move image forgery detection using an efficient and robust method combining un-decimated wavelet transform and scale invariant feature transform. AASRI Procedia 9, 84–91 (2014) 10. Kaur, H., Saxena, J., Singh, S.: Simulative comparison of copy-move forgery detection methods for digital images. Int. J. Electron. Electr. Comput. Syst. 4, 62–66 (2015) 11. Mursi, M.F.M., Salama, M.M., Habeb, M.H.: An improved SIFT-PCA-based copymove image forgery detection method. Int. J. Adv. Res. Comput. Sci. Electron. Eng. 6, 23–28 (2017) 12. Alberry, H.A., Hegazy, A.A., Salama, G.I.: A fast SIFT based method for copy move forgery detection. Future Comput. Inform. J. 3(2), 159–165 (2018)

Computer Vision

Automated Industrial Quality Control of Pipe Stacks Using Computer Vision Sayantan Chatterjee1(B) , Bidyut B. Chaudhuri2 , and Gora C. Nandi1 1

Department of Robotics and Artificial Intelligence, Indian Institute of Information Technology, Allahabad, Allahabad 211012, Uttar Pradesh, India {IIT2015511,gcnandi}@iiita.ac.in 2 Department of Computer Vision and Pattern Recognition, Indian Statistical Institute, Kolkata, Kolkata 700108, West Bengal, India [email protected]

Abstract. In this work, we describe an automated quality assurance system for pipes in warehouses and yards using simple handheld and mobile equipment like smartphone cameras. Currently, quality inspection for bent and crooked pipe ends is done manually, which entails additional labour costs and is relatively slower than a mechanised approach. We propose an efficient and robust method of detecting the perfectly circular cross-sections of the pipes in stacks using an adaptive variation of Hough Transform algorithm. As a multistage approach, our proposed method first intensifies the foreground features relative to the background and then applies Canny edge detection to obtain the gradient details. The gradient directions are in turn fed to the modified Hough Transform algorithm using Hough gradient to detect the cross-sections. The novel Hough Transform modification features a region-based processing in a “coarse to fine” manner by dividing the image into smaller grids and detecting circular cross-sections per grid, which has computational advantages of using a smaller accumulator and is less memory intensive. Experiments were performed on real industrial images to validate the efficiency of the proposed algorithm and the results show that the proposed method can successfully and accurately highlight the perfectly circular cross-sections while leaving the faulty pipe-ends undetected. Keywords: Quality control · Detecting cross-sections · Image processing · Hough Transform · Robust region-based algorithm industry

1

· Pipe

Introduction

Recently, there has been a significant interest within the computer vision, photogrammetry and signal processing communities to provide state-of-the-art fault detection and quality assurance technologies in large scale manufacturing to avoid the involvement of additional labour costs and make the entire manufacturing workflow more time-efficient. This project derives its inspiration from the c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 27–38, 2020. https://doi.org/10.1007/978-981-15-4015-8_3

28

S. Chatterjee et al.

industrial need of employing lesser staff for manual quality checks, and reinforcing the current system with a robust approach for automating the process. In this paper, the method of detecting perfectly circular cross-sections from a lateral view of a pipe stack has been developed in order to avoid particular manufacturing errors of broken, crooked and misshapen as well as deformed pipeends. Considering that the stacks are often found in open yards and warehouses, the main requirement of the method was to make it portable and mobile. This saves the inspection crew from the added hassle of carrying larger equipment on field. The goal of this project, therefore, is to build an automated quality checking system for large-scale pipe manufacturing using mobile and portable equipment like smartphone cameras. The metrics for judging an algorithm are also not quite as straightforward as it may seem—several factors come into play other than the accuracy viz. performance measures for a wide range of radii, i.e., robustness, inference time, computational hardware constraints, quality of photographs available, etc. The metrics that will be focused on here are: 1. Accurate detection of cross-section 2. Inference time 3. Hardware constraints Microsoft’s COCO data set [1] has become an industry standard for testing and evaluating object detection algorithms. The primary tests of our circle detection algorithm have been done on a part of this data set containing circular objects. One of the main areas of focus was to measure the trade-offs required to gain real-time inferring capabilities with minimal sacrifice to accuracy. Since our goal is to perform quick inference by utilizing the input from a portable camera (low computing device), the performance of the algorithm both in terms of latency as well as accuracy must be taken into account. A complete pipeline was developed which takes input from smartphone camera, feeds it to the detection algorithm, takes the output of the algorithm and then displays the detected circles in the stack along with their estimated radii in a separate monitoring feed. The pipeline performs all the aforementioned steps in real-time. The rest of the paper is organized as follows in Sect. 2, the state-of-theart and previous related work have been discussed. In Sect. 3, the problem has been formalized and elaborated upon and the architecture proposed has been discussed in Sect. 4. Section 5 describes the results and compares the performance among different algorithms. The future work has been discussed and proposed in the final section, i.e., Sect. 6.

2 2.1

Related Work Use of Computer Vision in Fault Detection in the Pipe Industry

In recent years, there has been a growing interest in tackling the challenges of industrial fault detection in pipe manufacturing using computer

Automated Industrial Quality Control of Pipe Stacks

29

vision algorithms. Alam et al. in [2] has proposed a method to detect cracks and holes in pipes by applying Sobel masks on an image of the pipe along its length. The defects are classified into holes and cracks according to the shape and size of the detected edges. Furthermore, crack detection with the help of Fourier transform, wavelet transform has been demonstrated in [3] and the effectiveness on using computer vision analysis and pattern recognition in quality control and fault detection has been comparatively demonstrated in the paper. In a similar vein to [2], this work proposes an automated and robust approach for detecting the circular cross-sections of the pipes by utilizing the lateral view images of pipe stacks with the aid of a modified Hough Transform algorithm for circles—a novel quality control approach not found in existing literature as per the search conducted to the best of our knowledge. 2.2

Hough Transform and Its Variants

The major developments and variations in Hough Transform and its adaptiveness have been comprehensively compared in Yuen et al. [4], which statistically illustrates the advantages and drawbacks of all the prevalent variants. The Standard Hough Transform (SHT) uses a 3-D accumulator array and limits voting to a section of the cone in the three dimensional parameter space by using the edge direction information. The Gerig and Klein Hough Transform (GKHT) solved the issue of requiring an exorbitant amount of storage space in case of a large range of radii by using 3 two-dimensional accumulator arrays, and then the gradient information was further introduced to make the increments of radius at each edge point more efficient. Based on the performance comparisons of several Hough Transform methods for circle detection in [4], we chose the Gerig and Klein Hough Transform with gradient (GHTG), which is shown to be computationally the most efficient in the comparative survey, and reduces complexity by converting the three-dimensional parameter space into 3 different two-dimensional accumulator arrays. Although there are other variants like the Fast Hough Transform (FHT) and modified Fast Hough Transform (MHFT) and the 2-1 Hough Transform (21HT), GHTG performs better than them and is computationally reliable and efficient, despite being restricted to non-concentric circles.

3

Problem Definition

The problem considered in this paper is related to the circle detection problem in pattern recognition and image processing. Given a photograph captured in an uncontrolled natural environment, the desirables of the quality control and fault-monitoring system are threefold: 1. The system only detects the cross-sections of the pipes that are perfectly circular while bent or crooked ends must not be detected.

30

S. Chatterjee et al.

2. The monitoring system should be general enough to be able to detect pipes of any radii, while also keeping circular objects in the background or shadows undetected—adaptiveness and robustness are essential. 3. Since the stacks are spread across a large width, due to perspective shift, some pipe-ends might overlap their adjacent ones. An arc (partial circle)-detection system must be involved. 3.1

Input

The input to the detector is a photograph, or a capturing streamline from any camera. For better calibration of the estimated radii, the distance between the camera and the pipe stacks should be kept relatively comparable (variance of approx. 1–1.5 m) to achieve accurately scaled results. 3.2

Assumptions

The assumptions regarding the input images are as follows: – The cross-sections visible in the image should be in focus and devoid of noise. – The pipe stack should be approximately centered inside the image frame. – The diameter of the largest cross-section should be at most a quarter of the image dimensions. 3.3

Output

The output of the detector are the coordinates of the centers of the circles detected, the predicted radii estimates, the number of circles detected, etc. The radii estimates can even be calculated relative to the real world measures by scaling with respect to the distance between the camera and the pipe stack.

4

Methodology

4.1

Comparing the State-of-the-Art Circle Detection Algorithms

A comparative study of the above mentioned state-of-the-art circle detection algorithms on the following metrics has been performed: – – – –

Accuracy of localizing circle centers Training time Inference time Accuracy of estimated radii

From conventional computer vision and deep learning algorithms, we tested the following state-of-the-art approaches:

Automated Industrial Quality Control of Pipe Stacks

31

1. A template-matching approach using distance transform (Euclidean or Manhattan) [5], which does not support a wide range of radii due to the fixed size of the template mask. 2. A contour detection approach by convolution using kernel masks [5,6], which is difficult to be scaled for large images. 3. Deep neural nets for circle detection, e.g. using Euclidean distance measure as the final layer metric, etc. [7,8], which is the most accurate but is highly intensive as far as computational constraints are concerned. 4. Adaptive Hough Transform with partial circle detection using chord and sagitta properties [9–11], which is scalable and has support for a wide range of radii, but struggles for concentric circles and is situationally dependent on its parameters. While deep neural nets with Euclidean distance as the final layer, along with the spiking neural model in [7] and the learning automata approach in [8] were very accurate in estimating the radii, the detection of circle centers needed ample training time which made the process slower, and also too computationally intensive to be hosted on portable mobile hardware with memory and processing power constraints. Our choice of Hough Transform with gradient details performs reasonably accurately over a multitude of naturally lit conditions, provided the parameters have been tuned properly. The advantages in the region-based approach that is proposed here are a smaller accumulator size and limited entries in the Hough space for voting, and the inference is thus less computationally intensive. The detailed workflow for the pipeline to obtain the detected cross-sections from the smartphone images is illustrated in Fig. 1 and is discussed in detail in the following subsections from Sects. 4.2 to 4.6. 4.2

Preprocessing

1. Compression: The images from the modern smartphone cameras are multiple megapixels in size. In order to reduce the amount of computation required to process the pixels, an efficient lossy compression algorithm like JPEG encoding should be applied on the input image frame to reduce the overall image size. The current compression method implemented reduces the image to ≈75% of its original size. 2. Grayscaling: The compressed images are then colour converted from an RGB colour map to grayscale using the pixel intensity obtained by mixing the three components and quantizing to the given grayscale as mentioned in CCIR-601 (currently designated as IBU-BT.R601) [12]: Y = 0.299R + 0.587G + 0.114B

(1)

3. Inversion: The pixel intensities are inverted in a 0–255 range to increase the amount of darker pixels relative to the lighter pixels. This makes the process computationally economical due to lower intensities of the darker pixels and prevents overflow.

32

S. Chatterjee et al.

Fig. 1. Detailed workflow for the detection pipeline

4.3

Morphological Processing: Intensification of Foreground

Erosion Definition 1. Let E be a Euclidean space or an integer grid, and A a binary image in E. The erosion of the binary image A by the structuring element B is defined by: A B = {z ∈ E | Bz ⊆ A} where Bz is the translation of B by the vector z, i.e., Bz = {b + z | b ∈ B}, ∀z ∈ E Dilation Definition 2. Let E be a Euclidean space or an integer grid, and A a binary image in E. The dilation of the binary image A by the structuring element B is given by: A ⊕ B = {z ∈ E | (B s )z ∩ A = ∅} where B s is the symmetric of B, i.e., B s = {x ∈ E | −x ∈ B} Using a 3 × 3 structuring element as the kernel, the image boundaries are eroded and the thickness of the foreground object edges is reduced. On this eroded image, the same 3 × 3 kernel is used to dilate the object areas to accentuate the foreground features.

Automated Industrial Quality Control of Pipe Stacks

33

The combination of these two steps help in minutely separating touching objects, removing slender protrusions at the edges, as well as small holes and distortions caused by white noise from the image and intensifying the foreground; the object areas increase and the broken parts are rejoined. This step is essential to improve fidelity in the Circular Hough Transform step. 4.4

Smoothing: Suppression of Background

A Gaussian blur (also known as a two-dimensional Weierstrass transform) is introduced as a smoothing filter for diffusing the background details even further in order to minimize the occurrences of spurious circles in the inferences. A 13 × 13 Gaussian kernel on the boundary grid cells, followed by a 9 × 9 or 11 × 11 kernel on the interior cells, as per the demands of the situation, are used for smoothing the image. This is done to smoothen the background details more, as they are more probable to occur at the image boundaries, and the objects in focus are less blurred. Standard deviation and coefficients for the kernel are calculated as follows: α = 0.3 · ((ksize − 1) · 0.5 − 1) + 0.8 ksize − 1 2 · Gi = α · e − i − 2 2 ∗ σ2

(2) (3)

Note 1. The expression for standard deviation α in Eq. 2 is obtained from the documentation of getGaussianKernel method in OpenCV. 4.5

Edge Detection

Since the Gerig-Klein Hough Transform can integrate gradient direction details for efficient increments of radii on each edge point, a Canny edge detection layer has been integrated due to its usage of gradients. The gradient details returned by the Canny operator are in turn fed to the detector for increased efficiency. 4.6

Region-Based Hough Transform

Circular Hough Transform works on the following parametric equation of a circle: (x − x0 )2 + (y − y0 )2 = ρ2 The Cartesian space of x and y is transformed into a three dimensional parameter space of x0 , y0 and ρ, also known as the Hough space. The algorithm iterates over values of angles and from a prespecified ‘minimum distance’ from detected edge points and intersections in Hough space are chosen based on a parameter to specify circular characteristics in the Cartesian space, i.e., the original image. The modified algorithm divides the image into a 4 × 4 grid and each grid cell is processed individually for circles. The cells have a 20% overlap

34

S. Chatterjee et al.

with adjacent cells (5% for each side) to avoid missing circles that lie on or very near the boundary of the cell. The processing of each region, i.e, individual grid cell features a downward iteration from an upper bound to a lower bound of radii in a “coarse to fine” manner. The upper bound of radius rmax is the largest possible circle inside a rectangular area, and is defined as 0.5 · min{h, w}, where h and w are the height and width of the grid cell, respectively. The method then iterates from the upper bound of the radius to an empirically chosen lower bound rmin to detect all possible circles within that area.

Fig. 2. Illustration of the working inference model on privately collected images: (a) Original image from the smartphone camera divided into a 4 × 4 grid, (b) A particular grid cell, zoomed in, (c) Detected circles with centers lying within the region, (d) Output image with all detected circles

5

Results

The algorithm has been tested on privately collected images from warehouses and publicly available stock images, since no pipe-specific dataset was found in a search conducted to the best of our abilities. The results obtained on applying the detection algorithm on the collected images have been illustrated here. As shown in Fig. 2, the circles in each region are separately detected and the results of all the grid cells are then stitched into the original image and sent to a monitoring stream for user analysis. The results also depict the robustness of

Automated Industrial Quality Control of Pipe Stacks

35

Fig. 3. Detailed comparative accuracy scores with open-source implementations

the region-based approach on effectively identifying all the overlapping circular cross-sections from the partial arcs visible. The comparative results of the proposed algorithm with open source implementations of Circular Hough Transform like OpenCV have been illustrated in Fig. 3. According to the results obtained, it is clear that given a Canny edge result, the region-based approach is consistently reliable and does not lose its accuracy as rapidly as the OpenCV implementation with stricter voting parameters. The accuracy metric that has been used to evaluate the performance of the algorithms is the F1 score averaged over 50 trials. The Hough voting parameter, also known as the detection “sensitivity”, has been chosen empirically by testing the algorithm on more than 15000 pipes over 100 industrial and stock images. Since the voting parameter corresponds to the number of edge points that the algorithm needs to predict an edge as the part of a circle, it is induced that the parameter might be related to the circumference of the circles to be detected. From the experimental results demonstrated in Fig. 4, it has been deemed reliable that choosing a voting parameter of max{30, 0.6 · rmin } produces consistent results over a large range of radii on an accumulator of size equal to the size of the grid cell. As per tests conducted on smartphones, the inference time for images ranged between 0.4 s (80 pipes) to 0.9 s (300 pipes). In order to illustrate and validate the performance of the proposed algorithm in presence of deformed cross-sections, a side-by-side view of a clogged and clean

36

S. Chatterjee et al.

(a) Water pipes (Source: http://robertkaplinsky.com)

(b) Metal pipes (Source: http://dreamstime.com)

Fig. 4. Experimental results on stock images

Fig. 5. Demonstration of the detection algorithm in presence of deformed cross-sections

pipe has been used in Fig. 5, which shows that only the cleanly visible circular cross-section has been detected. Based on these results, it can therefore be concluded that the region-based cross-section detection algorithm is more accurate and robust for identifying perfectly circular cross-sections of pipe stacks from smartphone images.

6

Discussion and Future Work

The proposed region-based technique has been shown to accurately detect the perfectly circular cross-sections of pipes to provide an effective and robust quality assurance system. Experimental results and comparisons show the superiority of this method in terms of accurately identifying the circle centers and estimating

Automated Industrial Quality Control of Pipe Stacks

37

the radii. In order to achieve sub-pixel perfect calibrations for further accurate estimation over a wide range of radii, the algorithm can be deployed on a camera feed from a conveyor belt situated at a fixed distance from the stack with the pipe ends facing forward. However, there are a couple of limitations that can be overcome in future. Currently, the actions that are being taken are agnostic to the actual distance of the camera from the pipe stacks. Incorporating a depth sensor would help in accurately informing the user of the radii estimates using simple scaling calculations. The entire pipeline can be made even faster to improve the inference time of the circle detection system by integrating a faster approach to auto-tune the parameters. This is a tricky caveat to overcome since the Canny edge detection and Hough voting parameters are highly situation dependent due to the variation in lighting, sharpness and noise in natural environments. As discussed in Sect. 4.1, an open scope for further research lies in incorporating a neural network model that is lightweight enough to function on mobile hardware at minimal latency. Working on automating the process using such a model will ensure consistently accurate results on an even wider range of images.

References 1. Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312 2. Alam, M.A., Ali, M.M.N., Syed, M., Sorif, N., Abdur Rahaman, M.: An algorithm to detect and identify defects of industrial pipes using image processing. In: The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014), pp. 1–6 (2014). https://doi.org/10.1109/SKIMA. 2014.7083567 3. Abdel-Qader, I., Abudayyeh, O., Kelly, M.E.: Analysis of edge-detection techniques for crack identification in bridges. J. Comput. Civil Eng. 17(4), 255–263 (2003). https://doi.org/10.1061/(ASCE)0887-3801(2003)17:4(255). https://ascelibrary. org/doi/abs/10.1061/%28ASCE%290887-3801%282003%2917%3A4%28255%29 4. Yuen, H.K., Princen, J., Illingworth, J., Kittler, J.: Comparative study of Hough transform methods for circle finding. Image Vis. Comput. 8, 71–77 (1990) 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall Inc., Upper Saddle River (2006) 6. Beaini, D., Achiche, S., Cio, Y.L., Raison, M.: Novel convolution kernels for computer vision and shape analysis based on electromagnetism. CoRR abs/1806.07996 (2018). http://arxiv.org/abs/1806.07996 7. Huang, L., Wu, Q., Wang, X., Zhuo, Z., Zhang, Z.: Circle detection using a spiking neural network. In: 2013 6th International Congress on Image and Signal Processing (CISP), vol. 03, pp. 1442–1446 (2013) 8. Cuevas, E., Wario, F., Zaldivar, D., Cisneros, M.A.P.: Circle detection on images using learning automata. CoRR abs/1405.5406 (2014). http://arxiv.org/abs/1405. 5406 9. Illingworth, J., Kittler, J.: The adaptive Hough transform. IEEE Trans. Pattern Anal. Mach. Intell. 9(5), 690–698 (1987). https://doi.org/10.1109/TPAMI.1987. 4767964

38

S. Chatterjee et al.

10. Bera, S., Bhowmick, P., Bhattacharya, B.B.: Detection of circular arcs in a digital image using chord and sagitta properties. In: Ogier, J.-M., Liu, W., Llad´ os, J. (eds.) GREC 2009. LNCS, vol. 6020, pp. 69–80. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13728-0 7. http://dl.acm.org/citation.cfm?id= 1875532.1875539 11. Yip, R.K.K., Tam, P.K.S., Leung, D.N.K.: Modification of Hough transform for circles and ellipses detection using a 2-dimensional array. Pattern Recogn. 25, 1007–1022 (1992) 12. Fischer, W.: Digital Video Signal According to ITU-BT.R.601 (CCIR 601). In: Fischer, W. (ed.) Digital Television. Signals and Communication Technology. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-05429-1 4

Asymmetric Wide Tele Camera Fusion for High Fidelity Digital Zoom Sai Kumar Reddy Manne(B) , B. H. Pawan Prasad, and K. S. Green Rosh Samsung R&D Institute Bangalore, Bangalore 560037, India [email protected], [email protected], [email protected]

Abstract. Asymmetric multi-camera systems are growing popular among smartphone manufacturers due to their ability to enhance image quality in applications such as low light imaging and camera zoom. One such multi-camera system is a tele-wide configuration with single or multiple tele and wide cameras. The images from these multiple cameras can be fused to generate an output image with improved image quality as opposed to individual input images. Major challenges in this problem include handling of photometric and Field-of-View differences between frames from an asymmetric camera system. In this paper, novel techniques for both multi-camera image fusion as well as multi-camera transition addressing the aforementioned challenges to create a seamless user experience are presented. The proposed method is evaluated qualitatively using comparisons against the existing method in Galaxy Note 8 and quantitatively against similar methods published in literature, showing an improvement in PSNR of 20% for synthetic datasets and 3% improvement for real datasets. . . . Keywords: Photometric alignment Multi-camera transition

1

· Multi-camera fusion ·

Introduction

Photography is one of the major unique selling point (USP) in modern smartphones. With the recent advancements, smartphone cameras are able to match image quality similar to DSLR cameras in several scenarios. However, smartphone cameras struggle to match the zoom capabilities of a DSLR camera due to its small form factor. Hence most of the smartphones today use methods such as multi-frame fusion and super resolution to achieve digital zoom instead of an optical zoom. However, this method degrades the image quality considerably for higher zoom factors. A solution to this problem is to leverage multiple cameras available in contemporary smartphones. An example for a multi-camera system (asymmetric cameras) is shown in Fig. 1. The example shows a set of three cameras: tele, wide and super-wide; which has increasing order of Field-of-View (FOV) and decreasing order of focal lengths. The images obtained from the three cameras can then c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 39–50, 2020. https://doi.org/10.1007/978-981-15-4015-8_4

40

S. K. R. Manne et al.

Super Wide FOV

Super Wide Wide Tele

Tele FOV

Wide FOV

Fig. 1. Super wide, wide and tele sensor setup

be fused together to obtain an optically zoomed output image depending on the zoom factor. However the differences in FOVs and photometric characteristics of the different camera sensors in the system make the task of image blending challenging. Major contribution in the paper is a novel algorithm for fusing the images from multiple asymmetric cameras, which takes care of the alignment errors and produces an artifact free final fused image. This proposed method enables two important user scenarios on a smartphone camera: Multi-camera fusion for image capture and Multi-camera video transition for high fidelity digital zoom, giving a DSLR like zoom experience to the user.

2

Related Work

Limited scientific work has been proposed in this research area such as [1–4]. In [1] high frequency components are estimated from a detail rich tele image. Then super resolution of these high frequency components is done with a registered wide image based on self-similarity. In [2] a method to correct brightness and color of the input tele and wide images is given, but tele image FOV is by default assumed to be around the center of the wide image, which is not the case in a multi-camera setup where the sensors are placed side-by-side. [3] tries to solve misalignment errors at boundaries between objects and background while fusion. Optical seams are estimated to generate a mask to detect the noticeable regions and fusion is done using local warping based on motion estimation. For photometric alignment, [5] proposes a global tone mapping curve for each color channel. [6] proposes a method where the intensity histograms are matched but does not work well for images with significantly different field of views. [7] uses a linear model to correct panoramic stitching using multiple images from single camera. [8] proposes locally adaptive correction methods which are computationally very expensive. [9] proposes surround monitoring system which does

Multi-camera Fusion SW

Image Registration

Photometric Alignment

Deghosting

Blending

W

T

Image Registration

Photometric Alignment

41

Deghosting

FOV Correction

Fused Output

Fig. 2. Block diagram of the proposed Multi-camera Fusion Algorithm

not provide accurate color distortion removal. [10] proposes spatial neighborhood filtering at each pixel which is computationally expensive. These approaches are not directly suitable to solve the unique challenges posed by asymmetric camera configuration of wide and tele cameras.

3

Proposed Method

In the following section, the proposed methods for both multi-camera image fusion as well as multi-camera video transition are described. 3.1

Multi-camera Image Fusion

The proposed method for multi-camera image fusion can be divided into five stages as shown in Fig. 2. For brevity, fusion of wide and tele cameras is described here which can be directly extended to the triple camera configuration, by adding another branch for super-wide and wide images in the blending stage. Global Image Registration. The images captured from the wide and telephoto cameras comprise of different FOVs as demonstrated in the Fig. 1. Hence the first step is to align the FOVs of these two images. To this end, an affine transformation matrix between wide image (W ) and tele FOV image (T ) is estimated using an ORB based feature descriptor [11] and a FLANN based feature matching [12]. The affine transformation matrix is then used to warp the wide FOV image to the structure of tele using bi-cubic interpolation method. Hω : W → T

(1)

Photometric Alignment. The images captured from different cameras in a typical asymmetric camera system are pre-processed using different Image Signal Processors (ISPs), hence the photometric characteristics of these two images will be quite different. The method for photometric correction proposed is an improvement upon the algorithm from [14].

42

S. K. R. Manne et al.

Global Brightness Correction. The brightness correction algorithm should ensure that the alignment errors due to global image registration do not result in erroneous photometric alignment. Hence the outlier rejection algorithm proposed by [14] is used to get a subset of the pixels on which the photometric correction is to be performed. Three different modelling methods were experimented with for an accurate photometric correction as discussed below: Linear Modelling. This modelling method is used by [14] for brightness correction. A linear relationship is assumed between source and target images, which is solved using Levenberg-Marquardt algorithm for least squares estimation. However, a linear model is often susceptible to estimation errors due to several nonlinear operations in the camera ISP. To rectify this issue, the following methods were developed and analyzed. Non-linear Modelling. Since linear modelling is not enough to correctly adapt to the non-linear operations of ISP, a polynomial fit is experimented with. To this end, photo corrected tele image (TˆY ) is modelled as a polynomial function (degree 3) of TY as follows, where TY is the Y channel of YUV: TˆY = a + bTY + cTY2 + dTY3

(2)

The coefficients {a, b, c, d} are obtained using QR householder decomposition to solve the least squares optimization, on the pixels remaining after the outlier rejection from [14]. Patch-Based Non-linear Modelling. Due to several differences between the wide and tele images in terms exposure, white balance, ISP tuning parameters, etc, estimating a single model for the entire image may not provide the best results. Hence different non-linear models are estimated for different regions of the image. The image is split into multiple overlapping patches and separate polynomial functions are estimated for each of the patches. These patches are then blended using bilinear interpolation of polynomial function similar to [15]. The outlier rejection from [14] is used in each patch separately to eliminate outliers while estimating polynomial coefficients. A comparison of patch based polynomial fit for several patch sizes is shown in Fig. 3. It can be seen that as number of patches increase, noise in estimated image decreases (gray patch). However, in regions of structural mismatch, estimation with higher number of patches tends to introduce artifacts (eye patch). De-ghosting. Despite the global registration process, there can be still small shifts in the object positions locally. This arises because of the positional differences between the lenses, as the lenses are stacked next to each other. So each lens sees a slightly different version of the same scene. Local motion needs to be detected in order to have a ghost-free blending. Optical flow between two frames provides flow vectors which gives the distance and direction between the corresponding pixels in them. Dense Inverse search

Multi-camera Fusion

43

Fig. 3. Comparison of polynomial fit based photometric alignment techniques for various patches showing increase in patch size decreases noise but introduces artifcats.

proposed in [13] is utilized to obtain the optical flow vectors. Using the flow vectors (V ) calculated the tele image is remapped to give T . A binary mask, called flow map (F ) used for blending, is estimated from the optical flow vectors calculated as follows: 1 V (i, j) > φ F (i, j) = (3) 0 otherwise Where, φ is the mean value of V. Blending. Blending the tele and wide frames is done using a weight map. The weight map (M ) is obtained from the absolute difference between Gaussianblurred TY and WˆY . In the next step the weight mask is altered to prefer either of tele and wide at a given pixel, to eliminate ghosting. The weight map pixels are divided into three subsets of pixels pˆ, qˆ and rˆ using two thresholds μ1 and μ2 . Using the thresholds, the weight map intensities are altered as below: ⎧ ⎪ ⎨pˆ : M (i, j) M (i, j) < μ1 M (i, j) = qˆ : M (ˆ (4) q) μ1 < M (i, j) < μ2 ⎪ ⎩ rˆ : 255 M (i, j) > μ2 where, M (ˆ q) =

(255 − μ1 ) (μ2 − 255) M (ˆ q) + μ1 (μ2 − μ1 ) (μ2 − μ1 )

(5)

44

S. K. R. Manne et al.

By altering the weight map using (4), gives a decent mask to blend input images to form the final fused image without artifacts. Both μ1 and μ2 are tunable parameters. In the experiments carried out, μ1 is set as the mean of M and μ2 is adaptively chosen based on the Flow map F values. For a given pixel (i, j) if F (i, j) is 1, then μ2 is given a lower value (2 × μ1 in our experiments) since high chance of misalignment and if F (i, j) is 0, which means an accurate alignment ˆ and T so a higher value (6 × μ1 ) is assigned to μ2 . between W In the next step, the mask is decomposed into a Laplacian Pyramid L. Although some sharp changes in intensities in the weight map are reduce by Gaussian blur, to further smoothen the mask and fill in small holes morphological operations like erosion and dilation are done at each level of the pyramid. The final mask is reconstructed from the pyramid and Gaussian-blurred. The mask contains dark regions close to pixel value 0, which represent close resemblance across images as well as white regions close to 255, where there is a lot of ˆ and T . At the regions of high disparity, the pixel values disparity between W ˆ in this case). of the fused image should be closer to the reference image (W The mask is modified gradually around the borders to give more weightage to reference frame, which will help in the next stage while putting the fused image back in Wide FOV. A weighted average approach is used to arrive at the final fused image F as follows: F =

ˆ + (255 − M ) × T M ×W 255

(6)

FOV Correction. The final fused image has Tele FOV, and should be brought back to the wide FOV. Affine matrix, Hω is used to get the wide image to tele FOV, so an inverse affine transformation is used on the fused image to get it to wide FOV. (7) Hω−1 = F → W After affine warping based on Hω−1 , the fused image is back in wide FOV, but pixels other than the tele FOV are black. As the fused image around the borders is gradually made closer to the wide (reference image), in this region the original wide pixels are copied to get the final fused image with wide FOV. System Zoom Factor SW W T

Source Selection

Transition Required ?

No SW/W/T Output

Yes Transition Zone Estimation

Geometric Parameter Estimation

Photometric Parameter Estimation

Fig. 4. Transition block diagram

Smooth Transition Filtering

Multi-camera Fusion

3.2

45

Multi Camera Video Transition

In a multi camera system it is essential to switch between cameras based on the scene, while zooming in it is better to switch from a wide camera to tele camera, as an example. Since each of the super wide, wide and tele cameras are inherently different, large geometrical and photometric difference appears at the time of camera switch, which is not aesthetically pleasing to the end user. Hence a smooth transition while switching between cameras which ensures that the entire zoom experience is similar to an optical zoom in a DSLR camera, as if there is only one zoom lens covering the entire large FOV of super wide all the way down to the FOV of the tele camera, is required. A high level block diagram of the proposed multi camera video transition algorithm is shown in Fig. 4. For brevity, transition method is presented for Super Wide to Wide switching. Source Selection. In a typical triple camera scenario, camera selection is dependent upon the system zoom factor. Assuming system zoom factor supported is from 1x to 10x, a super wide camera is chosen for zoom factors from 1x to 2x, wide camera between 2x to 4x and tele camera between 4x to 10x. The second aspect that is critical in camera selection is the lighting condition. The ISO information is also utilized to make a decision as well where large ISOs will switch to the wide camera automatically to ensure the best low light performance. Transition Zone Estimation. In order to provide a smooth transition during switching camera sources, it is critical to identify a transition window. A transition window comprises of a set of video frames that needs additional processing to ensure a smooth and seamless transition. The additional processing is in the form of geometric and photometric alignment followed by a smoothening filter as described in Fig. 4. The transition window size, i.e., the number of frames over which the additional processing is done is denoted by N. Geometric Parameter Estimation. The same feature based approach as described in the Sect. 3.1 is used to estimate an affine transformation between the two video frames. An affine matrix is composed of scale, shear, rotation and translation between the two camera planes. An example affine transform is given in (8). sx cosθ shx sinθ tx (8) A= sy sinθ shy cosθ ty Photometric Parameter Estimation. For photometric alignment between the two frames parameters are estimated using the patch based polynomial model described in Sect. 3.1.

46

S. K. R. Manne et al.

Smooth Transition Filtering. To ensure a smooth and seamless geometric transition, each frame needs to be transformed by only a fraction of the final transformation. Hence, once the window size N is estimated, steps are derived as given below. θi = θ × (i) /N tix = tx × (i) /N

(9) (10)

tiy = ty × (i) /N

(11)

Where, i is the index of a frame in the transition window. The rotation angles from the rotation matrix are obtained and incremental rotation matrices are constructed to ensure a smooth transformation both in terms of translation as well as rotation as given below.

(12) Ai = Ri |ti i i cosθ sinθ Ri = (13) −sineθi cosθi i t ti = xi (14) ty The final smooth transition output frames are calculated from the following equations: (15) Wgi = Ai W i i i i i i (16) Wp = Wg + P Wg , T − Wg × i/N Where P(.) is the photometric alignment function as described in the earlier section.

4

Experimental Results

In this section experimental results for the proposed photometric alignment, multi camera image fusion and video transition are presented. Photometric Alignment. The proposed method for photometric alignment are compared against the existing solutions of Xiu et al. [3] and Han et al. [6]. For comparisons, a set of synthetic images (Is ) were generated from real images Ir taken from Galaxy Note 8 as follows: γ

Is = (Ir )

(17)

Where, γ is a parameter chosen at random from the interval [0.9, 1.1]. Each of the algorithm is evaluated on its ability to recover the image Ir from Is . PSNR values between P (Is ) and Ir , where P(.) is a photometric alignment algorithm, is used to quantify this process. These comparisons are summarized in Table 1.

Multi-camera Fusion

47

Table 1. Quantitative evaluation of photometric alignment for synthetic data. PSNR Xiu et al. [3] Han et al. [6] Ours linear Ours polynomial Ours patch based RGB

34.46

44.06

27.84

50.08

40.23

LAB

29.38

26.38

24.26

47.28

33.98

YUV

33.29

34.18

25.11

51.88

39.99

Table 2. Quantitative evaluation of photometric alignment for tele and wide images. PSNR Xiu et al. [3] Han et al. [6] Ours linear Ours polynomial Ours patch based RGB

22.88

23.22

23.04

23.05

23.57

LAB

27.08

27.70

27.50

27.52

27.96

YUV

27.20

27.82

27.61

27.62

28.08

The proposed methods are also tested on tele and wide images from Galaxy Note 8, taken under different lighting conditions, to ensure the robustness of the algorithm. The results show that for synthetic data, when a global gain is applied through-out the image polynomial fit works the best, since a global gain is estimated, whereas for real data does not have a global intensity change, patch based polynomial fit method works best. The PSNR values are lower in the real data evaluation, because in real data the pixels are not accurately aligned, hence a mismatch in PSNR evaluation leads to lower values (Liu et al. [5]). Multi-frame Image Fusion. The comparison of input wide and tele images with the fused output demonstrating the improvements in both sharpness as well as noise characteristics are presented in Fig. 5. The proposed method is able to

Fig. 5. Comparison of Fused Image (Left) Tele Image (Center) and Wide Image (Right), showing Fused Image with details from Tele Image and Wide Image.

48

S. K. R. Manne et al.

Fig. 6. Comparison of Galaxy Note 8 output (Left) vs Proposed fusion output (Right), showing better details and noise characteristics in proposed method.

bring in good noise characteristics from wide image and at the same time bring better details from tele image into the fused output image, there-by bringing the best of both the worlds from each of these cameras into the fused output. Comparison of the proposed fusion results with the results generated by Galaxy Note 8 are provided in Fig. 6. In the results the output from Note 8 can be seen to lack details which the proposed method is able to bring in from the tele image. Also noise characteristics are better in the output from the proposed method. This is made possible by the method of blending where accurately aligned pixels are picked from detail rich tele image and the reference wide image with better noise characteristics, is chosen for any misaligned pixels, thus giving both the desired qualities, without artifacts in the final output. Multi-camera Video Transition. In Fig. 7, results of the proposed algorithm for image transition in digital zoom are compared against camera switching without transition. Video frames in the top row, without transition, the frame at 1.9 is from wide and frame at 2.0 is from tele. The consecutive frames 1.9 and 2.0 show apparent color differences and also are geometrically misaligned. In the bottom row, where the transition algorithm is applied on the video frames, the frame 1.9 and frame 2.0 are almost alike. From frame 2.0 to frame 2.4 the differences between tele and wide are incrementally brought in and at the end the tele frame 2.4 and wide frame 2.4 are again geometrically and photometrically aligned. Since the misalignments are gradually brought in along 5 frames, the user does not experience the camera switch happening.

Multi-camera Fusion

49

↓ This is where switch from wide camera to tele camera happens

↓ This is where switch from wide camera to tele camera happens

↓ This is where switch from wide camera to tele camera happens

Fig. 7. Video frames without transition (top row) and with transition (bottom row) from zoom ratio 1.9 to 2.4, with switching of cameras at zoom 2.0, showing apparent switch at zoom 2.0 which is not visible in case of transition.

5

Conclusion

In this paper, a complete framework for image fusion in asymmetric tele-wide multi-camera systems and a method to provide smooth transition between cameras while switching, are presented. The proposed method aids in producing a seamless high-fidelity digital zoom that leverages multiple camera sensors in contemporary smartphones. The proposed fusion algorithm brings in good amount of details from tele image into the wide image, without any artifact from blending. The proposed methods for image fusion and transition have been tested on a plethora of imaging scenarios comprising of indoor/outdoor lighting and moving objects and generate better quality artifact-free results compared to that produced by a Samsung Note 8. The proposed method also produces a seamless transition, while zooming owing to the novel algorithm proposed smooth transition filtering. The proposed photometric alignment shows an improvement of 20% in PSNR values as opposed to existing methods for the same on synthetic dataset, and a 3% improvement on real datasets from Note 8.

50

S. K. R. Manne et al.

References 1. Moon, B., Yu, S., Ko, S., Park, S., Paik, J.: Local self-similarity based superresolution for asymmetric dual-camera. In: IEEE International Conference on Consumer Electronics - Berlin (2016) 2. Park, S., Moon, B., Yu, S., Ko, S., Park, S., Paik, J.: Brightness and color correction for dual camera image registration. In: IEEE International Conference on Consumer Electronics - Asia (2016) 3. Kim, H., Jo, J., Jang, J., Park, S., Paik, J.: Seamless registration of dual camera images using optical mask-based fusion. In: 2016 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia) (2016) 4. Jo, J., Jang, J., Paik J.: Image fusion using asymmetric dual camera for zooming in 2016. In: Imaging and Applied Optics (3D, AO, AIO, COSI, DH, IS, LACSEA, MATH) (2016) 5. Liu, Y., Zhang, B.: Photometric alignment for surround view camera system. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1827–1831, October 2014 6. Han, S.R., Min, J., Park, T., Kim, Y.: Photometric and geometric rectification for stereoscopic images. In: Proceedings of SPIE, vol. 8290, pp. 829007–829079 (2012) 7. Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 74(1), 5973 (2007) 8. Suen, S.T.Y., Lam, E.Y., Wong, K.K.Y.: Digital photograph stitching with optimized matching of gradient and curvature. In: Proceedings of SPIE, vol. 6069, pp. 60690G60–60690G12 (2006) 9. Liu, Y.-C., Lin, K.-Y., Chen, Y.-S.: Bird’s-eye view vision system for vehicle surrounding monitoring. In: Sommer, G., Klette, R. (eds.) RobVis 2008. LNCS, vol. 4931, pp. 207–218. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3540-78157-8 16 10. Uyttendaele, M., Eden, A., Skeliski, R.: Eliminating ghosting and exposure artifacts in image mosaics. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, pp. II509–II516 (2001) 11. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: ICCV 2011, pp. 2564–2571 (2011) 12. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Application, VISSAPP 2009, pp. 331–340. INSTICC Press (2009) 13. Kroeger, T., Timofte, R., Dai, D., Van Gool, L.: Fast optical flow using dense inverse search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 471–488. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-46493-0 29 14. Anirudth, N., Prasad, B.H.P., Jain, A., Peddigari, V.: Robust photometric alignment for asymmetric camera system. In: 2018 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, pp. 1–4 (2018) 15. Zuiderveld, K.: Contrast limited adaptive histograph equalization. In: Graphic Gems IV, pp. 474–485. Academic Press Professional, San Diego (1994)

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm Dharambhai Shah(B) and Tanish Zaveri Institute of Technology, Nirma University, Ahmedabad, Gujarat, India [email protected], [email protected]

Abstract. Spectral Mixture Analysis (SMA) for given hyperspectral mixed pixel vector estimates the number of endmembers in the image along with their spectral signatures and abundance fractions. A novel algorithm called Energy-based Convex Set (ECS) is presented for unsupervised endmember extraction from hyperspectral data in this paper. The algorithm uses the concept of band-energy and convex geometry for extraction of endmembers. The advantage of the proposed algorithm is that it combines the spatial information from band energy and the spectral information from convexity for improvement. The performance of proposed algorithms and prevailing algorithms are evaluated with spectral angle error, spectral information divergence, and normalized crosscorrelation for the synthetic and real dataset. It is observed from the simulation results that the proposed algorithm is giving worthy performance than other prevailing algorithms. Keywords: Convex · Endmember extraction image · Spectral unmixing

1

· Energy · Hyperspectral

Introduction

Imaging spectroscopy, also known as a hyperspectral sensor, is remarkable sensor technology in the remote sensing. Hyperspectral sensors capture wavelengths ranging from visible to mid-infrared. This enormous number of wavelengths capturing capability makes hyperspectral sensors a preferable choice for many remote sensing applications. Enormous applications like mineral mapping [15], soil analysis [10], water quality estimate [13] and vegetation analysis [17] are greatly improved by this sensor technology. However, these applications are affected by low spatial resolution, intimate mixtures of materials, multiple scattering. Due to these reasons, many pixels in the hyperspectral images are mixed pixels. Mixed pixels are pixels which are mixtures of more than one material. The analysis of this mixed pixels using spectral features of the data is called a Spectral Mixture Analysis (SMA). SMA is basically analyzing three processes as shown in Fig. 1. The first process is finding out the number of materials. The second one is extracting spectral signatures from the image itself. And the final one is finding fractions of each c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 51–60, 2020. https://doi.org/10.1007/978-981-15-4015-8_5

52

D. Shah and T. Zaveri

Fig. 1. Spectral mixture analysis process

material in each pixel. Out of these three processes, the second process is very challenging. The focus of this paper is on endmember extraction. In the last decade, many researchers have tried to extract endmembers from the image itself. Most popular endmember extraction algorithms are N-point Finder (NFINDR) [19], Simplex Growing Algorithm (SGA) [4], Successive Volume Maximization (SVMAX) [3], Alternating Volume Maximization (AVMAX) [3], Vertex Component Analysis (VCA) [14], TRIple-P:P-norm based Pure Pixel Identification (TRIP) [1], Pixel Purity Index (PPI) [7]. Winter’s belief [18] based on convexity criterion is a remarkable concept for endmember extraction. Winter’s belief is that the volume of simplex formed by pure materials is always greater than the volume of simplex formed by any other combination of pixels. NFINDR, SGA, SVMAX, AVMAX are the endmember extraction algorithms developed based on Winter’s belief. VCA and TRIP algorithms are based on the orthogonal projection. PPI algorithm is developed based on convexity and orthogonal projection. The above-mentioned algorithms view the data from the spectroscopic point while neglecting the spatial content present in the data. The first attempt which uses both spatial and spectral information by Plaza et al. [16]. This first algorithm, Automated Morphological Endmember Extraction (AMEE), is based on the morphological operations. Few other popular algorithms with both spatial and spectral information are Region-Based Spatial Pre Processing (RBSPP) [11] and Spatial-Spectral Pre Processing (SSPP) [12]. A novel algorithm, Energy-based Convex Set (ECS), is proposed for endmember extraction based on the concept of band energy and convex set. The motivation behind using the energy of a band is that it can help to find optimum bands for convex set findings. As discussed earlier, convexity based on winter’s belief is very useful in endmember extraction. Combining band energy with convexity leads to improve the accuracy of the spectral mixture analysis. The rest of the paper is organized as follows. Section 2 discusses the mathematical formulation of SMA with the proposed work. Section 3 contains dataset description, definition of evaluation parameters and simulation results. The final section concludes the paper.

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm

2

53

Proposed Algorithm

In this section, the proposed algorithm is explained with the mathematical formulation of SMA. Given the mixed pixel matrix Y in the spectral mixture analysis, Y = MA + E (1) Where, mixed pixel vector matrix or hyperspectral image (Y) is of size L × N . Mixing matrix (M) is of size L×Q. Abundance matrix (A) is of size Q×N . Error matrix (E) is of size L × N . Notations L, Q, and N are the number of bands, the number of materials, and the number of pixels in the hyperspectral image. Value of L and N is known from the image while the value of Q can be obtained from the well-known algorithms like Hysime [2] or Virtual Dimensionality (VD) [6]. Each column of abundance matrix is limited by two following constraint. – Abundance Non-negativity Constraint AN C : aij ≥ 0, ∀i of j

(2)

– Abundance Sum-to-one Constraint ASC :

Q

aij = 1, ∀i of j

(3)

i=1

The proposed algorithm is implemented in four steps. The first one is preprocessing which normalize each band of the data. The second step is the band energy calculation. The third step is the convex set optimization. Finally, extra points are removed if any. 1. Pre-processing In hyperspectral image Y, each pixel bj of ith band (bi ) is converted to ¯bj . Band normalization for ith band(bi ) is as follows. bj − min(bi ) ¯bj = , ∀j ∈ bi (4) i i max(b ) − min(b ) Each pixel bj is replaced by ¯bj in the hyperspectral data. Each new band is ¯ j instead of (bj ). represented as b 2. Band energy For the normalized data, band energy is calculated as, ¯ i )2 = ¯ i ) = (b EN (b

N

2 (¯bij )

(5)

j=1

The above equation gives energy of each band which will be useful in the next step. New set S based on ascending order of band energy is created.

54

D. Shah and T. Zaveri

3. Convex set optimization As mentioned in [5], convexity based endmember extraction is very popular in the spectral mixing analysis. In convex geometry, the affine hull of a set {x1 , x2 , ..., xQ } is defined as, Q θi xi θ ∈ IR, 1TQ θ = 1 af f {x1 , x2 , ..., xQ } = r =

(6)

i=1

The above equation can be easily related to Eq. (1) but the problem with affine hull is that it considers numbers from IR. While in Eq. (1), SMA should follow ANC which is related to only positive numbers. So convex hull is best suited to Eq. (1). Convex hull of a set {x1 , x2 , ..., xQ } is defined as, Q θi xi conv{x1 , x2 , ..., xQ } = r =

θ ∈ IR+ , 1TQ θ = 1

(7)

i=1

The convex hull Eq. (7) extracts total NX convex set points. Convex set points are extracted using following Optimization Problem: minimize

|NX − Q|

subject to

NX ≥ Q

X∈T

(8)

Where, T = {t1 , t2 , ..., tL/2 } is set of numbers which are number of points involved in making convex set for two bands bl and bh data. bl and bh are low energy and high energy bands taken from newly formed set S respectively. The reason for putting constraint in Eq. (8) is that it is necessary for endmember extraction algorithm that it extracts a minimum Q materials. 4. Removal of extra points There are two possibilities (NX = Q and NX ≥ Q) in Eq. (8). If NX = Q, then non of the point is removed. But for NX ≥ Q, extra (NX − Q) points needs to be removed. These extra points are removed using Euclidean Distance D(x) between point x(i, j) and its neighbouring pixel x1 (i1 , j1 ). Euclidean distance is defined as, (9) D(x) = (i − i1 )2 + (j − j1 )2 Total (NX − Q) points which are having less D(x) are removed. The algorithm ˆ which has L-dimensional Q vectors. Pseudocode for the proreturns matrix M posed algorithm is described in Table 1.

3

Simulation Results

This section includes description of datasets (synthetic and real), definition of evaluation parameters and discussion on simulation results.

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm

55

Table 1. Pseudocode of proposed algorithm Inputs : Y, Q For i = 1 to L ith band normalization ¯i ) Calculate EN (b EndFor Entropy based ascending sort and create new set S Find optimized NX for Optimization Problem : |NX − Q| minimize NX ∈T

subject to NX ≥ Q If NX > Q Calculate euclidean distance D(x) for each point x Remove (NX − Q) points EndIf ˆ = convex set points spectral signatures M ˆ Output : M

3.1

Dataset

Synthetic Dataset: Five images (Exponential Gaussian (Fig. 2a), Rational Gaussian (Fig. 2b), Spheric Gaussian (Fig. 2c), Matern Gaussian (Fig. 2d), Legendre (Fig. 2e)) are used as synthetic dataset. Each image is of size 128 × 128 × 431. These five synthetic images are generated using Hyperspectral Imagery Synthesis toolbox (HIST) package [9]. Five endmembers (fiberglassgds374, sheet metal-gds352, brick-gds350, vinyl-plastic-gds372, asphalt-gds367) from the USGS spectral library [8] are selected as pure endmembers in this tool. Real Dataset: Two benchmark real dataset (Cuprite and Jasper) as shown in Fig. 3 are used in this simulation. – Cuprite image captured by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor is the benchmark image in the hyperspectral unmixing research. After removal of noisy and water absorption bands, the image of size 250 × 190 × 188 is processed. The image consists of 12 endmembers [20]. – Jasper image is AVIRIS image collected over the Jasper bridge. Jasper image of size 95 × 95 × 156 is taken after removal of low signal-to-ratio bands. The image has 4 endmembers (water, soil, road, tree) [20].

3.2

Evaluation Parameters

Let Ground Truth (GT) endmember be m = [m1 , m2 , ...mL ]T and extracted ˆ 2 , ...m ˆ L ]T . Three standard endmember by proposed algorithm be m ˆ = [m ˆ 1, m

56

D. Shah and T. Zaveri

(a) Exponential

(b) Rational

(c) Spheric

(d) Matern

(e) Legendre

Fig. 2. Synthetic dataset

(a) Cuprite

(b) Jasper

Fig. 3. Real dataset

evaluation parameters defined in this section are used to compare the performance of endmember extraction algorithms. Spectral Angle Error (SAE): SAE between two vectors mi and m ˆ i is defined as, m ·m i î (10) θi = cos−1 |mi ||m ˆ i| SAE vector θ = [θ1 , θ2 , ...θQ ]T of size Q × 1 is defined for all Q endmembers. Root Mean Square (RMS) value of SAE vector is RMSSAE [14]. It is defined as, RM SSAE : θ =

1 Q

1/2 2 E[ θ 2 ]

(11)

Where E[.] denotes the expectation operator. Spectral Information Divergence (SID): SID is an information theory based measurement parameter to test spectral variability between two spectra. Probability vectors p = [p1 , p2 , ...pL ]T and q = [q1 , q2 , ...qL ]T are defined for Ldimensional two spectra m and m ˆ respectively. Each component of probability vectors are defined as,

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm

57

Fig. 4. Spectral angle error

mi pi = L i=1

mi

m î , qi = L i=1

m î

(12)

ˆ i is defined as, SID value between two spectra mi and m L

L pj qj pj log ( ) + qj log ( ) φi = q p j j j=1 j=1

(13)

The first term in (13) is the relative entropy of mi with respect to m ˆ i . Qdimensional SID vector φ = [φ1 , φ2 , ...φQ ]T is SID error vector defined for all Q endmembers. RMSSID [14] is defined as, RM SSID : φ =

1 Q

1/2 2 E[ φ 2 ]

(14)

Normalized Cross Correlation (NXC): NXC between two spectra mi and m ˆ i of size L × 1 is defined as,

L ˆ i − am 1 ˆ) i=1 (mi − am )(m × ν= (15) (L − 1) sm × sm ˆ

58

D. Shah and T. Zaveri

Here, am and am ˆ respectively. ˆ are average or mean values of spectra m and m are standard deviation of spectra m and m ˆ respectively. RMS value sm and sm ˆ for Normalized Cross Correlation vector (ν = [ν1 , ν2 , ...νQ ]T ) is defined as, RM SN XC : ν =

1 Q

1/2 2 E[ ν 2 ]

Fig. 5. Spectral information divergence

Fig. 6. Normalized cross correlation

(16)

Energy Based Convex Set Hyperspectral Endmember Extraction Algorithm

3.3

59

Discussion

Synthetic Dataset Comparison: Endmembers extracted from all algorithms including proposed one is similar to GT endmembers. Evaluation parameter values computed from all algorithms are θ = 0, φ = 0, and ν = 1. Comparison for the synthetic dataset is not made as all algorithms results are the same. Real Dataset Comparison: Root Mean Square (RMS) value of spectral angle error, spectral information divergence and normalized cross-correlation for both real dataset (Cuprite and Jasper) are plotted in Figs. 4, 5 and 6 respectively. The proposed algorithm is compared with prevailing algorithms (NFINDR, SGA, SVMAX, AVMAX, VCA, TRIP. PPI, RBSPP, SSPP, AMEE) in these three graphs. It is observed from these three graphs that SAE and SID errors are least and NXC value is maximum for the proposed algorithm in comparison to other algorithms. It shows that the proposed algorithm is giving better endmembers than prevailing algorithms.

4

Conclusion

The process of endmember extraction in the spectral mixture analysis is addressed in this paper. A novel algorithm combining band energy with convexity is proposed with the aim of reducing error. Combining energy based spatial information with convexity based spectral information leads to improvement in the simulation results. Performance of the proposed algorithm is tested on both synthetic and real dataset. It is observed from the simulation results that the proposed algorithm outperforms other prevailing algorithms. Acknowledgement. This work has been carried out under the grant received from Vishveshvarya Ph.D. scheme by Government of India. The authors of this paper are also thankful to the management of Nirma University for providing the necessary infrastructure and support.

References 1. Ambikapathi, A., Chan, T., Chi, C., Keizer, K.: Hyperspectral datageometry-based estimation of number of endmembers using p-norm-based pure pixel identification algorithm. IEEE Trans. Geosci. Remote Sens. 51(5), 2753–2769 (2013). https:// doi.org/10.1109/TGRS.2012.2213261 2. Bioucas-Dias, J.M., Nascimento, J.M.: Hyperspectral subspace identification. IEEE Trans. Geosci. Remote Sens. 46(8), 2435–2445 (2008) 3. Chan, T., Ma, W., Ambikapathi, A., Chi, C.: A simplex volume maximization framework for hyperspectral endmember extraction. IEEE Trans. Geosci. Remote Sens. 49(11), 4177–4193 (2011). https://doi.org/10.1109/TGRS.2011.2141672 4. Chang, C.I., Wu, C.C., Liu, W., Ouyang, Y.: A new growing method for simplexbased endmember extraction algorithm. IEEE Trans. Geosci. Remote Sens. 44(10), 2804–2819 (2006). https://doi.org/10.1109/TGRS.2006.881803

60

D. Shah and T. Zaveri

5. Chang, C.I.: Real-Time Progressive Hyperspectral Image Processing. Springer, New York (2016). https://doi.org/10.1007/978-1-4419-6187-7 6. Chang, C.I., Du, Q.: Estimation of number of spectrally distinct signal sources in hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 42(3), 608–619 (2004) 7. Chang, C.I., Plaza, A.: A fast iterative algorithm for implementation of pixel purity index. IEEE Geosci. Remote Sens. Lett. 3(1), 63–67 (2006). https://doi.org/10. 1109/LGRS.2005.856701 8. Clark, R.N., et al.: USGS digital spectral library splib06a. US geological survey, digital data series 231, 2007 (2007) 9. Computational Intelligence Group, University of the Basque Country/Euskal Herriko Unibertsitatea (UPV/EHU), Spain: Hyperspectral imagery synthesis (EIAS) toolbox (2010) 10. Ghosh, G., Kumar, S., Saha, S.K.: Hyperspectral satellite data in mapping saltaffected soils using linear spectral unmixing analysis. J. Indian Soc. Remote Sens. 40(1), 129–136 (2012). https://doi.org/10.1007/s12524-011-0143-x 11. Martin, G., Plaza, A.: Region-based spatial preprocessing for endmember extraction and spectral unmixing. IEEE Geosci. Remote Sens. Lett. 8(4), 745–749 (2011) 12. Martin, G., Plaza, A.: Spatial-spectral preprocessing prior to endmember identification and unmixing of remotely sensed hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 5(2), 380–395 (2012) 13. Mishra, R., Shah, D., Zaveri, T., Ramakrishnan, R., Shah, P.: Separation of sewage water based on water quality parameters for South Karnataka coastal region, vol. 2017. Asian Association on Remote Sensing, October 2017 14. Nascimento, J.M., Dias, J.M.: Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(4), 898–910 (2005) 15. Oskouei, M.M., Babakan, S.: Detection of alteration minerals using Hyperion data analysis in Lahroud. J. Indian Soc. Remote Sens. 44(5), 713–721 (2016). https:// doi.org/10.1007/s12524-016-0549-6 16. Plaza, A., Martinez, P., Perez, R., Plaza, J.: Spatial/spectral endmember extraction by multidimensional morphological operations. IEEE Trans. Geosci. Remote Sens. 40(9), 2025–2041 (2002) 17. Wang, M., Niu, X., Yang, Q., Chen, S., Yang, G., Wang, F.: Inversion of vegetation components based on the spectral mixture analysis using hyperion data. J. Indian Soc. Remote Sens. 46(1), 1–8 (2017). https://doi.org/10.1007/s12524-017-0661-2 18. Winter, M.E.: N-FINDR: an algorithm for fast autonomous spectral end-member determination in hyperspectral data. In: Imaging Spectrometry V, vol. 3753, pp. 266–276. International Society for Optics and Photonics (1999) 19. Xiong, W., Chang, C., Wu, C., Kalpakis, K., Chen, H.M.: Fast algorithms to implement N-FINDR for hyperspectral endmember extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 4(3), 545–564 (2011). https://doi.org/10.1109/ JSTARS.2011.2119466 20. Zhu, F.: Hyperspectral unmixing: ground truth labeling, datasets, benchmark performances and survey. arXiv preprint arXiv:1708.05125 (2017)

Fast Semantic Feature Extraction Using Superpixels for Soft Segmentation Shashikant Verma(B) , Rajendra Nagar, and Shanmuganathan Raman Indian Institute of Technology Gandhinagar, Gandhinagar, Gujarat, India {shashikant.verma,rajendra.nagar,shanmuga}@iitgn.ac.in

Abstract. In this work, we address the problem of extracting high dimensional, soft semantic feature descriptors for every pixel in an image using a deep learning framework. Existing methods rely on a metric learning objective called multi-class N-pair loss, which requires pairwise comparison of positive examples (same class pixels) to all negative examples (different class pixels). Computing this loss for all possible pixel pairs in an image leads to a high computational bottleneck. We show that this huge computational overhead can be reduced by learning this metric based on superpixels. This also conserves the global semantic context of the image, which is lost in pixel-wise computation because of the sampling to reduce comparisons. We design an end-to-end trainable network with a loss function and give a detailed comparison of two feature extraction methods: pixel-based and superpixel-based. We also investigate hard semantic labeling of these soft semantic feature descriptors. Keywords: Feature extraction segmentation · Superpixels

1

· Semantic representation · Image

Introduction

Automatic object detection and characterization is a difficult problem. Challenges arise due to the fact that the appearance of the same object can differ heavily in terms of orientation, texture, color, etc. Therefore, it is necessary that feature description for pixels belonging to the same class should be such that it can cope-up with all variations. There has been wide research in this field using traditional approaches. Features such as SIFT [1], SURF [2] are crafted in such a way that they remain scale and orientation invariant. The problem of scale variance of features can be explained by scale-space theory [3]. Though various traditional approaches work well for a small set of images or a particular shape of an object, their efficiency in capturing characteristics for a large number of classes with respective variations has not been explored much. The current state-of-the-art methods rely on deep learning for feature extraction, which outperforms traditional approaches. Deep learning based segmentation methods extract a feature map of the image and then assign a probability measure on it. Hence, we obtain a hard label for every pixel signifying its class. c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 61–72, 2020. https://doi.org/10.1007/978-981-15-4015-8_6

62

S. Verma et al.

In soft segmentation, a pixel can belong to more than one segments. Therefore, it represents soft transitions between the boundaries of objects. These soft transitions have a wide range of applications in image matting, deblurring, editing, and compositing [4,5]. Deep learning techniques use various metric learning methods to establish a relation between the input and the output. Deep networks can learn these complex non-linear metrics by loss functions such as contrastive Loss, triplet Loss, etc. These losses are often used in obtaining discriminatory features maps for applications in image retrieval and face recognition [6]. The loss incurred in these frameworks are computed only considering one negative example, and hence, these methods lead to slow convergence. Addressing this issue authors of [7] proposed multi-class N-pair loss extending triplet loss in which a positive example is compared with all possible negative examples. The metric learning technique in [7] can be used for extracting distinct semantic features such that objects similar in semantic context tend to have similar features. As there can be a large number of both pixels and class labels in an image, N-pair loss computation can be a computational bottleneck. To deal with this, [4] adopts a sampling approach. They iteratively sample a sizeable number of pixels from a subset of randomly chosen classes and compute pairwise loss on these selections. A major drawback of this approach is that the global context of image diversity is not captured and all negative class examples are not employed for the loss calculation. To address this problem, we leverage the property of superpixels in representing a set of similar pixels. We employ a modified form of N-pair loss on all superpixels instead of pixels. This also preserves the global context of semantic diversity in an image. In Sect. 3, we discuss the detailed implementation of the layer which determines feature vector for superpixels and propose the architecture of an end-to-end trainable network. The model learns to extract similar features for superpixels if they represent the same semantic class and dissimilar otherwise. In Sect. 4, we discuss about the complexity of our methodology during the forward and backward pass, assess semantic hard labeling of obtained feature descriptors and their application in obtaining soft segmentation using matting method of [4].

2

Related Work

There are many widely used loss functions for deep metric learning, such as Euclidean loss, softmax loss, contrastive loss, N-pair loss, etc. For extracting features, contrastive and triplet losses impose a margin parameter such that features of different classes are distant from each other at least with the margin imposed. Both the loss functions suffer from slow convergence as they employ only one negative example at a time for learning [7]. Moreover, to reduce computational complexity, they require data sampling for positive and negative samples to accelerate training [4,8]. We use a modified form of N-pair loss as in [7] for our feature extraction, which addresses the problem of these losses while maximizing inter-class separability.

Semantic Soft Segmentation

63

Image segmentation techniques have significantly improved with deep learning methods. Most of these methods incorporate softmax loss for classification of each pixels to their respective context [9–11]. Though softmax losses can efficiently predict the probabilities of a feature to be classified in which class, they fail to obtain the largest separable discriminatory feature map for positive and negative examples. Inspired from margin imposition in contrastive and triplet losses, [12] proposes a combination of margin term with softmax losses. Softmax loss alone tends to fail in extracting discriminatory features, specific to problems addressed [13,14] combine both of the losses for learning. For soft segmentation, per-pixel feature extraction is an important step. [4] adopts discriminatory feature learning metric by imposing loss in hard negative data mining style. They further use the semantic discrimination ability of these features to obtain soft segments using spectral matting technique. Some other soft segmentation methods use matting with the color information of the pixels [15,16].

3

Method

Our method has the following steps: Feature extractor CNN, which extracts the feature map with the weights of kernels learned during training by CNN. We over-segment the input image to obtain superpixels and define a mapping function on the extracted feature map to determine feature descriptor for every superpixel. We employ a modified form of N-pair loss on these superpixels to update the weights of the network. We discuss specific details of architecture and feature extraction, mapping function to obtain feature of superpixels, loss function in Sects. 3.1 and 3.2, respectively. 3.1

Model Architecture

We extract features for semantic soft segmentation by a neural network, as shown in Fig. 1. We use cascaded ResNet bottle-neck block [17] as the baseline of the network for feature extraction and downsample the map up to approximate onethird size of initial input. Output feature map at different layers contain different contextual information. Lower level features are object contours and edge aware while higher level features are context aware. In the end, we concatenate all these features as discussed in [18]. Thus, the obtained map has information about semantic context and contour of objects and regions in an image. As the network grows deeper, convolutional neural networks face the problem of vanishing gradients, and this can be addressed by making skip connections in the architecture. A basic ResNet bottleneck block serves this purpose [17]. We downsize the feature map output of ResNet blocks by the max-pooling operation. In the end, we concatenate feature maps from various layers of the network by bi-linearly upsampling them to match with original image shape. We perform a metric learning operation on these concatenated features using superpixels instead of the per-pixel approach used by [4]. This increases the computational performance of the network and makes the model aware of global semantic

64

S. Verma et al.

information in an image. We generate superpixels by simple linear iterative clustering Algorithm [19], which uses LAB features and spatial information per pixel to over segment the image.

Fig. 1. Architecture of feature extraction CNN.

Feature Descriptor for Superpixels. For every superpixel generated by SLIC algorithm, first we need to associate each of them with a ground truth label for supervised learning using CNN. As every superpixel is a collection of pixels which are similar to each other in the local context, we assign ground truth label of a superpixel with the labels of pixels constituting it. Due to inaccuracy along edges and other factors, it may occur that a superpixel may contain differently labeled pixels. In such a case, the label having a majority among all pixels is assigned as ground truth label, refer Eq. (1). For an image I containing Np pixels with its ground truth labeled in Nc Np Np semantic classes. Let P = {pi }i=1 be the set of all pixels, L = {li }i=1 be their Ns corresponding labels where li ∈ {1, 2, . . . , Nc }, and S = {Si }i=1 be the set of all Ns superpixels generated by SLIC. We define the sets Pi ⊂ P and Li ⊂ L to be the set of pixels and the set of labels that a superpixel Si contains, respectively. Assume Si contains total of k pixels such that cardinality |Pi | = k = |Li | and Pi = {pj }kj=1 , Li = {lj }kj=1 with pixel pj having label lj . The ground truth label lSi for superpixel Si is determined by Eq. (1). lSi = max({C(li ) | C(li ) =

k

δ(lj , li )}ki=1 )

(1)

j=1

lS = [lS1 , lS2 , . . . , lSNs ]T

(2)

where, C(li ) represents the number of occurrences of li labelled pixels in superpixel Si , lS is a vector representing labels of all Ns superpixels, and δ(i, j) is a Kronecker delta function which takes the value 1 if i = j and 0 otherwise. Let us represent final concatenated features from feature extractor by F with F ∈ RD×H×W and input image I ∈ R3×H×W , where D, H, W are the dimensions

Semantic Soft Segmentation

65

Fig. 2. Flowchart

of the extracted feature map, height and width of the image, respectively. Note that for every Np pixels, their exist Np number of D length feature vectors in the feature map F. For each pixel p ∈ P, we represent its feature vector as Fp such that Fp ∈ F. We find a single dimensional feature vector FSi for superpixel Si ∈ S as the average of features of all pixels in Pi by Eq. (3). Note that Pi is the set of all pixels that are contained in superpixel Si with |Pi | = k and Pi ⊂ P. FSi =

1 Fpj , ∀pj ∈ Pi ⊂ P , i ∈ {1, 2, . . . , Ns } |Pi | j FS = [FS1 , FS2 , . . . , FSNs ]T

(3) (4)

Using Eq. (3), we determine the averaged feature for all superpixels and obtain a superpixel feature map FS with FS ∈ RNs ×D and corresponding ground truth label lS with lS ∈ RNs from Eqs. (4) and (2), respectively. 3.2

Loss Function

For learning, we use modified form of the N-pair loss [7] where we use L2 distance between superpixel feature vectors instead of inner product in similar way

66

S. Verma et al.

as in [4]. Consider two feature vectors of superpixels Sp and Sq represented by FSp , FSq ∈ FS with labels lSp , lSq ∈ lS respectively, where p, q ∈ {1, 2, . . . , Ns }. We define loss function such that FSp and FSq are similar to each other if lSp = lSq and dissimilar otherwise by Eq. (5). 1 I[lSp = lSq ] log 1 + exp FSp − FSq /2 Lpq = |S| p,q∈|S| (5) + I[lSp = lSq ] log 1 + exp − FSp − FSq /2 p,q∈|S|

where |S| = Ns = total number of superpixels generated using SLIC. I[.] is an indicator function which is 1 if statement holds true and is 0 otherwise. . represents L2 distance between the feature vectors. if lSp = lSqand Note that both superpixels have dissimilar features then log 1 + exp FS − FS /2 p

q

term in Eq. (5) evaluates to a larger value and contributes to total loss which needs to be minimized by back-propagation through the network. Moreover, if features of both the superpixels are similar then the value of this term approaches zero and no loss is induced. Figure 2 shows the flowchart of adopted methodology. Note that superpixel supervision is only needed for loss calculation and is no more needed once the training is over. Since from ground truth information, we only use the fact that whether two features under consideration belong to the same category or not, learning is class agnostic. In Sect. 4, we assess semantic hard labelling of the learned feature descriptors. Feature Extraction and Selection We fix the size of the input to CNN by resizing the images to 224 × 224. Then, we obtain a 128-dimensional feature map F. For loss computation, we over-segment the input image into Ns = 500 superpixels and compute N-pair loss as defined in Eq. (5). Using Eqs. (3) and (4), we obtain 128 dimensional feature vector FS ∈ R128×500 for every superpixel. This feature map for every super pixel has enough capacity to capture diverse contextual data of objects. We perform guided filtering [20] on the obtained feature map with the guidance of input image. This lets features to adhere more towards boundary and contours present in an image. To reduce dimensionality and select the dominant feature from F, we use principal component analysis (PCA) [21] and generate a 3D feature map corresponding to the three largest eigenvalues. We show a comparison of both selected dominant features and few random 3D projections of filtered 128-D map F in Figs. 3 and 4, respectively. To show feature descriptors of superpixels, we define a mapping on FS as M : where F ∈ R128×224×224 as in Eqs. (6) and (7). FS → F, M(FSi ) = FSi

(6)

p = M(FS ) | ∀p ∈ Pi , i ∈ {1, 2, . . . , Ns } F i

(7)

is the mapped 128 dimensional feature vector for pixel location p.

p ∈ F where F Note that Pi is the set of all pixels that the superpixel Si contains. We show a

Semantic Soft Segmentation

67

vs feature map F in Fig. 5. Observe comparison of mapped superpixel features F in Fig. 5. Hence, we use that feature map F is smoother and continuous than F F for feature selection using PCA and further processing. This also reduces the complexity of CNN once training is over as superpixel supervision is no more needed. In Fig. 2, we show a flowchart of approach during training and testing of network.

4

Experimental Analysis

We trained our network on ADE20k dataset [22], which contains 150 labeled semantic classes. We used basic bottleneck block from ResNet as a building block of our feature extractor CNN. Weights of the network were initialized by Xavier initialization [24]. We start with a learning rate of 1 × 10−3 and use stochastic gradient descent optimizer with the momentum of 0.9, weight decay of 5 × 10−4 and poly learning rate of 0.9 as suggested in [25]. We generated Ns = 500 number of superpixels for input image by SLIC algorithm [19] to estimate loss during training session. Note that few pixels may remain disjoint after over segmentation using SLIC. To ensure connectivity, post-processing of superpixel is done so that every disjoint pixel is assigned to a nearby superpixel. Due to this operation, we may obtain a lesser amount of superpixels than Ns . We populate FS and lS with redundant data to match the dimensions of incoming batches and compute loss using Eq. 5. We train for 60K iterations with a batch size of four over ADE20k training split which has 20210 images. It takes about 12 h on NVIDIA Titan Xp 12 GB GPU. We show the semantic soft segmentation by matting method proposed in [4] on our features and comparison in Fig. 6. Loss Computational Complexity. It is easy to notice that over segmenting an image into superpixels reduces the number of data points from total pixels in image |P| = Np to |S| = Ns . As mentioned by authors of SLIC algorithm [19], the complexity to compute superpixels is O(Np ) and for computing L2 distance between feature descriptors for every possible pairs of superpixels, complexity is of order O(Ns2 ). Thus, our method achieves total complexity of O(Np ) + O(Ns2 ). In our experiment we use Ns = 500 and image of size 224×224 having total pixel count Np = 50176. We compare our method of loss estimation with [4], which employ sampling based approach to reduce complexity. They randomly select Ninst number of instances out of Nc labelled classes from image, sample Nsamp amount of pixels from every selected instance and compute L2 loss between each possible pixel pairs. They repeat this process Niter times. Let us define the total number of feature descriptors sampled as Ntotal = Niter × Ninst × Nsamp . The 2 ). Numerical values computational complexity for this method becomes O(Ntotal reported in [4] are as Niter = 10, Ninst = 3 and Nsamp = 1000 having complexity of O(300002 ) compared to O(5002 ) + O(50176) by our method. Note that in our method, superpixel computation does not cause any back-propagation overhead. During training time, gradients accumulation in the graph is only due to Ns2 data points while the method in [4] needs an accumulation of gradients for all Ntotal points. In Table 1, we summarize the comparison of both the methods.

68

S. Verma et al.

Fig. 3. For an input image (a), we show effect on features obtained after PCA with and without guided filtering in (b), (c) respectively. Notice the adherence of features towards edges in (c) due to guided filtering operation. Image taken from ADE20k dataset [22]

Fig. 4. For an input image in (a) we extract 128-dimensional feature map F and show randomly sampled 3d maps from it in (c, d, e, f, g, h). (b) shows 3-dimensional feature selection on F using PCA. Notice the semantic context embedded in different channels of map. Image taken from ADE20k dataset [22]

Fig. 5. For an input image, Row 1 shows randomly sampled 3D maps from F and Row 2 shows superpixel feature descriptors by mapping M defined in 6. Observe the Zoom in caption is enhanced for better visualization. smoothness of F compared to F. Image taken from coco stuff dataset [23]

Semantic Soft Segmentation

69

Fig. 6. (a) Three different input images from coco stuff dataset [23] (b) Features extracted in [4] (c) Features extracted using our method (d) Semantic soft segmentation results reported in [4] (e) Semantic soft segmentation results produced by our features using matting method of [4]. Table 1. Computational complexity of two methods. Note that in our method O(Np ) don’t add any overhead to gradient computations during backpropagation

Method Our method

Iterations Instances Points Complexity NA

Aksoy et al. [4] Niter

NA

Ns

Ninst

Ntotal

O(Ns2 ) + O(Np ) 2 O(Ntotal )

Parameters 3.7 M 68.6 M

Semantic Hard Segmentation. To assess the hard labeling of obtained feature descriptors, we add layers to reduce the dimension of feature map to a total number of labeled classes Nc in ground truth for evaluating cross entropy loss. We employ architecture shown in Fig. 7 on CityScape dataset [26] which has a total of Nc = 19 labelled classes. We use feature extractor CNN in the test mode to obtain soft semantic feature map of 128 dimensions. We train the model over training split of the dataset and form a confusion matrix for every pixel with its predicted label and ground truth label to compute mean Intersection over Union (IoU), pixel accuracy, class accuracy and frequency weighted IoU. We report standard metrics of semantic segmentation and some results on validation split in Table 2 and Fig. 8, respectively. Note that, we use lower level features from first ResNet block inspired from [18] to acquire contours and edges of slim objects like poles, traffic lights which seem to be lost due to in-efficiency of SLIC. From Table 2, we observe that soft features extracted posses enough diversity to be used for semantic segmentation purposes by integrating it with various stateof-the-art methods. We compare the results with benchmarks on CityScapes dataset in Table 3.

70

S. Verma et al.

Fig. 7. Semantic hard labelling of soft feature descriptors by employing cross entropy loss. Note that we use feature extractor CNN in test mode and only learn weights for added convolution and batch-normalization layers. Color legend for layers is same as in Fig. 1.

Table 2. It shows the metric evaluation on CityScapes [26] dataset, where mIoU, fwIoU refers to mean and frequency weighted intersection Over union.

Pixel accuracy 88.12% Class accuracy 58.65% mIoU

40.37%

fwIoU

81.66%

Input image

Ground Truth

Table 3. It compares the results with state of the art methods on this dataset.

Method

mIoU

SegNet basic [11]

57.0%

FCN-8s [10]

65.3%

DeepLab [25]

63.1%

DeepLab-CRF [25] 70.4% Proposed approach 40.37%

Soft Features

Prediction Map

Fig. 8. Semantic segmentation by employing cross entropy loss on obtained 128D feature map. Soft features shown are selected 3D features using PCA.

Semantic Soft Segmentation

5

71

Conclusion

We have proposed a deep metric learning method to extract a feature descriptor per pixel in an image. We have designed a layer which can be back-propagated to determine features for SLIC superpixels and have proposed an end-to-end trainable model architecture. We have employed the multi-class N-pair loss on superpixels instead of pixels, thus reducing the complexity of loss computation and backpropagation overhead. We have shown that the feature map learned has diverse semantic context of an input image and can be used for various applications like semantic soft and hard segmentation.

References 1. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: BMVC, vol. 4 (2002) 2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023 32 3. Lindeberg, T.: Scale-Space Theory in Computer Vision, vol. 256. Springer, Boston (2013). https://doi.org/10.1007/978-1-4757-6465-9 4. Aksoy, Y., Oh, T.-H., Paris, S., Pollefeys, M., Matusik, W.: Semantic soft segmentation. ACM Trans. Graph. (TOG) 37(4), 72 (2018) 5. Pan, J., Hu, Z., Su, Z., Lee, H.-Y., Yang, M.-H.: Soft-segmentation guided object motion deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2016) 6. Chopra, S., Hadsell, R., LeCun, Y., et al.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR, vol. 1, pp. 539–546 (2005) 7. Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: Advances in Neural Information Processing Systems, pp. 1857–1865 (2016) 8. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) 9. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017) 10. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 11. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017) 12. Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: ICML, vol. 2, no. 3, p. 7 (2016) 13. Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in Neural Information Processing Systems, pp. 1988–1996 (2014) 14. Zhang, X., Zhou, F., Lin, Y., Zhang, S.: Embedding label structures for finegrained feature representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1114–1123 (2016)

72

S. Verma et al.

15. Aksoy, Y., Ozan Aydin, T., Pollefeys, M.: Designing effective inter-pixel information flow for natural image matting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 29–37 (2017) 16. Singaraju, D., Vidal, R.: Estimation of alpha mattes for multiple image layers. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1295–1309 (2010) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 18. Bertasius, G., Shi, J., Torresani, L.: High-for-low and low-for-high: efficient boundary detection from deep object features and its applications to high-level vision. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 504–512 (2015) 19. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., S¨ usstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012) 20. He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2012) 21. Bickel, P., Diggle, P., Fienberg, S., Gather, U., Olkin, I., Zeger, S.: Springer Series in Statistics. Springer, New York (2009). https://doi.org/10.1007/978-0-387-77501-2 22. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 23. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 24. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) 25. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834– 848 (2017) 26. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

Spatially Variant Laplacian Pyramids for Multi-frame Exposure Fusion Anmol Biswas(B) , K. S. Green Rosh, and Sachin Deepak Lomte Samsung R&D Institute, Bangalore, India [email protected], {greerosh.ks,sachin.lomte}@samsung.com

Abstract. Laplacian Pyramid Blending is a commonly used method for several seamless image blending tasks. While the method works well for images with comparable intensity levels, it is often unable to produce artifact free images for applications which handle images with large intensity variation such as exposure fusion. This paper proposes a spatially varying Laplacian Pyramid Blending to blend images with large intensity differences. The proposed method dynamically alters the blending levels during the final stage of Pyramid Reconstruction based on the amount of local intensity variation. The proposed algorithm out performs state-of-the-art methods for image blending both qualitatively as well as quantitatively on publicly available High Dynamic Range (HDR) imaging dataset. Qualitative improvements are demonstrated in terms of details, halos and dark halos. For quantitative comparison, the noreference perceptual metric MEF-SSIM was used.

1

Introduction

Seamlessly blending images together is an important low-level image processing operation that is used in a variety of applications such as, panorama generation, High Dynamic Range (HDR) imaging and so on. Laplacian Pyramid Blending [1] is a popular method for this as it can generate natural looking images at relatively low computational effort. Of these applications, HDR imaging is a method for faithfully capturing the large dynamic range present in a natural scene by a camera with limited sensor capabilities - that can only produce Low Dynamic Range (LDR) images. Typically it is done by exposure fusion - that is, to take multiple LDRs with bracketed exposure times and blend those images. This necessitates blending images with large variations in intensity. There has been a lot of research done to work out the optimal weighting function to weigh the different exposures [2–5]. Once the weight maps are computed, HDR pipelines typically use Laplacian Pyramid Blending with a certain number of levels to blend these images [2]. From Fig. 1, it can be observed that keeping a constant number of pyramid levels for blending the entire image forces us to accept some tradeoffs. Lower number of levels produces higher dynamic range, but the blended image has sharp boundary halos and looks unnatural. Higher number of levels creates more natural images, but at the cost of dynamic range, details and spread-out halos. c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 73–81, 2020. https://doi.org/10.1007/978-981-15-4015-8_7

74

A. Biswas et al.

(a)

(b)

(c)

Fig. 1. Challenges in Laplacian Pyramid Blending. (a) input exposure stack (b) Blending with 3 levels (c) Blending with 6 levels. With less number of levels(b), the output image looks cartoonish and produces strong gradient reversal artifacts (red arrows). With large number of levels, the image looks more natural, but produces wide halos (red arrows) and less details (yellow inset) (Color figure online)

Several works have tried to address the issue of halos in multi-expsore fusion. Shen et al. [6] develops a method where details extracted in different levels are boosted to reduce artifacts using a custom developed weightmap. This method, however is very slow to be implemented in real-time. Li et al. [7] introduces weighted guided image filtering which can be applied for multi-expsosure fusion application. All these methods, however does not take into account spatial structure of the images while deciding amount of detail extraction required; i.e, they use constant number of image pyramid levels throughout the image. This paper proposes a blending algorithm based on Pyramid Blending which can effectively alter the level of blending spatially in a patch-based manner. This dynamic nature of the blending is dictated by the spatial characteristics of the weight function and works to maximize dynamic range, and minimize halos while otherwise maintaining the naturality of the blended image that comes with using a reasonably large number of pyramid levels for blending. The rest of the paper is organized as follows. Section 2 discusses the proposed method to blend two images taken with different exposures and the pipeline to blend multiple images into one HDR output. Section 3 provides qualitative and quantitative comparisons of the blending method with several state of the art exposure fusion based methods. 1.1

Overview of Laplacian Pyramid Blending

Given two images Is and Ih and a weight map W , laplacian pyramid blending [1] provides a methodology to smoothly blend the images. It consists of three major steps: pyramid decomposition, pyramid blending and pyramid reconstruction. In the pyramid decomposition step, first gaussian pyramids of the images are generated as follows: l−1 l = (G IGauss )↓ (1) IGauss

Spatially Variant Laplacian Pyramids (a)

(b)

75

(c)

Fig. 2. Comparison of average laplacian weights and patch variance weights. (a) Input exposure stack (b) Output of the proposed method with average laplacian weighting (c) Output of the proposed method with patch variance weighting. Weighting by patch variance shows slight improvement in terms of halos, visible around the tree. l Here, IGauss refers to the lth level of the gaussian pyramid for image I ∈ [Is , Ih , W ], G refers to a guassian filter, refers to the convolution operator and ↓ denotes a downsampling operator. Next the laplacian pyramids are constructed as follows: l+1 l IGauss − G (IGauss ↑), if l < M l ILaplacian = (2) l if l = M IGauss , l refers to the lth level of the laplacian pyramid for image Here ILaplacian I ∈ [Is , Ih ], ↑ refers to an upsampling operator and M refers to the total number of levels in the pyramid. This is followed by the pyramid blending stage where the Laplacian pyramids are blended to form a blended pyramid ILapBlend as follows: l l l ILapBlend = Isl Laplacian ⊗ WGauss + Ihl Laplacian ⊗ (1 − WGauss ).

(3)

Here ⊗ denotes an element-wise multiplication operator. This is followed by l the final pyramid reconstruction stage where each level (IRecon ) is reconstructed as follows: l+1 l G (IRecon ↑) + ILapBlend , if l < M l IRecon = (4) l if l = M ILapBlend , 1 The top-most level of the reconstructed images, i.e, IRecon gives the final blended output. While laplacian pyramid blending works well for most image fusion tasks, it produces artifacts for multi-exposure fusion, in regions of large intensity differences. When the total number of levels is large, the algorithm tends to produce wide halos around edges, and when the number of levels is small, strong gradient reversal and halo artifacts are observed (Fig. 1. To counter these issues, a modified laplacian pyramid blending with spatially varying levels is proposed as described below.

76

(a)

A. Biswas et al. (b)

(c)

(d)

(e)

Fig. 3. Halos and Gradient Reversal (Dark Patches/Shadows). (a) Input image stack (b) Output of standard Laplacian Pyramid Blending (c) Output of Gu et al. [8] (d) Output of Boosting Laplacian Pyramid [6] (e) Proposed method with Patch Variance weighting function. Standard Pyramid Blending (b) has Halos and Gradient Reversal (dark patches), Boosting Laplacian Pyramid (d) mainly has dark patches and some halos (black arrows) (c) loses some dynamic range due to saturation around the sun (black box), while (e) has neither halos nor dark patches

2

Proposed Method

Given a set of n short exposure images Is1−n , high exposure images Ih1−n , and n − 1 weightmaps W 1−(n−1) , the aim is to produce a blended HDR image which looks natural and has fewer halos. This paper proposes a method to blend images using spatially varying Laplacian Pyramids to reduce halos and gradient reversal artifacts (dark patches around bright regions). First, a brief overview of the standard Laplacian Pyramid blending [1] is given, followed by the proposed spatially variant Lapalacian Pyramid Blend. For brevity, image blending of two images is detailed in this section. The method can be extended easily to handle multiple images also as shown in Sect. 2.2. 2.1

Proposed Spatially Variant Level-Based Blending

Conventional laplacian pyramid blending assumes a uniform number of levels throughout the image. A higher number of levels is essential for a smooth blending. However, this cause halo and gradient reversal artifacts in regions of large intensity changes in input images. Hence a pyramid blending algorithm which takes into the account the intensity variations in the input image, and dynamically changes the level of blending applied at each region is developed. The proposed method also processes the inputs in three steps similar to laplacian

Spatially Variant Laplacian Pyramids (a)

(b)

(d)

77

(c)

(e)

Fig. 4. Halos and Gradient Reversal 2 (Dark Patches/Shadows). (a) Input image stack (b) output of standard Laplacian Pyramid Blending (c) Output of Gu et al. [8] (d) Output of Boosting Laplacian Pyramid [6] (e) Proposed method with Patch Variance weighting function. Halos in (b) and strong Gradient Reversal in the form of dark shadows around the clouds in (d) are visible inside the black rectangle. (c) shows decoloration and darkening of the sky color (black box)

pyramid blend [1]. The first stage of pyramid decomposition is identical to the pyramid decomposition step in [1]. In the second stage of pyramid blending, an additional Gaussian Blend pyramid (IGaussBlend ) is also constructed along with ILapBlend as follows: l l l IGaussBlend = Isl Gauss ⊗ WGauss + Ihl Gauss ⊗ (1 − WGauss )

(5)

To bring in the intensity variation, the images are processed in overlapping patches of K × K at each level during the third stage of pyramid reconstruction. At each level of pyramid reconstruction, each of the K × K are reconstructed as follows: ↑) + Ipl LapBlend ) + α.Ipl GaussBlend , if l < M (1 − α).(G (Ipl+1 Recon Ipl Recon = Ipl LapBlend , if l = M (6) Here Ip refers to a K × K patch extracted from a given layer from Blended Laplacian or Gaussian Pyramids or Reconstructed pyramid. Here iα refers to a dynamically computed weighing factor which encompasses the intensity variations for a given patch. A higher value of α denotes a larger variation in intensity, and correspondingly a lower level of blending. To understand this, consider an alpha value of 1 . In this case the the reconstructed image at a level l is same as

78

(a)

A. Biswas et al.

(b)

(d)

(c)

(e)

Fig. 5. Dynamic Range and Preserving Details. (a) Input image stack (b) output of standard Laplacian Pyramid Blending (c) Output of Boosting Laplacian Pyramid [6] (d) Output of Gu et al. [8] (e) Proposed method with Patch Variance weighting function. Black Inset shows the region where details in saturated region are improved by Boosting Laplacian Pyramid and the proposed method

the Gaussian blended image. This effectively cancels out any contribution due to higher number of levels. The weighing factor α is modelled as a function of the input image, α = f (M(Ih ). Here M is a map derived from image Ih as follows: M(I) = exp(−

(I − 1.0)2 ) 2σ 2

(7)

Here I is normalized to the range [0, 1] and the value of σ is empirically chosen to be 0.3. The choice of the function f () should be representative of the intensity changes in the given patch. Two different functions for f () were experimented with, as described below. Laplacian of Patch. The laplacian of any image patch gives the gradient information present in the patch, which is indicative of the amount of intensity variation in the patch. Hence for the first choice of the weighing function, the mean of the intensities of the laplacian of the patches extracted from M(Ih ) was taken. Variance of Patch. The variance of the image pixel intensities also give a good representation of the spatial variation of intensity. Hence the variance of the patches extracted from M(Ih ) as f () was also experimented with. The outputs of the two proposed methods are compared in Sect. 3.

Spatially Variant Laplacian Pyramids Number of blending levels

Low

(a)

(b)

79

High

(c)

Fig. 6. Visualization of spatially varying levels. (a) Input image stack (b) Maps, showing the spatial variation of the equivalent number of levels in the proposed method. Top - Map for blending first two images, Bottom - Map for blending the previous output and third image of the set (c) Output of the proposed method

Blending of Patches. Once the patches for each level are reconstructed they have to be blended in a seamless manner to generate the full image. Modified raised cosine filter proposed by [9] was used to blend the overlapping patches. It is to be noted that each of the patches have an overlapping factor of 50%. 2.2

Extension to Multiple Frames

Given N differently exposed images and N − 1 weight maps, the proposed twoframe blending method can be sequentially executed N − 1 times to generate the output as follows: (8) Ioi = B(Ioi−1 , Ii , Wi ) Here Ioi represents HDR output formed by blending i frames and Ii denotes the ith input image. B refers to the proposed blending algorithm.

3 3.1

Experimental Results Qualitative Comparisons

The proposed exposure fusion method was evaluated on publicly available HDR dataset [10]. Figure 2 compares the two variants of the proposed method. It is observed that patch variance as the weight factor produces slightly superior outputs. Henceforth, patch variance is used for all comparisons. The output of the proposed method with patch variance for weighting factor are compared with the output from using standard Laplacian Pyramid Blending with the same number of levels and identical weight map and the outputs of Boosting laplacian Pyramid [6] and Gu et al. [8] in Figs. 3, 4 and 5. Halos, dark patches arising from gradient reversal and level of details are specifically compared. It is observed

80

A. Biswas et al.

that the proposed method is the most effective at eliminating halos and dark patches while maintaining a level of details and dynamic range roughly similar to Boosting Laplacian Pyramid. Table 1. Comparisons using MEF-SSIM Image set

3.2

Gu et al. [8] Raman et al. [11] Shen et al. [6] Proposed

Balloons

0.913

0.768

0.902

0.936

Belgium House

0.896

0.809

0.915

0.933

Chinese Garden 0.927

0.911

0.917

0.96

Kluki

0.921

0.901

0.861

0.915

Memorial

0.87

0.617

0.909

0.94

Cave

0.933

0.693

0.922

0.914

Farmhouse

0.932

0.877

0.942

0.971

House

0.876

0.77

0.876

0.829

Lamp

0.871

0.864

0.875

0.902

Landscape

0.94

0.953

0.872

0.954

LightHouse

0.934

0.938

0.873

0.962

Office

0.899

0.906

0.93

0.962

Tower

0.931

0.895

0.873

0.965

Venice

0.889

0.892

0.868

0.915

Mean

0.909

0.842

0.895

0.932

Quantitative Comparisons

Quantitative comparison results against the boosted Laplacian Pyramid proposed by Shen et al. [6], gradient field exposure fusion proposed by Gu et al. [8] and bilateral filter based multi-exposure compositing, proposed by Raman et al. [11] are provided. The metric MEF-SSIM [10] proposed by Ma et al. [10] is being used for comparison. This metric is chosen for the objective evaluation of the proposed algorithm since it provides a way to analyse the perceptual image quality without the need of any reference images. The metric is based on the multi-scale SSIM principle [12] and measures the local structure preservation of the output image with respect to input images at fine scales and luminance consistency at coarser scales. For a more detailed discussion on the metric, the readers may refer to [10]. The algorithms are evaluated on the public dataset provided by [10]. The results are summarized in Table 1. From the table, it can be observed that the proposed method produces much better scores in most of the scenes, and produces a better overall score. This shows that the algorithm can generalize very well compared to the state-of-theart method.

Spatially Variant Laplacian Pyramids

3.3

81

Visualizations

Figure 6 shows the spatial variation of the equivalent number of blending levels in the proposed method. It clearly indicates how the level of blending is reduced in and around regions with high intensity variation and how the effect slowly dies out with distance. This variation allows the algorithm to control halos while also maintaining the naturalness of the final output image.

4

Conclusion

This paper proposes a novel method for blending images with large variation in intensity. It overcomes the limitations of Laplacian Pyramid Blending by dynamically altering the blending level on a spatially varying basis. This allows us to leverage a range of blending levels in a single image as opposed to conventional methods. Extensive qualitative comparisons with state of the art methods in exposure fusion shows the ability of the proposed algorithm to generate artifact free images with greater dynamic range and details. A quantitative study using MEF SSIM scores is also performed to show that the method produces perceptually superior results.

References 1. Burt, P., Adelson, E.: The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 31, 532–540 (1983) 2. Mertens, T., Kautz, J., Van Reeth, F.: Exposure fusion: a simple and practical alternative to high dynamic range photography. Comput. Graph Forum 28, 161– 171 (2009) 3. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: ACM SIGGRAPH 2008 Classes, p. 31. ACM (2008) 4. Robertson, M.A., Borman, S., Stevenson, R.L.: Dynamic range improvement through multiple exposures. In: Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348), vol. 3, pp. 159–163. IEEE (1999) 5. Tursun, O.T., Aky¨ uz, A.O., Erdem, A., Erdem, E.: The state of the art in HDR deghosting: a survey and evaluation. Comput. Graph Forum 34, 683–707 (2015) 6. Shen, J., Zhao, Y., Yan, S., Li, X., et al.: Exposure fusion using boosting Laplacian pyramid. IEEE Trans. Cybern. 44, 1579–1590 (2014) 7. Li, Z., Zheng, J., Zhu, Z., Yao, W., Wu, S.: Weighted guided image filtering. IEEE Trans. Image Process. 24, 120–129 (2014) 8. Gu, B., Li, W., Wong, J., Zhu, M., Wang, M.: Gradient field multi-exposure images fusion for high dynamic range image visualization. J. Vis. Commun. Image Represent. 23, 604–610 (2012) 9. Hasinoff, S.W., et al.: Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graphics (TOG) 35, 192 (2016) 10. Ma, K., Zeng, K., Wang, Z.: Perceptual quality assessment for multi-exposure image fusion. IEEE Trans. Image Process. 24, 3345–3356 (2015) 11. Raman, S., Chaudhuri, S.: Bilateral filter based compositing for variable exposure photography. In: Eurographics (Short Papers), pp. 1–4 (2009) 12. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)

Traffic Sign Recognition Using Color and Spatial Transformer Network on GPU Embedded Development Board Bhaumik Vaidya1(B) and Chirag Paunwala2 1 Gujarat Technological University, Ahmedabad, India

[email protected] 2 Electronics and Communication Department, Sarvajanik College of Engineering

and Technology, Surat, India

Abstract. Traffic sign recognition is an integral part of any driver assistance system as it helps the driver in taking driving decisions by notifying about the traffic signs coming ahead. In this paper, a novel Architecture is proposed based on Convolutional Neural Network (CNN) for traffic sign classification. It incorporates Color Transformer Network and Spatial Transformer Network (STN) within CNN to make the system invariant to color and affine transformation invariant. The aim of this paper is to compare the performance of this novel architecture with the existing architectures in constrained road scenarios. The performance of the algorithm is compared for two well-known traffic sign classification dataset: German Traffic Sign dataset and Belgium Traffic Sign dataset. The paper also covers the deployment of the trained CNN model to Jetson Nano GPU embedded development platform. The performance of the model is also verified on Jetson Nano development Board. Keywords: Traffic sign classification · Convolutional Neural Networks · Spatial Transformer Network · Color transformer network · Jetson Nano development board

1 Introduction Traffic sign detection and classification is the essential part of any driver assistant or autonomous driving system. Traffic signs are used to maintain traffic discipline on Roads and avoid road accidents. The use of traffic signs is prominent of highways during night time and on hilly regions to identify curves on the road. The traffic signs can aide driving in adverse visual conditions like foggy or rainy environments. The correct interpretation of traffic signs can help reducing road accidents drastically. Traffic signs are designed using a specific shape and color to indicate valuable information like traffic rules, speed limits, turns on the road, road conditions, construction going on etc. This shape and color information can be used to detect and classify Traffic signs. Many systems have been built to detect traffic signs using color and shape © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 82–93, 2020. https://doi.org/10.1007/978-981-15-4015-8_8

Traffic Sign Recognition Using Color and Spatial Transformer Network

83

information. This systems have their limitation. Otherwise also detecting and classifying traffic signs is a complex computer vision problem. The problems like color fading, partial occlusion, illumination variation, motion blur etc. are major hurdles in detecting and classifying traffic signs. The availability of large datasets and machines with high computational power has revolutionized the research in the domain of Traffic Sign Classification. The traditional method of using shape and color for identification is being replaced by various Architectures of CNN. Though still many CNN architectures need Image Preprocessing to make it Color and Transform invariant. This paper introduces an end to end CNN architecture which is invariant to color and affine transformation and helps in classifying traffic signs accurately. We need some embedded platform to deploy this model on hardware, if we want to include this system in vehicles. Jetson Nano development board is used in this paper to deploy the trained CNN model on hardware. The rest of the paper is organized as follows. Section 2 summarizes related work on traffic sign classification and detection. The details of the proposed system and theoretical background of algorithms used is explained in Sect. 3. Section 4 shows the experiment results of the proposed system on two datasets. The last sections contain the conclusion and references used for the work.

2 Related Work Traffic sign Classification is the process of identifying the class of the sign from an Image while Traffic Sign detection is to localize the sign by adding a bounding box around the detected signs. There are two approaches for Traffic Sign Detection and Classification in the literature. The first approach uses traditional image processing techniques while the other approach uses different deep neural network architectures. Traffic signs have definite color and shapes so they were used traditionally to detect and classify traffic signs. Shadeed et al. [1] in their paper proposed a traffic sign classification system using color segmentation in YUV color space. Li et al. [2] in their paper proposed a technique based on color segmentation and shape matching. De La Escalera et al. [3] in their paper prosed a system based on color segmentation and color thresholding. Most of the color based method uses HIS, HSV and YCbCr color spaces for segmentation as RGB color space is very sensitive to change in illumination. The advantage of using color based techniques is that they are fast but they are very sensitive to color information. These systems can fail after the color fades. Traffic signs are mostly circular or triangular. They have definite shape so this information can be used to classify traffic signs from an Image. Barnes et al. [4] used Hough transform to detect traffic signs which have shape of polygons. Hough transform is a very simple and fast algorithm to identify regular shapes from an image. Berkaya et al. [5] used EDCircle algorithm to identify circular traffic signs. Keller et al. [6] used Histogram of oriented gradients (HOG) features to identify traffic signs from an Image. The shape based methods are not as fast as color based methods but still they can be used in real time. They fail when traffic sign is having deformation or occluded by some other objects. The color and shape based methods used to detect traffic signs uses some classification algorithm for classifying different types of traffic signs. Maldonado-Bascon et al.

84

B. Vaidya and C. Paunwala

[7] used Support Vector Machine (SVM) for classification. Cire¸san et al. [8] used Random Forest Algorithm for classification and Ruta et al. [9] used K-Nearest Neighbor algorithm for Classification. The second approach for detecting and classifying is to use deep neural networks. Ciresan et al. [8] proposed a multi column deep neural network for classifying traffic signs from a color image. Sermanet et al. [10] proposed a multi scale CNN architecture for Traffic Sign Classification. Aghdam et al. [11] proposed an end to end CNN architecture for Traffic sign detection and Classification. The use of CNN architectures have significantly increased the accuracy of traffic sign detection and classification. Most CNN based architectures needs a preprocessing steps like color space conversion, histogram equalization, data augmentation etc. to make it color, transformation and illumination invariant. The motivation of this paper was to eliminate this preprocessing step and make an end to end CNN architecture which is invariant to affine transformation, color and illumination. The CNN architectures takes a raw image as input and gives corresponding class as an output without any extra steps. The details of a proposed system is explained in the next section.

3 Proposed System The simplified flow chart for the proposed system is shown in Fig. 1 below.

Fig. 1. Flow chart for proposed system

As can be seen from the figure, the input image is given to a single neural network which internally contained Color transformer network [12] for color space conversion, Spatial Transformer Network [13] for learning spatial transformation and CNN for feature extraction and classification. The beauty of the system that entire network is trainable in a one go. The color transformer network and STN will learn their parameters in the same way like regular CNN using gradient descent and back propagation. The entire system is trained using German Traffic Sign Classification dataset [14] or Belgium Traffic Sign dataset [15]. The trained model is deployed on Jetson Nano development board [16] for inference in real time on embedded platform. The individual constituents of the system are explained one by one below:

Traffic Sign Recognition Using Color and Spatial Transformer Network

85

3.1 Color Transformer Network Color images are stored in RGB format inside computer. This format is very sensitive to change in illumination so to make a robust system, researchers use HSV or YCbCr format to train CNN model. This requires deciding on a color channel format which is optimal for the problem and explicit color space conversion step. It would be great if this color conversion parameters can also be learned by the network during training according to the dataset. Mishkin et al. proposed a technique for leaned color space conversion using 1 × 1 convolution [12]. This paper adopts the best technique from that paper for a color transformer network which passes RGB values through ten 1 × 1 convolutions followed by three 1 × 1 convolutions. The ReLU activation function is used. This network will learn color transformation parameters according to the dataset while training. 3.2 Spatial Transformer Network Jaderberg et al. proposed STN [13] which is a learnable module that allows spatial transformation of the data within the network. This module can be integrated with existing CNN architecture to give them ability to learn spatial transformation parameters conditional on the training dataset without the need of any additional training. The architecture of STN is shown in Fig. 2 below:

Fig. 2. Spatial Transformer Network [13]

The input feature map U is passed through a localization network which itself is a CNN to find transformation parameters θ through regression. The sampling grid Tθ (G) is generated by transforming regular spatial grid G over V. This sampling grid is applied to feature map using image resampling to produce output feature map V. The input to the localization network is 32 × 32 × 3 image. The architecture of Localization network used in the paper is shown in table below (Table 1):

86

B. Vaidya and C. Paunwala Table 1. Localization network architecture Layer name

Filter size Number of filters or neurons

Convolution

3×3

32

Max-pooling with stride 2 × 2

–

–

Convolutional

3×3

64

Max-pooling with stride 2 × 2

–

–

Convolutional

3×3

64

Max-pooling with stride 2 × 2

–

–

Dense

–

128

Dense

–

64

Dense

–

6

The localization network generates six parameters for affine transformation matrix shown in Eq. 1 as output. θ11 θ12 θ13 (1) Aθ = θ21 θ22 θ23 Then grid generator will generate a grid of coordinates in the input image corresponding to each pixel from the output image using the Eq. 2 below:

xis yis

⎛

⎞ xit = Tθ (G i ) = Aθ ⎝ yit ⎠ 1

(2)

Where (xis , yis ) are co-ordinates of source image and (xit , yit ), are co-ordinates of affine transformed image. Aθ is the transformation matrix generated in Eq. 1. The bilinear interpolation is used to generate the final pixel value from the grid values. 3.3 Convolutional Neural Network CNN is used for feature extraction and classification in the system. It consists of series of convolutional and pooling layers to learn hierarchical features from the image. The ReLU (Rectified Linear Unit) activation function is used throughout the architecture as it is easy to compute and makes learning faster by avoiding vanishing gradient problem which is prominent in sigmoid or tanh activation functions. The spatially transformed feature map is given as input to the CNN. It has the same size of input image. The layer wise CNN architecture used in the paper is shown in Table 2.

Traffic Sign Recognition Using Color and Spatial Transformer Network

87

Table 2. CNN layer-wise architecture Layer name

Filter size Number of filters or neurons

Convolution

3×3

16

Convolutional

3×3

32

Max-pooling with stride 2 × 2

–

–

Convolutional

3×3

64

Convolutional

3×3

96

Max-pooling with stride 2 × 2

–

–

Convolutional

3×3

128

Convolutional

3×3

64

Max-pooling with stride 2 × 2

–

–

Dense

–

128

Dense

–

43

Batch Normalization is used after every convolutional layer output. It helps in faster training and increases accuracy by normalizing output of each layer [17]. The Input image size is 32 × 32 × 3 for German Traffic sign dataset and output neurons are 43 as it has 43 classes. The Input image size is 64 × 64 × 3 for Belgium Traffic sign dataset and output neurons are 62 as it has 62 classes. The training setup and results are explained in the next section. 3.4 GPU Embedded Development Board A GPU are much more efficient in computing parallel operations like convolutions then simple CPUs. When the proposed system needs to be deployed in real life scenarios it has to be deployed in embedded board which contains GPU for faster inference. There are several GPU Embedded board available in market like NVIDIA Jetson TX1, TX2, Nano and Google Coral. Jetson TX1 and TX2 are very costly for the given application and Google Coral only supports TensorFlow Lite at this point so Jetson Nano Development board is used to deploy the system on hardware. Jetson Nano also consumes less power (Less than 5 W) compared to other boards [16]. NVIDIA Jetson nano board which is a small powerful computer that allows faster computation of traffic sign classification is used for deployment in this paper. It has a 128 core Maxwell GPU along with Quad-core ARM A57 CPU running at 1.43 GHz. It has 4 GB of RAM and runs from a MicroSD card. It has 4 USB 3.0 port for connecting external peripherals and a HDMI connector for connecting displays. It can also be interfaced with Camera via a CSI connector or a USB port. It delivers performance of 472 Giga floating point operations per second (GFlops) [16]. The CNN model for traffic sign classification using TensorFlow and Keras is deployed on Jetson nano for inference. The board comprises of a Jetpack installed over Ubuntu Operating System which uses TensorRT for faster inference.

88

B. Vaidya and C. Paunwala

4 Implementation and Results The proposed system is implemented using Python and OpenCV using Anaconda Python distribution. Tensorflow and keras library is used to implement CNN architecture. The system is tested both on CPU and GPU hardware platform. The GeForce 940 GPU is used for training which has a dedicated RAM of 4 GB. The CPU has i5 processor with 8 GB of RAM and 2.2 GHz clock speed. The system is also deployed on Jetson nano development board. The dataset used are BTSD (Belgium traffic sign dataset) [15] and GTSRB (German traffic sign recognition benchmark) [14]. They both contains large amount of image data collected in real life scenarios and ideal for training CNN. Training setup for GTSRB is shown in Table 3. Table 3. Training setup for GTSRB Parameter

Value

Input image size

32 × 32 × 3

Number of classes

43

Batch size

256

Number of epochs

50

Learning rate

0.0005

No. of training images

34799

No. of validation images 4410 No. of testing images

12630

Training accuracy

99.94%

Validation accuracy

99.16%

Test accuracy

98.40%

Classification time

3.9 ms/image

The increase in training and validation accuracy after every epoch is shown in Fig. 3 below. Adaptive moment estimation (Adam) [18] is preferred as an optimization function while training because of its faster convergence and lower fluctuations compared to RMSProp and Nadam Optimizer as shown in Fig. 3. The equations for updating weights using ADAM optimizer while training are given below: Exponentially decaying averages of past gradients is calculated by: m t = β1 m t−1 + (1 − β1 )gt

(3)

Exponentially decaying averages of past squared gradients is calculated by: vt = β2 vt−1 + (1 − β2 )gt2

(4)

Traffic Sign Recognition Using Color and Spatial Transformer Network

Bias correction mˆ t = update rule:

mt 1−β1t

and vˆt =

vt 1−β2t

89

and then, update parameters using Adam’s

θt+1 = θt −

η vˆt +

mˆ t

(5)

β1 = 0.9 and β2 = 0.999 is used in this paper as proposed by authors.

Adam Optimizer

RMSprop Optimizer

Nadam Optimizer

Fig. 3. Training and validation accuracy vs Number of epochs for different optimizers

Training setup for Belgium Traffic sign dataset is shown in Table 4. Table 4. Training setup for BTSD Parameter

Value

Input image size

64 × 64 × 3

Number of classes

62

Batch size

256

Number of epochs

50

Learning rate

0.0005

No. of training images

4575

No. of validation images 2520 Training accuracy

98.26%

Validation accuracy

96.31%

Classification time

3.9 ms/image

90

B. Vaidya and C. Paunwala

The increase in training and validation accuracy after every epoch using Adam optimizer is shown in Fig. 4 below.

Fig. 4. Training and validation accuracy vs Number of epochs

Initially, it was found that training accuracy is very high compared to validation accuracy as shown in Fig. 4. So model was over fitting on training data. To overcome that, dropout of 0.3 was used in dense layers. It randomly removes neurons with a probability of 30% while training so we are training a different architecture every time. It removes the over dependence on any neuron and by that avoids over fitting. The few correctly identified traffic signs from the test set of germen dataset using the proposed method is shown in Fig. 5.

Fig. 5. Correctly classified traffic signs

The few correctly identified traffic signs from the random images downloaded from internet using the proposed method is shown in Fig. 6.

Traffic Sign Recognition Using Color and Spatial Transformer Network

91

Fig. 6. Correctly classified images from Internet

It is important to look at images on which the algorithm fails to identify traffic signs.

Fig. 7. Incorrectly classified images

As can be seen from the Fig. 7, the traffic signs are very difficult to classify even for a human as they are hardly visible due to illumination and blurriness. The performance of the proposed algorithm on German Traffic Sign dataset is compared with other algorithms in Table 5. Table 5. Performance of the proposed system Algorithm used

Classification accuracy (%)

Human (Average) [14]

98.84

CNN with Spatial Transformer (Proposed Method) 98.40 Multi-scale CNN [10]

98.31

CNN without STN

97.04

Random Forest [19]

96.14

LDA on HOG2 [20]

95.68

LDA on HOG1 [20]

93.18

LDA on HOG3 [20]

92.34

As can be seen, the performance of the proposed algorithm is very close to human performance. By tweaking few parameters and training for a longer time may help it in reaching human level performance as training and validation error is already surpassing human performance.

92

B. Vaidya and C. Paunwala

The system is also deployed on Jetson Nano development board. It takes around 11 ms to classify traffic signs in Jetson Nano which is faster than 40 ms taken by CPU only computation unit. So this system can be easily used for deploying in vehicles for real time applications.

5 Conclusion Traffic sign classification is very essential for building driver assistant or autonomous driving system. In this paper, a novel Architecture is proposed based on Convolutional Neural Network (CNN), Color Transformer Network and Spatial Transformer Network (STN) is proposed. It removes the need for preprocessing to make CNN invariant to color and affine transformation. The architecture is end to end which can be trained in one go and learn the parameters for color transformation and spatial transformation. The performance of proposed architecture is compared with the existing architectures and it can be seen that it can reach a human level performance. The performance of the algorithm is compared for two well-known traffic sign classification dataset. The performance of the model is also verified on Jetson Nano development Board.

References 1. Shadeed, W.G., Abu-Al-Nadi, D.I., Mismar, M.J.: Road traffic sign detection in color images. In: Proceedings of the 2003 10th IEEE International Conference on Electronics, Circuits and Systems (ICECS). IEEE (2003) 2. Li, H., Sun, F., Liu, L., Wang, L.: A novel traffic sign detection method via color segmentation and robust shape matching. J. Neurocomput. 169, 77–88 (2015). ISSN 0925-2312 3. De La Escalera, A., Moreno, L.E., Salichs, M.A., Armingol, J.M.: Road traffic sign detection and classification. IEEE Trans. Ind. Electron. 44(6), 848–859 (1997) 4. Barnes, N., Loy, G., Shaw, D.: Regular polygon detection. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, pp. 778–785 (2005) 5. Berkaya, S.K., Gunduz, H., Ozsen, O., Akinlar, C., Gunal, S.: On circular traffic sign detection and recognition. J. Expert Syst. Appl. 48, 67–75 (2016). ISSN 0957-4174 6. Keller, C.G., Sprunk, C., Bahlmann, C., Giebel, J., Baratoff, G.: Realtime recognition of U.S. speed signs. In: Proceedings of the Intelligent Vehicles Symposium, pp. 518–523 (2008) 7. Maldonado-Bascón, S., Lafuente-Arroyo, S., Gil-Jimenez, P., Gómez-Moreno, H., LópezFerreras, F.: Road-sign detection and recognition based on support vector machines. IEEE Trans. Intell. Transp. Syst. 8(2), 264–278 (2007) 8. Cire¸san, D., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural network for traffic sign classification. Neural Netw. 32, 333–338 (2012) 9. Ruta, A., Li, Y., Liu, X.: Real-time traffic sign recognition from video by class-specific discriminative features. Pattern Recognit. 43(1), 416–430 (2010) 10. Sermanet, P., LeCun, Y.: Traffic sign recognition with multi-scale convolutional networks. In: The 2011 International Joint Conference on Neural Networks (IJCNN), SanJose, California, USA, pp. 2809–2813. IEEE (2011) 11. Aghdam, H.H., Heravi, E.J., Puig, D.: A practical approach for detection and classification of traffic signs using Convolutional Neural Networks. J. Robot. Auton. Syst. 84, 97–112 (2016). ISSN 0921-8890

Traffic Sign Recognition Using Color and Spatial Transformer Network

93

12. Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of CNN advances on the ImageNet. arXiv preprint arXiv:1606.02228 (2016) 13. Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015) 14. German Traffic Sign Recognition Dataset. http://benchmark.ini.rub.de/?section=gtsrb& subsection=dataset 15. Belgium Traffic Sign Dataset. https://btsd.ethz.ch/shareddata/ 16. Jetson Nano Development Board Help Document. https://developer.nvidia.com/embedded/ learn/get-started-jetson-nano-devkit 17. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 18. Kingma, D. P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–13 (2015) 19. Zaklouta, F., Stanciulescu, B., Hamdoun, O.: Traffic sign classification using K-d trees and Random Forests. In: International Joint Conference on Neural Networks (IJCNN), pp. 2151– 2155 (2011) 20. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)

Unsupervised Single-View Depth Estimation for Real Time Inference Mohammed Arshad Siddiqui(B) , Arpit Jain, Neha Gour, and Pritee Khanna PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur, India {mohdsiddiqui,arpit.jain,g.neha,pkhanna}@iiitdmj.ac.in

Abstract. Several approaches using unsupervised methods have been proposed recently to perform the task of depth prediction with higher accuracy. However, none of these approaches are flexible enough to be deployed in the real-time environment with limited computational capabilities. Inference latency is a major factor that limits the application of such methods to the real world scenarios where high end GPUs cannot be deployed. Six models based on three approaches are proposed in this work to reduce inference latency of depth prediction solutions without losing accuracy. The proposed solutions can be deployed in real-world applications with limited computational power and memory. The new models are also compared with the models recently proposed in literature to establish a state of the art depth prediction model that can be used in real-time.

Keywords: Monocular depth prediction learning · Inference latency

1

· CNN · Unsupervised

Introduction

Depth estimation in the images is one of the intriguing, complex, and computation-heavy problems in the domain of computer vision. Applications of depth estimation exist in the fields of 3D reconstruction, self-driving cars, robotic vision, and recently in mobile photography [15]. Successful approaches in this field have been dependent on the perception of the structure from motion (consecutive image frames), binocular vision based methodology, stereo vision methodology, and environment cues such as lightening, colour, and geometry for learning depth. Existing approaches focus on the accuracy of depth prediction rather than the ability to process each frame in real-time and hence, result in large inference latency. For an image IM ×N , inference latency of a method is the time taken by this method to generate corresponding output depth map OM ×N . Depth estimation is broadly classified into stereo and monocular methods. Monocular (single-view) depth estimation is generally preferred in the real-world scenarios as it eliminates the need of adjusting dual camera setup for reducing calibration errors and synchronization problems. Single image depth prediction c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 94–105, 2020. https://doi.org/10.1007/978-981-15-4015-8_9

Unsupervised Single-View Depth Estimation for Real Time Inference

95

is a complex and unrealistic solution because it is generally poorly represented and geometrically very ambiguous problem which requires high computation architectures. However, with the evolution of deep learning methods, monocular depth estimation has shown some amazing results [2, 11]. These monocularvision based methods cannot be actually compared to the stereo-vision methods on the common grounds of metrics as current deep learning based approaches completely depend upon learning of high level semantic information to relate it to the absolute depth values, which is extremely difficult to learn through generic network functions. Expecting real-time inference in such methods still seems unrealistic since these are very expensive in practical scenarios and require high-end computing systems for performing such large scale computations. An alternative approach is proposed in [6] to pose depth estimation as an image reconstruction problem during training stage. This fully convolutional method does not require any ground truth depth data and with the generation of depth as an intermediate, pixel-level correspondence between stereo pair images is predicted via learning a fully differentiable function. It is essentially a monocular depth estimation approach which employs binocular color-image rather than utilizing ground truth depth data during training time. Although it is a state of the art solution to the monocular depth prediction problem and has reduced the inference latency as compared to previous methods, still it requires high computational power for depth maps to be generated in real-time. This work proposes three novel approaches based on the base model given in [6] to perform depth prediction in real-time. Proposed models reduce inference latency on realistic systems with low computational power. A depth estimation system is real-time in function if it produces output depth-map for an input frame (Fig. 1) at an acceptable rate. The methods proposed in this work perform depth prediction with accuracy similar to the state-of-the-art models but with much lower inference latency as required for real world applications. Existing methods of depth estimation are discussed in Sect. 2. Proposed methods to reduce inference latency are discussed in Sect. 3. Experimental analysis is performed in Sect. 4 and the work is concluded in Sect. 5.

Fig. 1. An input image (top) and its output depth map (bottom).

96

2

M. A. Siddiqui et al.

Related Works

A large number and classes of methodologies are proposed to provide competitive solutions to the problem of depth prediction, but none of these methods scale to the real-time environment. Depth estimation proposals aimed to achieve depth information through usage of paired images [18], considering a statically installed camera, fixed scene, and varying exposure/light or measuring other monocular cues [1], several overlapping/coincident images captured from multiple views or directions [3], temporal sequences [17], and even consecutive video frames [12,20,22]. Generally, these approaches are valid when more than one input images are available at the depth estimation scene. Solutions to the depth estimation problem can broadly be classified in three sections. Stereo-Based Depth Estimation: Stereo-based learning is trained on paired input set of image sources to predict depth at test time. A large number of proposed solutions in this category use a data-term that calculates a similarity score by comparing a single pixel in the first image against all the pixels in the second image. Generally, the underlying situation of disparity estimation can then be represented simply as a single-dimensional search problem for each and every pixel after rectification of the stereo pair. This is a compute-intensive operation in such methods. Some latest approaches indicate that rather than utilizing hand-defined similarity scores/measures, training a function using the application of supervised learning to predict the correspondence score results into far better observations [9]. Such approaches are naturally faster than the previous methods as these are dynamic in function. A fully convolutional deep network introduced by [13] directly computes the correspondence field between two images instead of learning the matching function. Regression training loss for the prediction of disparity is minimized to reduce the run time as it eliminated the use of brute-force approach through the entire image pixels. Major drawback of stereo-based approaches is the dependence of the methods on the availability of huge amounts of accurate disparity data in the form of ground truth along with the stereo image pairs during training stage. Obtaining this form of data is a challenge in itself when real world is considered. Therefore, many approaches typically rely on synthetic data during training time which is closer to real data, but there is still a requirement of manual new-content creation. In addition to this, these methods require dual camera setup for datacollection and camera synchronization is another problem altogether. Supervised Monocular Depth Estimation: ‘supervised monocular’ or ‘single-view depth estimation’ is the process of depth estimation when only one image and its depth map (i.e., ground truth) is given as input to the system at test time. Most of the earlier methods in this class were patch based local methods. Hand-crafted features used here lack global context which makes them oblivious to the existence of thin or fine grained structures in the input image. Per-pixel depth prediction is improved by inculcating semantics in the model given in [10]. Another method requires the entire training set at test time [8].

Unsupervised Single-View Depth Estimation for Real Time Inference

97

This imposes high memory requirements on the entire system, thereby moving it away from realistic situations. A convolutional neural network based approach suggested to eliminate the need of hand-tuning the unary and pairwise terms as these could be learned via network itself [11]. This helped in reducing the training cost. An approach to learn representation directly from the raw pixel values utilized a two scale deep network trained on images along with their depth values to produce dense pixel depth predictions [2]. Like stereo based methods, the approaches in this category heavily depend on the availability of high quality ground truth data. Capturing ground truth data employs expensive sensors with accurate calibration and timely synchronization. Moving as well as partially or fully transparent objects are comparatively hard to detect and produce inconsistencies when these sensors are utilized. Unsupervised Monocular Depth Estimation: Some deep-learning based approaches are proposed for depth estimation which do not require ground truth data during training stage. These methods use multiple views (generally two) of a single scene for training and single image for testing. These methods create a pair image corresponding to the input image and then compare them to predict depth. Deep3D network, in the binocular pair context, aims to produce a corresponding right view from the left view as the image input source [19]. Distribution of all possible disparity values for each and every pixel is generated using an image reconstruction loss. The major drawback of this approach is its high memory footprint for image synthesis and large number of candidate disparity values which makes it unsuitable for real-world deployment and limits the scalability of this system for higher resolution inputs. Network introduced by [4] is trained on image reconstruction loss, but its optimization was hard due to the non-differentiability. As an improvement to this, depth estimation is introduced through a new training loss that tries to force consistency between the produced disparities relative to both right and left [6]. In other recent approaches like [23,24], training loss is fully differentiable because bilinear sampling was used for the synthesis of images. This produces superior results when compared to [19] and also surpasses the performance abilities of the supervised approaches in terms of both accuracy and inference latency. In this approach, depth estimation is represented as an image reconstruction problem similar to [13]. Apart from this, other notable works like AsiANet [21] and AdaDepth [16] have also shown satisfactory results in this domain. In most of the practical scenarios, the process of depth estimation is supposed to be real-time in function. The inference latency of most of the algorithms does not cater to the real world applications where consecutive stream of image frames are input to the system. In such situations, stereo-based methods still function but monocular based approaches perform poorly. Also, the model size for all these approaches is large which hinders the performance abilities of these algorithms in small embedded systems. The work presented in this paper aims at reduction of model complexities and reduce the inference latency of the depth prediction without losing accuracy.

98

3

M. A. Siddiqui et al.

Proposed Methods for Depth Estimation

Three different approaches are proposed to reduce the time required for inference. These approaches are individually applied on the base model shown in Fig. 2 to showcase their efficiency. The model stated in [6] essentially feeds left stereo image (I l ) to the network and produces left (dl ) as well as right (dr ) image disparities. These disparities along with the original stereo pair (I l , I r ) are again processed by the sampler to produce the image pair (Iˆr , Iˆl ) which is compared against target to compute the losses. Model compression is performed in the proposed work to reduce inference latency. This enables the weight matrix to be completely loaded on the RAM of low computational devices instead of having to fetch it from the main memory repeatedly. Other than this, pruning also changes a lot of weights to 0. It reduces multiplication operations and less prediction time is required. This smaller model results in reduction of overall inference time. ReLU activation function in the encoder as proposed by [6] is replaced by ‘leaky ReLU’ to counter the problem of dead neurons. A neuron is called ‘dead’, when trained weights become negative. The value of ReLU for negative values is 0 and once that happens its output is always 0 (making no contributions in training). This problem can be avoided With leaky ReLU since it does not have 0 slope. Leaky ReLU also helps in speed up of training. Dropout regularization is used to make the model robust and generalized. In dropout regularization, some randomly selected neurons are dropped during training and their weights are also not updated during back propagation. This forces other neurons to step in and compensate for the missing neurons. This helps make the model more generalized, avoids over-fitting, and performs better on unseen test data.

Fig. 2. Baseline model used in this work (Proposed in [6]).

Unsupervised Single-View Depth Estimation for Real Time Inference

3.1

99

Method 1: Models Based on Parameter Reduction

Some parameters are removed for model compression and inference speed up. The model is tested by removing 10%, 25%, and 35% parameters from the last two layers of the network. After this, binning is introduced to further reduce the size of the model and increase speed. Quantile binning is used in the proposed approach. If there are N input layers and M output layers, instead of N × M weights, these N × M values are classified in B bins (B < N × M ) with the values being classified to the bin value closest to it. The value of each bin is selected such that each bin has equal number of values assigned to it. This discretization allowed us to further reduce the model size by reducing the number of weights required from N × M to B. The number of bins B are chosen as max(M, N ). These methods (10%, 25%, and 35% parameters reduction) are referred in Table 1 as Model 1.1, 1.2, and 1.3. 3.2

Method 2: Model Using Smaller Encoder

Initially in [6], VGG-16 and ResNet50 architectures were used for encoding. In place of that, the proposed model is trained on SqueezeNet. The architecture of SqueezeNet as proposed by [7] described three strategies to compress the model, which ultimately leads to a smaller inference latency as less weights are to be used. For this, 1. 3 × 3 filters were reduced to 1 × 1 filters, 2. number of input channels were reduced to 3 × 3 filters and 3. down sampling late in the network. The model size is substantially reduced using these modifications. The compression by SqueezeNet and the resultant speed up is discussed in the next section. This method is referred as Model 2.0 in Table 1. 3.3

Method 3: Models Using Weight Matrix Pruning

Given a matrix A of size M × N , the sparsity score S of a matrix A is the ratio of elements with value 0 to the total number of elements in matrix A. Sparsity score of weight matrix resulting from the trained model is increased to speed-up inference. In Fig. 3, the step ‘change small (selected) weights to zero’ can be done in two different ways as discussed below and referred as Model 3.1 and 3.2 in Table 1. The rest of the algorithm for pruning remains the same for both these methods. Naive Approach to increase sparsity score selects the nodes with the small weight values and sets them to zero without considering any other factor. For this, a sparsity score S is selected and S% of the smallest weights in the weight matrix are set to zero. Since this approach sets the smallest S% of values to zero without looking at the implication on the accuracy of the model, it is called as naive approach here. Naive approach does not always result in the ideal solution. The selection of nodes which are to be pruned is a difficult problem. The greedy approach to find the nodes with the least negative effect on the model would be to remove a parameter p and retrain the model to see the change in accuracy and then select those parameters which result in the least negative effect

100

M. A. Siddiqui et al.

Fig. 3. Steps for pruning to increase sparsity in weight matrix

in accuracy. Since proposed model is still very large, this is extremely expensive to compute for each parameter. The method suggested by [14] and referred as Oracle pruning to increase sparsity score is used as second approach. It was shown in [14] that the contribution of parameter p can be calculated using the first order Taylor expansion of the cost function. This simplifies the computations required to calculate the effect of each parameter p on the model. After ranking the parameters, the least ranked S% of the parameters are removed. The model is retrained after pruning. Remaining nodes are initialized with earlier learned weights. It is also important to note that leaky ReLU is not used when retraining the model. This allows the gain in information lost by pruning as the non-pruned nodes compensate for the missing nodes. Further, the resultant sparse matrix is highly biased with peaks of values forming clusters. This leads to perform Huffman coding to further reduce the size of the model. Since a large number of run lengths for the value 0 is observed, a substantial compression in the model size is achieved.

4

Experiments and Results

Experiments are performed to obtain inference latency of the proposed models for different sized images of KITTI dataset [5] in three different computational environment. The KITTI contains raw data which is used for training and testing. The dataset consists of 175 GB of stereo pair of images and their depth values. The performance is compared with the base model as shown in Table 1. The desired speed up in inference latency should not be at the cost of accuracy and therefore, different error and accuracy metrics are used to compare the proposed models with the existing models as given in Table 2 on image resolutions of size 256 × 512. This comparison truly highlights the novelty and superiority of the proposed methods. Comparing Inference Latency of the Models: The existing algorithms are all tested on high end computational devices, which do not represent the

Unsupervised Single-View Depth Estimation for Real Time Inference

101

real world environment. For a comprehensive analysis of inference latency given in Table 1, processors with different computation powers are used to show the speed-up achieved with different models as compared to the base system. Along with this, the difference in inference latency is also compared on different resolution of images. Considering real-world scenario, the following system configurations are used for comparison: Nvidia Tesla K80 with 12 GB RAM referred as C1; Intel i7 4710HQ (2.5 GHz) with 8 GB RAM referred as C2; and Raspberry Pi 3 (1.4 GHz) with 1 GB RAM referred as C3. The test on Raspberry Pi 3, simulates real-world environment with low computation embedded systems as similar systems are generally used in such applications. The resolution of the images in the test are 128×256 and 256×512. Using parameter reduction (Models 1.1, 1.2, and 1.3), the resultant gain in inference latency is less as compared to other proposed models. Still, it is significantly better than the base models. Table 1 shows that models based on method 2 and 3, outperform base model by huge margins when the computational powers are low. It can be seen in the table that though Model 2.0 is slightly faster as compared to Model 3.1 and 3.2, but this speed-up is at the cost of accuracy (Table 2). Using either of the pruning gives similar results since the sparsity score in both the approaches is same, which results in equally sparse matrices. Visual results to compare the proposed models with [6] are shown in Fig. 4. The first row is input image, the second image is depth map (i.e., ground truth), and the remaining rows are depth maps obtained with the methods mentioned with rows. Table 1. Comparison of the proposed models with the base model on different computational powers and resolution of images from KITTI dataset. Model

Resolution Computational power C1 C2 C3

Monodepth [6] 128 × 256 256 × 512

0.042 s 0.59 s 7.82 s 0.048 s 0.65 s 12.22 s

Model 1.1

128 × 256 256 × 512

0.040 s 0.52 s 6.55 s 0.046 s 0.58 s 10.10 s

Model 1.2

128 × 256 256 × 512

0.036 s 0.36 s 0.042 s 0.47 s

5.81 s 9.80 s

Model 1.3

128 × 256 256 × 512

0.032 s 0.31 s 0.038 s 0.44 s

5.04 s 7.78 s

Model 2.0

128 × 256 256 × 512

0.015 s 0.08 s 0.018 s 0.10 s

1.02 s 1.04 s

Model 3.1

128 × 256 256 × 512

0.027 s 0.10 s 0.034 s 0.11 s

1.22 s 1.46 s

Model 3.2

128 × 256 256 × 512

0.027 s 0.10 s 0.034 s 0.11 s

1.22 s 1.46 s

102

M. A. Siddiqui et al.

Fig. 4. The ground truth of input images and solution by [6] compared with different approaches proposed in this work. Since the proposed models operate on top of [6] and preserve accuracy, the outputs are similar.

Comparing Accuracy of the Models: Table 2 compares the proposed models with other state of the art models that use single view depth estimation on error and accuracy metrics. Relative absolute error (Abs Rel), relative square error (Sq Rel), root mean square error (RMSE), and root mean square logarithmic error (RMSLE) are used as error metrics (lower is better). For accuracy (higher is better) a threshold comparison with δ (threshold = 1.25, 1.252 and 1.253 ), y GT , yypred ) is performed. where δ = max( ypred GT Accuracy remains satisfactory and comparable using parameter reduction. Although resultant gain in inference latency is not much, still it is better than the existing models. Parameter reduction is compared by reducing 10% (Model 1.1), 25% (Model 1.2), and 35% (Model 1.3) parameters. The loss in accuracy with 10% size reduction was negligible and the speed-up was small. Removing 25% parameters, the speed-up achieved was slightly more with the accuracy still comparable with a minor loss. The best speed-up in parameter reduction was given by removing 35% parameters, but it can be noticed in Table 2 that there is a significant loss in accuracy. This speed-up is at the cost of loss in accuracy. Using SqueezeNet as encoder in Model 2.0 reduces the model size to a huge extent. Even though there is a significant loss in accuracy, the speed-up in inference latency is good since it has much less parameters. Pruning to add sparsity in the weight matrix gives the best results when sparsity is imposed using Model 3.2. The speed-up is increased by the Huffman coding of the sparse weight matrix by further reducing the size of the model. The small increase in accuracy compared to the original model suggests the presence

Unsupervised Single-View Depth Estimation for Real Time Inference

103

Table 2. A comparison of the proposed models with the existing models on 256 × 512 images of KITTI dataset. Method

Error metric

Accuracy metric

Abs Rel Sq Rel RMSE RMSLE δ < 1.25 δ < 1.252 δ < 1.253 Eigen et al. [2]

0.203

1.548

6.307

0.282

0.702

0.890

0.890

Liu et al. [11]

0.201

1.584

6.471

0.273

0.680

0.898

0.967

Zhou et al. [22]

0.201

1.391

5.181

0.264

0.696

0.900

0.966

Yang et al. [20]

0.182

1.481

6.501

0.267

0.725

0.906

0.963

Garg et al. [4]

0.169

1.080

5.104

0.273

0.740

0.904

0.962

AdaDepth [16]

0.167

1.257

5.578

0.237

0.771

0.922

0.971

AsiANet [21]

0.145

1.349

5.919

0.230

0.805

0.929

0.967

Godard et al. [6]

0.146

1.341

5.921

0.244

0.809

0.927

0.973

Proposed Model 1.1 0.151

1.344

5.924

0.229

0.801

0.922

0.969

Proposed Model 1.2 0.158

1.349

5.938

0.235

0.794

0.911

0.968

Proposed Model 1.3 0.169

1.407

5.998

0.258

0.755

0.907

0.966

Proposed Model 2.0 0.181

1.473

6.221

0.271

0.729

0.899

0.965

Proposed Model 3.1 0.153

1.357

5.966

0.248

0.785

0.922

0.969

Proposed Model 3.2 0.148

1.344

5.918

0.221

0.803

0.926

0.973

of over-fitting. Retraining the model recovers the loss in accuracy to a large extent and in turn induces regularization as well, which helps in generalizing the model. It can be seen in Table 2 that increasing the sparsity using pruning, as described in Sect. 3.3, leads to loss of details in the disparity map with similar speed up since the rest of the process is similar and only neuron selection for pruning defers in the two proposed methods. For methods where the code and model was provided (for method by [2,4,6,11,22]), we used respective author’s script to get the accuracy statistics and for others, we used the accuracy quoted by the authors in their paper.

5

Conclusion

In an attempt to speed up depth estimation, this work optimizes the approach used by [6] to get acceptable inference latency for use in the real-time applications even on systems with low computational power. Out of six models obtained with three approaches, Model 3.2 not only provides sufficient speed-up but also preserves the accuracy of the original algorithm. This unique method reduces the inference latency of image of resolution 256 × 512 from 48 ms using [6] to 34 ms when tested on systems with high computational power. The proposed method outperforms existing methods when these methods are run on systems with low computational power like Raspberry Pi 3 which is an example of real-world embedded systems. The inference latency in such a low powered system was lowered from 12.22s as obtained in [6] to 1.46s by the proposed models 3.1 and 3.2, which is roughly 8.3 times. The proposed method

104

M. A. Siddiqui et al.

with such a small inference latency will help the deployment of monocular depth estimation methods in real-world applications. Pruning to increase sparsity combined with retraining model and Huffman coding of the resultant weight matrix gave us a much smaller weight matrix which is easier to load and use on the embedded systems where the computational capability and memory is low.

References 1. Abrams, A., Hawley, C., Pless, R.: Heliometric stereo: shape from sun position. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 357–370. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-33709-3 26 2. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014) 3. Furukawa, Y., Hern´ andez, C., et al.: Multi-view stereo: a tutorial. Found. Trends Comput. Graph. Vis. 9(1–2), 1–148 (2015) 4. Garg, R., Vijay Kumar, B.G., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8 45 5. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the Kitti dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013) 6. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279 (2017) 7. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and token while first word of remaining sentences is initiated by all preceding sentence knowledge along with visual information. The log probability of each word in report can be modeled as composite function of MLMA mechanism as follows: Sn , V ) = f (αiSn , βiSn ), log p(wiSn |w1Sn , ..., wi−1

(4)

where non-linear function f (.) defines the maximum probability of wiSn by computing attentive output from context-level visual attention (αiSn ) and textual attention (βiSn ) at ith word sequence in nth sentence. 2.2

Context Level Visual Attention (CLVA)

The context-level visual attention (CLVA) is proposed to focus on spatial part from image and medical semantic part from report for inter word dependency within sentence. CLVA synchronizes the fused convolutional and embedding features by attending on context level semantic information in visual feature maps to correctly predict the next word in report. Fused feature at any word sequence t is expressed as: (5) F Ft = [V : W E t ], where, (V ) is flatten form of convolution feature maps computed from block5 pool layer (7 × 7 × 512) of the VGG16 [10] and (W E t ) is representation of word embedding at sequence t in a report. V = [v1 , v2 , ...vk ], vi ∈ Rd ,

178

G. O. Gajbhiye et al.

d represents flatten feature map value (49) and k represents volume of feature maps (512), composing dimensionality of visual feature map as V ∈ Rd×k . W E t = wt ∗ W E, where, ∗ denotes element wise multiplication, wt is one hot encoded word vector standing on tth position in a report and W E is a trainable embedding matrix. W E ∈ Rl×k , wt ∈ R1×l and hence W E t ∈ Rl×k , where, k is embedding dimension (512) and l is maximum number of words (149) in a report. Finally, fused feature can be represented as F Ft ∈ R(l+d)×k . Hidden state of long range LSTM is updated by providing fused features of current time state and previous hidden state as follows: ht = LST M (F Ft , ht−1 ).

(6)

Let H = (h1 , h2 , .., hT ) be the hidden state matrix of size ((l + d) × T ) constructed by accumulating T hidden state of LSTM. The attention mechanism accepts H and provides significant knowledge to predict the word at each time step. Overall attention mechanism consist of three stages: computing attention vector, framing of attention matrix and finally calculating context vector with significant information. The attention vector Att V 1 in CLVA is computed as: Att V 1 = sof tmax(tanh((H.Wa1 ) + Wb1 )),

(7)

where Att V 1 ∈ R(l+d)×1 and . represents the dot product. The Wa1 ∈ RT ×1 is hyper-parameter to learn and Wb1 ∈ R(l+d)×1 is bias weight parameter. The attention vector is expanded and performed element-wise multiplication with hidden state matrix to yield attention matrix with Att M 1 ∈ R(l+d)×T as follows: Att M 1 = (Exp(Att V 1) ∗ H).

(8)

Finally, sum across the column T is executed over attention matrix to provide context vector α ∈ R1×T as: α = column sum(Att M 1) = (α1 , α2 , ..., αT ). 2.3

(9)

Textual Attention (TA)

The textual attention (TA) is introduced to learn the inter dependency and contextual coherence of diverse sentences within report. The BD-LSTM is used to access the context in forward as well as backward direction by enhancing the model capacity [15]. Separate attention with BD-LSTM for embedding word features is utilized to learn the heterogeneous structure of report. Let G = (g1 , g2 , .., gM ) be the hidden state matrix with size (l × M ) produced by BD-LSTM. Here, M is the total hidden steps (forward + backward). Attention vector Att V 2 ∈ Rl×1 over hidden state matrix G is given by Att V 2 = sof tmax(tanh((G.Wa2 ) + Wb2 )),

(10)

where Wa2 ∈ RM ×1 is trainable weight parameter Wb2 ∈ Rl×1 is bias parameters. Attention matrix with Att M 2 ∈ Rl×M can be computed as: Att M 2 = (Exp(Att V 2) ∗ G).

(11)

Automatic Report Generation for Chest X-Ray Images

179

The context vector for TA with dimensionality β ∈ R1×M is given by: β = column sum(Att M 2) = (β1 , β2 , ..., βT ). 2.4

(12)

Word Level Soft-Max Prediction

Outcomes of CLVA (α) and TA (β) are concatenated to compose joint attention vector as γ = [α : β] with γ ∈ R1×(T +M ) . Combined attentive output linked to fully connected layer representing vocabulary size with sof tmax activation to predict the next word in a sequence which ultimately generate the report. The probability of predicted word over vocabulary conditioned on joint attention vector in terms of non-linear function f is represented as f (α, β) = sof tmax(Wf c ([α : β])) = sof tmax(Wf c (γ)),

(13)

where Wf c is weight parameter for fully connected layer with same number of nodes as vocabulary size.

3

Experiments and Results

To achieve the objective of generating the medical report for chest X-ray, IU chest X-ray dataset [3] is utilized. It consists of 3955 radiological reports in diverse form of compression, indication, findings and impression associated with 7470 multiple views (frontal, lateral, side-lateral) of chest X-ray. The dataset provides the information regarding pneumothorax, plural effusion, cardio pulmonary disease, heart size, airspace around lungs and other chest related diseases. Impression describes major substantial knowledge while findings provide overall conclusive information about the chest X-ray. The combination of impression and findings is adopted to prepare final medical report. The dataset has inconsistent and improper distribution, since some image does not have report, some reports consist of erroneous words (e.g. xxxx, x-xxxx, etc.) and concatenation of impression and findings causes similar meaning sentences repeated within report. For appropriate distribution, the images without report and erroneous words are omitted while repeated sentences are replaced by single sentence to finalize a dataset of 7429 image-report pairs. To evaluate the performance of proposed method, 500 samples are randomly selected for validation and 500 samples are kept for testing. Various experiments are performed to compute statistical caption evaluation metrics and matching facts from generated report for X-ray image. The training specifications with experimental results are furnished in this section. 3.1

Training Algorithm

All images are encoded by block5 pool layer of the VGG16 [10]. All reports are pre-processed by converting to lower case, removing irrelevant words, numeric representations and punctuation. Reports are modified by including < start > and < end > tokens at the beginning and at the end. The maximum length of

180

G. O. Gajbhiye et al.

report is limited to 149 words with vocabulary size of 2013 unique words. The special case of full stop is also considered as word element in vocabulary. All the proposed models are trained to minimized crossentropy loss by Nadam (Nesterov- Adam) [4] optimizer with initial learning rate (LR) of 0.001. The reduced LR criteria by factor of 0.1, early stoppage and dropout mechanisms are adopted to eliminate the overfitting issue. The MLMA model is initialized by providing visual features and < start > token as input to predict the first word in the report. Teacher forcing strategy (ground truth word of previous time step is applied as input to model for current time step) with 128 batch size is exploited to train end-to-end MLMA model. In testing, heuristic beam search technique is adopted to generate best first word in sequential manner. The MLMA model generates the words in a order to construct the sentences until prediction of special character of full stop. The sentences are generated until the prediction of < end > token to prepare the plausible medical report for the X-ray. All the discussed methods are implemented in Python with Keras library and Tensorflow backend on GTX 1080Ti GPU. 3.2

Quantitative and Qualitative Results

Initially, CLVA alone is applied with LSTM and BD-LSTM networks. Then extra attentive layer is added to emphasize structural and syntactical pattern of textual input to improve overall generated report quality. To evaluate the quality of generated reports BLEU, Meteor, Rouge-L and CIDER metrics are computed from COCO evaluation API [1] and compared with state-of-the-art methods. Table 1. Comparison of proposed results with current approaches. Methods

Bleu 4 Bleu 3 Bleu 2 Bleu 1 Meteor Rouge-L CIDER

Tie-Net [12]

0.073

0.103

0.159

0.286

0.107

0.226

–

HRGR [9]

0.151

0.208

0.298

0.438

–

0.322

0.343

MRMA [14]

0.195

0.270

0.358

0.464

0.274

0.366

–

MHMC [6]

0.247

0.306

0.386

0.517

0.217

0.447

0.327

CLVA(L)

0.164

0.206

0.276

0.420

0.233

0.351

0.555

CLVA(Bd-L)

0.206

0.245

0.309

0.443

0.248

0.384

0.685

CLVA(L)+TA(L)

0.145

0.191

0.263

0.410

0.227

0.346

0.439

CLVA(Bd-L)+TA(Bd-L) 0.160

0.204

0.277

0.424

0.233

0.356

0.438

CLVA(Bd-L)+TA(L)

0.227

0.269

0.334

0.464

0.262

0.403

0.952

CLVA(L)+TA(Bd-L)

0.278

0.317

0.380

0.500

0.281

0.440

1.067

(Tie-Net: Text Image embedding Network, HRGR: Hybrid Retrieval-Generation Reinforcement, MRMA: Multimodal Recurrent Model with Attention, MHMC: Multi-task Hierarchical LSTM Model with Co-attention)

Table 1 shows the comparison of existing (top four) and proposed (remaining lower) methods using caption evaluation metrics. In the proposed MLMA approach, combination of LSTM (L) and BD-LSTM (Bd-L) networks with its metric outcomes are presented, out of which LSTM of 1024 hidden units in

Automatic Report Generation for Chest X-Ray Images

181

CLVA and BD-LSTM of 512 hidden units in TA shows the best result. Higher values of CIDER, Meteor, Blue 3 and Blue 4 metrics for the proposed method indicates high contextual coherence within words and sentences. From existing approach, HRGR [9] & MRMA [14] requires both frontal and lateral X-ray views for generation of radiological report. Figure 2 presents qualitative results for report generation at different views. The views are described with following four cases. The semantically matching facts from original report are highlighted in generated report. Case 1 shows that model is able to generate matching report but fails to predict some of the prominent facts about the disease. Case 2 indicates, model is able to describe more comprehensive facts than original for lateral view. In Case 3, model is able to generate the overall plausible report but having different structural arrangement of sentences. Case 4 signifies, ability of model to generate medical report correctly as that of original report. All cases with generated outcome are from test set.

Fig. 2. Generated reports for frontal and lateral views.

4

Conclusion

The proposed MLMA method is designed for direct mapping of chest X-ray with its radiological report and shows favorable results for unknown X-ray image. It is an extensive effort in computer diagnosis system to assist the medical professionals for automatic description of abnormalities in chest on finger-tip. The MLMA method is trained to learn syntactical and structural pattern of report by attending on context level visual and textual patterns for generation of plausible medical report. The Context Level Visual Attention is introduce to focused on pertinent visual and textual pattern simultaneously, whereas Textual Attention is proposed to capture the valuable semantic knowledge in a sentence. The higher values of Bleu 3(0.278), Bleu 4(0.317), Meteor(0.281) and CIDER(1.067) metrics are representing the high level coherence within words and sentences for medical report generation. Quantitative and qualitative result indicates that the proposed MLMA approach is outperform state-of-the-art methods for generation of highly contextual coherence multifarious medical report for any chest X-ray view.

182

G. O. Gajbhiye et al.

Aknowledgment. This research work was supported by Center for Intelligent Signal and Imaging Research (CISIR), University Teknologi PETRONAS (UTP), Malaysia and Shri Guru Gobind Singhji Institute of Engineering and Technology (SGGSIE&T), Nanded, India with the international grant (015ME0-018).

References 1. Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015) 2. Coche, E.E., Ghaye, B., de Mey, J., Duyck, P.: Comparative Interpretation of CT and Standard Radiography of the Chest. Springer, Heidelberg (2011). https://doi. org/10.1007/978-3-540-79942-9 3. Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2015) 4. Dozat, T.: Incorporating Nesterov momentum into Adam (2016) 5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 6. Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195 (2017) 7. Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016) 8. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017) 9. Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. In: Advances in Neural Information Processing Systems, pp. 1537–1547 (2018) 10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 11. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) 12. Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9049–9058 (2018) 13. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) 14. Xue, Y., et al.: Multimodal recurrent model with attention for automated radiology report generation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´ opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 457–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1 52 15. Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 207–212 (2016)

Image Processing

Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique Pranaba K. Mishro, Sanjay Agrawal(B) , and Rutuparna Panda VSS University of Technology, Burla 768018, Odisha, India [email protected], [email protected], [email protected]

Abstract. Denoising is a fundamental requirement in the field of medical image processing. It is the process of reducing the additive noise from a noisy medical image, while preserving the information from the clinical data and the physiological signals. Therefore, it is essential to recover an estimated image that conveys almost the same information present in the original image. Wavelet shrink is a standard method of denoising the medical images due to its spectral decomposition and energy compression mechanism. The orthogonal decomposition and energy based compression of the image preserves its spectral components. However, break points are observed due to the presence of purely harmonic patches in the image. To solve this problem, we suggest a spline based fuzzy wavelet shrink (SFWShrink) model for denoising the medical images. The fuzzy wavelet shrink model assigns the wavelet coefficients using a fuzzy rule based membership value. The fuzzy membership values signify the relative importance of a data point with respect to its neighboring data points. This eliminates the noise points in the data space. Next, the spline estimation removes the break points occurring due to the wavelet transform. The benefits of the proposed method are: (i) it is a simple and effective method of denoising; (ii) it preserves the insignificant image details. The proposed model is evaluated with different modalities of synthetic medical images. It is compared with the standard wavelet shrink models, such as VisuShrink, BayesShrink and NeighShrink. The evaluation parameters show the superiority of the suggested technique in comparison to the other approaches. Keywords: Image denoising · Wavelet transform · Spline based estimation · Fuzzy based modeling

1 Introduction Noise in clinical image is a limiting factor in the diagnosis of a disease and its treatment planning. Although the imaging techniques are too advanced, the noise is still added during their processing and transmission. The medical images are comprising of clinical and physiological information. Preserving these subtle information is the main challenge in denoising the medical images [1]. The conventional approaches for medical image denoising use spatial and transform domain information. The spatial domain approaches © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 185–194, 2020. https://doi.org/10.1007/978-981-15-4015-8_16

186

P. K. Mishro et al.

use low pass filters for denoising the image. The use of low pass filter limits the performance due to their dependency on the critical frequencies and the filter behavior. This may also produce artificial frequencies in the resulting image. On the contrary, the transform domain approaches use spectral or wavelet decomposition for denoising. The energy compression mechanism in wavelet transform preserves the spatial coefficients which enriches the image details. The wavelet shrink approach of denoising image is highly accepted due to its higher convergence rate, edge preserving nature and simplicity [2]. A number of approaches for medical image denoising based on wavelet transform are found in the literature [3–10]. Donoho [3] proposed an adaptive wavelet transform method for denoising the Gaussian noise. The denoising is obtained with an adaptive estimation of the wavelet coefficient. Yoon and Vidyanathan [4] proposed a custom threshold function based on wavelet transform for denoising MRI images. The custom threshold function is used for characterizing the given signal with a small estimation error. Luiser and Blu [5] suggested an orthonormal wavelet transform for denoising the medical images. The orthonormal wavelet transformation is obtained using a multi-scale wavelet thresholding mechanism. It is observed that the elementary nonlinear approaches used in the above methods are characterized by an unknown weight factor. The noise estimation using these approaches may be inaccurate due to the improper selection of weights. Olawuyi et al. [6] presented a comparative analysis of different wavelet shrinkages for denoising the cardiac MR images. The authors concluded that haar wavelet shows rich and the dmey wavelet shows poor performance with the cardiac MR images. Mukhopadhya and Mandal [7] proposed an extension of the Bayesian thresholding approach to medical image denoising. The stochastic approach of genetic algorithm is used for optimizing the wavelet coefficients. The statistical dependence of the model makes it computationally intensive. Further, the stochastic approaches are suitable for the estimation of marginal variance of noise. Niami et al. [8] proposed a modified wavelet transform using dual tree complex thresholding with wiener filtering for denoising medical images. The wavelet coefficients of the dual tree complex wavelet transform are derived using soft and hard thresholding methods. Sujitha et al. [9] proposed a wavelet-based transform for denoising the MR images. A global thresholding mechanism is used for discretizing the wavelets and to decompose it for denoising. However, it may not perform well for other medical imaging modalities. Das et al. [10] proposed a hybrid approach using Haar wavelet transform for denoising the brain position emission tomography (PET) images. However, their method is limited to PET images only. From the above discussions, it is observed that the decomposition of the wavelet coefficients depends on the relative weight of the data points and its neighboring space for denoising. Further, the break points observed due to the presence of harmonic patches needs to be eliminated. So assigning suitable weights to the data points and eliminating the break points are the main challenges in denoising the medical images. This has motivated us to propose a model assigning fuzzy membership values to the wavelet coefficients and smoothing them using spline estimation. The proposed SFWShrink model is experimented with different modalities of images, such as, brain magnetic resonance (MR) image, Ultra sound image, computer tomographic (CT) image, and mammogram

Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique

187

images. Further, the performance of our model is compared with the standard wavelet shrink models, such as VisuShrink [11], BayesShrink [12] and NeighShrink [13]. The proposed model is validated with five standard performance evaluation indices, such as, mean squared error (MSE), peak signal-to-noise ratio (PSNR), mutual information (MI), structural similarity index measure (SSIM) and universal image quality index (UIQI) [14]. This method is simple and can be applied as a pre-processing stage for any of the image processing applications. The rest of the paper is progressed as follows: Sect. 2 illustrates the wavelet transform. The suggested spline based fuzzy wavelet transform is explained in Sect. 3. Section 4 explains the results, evaluation and validation of the suggested model. Finally, Sect. 5 presents the conclusion and the future scope.

2 Wavelet Shrinkage Let X be a noise-free image in spatial domain. Let xs,o (i, j) and ys,o (i, j) represent the noise free and noisy wavelet coefficients of X with scale s and orientation o respectively. Because of the linear characteristic of the wavelet transform, the additive noise can be incorporated as: ys,o (i, j) = xs,o (i, j) + εs,o (i, j) i, j = 1, 2, . . . , N

(1)

where εs,o (i, j) represent the additive noise component in wavelet domain uniformly distributed with mean = 0 and variance = σ 2 . The aim is to denoise ys,o (i, j) by obtaining an estimate ( Xˆ ) to minimize the MSE [2].

LL3

HL3

LH3

HH3

HL2

HL1 LH2

HH2

LH1

HH1

Fig. 1. Sub-bands in two-dimensional orthogonal wavelet transform.

A conventional way to represent the wavelet coefficients in sub-bands is shown in Fig. 1. This shows the sub-band details, L L k , L Hk , H L k and H Hk , for k = 1, 2, . . . , J , with J as the coarsest scale in decomposition with scale size (N /2k × N /2k ). The estimated denoising wavelet coefficients (xˆs,o (i, j)) are obtained by filtering the subbands of each wavelet coefficient ys,o (i, j) using a thresholding function. The thresholding function can be achieved using soft-thresholding or hard-thresholding. Finally, the denoised image is reformed by inverse wavelet transform of the estimated wavelet coefficients (xˆs,o (i, j)).

188

P. K. Mishro et al.

3 Proposed Methodology The proposed SFWShrink model is an approach of assigning membership values to the wavelet coefficients. It is observed that neighboring coefficients with comparable magnitudes are assigned larger weights than coefficients with dissimilar magnitudes. The wavelet coefficients produced by the noise components are larger, isolated and uncorrelated, whereas, the edge coefficients are persistent and more similar in magnitude. To represent this, a magnitude similarity fuzzy function (m u (m, n)) and a spatial similarity fuzzy function (su (m, n)) is computed in a neighboring space of dimension (m, n). They are defined as: ys,o (i, j) − ys,o (i + m, j + n) 2 (2) m u (m, n) = exp − Tfw 2 m + n2 su (m, n) = exp − (3) N where, ys,o (i, j) and ys,o (i + m, j + n) denotes the central wavelet coefficient and the neighboring wavelet coefficient respectively. Here T fw represent the fuzzy thresholding function. From the above two fuzzy functions, we obtain the adaptive weights (wu (m, n)) for finding the fuzzy features of each wavelet coefficient as: wu (m, n) = m u (m, n) × su (m, n)

(4)

The wavelet coefficients belonging to noise and image discontinuity are characterized based on their discrimination in the feature space. The fuzzy features for these wavelet coefficients are computed using the adaptive weights as: wu (m, n) × ys,o (i + m, j + n) f u (i, j) =

m n

m n

wu (m, n)

.

(5)

The wavelet coefficients with fuzzy features f u (i, j) are computed as in (5) and the fuzzy rule for shrinking the wavelet coefficients is applied. The shrinking of the fuzzy wavelet coefficients in the two-dimensional orthogonal sub-bands is characterized by the fuzzy membership values μ(x). The fuzzy membership values indicate the inclusion of an input data point for a particular noise variance using the thresholding values (Th1 and Th2 ) as given below. ⎧ ⎪ 0 x ≤ T h1 ⎪ ⎪

2 ⎪ ⎪ h2 ⎨ 2 × x−T h 1 T h 1 ≤ x ≤ T h 1 +T T h 2 −T h 1 2 μ(x) = (6)

2 ⎪ T h 1 +T h 2 x ⎪ 1 − 2 × T h − ≤ x ≤ T h ⎪ 2 2 T h 2 −T h 1 2 ⎪ ⎪ ⎩ 1 x ≥ T h2 The thresholding values (Th1 and Th2 ) are computed using the estimated variance σˆ n of the noise in an image and is useful for expressing the fuzzy membership values. The

Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique

189

thresholding values (Th1 and Th2 ) and the noise variance σˆ n are non-linearly related. The optimum value of the thresholds (Th1 and Th2 ) is computed using the relative constants K 1 and K 2 at different noise variance level as: T h 1 = K 1 × σˆ n

(7)

T h 2 = K 2 × σˆ n

(8)

Fig. 2. Spline estimation curve for values of the relative constants K 1 and K 2 at noise variance level = 40.

The spline estimation curve is a representation of the fuzzy membership values of each data point in the input space or universe of discourse. The thresholding values are marked on the spline estimation curve for each data point vector X. This relationship is presented graphically in Fig. 2. This is to determine the best value for the relative constants K 1 and K 2 . The estimated wavelet coefficients are computed from the membership values of each fuzzy feature of the noisy wavelet coefficients and is expressed as: (9) xˆs,o (i, j) = μ f u (i, j) × ys,o (i, j) where, f u (i, j) is the fuzzy feature computed using (5). Image reformation is the process of converting the wavelet domain transformed image into its spatial representation. The denoised image ( Xˆ ) is obtained from the estimate of the wavelet coefficients xˆs,o (i, j). This is represented as follows: wu (m, n) × xˆs,o (i + m, j + n) m n ˆ (10) X= wu (m, n) m n

Figure 3 shows the schematic block diagram of the proposed SFWShrink model. The noisy medical image is formed by adding the white Gaussian noise using (1). The medical image with additive Gaussian noise is first transformed in to wavelet domain. The wavelet transform decomposes the image in terms of sub-bands based on the energy compression of the spatial coefficients, which enriches the detailed information and decompose the noise elements from the actual image contents.

190

P. K. Mishro et al.

X

∑

Wavelet Transformation

Smoothing the thresholding values using Spline estimation

Fuzzy modeling of Wavelet coefficients

Gaussian Noise

Denoising using Wavelet Shrinkage

Inverse Wavelet Transformation

Xˆ

Fig. 3. Schematic block diagram of the proposed SFWShrink model.

The proposed SFWShrink method assigns the wavelet coefficients with fuzzy membership values. Here, the fuzzy modeling for characterizing and shrinking the wavelet coefficients is based on their similarity and continuity feature. The magnitude similarity fuzzy function (m u (m, n)) and the spatial similarity fuzzy function (su (m, n)) are computed using (2) and (3). The fuzzy wavelet thresholding value is computed using (5). The shrinking of the fuzzy wavelet coefficients in the two-dimensional orthogonal sub-bands is characterized by the fuzzy membership values (6). The thresholding values (Th1 and Th2 ) are computed using the variance of noise in an image. The best values for Th1 and Th2 at different noise variance level is determined using spline estimation of the fuzzy wavelet coefficients. Reformation of the spatial image from the wavelet domain is obtained using (10). The resulting image is the denoised and estimated input medical image.

4 Results and Discussions This section presents the evaluation of the proposed model for denoising medical images. The proposed model is executed using MATLAB and compared with the standard wavelet shrinking models, such as, VisuShrink, BayesShrink and NeighShrink. The proposed SFWShrink model for denoising is experimented with different modalities of medical images, such as brain MR image and CT image from NITRC: human imaging database [15], ultra-sonic image from SPLab database [16], and mammogram image from MIAS database [17]. Further, the validation is supported using different evaluation indices, such as MSE, PSNR, MI, SSIM, UIQI [14]. The qualitative analysis of the suggested approach is presented in the figures (Figs. 4, 5, 6 and 7), and the quantitative analysis is presented using the tables (Tables 1, 2, 3 and 4). 4.1 Qualitative Analysis Visual quality of the proposed model is presented in Figs. 4, 5, 6 and 7 in comparison with the standard wavelet shrink approaches.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Results of denoising the brain MR image with MSE = 120 and PSNR = 22. (a) Original image, (b) Noisy image, (c) denoised image using VisuShrink, (d) Bayesshrink, (e) NeighShrink and (f) SFWShrink.

Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique

(a)

(b)

(c)

(d)

(e)

191

(f)

Fig. 5. Results of denoising the ultra-sonic image with MSE = 120 and PSNR = 22. (a) Original image, (b) Noisy image, (c) denoised image using VisuShrink, (d) Bayesshrink, (e) NeighShrink and (f) SFWShrink.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 6. Results of denoising the CT image with MSE = 120 and PSNR = 22. (a) Original image, (b) Noisy image, (c) denoised image using VisuShrink, (d) Bayesshrink, (e) NeighShrink and (f) SFWShrink.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 7. Results of denoising the mammogram image with MSE = 120 and PSNR = 22. (a) Original image, (b) Noisy image, (c) denoised image using VisuShrink, (d) Bayesshrink, (e) NeighShrink and (f) SFWShrink.

Figures 4, 5, 6 and 7 shows the visual performance of the proposed model for denoising the brain MR image, ultra-sonic image, CT image and mammogram image, respectively. It is found that the images resulting from the proposed model are more clear in comparison to the standard wavelet shrink approaches. This indicates that the noise is reduced significantly. Further, the use of smoothing spline estimation provides more accurate thresholding values with respect to the variance of noise in an image. The accuracy in thresholding values makes subtle information prominently visible in all the resulting images. This also eliminates the break-points observed due to the presence of harmonic patches. The proposed method outperforms the MR, ultra-sonic and CT images. However, some noise marks are still observed in case of the mammogram images. 4.2 Quantitative Analysis The data in Tables 1, 2, 3 and 4 describes the quantitative evaluation of the proposed model in comparison with VisuShrink, BayesShrink, NeighShrink for denoising the

192

P. K. Mishro et al.

medical images. The quantitative analysis is accomplished using different evaluation parameters, such as MSE, PSNR, MI, SSIM, and UIQI for different modalities of medical images. Table 1. Comparative analysis of the proposed model for denoising the brain MR image with additive Gaussian noise with MSE = 120 and PSNR = 22. Method of denoising Evaluation parameters MSE

PSNR

MI

SSIM

UIQI

VisuShrink

27.9613 33.6624 2.5916 0.4572 0.5429

BayesShrink

20.1057 36.3394 2.8663 0.4754 0.5585

NeighShrink

15.2541 38.0324 1.9355 0.5963 0.6280

SFWShrink

10.2121 38.6946 3.1058 0.6113 0.6426

Table 2. Comparative analysis of the proposed model for denoising the ultra-sonic image with additive Gaussian noise with MSE = 120 and PSNR = 22. Method of denoising Evaluation parameters MSE

PSNR

MI

SSIM

UIQI

VisuShrink

38.2879 32.3132 2.7024 0.4561 0.5124

BayesShrink

22.3219 34.6435 2.8709 0.3918 0.4722

NeighShrink

20.1546 36.0876 2.4561 0.4483 0.5126

SFWShrink

16.0074 38.6476 3.0345 0.5869 0.6370

Table 3. Comparative analysis of the proposed model for denoising the CT image with additive Gaussian noise with MSE = 120 and PSNR = 22. Method of denoising Evaluation parameters MSE

PSNR

MI

SSIM

UIQI

VisuShrink

36.2494 32.5431 2.5371 0.4018 0.4678

BayesShrink

28.1540 35.7920 2.6113 0.3720 0.4445

NeighShrink

20.1632 36.1975 2.1607 0.3763 0.4401

SFWShrink

15.6075 38.7958 2.8454 0.5644 0.6002

Medical Image Denoising Using Spline Based Fuzzy Wavelet Shrink Technique

193

Table 4. Comparative analysis of the proposed model for denoising the mammogram image with additive Gaussian noise with MSE = 120 and PSNR = 22. Method of denoising Evaluation parameters MSE

PSNR

MI

SSIM

UIQI

VisuShrink

27.6564 39.8491 3.1433 0.4518 0.5745

BayesShrink

26.7189 39.2962 2.9774 0.1413 0.1930

NeighShrink

24.9032 41.2260 3.1953 0.2197 0.2746

SFWShrink

28.5243 39.3820 3.4404 0.5291 0.5997

The best values in the table are highlighted with bold faces. A lower value of MSE, and higher values for PSNR is desired for justifying a method to be more suitable for denoising an image. It is found that the all the evaluation parameters show better values for all the images, using the proposed model, except the mammogram image. For instance, the proposed model outperforms in case of the MR, ultra-sonic and CT images as shown in Tables 1, 2 and 3. But NeighShrink is providing better values for the mammogram images in Table 4. Further, the MI, SSIM and the UIQI values obtained are the best for the proposed model for all the images excepting MSE, PSNR values in the mammogram image. However, the proposed model is close to the best value of MSE and PSNR as seen in Table 4.

5 Conclusions In this paper, we proposed a spline based fuzzy wavelet shrink model for denoising the medical images. The fuzzy membership values of the wavelet coefficients signify the relative importance of a data point with respect to its neighboring data points. Further, the orthogonal decomposition and energy based compression of the image preserves its spectral components. The spline estimation of the data-points improves the denoising capability as well as eliminates the break-points observed due to the presence of harmonic patches. This makes the process of denoising simple and effective in preserving the insignificant image details. The proposed model is implemented and compared with the standard wavelet based shrink approaches. The denoising and edge preservation quality of the proposed model is found to be better in comparison to the other models. This may be due to the spline estimation of the data points for computing accurate thresholding values. However, some noise marks are still observed in case of the mammogram images. Acknowledgement. This work is supported by PhD scholarship grant under TEQIP-III, VSS University of Technology, Burla.

References 1. Abdalla, S.H., Osman, S.E.F.: Challenges of medical image processing. Int. J. Adv. Eng. Technol. 9(6), 563–570 (2016). https://doi.org/10.1007/s00450-010-0146-9

194

P. K. Mishro et al.

2. Saeedi, J., Moradi, M.H., Faez, K.: A new wavelet-based fuzzy single and multi-channel image denoising. Image Vis. Comput. 28(12), 1611–1623 (2010) 3. Donoho, D.L.: De-noising by soft-thresholding. IEEE Trans. Inf. Theory 41(3), 613–627 (1995) 4. Yoon, B.J., Vaidyanathan, P.P.: Wavelet-based denoising by customized thresholding. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 925–928. IEEE (2004) 5. Luisier, F., Blu, T., Unser, M.: SURE-LET for orthonormal wavelet-domain video denoising. IEEE Trans. Circuits Syst. Video Technol. 20(6), 913–919 (2010) 6. Olawuyi, N.J.: Comparative analysis of wavelet-based denoising algorithms on cardiac magnetic resonance images. Afr. J. Comput. ICT 4(1), 11–15 (2011) 7. Mukhopadhyay, S., Mandal, J.K.: Wavelet based denoising of medical images using sub-band adaptive thresholding through genetic algorithm. Procedia Technol. 10, 680–689 (2013) 8. Naimi, H., Adamou-Mitiche, A.B.H., Mitiche, L.: Medical image denoising using dual tree complex thresholding wavelet transform and Wiener filter. J. King Saud Univ. Comput. Inf. Sci. 27(1), 40–45 (2015) 9. Sujitha, R., De Pearlin, C.C., Murugesan, R., Sivakumar, S.: Wavelet based thresholding for image denoising in MRI image. Int. J. Comput. Appl. Math. 12(1), 569–578 (2017) 10. Das, K., Maitra, M., Sharma, P., Banerjee, M.: Early started hybrid denoising technique for medical images. In: Bhattacharyya, S., Mukherjee, A., Bhaumik, H., Das, S., Yoshida, K. (eds.) Recent Trends in Signal and Image Processing. AISC, vol. 727, pp. 131–140. Springer, Singapore (2019). https://doi.org/10.1007/978-981-10-8863-6_14 11. Fourati, W., Kammoun, F., Bouhlel, M.S.: Medical image denoising using wavelet thresholding. J. Test. Eval. 33(5), 364–369 (2005) 12. Karthikeyan, K., Chandrasekar, C.: Speckle noise reduction of medical ultrasound images using Bayesshrink wavelet threshold. Int. J. Comput. Appl. 22(9), 8–14 (2011) 13. Chen, G.Y., Bui, T.D., Krzyzak, A.: Image denoising using neighbouring wavelet coefficients. Integr. Comput. Aided Eng. 12(1), 99–107 (2005) 14. Al-Najjar, Y.A., Soong, D.C.: Comparison of image quality assessment: PSNR, HVS, SSIM, UIQI. Int. J. Sci. Eng. Res. 3(8), 1–5 (2012) 15. NITRC Human imaging Database. https://www.nitrc.org/frs/group_id=82. Accessed Mar 2019 16. SPLab Ultra sound imaging Database. http://splab.cz/en/download/databaze/ultrasound. Accessed Mar 2019 17. MIAS mammogram image database. http://www.mammoimage.org/databases/. Accessed Mar 2019

MBC-CA: Multithreshold Binary Conversion Based Salt-and-Pepper Noise Removal Using Cellular Automata Parveen Kumar1,2(B) , Mohd Haroon Ansari3 , and Ambalika Sharma1 1

2

Indian Institute of Technology, Roorkee, India [email protected] National Institute of Technology Uttarakhand, Srinagar Garhwal, India [email protected] 3 Indian Institute of Science, Bengaluru, Karnataka, India [email protected]

Abstract. The salt-and-pepper noise, one of the forms of impulse noise, is one of the important problems that needs to be taken care of. Saltand-pepper noise in the images are introduced during their acquisition, recording and transmitting. Cellular Automata (CA) is an emerging concept in the field of image processing due to its neighborhood dependence. Various methods have been proposed using CA for noise removal, simply due to the high complexity of CA, most of them are proven to be inefficient. However, CA can be used efficiently with some modifications that result in a reduction in its complexity, for the large number of image processing techniques. In this paper, we overcome the problem of CA by Multithreshold Binary Conversion (MBC) in which we convert the grayscale images to binary images based upon a chosen set of threshold values, reducing the state from 256 to 2 for every pixel. The resulting images are then fed to the CA. The result obtained is a set of binary images and these binary images need to be recombined to obtain a noise free grayscale image. We have used a method similar to a binary search that reduce the complexity of recombining the images from N 2 K to N 2 logK making our recombination algorithm an efficient algorithm, in terms of complexity, to recombine binary images to a single grayscale image. This reduction in the complexity of noise removal has no effect on the quality of a grayscale image. Keywords: Cellular automata · Salt-and-pepper noise model · Thresholding · Complexity

1

· Moore

Introduction

Emerging as one of the most important areas of research, image processing has widespread application in medical diagnosis, weather forecasting, forensic labs and many different industries. One of the major challenges faced in the digital c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 195–204, 2020. https://doi.org/10.1007/978-981-15-4015-8_17

196

P. Kumar et al.

image processing is the corruption of the image by various factors. Images are transferred from one place to another through a variety of ways. During their acquisition, recording and transmission [1], images are often corrupted by the impulse noises, of which one of the form is the salt-and-pepper noise. However, this problem cannot be eliminated even after several attempts, so, we need to find out a different approach to overcome it. Instead of trying to get the original image in several attempts, we try to correct the image received in a single attempt. The salt-and-pepper noise need to be removed from the images without losing the texture and preserving the integrity of the images. The basic idea of CA, that is being widely used, was conceptualized by J. Von Nuemann and Stanislaw Ulam [2,3]. Functioning of CA totally depends on its neighborhood cells and their state. CA is one of the emerging tools in the field of image processing. The computation required in image processing techniques is very high, so we need a technique that could overcome this obstacle. The cellular automata has proved to be one of the best tools that can be used in image processing. In our work, we will be using CA for the removal of saltand-pepper noise from the images. Salt-and-pepper noise removal has been an active area of research in the field of image processing. It has attracted a lot of researchers due to its importance in a wide variety of industries. A large number of methods have been proposed in the literature to remove salt-andpepper noise. Most of the methods used the grayscale image as the test subject. Each of them has their own limitations and advantages. Some of the commonly used methods are median filter with some other classes of modified median filter [4,5], where correction in the images is done on the basis of the median values of the neighborhood pixel or based on the function that uses the median value of the neighborhood pixel. Despite lots of methods developed over the time, the results are not up to the mark and the resulting image is blurred specially in cases of high noise. Some different techniques and classification methods are also discussed in [27–30]. A greedy method was proposed by Rosin et al. [6] that uses CA for the removal of noises that include salt-and-pepper noise. Liu et al. [7] provide a CA algorithm for higher noise levels where the amount of data left in the image is very low and it is relatively harder to get a better result. A local transition function was designed by Popvici et al. [8] indicating that neighborhood cells define the current cell’s state, an approach being similar to the approach used in the CA. It seems that there is an interaction among only a few neighborhood cells, but interaction with a few neighborhood cells can result in non-linear global behavior of which an example is Conway’s Game of Life [9]. A new feature extraction approach has been given in [24,25]. A robust CNN model was given by Babu et al. [26] for writer identification. The paper onwards consists of the following sections: Sect. 2 describes the designing of CA model. Section 3 describes the proposed approach for saltand-pepper noise removal. Section 4 compares the result of various standard approaches with the result obtained with MBC-CA. Section 5 conclude the paper.

MBC-CA

2

197

Cellular Automaton Model

A cellular automata is defined as a collection of cells that form a n-dimensional structure [10]. Each cell of automata belongs to a particular state among predefined S states. State of a cell is a function of time and its neighborhood cells are usually governed by some set of rules. Cellular automata for operation on images is defined as a two dimensional grid like structure where each cell corresponds to the pixel of the images. State of the cell in case of images is defined as the value assumed by the pixel. For example, in case of a grayscale image, a pixel can assume any value from 0 to 255, these values are the states of the CA and hence CA model in case of grayscale image consists of 256 states. In case of binary images, there can be only two values 0 and 1. So, by converting an image to the binary images, a number of states in the CA are reduced to 2. New state of the pixel we obtain after applying CA on it is a function of time and the old numerical value held by the pixel as well as its neighborhood pixels. For the sake of simplicity, the rules governing the state of cell in automaton are taken to be independent of time. Our model for CA is simple and deterministic. It can be defined by three tuple (S, Br , δ), where S is the set of states, Br defines the neighborhood of the cell and δ is a function that has argument as the current state of the given cell as well as state of the neighborhood cells defined by the Br and value returned by δ is the new state of the cell. A standard way to define the Br is given in Eq. 1 which uses a ball defined by Br1 [8], also known as Von Neumann Neighborhood [11]. In Eq. 1, xc is the location of the current pixel and x is the location of neighborhood pixel. Another way to define neighborhood is by using the Norm [12] in the Eq. 2, also known as Moore Neighborhood [13]. Br1 (xc ) = {x : |x − xc |1 ≤ r}; x = < xi , yi>

(1)

Br∞ (xc ) = {x : |x − xc |∞ ≤ r}; x = < xi , yi>

(2)

In Eq. 2, xc is the location of the current pixel and x is the location of neighborhood pixel. In the Eq. 1, exponent 1 represent that we will be taking the cell in the neighborhood along a single line in each direction in the sphere limited by r and exponent ∞ in Eq. 2 represent we will be taking the cell along all the possible lines in each direction in the sphere limited by r. We are applying the CA on binary images. We define our model of CA as 2-dimensional binary state CA with Moore neighborhood of r = 2. So, δ function in our CA will take the state of the current cell along with the state of 8-neighbor cells and output the new state of the current cell. Let Sxi be the initial state of current pixel and S1 , S2 . . . S8 be the states of other 8 neighborhood pixels and Sxf be the state after applying CA on the pixel. δ can be represented as given in Eq. 3. Choice of r in the Moore neighborhood depends on the requirement but also limited by the resource due to increase in complexity of CA with the increase in neighborhood cells. In the proposed model, we have taken r = 2, which gives us 8-neighbors. These neighbors in case of our image are all the eight pixels that surround each pixel in the image except the corner pixel has only six pixels. δ(Sxi , S1 , S2 , S3 , S4 , S5 , S6 , S7 , S8 ) = Sxf

(3)

198

3

P. Kumar et al.

MBC-CA Model for Noise Removal

The proposed model for salt-and-pepper noise removal consists of a series of two major steps. First, Multithreshold Binary Conversion (MBC) and CA, second is Recombination of Binary Images (RBI). The proposed model is depicted in Fig. 1.

Fig. 1. MBC-CA model for salt-and-pepper noise removal.

3.1

Multithreshold Binary Conversion (MBC)

In general, it is highly computational and tedious task to find the rule set for CA [14]. Consider a cell having S possible states and N number of neighborhood cells that govern the state of cell on applying CA. For each neighborhood cell, there are S possible states. The total number of possible choices of N neighborhood cells are S N . Current cell can be given one of the S possible states for each N neighborhood choice. So, total number of possibilities are S S . For proposed model, if we were to apply on a grayscale image, for Moore neighborhood of only eight neighborhood cells and possible states 256 (grayscale pixel values), total 8 19 possibilities are 256256 which evaluates to about 1010 . This exponential result set creates a computationally impossible situation. So, rather than applying CA to grayscale images, we split the grayscale image to N binary images by choosing N threshold points. Each threshold point generates a binary image and hence we get N binary images corresponding to N threshold points. We use an approach similar to Multithreshold analog to digital conversion [15]. The first task is to choose N threshold values in the range 0 to 255 i.e. the range of the grayscale pixel intensity value. Let T be the array consisting of N chosen threshold points; I be the grayscale image; I(i, j) corresponds to the pixel at ith row and j th column; F be the set of binary images obtained and Fk corresponding to the binary image of the k th threshold point. Mathematically, transformation can be represented as given in Eq. 4. {I(i, j) ∈ (0, 255)} =⇒ {Fk (i, j) ∈ (0, 1)}, ∀k ∈ T

(4)

MBC-CA

199

For every pixel of the image, following rule as Eq. 5 is applied to convert it from grayscale image to a binary image. There are large number of ways to choose threshold for the binary conversion. One way is to choose N numbers randomly from amongst the values 0 to 255, for example {0 10 25 75 80 150 201 203 250 255}. Another way could be exponential increment, for example, exponential increment with base as 2 will give following threshold points {1 2 4 8 16 32 64 128}. Although there is no hard and fast rule regarding the choice of threshold points, but choosing threshold point uniformly over the interval gives better results. In the proposed model, we are using uniform sampling. In uniform sampling, value is chosen after a regular interval. Interval chosen in our model is four giving us 64 values {0 4 8 . . 252}. So, based on these 64 values, we will be getting 64 binary images. 1 if I(i, j) ≥ k (5) Fk (i, j) = 0 otherwise We have already seen the complexity of choosing the rules for CA. Due to the exponential increase in complexity with the increase in r in Moore neighborhood, we have limited the value of r to 2. In the proposed model, we are using Dominating State Rule. According to this rule, the state of the current cell after applying CA model will be the state that dominates in number among the neighborhood cells including the current cell. Since we are applying our model on the binary image, there are only two possible states i.e. 0 or 1. If n(S) denotes the no of cells with state S in the input sequence of δ Eq. 3, we can represent the rule mathematically as given in Eq. 6. 1 if n(1) ≥ n(0) f (6) Sx = 0 otherwise 3.2

Recombination of Binary Images (RBI)

The final step of the proposed model is to recombine the binary images obtained in the previous step to get a final grayscale noise free image. In our recombination Algorithm 1, the basic idea is to create a new image pixel by pixel and the value of the pixel being derived by observing the corresponding pixel of every binary image. A very important property of every pixel in our model first needs to be addressed. The property is ‘If the value of a pixel is 0 in binary image corresponding to the k th threshold value, its value will be zero in binary image corresponding to every ith threshold ∀j ≥ k’. The value of a pixel in binary image is determined by Eq. 5. Pixel getting a value 0 in binary image corresponding to k th threshold means its value in the grayscale image is less than k th threshold. In the proposed model, the threshold values are arranged in increasing order. So, if the value of a pixel is less than one threshold value, it is implied that it will be smaller than all the threshold values after it. And hence every pixel posses this property.

200

P. Kumar et al.

When we apply CA to the binary images, every pixel will get a value according to the Eq. 6. If a pixel is assigned value 0, it means its neighborhood has more zeros than ones and from above mentioned property, pixel having a value zero will remain zero in all binary images after it. So this pixel, in binary images corresponding to threshold values after it, will have more number of pixels with value 0 than 1 and hence if a pixel gets a value zero after applying CA, it will always get value to zero in all binary images corresponding to threshold values after current value. So, the above property is still followed by every pixel even after applying CA. We are using the above property to reduce the time complexity of the recombination algorithm to a great extent. We are using an approach similar to binary search [16].

Algorithm 1. Recombination Algorithm of MBC-CA 1: for i ← 1 to m do 2: for j ← 1 to n do 3: start ← 1 4: start ← n − 1 5: while N = 0 do 6: M id ← ((start + end)/2) 7: if A(i, j, mid)! = A(i, j, mid + 1) then 8: Result(i, j) ← (T hreshold(mid) + 2) 9: else 10: if A(i, j, mid) == 0 then 11: start ← mid 12: end if 13: else 14: end ← mid 15: end if 16: end while 17: end for 18: end for

Let us understand how are we trying to approach the recombination first. While converting a grayscale image to the binary image, for each threshold, pixels having a value less than the threshold were assigned 0 in binary image and pixels with values more than threshold were assigned 1 in the binary image. We can observe that if a pixel has the value equal to the threshold, it will be assigned 1 in all the binary images corresponding to threshold value less than it and will be assigned 0 in all the binary images corresponding to the threshold values greater than it. Hence, grayscale value of the pixel can be found out by finding the point at which it changes its state from 1 to 0. One approach is to simply traverse each pixel for every corresponding binary image. But that is very time consuming. In our recombination algorithm, we have used a better approach to reduce the complexity of obtaining grayscale image from the stack of binary images obtained after applying CA. Instead of traversing corresponding pixel of every image, we used an approach similar to binary search. First, we select the pixel of middle binary image, check its value. If its value is zero, it would mean that the point in threshold points, we are looking for, lies in the first half images. So, we repeat the process on the first half subset of binary images. If the value of pixel turn outs to be 1, either it is

MBC-CA

201

required point or required point lies in the second half. Now check corresponding pixel in the next binary image, if it is 0, we have found the required point, if it is 1, repeat procedure for second half images. Once we found the point, we check the threshold value corresponding to the binary image in which our point lies. We assign a value of threshold+2 to the pixel in the grayscale resulting image. We repeat this procedure for every pixel of the binary images.

4

Experimental Result

In this section, the proposed model is evaluated on various standard images used in image processing such as Lena, baboon, pepper as shown in Fig. 2 as (a, b), (c, d ) and (e, f ), respectively, with varying percentage of salt-and-pepper noise in the grayscale image. The model is evaluated rigorously on a large set of images. We increment the level of salt-and-pepper noise from 10% to 90% with a step of 10%. To evaluate the performance of the proposed model of CA and compare it with the existing image restoration techniques, we use the Structural Similarity Index Metric (SSIM) [17]. It is a widely used metric in the field of image processing. Most of the populate techniques and algorithms have used SSIM for evaluation of the performance of their method. Results obtained by the proposed model are compared against several standard and popular methods used to remove salt-and-pepper noise from the images. To compare the performance with these popular methods like AMF [18], SMF [19], FBDA [20], IDBA [21], EDBA [22], BDND [23]. We used the SSIM values specified in the literature and calculated same for the proposed model. The SSIM values taken from the various sources and the same calculated for our model are tabulated in Table 1. The highest SSIM value of varying percentage of noise intensity is highlighted. Although the proposed model does not give the best SSIM value of higher noise level, but the complexity of the proposed model is very low in comparison to other models. We can see the variation in the SSIM value with an increase in the noise intensity in the images. In Fig. 3, the results of various methods on an image having salt-and-pepper noise intensity of 60% are shown. It shows the standard Lena image widely used in the field of image processing. The result produced by our model is much better than the result produced by most of the methods we have compared with. Our model has fairly good SSIM value even for the images with high intensity of salt-and-pepper noise. Also, in Fig. 4, the plot of SSIM values against the intensity of noise in the image is shown. The SSIM vs noise intensity curve produced by our model in Fig. 4 has steeper slope which means decrease in the SSIM value with increase in noise intensity is low for our model.

202

P. Kumar et al.

Fig. 2. Lenna, Baboon, Peppers images with 60% noise tested with MBC-CA. (a, c, e) Noisy Image (b, d, f) Output Image.

Table 1. SSIM Value comparisons of different methods. Noise (%) AMF [18] SMF [19] FBDA [20] IDBA [21] EDBA [22] BDND [23] Proposed model (MBC-CA) 10%

0.9974

0.9931

0.9979

0.9978

0.9951

0.9989

0.9995

20%

0.9939

0.9812

0.9971

0.9963

0.9914

0.9962

0.9983

30%

0.9886

0.9718

0.9963

0.9941

0.9879

0.9962

0.9976

40%

0.9825

0.9614

0.9948

0.9901

0.9825

0.9933

0.9864

50%

0.9738

0.9381

0.9899

0.9843

0.9755

0.9893

0.9756

60%

0.9636

0.9155

0.9842

0.9749

0.9655

0.9831

0.9711

70%

0.9471

0.8646

0.9974

0.9638

0.9483

0.9766

0.9642

80%

0.9209

0.7939

0.9593

0.9491

0.9154

0.9697

0.9456

90%

0.8637

0.7388

0.9325

0.9152

0.8132

0.9546

0.9342

Fig. 3. Comparisons of results by various models on an image having salt-and-pepper noise intensity of 60%.

Fig. 4. SSIM value vs Noise intensity plot for various standard models and MBC-CA.

MBC-CA

5

203

Conclusion

In this paper, we have introduced the application of CA to remove noise from images by introducing the concept of MBC. The biggest problem being faced by researchers of using the CA is its time complexity. In MBC-CA, conversion of image to several binary images has reduced the complexity of CA to very low order in comparison to when it is used for the grayscale image directly. But converting grayscale image to a binary image, introduces the overhead of combining them back to a single image. The recombination Algorithm 1 used in MBC-CA has reduced the complexity of recombining binary images into a single image to a great extent. The overall complexity of our approach is much better than the many existing approaches in the literature.

References 1. Luo, W.: Efficient removal of impulse noise from digital images. IEEE Trans. Consum. Electron. 52(2), 523–527 (2006) 2. von Neumann, J.: Theory of Self-Reproducing Automata (edited and completed by Arthur Burks). University of Illinois Press, Urbana (1966) 3. Phipps, M.J.: From local to global: the lesson of cellular automata. In: DeAngelis, D.L., Gross, L.J. (eds.) Individual-Based Models and Approaches in Ecology, pp. 165–187. Chapman and Hall/CRC, London (2018) 4. Zhang, S., Karim, M.A.: A new impulse detector for switching median filter. IEEE Signal Process. Lett. 9(11), 360–363 (2002) 5. Gupta, V., Chaurasia, V., Shandilya, M.: Random-valued impulse noise removal using adaptive dual threshold median filter. J. Vis. Commun. Image Represent. 26, 296–304 (2015) 6. Rosin, P.L.: Training cellular automata for image processing. IEEE Trans. Image Process. 15(7), 2076–2087 (2006) 7. Liu, S., Chan, H., Yang, S.: An effective filtering algorithm for salt-peper noises based on cellular automata. In: IEEE Congress on Image and Signal Processing (2008) 8. Popovici, A., Popovici, D.: Cellular automata in image processing. In: Gilliam, D.S., Rosenthal, J. (eds.) Proceedings of the 15th International Symposium on the Mathematical Theory of Networks and Systems, Electronic Proceedings (2002) 9. Paranj, B.: Conway’s game of life. In: Test Driven Development in Ruby, pp. 171– 220. Apress, Berkeley (2017) 10. Hadeler, K.-P., M¨ uller, J.: Cellular automata: basic definitions. Cellular Automata: Analysis and Applications. SMM, pp. 19–35. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-53043-7 2 11. Weisstein, E.W.: von Neumann neighborhood. From MathWorld-A Wolfram Web Resource (2013) 12. Gray, L.: A mathematician looks at Wolfram’s New Kind of Science. Not. Amer. Math. Soc. 50, 200–211 (2003) 13. Weisstein, E.W.: Moore neighborhood. From MathWorld-A Wolfram Web Resource (2005). http://mathworld.wolfram.com/MooreNeighborhood.html 14. Wolfram, S.: Cellular Automata and Complexity: Collected Papers. CRC Press, Boca Raton (2018)

204

P. Kumar et al.

15. Krishnan, P.M., Mustaffa, M.T.: A low power comparator design for analog-todigital converter using MTSCStack and DTTS techniques. In: Ibrahim, H., Iqbal, S., Teoh, S.S., Mustaffa, M.T. (eds.) 9th International Conference on Robotic, Vision, Signal Processing and Power Applications. LNEE, vol. 398, pp. 37–45. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-1721-6 5 16. Davis, C.H.: The binary search algorithm. J. Assoc. Inf. Sci. Technol. 167–167 (1969) 17. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 18. Hwang, H., Hadded, R.A.: Adaptive median filter: new algorithms and results. IEEE Trans. Image Process. 4(4), 449–502 (1995) 19. Bovik, A.: Handbook of Image and Video Processing. Academic Press, San Diego (2000) 20. Nair, M.S., Raju, G.: A new fuzzy-based decision algorithm for high-density impulse noise removal. Sig. Image Video Process. 6, 579–595 (2010) 21. Nair, M.S., Revathy, K., Tatavarti, R.: An improved decision-based algorithm for impulse noise removal. In: Congress on Image and Signal Processing, CISP 2008, vol. 1, pp. 426–431 (2008) 22. Srinivasan, K.S., Ebenezer, D.: A new fast and efficient decision-based algorithm for removal of high-density impulsive noises. IEEE Signal Process. Lett. 14(3), 189–192 (2007) 23. Ng, P.E., Ma, K.K.: A switching median filter with boundary discriminative noise detection for extremely corrupted images. IEEE Trans. Image Process. 15(6), 1506–1516 (2006) 24. Kumar, P., Sharma, A.: DCWI: distribution descriptive curve and cellular automata based writer identification. Expert Syst. Appl. 128, 187–200 (2019) 25. Meena, Y., Kumar, P., Sharma, A.: Product recommendation system using distance measure of product image features. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE (2018) 26. Kumar, B., Kumar, P., Sharma, A.: RWIL: robust writer identification for Indic language. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE (2018) 27. Kumar, V., Monika, Kumar, P., Sharma, A.: Spam email detection using ID3 algorithm and hidden Markov model. In: 2nd Conference on Information and Communication Technology (CICT 2018), Jabalpur (India) (2018) 28. Panwar, P., Monika, Kumar, P., Sharma, A.: CHGR: captcha generation using hand gesture recognition. In: 2nd Conference on Information and Communication Technology (CICT 2018), Jabalpur, India (2018) 29. Bhatt, M., Monika, Kumar, P., Sharma, A.: Facial expression detection and recognition using geometry maps. In: 2nd Conference on Information and Communication Technology (CICT 2018), Jabalpur, India (2018) 30. Katiyar, H., Monika, Kumar, P., Sharma, A.: Twitter sentiment analysis using dynamic vocabulary. In: 2nd Conference on Information and Communication Technology (CICT 2018), Jabalpur, India (2018)

Image to CAD: Feature Extraction and Translation of Raster Image of CAD Drawing to DXF CAD Format Aditya Intwala(B) Symbiosis Institute of Technology, Pune, India [email protected]

Abstract. A CAD drawing has various drawing features like entity lines, dimensional lines, dimensional arrows, dimensional text, support lines, reference lines, Circles, GD&T symbols and drawing information metadata. The problem of automated or semi-automated recognition of feature entities from 2D CAD drawings in the form of raster images has multiple usages in various scenarios. The present research work explores the ways to extract this information about the entities from 2D CAD drawings raster images and to set up a workflow to do it in automated or semi-automated way. The algorithms and workflow have been tested and refined using a set of test CAD images which are fairly representative of the CAD drawings encountered in practice. The overall success rate of the proposed process is 90% in fully automated mode for the given sample of the test images. The proposed algorithms presented here are evaluated based on F1 scores. The proposed algorithm is used to generate user editable DXF CAD file from raster images of CAD drawings which could be then used to update/edit the CAD model when required using CAD packages. The research work also provides use cases of this workflow for other applications. Keywords: Feature extraction · CAD learning · OCR · Feature recognition

1

· Reverse engineering · Machine

Introduction

Even though most of the manufacturing industry uses 3D CAD models to represent the parts, the scope and usage of 2D CAD drawings is still highly prevalent in large section of the industry. Also, there is a huge amount of legacy CAD data locked in the form of 2D drawings either as raster images or hard copies. Also in many situations, like, say manufacturing subcontracting, or even at a manufacturing station, using CAD software to visualize, interpret and use is not possible. In such cases, 2D CAD drawing images prove to be a default mode of representation of the part. Such images could be in the form of the camera snapshot of a hard copy drawing or they could be in the form of exported raster images or PDF files from the CAD system. c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 205–215, 2020. https://doi.org/10.1007/978-981-15-4015-8_18

206

A. Intwala

It becomes very much essential to process the information contained in a 2D CAD drawing in automated way in many cases to increase the productivity [5], e.g. generation of path plan for inspection or manufacturing, perform incremental modifications in the existing drawings for different versions of the part, identify and plan manufacturing processes in automated way for a given part, generation of 3D CAD model automatically or semi-automatically from multiple view images of 2D CAD drawing, etc. The Current work is divided in three stages as shown in Fig. 1, first stage being the identification of individual geometric and text entities through image processing, the second stage involves processing the extracted information to find the correlation between the identified entities and dimensional information while the third stage is generating a user editable DXF file from the processed information. Thus, if one can get a blueprint of the product then it can be used to extract features and generate a user editable model for the same. This can be used to modify features of the competitor’s/company’s model and generating a new updated and improved model for same with fewer efforts.

Fig. 1. Stages of proposed workflow

The implementation and experimentation have been performed using Open CV 3.0 library through Python 3. Test data for candidate CAD drawing images are taken from the book Machine Drawing [10].

2

Proposed Algorithm

The input to our algorithm may be a multi-channel colored or single-channel grayscale image of the CAD drawing. If the image is a multi-channel, then it is first converted to single-channel grayscale image and then converted to a binary form by threshold.

Image to CAD

207

Let t be threshold value, qi be the probabilities of two histogram bins separated by threshold t; μi be the mean of histogram bin and σi2 be the variances of two bins and x(i) be the value of center intensity of histogram bin i, qi (t) =

t

fg (i)

(1)

0 t μi (t) = [ fg (i) ∗ x(i)]/qi

(2)

0

σb2 (t) = q1 (t) ∗ q2 (t) ∗ [μ1 (t) − μ2 (t)]2 255, fg (i, j) ≥ T hresholdV alue ft = T (fg ) = 0, fg (i, j) < T hresholdV alue

(3) (4)

Where i, j represents the row and column of pixel of the image. 2.1

Entity Detection

After pre-processing the input image to threshold image, we detect various entities like dimensional arrow head, dimensional text, lines, circle, GD&T symbols etc. at this stage of the algorithm. Dimensional arrow heads from the image are detected using the morphological operations [7] such as Black hat and White hat in order to filter out non arrow like objects from the image [9]. The approach is able to detect solid and line type arrow heads as shown in Fig. 2. Tbh (ft ) = ft • s(x) − ft

(5)

Where, Tbh is a Black Hat function, s(x) is 2 × 2 kernel and • is closing operator. Twh (ft ) = ft − ft ◦ s(x)

(6)

Where, Twh is a White Hat function, s(x) is 5×5 kernel and ◦ is opening operator. CAD drawings contain multiple textual entities in form of dimensions, GD&T symbols [11], Bill of Material etc. This information is required to process the drawing features. The approach is a systematic combination of region based techniques and morphological separation techniques. Here we first search for regions where the textual entities are present. Initially using the region based approach, we filter out most non-textual entities from the image and in a second step we apply a morphological separation. A two pass connected component filtering is applied to pixel labelling. In the first pass, each pixel of the binary image is iterative scanned and checked if it’s a foreground pixel. If the pixel is a foreground pixel, then the neighboring pixels are checked if there is any other foreground pixels connected to the current pixel. If no neighboring foreground pixels are found, then a new label is assigned to the current pixel or else the label of the neighboring foreground pixel is assigned to the current pixel. In the second

208

A. Intwala

Fig. 2. Dimensional Arrow heads detection

pass the labelled image is again scanned in similar fashion and new label is given based on smallest equivalence of neighboring pixel labels. The algorithm uses the union-find data structure which provides excellent performance for keeping track of equivalence relationships. Union-find store labels which correspond to the same blob in a disjoint-set data structure. CClabels is a set of connected components which is represented by 2 dimensional array fCC (r, c) where, 1, if (r, c) ∈ CClabels (7) fCC (r, c) = c 0, if (r, c) ∈ CClabels Where r, c represents the row and column of pixel of the image. These sorted labelled pixel list, CClabels is returned along with the new connected component image fCC . In morphological separation step, a set of morphological opening operations is performed in order to differentiate the text area from the background. Morphology Opening of the image is erosion followed by dilation using the same structuring element. It smooths the contours of an image, breaks down narrow bridges and eliminates thin protrusions. Thus, it isolates the object which may just touch another object for effective segmentation. This helps in getting rid connections between dimensional lines and the text or text and pixel representation a drawing entity. fo = fCC ◦ B = (fCC B) ⊕ B

(8)

Where fCC is connected components ◦ is Opening operator, is Erosion operator, ⊕ is Dilation operator and B is a rectangular structuring element. In the next step, a bounding box for each of the candidate region is found. A set of rules is applied on these bounding boxes to differentiate between textual

Image to CAD

209

and non-textual regions. The rules are based on the bounding box aspect ratio, critical width and bounding box area. These conditions filter out a majority of non-textual regions from the list. After this each of the candidate region remaining is treated as a Region of Interest (ROI)-that is fROI as shown in Fig. 3. We process the separated textual region and run Optical Character Recognition (OCR) using a Convolution Neural Network (CNN) [12]. The CNN is also able to recognize GD&T symbols and store the data in the data-structure for later use.

Fig. 3. Textual region detection

Major part of CAD drawing contains lines in various forms, feature lines, support lines, center lines, etc. Thus it is inevitable to detect lines precisely. Here we have used hybrid Hough based line detection where the Hough transform [2] is only used for initial guess. Firstly, corners are detected using Harris corner function [13]. This is used in combination with Hough line function to get the definition of candidate lines. Using this guess, we start scanning the pixels based on the definition of line and counting the number of nonzero pixels on the lines and based on some threshold determine precise unique lines from the candidate lines. To determine the corner candidate using Harris function which requires determinant and trace of matrix M . I I I I (9) w(x, y) u u u v M= Iu Iv Iv Iv x,y

Where Iu and Iv are the derivative of an image in x and y direction. Corner = λ1 λ2 − k(λ1 + λ2 )2 where k is some small constant.

(10)

210

A. Intwala

A line in image space transforms to point in Hough space (ρ, θ). we could determine such points to get the candidate lines in image space (x, y) as shown in Fig. 4. r = x cos θ + y sin θ (11) y = −(cos θ/ sin θ)x + r/ sin θ

(12)

Fig. 4. Line detection.

Nonlinear entities like circles, splines, ellipse, etc. are used to represent surface, holes, circular dimensions etc. in CAD drawing image. Here we use the parametric equations of some of 2D conic shapes like circle, ellipse, parabola and hyperbola to generate the data for training a 4 layer CNN, which is used to predict these shapes. For simple circle detection the approach of using optimized curve fitting using Random sample consensus (RANSAC) gives satisfactory results, but for more robust results CNN can be preferred. The output of this stage is as shown in Fig. 5. x = h + r cos(t), y = k + r sin(t)f orcircle

(13)

Where (h, k) is the center of the circle, t is parameter, r is radius of circle. x = h + a cos(t), y = k + b sin(t)f orellipse

(14)

Where (h, k) is center of ellipse t is parameter, a is radius in x − axis and b is radius in y − axis. 2.2

Entity Correlation

In this stage we correlate the individual extracted information to make sense of the data.

Image to CAD

211

Fig. 5. Circle detection.

We first find the direction of the extracted arrowhead in which it is facing. Here the corners of the arrowhead are extracted and Euclidean distance of all the corners to the center are calculated. The maximum distance is the extreme most points of the arrow head. The direction vector from center to the extreme point gives the direction of arrow head as shown in Fig. 6.

Fig. 6. Arrow head direction vector

For dimensional lines, we use the bounding box of the arrow head and its center to locate lines from the list of extracted lines on which it lies. Thus associating the arrow heads with its dimensional segments. We correlate the individual text extracted with its dimensional line using the Axis Aligned Bounding Box (AABB) collision approach as shown in Fig. 7. This correlated information is stored in the single associated data structure as dimensions. The final step would be to associate the dimensions to the individual extracted entity. This is done with the help of arrow head direction and associated support lines which traces back to the entity.

212

A. Intwala

Fig. 7. Axis aligned bounding box association

2.3

Data Exporter

The extracted entity along with the associated dimensions are used to export the data into DXF file format [1]. A DXF file is divided into sections. The data comes in blocks, which is made from pairs of lines. The first line is a number code which represents what to expect next and the second line is the data itself. DXF files starts and ends with special blocks. The sample of line and arc entity is shown in Fig. 8. It could also be exported into DWG or equivalent file format if the structure of the format is known. This exported DXF file is user editable, which can be modified/edited in any CAD packages available.

Fig. 8. DXF sample block for line and arc

3

Results

The individual algorithms for entity detection and correlation were implemented in Python3 and the set of 15 images was processed by the application generating the output DXF file of the raster images. The generated DXF file could be visualized/modified/edited in any commercial CAD packages. The Fig. 9 shows 2 input raster images and its corresponding exported DXF file as output by the mentioned approach and visualized in open source CAD QCAD package. Table 1 gives the summary of success rate with F1 scores of individual entity detection algorithms. ROC curve gives a good representation of the performance of individual algorithms is given in Fig. 10. Almost 90% of the data was extracted successfully from raster CAD image and was translated to user editable CAD file for further application.

Fig. 9. Outputs of raster images exported to DXF file

Image to CAD 213

214

A. Intwala Table 1. Precision - recall summary

Algorithm

Precision P = A/(A + B)

Recall F1 Score R = A/N F1 = 2(P*R)/(P+R)

Area Under Curve (AUC)

Arrow head Detection

0.95952

0.96834

0.95961

0.80

Text Detection

0.926174

0.8625

0.893203

0.83

Line segment Detection 0.87175

0.91542

0.89153

0.79

Circle Detection

1

0.8748

0.84

0.7944

Fig. 10. ROC graphs of entity detection

4

Applications

The extracted information from these raster CAD drawing images can be used for multiple applications – Digitization of legacy CAD drawing data which is still locked as hard copy drawings (as presented in this paper). – Generation of inspection plans for Coordinate Measuring Machines (CMM) from raster image of CAD models for inspecting a part. • Here using the extracted dimensions and GD&T information we could generate measurement strategies for calculation of actual linear measurement and tolerance information, Execute the plan on actual CMM and check if it is within provided tolerance values. – Generation of CAM tool paths for Computer Numerical Control (CNC) from raster image of CAD models for manufacturing a part. • Here using the extracted shape information we could generate CAM tool path, Execute the tool path on actual CNC machine. – Generation of 3D model by correlating multiple views of the part from raster images of multiple views of the same part [8].

Image to CAD

5

215

Conclusion

The CAD drawing, image data used in this study is a fairly good representation of the practically used data. For such a test data, using the proposed approach we were successfully able to translate 90% of the image to DXF data. Comparison of this approach with some of the similar existing vectorized approaches [3,4,6] is what I target for the future. Currently this approach suffers if the quality of the image is below average or the image captured is skewed. In these cases the accuracy of entity detection reduces due to skew-ness factor and image quality. The solution to this is trying out entire CNN based approach also might improve the translation rate of the application.

References 1. AutoCAD (2012) DXF Reference autocad 2012. https://images.autodesk.com/ adsk/files/autocad 2012 pdf dxf-reference enu.pdf 2. Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recognit. 13(2), 111–122 (1981) 3. Bessmeltsev, M., Solomon, J.: Vectorization of line drawings via polyvector fields. ACM Trans. Graphics (TOG) 38(1), 9 (2019) 4. Chen, J., Du, M., Qin, X., Miao, Y.: An improved topology extraction approach for vectorization of sketchy line drawings. Vis. Comput. 34(12), 1633–1644 (2018). https://doi.org/10.1007/s00371-018-1549-z 5. Chhabra, A.K., Phillips, I.T.: Performance evaluation of line drawing recognition systems. In: 2000 15th International Conference on Pattern Recognition, Proceedings, vol. 4, pp. 864–869. IEEE (2000) 6. Donati, L., Cesano, S., Prati, A.: A complete hand-drawn sketch vectorization framework. Multimedia Tools Appl. 78(14), 19083–19113 (2019) 7. Goutsias, J., Heijmans, H.J., Sivakumar, K.: Morphological operators for image sequences. Comput. Vis. Image Underst. 62(3), 326–346 (1995) 8. Intwala, A.M., Magikar, A.: A review on process of 3D model reconstruction. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp 2851–2855. IEEE (2016) 9. Intwala, A.M., Kharade, K., Chaugule, R., Magikar, A.: Dimensional arrow detection from CAD drawings. Indian J. Sci. Technol. 9(21), 1–7 (2016) 10. Narayan, K., Kannaiah, P., Venkata Reddy, K.: Machine Drawing. New Age International Publishers, New Delhi (2006) 11. Shen, Z., Shah, J.J., Davidson, J.K.: Analysis neutral data structure for GD&T. J. Intell. Manuf. 19(4), 455–472 (2008) 12. Suzuki, K.: Pixel-based artificial neural networks in computer-aided diagnosis. In: Artificial Neural Networks-Methodological Advances and Biomedical Applications, pp 71–92. InTech (2011) 13. Trajković, M., Hedley, M.: Fast corner detection. Image Vis. Comput. 16(2), 75–87 (1998)

Non-uniform Deblurring from Blurry/Noisy Image Pairs P. L. Deepa1(B) and C. V. Jiji2 1

Mar Baselios College of Engineering and Technology, Thiruvananthapuram, India [email protected] 2 College of Engineering, Thiruvananthapuram, India [email protected]

Abstract. In this paper, we address the problem of recovering a sharp image from its non-uniformly blurred version making use of a known but noisy version of the same scene. The recovery process includes three main steps - motion estimation, segmentation and uniform deblurring. The noisy image is first denoised and then used as a reference image for estimating the motion occurred in the non-uniformly blurred image. From the obtained motion vectors, the blurred image is segmented into image blocks encountered with uniform motion. To deblur these uniformly blurred segments, we use a two step process where we first generate an unnatural representation under an l0 minimization frame work followed by a hyper-Laplacian prior based non-blind deconvolution. The resulting deblurred segments are finally concatenated to form the output image. The proposed method gives better results in comparison with other state of the art methods.

Keywords: Non-uniform deblurring Hyper-Laplacian prior

1

· 3D block matching algorithm ·

Introduction

Image blur occurs due to either the motion of camera or the object under consideration, atmospheric turbulance, defocus of the camera, etc. Motion blur occurs when the scene being captured changes during recording of a single exposure, either due to sudden movement or long exposure. If the exposure duration is long, then the image captured is blurred due to camera shake and if it is very short, the image captured is noisy. If the blur kernel is uniform over the entire image, then the blurring is uniform, but spatially varying arbitrarily shaped blur kernel result in non-uniformly blurred image. Non-uniformity is mainly due to 3D rotation of the camera or non-uniform camera shake. Image deblurring means extracting the original image from the blurred image and estimating the kernel that causes the blur. This is a challenging problem since the original image and the blur kernel are both unknown. If we denote the c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 216–226, 2020. https://doi.org/10.1007/978-981-15-4015-8_19

Non-uniform Deblurring from Blurry/Noisy Image Pairs

217

sharp image or the latent image by x, the blur kernel by k and the blurred image by y, then the blurring process can be defined as y =x∗k

(1)

If k is a single kernel, then the blur is uniform and if it is a combination of several kernels, then the blur is non-uniform. So in the case of non-uniform blur, we have n different kernels k1 , k2 , ...kn . If xi is that part of an image degraded by the kernel ki , then that degraded portion can be represented as yi = xi ∗ ki . Therefore, non-uniformly blurred image can be modelled as a concatenated version of all yi ’s, where i = 1, 2, ..., n. Estimating these spatially varying blur kernels and the resulting deblurred image is further an ill-posed problem. Obtaining the sharp image and the blur estimate from a single input image is again a hard problem. There are several single image deblurring techniques in the literature, termed as blind deconvolution techniques, where the sharp image and the blur kernel are estimated simultaneously [1–4]. There are other single image deblurring methods that estimate the blur kernel first and then the sharp image [5]. Use of additional images can make the deblurring problem relatively simple, as it may reveal some interesting information in solving the problem. Chen et al. [6] used two blurred images with different blur kernels to solve the dual blurred problem. Yuan et al. [7] used a blurry/noisy image pair for estimating blur kernel and the corresponding sharp image without ringing artifacts. Here, the authors used a wavelet based denoising algorithm to denoise the noisy image and finally the Richardson-Lucy (RL) algorithm is used to deblur the blurred image. Lee et al. [8] used bilatreal filtering for denoising and its gradient is used for estimating the blur kernel. Finally hyper-Laplacian prior is used along with the denoised image for restoring the sharp image. Li et al. [9] proposed a robust algorithm to obtain the sharp image by fusing the blurry and noisy images. In [7,8] and [9], the authors deal with the uniform deblurring problem using blurry/noisy pairs. Whyte et al. [10] used blurry/noisy pairs for non-uniform deblurring. Here, the authors adopted the method proposed by Yuan et al. [7], but used only the l1 norm and positive constraints alone as they lead naturally to a sparse representation of the blur kernel. But as the non-uniformity increases, the l1 norm alone will not be sufficient for estimating the kernel. In [11], Gu et al. considered the same problem with the help of Gaussian Mixture Model (GMM) and Expectation Maximization (EM) algorithms. They have included the bilateral term in the objective function and deblurred the image without estimating the blur kernel. In [13], Xu et al., the authors deal with an unnatural representation using l0 norm. In this paper, we address the non-uniform deblurring problem, making use of a noisy version of the scene, by modelling the non-uniformity as a combination of several uniformly blurred sections. We have incorporated the l0 norm along with the bilateral term to form the objective function as the l0 norm along can’t preserve all the sharp features. Here, the noisy image acts as a reference for separating the different uniformly blurred segments in the non-uniformly blurred image.

218

P. L. Deepa and C. V. Jiji

The main contributions of this work are: 1. We use the 3D block matching algorithm that is normally used for multiple image motion estimation for effectively estimating the motion between different non-uniformly blurred segments. 2. We use a two stage approach for deblurring the generated uniformly blurred image segments where we first estimate the kernel utilizing an unnatural representation of the image under an l0 minimization framework incorporated with the bilateral term followed by a hyper Laplacian prior based non-blind deconvolution.

2

Proposed Deblurring Algorithm

The proposed method consists of the following steps. First, the noisy image is denoised and motion causing blur is estimated by comparing the blurred image with the denoised one. Then different motion vectors are identified and the blurred image is segmented into several homogeneous images having uniformly blurred sections. Each segmented image is deblurred using a uniform deblurring algorithm and finally concatenated to form the final deblurred image. Figure 1 shows the block diagram of the proposed method.

Fig. 1. Block schematic of the proposed approach

2.1

Denoising

The noisy image is first denoised using directional filter and is used as a reference image for the motion estimation step. Let In be the noisy image, then the directionally low pass filtered image is given by 1 ∞ w(t)I(p + tuθ )dt (2) Id (p) = c −∞

Non-uniform Deblurring from Blurry/Noisy Image Pairs

219

where, p is the pixel location, tthe distance from each pixel to p, c the nor∞ malization factor given by c = −∞ w(t)dt and uθ = (cosθ, sinθ), a unit vector in the direction of θ. Here, w(t) is the Gaussian window function given by w(t) = exp(−t2 /2σ 2 ), where σ is the standard deviation. According to the value of the angle θ we can implement different directional filters. The value of θ should be as small as possible if the intensity of noise is very high and vice versa. The directional filter has the property that it only denoises the image without affecting its statistical properties. 2.2

Motion Estimation

The non-uniformly blurred image Ib is divided into non-overlapping blocks of size N × N. This can be a user defined variable depending on the nature of the image. Each of the N × N block in the blurred image is compared with the corresponding block as well as its adjacent neighbours in the reference denoised image and the best match is identified. Usually the blocks are displaced to its immediate neighbours and so we can find a motion vector, representing the motion occurred, which has an x-coordinate and a y-coordinate for the new location and therefore the motion vectors are of dimension 1 × 2. This vector models the movement of that block from one location to another. This process is repeated for each and every blocks in the test image and the motion vectors are calculated. 2.3

Segmentation

From the above step, we get motion vectors corresponding to each block in the blurred image which indicates the direction in which the motion has occured. Different distinct motion vectors are generated according to the non-uniformity in the motion blur in the image. Image blocks having the same motion vectors are concatenated and the remaining block positions are filled with the average image intensity so as to form a unique image in such a way that the block positions are unchanged. Similar images are formed corresponding to all unique motion vectors and the corresponding image blocks. Thus we get as many images as the number of unique motion vectors. The process is illustrated in Fig. 2 where we assume that there are only 3 unique motion vectors corresponding to the blurred image segment shown in Fig. 2(a). Here, all the image blocks A are assumed to have the motion vectors M1 , all blocks B have the motion vectors M2 and all blocks C have the motion vectors M3 . Here, the segmentation process will result in three different images as shown in Fig. 2(b), (c) and (d) respectively; where G represents the average intensity value. 2.4

Uniform Deblurring

The images generated above are corresponding to unique motion vectors and hence they are uniformly blurred images. Here, we use a two stage optimization

220

P. L. Deepa and C. V. Jiji

framework for deblurring the segmented images where first we find an unnatural representation of the latent image using l0 approximation. This is followed by a non-blind deconvolution using hyper-Laplacian prior. Unnatural Representation. Here we obtain an intermediate representation of the image called unnatural representation, and the blur kernel and the sharp image are estimated from this unnatural representation as shown in Fig. 3. For this, we use the framework proposed by Xu et al. [13] which is based on a loss function that approximates l0 cost into the objective function along with the bilateral regularization term, for preserving the fine details, given by λ 2 2 U (X) = min (y − k ∗ x ˜ + v x ˜) + γ k ˜) + ψ0 (h x (3) 2 (˜ x,k)

Fig. 2. Segmentation process

where x ˜ is the unnatural representation of the image, ψ0 (.) is a very high pursuit function that regularizes the high frequency components by manipulating the gradient vectors by approximating the sparse function very closely, λ and γ are regularization weights and h and v are the gradient vectors in the horizontal and vertical respectively. The first term of the above objective function represents the data fidelity term, the second term approximates the l0 cost and the third term reduces the effect of noise in the kernel. In order to preserve the fine details we used the bilateral regularization term as mentioned in [11] given by ⎛ ⎞ d2 ‘ l2 ‘ λ − p2 − p 2 ⎝(xp − xp‘ )e 2σd e 2σl ⎠ (4) B(X) = 2 ‘ p N (p)

Non-uniform Deblurring from Blurry/Noisy Image Pairs

221

where p‘ N (p) is the neighbouring pixel with its intensity equals to xm‘ . dp‘ and lp‘ are the spatial distance and the difference of intensity value between the neighbour pixel m‘ and the center pixel m, respectively. σd and σl are constants to control the degree of smoothness. Therefore, our final loss function is given by L(X) = U (X) + B(X))

(5)

Non-blind Deconvolution. The image obtained after the kernel estimation process as above contains only the high contrast structures with less details. Thus this will not be the actual latent image. So for obtaining the final sharp image a non-blind deconvolution has to be performed with the estimated kernel. Here we use a hyper-Laplacian regularization [15] as given by

Fig. 3. Unnatural representation

2

min{wxm (x ∗ k − y) + x

λ ((h x)α + (v x)α )} 2

(6)

where, wxm = E[mx ]/2σx2 and α is the regularization weight. Here, wxm is the factor that takes care of the outliers present in the deblurred image, σx is the standard deviation and mx = 1 if the pixel is an outlier and mx = 0 if the pixel is an inlier. 2.5

Merging

After the above step, we get M sharp images corresponding to the M uniformly blurred images. These images are concatenated to form the final sharp image by selecting the relevant blocks from the M images.

222

P. L. Deepa and C. V. Jiji

Algorithm of the proposed method Inputs: Non-uniformly blurred image Ib and noisy image In . Denoising: Denoise In using directional filter to obtain Ip . Motion estimation: Divide In and Ip into non-overlapping blocks of size N xN . Compare the corresponding blocks and estimate the motion vectors. Segmentation: Form uniformly blurred images by the following steps. Concatenate blocks having same motion vectors in corresponding positions. Fill the remaining positions with the average values. We will get M uniformly blurred images if no: of motion vectors equals M . Uniform deblurring: Obtain the unnatural representation using the loss function. Perform for each of the sub-images. Parameter settings: λ[0, 1], σ 2 [100, 500] and γ ≤ 50. Non-blind deconvolution: Perform non-blind deconvolution. Remove outliers and blocking artifacts from the deblurred images using hyper-Laplacian regularization. Parameter settings: λ[0, 1], α[0.5, 0.8]. Merging: Form final deblurred image by merging all the deblurred images.

3

Implementation and Results

We have implemented our algorithm using the images from the database provided by the authors in [10] where blurred-noisy pairs are supplied, without any ground truth. The other dataset used for generating the synthetic images is from [14], where the authors provide a total of 15 images and 8 kernels. Here we show the results from experiments performed on selected 3 images. We generated the noisy version of each of them by adding gaussian noise and non-uniform version by using 3 different kernels out of the 8 given kernels. To denoise the noisy images, a series of directional filters which are 5◦ apart are applied along different directions. For estimating the motion as described in Sect. 2.2, we used N = 20. We obtained 3 to 5 unique motion vectors from these experiments. Table 1. Table showing the PSNR values of the proposed method when compared with Whyte et al. and Gu et al. for the 3 pairs of images Images PSNR (Whyte et al.) PSNR (Gu et al.) PSNR (Proposed method) 1

38.58 dB

40.84 dB

41.3 dB

2

35.24 dB

36.93 dB

37.27 dB

3

35.58 dB

36.97 dB

37.10 dB

Non-uniform Deblurring from Blurry/Noisy Image Pairs

223

Fig. 4. (a) Blurry image, (b) Noisy image (c) Denoised image, (d) Whyte et al. and (e) Proposed method (f)–(j) Segmented images corresponding to the first, second, third, fourth and fifth motion vectors respectively, (k)–(o) Corresponding deblurred images respectively and (p)–(t) Corresponding blur kernels respectively

224

P. L. Deepa and C. V. Jiji

Fig. 5. (a) Blurry image, (b) Whyte et al. (c) Gu et al. and (d) Proposed method

Fig. 6. (a) Blurry image, (b) Whyte et al. (c) Gu et al. and (d) Proposed method

Non-uniform Deblurring from Blurry/Noisy Image Pairs

225

The results are shown in Figs. 4 and 5. Figure 4(a) shows the input blurry image, Fig. 4(b) the noisy image, Fig. 4(c) the denoised output, Fig. 4(d) the output of Whyte et al. and Fig. 4(e) the output of the proposed method. Using the obtained motion vectors, we generated 5 different segmented images as explained in Sect. 2.3. One of the segmented images is shown in Fig. 4(f), the corresponding deblurred version is shown in Fig. 4(k) and the obtained kernel is shown in Fig. 4(p). Figures 4(h) to 4(j) show the second, third, fourth and fifth segmented images respectively, Fig. 4(l) to Fig. 4(o) their uniformly deblurred versions and Fig. 4(q) to Fig. 4(t) the corresponding blur kernels respectively. Figure 5(a) shows the blurred image, Fig. 5(b) shows the result of Whyte et al., Fig. 5(c) shows the result of Gu et al. and Fig. 5(d) the output of the proposed method. Figure 6(a) shows another blurred image, Fig. 6(b) shows the result of Whyte et al., Fig. 6(c) shows the result of Gu et al. and Fig. 6(d) the output of the proposed method. Table 1 shows the PSNR values obtained for the experiments on three synthetic images. From the table, it is evident that the proposed method outperforms the method proposed by Whyte et al. and Gu et al.

4

Conclusion

In this paper, we proposed an effective method for deblurring non-uniformly blurred images utilizing a noisy version of the same scene. In the three step beblurring process, we first estimated the motion vectors using a 3D block matching algorithm using the denoised version of the noisy image as reference. Using the unique motion vectors obtained, we generated uniformly blurred images which are further deblurred using an l0 based optimization framework incorporated with bilateral regularization. Finally these deblurred images are merged to form the final latent image. Experiments performed on natural and synthetic images show that the proposed method outperforms the existing state of the art deblurring method making use of blurred-noisy image pairs.

References 1. Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Understanding and evaluating blind deconvolution algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2354–2367 (2011) 2. Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Efficient marginal likelihood optimization in blind deconvolution. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2657–2664 (2011) 3. Xu, L., Jia, J.: Two-phase kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-155499 12 4. Fergus, R., Singh, B., Hertzmann, A., Roweis, S.T., Freeman, W.: Removing camera shake from a single photograph. In: ACM Transactions on Graphics, SIGGRAPH 2006 Conference Proceedings, Boston, MA, vol. 25, no. 4, pp. 787–794 (2006)

226

P. L. Deepa and C. V. Jiji

5. Krishnan, D., Tay, T., Fergus, R.: Blind deconvolution using a normalized sparsity measure. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 233–240, June 2011 6. Chen, J., Yuan, L., Tang, C.K., Quan, L.: Robust dual motion deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 4, pp. 1–8 (2008) 7. Yuan, L., Sun, J., Quan, L., Shum, H.-Y.: Image deblurring with blurred/noisy image pairs. In: ACM Transactions on Graphics (Proceedings of the SIGGRAPH), vol. 26, no. 3, July 2007 8. Lee, S.H., Park, H.M., Hwang, S.Y.: Motion deblurring using edge map with blurry/noisy image pairs. Opt. Commun. 285(7), 1777–1786 (2012) 9. Li, H., Zhang, Y., Sun, J., Gong, D.: Joint motion deblurring with blurred/noisy image pair. In: International Conference on Pattern Recognition (ICPR), pp. 1020– 1024, August 2014 10. Whyte, O., Sivic, J., Zisserman, A., Ponce, J.: Non-uniform deblurring for shaken images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010) 11. Gu, C., Lu, X., He, Y., Zhang, C.: Kernel-free image deblurring with a pair of blurred/noisy images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 12. Zhong, L., Cho, S., Metaxas, D., Paris, S., Wang, J.: Handling noise in single image deblurring using directional filters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013) 13. Xu, L., Zheng, S., Jia, J.: Unnatural L0 sparse representation for natural image deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1107–1114 (2013) 14. Hu, Z., Yang, M.H., Pan, J., Su, Z.: Deblurring text images via L0 regularized intensity and gradient prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901–2908 (2014) 15. Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-Laplacian priors. In: NIPS, pp. 1033–1041 (2009) 16. Pan, J., Lin, Z., Su, Z., Yang, M.-H.: Robust Kernel estimation with outliers handling for image deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

An Effective Video Bootleg Detection Algorithm Based on Noise Analysis in Frequency Domain Preeti Mehta1(B) , Sushila Maheshkar1 , and Vikas Maheshkar2 1

2

National Institute of Technology, New Delhi, India {preetimehta,sushila}@nitdelhi.ac.in Netaji Subhas Institute of Technology, New Delhi, India [email protected]

Abstract. Nowadays everybody has mobile phone, tablet and other video capturing devices containing high quality cameras. This enable them to recapture the videos from other imitating media such as projectors, the LCD screens etc. Recently, video piracy has become a major criminal enterprise. So, in order to combat this uprising threat of video piracy, content owners and the enforcement agencies such as Motion Pictures Association, have to continuously work hard on video copyright protection laws. This is one of the major reasons why digital forensics has considered recaptured video detection an important problem. This paper presents a simple and an effective mechanism for recaptured video detection which is based upon the noise analysis in the frequency domain. The features adopted are mean, variance, kurtosis and mean square error. These features are calculated on the mean strip extracted from logarithmic magnitude Fourier plot on complete video length. Keywords: Video piracy · Video recapture detection Frequency domain · SVM classifier

1

· Noise analysis ·

Introduction

With global internet availability and accessibility than ever before, a substantial amount of data and information are shared on internet. This sharing of data creates issues when some unauthorized person accesses the authorized data and leaks it before actual data is available on internet. In addition, people nowadays communicate with each other without time and space limitations over internet. Thus, the leaked data spreads in the network very quickly. This leaked digital content spreads rapidly through illegal pathways over internet. Multimedia, characteristically videos have become a part and parcel of human life. In particular, movie has been the most popular source of leisure activity in today’s life of every individual human being. There is no boundation on age and gender type who prefer movies as a means of entertainment. Leaving aside c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 227–238, 2020. https://doi.org/10.1007/978-981-15-4015-8_20

228

P. Mehta et al.

the feature of recreation activity, movies/videos plays a crucial role in transmitting information more effectively in comparison with still digital images. Video, in any form can be of huge profitable value. In film industry worldwide, pirated movies have resulted in significant problems. Also, sometimes the original video is edited to perform a spoof attack. Spoofing attack mainly occurs when an attacker tries to masquerade as someone else in order to falsify the data that are captured by the sensors in an attempt to bypass the security system. Thus, in order to prevent the attacker from attacking we need to detect whether the video is original or recaptured video. Rights holders, technology and security partners, content owners and law enforcement agencies are always working hard to combat the threat of piracy. Movie piracy is increasing with the technology growth and has developed into a complete criminal enterprise. This situation has been uncontrollably rising from the last decade resulting in a huge revenue loss in terms of economy. According to the survey conducted by MPA and Lek [1] in 2005, 90% of the piracy results from the camcorder bootleg videos. In camcorder piracy, the pirate person brings a camcorder device into a theatre and document a movie from the projection screen. There have several cases in which the film producer has filed various cases to court to block various websites and URLs to prevent illegal access of films. For example, in July 2016, the Mumbai High Court blocked “134 web links and URLs” on the plea of film producers of movie named ‘Dishoom’, seeking orders against the illegal distribution of their films through online piracy. 1.1

Motivation

The popularity of internet activities has promoted to an unparallel growth of sharing of multimedia information. Applications such as shareit, whatsap, facebook etc. give cost free and smooth file sharing facilities. On-site videos and pictures of fatal epidemic, sports event, warfare, political rallies, and accidents can reach billions of people instantly online. The movie release in theatre can be found in a short duration of time on pirate sites. In addition to digital image forensics, the detection of near-duplication or photographic-copying for images or videos is an emerging and an important forensic problem. The practical problems arising from recapturing can be of biometric spoofing, video piracy or for fraudulent purpose. The statistical report showing the piracy revenue loss and lack of research in that area to counterclaim the loss is infinitesimal. Our research will benefit the entertainment industry as well as its subpart can be used as solution for face spoofing problem. The rest of this paper is organised as follows. In Sect. 2, related literature survey is summarized. Section 3 describe video recapture dataset in detailed. Section 4 described the proposed methodology for recapture video detection followed by experimental results evaluated on our dataset in Sect. 5. We conclude this paper in Sect. 6 followed by references.

Video Bootleg

2

229

State-of-the-Art

The following papers focus on the work of face spoofing problem or detection of recapturing still images from the LCD screen. In [2], Cao and Kot proposed an algorithm based on fine texture patterns introduced by the LCD screen on recaptured images. The texture patterns difference was detected by using multiscale local binary pattern operator and multiscale wavelet statistical parameters. They were used to model the loss of details while recapturing. They have also used colour moments for colour distortion artifacts. Wang [3] in his paper has proposed a simple technique based on pixel-wise correlation coefficients calculation in an image differential domain for recapture classification. Thongkamwitoon, Muammar and Dragotti [4], have proposed an algorithm based on blurriness artifact. They have used two parameters namely, an average line spread width and a sparse representation error for training dictionaries for the classification. Many papers were available based on deep learning approach for recapture from the LCD screen detection [5,6]. Yang, Ni and Zhao [5] have given a CNN architecture for feature extraction and classification. Also, authors have concluded with addition of pre-processing layer and use of other neural networking technique in addition with CNN can improve the classification accuracy. Li, Wang and Kot [6], have used CNN with RNN to extract both inter and intra dependenices for classification of recaptured still images. In all the papers mentioned above either the ICL database, NTU-ROSE or Astar database is used. The work done in video piracy area is basically divided into three categories. Firstly, some research approaches represent the methodology for classifying the recaptured video from the original videos. Second category of work is based on watermark embedment for copyright verification of original work and third are the research done for finding the pirate position in the cinema hall. Kumar, Manikanta and Srinivas [7], have used face features to embed coded bits in the movie for encryption. Their work is divided into two parts; first to extract face features which includes number of faces that appear in movie, the time of occurrence of each face and the frequency at which it appears and phase two is to embed bits. For that they have used LSB algorithm. Zhang and Zhang in their paper [8], have used DCT, DWT and neural network techniques to watermark the video. They have used neural network to strengthen the embedded watermark. In paper [9] by Nakashima, Tachibana and Babaguchi, research algorithm based on spread-spectrum audio watermarking for the multi-channel movie sound tracker is proposed. They have also presented a position estimation system for finding the pirate position estimation relative to speakers. The algorithm works significantly with the mean estimation error of 0.44 m without remarkably impairing the sound quality. Wang and Farid [10], used the fact that internal parameters of camera; basically, camera skewness is non-zero in recaptured video. They have calculated the camera’s intrinsic parameter from two or more video frames to differentiate between recaptured or non-recaptured videos.

230

P. Mehta et al.

Instead of embedding a watermark, Zhai and Wu [11] has used a temporal psychovisual modulation technique to visually deteriorate the recorded movie content while achieving visual transparency of added interfering signals to the audience. Roopalakshmi [12], proposed a novel approach for pirate position estimation using only content-based visual and audio fingerprints without watermarking encryption. The results estimate the pirate position with a mean absolute error of (38.25, 22.45, 11.11) cm. In this section we have provided a literature survey of recapture multimedia detection techniques. Given the focus of research paper on recapture video detection problem specifically video piracy issue, more emphasis is given to the analysis of fingerprint left on videos during the recapture process.

3

Recapture Video Dataset

In the context of our research and the non-availability of video recaptured dataset, we present in this section the description of the acquisition of large and diverse set of recapture videos by the LCD and the projection screen of varying resolution. To create the variation among the dataset, videos were captured in both indoor and outdoor environmental conditions with the cameras in auto mode. To create variation in recaptured dataset we tried to vary basically three parameters (camera parameters, environmental parameters and screen parameters). The basic environment setup for recapturing is shown in Fig. 1.

Fig. 1. Block diagram of experiment setup

The dataset consisted of both original videos (single captured videos) and recaptured videos. Original dataset consisted of 80 videos; sets of 10 videos captured by 8 different mobile cameras.

Video Bootleg

231

Table 1. Specification of cameras used for capturing original videos S.No. Model

Screen resolution (pixels)

Frame rate (fps)

Camera resolution (MP)

C1

Lenovo P2

1080 × 1920

30

13

C2

Redmi Note 4

1080 × 1920

30

13

C3

Redmi Note 5

1080 × 2160

30

13

C4

Iphone 7

1334 × 750

30

12

C5

Samsung S6 Edge

1440 × 2560

30

16

C6

Oppo Realme One 1080 × 2160

30

12

C7

Moto G 2015

1280 × 720

30

13

C8

Xiaomi Mi Max 2

1280 × 720

30

12

The specification of cameras used for capturing the original video is shown in Table 1. The camera resolution varies from 12MP to 16MP with varying screen resolution. In order for creation of recaptured dataset two different cameras were used and three different imitating mediums (screens on which original videos were projected) were used which consisted of two the LCD screens and one the projector screen. Total of 640 videos (480 + 160) were recaptured from the LCD screens and the projected screen. The specification details for the cameras and imitating materials are shown in Tables 2 and 3 respectively. Table 2. Specification of cameras used for recaptured videos S.No. Model

Screen resolution Frame Camera (pixels) rate (fps) resolution (MP)

1

Nikon D3200

1920 × 1080

30

24.2

2

Sony Lens G (DSC-HX 20V) 4896 × 3672

30

18

3

Xiaomi Mi Max 2

30

12

1280 × 720

Table 3. Specification of imitating mediums used for recaptured videos S.No. Model

Screen resolution (pixels)

Frame rate (fps)

1

Acer Predator Helio 300 1920 × 1080

60

2

IPAD Pro 17 4

2224 × 1668

120

3

EPSON EB-X31

1024 × 768

30

The results on this dataset is explained in Sect. 5. The obtained dataset is divided into two parts i.e. testing and training. Training dataset comprises of 70% of total dataset, while testing dataset comprises of 30% of total dataset.

232

4

P. Mehta et al.

Proposed Methodology

In this section, the proposed methodology is explained in detail with the help of block diagram and algorithm. The block diagram is shown in Fig. 2. Video processing is explained in further steps. Training

70%

Frame Extraction

Noise Residual Image

Fourier Transform

Visual Rhythm

Mean Strip

Feature Extraction

Video Processing

Dataset

Testing Feature Extraction

Video Processing 30%

SVM Classifier

Detection Results

Dataset

Fig. 2. Block diagram of proposed methodology

Let us define V as a video containing sequence of the frames f such that every frame in the sequence can be defined as f (x, y) ∈ N 2 which depict the intensity value of pixel positioned at (x, y) of the frame. Now, in order to separate the noise from the frames, we take another copy of the frame Vcopy and pass it through the low pass filter. To obtain the new frame containing only noises (Vnoise ), we perform subtraction between the original frame and (Vcopy ) as shown in Eq. 1. Vnoise = V − f ilter(Vcopy )

(1)

From these new Vnoise frames, a new residual video is obtained. Frome this video, to analyse the noise patterns, Discrete Fourier Transform (DFT) on each frame is applied using Eqs. 2–3. Also, logarithmic magnitude plot is calculated in such a way that origin lies at the centre of frame. This is obtained by using Eq. 4. N M Vnoise (m, n)e−j2π[(um/M )+(vn/N )] (2) V(u, v) = m=1 n=1

|V(u, v)| =

(u, v)2 + (u, v)2

P(u, v) = log(1 + |V(u, v)|)

(3) (4)

The difference between original and recaptured frame logarithmic magnitude plot is shown in Fig. 3. It can be noted that the highest response difference between original and recapture frequency response mainly centred on the abscissa and ordinated axes. Taling into account the problem, we extract the set

Video Bootleg

(a): Original

233

(b): Recapture

Fig. 3. Logarithmic magnitude plot of frame sample

of horizontal and vertical strip from centre of width 21 pixels from each Vnoise frame and concatenated into single image. This will simplify the video into a single image with temporal noise features. The single image obtained is converted into Mean Strip (M S). Mean strip formation is as follows: 1. Divide the obtained image into Sk number of non-overlapping strips vertically of W pixel width, where k indicates the index of the strip. 2. The position of the pixel at coordinates (i, j) in the k th strip is represented by (i, j, k). 3. The mean strip is calculated by averaging each pixel at position (i, j) in each strip Sk .

(a): Input Video

(b): Concatenation of Ordinate Region

Fig. 4. (a) Video frames (b) DFT plot and concatenation of ordinate (vertical) axis

234

P. Mehta et al.

The logarithm magnitude fourier plot of Vnoise frames extracted from the original video V shown in Fig. 4(a), is represented in Fig. 4(b). The abscissa and ordinate axis of width 21 units is extracted from each frame and concatenated together to form a single image. The concatenation of abscissa is performed by rotating the strip by right angle. Total of four features are extracted from the Mean Strip (M S) which are Mean (M ), Variance (V ), Kurtosis (K) and Mean Square Error (M SE) given by the Eqs. 5–8. R C 1 P(r,c) R × C r=1 c=1

(5)

V =

R C 1 (P(r,c) − M )2 R × C r=1 c=1

(6)

K=

C R 1 (P(r,c) − M )4 R × C × V 2 r=1 c=1

(7)

R C 1 (P(r,c) − PC/2 ) R × C r=1 c=1

(8)

M=

M SE =

Where, R × C = Dimension of Mean Strip PC/2 = Center Strip in Mean Strip Algorithm 1: Temporal Residual Noise Features Extraction Input: V is a set of original and recaptured videos, F are the extracted frames from each video Output: M is the mean value, K is the kurtosis value, M SE is the mean square value and V is the variance value parameter for each Input Video V do extract F frames; foreach F frames do Fg = rgb2gray(F); Ff ilter = GaussianFilter(Fg ); Vnoise = Fg − Ff ilter ; magPlot = log(1 + abs(df t(Vnoise ))); compute verticalStrip(magPlot); compute horizontalStrip(magPlot); VR = cat(verticalStrip); HR = cat(horizontalStrip); end M Sv = MeanStrip(VR); M Sh = MeanStrip(HR); compute M(M Sv ), V(M Sv ), MSE(M Sv ), K(M Sv ), M(M Sh ), V(M Sh ), MSE(M Sh ), K(M Sh ) values; end

The acquire features selected should be differentitative enough to considered the difference between the two kinds of videos under examination, along with the robustness characteristics. In order to assure both constraints, we choose to

Video Bootleg

235

compute uncomplicated statistical parameters for the resultant Mean Strip (MS) generated. Also, considering the necessity of being discriminative, preferably the features should be independent of the content present, but should be responsive to the two classes i.e. original and recaptured videos. To make the framework content independent, low-frequency components were eliminated and DFT of noisy video frames (Vnoisy ) is performed. The proposed methodology is given in Algorithm 1.

5

Experimental Result

In this section, based on the proposed methodology and simulation, results are presented. Simulations has been carried out with MATLAB 2015a on Windows R Xeon(R) CPU E5-2620 v2 @ 8 platform using GPU with configuration Intel 2.10 GHz × 12 on the dataset already explained in Sect. 3. The results obtained are shown with the help of confusion matrix and accuracy. Terms used in confusion matrix is: TP: True Positive, TN: True Negative, FP: False Positive, and FN: False Negative The formula used for calculating accuracy is shown in Eq. 9. Acc =

TP + TN TP + FP + TN + FN

(9)

C8, C9, C10 indicates the camera devices which are used for the capturing and recapturing purposes as mentioned in the equipment used Table 2. P, S1, S2 indicates the imitating medium used for the recapturing purposes as mentioned in the Table 3. We have considered starting 100 frames from the video and width of each strip is 21 pixels from the centre of each frame for extracting the Vertical and horizontal Strips. The step by step processing for vertical information extraction is shown in Fig. 5. Similar steps are followed for horizontal information extraction. For horizontal information, instead of extracting ordinates we extract abscissa of 21 pixel width strip from the centre. The results are obtained with SVM classifier with radial basis function kernel on partial and full dataset for vertical and horizontal mean strip as shown in Tables 4 and 5 respectively. The accuracy varies from 49% to 100% mainly due to the diversity in content detail. In addition, we notice that the accuracy results obtained from horizontal mean strip is more accurate than the vertical mean strip. This is due to the presence of aliasing frequencies on abscissa axis. The results for the four features extracted from the mean strip of concatenated frequency strips are shown in Table 6. The highest accuracy is obtained by MSE when used individually, second highest is obtained by mean and then by variance and lowest was obtained by kurtosis. The accuracy of horizontal textual information using all features combined is much reliable.

236

P. Mehta et al.

(a): Input Frame

(e): Logarithm Magnitude Plot

(b): Grayscale Frame (c): Filtered Frame (d): Residual Noisy Frame

(f): Vertical Strip

(g): Concatenated Strips

(h): Mean Strip

Fig. 5. The illustration steps of the method for vertical strips concatenation Table 4. Results obtained on partial and full dataset for vertical mean strip Dataset Confusion matrix Accuracy % TP TN FP FN C8P

54

33

4

9

95

C9P

56

33

2

4

97

C9S1

56

44

2

12

91

C9S2

57

22

1

17

90

C10S1

46

40 12

15

49

C10S2

58

0

0

100

Total

6

288 52

0

85

59

Table 5. Results obtained on partial and full dataset for horizontal mean strip Dataset Confusion matrix Accuracy % TP TN FP FN C8P

88

46 0

11

92.4

C9P

88

71 0

0

100

C9S1

88

77 0

7

98.0

C9S2

88

85 0

3

98.3

C10S1

88

74 0

5

97.0

C10S2

87

74 1

10

93.6

Total

87

105 1

8

95.5

Video Bootleg

237

Table 6. Result for different features Accuracy % Mean Variance Kurtosis MSE Overall

6

Vertical

86.3

76.6

67.7

90.0

90.2

Horizontal

81.1

92.5

77.6

98.0

95.5

Conclusion

In this paper we examined the video bootleg problem and presented a methodology for detecting video near-duplication problem by analysing temporal noise artifacts generated by the video recapturing process through analysis of frequency response. The experiments carried out exhibit that the logarithmic magnitude fourier spectrum of video noise can distinguish between the original and pirated videos. As the fourier transform is not scale invariant, the proposed methodology will not performed well if any rotation, scaling or translation on the recaptured video is done. In addition, the extracted mean strip and feature descriptors compact the dimensionality of the video. The reduced representation achieved will help us in real-time computation framework.

References 1. Motion Picture Association of America: Us piracy fact sheet (2005). https://www. mpaa.org/USPiracyFactSheet.pdf 2. Cao, H., Kot, A.C.: Identification of recaptured photographs on LCD screens. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 1790–1793. IEEE (2010) 3. Wang, K.: A simple and effective image-statistics-based approach to detecting recaptured images from LCD screens. Digital Invest. 23, 75–87 (2017) 4. Thongkamwitoon, T., Muammar, H., Dragotti, P.-L.: An image recapture detection algorithm based on learning dictionaries of edge profiles. IEEE Trans. Inf. Forensics Secur. 10(5), 953–968 (2015) 5. Yang, P., Ni, R., Zhao, Y.: Recapture image forensics based on Laplacian convolutional neural networks. In: Shi, Y.Q., Kim, H.J., Perez-Gonzalez, F., Liu, F. (eds.) IWDW 2016. LNCS, vol. 10082, pp. 119–128. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-53465-7 9 6. Li, H., Wang, S., Kot, A.C.: Image recapture detection with convolutional and recurrent neural networks. Electron. Imaging 2017(7), 87–91 (2017) 7. Kumar, G.S., Manikanta, G., Srinivas, B.: A novel framework for video content infringement detection and prevention. In: 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 424–429. IEEE (2013) 8. Zhang, Y., Zhang, Y.: Research on video copyright protection system. In: 2012 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 1277–1281. IEEE (2012)

238

P. Mehta et al.

9. Nakashima, Y., Tachibana, R., Babaguchi, N.: Watermarked movie soundtrack finds the position of the camcorder in a theater. IEEE Trans. Multimedia 11(3), 443–454 (2009) 10. Wang, W., Farid, H.: Detecting re-projected video. In: Solanki, K., Sullivan, K., Madhow, U. (eds.) IH 2008. LNCS, vol. 5284, pp. 72–86. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88961-8 6 11. Zhai, G., Wu, X.: Defeating camcorder piracy by temporal psychovisual modulation. J. Disp. Technol. 10(9), 754–757 (2014) 12. Roopalakshmi, R.: A brand new application of visual-audio fingerprints: estimating the position of the pirate in a theater-a case study. Image Vis. Comput. 76, 48–63 (2018)

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays Nikhil Kumar(B) , Meenakshi Massey, and Neeta Kandpal Instruments Research and Development Establishment, Defence Research and Development Organization, Dehradun 248008, India [email protected]

Abstract. High exigency of sophisticated state-of-the-art Infra Red (IR) cameras has witnessed a proliferation of larger format IR Focal Plane Arrays (FPAs) arranged in a 2D grid of photo detectors placed at focal plane of imaging system. Current IR FPAs are performancerestricted by variations in responses of individual detector elements resulting in spatial non-uniformity even for a uniform radiation source as scene. Generally, two-point Non Uniformity Correction (NUC) is the preferred technique to cater for this problem. This technique is limited by variations in performance over the entire operating range causing the residual non uniformity (RNU) to vary more on moving farther from the calibrated points. In the present approach, an integration time based NUC method is proposed using least square regression; wherein photo responses of individual detector elements at different integration times are measured when the IR camera is subjected to any uniform radiation source. A close linear approximation of these responses at various integration times is examined in the form of the best fit line by minimizing sum of square of errors. Consequently, a gain and offset value for each detector element is generated and stored in memory, which is utilized during real time to display the NUC corrected image. The results of present INT-LSR-NUC exhibit a considerable gain in performance when evaluated against conventional two-point NUC. Keywords: Non Uniformity Correction (NUC) · Infra Red (IR) · Focal Plane Array (FPA) · Residual Non Uniformity (RNU) · Least Square Regression (LSR) · Integration time (INT)

1

Introduction

Contemporary thermal imaging systems [11] employ highly sensitive, larger format Infra Red (IR) focal plane arrays (FPA) which consist of a number of photo detectors [13] in a particular geometry at the focal plane [12] of an optical system. Mismatches in the photo response of the individual photo detectors and parameter variations cause unavoidable non-uniformities resulting in the superposition of a fixed pattern noise on the image. Spatial non-uniformities in the photo response of the individual detecting element can lead to unusable images c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 239–247, 2020. https://doi.org/10.1007/978-981-15-4015-8_21

240

N. Kumar et al.

in their raw state. Contemporary NUC techniques can be broadly divided in two primary categories [7]: 1. Reference based correction techniques [2,5,7] 2. Scene based techniques [1,3,4] Two-point NUC is one of the most popular reference based correction technique, where in radiometric response of system is captured at two distinct temperatures/integration times keeping a uniform source as scene. However, two point NUC is limited by variations in performance over the entire operating range causing the residual non uniformity (RNU) to vary more on moving farther from the calibrated points. In the present approach, the radiometric responses of the thermal imaging system are captured at various integration times points within the entire operating range, instead of the usual two integration times. In place of linear interpolation [9] between these two points a linear approximation for complete range is explored with error minimization [8]. Rest of the paper is organized as follows. Related work model is presented in Sect. 2. In Sect. 3, methodology of proposed approach is elaborated. Section 4, includes results and analysis. A comparison of the proposed approach with the existing and very popular two-point NUC is also given in this section. Authors have made an attempt to conclude the work in Sect. 5.

2

Related Work

Hardie et al. [1] have proposed a simple, relatively less complex, scene based NUC algorithm that deals with relatively low levels of non-uniformities. This method exploits global motion between frames in a sequence to trace the true scene value along a motion of trajectory of pixels. Assuming the gain and biases of the detector elements to be uncorrelated along the trajectory, an average of these pixel values is calculated for different scene values for each detector. The observed pixel values and corresponding estimates of true scene values form the points used in line fitting. Scribner et al. [3] have suggested a scene based NUC method, using neural network approach that has the ability of adapting sensor parameters over time on a frame by frame basis. Torres et al. [4] have developed an enhanced adaptive scene based NUC method based on Scribner’s adaptive NUC technique. This technique is improved by the addition of optimization techniques like momentum, regularization and adaptive learning rate. Kumar [2] has proposed an Infrared staring sensor model and mathematically analyzed output of any pixel in terms its unique non-uniformities. Apart from this an overview of processing algorithms for correcting the sensor nonuniformities based upon calibration as well as scene based methods to correct gain and offset parameters has been presented. Hardware implementation architecture of both types of algorithms has also been discussed in a comprehensive manner.

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays

241

Khare et al. [5] have discussed NUC correction technique for additive and multiplicative parameters and their implementation in reconfigurable hardware for Long Wave Infrared (LWIR) imaging systems. Kumar et al. [7] have discussed calibration based two point NUC correction technique and its implementation in reconfigurable hardware for Mid Wave Infrared (MWIR) imaging systems based upon 320X256 detecting element based IRFPA. Kay [8] describes least square approaches as an attempt to minimize squared difference between given data and assumed model or noiseless data and mentions that as no probabilistic assumptions have been made about data hence method is equally valid for gaussian and non gaussian noises.

3

Methodology

If y1ij , y2ij , y3ij , .......ymij ......ynij are n radiometric responses of any (i, j)th detecting element of FPA of MWIR imaging system with M × N InSb detecting elements, collected by capturing a uniform scene at n distinct integration times t1 , t2 , t3 , .....tm ......tn respectively, then: 1. Let the response of any (i, j)th detecting element of MWIR FPA captured at any particular integration time tm , may be expressed in the form of a linear equation as following: ymij = aij xmij + bij (1) where aij and bij are the gain and offset non-uniformities associated with the (i, j)th detecting element of FPA, respectively. Assuming uniform distribution of the incoming photon flux over complete FPA, value of xmij at any particular integration time tm may be defined as mean value of radiometric response of all detecting elements at that integration time. ⎛ ⎞ M N 1 ⎝ (2) xm ⎠ xm = xmij = M × N i=1 j=1 ij Using Eq. 2, Eq. 1 may have an alternate representation: ymij = aij xm + bij

(3)

2. Following is the matrix representation of radiometric responses of (i, j)th detecting element of FPA at n distinct integration times: ⎡ ⎤ ⎡ ⎤ x1 1 y1ij ⎢ y2ij ⎥ ⎢ x2 1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ y3ij ⎥ ⎢ x3 1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ .. .. ⎥ aij ⎢ . ⎥ = ⎢ . .⎥ ∗ (4) ⎢ ⎥ ⎢ ⎥ bij ⎢ymij ⎥ ⎢xm 1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . .⎥ ⎣ .. ⎦ ⎣ .. .. ⎦ ynij xn 1

242

N. Kumar et al.

3. Let

⎡ ⎤ x1 y1ij ⎢ x2 ⎢ y2ij ⎥ ⎢ ⎥ ⎢ ⎢ x3 ⎢ y3ij ⎥ ⎢ ⎥ ⎢ ⎢ . ⎢ .. ⎥ ⎥ , Φ = ⎢ .. . Υij = ⎢ ⎢ ⎥ ⎢ ⎢xm ⎢ymij ⎥ ⎢ ⎥ ⎢ ⎢ . ⎢ . ⎥ ⎣ .. ⎣ .. ⎦ ynij xn ⎡

⎤ 1 1⎥ ⎥ 1⎥ ⎥

.. ⎥ aij .⎥ , Θ = ⎥ ij bij 1⎥ ⎥ .. ⎥ .⎦ 1

(5)

then Eq. 4 may be represented as following: Υij = Φ ∗ Θij

(6)

4. Least Square estimate [8] Θîj of vector parameter Θij of Eq. 6 is following: −1 Θîj = (ΦT Φ) ΦT Υij

(7)

In this way the gain and the offset values for all other detecting elements of FPA may be estimated. 5. After NUC correction Eq. 1 can be expressed as following: xmij = aij ymij + bij where aij = and

(8)

1 ; aij

(9)

aij bij

(10)

bij = −

Estimated gain and offset values are utilized for calculation of aij and bij . These values are stored in non volatile memory in the form of a table. During the image formation process Eq. 8 is implemented in real time to generate non-uniformity corrected video sequences. 6. Apart from doing non-uniformity correction present algorithm also suggests a mechanism for identification of defective pixels [7,10]. Suppose any set Im = {ym11 , ym12 .......ymij ...........ymM N } represents radiometric responses captured from FPA with M × N detecting elements at any fixed integration time tm with spatial mean μ and standard deviation σ then let ∀i ∈ {1, 2, 3, 4....M } and ∀j ∈ {1, 2, 3, 4....N }; if ∃ any ymij : μ − 3σ > ymij > μ + 3σ

(11)

A set of bad pixels Bm can be formulated with all such ymij s identified above at that integration time tm .

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays

4

243

Results and Analysis

Responses of a 640X512 InSb detecting elements based MWIR imaging system are captured in the range of 1 ms to 19 ms at a fixed interval of 1 ms. A minimum error linear approximation to these responses is explored with help of least square estimation for each detecting element. Residual Non Uniformity (RNU) defined as following is considered as performance measure: =

σ μ

(12)

where σ is the standard deviation of any considered frame of size M × N defined as following: ⎛ ⎞ M N 1 2 ⎝ (13) σ= (xi,j − μ) ⎠ M × N i=1 j=1 and μ is spatial mean of considered frame. Two point NUC [5] is performed by considering responses of imaging system at 4 ms and 16 ms integration times. Values of as in Eq. 12 are calculated at different integration times and plotted as in Fig. 1(a). INT-LSR-NUC can be declared as conqueror for most of the considered integration time ranges in the plot except few areas in vicinity to points chosen for two-point NUC. As two-point NUC is curve fitting based approach, hence this type of behavior of is trivial. Similarly in Fig. 1(b) plots for spread of gray levels with respect to integration time variation are shown. It has been assumed that most of the signal of relevance lies within μ ± 3σ band and spread of this has been shown for both type of NUCs. One can easily conclude that spread of μ ± 3σ band is lesser in case of INT-LSR-NUC as these frames are more uniform than frames after two point-NUC. Figure 2 is representing response of FPA with an uniform source at 10 ms integration time as input scene; one can observe that there are a-lot of peaks in response at various places. By virtue of Eq. 11, when response at any location of FPA crosses μ ± 3σ band, detecting element of that location can be declared as defective element. A proper defective pixel replacement mechanism is invoked to get rid of this defect. Authors have detected 37 defective detecting elements in present work using above criterion. In Fig. 2(b) and (c) response is becoming more uniform as two point and INT-LSR-NUC has been performed respectively.

244

N. Kumar et al.

Fig. 1. (a) *Comparison of RNUs after two-point NUC and INT-LSR-NUC at various temperatures (b) *Comparison of spread of graylevels (μ ± 3σ band) of the scene after two-point NUC and INT-LSR-NUC at integration times from 1 ms–19 ms (*for a uniform radiation source as scene)

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays

245

Fig. 2. (a) Uncorrected response of FPA (standard deviation σraw = 751.83 and RNU raw = 0.0467) (b) Response of same FPA after two-point NUC (standard deviation σtp = 427.09 and RNU tp = 0.026) (c) Response of same FPA after INT-LSR-NUC (standard deviation σlsr = 154.8 and RNU lsr = 0.0096) Peaks are representing locations of defective detecting elements.

246

5

N. Kumar et al.

Conclusion

In this approach rather than simple two point line fitting, a linear approximation in responses is explored and a line with minimum error values is chosen. Twopoint NUC only considers two points and makes error values zero there without worrying about RNU minimization at other points. In the present approach there is no guarantee that error will be zero at any point but it will assign an optimal value at each considered point. Fortunately least square estimation is a very old and well established algorithm. This makes implementation aspect of present model very simple. The results are promising and since the model is extremely simple, it can be applied for real time system realization. Acknowledgment. The authors express their sincere gratitude to Mr. Benjamin Lionel, Director, IRDE for his constant motivation and support as well as permission to publish this work. He has always inspired the authors towards innovation and adopting creative and simple approaches for solving difficult problems.

References 1. Hardie, R.C., et al.: Scene-based non-uniformity correction with video sequences and registration. Appl. Opt. 39(8), 1241–1250 (2000) 2. Kumar, A.: Sensor non uniformity correction algorithms and its real time implementation for infrared focal plane array-based thermal imaging system. Defence Sci. J., 63(6) (2013) 3. Scribner, D.A., et al.: Adaptive non-uniformity correction for IR focal-plane arrays using neural networks. In: Infrared Sensors: Detectors, Electronics, and Signal Processing, vol. 1541. International Society for Optics and Photonics (1991) 4. Torres, S.N., et al.: Adaptive scene-based non-uniformity correction method for infra-red focal plane arrays. In: Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XIV, vol. 5076. International Society for Optics and Photonics (2003) 5. Khare, S., Kaushik, B.K., Singh, M., Purohit, M., Singh, H.: Reconfigurable architecture-based implementation of non-uniformity correction for long wave IR sensors. In: Raman, B., Kumar, S., Roy, P.P., Sen, D. (eds.) Proceedings of International Conference on Computer Vision and Image Processing. AISC, vol. 459, pp. 23–34. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-210463 6. Wang, Q., et al.: A new scene-based non-uniformity correction algorithm for infrared focal plane array. J. Phys. Conf. Ser., 48(1) (2006) 7. Kumar, A., Sarkar, S., Agarwal, R.P.: A novel algorithm and hardware implementation for correcting sensor non-uniformities in infra-red focal plane array based staring system. Infrared Phys. Technol. 50(1), 9–13 (2007) 8. Kay, S.M.: Fundamentals of Statistical Signal Processing, Estimation Theory, vol. 1. PTR Prentice-Hall, Englewood Cliffs (1993) 9. Norton, P.R., et al.: Third-generation infra-red imagers. In: Infrared Technology and Applications XXVI, vol. 4130. International Society for Optics and Photonics (2000)

A Novel Approach for Non Uniformity Correction in IR Focal Plane Arrays

247

10. Lopez-Alonso, J.M.: Bad pixel identification by means of principal components analysis. Opt. Eng. 41(9), 2152–2158 (2002) 11. Hudson, R.D.: Infrared System Engineering, vol. 1. Wiley-Interscience, New York (1969) 12. Singh, R.N.: Thermal Imaging Technology: Design and Applications. Universities Press, Hyderabad (2009) 13. Kruse, P.W., Skatrud, D.D.: Uncooled Infrared Imaging Arrays and Systems. Academic Press, San Diego (1997)

Calibration of Depth Map Using a Novel Target Sandip Paul1(B) , Deepak Mishra1 , and M. Senthil2 1

Indian Institute of Space Science and Technology, Trivandrum, Kerela, India [email protected], [email protected] 2 Space Applications Centre, Ahmedabad, Gujrat, India [email protected]

Abstract. Depth recovery from an image uses popular methods such as stereo-vision, defocus variation, focus or defocus variation, aperture variation, etc. Most of these methods demands extensive computational efforts. Further, the depth can also be obtained from a single image. Depth from single image requires less computational power and is simple to implement. In this work we propose an efficient method that relies on defocus or blur variation in an image to indicate the depth map for a given camera focus. The depth is derived by applying Gaussian filters on the contrast changes or edges in the image. The actual blur amount at the edges is then derived from the ratio of gradient magnitudes of these filtered images. This blur represents a depth map of the actual image. Our method differs from all other methods available in the literature and is based on the fact that we use real data, simple calibration method to extract the depth maps. This is unlike other authors who have mostly relied upon simulated datasets. A unique target is used to calibrate the derived depths by finding coefficients k between theoretical and actual blur. Additionally, the target also characterizes the blur range. Keywords: Thin lens model · Defocus map · Gaussian gradient · Sparse depth map · Full depth map · Matting Laplacian · Calibration

1

Introduction

Passive optical cameras provide information that is compatible to humans and require very low power. The 3D information can be estimated from cameras using methods like stereo-vision [5] using two or more cameras, or monocular-vision using multiple images. Stereo with multiple cameras have issues of matching image scale, translation and gain calling for precision calibration and mechanical structure [6]. Literature survey of monocular-vision shows techniques for depth estimation mainly by varying focus and aperture [1–4,14,15]. Here, most methods use multiple images with variation of camera focus or aperture. These methods need to address errors due to contrast change from aperture variation or magnification variation from focus variation, mis-registration of multiple images c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 248–258, 2020. https://doi.org/10.1007/978-981-15-4015-8_22

Calibration of Depth Map Using a Novel Target

249

and dynamic scenes. Defocus blur based methods have also been proposed with multiple images [2,3]. Coded aperture is an alternative improvement where two or more images are simultaneously obtained with two or more shaped apertures [3,7]. There are also a few variations like color coded apertures in a single camera [8–10]. Depth can also be estimated from the blur present in an image [11,12]. Defocus blur is caused by light rays originating from the same point, pass through lens, and converge before or after but not at the optics focal plane. The effect is reduced sharpness of the image. Assuming Lambertian surfaces, blur is mathematically modeled as blur kernel which convolves with the otherwise sharp in-focus image. Blur varies spatially and with object distance. Depth map is useful to reconstruct three-dimensional images. From our results we observe that the above proposed method is immune to contrast variations with-in the image, magnification artifacts and spectral sensitivity issues. The first and second derivatives of image function are used to localize the blur religions and to estimate its magnitude. To obtain the depth map, ratio of absolute value of Gaussian blur magnitude and gradient magnitude along with interpolation is used in the proposed scheme. The details of method and results will be discussed in the subsequent sections. Our main contribution is the use of specialized calibration target to relate the theoretical blur with actual blur and the actual depth.

2

Defocus Model

A camera can be modeled as a thin lens, paraxial system which follows lens law. The optical schematic is shown in Fig. 1. Here, fo is the focal length, d, df is the distance from objects to the lens and N is the stop number of the camera (optics aperture/f0 ). A scene object is defined as a collection of points having primitive geometric shape(s). When an object is focused in the image plane other objects ‘O’ in front or behind this object are defocused and create a blur circle or circle of confusion (COC) with diameter c on the image plane. The blur diameter is proportional to the object distance and is given as: c=

f02 |d − df | d N (df − f0 )

(1)

Thus, in an image, various focused and defocused points are created corresponding to scene object distances. Knowing c, a 2D depth map, related to spatial blur variations of an image and corresponding object distances can be derived. The model assumes that (1) Objects in the image is larger than the blur radius to provide definite contrast for edges (2) Blurred edge is originally a sharp step edge, camera is focused for some object in the image (3) Blur is locally modeled using a single disc PSF. These may lead to issues at occlusion boundaries, translucent objects (4) Defocus blur varies spatially (5) Image is not over/under-exposed which lead to information loss.

250

S. Paul et al.

Fig. 1. Overview of blur estimation approach

3

Proposed Depth Map Estimation

The depth map is two-dimensional spatial information corresponding to distances of scene objects within the image. Our method of depth estimation is based on [12]. Initially an edge map is extracted from an image with edge detector. Here objects at focus provide finer and sharper edges compared to objects away from focus. A one dimensional edge located at x = 0, can be represented as a step function u(x) f (x) = Au(x) + B, (2) where, A is the amplitude of the edge and B is the offset of the edge from origin. Blur filter is then applied on the edges. The blur is modeled as a Gaussian point spread function (PSF) acting on a sharp image f. The PSF kernel g(x, σ), has standard deviation as σ = kc to relate with Eq. 1. I(x) = f (x) ⊗ g(x, σ)

(3)

This filter blurs the sharp edges more compared to blurred edges. Gradient of the re-blurred edge is then the change in intensity over a series of pixels. These intensity changes are not linear and are similar to one side of a Gaussian curve. The gradient of convolution is in x direction is given as: I1 (x) = (I(x) ⊗ g(x, σ0 )) = ((Au(x) + B) ⊗ g(x, σ0 ))

(4)

A x2 ), = exp(− 2 2 2 2(σ + σ02 ) 2π(σ + σ0 )

(5)

where σ0 is the standard deviation for the re-blur Gaussian kernel. The edge is dependent on amplitude A, σ0 and σ. Amplitude dependency is removed by taking ratio of the gradient magnitude of actual edge to the re-blurred edge as: |I(x)| σ 2 + σ02 x2 x2 = )) (6) exp(−( − |I1 (x)| σ2 2σ 2 2(σ 2 + σ02 ) Taking the derivative dR/dx and equating to 0, gives the edge location, where the ratio is maximum. At x = 0 the maximum value is |I(0)| σ 2 + σ02 R= = (7) |I1 (0)| σ2

Calibration of Depth Map Using a Novel Target

251

From R, we compute the value of σ which represents the actual amount of blur in the original image and is independent of edge amplitude. Rewriting Eq. 6 in terms of σ 1 (8) σ0 . σ=√ 2 R −1 Similarly, we can blur the edge with two different values of σ, namely, σ1 and σ2 and derive the actual blur in the input edge as σ12 − R2 σ22 (9) σ= R2 − 1 Extending this to images, the edge is found for both the axes. Canny edge detector is usually used for edge map as it is robust. Blurring is applied on edge map with 2D isotropic Gaussian kernel. Figure 2 summarizes the method.

Fig. 2. Overview of blur estimation approach

The gradient magnitude is computed as ||I(x, y)|| = Ix2 + Iy2 ,

(10)

where Ix and Iy are the gradients along x and y directions respectively. The σ values from gradient ratio are related to Eq. 1 as σ = kc. The depth d can be computed by knowing the value of k. Other camera parameters like df , N and fo are fixed for a camera and lens setting. N and fo are usually available with camera and lens data sheet. The values of df , N and fo can also be computed by the camera calibration application in Matlab. The depth map obtained with calibrated k provides information only at feature edges and lead to sparse depth map. A full depth map requires feature based segmentation of the image and interpolating the depth information of the sparse map into the segmented features. Alpha-matte method by [13] is used here to interpolate full depth map.

252

S. Paul et al.

Ideally the unknown term k should be unity. However, in reality the assumption of thin lens is not true. Further, camera fabrication imperfections, lens distortions lead to finite blurring of otherwise sharp image. In such case, k then can be found only by calibration. The main efforts of this paper are to find the best value of k which reliably gives the depth of the objects in the image.

4

Calibration

4.1

Range

The camera has finite blur due to fabrication imperfections, lens abrasions, aperture size and finite size of pixels. In optical systems these are characterized as PSF. A perfect camera PSF will be a 2D impulse function. However, the best focused image will have blurred edges due to imperfections and the corresponding PSF will have finite thick diameter. This blur is usually modeled as ‘pill box’. Many researchers also have approximated this PSF as 2D Gaussian function. Both PSF were used for experimentation. This paper will discuss more on ‘pill box’ PSF experimentation while summary of Gaussian PSF results will be discussed. The 1D pill box PSF is represented as hp (x) =

1 [u(x + r) − u(x − r)] 2r

(11)

Corresponding 1D Gaussian PSF is given as 1 1 ¯)2 ] exp[− (x − x 2 2πσ

(12)

Ik (x, y) = f (x, y) ⊗ hp (x, y, rk )

(13)

Ik (x, y) = f (x, y) ⊗ hg (x, y, σk )

(14)

hg (x) = √ The image I is modeled as

and The present method of calibration is by measuring the blur in the image at different known depths. Ideally this is achieved by focusing the camera to a near object. The far away objects will then be defocused and have blurred edges which will increase for more distant objects. First we find the blur range that can be reliably used. This blurring effect is synthesized by generating a unique target and applying known blurred functions. We used a special checkerboard pattern as the target. The pattern size is 400 × 3300 pixels (other sizes are also possible) and was divided into ten equal blocks. Each block is blurred with a filter having different known r (1 to 10). Blur (σ) map is generated from this pattern. In the map dark blue has the lowest magnitude. Figure 4 show the results after convolving PSF (Fig. 3a) and pattern.

Calibration of Depth Map Using a Novel Target

(a) One of ’pill-box’ PSF kernal

253

(b) One of ’Gaussian’ kernal (9 x 9)

Fig. 3. The blur is modeled as PSF. Two types of PSF kernels are used. (Color figure online)

Fig. 4. This unique checkerboard pattern of 400×3300 pixels has 10 blocks of 400×330 pixels each. Each block is blurred with PSF filter in increments of r radius. Above figure shows two (400 × 660 pixels) of the ten blocks.

The synthetic blur map is shown in Fig. 5 for the range of PSF radius of 1 to 10. Dark blue has the lowest value. The values are not monotonic as shown by repeating colors. It is observed that above r = 7, the blur is very high and the blocks become uniform gray. The edge detector now cannot detect any edge setting the upper range limit. Further, the pattern reverses after r = 4 (near pixel 2000 (Fig. 7a)). This happens as the edges become thicker than the pattern (Fig. 6). The slope upto r = 1 is slow and has poor sensitivity in distinguishing depth (Fig. 7b). The useful range of r is then >1 to k, then we will say there is hole between y + k and y + l or x + k and x + l. We have filled such holes by labelling them as object i.e. B(x, y + k : y + (l − 1)) = 1 or B(x + k : x + (l − 1), y) = 1. Subsequently, the updated labelling will act as the new labelling and the updated object will act as the new object seed for the application of the refined graph cut without changing the background seeds or the search space. Figure 4(b) shows the updated object after hole-filling and Fig. 4(c) shows the final result after the refined graph cut is applied. The refinement step will iteratively be carried out till the user is satisfied. Nevertheless, in most of the cases, the object of interest is extracted within 1–2 iterations.

4

Experimental Results and Comparison

In this section, experiments have been carried out to establish the effectiveness of the proposed approach. At first, a case study is conducted to find the

Fig. 4. Refinement of the segmentation result (a) segmentation after initial graph cut (b) generation of new object seeds by hole-filling (c) Final segmentation

A Reduced Graph Cut Approach

297

Fig. 5. Results with different inputs (a) using scribble based input (b) using rough contour as input (c) using a mouse click as input

most probable way an user chooses the object of his/her interest. Subsequently, the efficiency of the proposed approach in handling all such inputs is evaluated followed by the comparative analysis against other state-of-the-art graph cut approaches. Different parameters like precision, recall and F1-measure values are calculated for quantitative evaluation of the segmentation accuracy whereas execution time is considered to compare the computational complexity. 4.1

User Study

An user study is conducted considering 20 participants with no particular expertise in image processing. For all the trials, same image is considered for an unbiased evaluation of the interaction methods. The participants were shown the original image and asked to select the object of their interest without any constraint on the type of input. It is observed that the users mostly interact in three possible ways such as (i) a mouse click on the object of interest (ii) a scribble on the object or (iii) a rough contour around the object. Among them, the most frequent choice is just a mouse click which has not been considered by any of the interactive segmentation approaches so far. Figure 5 illustrates one of the Table 1. Comparison of average F1-measure and Time (sec) in case different input types as obtained from the user study Interaction type No. of users Average F1-measure Average time (sec.) Scribble

7/20

0.9509

66.65

Rough contour

4/20

0.9495

66.79

Mouse click

9/20

0.9491

65.64

298

P. Subudhi et al.

instances randomly selected from each type of interaction which shows that the results in all cases are visually similar. Table 1 presents an overall statistics with average time and accuracy in all the three methods of interaction. It can be observed that, all the interaction methods incur nearly the same time to produce the segmentation. Also, there is no significant difference in segmentation accuracy which corroborates our observation in the qualitative results. 4.2

Comparison with Other State-of-the-art Methods

To evaluate the performance of the proposed approach, we have compared it with other state-of-the-art graph cut approaches like standard graph cut [2], grab cut [8] and the layer based approach [3] having close resemblance with the proposed approach. Figure 6 shows the results on few images from MSRC dataset [1] using the above mentioned approaches along with the results of the proposed approach. As graph cut works on the whole image rather than in the vicinity of the object, so it segments other similar parts in the image as can be observed from the second row of Fig. 6(a), (c) and (e). All other approaches segment specifically the object of interest, however, the layer based approach does not employ any image properties to find the most significant layers which sometimes results in inaccurate segmentation of the object of interest as in the case of Fig. 6(a) and (d). An overall visual analysis of the results shows the

Fig. 6. Comparison with state-of-the-art approaches. First row: Original images. Second row: Results of standard graph cut. Third row: Results of grab cut. Forth row: Results of layer-based approach. Fifth row: Results of proposed approach

A Reduced Graph Cut Approach

299

Table 2. Comparison of precision, recall, F1-measure and computational time (sec) for the results in Fig. 6 Image

Method

Precision Recall F1-measure Time

Figure 6(a) Graph cut Grab cut Layer based Proposed

0.9514 0.9503 0.9684 0.9706

0.9797 0.9872 0.9632 0.9885

0.9653 0.9684 0.9658 0.9795

132.26 293.21 50.81 36.41

Figure 6(b) Graph cut Grab cut Layer based Proposed

0.8416 0.9015 0.8501 0.9127

0.9915 0.9744 0.9912 0.9813

0.9104 0.9365 0.9153 0.9458

116.78 307.94 45.99 27.55

Figure 6(c) Graph cut Grab cut Layer based Proposed

0.8281 0.9956 0.9882 0.9945

0.9862 0.9751 0.9614 0.9793

0.9002 0.9852 0.9746 0.9869

105.32 267.28 48.63 29.46

Figure 6(d) Graph cut Grab cut Layer based Proposed

0.9718 0.9996 0.9992 0.9985

0.9621 0.9819 0.9331 0.9875

0.9669 0.9907 0.9653 0.9928

108.66 294.83 63.53 41.64

Figure 6(e) Graph cut Grab cut Layer based Proposed

0.9808 0.9472 0.8128 0.9811

0.9792 0.9810 0.9932 0.9392

0.9545 0.9638 0.8904 0.9597

79.10 250.10 63.53 23.34

superiority of the proposed approach to its other variants. Further, quantitative evaluation of the results in Fig. 6 is carried out in Table 2 where the calculated values of precision, recall and F1-measure values are given along with the total time incurred. It can be observed from the table that, the proposed approach has a higher F1-measure than all other approaches except in one case where grab cut has a higher value. However, grab cut requires significantly higher computation time due to the iterative process involved in it. To summarize the key advantages of the proposed approach with respect to its other variants are – Graph cut requires scribble based input, grab cut needs a bounding box and the layer based approach needs a rough contour close to the actual boundary whereas the proposed approach can work on any of these inputs. – It automatically generates the best possible background and foreground seeds which are further refined according to the need of the user. – It works on comparatively lesser number of pixels than graph cut and grab cut. Although, the layer based approach sometimes require lesser number of nodes than the proposed approach, it is highly dependant on the initial placement of contour and may not work in case of improper selection of the initial rough contour.

300

5

P. Subudhi et al.

Conclusion and Future Work

In this paper, we propose a novel reduced graph based segmentation approach having two salient components. First, it allows the users to give the inputs in their own way providing flexibility in interaction. Second, by mapping all types of input to a single pixel it finds two boxes and graph cut is applied only in the area between these two boxes where boundary of the object lies, thereby reducing the overall size of the graph and computational effort of the graph cut. Experimental evaluations verify the efficacy of the proposed approach both in terms of accuracy and computational complexity. Despite its good performance, the proposed approach has one limitation i.e. the construction of the inner and outer box is highly dependant on the selection of threshold values which are chosen by trial and error. In future, efforts can be given to incorporate image features to select the threshold values which will make our approach more robust and readily adoptable for image editing applications.

References 1. Microsoft foreground extraction benchmark dataset. http://research.microsoft. com/en-us/um/cambridge/projects/visionimagevideoediting/segmentation/ grabcut.htm 2. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: 2001 Eighth IEEE International Conference on Computer Vision, ICCV 2001. Proceedings, vol. 1, pp. 105–112. IEEE (2001) 3. Gueziri, H.E., McGuffin, M.J., Laporte, C.: A generalized graph reduction framework for interactive segmentation of large images. Comput. Vis. Image Underst. 150, 44–57 (2016) 4. Lermé, N., Malgouyres, F.: A reduction method for graph cut optimization. Pattern Anal. Appl. 17(2), 361–378 (2013). https://doi.org/10.1007/s10044-013-0337-7 5. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. (ToG) 23, 303–308 (2004) 6. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: 2005 Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 259–265. IEEE (2005) 7. Price, B.L., Morse, B., Cohen, S.: Geodesic graph cut for interactive image segmentation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3161–3168. IEEE (2010) 8. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG). 23, 309–314 (2004) 9. Sinop, A.K., Grady, L.: Accurate banded graph cut segmentation of thin structures using Laplacian pyramids. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 896–903. Springer, Heidelberg (2006). https://doi.org/ 10.1007/11866763 110 10. Yang, W., Cai, J., Zheng, J., Luo, J.: User-friendly interactive image segmentation through unified combinatorial user inputs. IEEE Trans. Image Process. 19(9), 2470–2479 (2010) 11. Zhou, H., Zheng, J., Wei, L.: Texture aware image segmentation using graph cuts and active contours. Pattern Recogn. 46(6), 1719–1733 (2013)

A New Fuzzy Clustering Algorithm by Incorporating Constrained Class Uncertainty-Based Entropy for Brain MR Image Segmentation Nabanita Mahata and Jamuna Kanta Sing(B) Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India [email protected], [email protected]

Abstract. We propose a new fuzzy clustering algorithm by incorporating constrained class uncertainty-based entropy for brain MR image segmentation. Due to deficiencies of MRI machines, the brain MR images are affected by noise and intensity inhomogeneity (IIH), resulting unsharp tissue boundaries with low resolution. As a result, standard fuzzy clustering algorithms fail to classify pixels properly, especially using only pixel intensity values. We mitigate this difficulty by introducing entropy that measures constrained class uncertainty for each pixel. The value of this entropy is more for the pixels in the unsharp tissue boundaries. Apart from using the fuzzy membership function, we also define the similarity as the complement of a measure characterized by a Gaussian density function in non-Euclidean space to reduce the affect of noise and IIH. By introducing a regularization parameter, the trade-off between the fuzzy membership function and class uncertainty-based measure is resolved. The proposed algorithm is assessed both in qualitatively and quantitatively on several brain MR images of a benchmark database and two clinical data. The simulation results show that the proposed algorithm outperforms some of the fuzzy-based state-of-the-art methods devised in recent past when evaluated in terms of cluster validity functions, segmentation accuracy and Dice coefficient.

Keywords: Brain MR image segmentation algorithm · Entropy

1

· Fuzzy clustering

Introduction

Human brain magnetic resonance (MR) image segmentation is an important and essential task for many clinical analyses, such as therapy panning for image This work is partially supported the SERB, Govt. of India (File No: EEQ/2016/ 000145). c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 301–310, 2020. https://doi.org/10.1007/978-981-15-4015-8_27

302

N. Mahata and J. K. Sing

guided surgery, post-surgery monitoring of patients, evaluation of treatments, etc. However, the task is even more challenging and sensitive as it involves human life. The segmentation is usually done after acquiring brain image through a magnetic resonance imaging (MRI) scanner. The acquired brain MR images are normally affected by noise and intensity inhomogeneity (IIH) that causes nonuniform distribution of pixel (voxel) intensities of a particular soft tissue across the image domain. This phenomenon occurs due to patient movement, improper buildup of magnetic field, radio-frequency coil non-uniformity and other related factors [1]. Consequently, we get low resolution brain MR images having blurred tissue boundaries. Review on different segmentation methods can be found in [2] and it suggests that fuzzy clustering algorithms are studied most in the last decade. Qiu et al. [3] proposed a fuzzy clustering algorithm for MR image segmentation using type 2 fuzzy logic, where two fuzzifiers and a spatial constraint are introduced into the membership function. Authors in [4] presented a conditional spatial fuzzy c-means (FCM) algorithm for brain MR image segmentation by introducing conditional variables, local and global membership functions into the objective function. In [5], authors presented a two-stage modified FCM algorithm for 3D brain MR image segmentation using local spatial information along with local and global membership functions. Chetih et al. [6] proposed a fuzzy clustering algorithm for noisey brain MR image segmentation using a non-parametric Bayesian estimation in the wavelet domain. Namburu et al. [7] proposed a generalized rough intutionistic fuzzy c-means algorithm for brain MR image segmentation. The rough fuzzy regions corresponding to the different brain tissues are determined by analyzing the image histogram. Another rough set based intuitionistic type-2 FCM algorithm is proposed by Chen et al. [8]. It incorporates intuitionistic type-2 fuzzy logic and local spatial information to mitigate the shortcomings of the FCM algorithm. Rough set-based FCM algorithm for brain MR image segmentation is presented by [9]. The above methods do not consider entropy in the objective function. However, very few methods have used entropy to model the objective function [10–12]. These methods are usually suitable for handling noisy data. Yao et al. [10] proposed a fast entropy-based fuzzy clustering method, which automatically identifies the number and locations of initial cluster centers by minimizing entropy, corresponding to a data point. The method was tested on some machine learning data sets. Kannan et al. [11] devised an effective quadratic entropy fuzzy c-means algorithm by combining regularization function, quadratic terms, mean distance functions, and kernel distance functions. The algorithm also proposed a prototype initialization technique, which is validated by silhouette index using some time series data. Zarinbal et al. [12] proposed a relative entropy fuzzy cmeans clustering algorithm especially for the noisy data. The entropy term is added to the objective function of the FCM algorithm to act as regularization function while maximizing the dissimilarity between the clusters. Recently, some authors [19,20] proposed convolutional neural networks (CNNs) based algorithms for brain MR image segmentation. Pereira et al. [19] proposed an algorithm using deep convolutional neural networks that exploits small convolutional kernels. Its architecture stacks more convolutional layers

A New FCICCuE for Brain MR Image Segmentation

303

having same receptive fields as of bigger kernels. Further, Moeskops et al. [20] proposed an automatic method based on multi-scale CNN. It combines multiple patch and kernel sizes to learn multi-scale features that estimate both the intensity and spatial characteristics. However, the main disadvantage of CNN is that it requires high computational time during its training phase. In this paper, we propose a new fuzzy clustering algorithm by incorporating constrained class uncertainty-based entropy for brain MR image segmentation. We define entropy characterizing class uncertainty based on likelihood measures for each pixel. Furthermore, we also define the similarity as the complement of a measure characterized by a Gaussian density function, which in turn establishes some correlation among the neighboring pixels. The trade-off between fuzzy membership function and class uncertainty measure is resolved by introducing a regularization parameter.

2

Fuzzy Clustering Algorithm by Incorporating Constrained Class Uncertainty-Based Entropy

Due to deficiencies in MRI machines, the acquired brain MR images are contaminated by noise and intensity inhomogeneity (IIH), resulting irregular pixel intensity distribution with respect to a tissue (region). It gets worst at the blurred tissue boundaries. Due to these inherited artifacts, uncertainty arises while classifying these pixels into a specific class. We call this as a class uncertainty measure. Additionally, we assume that with respect to a pixel, the sum of these class uncertainty measures over all the classes is unity, making it constrained in nature. Furthermore, this class uncertainty attains its maximum value at the tissue boundaries. To exploit this uncertainty, entropy is introduced for each pixel by using these constrained class uncertainty measures. It may be noted that the higher entropy value indicates more uncertainty and vise-versa. Additionally, to reduce the influence of outliers and noise, the similarity measure is characterized by a Gaussian density function and controlled by its receptive field. Apart from using the fuzzy membership function, we judiciously integrate these entropy and similarity measure in the fuzzy objective function for brain MR image segmentation. The trade-off between the fuzzy membership function and class uncertainty measure is resolved by introducing a regularization parameter. We call this algorithm as the fuzzy clustering algorithm by incorporating constrained class uncertainty-based entropy (FCICCuE). Let a brain MR image consists of N pixels and has C different soft tissue regions. As the MR images are contaminated by noise and IIH, for each pixel xk (k = 1, 2, . . . , N ), we construct a feature vector Xk (k = 1, 2, . . . , N ) having five features by considering its immediate neighbors. In particular, this feature vector represents the pixel intensity value and mean pixel values along the horizontal, vertical, forward diagonal and backward diagonal directions, respectively within a square neighborhood. The objective function of the FCICCuE algorithm is defined as follows:

304

N. Mahata and J. K. Sing

JF CICCuE =

N C

m [αμm ik (1 − Gik ) + (1 − α)pik Gik ] −

k=1 i=1

N C

pik ln(pik ) (1)

k=1 i=1

subject to the following two constraints: C

μik = 1,

i=1

C

pik = 1

(2)

i=1

where μik is the membership value of the pixel xk for cluster i, m(> 1.0) is the fuzzifier value, pik represents the constrained class uncertainty measure for the pixel xk with respect to class (cluster) i, α(0 < α < 1.0) acts as a regularizing parameter, and Gik defines similarity between the cluster center Vi and pixel vector Xk and is characterized as follows: −

Gik = e

Xk −Vi 2 2σ 2 i

(3)

where σi is the variance representing the receptive field of the Gaussian density function. The first term of (1) holds two aspects; first one aims to minimize the product of fuzzifier induced membership function and complement of the Gaussian pdf-based similarity measure, which is inversely proportional to the fuzzy membership function. Whereas, the second aspect aims to minimize the product of fuzzifier induced constrained class uncertainty measure and the similarity measure. Trade-off between these two aspects is resolved by using a regularizing parameter. The second term of (1) represents the constrained class uncertaintybased Shannon entropy. The objective of the algorithm is to minimize these two terms across the image domain. The iterative equations for μik , pik and Vi can be obtained by minimizing (1) with the associated constraints as define in (2). This is realized by rewriting (1) with the help of Lagrange multipliers and then setting zero to its partial derivatives with respect to these parameters. After simplification, we get the following iterative equations: 1

μik = C

1 1−Gik m−1 l=1 ( 1−Glk )

exp(m(1 − α)pm−1 Gik ) ik pik = C m−1 Glk ) l=1 exp(m(1 − α)plk N m m k=1 (αμik − (1 − α)pik )Gik Xk Vi = N m m k=1 (αμik − (1 − α)pik )Gik

(4)

(5)

(6)

Once the final values for μik and Vi (i = 1, 2, . . . , C; k = 1, 2, . . . , N ) are obtained, the class of a pixel Xk is determined as follows: class(Xk ) = arg max{μik }, i = 1, 2, ..., C i

(7)

A New FCICCuE for Brain MR Image Segmentation

3

305

Experimental Results and Discussion

We have validated the FCICCuE algorithm with m = 1.75 and α = 0.8 by using several volumes of a benchmark brain MR database, the BrainWeb phantom data [13] and two volumes of clinical (both male and female) brain MR image data. The BrainWeb database provides the segmentation ground truth for quantitative evaluation. We have segmented each image into three main tissue objects: (i) cerebrospinal fluid (CSF), (ii) gray matter (GM) and (iii) white matter (WM). The validation study involves the qualitative and quantitative analysis. Furthermore, where ground truths are available, the analysis is done in terms of (i) segmentation accuracy (SA) and (iii) Dice coefficient. Additionally, the analysis also performed in terms of indices for cluster validity functions, such as (i) partition coefficient (Vpc ) and (ii) partition entropy (Vpe ). To demonstrate the efficiency, the results of the proposed FCICCuE algorithm are compared with some of the fuzzy-based state-of-the-art methods like, FCM [14], sFCM [15], FGFCM [16], ASIFC [17] and PFCM [18]. These validation parameters are defined in the following paragraphs. Segmentation Accuracy: Segmentation accuracy (SA) of a cluster is defined as the ratio of the number of correctly classified pixels and the number of corresponding pixels in the ground truth [5,17]. The value of SA lies in the range [0, 1.0] and its value becomes 1.0 for perfect segmentation, with higher value considered as “better”. Dice Similarity Coefficient or Dice Coefficient: The Dice similarity coefficient or Dice coefficient is another very efficient validity measurement as it also compares the segmentation results with the ground truth [3,4]. Its value is in the range [0, 1.0] and the optimal clustering result is achieved when its value is 1.0, with higher values considered as “better”. Partition Coefficient: It is an important indicator of fuzzy partition and considered as first parameter of cluster validity functions. It describes the confidence of a fuzzy algorithm for partitioning the patterns into possible clusters [4,5]. The ideal clustering is achieved when the partition coefficient Vpc is 1.0. Partition Entropy: Another important indicator of fuzzy partition of an algorithm is partition entropy Vpe and also known as an index of cluster validity functions [4,5]. Best clustering performance is achieved when the Vpe is 0. 3.1

Experiments on BrainWeb Data

This study includes all the simulated brain MR phantom data volumes having a resolution (height × width × depth) of 181 × 217 × 181 with slice thickness 1 mm. All the six volumes are downloaded by varying noise (5%–9%) and IIH (20%– 40%). However, for validation we have considered image volumes consisting of 51 images (slices 50–100), where a fair amount of CSF, GM and WM regions are present.

306

N. Mahata and J. K. Sing

To visualize the segmentation results of the proposed method, we have presented the segmented CSF, GM and WM regions of a T1-weighted brain MR image having 9% noise and 40% IIH in Fig. 1. From the figure, it is clear that the proposed method can achieve better results even in the presence of high levels of noise and IIH.

Fig. 1. Segmentation results on a T1-weighted (9% noise, 40% IIH) brain MR image from BrainWeb database. (a): Input image, (b): CSF, (c): GM, (d): WM and (e): Segmented image.

Fig. 2. Segmentation results of different algorithms in terms of segmentation accuracy (SA) on 6 brain MR image volumes with different percentage of noise and IIH. (a): SA of CSF, (b): SA of GM and (c): SA of WM.

A comparative performance analysis among the different methods in terms of segmentation accuracy on different tissue regions is presented in Fig. 2 using

A New FCICCuE for Brain MR Image Segmentation

307

the images with high levels of noise and IIH. The results show that the proposed FCICCuE algorithm is superior to the FCM, FGFCM, sFCM, ASIFC and PFCM algorithms. In addition, it yields far superior results in case of higher noise and IIH. Similarly, Fig. 3 shows the results of different methods in terms of Dice coefficient, partition coefficient and partition entropy. The simulation results again demonstrate superiority of the FCICCuE algorithm over other competitive methods. In particular, the results reveal that the proposed FCICCuE algorithm performs segmentation task more confidently than the other methods, as its Dice coefficient is higher and its Vpc and Vpe values in all cases are closer to 1.0 and nearer to 0.0, respectively. Again, the results reveal that as compared to other methods, its works well with more noise and IIH.

Fig. 3. Segmentation results of different algorithms in terms of (a) Dice coefficient, (b) partition coefficient and (c) partition entropy on 6 brain MR image volumes with different percentage of noise and IIH.

3.2

Experiments on Clinical Brain MR Image Data

The present study also involves two clinical brain MR image volumes, which we obtained from the Advanced Medical Research Institute (AMRI) Hospital, Dhakuria, Kolkata, India and EKO X-Ray & Imaging Institute, Jawaharlal

308

N. Mahata and J. K. Sing

Nehru Road, Kolkata, India. The images were acquired through a 1.5T MRI machine. The resolutions (height × width × depth) of these image volumes are of 181 × 181 × 25 and 256 × 150 × 20. Table 1 presents a comparative tabulation among the different methods in terms of Vpc and Vpe on these clinical image volumes. The results show that the Vpc value of the proposed FCICCuE algorithm in all cases are closer to 1.0 and also higher than its competitive methods; whereas, the Vpe values are closer to 0.0 and lower than the other methods. Therefore, in clinical brain MR images the proposed algorithm also perform segmentation task in superior way than its competitive methods. Table 1. Comparison between different methods in terms of Vpc and Vpe on two volumes of clinical brain MR data

4

Image volume

Method

Vpc

Vpe

Subject 1 (Male)

FCM FGFCM sFCM ASIFC PFCM FCICCuE

0.705 0.815 0.886 0.911 0.924 0.976

0.558 0.329 0.294 0.206 0.137 0.041

Subject 2 (Female) FCM FGFCM sFCM ASIFC PFCM FCICCuE

0.791 0.811 0.835 0.897 0.905 0.984

0.253 0.159 0.074 0.056 0.053 0.031

Conclusion

In this paper, we have presented an new fuzzy clustering algorithm based on constrained class uncertainty entropy for segmentation of brain MR images. For each pixel, we have introduced an uncertainty-based possibilistic measure that specifies its association into a particular class. This measure is proportional to the class uncertainty. Moreover, by enforcing the sum of these class uncertainty measures over all the classes unity, we make it constrained in nature. Entropy is calculated based on these measures. Additionally, we have also defined a similarity as the complement of a measure characterized by a Gaussian density function to establish some correlation among the neighboring pixels. The algorithm is evaluated on several brain MR data volumes of a benchmark database and two clinical brain MR image volumes. The simulation results in terms of several performance indices show that the proposed method can effectively segments the images in all cases and it also superior to some of the state-of-the-art

A New FCICCuE for Brain MR Image Segmentation

309

methods devised recently. The study also suggests that the proposed method yields even superior results as compared to the other methods when the brain MR images are corrupted by more noise and intensity inhomogeneity, making it more robust to these artifacts. Acknowledgment. This work is supported the SERB, Govt. of India (File No: EEQ/2016/000145). We are also grateful to radiologists Dr. S. K. Sharma and Dr. Sumita Kundu of the EKO X-Ray & Imaging Institute, Jawaharlal Nehru Road, Kolkata, for their support and providing clinical brain MR data. Authors are also thankful to Mr. Banshadhar Nandi and Mr. Niloy Halder, Sr. Technologist (Imaging), AMRI Hospital, Dhakuria, for their support and providing the clinical data. Moreover, authors convey their profound indebtedness to radiologist Dr. Amitabha Bhattacharyya for his invaluable suggestions and support.

References 1. Simmons, A., Tofts, P.S., Barker, G.J., Arridge, S.R.: Sources of intensity nonuniformity in spin echo images at 1.5T. Magn. Reson. Med. 32(1), 121–128 (1994) 2. Dora, L., Agrawal, S., Panda, R., Abraham, A.: State-of-the-art methods for brain tissue segmentation: a review. IEEE Trans. Biomed. Eng. 10, 235–249 (2017) 3. Qiu, C., Xiao, J., Yu, L., Han, L., Iqbal, M.N.: A modified interval type-2 fuzzy C-means algorithm with application in MR image segmentation. Pattern Recogn. Lett. 34(12), 1329–1338 (2013) 4. Adhikari, S.K., Sing, J.K., Basu, D.K., Nasipuri, M.: Conditional spatial fuzzy Cmeans clustering algorithm for segmentation of MRI images. Appl. Soft Comput. 34, 758–769 (2015) 5. Kahali, S., Adhikari, S.K., Sing, J.K.: A two-stage fuzzy multi-objective framework for segmentation of 3D MRI brain image data. Appl. Soft Comput. 60, 312–327 (2017) 6. Chetih, N., Messali, Z., Serir, A., Ramou, N.: Robust fuzzy c-means clustering algorithm using non-parametric Bayesian estimation in wevelet transform domain for noisy MR brain image segmentation. IET Image Process. 12(5), 652–660 (2018) 7. Namburu, A., Samayamantula, S.K., Edara, S.R.: Generalized rough intuitionistic fuzzy c-means for magnetic resonance brain image segmentation. IET Image Process. 11(9), 777–785 (2017) 8. Chen, X., Li, D., Wang, X., Yang, X., Li, H.: Rough intuitionistic type-2 c-means clustering algorithm for MR image segmentation. IET Image Process. 13(4), 607– 614 (2019) 9. Huang, H., Meng, F., Zhou, S., Jiang, F., Manogaran, G.: Brain image segmentation based on FCM clustering algorithm and rough set. IEEE Access 7, 12386– 12396 (2019) 10. Yao, J., Dash, M., Tan, S.T., Liu, H.: Entropy-based fuzzy clustering and fuzzy modeling. Fuzzy Sets Syst. 113(3), 381–388 (2000) 11. Kannan, S.R., Ramathilagam, S., Chung, P.C.: Effective fuzzy c-means clustering algorithms for data clustering problems. Expert Syst. Appl. 39(7), 6292–6300 (2012) 12. Zarinbal, M., Fazel Zarandi, M.H., Turksen, I.B.: Relative entropy fuzzy c-means clustering. Inf. Sci. 260, 74–97 (2014)

310

N. Mahata and J. K. Sing

13. Cocosco, C.A., Kollokian, V., Kwan, K.R.S., Evans, A.C.: BrainWeb: online interface to a 3D MRI simulated brain database. NeuroImage 5(4), 425 (1997) 14. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981) 15. Chuang, K.S., Tzeng, H.L., Chen, S., Wu, J., Chen, T.J.: Fuzzy c-means clustering with spatial information for image segmentation. Comput. Med. Imaging Graph. 30(1), 9–15 (2006) 16. Cai, W., Chen, S., Zhang, D.: Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recogn. 40(3), 835–838 (2007) 17. Wang, Z., Song, Q., Soh, Y.C., Sim, K.: An adaptive spatial information-theoretic fuzzy clustering algorithm for image segmentation. Comput. Vis. Image Underst. 117(10), 1412–1420 (2013) 18. Pal, N.R., Pal, K., Keller, J.M., Bezdek, J.C.: A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 13(4), 517–530 (2005) 19. Pereira, S., Pinto, V.A., Silva, C.A.: Brain tumour segmentation using convolutional neural networks in MRI images. IEEE Trans. Med. Imaging 35, 1240–1251 (2016) 20. Moeskops, P., Viergever, M.A., Mendrik, A.M., Vries, L.S., Benders, M.J.N.L., Isgum, I.: Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans. Med. Imaging 35, 1252–1262 (2016)

A Novel Saliency-Based Cascaded Approach for Moving Object Segmentation Prashant W. Patil1(B) , Akshay Dudhane1 , Subrahmanyam Murala1 , and Anil B. Gonde2 1

CVPR Lab, Indian Institute of Technology Ropar, Rupnagar, India [email protected] 2 SGGSIET, Nanded, MS, India

Abstract. The existing approaches achieved remarkable performance in many computer vision applications like moving object segmentation (MOS), classification, etc. However, in presence of infrequent motion of foreground objects, bad weather and dynamic background, the accurate foreground-background segmentation is a tedious task. In addition, the computational complexity is a major concern, as the data to be processed is large in case of video analysis. Considering the above mentioned problems, a novel compact motion saliency based cascaded encoder-decoder network is proposed for MOS. To estimate the motion saliency of current frame, background image is estimated using few neighbourhood frames and subtracted from the current frame. Further, to estimate prior foreground probability maps compact encoder-decoder network is proposed. The estimated foreground probability maps are undergoes the problem of spatial coherence where visibility of foreground objects is not clear. To enhance the spatial coherence of obtained foreground probability map, cascaded encoder-decoder network is incorporated. The intensive experimentation is carried out to investigate the efficiency of proposed network with different challenging videos from CDnet-2014 and PTIS database. The segmentation accuracy is verified and compared with existing method in terms of average F-measure. In addition, the compactness of proposed method is analysed in terms of computational complexity and compared with the existing methods. The performance of proposed method is significantly improved as compared to existing methods in terms of accuracy and computational complexity for MOS task. Keywords: Motion saliency Frame segmentation

1

· Weather degraded videos · CNN ·

Motivation

The moving object (foreground or background) segmentation (MOS) is a crucial step for any intelligent video processing applications, such as video surveillance, c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 311–322, 2020. https://doi.org/10.1007/978-981-15-4015-8_28

312

P. W. Patil et al.

pedestrian detection, traffic monitoring, re-identification and human tracking, vehicle navigation, anomaly detection, self-driving cars etc. In general, any MOS technique is pixel level segmentation task i.e. each pixel of input video frame is classified into moving or stationary object as foreground and background respectively. Generally, video recorded using static cameras shows foreground object in motion and the background appears motionless. But, actually local motion appears on background objects like waving trees, low contrast, shelters irregular motion of object and snow fall, with some stillness in foreground. Thus clear background estimation is crucial for foreground segmentation during unusual motion, changes in illumination, camouflage, noise, etc. The different types of approaches used for MOS task are [8,16,18,31,32,34] and [9]. The most widely used method for MOS task in background subtraction. Initially, clean background estimation is performed with considering the information of few video frames. Further, the estimated background is used for foreground estimation task with considering the current video frames. However, the moving object may not show continuous motion and could be static over several frames. The background subtraction based techniques degrades the performance in presence of different practical scenarios. Recently, deep learning based approaches achieve remarkable progress for MOS and saliency estimation tasks. Also, any video processing application is required to process a large amount of the data. Therefore for any computer vision algorithm, the computational complexity is the major concern. In this context, the concept of motion saliency with background estimation and deep learning methodologies is adopted for MOS to overcome the above mentioned problems. The detailed literature on the existing methodologies for MOS is given in the next section.

2

Related Work

Various techniques are proposed for single image segmentation task. But, the technique used for image segmentation is not able to give significant improvement for moving object detection. Because, in any video processing applications, most prominent information is motion related to moving objects. Also, the moving objects are partially or completely blurred during recording due to some internal or external parameters. Along with this, moving objects may show irregular motion and background may show motion. The traditional MOS approaches divides every pixel of the video frame into foreground or background. The spatial and temporal information with frame difference is considered in [32]. Each pixel process independently for background-foreground separation in pixellevel methods and the inter-pixel and neighbourhood relationship is considered in region level methods. The combination of pixel and region level approach is proposed in [17] to estimate the background. The general assumption is that any static camera video having motionless background and only moving foreground objects. However, the moving object may not show continuous motion and could be static over several frames. Due to local motion like snow fall, water fountain, waving tree, etc. in background, estimation of clean background is a

Saliency-Based Cascaded Approach for Moving Object Segmentation

313

very challenging task for MOS. The unsupervised and supervised is the broad classification of any MOS algorithm. The unsupervised approach is proposed with region of difference technique in [18] to overcome the limitation of irregular motion. The existing approaches used fixed noise distribution throughout all video frames using Gaussian or Laplacian distribution for MOS. Recently, Yong et al. [33] incorporated the different mixture of Gaussian (MoG) distribution technique to separate a foreground object. In several computer vision applications like automatic video surveillance, automated video analysis, etc., accuracy affect significantly on the overall performance. The saliency detection techniques gained more attention because of improvement in overall accuracy for salient object detection in images. Also, the saliency detection task extract salient objects from images/video frames and suppress the unwanted information. Further, some of algorithm based on saliency detection are used for moving object detection. Tao et al. [31] introduced the saliency-based background (spatio-temporal) prior for visual object detection. Recently, Wang et al. [28] proposed a technique with considering the intra-frame information for spatial feature learning and inter-frame information for motion feature learning for video saliency detection. In recent years, deep learning methodologies gained more attention in various tasks like background estimation, saliency estimation, moving object segmentation tasks, because of significant improvement in overall accuracy. Some of the recent approaches [3,6,7,10–12,19–21,25,27] are used for various applications like moving object detection, single image haze removal, action recognition, facial expression recognition. Babaee et al. [1] integrated the concept of unsupervised technique for background estimation by considering the initial conscutive video frames and supervised approach is used for foreground/moving objects detection. The different combinations like convolutional neural network (CNN), multi-scale CNN and cascade CNN is proposed in [30] for background-foreground segmentation task. Various researchers used learning based techniques for motion saliency detection or salient object detection. Recently, motion saliency (static and dynamic) based approach [27] used salient object detection from videos. The pixel-level and frames-level learning of semantic and temporal feature learning based mechanism with pre-trained model is proposed in [9] with concept of encoder-decoder network for moving object detection. Here, VGG-16 architecture is used as encoder to extract low resolution features with 13-convolution layers and decoder network is implemented using 13-deconvolution (one for each convolution) layer. The existing spatio-temporal feature learning based saliency models gives insufficient performance to handle a unconstrained videos. The prime task of any moving object detection or video salient object detection method is to suppress background information and highlight moving/salient objects. From the above literature, the MOS task broadly involves two major steps that are background generation and saliency estimation followed by moving objects detection. The learning based methods produces fruitful results in different application of computer vision. But, the computational complexity is major concern for these methods. Also, one more limitation of these methods is not

314

P. W. Patil et al.

Fig. 1. Overview of the proposed OBJECTNet for MOS. (Note: ConvEd: EncoderConvolutionalLayer, ConvDe: DecoderDe-convolutionalLayer, NCh: Number of Channels)

able to achieve significant improvement in accuracy with different scenario. These limitations motivate us to propose a compact network for MOS task in different weather degraded video. In this context, the compact encoder-decoder network is proposed with cascaded technique and motion saliency learning for MOS task. The experimental analysis is carried out on videos having challenging situation like bad weather, dynamic background, etc. The major contributions of the work are: 1. The learning based motion saliency is estimated using input video frames and estimated background. 2. The compact cascaded encoder-decoder network is proposed to improve the foreground detection probability with the help of estimated motion saliency and video frames. 3. The proposed approach gives remarkable improvement in segmentation accuracy and also shows significant decrement in network complexity as compared to the existing methods for MOS on benchmark datasets. The proposed approach is named as OBJECTNet i.e. moving OBJect detection using Cascaded Encoder-decoder Network. The performance of the proposed OBJECTNet is tested on four categories of ChangeDetection.net (CDnet)-2014 [29] and 4 videos of perception test images sequences (PTIS ) [26] benchmark datasets for MOS in videos.

3

Proposed System Framework

Inspired from literature [9] and [28], the novel compact OBJECTNet is proposed for MOS task. There are three major steps involved in the proposed OBJECTNet: (a) background generation with the help of initial consecutive video frames, (b) learning based motion saliency estimation, (c) foreground segmentation using compact cascaded encoder-decoder network for MOS. The detailed architectural analysis is given in the subsections.

Saliency-Based Cascaded Approach for Moving Object Segmentation

3.1

315

Background Estimation

As background estimation is key step in many real time video processing application. Various researchers tried to overcome the problem of non-static (waving tree) background, irregular motion of the objects, bad weather (snow fall ) and sudden illumination changes. Recently, Roy et al. [22] introduced the concept of adaptive histogram with local and global information combination for the background modelling. As motion is mainly related to temporal information, the single channel (gray scale) information is enough for background modelling. In this work, simple approach is used for background estimation with single channel N initial video frames. Using Eq. (1), the background image is estimated for particular video frames with considering N initial video frames. BG =

1 N

N

I(x, y, ti )

(1)

i=1

The estimated background is able to handle the partial effect of dynamic background, snow fall and shadow conditions. The saliency map allow only relevant frame area which is related to motion of objects and suppress the background information which is not used for foreground segmentation. Thus, motion saliency estimation plays important role for MOS which is elaborated below. 3.2

Motion Saliency Estimation

To estimate the motion saliency for a particular video frame, estimated background and input video frames are given as input to CNN. The detailed discussion about motion saliency estimation is given below: Difference Layer: The estimated background is concatenated temporally with input video frames, and used as input to the difference layer. The difference operation is implemented using convolution layer of CNN with +1 and −1 for estimated background and input video frame respectively with zero learning rate. The output of difference layer is approximate motion saliency which gives prominent edge information related to the foreground object. Also, strong edges related to non-static background may present in the motion saliency. From estimated motion saliency, foreground segmentation is performed using a compact cascade encoder-decoder network discussed in next subsection. 3.3

Cascade Encoder-Decoder Network

The existing network used for MOS achieved promising result with a pre-trained network like VGG-16 with [9]. As existing network uses directly video frames for MOS, the computational complexity is a major drawback for real-time application. This drawback inspires us to reduce the computational complexity with significant improvement in segmentation accuracy. To do this, the compact cascaded encoder-decoder network is proposed with estimated motion saliency as input. The encoder network is proposed with two convolution layers followed by

316

P. W. Patil et al.

Fig. 2. Qualitative comparison of existing state-of-the-art methods with proposed OBJECTNet for bad weather (BW), dynamic background (DB), intermittent object motion (IOM) and shadow (SD) video categories from CDnet-2014 dataset.

max pooling with a stride of two. Corresponding to each convolution layer of the encoder, the deconvolution layer with up sampling factor of two is used in decoder network. The non-linear activation function of BiRelu [5] is used instead of ReLU. Each convolution layer of encoder network performs the convolution followed by max pooling operation. The obtained spatial coherence between adjacent pixels of the first encoder-decoder output is less in some region where properties appear to be the same. To enhance the spatial coherence among pixels of the foreground object, estimated output using the first encoder-decoder is cascaded with the second encoder-decoder network. The input to the second encoder-decoder network is two channel: one is estimated foreground probability area and the second is an original input video frame. The overall flow of the proposed network for MOS is shown in Fig. 1.

4

Training of Proposed OBJECTNet

The training data is prepared using four categories of CDnet-2014 [29] database. Total 11877 video frames are collected with respective groundtruth. Out of these frames, the training is preformed using 7125 video frames and validation of proposed approach is carried out with the help of remaining video frames. The random weights are initialized for learning of proposed approach. At the time of proposed OBJECTNet training, the learning rate is 0.001 and batch size of 16 is used with stochastic gradient descent (SDG) back propagation algorithm. During the training, the loss is minimized with the help of mean square error. The training is carried out on PC with 4.20 GHz Intel Core i7 processor and NVIDIA GTX 1080 11 GB GPU.

Saliency-Based Cascaded Approach for Moving Object Segmentation

5

317

Results Analysis

The success of the proposed OBJECTNet is validated on various challenging visual scenarios from standard databases used for MOS. In this work, the 12 videos from four categories of ChangeDetection.net (CDnet)-2014 [29] and 4 videos from perception test images sequences (PTIS ) [26] database are used to examine the effectiveness of proposed OBJECTNet in terms of average F-measure as accuracy parameter. 5.1

Performance on Videos of CDnet-2014

The CDnet-2014 is one of the standard database used for moving object segmentation task. The dynamic background (boat, canoe and overpass), bad weather (blizzard, skating and snowfall ), effect of shadow (bungalows, copymachine and peopleInShade) and irregular motion of objects (abandonedBox, parking and streetLight) are used to examine the effectiveness of proposed OBJECTNet. Table 1. Video-wise accuracy comparison analysis in terms of average F-measure on CDnet-2014 database. (PM-Proposed Method ) Methods

DBS [1] MCN [30] IUT [2] SMG [4] WSE [15] SCD [14] PAS [24] PM

parking

0.5971

0.8348

0.6482

0.6867

0.8216

0.8035

0.8192

0.9522

abd

0.5567

0.8519

0.9019

0.8259

0.8427

0.731

0.8135

0.9596

STL

0.9161

0.9313

0.9892

0.9627

0.8911

0.8856

0.9464

0.9685

CM

0.9434

0.9483

0.926

0.964

0.9217

0.8899

0.9143

0.9691

bunglows 0.8492

0.9523

0.8392

0.9259

0.8374

0.8161

0.8986

0.9649

PPS

0.9197

0.959

0.9103

0.9174

0.8948

0.885

0.8387

0.9853

canoe

0.9794

0.9869

0.9462

0.9515

0.6131

0.9177

0.9379

0.9493

boats

0.8121

0.9624

0.7532

0.9795

0.6401

0.8503

0.8416

0.9862

blizzard

0.6115

0.9079

0.8542

0.8454

0.8584

0.8175

0.7737

0.7642

skating

0.9669

0.9653

0.9156

0.8997

0.8732

0.8604

0.8984

0.9028

snowFall

0.8648

0.9474

0.8453

0.8896

0.8979

0.8315

0.8393

0.9789

overpass

0.9416

0.925

0.9272

0.9763

0.7209

0.8489

0.959

0.9365

Average

0.851

0.9398

0.8917

0.9216

0.8174

0.8485

0.8783

0.9427

Table 2. Comparison of average F-measure with different state-of-the-art deep networks used for MOS. (IOM: intermittent object motion, PM: Proposed Method ). Category

DBS [1] MSCNN [30] GoogLeNet [9] VGG-16 [9] ResNet [9] PM

BadWeath 0.9244

0.9459

0.7961

0.8949

0.9461

0.9461

Shadow

0.9041

0.9532

0.8049

0.9084

0.9647

0.9745

DyanBG

0.8012

0.9524

0.6588

0.7356

0.8225

0.9452

IOM

0.6914

0.8722

0.6488

0.7538

0.8453

0.9675

Average

0.8298 0.9309

0.7271

0.8231

0.8945

0.9662

318

P. W. Patil et al.

Table 3. Video-wise average F-measure comparison of various methods for MOS on PTIS dataset. Methods

Curtains Campus WaterSurf Fountains Average

DECOLOR [35] 0.8956

0.7718

0.6403

0.8675

0.7936

MODSM [13]

0.9094

0.7876

0.9402

0.8203

0.8647

OMoGMF [33]

0.9257

0.6589

0.9314

0.8258

0.8353

SGSM-BS [23]

0.9358

0.8312

0.9285

0.8717

0.8919

OBJECTNet

0.9556

0.7994

0.9417

0.9543

0.9128

The quantitative comparison of proposed OBJECTNet with other existing methods is illustrated in Table 1. Also, the proposed method segmentation accuracy in terms of average F-measure is compared with existing state-of-the-art deep learning approaches like GoogLeNet, VGG-16, ResNet, etc. Table 2 gives segmentation accuracy comparison of various deep learning methods in terms of average F-measure. Along with quantitative results, the proposed OBJECTNet is also compared with other existing methods qualitatively as shown in Fig. 2. The proposed OBJECTNet gives significant improvement in segmentation accuracy as compared to existing methods used for MOS. From Tables 1, 2 and Fig. 3, it is evident that the proposed OBJECTNet outperforms (qualitatively as well as quantitatively) the existing methods. In above subsection of result analysis, the segmentation accuracy is evaluated similar to traditional deep learning based methods [9] and [21] for MOS. Further, to prove the effectiveness of the proposed OBJECTNet, results are tested on

Fig. 3. Qualitative comparison of curtain and water surface video of PTIS database for MOS.

Saliency-Based Cascaded Approach for Moving Object Segmentation

319

cross data (the measurement of segmentation accuracy with different data which is not used for training) of PTIS dataset for MOS. 5.2

Performance on Videos of PTIS Database

The PTIS database comprises nine videos with variety of practical circumstances like illumination changes, dynamic and static background with little duration persistent objects. From PTIS database, four videos having non static background like water surface (WS ), fountain (FT ), curtain (CR) and campus (CP ) are considered for accuracy measurement. Table 3 and Fig. 3 gives the quantitative and qualitative comparison of the existing methods in terms of average F-measure for MOS respectively on PTIS database. The experimental results on PTIS database is significantly improved from existing state-of-the-art methods for MOS is observed. Table 4. Computational complexity comparisons of existing methods for MOS(PR: Parameter, PM: Proposed Method ) PR

PM

VGG-16 [9]

GoogLeNet [9]

ResNet [9]

Conv

4

13

22

17

Deconv

4

13

22

17

Filters

12 × 2

2688 × 2

256 × 2

3376 × 2

3×3

7×7

3 × 3, 7 × 7

FilterSize 3 × 3 FP

3 × 3 × 12 × 2 × 2 3 × 3 × 2688 × 2 7 × 7 × 256 × 2 (7 × 7 × 192 + 3 × 3 × 3584) × 2

CO 1 112 58 FP: Filter parameter, CO: Complexity over proposed OBJECTNet

5.3

192

Computational Complexity Analysis

The computational complexity is prime concern for any MOS method. Here, existing learning based approach [9] and [30] are compared with proposed OBJECTNet in terms of number of filter parameters, convolution and deconvolution layer used for implementation purpose. The computational complexity of proposed OBJECTNet shows marginal decrement as compared to existing deep network like VGG-16, ResNet, etc. used for MOS illustrated in Table 4. From above three subsections, it is evident that the proposed OBJECTNet gives the significant improvement on videos like non-static background, weather degraded or irregular motion of object videos from CDnet-2014 and PTIS database in terms of F-measure both quantitatively (see Tables 1, 2 and 3) and qualitatively (see Figs. 2 and 3). Also, the proposed OBJECTNet shows significant decrement in computational complexity as compared to other existing state-of-the-art methods for MOS.

320

6

P. W. Patil et al.

Conclusion

In this work, the problems associated with environmental condition like nonstatic background, bad weather, irregular motion of object, effect of shadow and computational complexity of existing network for MOS are addressed. The proposed OBJECTNet overcome these problems with the help of estimated motion saliency and compact cascaded encoder-decoder network. The spatial coherence of obtained foreground pixels using single encoder-decoder network is further enhanced with cascaded architecture. The effectiveness of proposed OBJECTNet is evaluated on the challenging videos like non-static background, bad weather scenarios from CDnet-2014 and PTIS database. The qualitative analysis of proposed OBJECTNet is observed by comparing existing methods for MOS. From considerable improvement in segmentation accuracy and decrement in computational complexity, the proposed OBJECTNet is satisfactory for real-time applications.

References 1. Babaee, M., Dinh, D.T., Rigoll, G.: A deep convolutional neural network for video sequence background subtraction. Pattern Recogn. 76, 635–649 (2018) 2. Bianco, S., Ciocca, G., Schettini, R.: Combination of video change detection algorithms by genetic programming. IEEE Trans. Evol. Comput. 21(6), 914–928 (2017) 3. Biradar, K.M., Gupta, A., Mandal, M., Vipparthi, S.K.: Challenges in time-stamp aware anomaly detection in traffic videos. arXiv preprint arXiv:1906.04574 (2019) 4. Braham, M., Piérard, S., Van Droogenbroeck, M.: Semantic background subtraction. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 4552–4556. IEEE (2017) 5. Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. Image Process. 25(11), 5187–5198 (2016) 6. Chaudhary, S., Murala, S.: Depth-based end-to-end deep network for human action recognition. IET Comput. Vision 13(1), 15–22 (2018) 7. Chaudhary, S., Murala, S.: TSNet: deep network for human action recognition in hazy videos. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3981–3986. IEEE (2018) 8. Chen, X., Shen, Y., Yang, Y.H.: Background estimation using graph cuts and inpainting. In: Proceedings of Graphics Interface 2010, Canadian Information Processing Society, pp. 97–103 (2010) 9. Chen, Y., Wang, J., Zhu, B., Tang, M., Lu, H.: Pixel-wise deep sequence learning for moving object detection. IEEE Trans. Circuits Syst. Video Technol. 29, 2567– 2579 (2017) 10. Dudhane, A., Murala, S.: C∧ 2MSNet: a novel approach for single image haze removal. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1397–1404. IEEE (2018) 11. Dudhane, A., Murala, S.: Cardinal color fusion network for single image haze removal. Mach. Vis. Appl. 30(2), 231–242 (2019). https://doi.org/10.1007/s00138019-01014-y 12. Dudhane, A., Murala, S.: CDNet: single image de-hazing using unpaired adversarial training. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1147–1155. IEEE (2019)

Saliency-Based Cascaded Approach for Moving Object Segmentation

321

13. Guo, X., Wang, X., Yang, L., Cao, X., Ma, Y.: Robust foreground detection using smoothness and arbitrariness constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 535–550. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0 35 ¨ ¨ 14. I¸sık, S ¸ ., Ozkan, K., G¨ unal, S., Gerek, O.N.: SWCD: a sliding window and selfregulated learning-based background updating method for change detection in videos. J. Electron. Imaging 27(2), 023002 (2018) 15. Jiang, S., Lu, X.: WeSamBE: a weight-sample-based method for background subtraction. IEEE Trans. Circuits Syst. Video Technol. 28, 2105–2115 (2017) 16. Liang, C.W., Juang, C.F.: Moving object classification using a combination of static appearance features and spatial and temporal entropy values of optical flows. IEEE Trans. Intell. Transp. Syst. 16(6), 3453–3464 (2015) 17. Lin, H.H., Liu, T.L., Chuang, J.H.: Learning a scene background model via classification. IEEE Trans. Signal Process. 57(5), 1641–1654 (2009) 18. Lin, Y., Tong, Y., Cao, Y., Zhou, Y., Wang, S.: Visual-attention-based background modeling for detecting infrequently moving objects. IEEE Trans. Circuits Syst. Video Technol. 27(6), 1208–1221 (2017) 19. Patil, P., Murala, S.: FgGAN: a cascaded unpaired learning for background estimation and foreground segmentation. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1770–1778. IEEE (2019) 20. Patil, P., Murala, S., Dhall, A., Chaudhary, S.: MsEDNet: multi-scale deep saliency learning for moving object detection. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1670–1675. IEEE (2018) 21. Patil, P.W., Murala, S.: MSFgNET: a novel compact end-to-end deep network for moving object detection. IEEE Trans. Intell. Transp. Syst. 20, 4066–4077 (2018) 22. Roy, S.M., Ghosh, A.: Real-time adaptive histogram min-max bucket (HMMB) model for background subtraction. IEEE Trans. Circuits Syst. Video Technol. 28(7), 1513–1525 (2018) 23. Shi, G., Huang, T., Dong, W., Wu, J., Xie, X.: Robust foreground estimation via structured gaussian scale mixture modeling. IEEE Trans. Image Process. 27(10), 4810–4824 (2018) 24. St-Charles, P.L., Bilodeau, G.A., Bergevin, R.: A self-adjusting approach to change detection based on background word consensus. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 990–997. IEEE (2015) 25. Thengane, V.G., Gawande, M.B., Dudhane, A.A., Gonde, A.B.: Cycle face aging generative adversarial networks. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 125–129. IEEE (2018) 26. Wang, N., Yao, T., Wang, J., Yeung, D.-Y.: A probabilistic approach to robust matrix factorization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 126–139. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33786-4 10 27. Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2018) 28. Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 20–33 (2018) 29. Wang, Y., Jodoin, P.M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: CDnet 2014: an expanded change detection benchmark dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 387–394 (2014) 30. Wang, Y., Luo, Z., Jodoin, P.M.: Interactive deep learning method for segmenting moving objects. Pattern Recogn. Lett. 96, 66–75 (2017)

322

P. W. Patil et al.

31. Xi, T., Zhao, W., Wang, H., Lin, W.: Salient object detection with spatiotemporal background priors for video. IEEE Trans. Image Process. 26(7), 3425–3436 (2017) 32. Yeh, C.H., Lin, C.Y., Muchtar, K., Lai, H.E., Sun, M.T.: Three-pronged compensation and hysteresis thresholding for moving object detection in real-time video surveillance. IEEE Trans. Industr. Electron. 64(6), 4945–4955 (2017) 33. Yong, H., Meng, D., Zuo, W., Zhang, L.: Robust online matrix factorization for dynamic background subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 40(7), 1726–1740 (2018) 34. Zheng, J., Wang, Y., Nihan, N., Hallenbeck, M.: Extracting roadway background image: mode-based approach. Transp. Res. Rec. J. Transp. Res. Board 1944, 82–88 (2006) 35. Zhou, X., Yang, C., Yu, W.: Moving object detection by detecting contiguous outliers in the low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(3), 597–610 (2013)

A Novel Graph Theoretic Image Segmentation Technique Sushmita Chandel and Gaurav Bhatnagar(B) Indian Institute of Technology Jodhpur, Karwar, India {chandel.1,goravb}@iitj.ac.in

Abstract. In this paper, a novel graph theoretic image segmentation technique is proposed, which utilizes forest concept for clustering. The core idea is to obtain a forest from the image followed by construction of average value super pixels. Thereafter, a merging criterion is proposed to merge these super pixels into two big classes thereby binarizing and thresholding the image separating background from foreground. Extensive experimentation and comparative analysis are finally performed on a diverse set of images to validate the technique and have noted the significant improvements. Keywords: Image segmentation · Graphs · Weighted graphs · Forest

1 Introduction Image segmentation is defined as partitioning an image into its constituent objects/regions. It finds various applications like in industrial inspection of assembled products to detect any defect without opening it, tracking objects in sequence of images, classification of terrains visible in satellite images, detection and measurement of bones, tissues in medical images, biological analysis and motion analysis. A good segmentation technique is the one that is apt for a specific purpose and thus there are various segmentation techniques to deal with different kind of images and for different purposes [1, 2]. In general a region/object in an image can be identified by either grouping similar pixels as one region/object on the basis of similarity of gray levels or by evidence of a boundary separating one region/object from another on the basis of discontinuities or abrupt changes in gray levels. Thus image segmentation techniques can be broadly divided into three major categories, which are based on: (1) Discontinuities, (2) Similarity, (3) Clustering Techniques [3]. Techniques based on discontinuities are also known as Edge based Image Segmentation techniques. They find the edges of the object thereby separating one object from another or separating foreground from background. There are many edge detection methods like Sobel operator, Prewitt technique [4], Canny detector [5], Roberts operator [6], Laplacian detector and Marr-Hildreth detector [7]. Generally, these techniques are easy to implement but are having a major limitation of not handling texture information. The second category of techniques works on finding an optimal threshold value based on © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 323–333, 2020. https://doi.org/10.1007/978-981-15-4015-8_29

324

S. Chandel and G. Bhatnagar

some attribute quality or similarity measure between the original image and its binarized counterpart. These techniques are generally include the techniques based on histogram [10, 11] and region [9, 12, 13] of/in the images. These techniques are generally suffered from the mis-classification errors. The final category of techniques works on grouping data points into homogeneous regions which naturally belong together based on some criterion of human visual perception [15, 16]. These techniques provides strong theoretical analysis of the image thus giving image segmentation as a compact mathematical structure [16]. Recently, graph theoretic image segmentation is emerged as a branch of image segmentation that uses the underlying principles of graph theory to first model an image into a graph, then segment it into regions based on the application considered and then display the segmented graph as an image. Among these, minimum spanning tree based and graph-cut based image segmentation techniques are popular, which are basically clustering techniques. [17–19]. The remainder of this paper is organized as follows. Section 2 provides a detailed description of the proposed segmentation technique followed by the results and discussions in Sect. 3. Finally, the concluding remarks are illustrated in Sect. 4.

2 Proposed Segmentation Technique In this section, some motivating factors in the design of the proposed approach to segmentation are discussed. The proposed techniques takes an image and gives a binary image as the thresholded image. The detailed description of the proposed technique can be summarized as follows. 2.1 Images and Graphs This sub-section essential gives an overview on how to convert an image into a graph and vice versa. Generally, a graph is represented as G = (V, E), where V is a set of vertices and E is a set of edges between vertices. Owing this representation of graphs, the conversion of graph and image can be summarized as follow. Without loss of generality, let us assume that I be a grayscale image of dimension M × N . In the first step, map the pixel Ii, j to a vertex v(M× j)+i where i = 0, 1, 2, . . . , M −1; j = 0, 1, 2, . . . , N −1. In the next step, each pixels is connected to its 8-nearest neighbors thereby placing an undirected weighted edge E i, j = (vi , v j ) joining vertices vi and v j of the graph. The weight associated with an edge E i, j is wi j = |τi −τ j |, where τi and τ j are the intensity values associated with vertices vi and v j , respectively. It may be noted that the boundary pixels do not have all the neighbors. In contrast, multiple edges will be obtained, which successively produces the loops in the graph if all the neighbors are considered for an internal pixel. Therefore, in pursuit of deleting redundancies (multiple edges/loops) only four directions are considered from 8-neighbors. Thus, the following are the directions for a current pixel Ii, j . (1) (2) (3) (4)

East neighbor:Ii, j+1 South neighbor: Ii+1, j South-East neighbor: Ii+1, j+1 North-East neighbor: Ii−1, j+1

A Novel Graph Theoretic Image Segmentation Technique

325

Therefore, an undirected weighted graph is obtained with M × N vertices from an image having size of M × N . This graph usually known as Grid Graph. This grid graph is then used for the segmentation purposes. Once the segmentation has been performed the resultant graph can be viewed as a forest comprising of k trees where each tree represents a region in the segmented image. For instance, k = 2 signifies that there are only two regions present in the segmented image. Let us assign a distinct label i to each tree from a set of labels Σ = {1 , 2 , . . . , k } ⊆ Z+ . Clearly, all the vertices in a tree share a common label, which was assigned to the tree from the set Σ . These labels can be any random value or could be average of all the vertices of the tree. A vertex vx is then mapped back to pixel Ii, j , where i = x mod M and j = x/M, thus forming k distinct segments {S1 , S2 , . . . , Sk } in the segmented image such that label (k ) associated with each segment (Sk ) is now a valid intensity of that segment and all pixels lying in the segment have the same intensity value. 2.2 Construction of Forest In this sub-section, the main steps to construct forest from the grid graph obtained from the Subsect. 2.1 are illustrated, as mentioned in [19]. Mathematically, an undirected graph with no cycles is called a Forest. In order to understand the algorithm, let us first define a few terms. Definition 1: Internal Difference of a connected component The Internal Difference of a connected component C ⊆ V is defined as follows I N T (C) = max Wi j (1) ∀(vi ,v j )∈M ST (C,E)

where M ST (C, E) is the minimum spanning tree of the component C and Wi j is the weight of the edge (vi , v j ) joining vertices vi and v j respectively. Due to the edges are taken in ascending order of their weights thus I N T (C) is the edge that causes the merging of two components. Definition 2: Difference between two connected components The Difference between two connected components C1 , C2 ⊆ V is given as Wi j min D I F F(C1 , C2 ) = ∀(vi ,v j )∈E s.t. Vi ∈C1 ,V j ∈C2

(2)

Practically, D I F F(C1, C2) is the weight Wi j on the current edge (vi , v j ) under consideration. Definition 3: Minimum Internal Difference The Minimum Internal Difference between two connected components C1 , C2 ⊆ V is given as λ (3) M I N T (C1 , C2 ) = min I N T (Ci ) + i=1,2 |Ci | where λ is the parameter that define how large the segments should be. In principle, larger value of λ will result into larger connected components.

326

S. Chandel and G. Bhatnagar

Definition 4: Pairwise Comparison Predicate Pairwise Comparison Predicate to determine existence of boundary between two connected components C1 , C2 ⊆ V . Mathematically, it is defined as follows: 1 D I F F(C1 , C2 ) > M I N T (C1 , C2 ) (4) P(C1 , C2 ) = 0 otherwise Based on the Definitions 1–4, an algorithm is discussed below, which essentially provide a forest of the graph such that every tree inside it is a minimum spanning tree. The complete algorithm can be summarized as follows: (1) Initialize the forest F = (V, E ) with only vertices and no edges (E = φ). (2) Arrange E in the order of increasing edge weight. (3) Add edge (vi , v j ) ∈ E to F = (V, E ) if it does not form a cycle and if P(Ciset , C jset ) = 0 otherwise discard the edge (vi , v j ) ∈ E. Here Ciset and C jset are connected components containing vertices vi and v j , respectively. (4) If the edge (vi , v j ) ∈ E is added to F = (V, E ), update I N T (Ciset ) = Wi j , where Wi j is the weight associated with edge (vi , v j ) ∈ E. (5) Repeat Step# (3) until all the edges are exhausted.

2.3 Proposed Segmentation Process In this sub-section, a novel graph theoretic image segmentation techniques is proposed. The process is initiated by mapping image into a weighted graph followed by constructing of forest. The forest is then mapped to the image to get average super pixel image. In the transitional step, large and small regions are identified in average super pixel image and small regions are finally merged into larger regions to generate final segmented image. A detailed description of the proposed segmentation techniques can be summarized as follows: (1) Map image I to an undirected graph G = (V, E) as mentioned in Sect. 2.1. (2) Perform the algorithm, mentioned in Sect. 2.2, on G = (V, E) to obtain a forest F = (V, E ). Let the forest comprises of k trees and they are represented by T p : p = 1, 2, . . . , k. (3) Evaluate the label for each tree as follows: p =

∀vi ∈T p

|TP |

τi (5)

where | | represents the total number of vertices in tree T p and it may be possible that distinct tress have same label. (4) Map the forest F = (V, E ) to an image I˜ as mentioned in Sect. 3.1. Let us term this image as average super pixel image. It is worth mentioning that every tree T p is reflected as a regions (R p ) in I˜.

A Novel Graph Theoretic Image Segmentation Technique

327

(5) Fix the number (α ≤ k), which represents the total number of clusters in the segmented image. (6) Obtain the total number of pixels in α th largest region. Let it be β. (7) Identify big and small regions (R θp ) as follows: θ=

1 if |R p | ≥ β 0 otherwise

(6)

where θ = 1 and θ = 0 indicate big and small regions respectively. (8) Obtain the final segmented image (Is ) by replacing the label associated with the regions R 0p , ∀ p by the label of the regions R 1p such that their difference is minimum.

3 Results and Discussions In this section, the performance of the proposed technique is evaluated on a wide variety of images. All the experimental images are real images where each image is having a manually generated ground-truth image. There are small or large objects, objects with clear or fuzzy boundaries contained in these images and were noisy or smooth. All the test images and their ground-truth images are depicted in Fig. 1. The size of the images varies from 200 × 200 to 500 × 500. Further, the algorithm is implemented in

Fig. 1. (a, c, e, g, i) are Test images and (b, d, f, h, j) are their corresponding Ground truths.

328

S. Chandel and G. Bhatnagar

Fig. 2. (a1, b1, c1, d1, e1) Thresholded images obtained by Otsu’s method; (a2, b2, c2, d2, e2) are the thresholded images obtained by [19]; (a3, b3, c3, d3, e3) are the average super pixel images obtained by applying proposed technique; (a4, b4, c4, d4, e4) are the thresholded images obtained by proposed technique.

C++ on a personal computer with 2.4 GHz CPU, 2 GB RAM running the Windows 7 operating system. The performance of proposed method is compared, both qualitatively and quantitatively, with state-of-the-art techniques Otsu method and [19]. This technique is selected because this is a benchmark thresholding technique and widely used by the research community. The quantitative analysis is done using a measure of performance (η p ), which is based on the misclassification error between the individual ground-truth

A Novel Graph Theoretic Image Segmentation Technique

329

images with the obtained binary result delivered by thresholding. Mathematically, this measure is given by [23]. Table 1. Misclassification error measure comparison ηp

Image

Otsu Method Ref. [19] Proposed technique Figure 1(a)

0.8101

0.8594

0.9125

Figure 1(c)

0.9662

0.9753

0.9551

Figure 1(e)

0.9026

0.3509

0.9392

Figure 1(g)

0.7768

0.7533

0.9448

Figure 1(i)

0.9913

0.9924

0.9829

Average performance

0.8894

0.7863

0.9469

Standard Deviation

0.0842

0.2342

0.0228

ηp =

|BG ∩ BT | + |FG ∩ FT | |BG | + |FG |

(7)

where BG (BT ) and FG (FT ) denote the background and foreground of the ground-truth image (resulting segmented image), and |•| represents the cardinality of background and foreground classes. In principle, the higher values of η p indicates the better segmentation result. The visual results of the proposed techniques for qualitative analysis are presented in Fig. 2, wherein respective average super pixel images and the thresholded results are depicted. The quantitative performance measure η p is listed in Table 1. As is apparent from Table 1, proposed technique has the highest average performance of 94.69% with the lowest standard deviation of 2.28%. In contrast, the higher average performance and lower standard deviation is essentially shows the superior performance of the proposed technique. It further confirms that the proposed technique can be used for better segmentation results. Furthermore, consider the results of images Fig. 1(a) and Fig. 1(g). These images are highly textured images where in Otsu method fails to produce expected results, which is further justified with the lower values of η p = 0.8101 and η p = 0.7768, respectively. In contrast, the higher values of proposed technique η p = 0.9125, and η p = 0.9948 confirms that the proposed technique can efficiently handle the textures information in the image. The performance against texture can further enhance, if the experimental image is pre-processed with Gaussian filter. This pre-processing essential suppress the texture information and hence improves the quality of thresholded image. This fact can also be observed from the Fig. 3, where the visual comparison between the thresholded images before and after pre-processing are illustrated for image

330

S. Chandel and G. Bhatnagar

given in Fig. 1(a). It is clear from the figure that the quality is improved notably and misclassification error is reduced.

Fig. 3. Result before pre-processing (a) Super pixel image (b) thresholded image; Result after pre-processing (c) Super pixel image (d) thresholded image.

Table 2. Misclassification error measure comparison in noisy environment. Image

Standard deviation of Gaussian noise 2

5

10

15

Figure 1(a) 0.9183

0.8535 0.8963 0.8781

Figure 1(c) 0.9556

0.9518 0.9520 0.9448

Figure 1(e) 0.9086

0.8973 0.8613 0.8736

Figure 1(g) 0.7624

0.7648 0.9340 0.7757

Figure 1(i) 0.9849

0.9823 0.7306 0.8626

To test the robustness of the proposed method, the proposed method is also tested in noisy environment. Here the experimental images are added with additive white Gaussian noise (AWGN) with different standard deviation. Under these noisy environments, the performance measure (η p ) is calculated and shown in Table 2. The qualitative results of the proposed method can be visualized from Fig. 4. From Table 2 and Fig. 4, it can be deduced that the proposed method is able to achieve better segmentation in presence of AWGN. Therefore, it can be concluded that the proposed technique is robust enough against the noisy environment.

A Novel Graph Theoretic Image Segmentation Technique

331

Fig. 4. Results of proposed technique in noisy environment: Results for Gaussian noise with standard deviation (a1,b1,c1,d1,e1) 2; (a2,b2,c2,d2,e2) 5; (a3,b3,c3,d3,e3) 10; (a4,b4,c4,d4,e4) 15.

4 Conclusion An image segmentation technique based on graph theory properties is proposed in this paper. The core idea is to map an image to a graph and then obtain an average value super pixel image. The super pixel image is merged into average classes based on a merging criterion. The efficiency of the proposed technique is carried out by extensive experiments on various images. For qualitative analysis, a misclassification error measure has been computed that has showed the robustness of the proposed technique. It is also

332

S. Chandel and G. Bhatnagar

evident from experiments that the proposed technique produced much better results even for highly textured images.

References 1. Panchasara, C., Joglekar, A.: Application of image segmentation techniques on medial reports. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 6(3), 2931–2933 (2015) 2. Chavan, H.L., Shinde, S.A.: A review on application of image processing for automatic inspection. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 4(11), 4073–4075 (2015) 3. Sivakumar, P., Meenakshi, S.: A review on image segmentation techniques. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 5(3), 641–647 (2016) 4. Adlakha, D., Adlakha, D., Tanwar, R.: Analytic comparison between Sobel and Prewitt edge detection techniques. Int. J. Sci. Eng. Res. 7(1), 1482–1484 (2016) 5. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 639–643 (1986) 6. Amer, G.M.H., Abushaala, A.M.: Edge detection methods. In: The Proceedings of the 2015 2nd World Symposium on Web Applications and Networking, WSWAN 2015, Tunisia, March 2015 7. Marrand, D., Hildreth, E.: Theory of edge detection. Proc. R. Soc. Lond. B Biol. Sci. 207(1167), 187–217 (1980) 8. Chaubey, A.K.: Comparison of the local and global thresholding methods in image segmentation. World J. Res. Rev. (WJRR) 2(1), 01–04 (2016) 9. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson Hall, Upper Saddle River (2017) 10. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979) 11. Roy, P.: Adaptive thresholding: a comparative study. In: The Proceedings of International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), 10–11 July 2014 (2014) 12. Salih, Q.A., Ramli, A.R.: Region based Segmentation technique and algorithms for 3D images. In: The Proceedings of Signal Processing and its Applications Sixth International Symposium, 13–16 August 2001 (2001) 13. Lu, Y., Miao, J., Duan, L., Qiao, Y., Jia, R.: A new approach to image segmentation based on simplified region growing PCNN. Appl. Math. Comput. 205(2), 807–814 (2008) 14. Patin, T.: The Gestalt theory of perception and some of the implications for arts, submitted in partial fulfillment of the requirements for the degree of Master of Fine Arts Colorado State University Fort Collins, Colorado Fall (1984) 15. Antonio, M.H.J., Montero, J., Yáñez, J.: A divisive hierarchical k-means based algorithm for image segmentation. In: The Proceeding of IEEE International conference on Intelligent Systems and Knowledge Engineering, 15–16 November 2010 (2010) 16. Rao, P.S.: Image segmentation using clustering algorithms. Int. J. Comput. Appl. 120(14), 36–38 (2015) 17. Peng, B., Zhang, L., Zhang, D.: A survey of graph theoretical approaches to image segmentation. Pattern Recogn. 46(3), 1020–1038 (2013) 18. Morris, O.J., Lee, M.D.J., Constantinides, A.G.: Graph theory for image analysis: an approach based on shortest spanning tree. IEEE Proc. F (Commun. Radar Sign. Process.) 133(2), 146–152 (1968) 19. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59(2), 167–181 (2004)

A Novel Graph Theoretic Image Segmentation Technique

333

20. West, D.B.: Introduction to Graph Theory. Prentice Hall, Upper Saddle River (1996) 21. Kruskal, J.B.: On the shortest spanning subtree of a graph and the travelling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956) 22. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009) 23. Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative performance evaluation. J. Electron. Imaging 13(1), 146–165 (2004)

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents Harmohan Sharma1(B) , Dharam Veer Sharma2 , G. S. Lehal2 , and Ankur Rana2 1 Multani Mal Modi College, Patiala, Punjab, India

[email protected] 2 Punjabi University, Patiala, Punjab, India

[email protected], [email protected], [email protected]

Abstract. The work presented in this paper proposes extraction and recognition of Urdu Numerals from Machine-Printed Urdu Documents. Feature extraction approaches for the features of a numeral that have been considered include Zernike, Discrete Cosine Transform (DCT), Zoning, Gabor filter, Directional Distance Distribution and Gradient. In addition, performance evaluation has also been done by varying the feature vector length of Zernike and DCT features. For classification, k-Nearest Neighbor (k-NN) with different values of k and Support Vector Machine (SVM) with Linear, Polynomial and Radial Basis Function (RBF) kernels functions have been employed by changing the parameters, like degree of the polynomial kernel, and γ for RBF kernels. 1470 samples for training and 734 samples for testing the classifiers for Urdu numerals have been used. Maximum recognition accuracy of 99.1826% has been achieved. Keywords: Urdu numerals · Zernike · DCT · Zoning · Gabor · Directional Distance Distribution · Gradient · k-NN and SVM

1 Introduction Urdu is a Central Indo-Aryan language of the Indo-Iranian branch, belonging to the Indo-European family of languages spoken by more than 250 million people in India, Pakistan and other neighbouring countries. Being selected as one of the 23 scheduled languages of India, Urdu also enjoys the status of being one of the official languages of five Indian states. As known, Urdu is the national language of Pakistan. It is perhaps because of the intricate amalgam of various races and their rich cultural background, the sweetness and decency in Urdu language is truly unmatched and unparalleled. Urdu is written in Persian calligraphy, in the Nastaleeq style, which is intrinsically cursive in nature. This script follows the pattern of being written diagonally from top right to bottom left with a clear stacking of characters. This feature creates impediments in the procedures concerning character and word segmentation. Text is written right to left in both printed and handwritten forms whereas the numbers follow the left to right direction. For the convenience of the readers of the present generation, paging of the literature is done in the familiar Roman numerals and/or Arabic numerals. © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 334–347, 2020. https://doi.org/10.1007/978-981-15-4015-8_30

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

335

Urdu Script in Books, journals and or any other published media can be classified into two broad generations: hand-written and computerized. The Urdu literature traced prior to the year 1995 is found to be hand-written, whereas the literature generated after this year is with a computerized approach using the popular Urdu fonts Nastaleeq and Noori Nastaleeq. Thus for experimentation, a corpora consisting of two sets of images to cover both the generations is considered and collected. For initiating and accumulating the images of numerals, a database of numerals was prepared by extracting numerals from the scanned pages of Urdu books using connected component analysis technique proposed by [1] (Fig. 2) .

Fig. 1. Urdu numerals

Fig. 2. (a) Document containing Urdu numerals (b) Samples of Training Data for Urdu numerals.

2 Related Work Naz et al. [2] presented a complete review on recognition of offline as well as online numerals for Urdu, Arabic and Farsi, with the more detailed approach for Urdu digit recognition. Razzak et al. [3] presented a structural feature-based approach to identify the online handwritten numerals written in both Urdu and Arabic forms in an unconstrained environment. The reported accuracies were 97.4%, 96.2%, 97.8% on applying fuzzy

336

H. Sharma et al.

logic, HMM and Hybrid approaches respectively, experimented on 900 samples. Ansari et al. [4] presented a system for the handwritten Urdu Digits. From a total 2150 samples, 2000 samples were utilised for training and 150 samples were reserved for testing. Different Daubechies Wavelet transforms along with Zonal densities of different zones of an image were used for feature extraction. For the classification, the back propagation neural network was used. An average recognition of 92.07% accuracy was reported. Uddin et al. [5] proposed a novel approach of Non-negative Matrix Factorization (NMF) for offline handwritten Urdu numeral recognition. Around 86% accuracy was achieved to recognize 500 images. Yusuf and Haider [6] presented a new approach to recognize the handwritten Urdu numerals using shape context. 40 samples were used as training dataset and 28 samples were used for testing purposes. Zero percent error rate was reported on the 28 test digits. Further, for quicker processing, Haider and Yusuf [7] also presented a gradual pruning based approach for accelerated recognition of handwritten Urdu digits. Husnain et al. [8] developed a novel dataset contained 800 images of each of the handwritten 10 numerals originating from writers belonging to various social strata. 6000 images of Urdu numerals were selected as training data and remaining 2000 as test data. The proposed model used CNN to classify the Urdu handwritten numerals and considered the learning rate, the number of hidden neurons, and the batch size as parameters. Average accuracy rate of 98.03% was reported.

3 Feature Extraction The performance of a recognition system directly depends upon the feature extraction method(s) used for recognition purpose. There are many feature extraction methods available and the performance of a recognition scheme used may be different for different scripts. In order to know the performance of various feature extraction methods on a script, it is essential to conduct the experiments with these feature extraction methods using various classifiers taking large data pertaining to that script. It is very difficult to conduct experiments with all methods available in literature. Feature extraction approaches for the features of a numeral that have been considered in the proposed work include Zernike, Discrete Cosine Transform (DCT), Zoning, Gabor filter, Directional Distance Distribution and Gradient. In addition, performance evaluation has also been done by varying the feature vector length of Zernike and DCT features (Table 1). Table 1. Features and their vector length Features

Size

Zoning

25/49/58

Directional Distance Distribution

144

Gabor

189

Discrete Cosine Transform (DCT) 32/64/100 Zernike

36/49

Gradient

200

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

337

3.1 Zoning Zoning can be implemented on various forms of a character image such as original (solid character), character contour and character skeleton. In addition, this method can be used to identify both grey as well as binary images. Actually the use of zoning in pattern recognition is made to reduce the size of dimension of the feature vector. As defined by Trier et al. [9], zoning can be defined as the process in which an n × m grid is superimposed on the character image and for each of the n × m zones; the average grey levels in case of grey level character images are used as features. In case of binary images the percentage of black pixels in each zone is computed [9]. It is used to capture local properties of a character. Cao et al. [10] used zoning on numeral contours where the images are divided into 4 × 4 zones. The number of pixels based on the horizontal, vertical and diagonal orientation on the outer as well as inner contour are counted in each zone and used as feature. The overlapping zones have been also taken into consideration using the fuzzy border between the two zones. Densit y V alue =

N umber o f f or egr ound pi xels T otal number o f pi xels

For extracting these features, zones are created, both horizontally and vertically, by the extracted character image being segmented into seven windows of equal size. Density values are calculated for each window. All such density values are used to form the input feature vector set. The images are divided into (5 × 5), (7 × 7) and (7 × 7 + 3 × 3) zones which makes a feature vector set of size 25, 49 and 58 respectively. 3.2 Directional Distance Distribution Oh and Suen [11] proposed a distance based feature, the Directional Distance Distribution (DDD) has achieved the status of being the best feature on CENPARMI database of English numerals, NIST database of English capital letters and PE92 database of Hangul initial sounds. In DDD, the distance of a white pixel from a black pixel or the distance of a black pixel from a white pixel in all 8 possible directions is considered as a feature extraction criteria, with all 8 distances contributing effectively to the feature vector. For every pixel in the input binary array, two sets of 8 bytes-the White set and the Black set are allocated. For a white pixel, the White set is used to encode the distances to the nearest black pixels in 8 directions (0° , 45° , 90° , 135° , 180° , 225° , 270° , 315° ), whereas the Black set is simply filled with the value zero. Similarly, for a black pixel, the Black set is used to encode the distances to the nearest white pixels in 8 directions. For this feature, the input image is scaled to the size of 36 × 36 pixels. After computing direction distance distribution of all the pixels presented in an image, we get 16 sub-images corresponding to one image per direction. To down sample 16 * 36 * 36 (=20736) directional features value, the input image array has been divided into 3 equal parts both horizontally and vertically, thus fabricating 9 zones. From each zone 16 feature vectors have been obtained by adding the corresponding elements of all the sets, corresponding to the pixels in that particular zone. Therefore, 16 features from each zone makes a directional feature vector of length 16 * 9 (=144) (Fig. 3).

338

H. Sharma et al.

Fig. 3. The 8 directions used to compute directional distribution

3.3 Gabor A Gabor filter is a sort of local narrow band pass filter being selective to both orientation and spatial frequency. The ability of Gabor filters to extract the orientation-dependent frequency contents, i.e., edge like features; from as small an area as possible, advocates that the Gabor filter based features may prove to be effective in classifying characters. It is suitable for extracting the joint information in two-dimensional spatial and frequency domain and has been widely applied to texture analysis [12], computer vision [13] and face recognition. Hamamoto et al. [14] came up with fresh Gabor filter based feature extraction method for the marked recognition of hand-printed English numerals. A Gabor filter is a linear filter whose impulse response is defined by a harmonic function multiplied by a Gaussian function. G(x, y) = g(x, y) × s(x, y), where G(x, y) is a Gabor function, g(x, y) is a Gaussian shaped function known as envelope and s(x, y) is a complex sinusoid harmonic function known as carrier. The two-dimensional Gabor filter is presented as follows: R22 1 R12 2π R1 + × exp i f (x, y, θ, λ, σx , σ y ) = exp − 2 σx2 σ y2 λ where R1 = x cos(θ ) + y sin(θ ) and R2 = −x sin(θ ) + y cos(θ ), λ and θ are the wavelength and orientation of the sinusoidal plane wave respectively, σx and σ y are the standard deviations of Gaussian envelope along x-axis and y-axis. A rotation of the x-y plane by an angle θ will result in a Gabor filter at orientation is given by θ. The value of θ is given by θ = π(k−1) m , k = 1, 2, . . . . . . m here m, whose value is 9, indicates the number of orientations. In the proposed work, to calculate the feature vector length, we have normalized the input image with dimension 32 * 32 and applied first level zoning by partitioning the normalized image into four equal non overlapping sub-regions of size 16 * 16. In the second level zoning, these sub-regions are further partitioned into four equal non

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

339

Fig. 4. Gabor Feature Extraction based on Zoning

overlapping sub-subregions of size 8 * 8 and thus we obtain 16 small regions in different parts of the image. These 21 ( 1 + 4 + 16 ) images were thereafter convolved with even symmetric odd symmetric Gabor filters and in nine different angles of orientation namely 0, π 9, 2π 9, 3π 9, 4π 9, 5π 9, 6π 9, 7π 9, 8π 9 of 20°, to obtain a feature vector of size 189 (1 × 9 + 4 × 9 + 16 × 9) (Fig. 4). 3.4 DCT DCT is the most widely used transform in the image processing applications for feature extraction. The approach involves separating the relevant coefficients, after taking into consideration the transformation of the image, as a whole. The DCT of an image basically consists of three frequency components namely low, middle, high each containing some detail and information in an image. Mathematically, the 2D-DCT of an image is given by C(u, v) = α(u)α(v)

−1 M−1 N x=0 y−0

π(2y + 1)v π(2x + 1)u cos f (x, y), cos 2M 2N

⎧ ⎧ f or v = 0 f or u = 0 1 1 ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ N M α(u) = and α(v) = ⎪ ⎪ ⎪ ⎪ ⎩ 2 ⎩ 2 f or v = 0 f or u = 0 N M where C(u, v) the DCT coefficient corresponding to f (x, y). f (x, y) is the intensity of the pixel at coordinates (x, y), u = 0, 1, 2, . . . . . . , M − I , v = 0, 1, 2, . . . . . . , N − I and M × N is the size of the image. In the proposed work, image is scaled to the size of 32 * 32, therefore for this equation M = N . DCT concentrates most of the image energy in very few coefficients. The first transform coefficient is called DC component which is at [0, 0] and rest are called AC components. As the image is scaled to 32 * 32,

340

H. Sharma et al.

therefore a total of 1024 features (transform coefficients) can be obtained from it. But we have picked only 64 features in zigzag manner, as shown in Fig. 5.

Fig. 5. Selecting DCT coefficients of image in zigzag direction

We have also evaluated the feature-classifier performance by decreasing and increasing the feature size to 32 and 100 respectively, merely by selecting the first 32 and 100 features in zigzag manner corresponding to feature size. 3.5 Zernike Moments The Zernike Moments are implemented to characterize a function and to capture its significant features and are clearly scalar in nature. Digital image processing applications use the global Zernike moments (ZMs) as effective image descriptors. Teague [15] initiated the concept of to image analysis, ZMs have found wide applications in many image processing, pattern recognition and computer vision applications, such as edge detection [16], image reconstruction [17–19], content based image retrieval [20–22], image watermarking [23, 24], image recognition [25], pattern classification [26], and OCR [27] etc. As they possess the orthogonality property, ZMs have minimum information redundancy and, therefore, distinctively characterize an image. The separation of individual contribution of moment of each order to the image reconstruction process is also feasible because of the inherent property of orthogonality. The Zernike Moments have also proved to be robust to noise. With some geometric transformations they can be made scale and translation invariant considering the fact that the magnitudes of ZMs are rotation invariant.

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

341

A set of complex Zernike polynomials defined over the polar coordinate space inside a unit circle (i.e., x 2 + y 2 = 1) form the core of the Zernike Moments. For an image function f (r, θ ) the two-dimensional ZMs of order p with repetition q are defined as [15]: Z pq

p+1 = π

2π 1 0

∗ f (r, θ )V pq (r, θ )r dr dθ,

0

∗ (r, θ ) is the complex conjugate of the Zernike polynomials V (r, θ ), Where V pq pq defined as:

V pq (r, θ ) = R pq (r )e jqθ , where p is a non-negative integer, √ 0 ≤ |q| ≤ p, p − |q| = even, j = −1, θ = tan−1 xy , r = x 2 − y 2 The radial polynomials R pq (r ) are defined by: Z R pq (r )

=

( p−|q|)/2 s=0

(−1)s ( p − s)!r p−2s p−|q| s! ( p+|q| − s ! − s ! 2 2

The radial polynomials satisfy the orthogonality relation: 1 R pq (r )R p q (r )r dr = 0

1 δ pp , 2( p + 1)

where δi j is Kronecker delta. The set of Zernike polynomials V pq (r, θ ) forms a complete orthogonal set within the unit disk as: 2π 1 0

V pq (r, θ )V p∗ q (r, θ )r dr dθ =

0

π δ pp δqq p+1

For p = pmax , the total number of ZMs is 21 (1 + pmax )(2 + pmax ).

4 Classification Methods In the recognition process phase, an unknown sample is submitted to the system that decides from which class this sample belongs to. For performance evaluation, classifiers Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN) are implemented. SVM trained with training samples has been subjected to classify the test samples. For classification Linear, Polynomial and Radial Basis Function (RBF) kernels functions have been employed. Experiments have been done with these kernels by changing the parameters, like degree of the polynomial kernel, and γ for RBF kernels.

342

H. Sharma et al.

4.1 Support Vector Machine (SVM) One concept that is fast catching-up in the field of OCR is the Support Vector Machine (SVM) owing to its exceptional convergence performance and generalization ability. It finds ample utilization in the various real world practical applications such as bioinformatics, face detection in images, data mining, image processing, text classification, isolated handwritten character recognition, statistics, pattern recognition etc. to name a few. One can refer the detailed literature of SVM in [28]. Classification tasks involve segregating data into training and testing sets. Each instance in the training set contains one “target value” (i.e. the class labels) and several “attributes” (i.e. the features or observed variables). The objective of SVM is to create a model (based on the training data) which predicts the target values of the test data given only the test data attributes. Given a training set of attributes-label pairs (xi , yi ) , i = 1, 2, . . . . . . , l where xi ∈ R n and yi ∈ {1 , −1}l , the SVM require the solution of the following optimization problem given by: 1 T w w+C ξi 2 l

min

w,b,ξ

i=1

subject to yi w T ϕ(xi ) + b ≥ 1 − ξi , ξi ≥ 0. Here, training vectors xi are mapped into a higher (maybe infinite) dimensional space by the function ϕ. SVM finds a linear separating hyper-plane with the maximal margin in this higher dimensional space. C > 0 is the penalty parameter of the error term. Furthermore, K xi , x j ≡ ϕ( xi ) T ϕ x j is called the kernel function. The different kernel functions which are considered in the proposed work are given in Table 2: Table 2. Different Kernel functions of SVM Feature of the SVM Kernel Function Linear Polynomial RBF

K (xi , x j ) = xiT x j K (xi , x j ) = (γ xiT x j + r )d , γ > 0 2 K (xi , x j ) = exp(−γ xi − x j ), γ > 0

Here γ , r and d are kernel parameters. 4.2 k-Nearest Neighbor (k-NN) k-nearest neighbor (k-NN) is a machine learning algorithm that is non-parametric and conceptually simple, but at the same time a flexible classifier. It has created a niche for itself in the world of classification methods since it has proved to be efficient both theoretically and practically too. Though it is a simple classifier, it has delivered competitive results. It possesses the capability to predict unlabeled samples based on their striking

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

343

similarity with samples in training data set. This method compares an unknown pattern or feature set to a set of patterns or feature sets that have been previously associated with class identities in the training stage. The performance of a classifier further depends on the value of k, the size of training data set, the metric distance used to measure the distance between a test sample and the training samples and the mode of decision. We consider the similarity of two points to be the distance between them in this space under an appropriate metric. The way in which the algorithm decides which of the points from the training set are similar enough to be considered when choosing the class to predict for a new observation is to pick the k closest data points to the new observation, and to take the most common class among these. This is why it is called the k Nearest Neighbours algorithm Consider we have labeled training samples (vi , di ) , i = 1, 2, . . . . . . . . . , m, vi ∈ R n and di ∈ (1, 2, . . . . . . . . . , q) , where vi represents training samples and di label represents the class of samples from which a test sample belongs to out of q classes. The goal is to predict the correct class of a new unlabeled sample v. To predict the class of a new unlabeled sample v, we find k samples from the training sample set, which are closest to v and assign v the label of samples that appears most frequently out of k samples. In another way, assign v the label of samples that appears in majority out of k nearest samples. This is also called the majority rule. There may be some cases where the value of k is even or the value of k > 2 and all the k samples belong to different classes leading to ambiguity. In such cases it is essential to break a tie in the number of neighbors. A random and nearest tie-breaker is taken which uses the nearest neighbor among tied groups to break the tie in conflicting situation. To put k-NN rule into operation, a data set of labeled training samples along with a distance metric to compute the distance spanning between a training sample and a test sample as well as the number of nearest neighbors to be considered i.e. value of k.

Fig. 6. (a) Example with 1 nearest neighbor pattern and (b) Example with 3 nearest neighbor pattern

In Fig. 6(a), the training sample sets v1 , v2 and v3 are a part of three different classes i.e. 1, 2 and 3, respectively. The initial value assigned to k is 1. The unlabeled sample v bears nearest affinity to a corresponding sample of training sample set v1 and thereby

344

H. Sharma et al.

the class of v is 1. Observing Fig. 6(b), we can infer that here the value of k is 3. The unlabeled sample v is closely mapped to two samples of the training sample of set v1 and one sample of training sample of set v3 . Yet again the class of v works out to be 1 as out of k = 3 instances, two are derived from class 1 and one from class 3. The choice of k impacts a lot on the performance of k-NN. The performance of k-NN is satisfactory when k = 1, but with overlapping classes, the performance deters too. On the other hand, if k is too large, the neighborhood may include samples from other classes leading to misclassification. There are various distance metrics used to find the distance between training samples and a test sample.

5 Performance Evaluation SVM trained with training samples has been subjected to classify the test samples. For the classification Linear, Polynomial and Radial Basis Function (RBF) kernels functions Table 3. Recognition (%) of Urdu Numerals using SVM Classifier Features

Feature Vector Length

SVM Kernels Linear

Polynomial deg = 3

deg = 4

Zernike of order 11

36

98.9101

98.7738

98.6376

Zernike of order 12

49

98.6376

98.7738

98.6376

DCT

32

99.1826

99.1826

99.1826

DCT

64

99.1826

99.1826

99.0463

DCT

100

99.0463

99.1826

98.7738

Zoning

25

98.9101

98.2289

97.8201

Zoning

49

99.1826

98.5014

98.0926

Zoning

58

99.1826

98.3651

98.0926

Gabor

189

98.7738

98.9101

98.7738

Directional

144

98.7738

98.9101

99.0463

99.0463

98.7738

98.7738

Gradient

200

Features

Feature vector length

SVM γ = .0001

γ = .00001

γ = .003

γ = .00003

γ = .0005

γ = .00005

γ = .0007

Zernike of order 11

36

97.1390

82.2888

98.6376

93.7330

98.0926

96.4578

98.5014

Zernike of order 12

49

97.2752

82.0163

98.3651

94.5504

98.3651

96.4578

98.6376

DCT

32

97.2752

83.2425

99.0463

87.4659

97.1390

95.5041

97.2752

DCT

64

97.4114

83.2425

99.0463

93.5967

97.1390

95.7766

97.2752

DCT

100

87.8747

83.3787

97.2752

83.2425

97.1390

83.2425

97.0027

Zoning

25

98.7738

97.8202

63.3515

98.5014

93.5967

98.7738

90.1907

Zoning

49

96.3215

98.6376

48.3651

98.9101

76.1580

98.6376

70.9809

Zoning

58

96.3215

98.7738

47.8202

98.9101

74.6594

98.7738

68.9373

Gabor

189

97.2752

94.9591

97.2752

96.4578

98.3651

96.7302

98.5014

Directional

144

98.5014

98.2289

72.6158

98.7738

94.6866

98.6376

92.3706

Gradient

200

88.5559

84.8773

98.2289

84.8773

97.5477

84.8773

97.5477

RBF

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

345

Table 4. Recognition (%) of Urdu Numerals using kNN Classifier Features

Feature vector length kNN k=1

k=3

k=5

Zernike of order 11 36

88.2834 90.4632 93.0518

Zernike of order 12 49

87.7384 88.8283 92.0981

DCT

32

93.0518 94.0054 94.0054

DCT

64

93.0518 94.0054 94.0054

DCT

100

93.0518 94.0054 94.0054

Zoning

25

82.5613 91.2806 92.2343

Zoning

49

88.1471 90.0545 91.8256

Zoning

58

88.1471 90.1907 91.8256

Gabor

189

88.0109 92.5068 98.0926

Directional

144

90.5995 92.6431 92.9155

Gradient

200

89.9183 92.7793 93.1880

have been employed as seen in Table 3. Table 4 depicts the results for three different values of k considered in kNN for different feature extraction methods.

6 Conclusion For performance evaluation, classifiers namely Support Vector Machines (SVM), kNearest Neighbors (k-NN) have been used. Feature extraction approaches for the features of the Urdu numerals that have been considered in the proposed work include Zernike, DCT, Zoning, Gabor filter, DDD and Gradient. In addition, performance evaluation has also been done by varying the feature vector length of Zernike and DCT features. From above discussion it has been observed that DCT (32/64) and Zoning (49/58) feature with SVM (linear), DCT (32/64/100) with SVM (degree 3) and DCT (32) with SVM (degree 4) have provided the maximum classification accuracy of 99.1826% as compared to others. To come up with a large scaled dataset, it is imperative to create a massive repository of scanned Urdu data in the form of documental extracts, on which a series of feature extraction and classifiers can experiment to give viable outputs.

References 1. Lehal, G.S.: Ligature Segmentation for Urdu OCR. In: Proceeding of the International Conference on Document Analysis and Recognition (ICDAR), pp. 1130–1134 (2013) 2. Naz, S., Ahmed, S.B., Ahmad, R., Razzak, M.I.: Arabic script based digit recognition systems. In: Proceedings of the International Conference on Recent Advances in Computer Systems, RACS 2015, pp. 67–72 (2015) 3. Razzak, M.I., Hussain, S.A., Belaid, A., Sher, M.: Multi-font numerals recognition for Urdu script based languages. Int. J. Recent Trends Eng. 2(3), 70–72 (2009)

346

H. Sharma et al.

4. Ansari, I.A., Borse, R.Y.: Automatic recognition of offline handwritten Urdu digits in unconstrained environment using Daubechies Wavelet transforms. IOSR J. Eng. (IOSRJEN) 3(9), 50–56 (2013) 5. Uddin, S., Sarim, M., Shaikh, A.B., Raffat, S.K.: Offline Urdu numeral recognition using non-negative matrix factorization. Res. J. Recent Sci. 3(11), 98–102 (2014) 6. Yusuf, M., Haider, T.: Recognition of handwritten Urdu digits using shape context. In: Proceedings of the 8th International Multitopic IEEE Conference, INMIC 2004), pp. 569–572 (2004) 7. Haider, T., Yusuf, M.: Accelerated recognition of handwritten Urdu digits using shape context based gradual pruning. In: Proceedings of the International Conference on Intelligent and Advanced Systems, ICIAS 2007), pp. 601–604 (2007) 8. Husnain, M., et al.: Recognition of Urdu handwritten characters using convolutional neural network. Appl. Sci., 9(13), 1–21 (2019) 9. Trier, O.D., Jain, A.K., Taxt, T.: Feature extraction methods for character recognition-a survey. Pattern Recogn. 29(4), 641–662 (1996) 10. Cao, J., Ahmadi, M., Shridhar, M.: Recognition of handwritten numerals with multiple feature and multistage classifier. Pattern Recogn. 28(2), 153–160 (1995) 11. Oh, I.S., Suen, C.Y.: Distance features for neural network-based recognition of handwritten characters. Int. J. Doc. Anal. Recogn. 1(2), 73–88 (1998) 12. Jain, A.K., Farrokhnia, E.: Unsupervised Texture Segmentation using Gabor Filters. Pattern Recogn. 24(12), 1167–1186 (1991) 13. Porat, M., Zeevi, Y.Y.: The generalized Gabor scheme of image representation in biological and machine vision. IEEE Trans. Pattern Anal. Machine Intell. 10, 452–468 (1988) 14. Hamamoto, Y., Uchimura, S., Watanabe, M., Yasuda, T., Mitani, Y., Tomita, S.: A Gabor filter based method for recognizing handwritten numerals. Pattern Recogn. 31(4), 395–400 (1998) 15. Teague, M.R.: Image analysis via the general theory of moments. J. Opt. Soc. Am. 70(8), 920–930 (1980) 16. Qu, Y.D., Cui, C.S., Chen, S.B., Li, J.Q.: A fast subpixel edge detection method using SobelZernike moments operator. Image Vis. Comput. 23(1), 11–17 (2005) 17. Pawlak, M.: On the reconstruction aspect of moment descriptors. IEEE Trans. Inf. Theory 38(6), 1698–1708 (1992) 18. Singh, C.: Improved quality of reconstructed images using floating point arithmetic for moment calculation. Pattern Recogn. 39, 2047–2064 (2006) 19. Singh, C., Pooja, S., Upneja, R.: On image reconstruction, numerical stability, and invariance of orthogonal radial moments and radial Harmonic transforms. Pattern Recognit. Image Anal. 21(4), 663–676 (2011) 20. Kim, Y.S., Kim, W.Y.: Content-based trademark retrieval system using visually salient features. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 307–312 (1997) 21. Li, S., Lee, M.C., Pun, C.M.: Complex Zernike moments features for shape based image retrieval. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 39(1), 227–237 (2009) 22. Singh, C., Sharma, P.: Improving image retrieval using combined features of Hough transform and Zernike moments. Opt. Lasers Eng. 49(12), 1384–1396 (2011) 23. Kim, H.S., Lee, H.K.: Invariant image watermarking using Zernike moments. IEEE Trans. Circuits Syst. Video Technol. 13(8), 766–775 (2003) 24. Xin, Y., Liao, S., Pawlak, M.: Circularly orthogonal moments for geometrically robust image watermarking. Pattern Recogn. 40(12), 3740–3752 (2007) 25. Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990) 26. Papakostas, G.A., Boutalis, Y.S., Karras, D.A., Mertzios, B.G.: A new class of Zernike moments for computer vision applications. Inf. Sci. 177(13), 2802–2819 (2007)

Extraction and Recognition of Numerals from Machine-Printed Urdu Documents

347

27. Broumandnia, A., Shanbehzadeh, J.: Fast Zernike wavelet moments for Farsi character recognition. Image Vis. Comput. 25(5), 717–729 (2007) 28. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998). https://doi.org/10.1023/A:1009715923555

Colour Sensitive Image Segmentation Using Quaternion Algebra Sandip Kumar Maity1 and Prabir Biswas2(B) 1 ARC Document Solution, Kolkata, India

[email protected] 2 Indian Institute of Technology, Kharagpur, India

[email protected]

Abstract. In colour image segmentation, a few vector filters are useful, and they are not well applicable. The segmentation tools which are still available, not also colour sensitive. In this paper, we present an approach for color sensitive segmentation method by using the color sensitivity property of hyper complex or quaternion algebra. The filter is based on quaternion convolution and impulse response is determined using modified gray centered RGB color cube. This method demands that every color can be segmented by choosing suitable impulse response. Keywords: Gray-Centered RGB colour cube · Liniear quaternion convolution · Colour sensitive image segmentation

1 Introduction The history of color image processing extends back to the 1970s; over 40 years there have been a few developments of filters which are applicable in separate channel of color image. These filters are not suitable in color image processing. Vector filters in color image processing have been studied for a few years. In 1998, Sangwine [3] introduced the edge-detecting filter based on convolution with a pair of quaternions valued (hyper complex) left and right masks. The new filter converts areas of smoothly varying color to shades of grey and generates colors in regions where edges occur. Same author proposed [6] a new class of filter based on convolution with hyper complex masks and presents three color edge detecting filters. These are the first examples of filters based on quaternion or hyper complex convolution. In [4], Evans et al. proposed a color sensitive low pass smoothing operator. In [5], Author used the method to find edges between homogeneous regions of colors. In [7], Sangwine proposed linear color dependent filters based on decomposition of an image into components parallel and perpendicular to a chosen direction in color space. The components were separately filtered with linear filters and added to produce an overall result. The paper demonstrates this approach with a color-selective smoothing filter. They have used two separate filters in parallel and perpendicular component. In [8], Authors presented a concept of vector amplification which increases the magnitude of pixel vectors with directions close to that of the color of interest (COI). For amplification purpose, they utilize gray-centered RGB © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 348–358, 2020. https://doi.org/10.1007/978-981-15-4015-8_31

Colour Sensitive Image Segmentation Using Quaternion Algebra

349

color-space [9]. In this color space, the unit RGB cube is translated so that the coordinate (0, 0, 0) represents mid-gray (half-way between black and white). The translation achieved by subtracting (1/2, 1/2, 1/2) from each pixel value in unit RGB space and lastly this translation is reversed by adding (1/2, 1/2, 1/2).

Fig. 1. (a) Original image; (b) segmented output

In this paper, we present an approach for color sensitive segmentation method by using the color sensitivity property of hyper complex or quaternion algebra. Paper is organized as follows. In Sect. 2, we review the quaternion and discussed its application in image processing. In Sect. 3, we review the gray centered RGB color cube and its modification. In Sect. 4, linear quaternion convolution is discussed. In Sect. 5, proposed algorithm is discussed in detail.

2 Quaternion Color image processing is still a relatively undeveloped field where there is always a scope to develop new algorithm such as vector Signal processing. The idea of quaternion was first introduced by the famous mathematician Hamilton [1] and it is an expansion and generalization of complex number with three imaginary components. A quaternion is represented as Q = a + ib + jc + dk here one part is real, and another part is imaginary. The three operators, i, j and k, are all ‘square roots of minus one’, and are mutually orthogonal. When multiplied in pairs they produce the third with sign dependent on ordering. i 2 = j 2 = k 2 = i jk = −1 i. j = k, j.k = i, k.i = j j.i = −k,

k. j = −i, i.k = − j

These hyper complex numbers or quaternions obey the usual rules of algebra with the exception of commutativity in multiplication, which is due to the vector nature of the product. The quaternion H can also be written as [a b c d]. A quaternion with zero real or scalar part S[Q] = a = 0 is called pure quaternion. Multiplication of two pure quaternion P and Q is given as P Q = −P.Q + P × Q = S[P Q] + V [P Q]

350

S. K. Maity and P. Biswas

It may be noted that i, j and k may also represent matrices. A separate type of quaternion can be expressed using 2 × 2 complex matrices. 10 i 0 U= I = 01 0 −i 0 1 0i J= K = −1 0 i 0 This quaternion family also satisfies the fundamental relations of quaternion. I 2 = J 2 = K 2 = I J K = −U I J = K K I = J J K = I I J K = −U We may note that in this 2 × 2 matrix representation U is the identity matrix. Each of I, J, K is square root of the negative of the identity matrix. It may also be shown that K is the wedge product of I and J (Table 1) K =I∧J =

1 (I J − J I ) 2

A mathematical table can represent the whole thing. Table 1. Relational table between i, j and k. 1 i

j

k

1 1 i

j

k

i i −1 k

−j

j j −k −1 i k k j

−i −1

The conjugate of the quaternion a = a1 + ia2 + ja3 + ka4 is given by a¯ = a1 − ia2 − j − ka4

(1)

The sum of the two quaternions a = a1 +ia2 + ja3 +ka4 and b = b1 +ib2 + jb3 +kb4 is given by a + b = (a1 + b1 ) + (a2 + b2 )i + (a3 + b3 ) j + (a4 + b4 )k

(2)

The product of above two quaternion a and b is given as a.b = (a1 b1 − a2 b2 − a3 b3 − a4 b4 ) + i(a1 b2 + a2 b1 + a3 b4 − a4 b3 )+ j(a1 b3 + a3 b1 − a2 b4 + a4 b2 ) + k(a1 b4 + a4 b1 + a2 b3 − a3 b2 )

(3)

Colour Sensitive Image Segmentation Using Quaternion Algebra

The norm or magnitude of a quaternion is given as √ √ |a| = a a¯ = aa a12 + a22 + a32 + a42 ¯ =

351

(4)

The inverse of a quaternion is given as a −1 =

a¯ a a¯

(5)

Any quaternion may be represented in polar form, thus: a = |a|eμ∅ where π is a unit quaternion, and 0 ≤ ≤π. The two values μ and are the Eigen axis and Eigen angle of quaternion. A color image has pixels with three components. We can think of an RGB pixel value either as three separate components or as three elements of a vector. In this paper we consider RGB images, which can be represented in quaternion form [11] as pure quaternion’s f (m, n) = r (m, n) i + g (m, n) j + b (m, n) k. The choice of the imaginary part to represent the pixel values is supported by the coincidence between the three-space imaginary part of the quaternion and the three space of the RGB color values.

3 Modified Gray-Centered RGB Colour Cube Gray-centered RGB color cube is proposed in [9] by shifting the center to mid gray (1/2,1/2,1/2) as shown in Fig. 2. In the gray-centered RGB color cube, the magnitude of the pure quaternion used to represent a color can be used as a measure of the distance between the color and mid grey located at the origin of hyper complex space. This distance is related to both the saturation and intensity component of a color in terms of the Hue-Saturation-Intensity coordinate system. Here we modified it and scaled with double i.e. (−1,1). This modified RGB color cube is skew symmetric. In this scaled cube, black is represented by (−1, −1, −1). We have chosen to work in an offset RGB color space in which the origin is at the center of the RGB cube, which means origin is at mid-grey (0, 0, 0). This means that all pixel values are vectors directed away from mid-grey. To convert an image from standard normalized RGB values to this ‘offset’ RGB space; we must change the RGB cube. A full description and discussion of this grey-centered RGB color space can be found in [9]. This modified RGB color cube is skew symmetric in nature.

352

S. K. Maity and P. Biswas

Fig. 2. (a) RGB colour cube; (b) Gray centred RGB colour cube; (c) Modified Gray centred RGB cube. (Color figure online)

Figure 3 shows various color ramp images. All the pixels in each color ramp image are aligned in a specific direction in color space. The magnitude of the pixel vector varies uniformly from left to right along these ramp images, passing through zero at the center of the image. Thus, the colors of the ramp images are symmetric and the coordinate of every point in this cube skew symmetric. The intensity response also varies from left to right monotonically.

Colour Sensitive Image Segmentation Using Quaternion Algebra

353

Fig. 3. Color ramp images from vertex to vertex of the Gray centered RGB cube. Top to bottom: grayscale, red/cyan, green/magenta, blue/yellow. Vector directions: (1,1,1), (1, −1, −1), (−1,1, −1), (−1, −1,1), respectively. (Color figure online)

4 Linear Quaternion Convolution (LQC) In this section we explain how linear filters can be constructed using linear quaternion [12] as the point operators (i.e. the operators that operate on individual signal or image being processed). We assume that linear shift-invariant filters (image processing) are characterized by a finite impulse response, or a finite coefficient ‘mask’. g(n, m) =

X Y

h(x, y) f (n − x, m − y)

x=−X y=−Y

=h∗ f

(6)

Here, h (x, y) is the impulse response of the filter. In the case of vector signals and quaternion coefficients, we need four products in the convolution: g(n, m) =

X Y

{A(x, y) + i B(x, y) + jC(x, y) + j D(x, y)} f (n − x, m − y)

x=−X y=−Y

=

X Y

{A(x, y) f (n − x, m − y)

x=−X y=−Y

+ i B(x, y) f (n − x, m − y) + jC(x, y) f (n − x, m − y) + j D(x, y) f (n − x, m − y)} = A∗ f + B ∗ fi +C ∗ fj + D ∗ fk = Scalar par t (constant ter m + V ector par t( f or m o f (r, g, b))

(7)

Where A (x, y) is the pixel at the location (x, y). A means the finite quaternion valued function with (2X + 1) (2Y + 1) quaternion samples and similarly for B, C and D. In this way, we think of the filter as consisting of the sum of four quaternion valued convolutions, three of which are multiplied on the right by the constant values i, j and k. Now, at each sample point in f, we have a linear quaternion function which implements

354

S. K. Maity and P. Biswas

some geometric operation, and we regard the filter as the convolution of these geometric operations with the vector signal. We can now think of defining the frequency response of the filter. There are two aspects to designing a vector filter which make the process nontrivial. The problem is to define a suitable impulse response for the filter in terms of these geometric operations. A filter is partially defined by the shape of its impulse response (the ‘pattern’ of values in the samples of h (x, y)). We must choose the suitable impulse response h (x, y) for the Color of interest and defined by COI. All pixel in the image are resolved into a component parallel to the COI and perpendicular to the COI. Obviously, a pixel with a value which is exactly that of the COI will have no component perpendicular to the COI, and a pixel with a value which is exactly perpendicular to COI will have no component in the direction of COI. Other pixel will fall somewhere in between these extremes. If COI is like yellow colour then we must preserve yellow pixel. For that h (x, y) = {1, 1, −1} is the best impulse response. This impulse response will detect the sharp transition between yellow and non-yellow. Like that for red color mask will be h (x, y) = {1, −1, −1}. For green, white, blue, cyan will be (−1, 1, −1), (1, 1, 1), (−1, −1, 1), (−1, 1, 1). This is for exact true blue, white, cyan…etc. The output of the quaternion convolution is in the form of scalar and vector part. Where vector part of the convolution result will be in (r, g, b) form.

5 Color Sensitive Image Segmentation The color sensitive image segmentation task is given below. Step 1. Selection of COI and as well as appropriate impulse response according to the proposed RGB color cube. Step 2. Linear Quaternion Convolution using the impulse response. Step 3. We must make it 255 to the scalar part of Outcomes from LQC and according to that vector part will be changed. Step 4. Change the pixel values according to the resultant RGB value. Step 5. Suppress the specific color by a new false color and recognize it as COI. This method performs perfect result for any type of images and it obviously depends on the identification of impulse response.

6 Results and Discussion We followed the above-mentioned steps and getting these interesting results. We have used various types of natural images and tested for it. In Fig. 1; red flower is truly detected and extracted by our proposed method. In Fig. 4(d), yellow color is detected and red color of 4(b) is detected in Fig. 4(e). In Fig. 4(c), COI is red color therefore, we have used the respective impulse response and the result is in Fig. 4(f). After quaternion convolution, we are getting false color image. Here main challenge is to separate out opponent colour [10]. These false colors are coming according to the Table 2. So, we can eliminate other frequency component except the COI component.

Colour Sensitive Image Segmentation Using Quaternion Algebra

355

Fig. 4. (a), (b) & (c) Original image. (d), (e) Detected yellow and red colour component. (f) extracted red component only. (Color figure online)

Table 2. Color response of bar image in various impulse responses (after the linear quaternion convolution).

Colours in bar image

Impulse White (1, 1, 1) Black (-1, -1, -1) Red (1, -1, -1) Green (-1, 1, -1) Blue (-1, -1, 1) Yellow (1, 1, -1) Cyan (-1, 1, 1) Magenta (1, -1, 1)

White

Black

Red

Green

Blue

Yellow

Cyan

Magenta

Black

Black

White

White

White

Black

Black

Black

Black

Black

White

White

White

Black

Black

Black

White

Black

White

White

White

Cyan

Black

Yellow

White

Black

White

White

White

Cyan

Magenta

Black

White

Black

White

White

White

Black

Magenta

Yellow

Black

Black

White

White

White

Black

Yellow

Cyan

Black

Black

White

White

White

Magenta

Black

Cyan

Black

Black

White

White

White

Magenta

Yellow

Black

356

S. K. Maity and P. Biswas Table 3. Yellow and blue Color sensitive examples.

Original image

Yellow colour is assigned by black and mask is (1, 1, -1)

Blue colour is assigned by white and mask is (-1, -1, 1)

Here, we are showing relations between colors and impulse response by the help of color bar image. In Table 3, Two natural images and the corresponding results for yellow and blue color segmentation are shown. The appropriate impulse response choosing is also a success key for this method. So, the COI if not exactly white (255, 255, 255), red (255,0,0) etc. then we cannot use the impulse response like (1,1,1) or (1, −1, −1) etc. In Fig. 5, we have tested it by using impulse response (1,1,1) or (−1, −1, -1), but these masks are not giving satisfactory results. Here white color is not perfect white. So, we have used the impulse response mask (0,1,1) or (0, −1, −1). These two points are skew symmetric. This proposed method is applicable on all type of images. It is applied-on tongue images also to extract thick yellow coat, which appears only for acidity. Here, Fig. 6(c) is for normal patient due to very less amount of yellow coating is present but in case for Figs. 6(a) and (b), are fully affected from acidity. It is verified with clinically investigated data.

Colour Sensitive Image Segmentation Using Quaternion Algebra

357

Fig. 5. (a) & (b) segmentation result of the original image of Table 3 for white color by using (1,1,1) and (−1, − 1, − 1); (c) & (d) segmentation result of white color by using (0,1,1) and (0, − 1, − 1); (e) whitish color is marked by white component; (f) non whitish color is marked by black color component (Color figure online)

Fig. 6. (a) and (b) are yellow coated cropped tongue image; (c) very light coated cropped tongue image. (b) and (d) are extracted yellow coated image but (f) normal tongue image. (Color figure online)

7 Conclusion Our proposed method demands accurate and appropriate color sensitive segmentation and it obviously depends on appropriate choosing of impulse response mask. It’s a robust and simplistic approach. The proposed algorithm is applicable in all real time data.

358

S. K. Maity and P. Biswas

References 1. Hamilton, W. R.: Lectures on Quaternions. Hodges and Smith, Dublin, Cornell university Library (1853). http://historical.library.cornell.edu/math/ 2. Pervin, E.: Quaternions in computer vision and robotics. In: IEEE International Conference Computer Vision Pattern Recognition, pp. 382–383 (1983) 3. Sangwine, S.J.: Colour image edge detector based on quaternion convolution. Electron. Lett. 34(10), 969–971 (1998) 4. Evans, C.J., Ell, T.A., Sangwine, S.J.: Hypercomplex color-sensitive smoothing filters. In: IEEE International Conference on Image Processing, vol. 1, no. 7, pp. 541–544 (2000) 5. Evans, C.J., Ell, T.A., Sangwine, S.J.: Colour-sensitive edge detection using hypercomplex filters. In: European Association for Signal Processing Conference (2000) 6. Sangwine, S.J., Ell, T.A.: Colour image filters based on hypercomplex convolution. IEE Proc. Vis. Image Signal Process. 147(2), 89–93 (2000) 7. Sangwine, S.J., Gatsheni, B.N., Ell, T.A.: Linear colour-dependent image filtering based on vector decomposition. In: European Signal Processing Conference, France, vol. 11, pp. 214– 211 (2002) 8. Sangwine, S.J., Gatsheni, B.N., Ell, T.A.: Vector amplification for color-dependent image filtering. In: IEEE International Conference on Image Processing, Barcelona, Spain, 14–17, vol. 2, pp. 129–132 (2003) 9. Sangwine, S.J., Ell, T.A.: Gray-centered RGB color space. In: Second European Conference on Color in Graphics, Imaging and Vision, pp. 183–186, Technology Center AGIT, Aachen (2004) 10. Sangwine, S.J., Gatsheni, B.N., Ell, T.A.: Colour dependent linear vector image filtering. In: European Signal Processing Conference, pp. 585–588 (2004) 11. Ell, T.A., Sangwine, S.J.: Projective-space colour filters using quaternion algebra. In: European Signal Processing Conference (2008) 12. Ell, T.A., Sangwine, S.J.: Theory of vector filters based on linear quaternion function. In: European Signal Processing Conference (2008)

Information Retrieval

Multimodal Query Based Approach for Document Image Retrieval Amit V. Nandedkar(B) and Abhijeet V. Nandedkar SGGS Institute of Engineering and Technology, Nanded, India {amitvnandedkar,avnandedkar}@sggs.ac.in

Abstract. Scanning and storage of documents are regular practices. Retrieval of such documents is necessary to support office work. In this paper, a novel multimodal query based approach for retrieving documents using text, non-text contents is presented. This work focuses on logos, stamps, signatures for non-text query; and dates and keywords for text query to do retrieval. The proposed algorithm is called as multimodal document image retrieval algorithm (MDIRA), uses separation of text and non-text to simplify document indexing using both textual and non-textual contents. A single feature space for non-text contents is proposed for indexing. Various date formats can be recognized using regular expressions and mapped to uniform representation useful for indexing and retrieval. The proposed algorithm supports formation of multimodal queries using multiple attributes of documents. Results on a publicly available color document dataset, show the effectiveness of the proposed technique for document retrieval.

Keywords: Document retrieval indexing

1

· Multimodal queries · Document

Introduction

In recent times, offices and organizations need to handle a large number of scanned documents. There is a general need for classification and retrieval of such large-scale digitized documents. The standard approach of document image retrieval (DIR) uses only text. In reality for official records, an integrated approach for DIR using visual and textual contents is required. Therefore, document image retrieval is an important area of research in document image processing. Official documents may be categorized depending on their role in communication and transactions. These documents may have visual attributes (e.g., logos, stamps, signatures, printed, and handwritten textual contents). Such document attributes depict the identity and legitimacy of the document origin to a reader. Hence, separation of text and non-text elements (e.g., logo, stamp, and signature) is a critical task for document indexing and retrieval. In the past,

c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 361–371, 2020. https://doi.org/10.1007/978-981-15-4015-8_32

362

A. V. Nandedkar and A. V. Nandedkar

several approaches are reported for DIR using logo, signature, seal, etc. (refer to [3,6,7,12–14,16]). These techniques targeted one of the visual attributes of documents in an isolated manner. In [15], a sketchbased retrieval of pictorial parts of documents such as magazine pages is presented. A mobile document image retrieval framework for line drawing retrieval is reported in [5]. In this work, a Multimodal Document Image Retrieval Algorithm (MDIRA) is presented. The primary task of the MDIRA is to assist a user in finding the most relevant documents from the corpus of document images. This is an integrated approach to deal with non-textual and textual queries. The queries can be of visual forms (e.g., logos, stamps, and signatures) as well as of textual forms such as document dates and keywords. The MDIRA uses the text/non-text separation [10] as an initial processing block to simplify document indexing. It performs retrieval using a simple representation in feature space for all types of non-textual queries. The proposed algorithm also recognizes textual patterns of dates and keywords for text-based retrieval. This technique considers text, and different non-textual elements simultaneously for supporting complex queries to retrieve document images. The local part-based features are found to be useful in the field of document image processing [1,2]. Speeded-up robust features (SURF) [4] is a part-based technique as it represents an image as a set of key points and their respective local descriptors. SURF is invariant to common image transformations such as image rotation, scale changes, small change in viewpoint, illumination change. The SURF technique achieves scale invariance by examining the image at different scales, scale space, using Gaussian kernels (σ). For indexing of different non-textual elements present in document images, SURF features are modified and used. Experimentation and testing of the proposed algorithm are performed on the SPODS dataset [10].

2

Retrieval of Documents

The MDIRA has two streams of information processing as shown in Fig. 1. First, it considers the processing of visual features, and the other stream of processing involves textual contents for indexing of document images. Here, the separation of text/non-text elements (refer to [10]) is an important operation. The document images to be indexed are first processed to separate textual and nontextual contents. The separated text is fed to an optical character recognition (OCR) system for conversion into the electronic form of the text data. The query could be in the form of images of logos, stamps, signatures, etc., or it can be in the form of electronic text such as dates and keywords. The separated non-textual elements are indexed using suggested feature and kD-tree indexing scheme. Section 2.1 discusses the non-text based document retrieval. One of the challenges involved is to index document using date. As there are various ways to represent date in a document, it is required to take care of different formats of representation of dates and convert them into a unique representation. This simplifies date based document annotations. Example of the different date formats are shown in Table 1. Keywords are important for fast

Multimodal Query Based Approach for Document Image Retrieval

363

tracking of documents present in a large corpus of documents. Here, a predefined set of keywords is used. However, the list of keywords can be modified. Section 2.2 presents the text based document retrieval.

Fig. 1. Block diagram of MDIRA.

Table 1. Different date formats supported by the proposed MDIRA. ExampleDate formats 1

1st January, 2015 January 1, 2015 1 Jan., 20151 st Jan, 201501 Jan, 20151 Jan. 201501/01/2015

2

2nd February, 2016 February 2, 2016 2 Feb., 20162nd Feb, 2016 02 Feb, 2016 2 Feb. 2016 02/02/2016

1/1/2015 2/2/2016

2.1

01-01-2015 02-02-2016

1-01-2015 2-02-2016

1-1-2015 2-2-2016

01.01.2015 02.02.2016

1.1.2015 2.2.2016

01.1.2015 02.2.2016

Non-text Based Document Retrieval

The MDIRA provides access to document images based on visual attributes (e.g., logo, signature, and stamp). The first preprocessing step is to separate text/non-text elements of the document image (as shown in Fig. 2). After performing separation of text and non-text elements, the non-textual output consists of visual elements such as logo, stamp, and signature. This operation removes all irrelevant textual information and simplifies the feature extraction process for supporting the non-text based retrieval of documents. Hence, these visual attributes are represented with a minimal set of key points and their respective SURF descriptors [4]. The MDIRA uses a single feature space for finding similarity between the query image and the available set of documents in database. However, for stamps and signatures, SURF features are found to be less effective. They suffer from a high degree of intra-class similarity for stamps and signatures. On the other hand, logo matching can efficiently be done using them. Hence for increasing the discriminating capability of SURF features, more global information needs to be encoded. To support the findings, the window size

364

A. V. Nandedkar and A. V. Nandedkar

used to compute descriptor at a key point is varied from size of 20s to 140s. Here, s indicates the scale at which the Hessian key point is detected [4]. The window size of 20s is used for computing standard SURF descriptors as mentioned in [4]. Here, the window size for the computation of descriptors is expressed as: Window Size = μ.s, where μ ∈ {20, 60, 80, 100, 120, 140}. The window size of 20s is used for computing SURF descriptors as given in [4]. If μ = 20, it follows the standard implementation of SURF. The larger μ value results in a larger zone surrounding a key point for computing the SURF descriptor. It is observed that increasing the μ value results in improvement of retrieval performance, and beyond a certain value, the retrieval performance starts degrading. Empirically, it is observed that μ = 80 provides the best retrieval performance for stamp and signature queries in the given dataset. This improves the overall retrieval performance for stamp and signature class. The details of experimental results are given in Sect. 3. Figure 2 shows the feature extraction, and indexing process carried out for each of the document images present in the database. The extracted features for a document image are labeled with respective document identification number DOC ID. These extracted features are indexed using kD-tree indexing technique (refer to Fast Approximate Nearest Neighbor library (FLANN) [9]).

Fig. 2. Indexing and retrieval process for visual attributes of documents.

During retrieval, the query image is subjected to extract the features from the gray scale version of it. For each extracted feature the MDIRA finds K approximate nearest neighbors. Each of the nearest neighbors has a DOC ID. If S number of SURF features are obtained from query image, there are in total K × S number of nearest neighbors for the given query image (refer to Fig. 2). The MDIRA finds the number of distinct DOC IDs, and counts their respective instances. This list is sorted in descending order of count values. The top N documents are rendered as a most promising list of documents containing desired query image. It is ensured to retrieve at least N documents by keeping K > N. Such representation of non-textual elements using the modified SURF descriptors helps for proper retrieval of document images (refer to Sect. 3 for the retrieval results). Here, SURF features are modified by redefining the window size surrounding the key points for computing the descriptors. Hence, the MDIRA supports single feature space capable of discriminating intra-class and inter-class non-textual elements. The subsequent section discusses the text based document retrieval technique.

Multimodal Query Based Approach for Document Image Retrieval

2.2

365

Text Based Document Retrieval

Official documents consist of text that can be analyzed for locating dates and keywords. The string matching is performed to find keywords in textual output of OCR. A regular expression (RE) is a sequence of characters that defines a search pattern, mainly for use in pattern matching with strings, or string matching. Regular Expressions (REs) are used for date identification in OCRed textual string. The MDIRA utilizes the regular expression pattern that matches the defined date patterns. Figure 3 shows the overall process of insertion/updating of inverted indexes for dates and keywords.

Fig. 3. Document indexing using date and keyword information present in document image.

During implementation of MDIRA, it is assumed that each document has a unique identification number (DOC ID). Here, DOC i indicates DOC ID, 1 ≤ i ≤ D, and D indicates the total number of documents. In date based inverted index, a uniform representation of date information is utilized. As discussed above, regular expressions are applied to identify date patterns in input OCRed stream of textual contents. The following regular expressions are defined and used: R1 = (\d{1, 2})/(\d{1, 2})/([1|2]\d{3})

(1)

R2 = (\d{1, 2}).(\d{1, 2}).([1|2]\d{3}) R3 = (\d{1, 2}) − (\d{1, 2}) − ([1|2]\d{3}) RE1 = [R1 |R2 |R3 ]

(2) (3) (4)

RE2 = (\d{1, 2})(MONTHS)([1|2] \d{3}) RE3 = (MONTHS)(\d{1, 2})([1|2] \d{3})

(5) (6)

MONTHS = [jan|jan.|january|f eb|f eb.|f ebruary |mar|mar.|march|apr|apr.|april|may |jun|jun.|june|jul|jul.|july|aug|aug. |august|sep|sep.|sept|sept.|september |oct|oct.|october|nov|nov.|november |dec|dec.|december]

(7)

366

A. V. Nandedkar and A. V. Nandedkar

To understand the working of such representation of REs, the general operators used are described in Table 2. If any of the REs (refer Eqs. 4, 5, and 6) finds a match, it helps in capturing tokens representing information about day, month, and year. For capturing tokens, the method uses (expr) grouping operation (refer to Table 2). It is assumed that day can be represented as a single numeric digit or two numeric digits. It is represented as (\d{1, 2}). The year information in most of the official documents is always represented using four numeric digits. The year information is considered to be always starting with 1xxx or 2xxx to represent all years such as 1995, 2999. This is represented as ([1|2] \d{3}) in REs. Still there are almost 10 centuries remaining to start year information with numeric digit 3 (e.g., 3001). The regular expression MONTHS covers all possible commonly used representations of months in wordings. Table 2. Regular expression: general operators. Operator Description [x|y]

String representing either x or y

\d{a}

Any numeric digit sequence with ‘a’ number of consecutive digits

\d{a,b}

Any numeric digit sequence number with minimum ‘a’ number of consecutive digits or maximum ‘b’ number of consecutive digits

(expr)

Group elements of the expression and capture tokens

As depicted in Fig. 3, the MDIRA performs prepossessing of the OCRed output text stream to remove unnecessary symbols as well as all OCRed output stream are mapped to lowercase representation. This operation clears ambiguities during matching of REs. A person can write date with a lot of variation. Sometimes, extra spaces are used between two units of information. Even day can be represented as: 1st , 2nd , 3rd , 24th , and 21st , etc. Another possible cause of ambiguity is due to uncertain use of ‘,’ symbol (e.g., 1 January, 2016 or 1st January 2016). It is observed that for superscript text associated with day inforR OCR produces output made up of mation (e.g., st , nd , rd , th ), the Matlab symbols (e.g.,‘, , ∗ ,”,´,`). Here, the implementation is adapted to the characR OCR. Hence, all such symbols present in the input text teristics of Matlab stream are removed. This technique also removes ‘,’ and spaces from the input text stream. After performing all these operations, the technique produces the preprocessed text stream that is analyzed for identification of date patterns using REs. During indexing process, the captured tokens after matching with RE are mapped to a uniform representation. After matching with one of the defined REs, the captured TOKENS are associated with day, month and year depending on RE type. Once the tokens are available, they are mapped to uniform representation for future use in indexing. If the token represents month in words, then it is mapped to a unique numeric representation (e.g., ‘July’, ‘Jul.’, and ‘Jul’ are

Multimodal Query Based Approach for Document Image Retrieval

367

mapped to ‘07’). Apart from this, if any day and month information represented by the token is a single numeric digit, it is mapped to two numeric digits (e.g., ‘5’ mapped to ‘05’). Such mapping to uniform representation is useful during date based document image retrieval. For each identified date information, the MDIRA creates an entry or update the inverted indexing structure. A user can fire queries using such uniform representation of date information. The feature of keyword matching for DIR in MDIRA is implemented using R OCR text string locating functionality. This work assumes a predeMatlab fined knowledge base of keywords (refer to Table 3) and equivalence of keywords (refer to Table 4). The keyword searching operation is made case insensitive to guide proper search of keywords during indexing and retrieval. If the text stream of document contains any of the keywords belonging to the knowledge base, it makes a respective entry in inverted index structure. Whenever a user fires a keyword based query, the MDIRA retrieves a set of relevant documents by referring to inverted index. It retrieves the list of documents LIST1 relevant to given query Q. Also, the MDIRA searches for all relevant documents (LIST2) as per equivalence information. The final list of all relevant documents is the union of LIST1 and LIST2. (e.g, if Q= memo, resulting relevant documents will be union of LIST1 ∈ memo and LIST2 ∈ memorandum by referring to Table 4 and keyword based inverted index). Table 3. Example list of keywords. ID

2.3

Table 4. Example list of equivalence of keywords.

Keyword

1

‘memorandum’

Keyword

Equivalence ID

2

‘memo’

‘memorandum’

1

3

‘dy registrar’

‘memo’

1

4

‘dy. registrar’

‘dy registrar’

2

5

‘deputy registrar’

‘dy. registrar’

2

:

:

‘deputy registrar’ 2

:

:

:

:

n-1 ‘advertisement’

:

:

n

‘advertisement’

m

‘advt’

m

‘advt’

Combined Query Formation and Retrieval

The MDIRA supports multiple queries of interest to be fired simultaneously for retrieval of document images satisfying the given constraints. For example, Let Query 1 = Stamp Query and Query 2 = Logo Query. The combined query with AND logic can be represented as follows: Query = Query 1 AND Query 2. Execution of Query results into two lists L1 and L2 of relevant documents to Query 1 and

368

A. V. Nandedkar and A. V. Nandedkar

Query 2, respectively. Then the final list L of relevant documents is prepared as an intersection of L1 and L2 list: L = L1 ∩ L2. Similarly, for OR logic (e.g., Query = Query 1 OR Query 2, L = L1 ∪ L2). The proposed technique can also combine text and non-text based queries in same manner. A combined query (e.g., Query1 AND Query2) example for non-textual elements is illustrated in Fig. 4. Figure 4 shows one retrieved document image satisfying the presence of both query elements.

3

Experimental Results

The MDIRA is tested on the SPODS dataset [10], which consist of 1,088 color documents containing logos, stamps, signatures, and printed text. The performance of non-text based retrieval is measured using different measures such as mean average precision (MAP), and mean R-precision (MRP). The MAP is defined by: n Q 1 t=1 (P (t)) × rel (t) M AP = Q q=1 N o. of relevant documents

(8)

where n is the number of retrieved document images, P (t) is the precision at rank t (here, rank indicates position of a document in the list of retrieved documents), rel(t) is 1 if the result at rank t is a relevant document image, and otherwise 0. Q is the number of queries. The MRP is defined as follows: Q q=1 RP (q) (9) M RP = Q where R-precision (RP ) is the precision at Rth position in the ranking table for a query that has R relevant document images (i.e., RP (q) is the R-precision of query q). Q is the number of queries.

Fig. 4. Visual result for combined query containing logo and stamp.

Multimodal Query Based Approach for Document Image Retrieval

369

Here, the number of top retrieved documents are varied such as 1, 5, 10, 15, 20, 25, 30, and 35. On an average, most of the non-textual elements have 33 instances in the dataset. Table 5 provides the overall performance for query images of logo, stamp, and signature. Performance measures are computed at different values of μ (refer to Sect. 2.1). The overall performance of MDIRA with μ = 80 is found to be most appropriate for retrieval purpose. Table 5. Retrieval performance of MDIRA for queries based on logos, stamps, and signatures. Parameter µ for computing SURF descriptors MAP MRP 20

74.9

78.7

60

85.3

87.2

80

85.6

87.6

100

85.1

87.3

120

84.6

86.4

140

83.2

85.6

To compare retrieval performance of proposed algorithm with variations in feature space representation and indexing mechanism, the performance of modified SURF descriptor (refer to Sect. 2.1) are compared to standard SURF [4], and SIFT descriptors [8]. Here, the retrieval performance is verified for FLANN based indexing as well as sparse product quantization (SPQ) [11] based indexing. Like FLAAN technique, the SPQ is Fast approximate nearest neighbor (ANN) search method for high-dimensional feature indexing and retrieval of large-scale image databases. Retrieval performances of logo, stamp, and signature based queries for different descriptors in combinations with indexing techniques based on FLANN and SPQ are presented in Table 6. This experiment shows that the proposed document descriptor with SURF window size μ = 80 is performing well for FLANN and SPQ based document indexing. Table 6. Retrieval performance for queries based on logos, stamps, and signatures. Descriptor used for retrieval

Indexing technique

MAP

MRP

Standard SURF [4]

FLANN [9]

74.9

78.7

Standard SURF [4]

SPQ [11]

75.3

78.9

Proposed SURF with µ=80

FLANN [9]

85.6

87.6

Proposed SURF with µ=80

SPQ [11]

85.9

88.0

SIFT [8]

FLANN [9]

77.8

82.3

SIFT [8]

SPQ [11]

78.1

82.6

370

A. V. Nandedkar and A. V. Nandedkar

For text based document retrieval using date and keywords, the performance is measured in terms of recall. In this case, the inverted indexing scheme is used which retrieves only the relevant documents. Here, the results do not have any false positive cases. The recall for date based document retrieval is 97.17%, whereas the recall for keyword based document retrieval is 99.45%.

4

Conclusion

With an increasing use of scanned official documents, there is a need to facilitate retrieval of such documents using both textual and non-textual elements. The MDIRA also supports formulation of multimodal queries that can be represented as a combination of different non-text, and text queries, etc. Here, the text based retrieval functionality provides date and traditional keyword based access of documents. A uniform representation of date information in indexing structure supports retrieval of documents regardless of the date formats used in the documents. It is important to investigate the possible future scope of such DIR approach with improvised document image descriptors for indexing.

References 1. Ahmed, S., Malik, M.I., Liwicki, M., Dengel, A.: Signature segmentation from document images. In: International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 425–429. IEEE (2012) 2. Ahmed, S., Shafait, F., Liwicki, M., Dengel, A.: A generic method for stamp segmentation using part-based features. In: Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 708–712. IEEE (2013) 3. Alaei, A., Roy, P.P., Pal, U.: Logo and seal based administrative document image retrieval: a survey. Comput. Sci. Rev. 22, 47–63 (2016) 4. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 5. Duan, L.Y., Ji, R., Chen, Z., Huang, T., Gao, W.: Towards mobile document image retrieval for digital library. IEEE Trans. Multimedia 16(2), 346–359 (2014) 6. Jain, R., Doermann, D.: Logo retrieval in document images. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, pp. 135–139. IEEE (2012) 7. Le, V.P., Nayef, N., Visani, M., Ogier, J.M., De Tran, C.: Document retrieval based on logo spotting using key-point matching. In: Proceedings of the 22nd International Conference on Pattern Recognition (ICPR), pp. 3056–3061. IEEE (2014) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 9. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014) 10. Nandedkar, A.V., Mukherjee, J., Sural, S.: Text and non-text separation in scanned color-official documents. In: Mukherjee, S., et al. (eds.) ICVGIP 2016. LNCS, vol. 10481, pp. 231–242. Springer, Cham (2017). https://doi.org/10.1007/978-3-31968124-5 20

Multimodal Query Based Approach for Document Image Retrieval

371

11. Ning, Q., Zhu, J., Zhong, Z., Hoi, S.C., Chen, C.: Scalable image retrieval by sparse product quantization. IEEE Trans. Multimedia 19(3), 586–597 (2016) 12. Roy, P.P., Pal, U., Llad´ os, J.: Document seal detection using GHT and character proximity graphs. Pattern Recogn. 44(6), 1282–1295 (2011) 13. Rusi˜ nol, M., Llad´ os, J.: Efficient logo retrieval through hashing shape context descriptors. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 215–222. ACM (2010) 14. Srihari, S.N., et al.: Document image retrieval using signatures as queries. In: Proceedings of the 2nd International Conference on Document Image Analysis for Libraries (DIAL), pp. 198–203. IEEE (2006) 15. Tencer, L., Ren´ akov´ a, M., Cheriet, M.: Sketch-based retrieval of document illustrations and regions of interest. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, pp. 728–732. IEEE (2013) 16. Zhu, G., Zheng, Y., Doermann, D., Jaeger, S.: Signature detection and matching for document image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2015–2031 (2009)

Transformed Directional Tri Concomitant Triplet Patterns for Image Retrieval Chesti Altaff Hussain1(B) , D. Venkata Rao2 , and S. Aruna Mastani1 1

Department of ECE, JNTU College of Engineering, Anatapuramu, India [email protected] 2 QIS Institute of Engineering and Technology, Ongole, India

Abstract. Content-based image retrieval (CBIR) is an accurate characterization of visual information. In this paper, we have proposed a new technique entitled as Transformed Directional Tri Concomitant Triplet Patterns (TdtCTp) for CBIR. TdtCTp consist of three stages to obtain detail directional information about pixel progression. In first stage, structural rule based approach is proposed to extract directional information in various direction. Further, in second stage, microscopic information and correlation between each sub-structural elements are extracted by using concomitant conditions. Finally, minute directional intensity variation information and correlation between the sub-structural elements are extracted by integrating first two stages. Retrieval accuracy is estimated using traditional distance measure technique in terms of average retrieval precision and average retrieval rate on publicly available natural and medical image databases. Performance analysis shows that TdtCTp descriptor outperforms the existing methods in terms of retrieval accuracy.

Keywords: CBIR retrieval

1

· Local ternary pattern · Corel-10K · Image

Introduction

In the field of biomedical imaging, there is large increment in biomedical data due to the advanced technique like X-ray, magnetic resonance imaging, computed tomography. Manual access, search, indexing and retrieval of this huge data is difficult task. Thus, there is a need to develop well-structured technique to overcome above mentioned limitation. Content based image retrieval (CBIR) is a smart solution to handle this problem which uses the content of the input medical scan and retrieves the scans having similar content. CBIR has wide applications in academic, biomedical imaging, and industry. Existing methods for image retrieval and texture classification are present in [3,4,24]. Feature extraction plays vital role in accuracy of any image retrieval system. In existing literature, both local and global feature descriptor are proposed for feature extraction. Local feature descriptor divides the input image c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 372–382, 2020. https://doi.org/10.1007/978-981-15-4015-8_33

Transformed Directional Tri Concomitant Triplet Patterns

373

into number of local parts followed by spatial and transform domain features extraction. Whereas, global feature descriptor make use of entire image for visual feature extraction. Color is one of the prominent feature for CBIR. Researchers proposed various approaches like color histogram [21,22], color corelogram [17], central moments of each color space (mean, standard deviation and skewness) [9] to extract color features from given input. Single color feature fails to achieve good accuracy in large-scale databases. Along with color, texture is an important characteristic of an image. Researchers proposed various approaches [5,6] for texture feature extraction. Ojala et al. [15] proposed simple and efficient approach local binary patters (LBP) to extract textural features from an image. Disadvantage with LBP is that it fails to extracts the directional information (0◦ , 45◦ , −15◦ etc.). To overcome the disadvantage of LBP, various researchers proposed local feature discriptors [7,8,16] for textural feature extraction. Zhang et al. [27] proposed local derivate pattern to extract higher order local derivative information for face recognition. Further, Murala et al. [20] proposed new technique using edge information by taking the magnitude difference between center pixel with its neighborhood pixels named as local maximum edge binary pattern. Several image retrieval algorithms for CBIR using local feature are proposed by Murala [12,13,18]. Lin et al. [11] combine the color, texture and MCM feature to propose an algorithm for content based image retrieval. The Modified Color MCM used red, green and blue color planes to collects inter-correlation for CBIR which overcomes the limitations of color MCM proposed by Murala et al. [19]. Vipparthi et al. [26] proposed directional local ternary pattern for CBIR, which gives directional edge information with respect to reference pixels in particular direction. Murala et al. [14] proposed local ternary co-occurrence pattern (LTCoP) by using the concept of local ternary pattern and co-occurrence matrix for medical image retrieval, which encodes the co-occurrence relationship between ternary images. Vipparthi et al. [25] used logical XoR operation with different motif representation used in [10] for CBIR named as directional local motif XoR pattern(DLMXoRP). It is observed from the above discussion that existing approaches (local feature descriptor) mainly focussed on textural feature extraction by considering two direction only. Also, these approaches lacks the extraction of microscopic information. The retrieval accuracy of these approaches can be further improved with the incorporation of microscopic information. This observation inspired us to propose a novel local feature descriptor which extract the microscopic directional information for image retrieval. In this paper, we propose a new feature descriptor for image retrieval task. The major contributions of the works are given below: 1. The structural rule based approach is used to encode the relationship between center pixels with its surrounding pixels of each transformed image. 2. Concomitant conditions are used to extract the microscopic directional information and find the correlation between the pixels in each substructure.

374

C. A. Hussain et al.

3. Further, Transformed Directional Tri Concomitant Triplet Patterns (TdtCTp) is implemented by integrating first two steps. 4. Learning based CBIR technique is incorporated with the help of artificial neural network and traditional CBIR technique. The rest of manuscript is organized as follows: Sect. 1 illustrates the introduction and literature review on CBIR. Section 2 depicts the existing most widely used feature descriptor for face recognition, image (natural and medical ) retrieval. Proposed system framework for CBIR with the help of TdtCTp feature descriptor is discussed in Sect. 3. Result analysis is discussed in Sect. 4. Finally, Sect. 5 concludes the proposed approach for CBIR.

2 2.1

Local Patterns Local Ternary Pattern (LTP)

Tan et al. [23] introduced a new texture operator Local Ternary Pattern (LTP) having three different values (−1, 0, +1), which is the extension of LBP. In LTP, to divide the pixel values into three different quantization levels constant threshold (th) value is used. The gray values which belongs to the range (Ic ± th) are replaced with the zero values, other gray values which are above the range are replaced by +1 and remaining gray values are replaced by −1. The three valued image obtained from input image using LTP feature descriptor is further divided into two sub-images (upper and lower ternary pattern) to reduce dimension of feature. ⎧ ⎨+1 Ii ≥ Ic + th 0 |Ii − Ic | < th f1 (Ii , Ic , th) = (1) ⎩ −1 Ii ≤ Ic − th where, Ic intensity at reference pixel, Ii is surrounding pixel values in the radius q (q > 0, q ∈ N ). 2.2

Local Ternary Co-occurrence Pattern (LTCoP)

Murala et al. [14] proposed Local Ternary Co-occurrence Pattern (LTCoP) using concept of LDP and LTP for biomedical image retrieval. By using gray value of reference pixel and its surrounding neighbourhood pixels, the ternary edges present in image are calculated. In LTCoP, calculate the first order derivatives with references to center pixel-using Eqs. (2) and (3) as given below. Rp,q (gi ) = Rp,q (gi ) − Rp,q (gc )

(2)

Rp,q+1 (gi ) = Rp,q+1 (gi ) − Rp,q (gc )

(3)

where, R is input image, gc is reference pixel gray value, gi represent the intensity value of all neighbourhood pixels and p is pixels to be considered having radius q

Transformed Directional Tri Concomitant Triplet Patterns

375

for gray scale relationship (q > 0, q ∈ N ) After first order derivative calculation, the sign of derivatives are encoded as follows: 1 Rp,q (gi ) = f1 (Rp,q (gi ))

(4)

1 (gi ) = f1 (Rp,q+1 (gi )) Rp,q+1

(5)

Equation (4) and (5) gives the three value quantized image and these three valued images are used to find the co-occurrences value between the elements. Detail information of LTCoP is available in [14].

Fig. 1. 1 × 3 sub-structure construction procedure from given 3 × 3 grid, (a) horizontal, (b) vertical, (c) diagonal and (d) anti-diagonal direction

3 3.1

Proposed Feature Descriptor Directional Structure Transformed Pattern

The directional transformed pattern deals with detail textural information by using structural rule based approach. The input image (I) is divided into overlapping structure of size 3 × 3, which is further processed using four different structural elements (substructure) as shown in Fig. 1. This structural element gives the detail directional information in horizontal (0◦ ), vertical (90◦ ), diagonal (45◦ ) and anti-diagonal (135◦ ) direction. The detailed explanation for calculation of transformed image from each 1 × 3 substructure using pixel intensity comparisons is given as below: The overlapping reference structure is extracted from original image (I ) of size M × N using Eq. (6) Iref (m, n) = I(m + t, n + t)

(6)

where,t = −1:1; m = 2,3.......M−1; n = 2,3.......N−1 The 1 × 3 sub structure (γθ◦ ) extraction in given direction θ◦ ∈ ◦ [0 , 45◦ , 90◦ , 135◦ ] from each sub block (I3×3 ) is calculated using Eqs. (7) and (8) γ0◦ (1, i) = Iref (2, i), γ90◦ (1, i) = Iref (i, 2),

γ45◦ (1, i) = Iref (i, i)

(7)

γ135◦ (1, i) = Iref (4 − i, i)

(8)

376

C. A. Hussain et al.

f or ∀i = 1, 2, 3 The transformed images for particular direction are calculated using Eq. (9) (9) Tθ◦ (m, n) = f (γθ◦ ) where,

⎧ 1 ⎪ ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎨3 f (γθ◦ ) = 4 ⎪ ⎪ 5 ⎪ ⎪ ⎪ ⎪ 6 ⎪ ⎪ ⎩ 7

γθ◦ (1, 3) > γθ◦ (1, 2) & γθ◦ (1, 2) > γθ0 (1, 1) γθ◦ (1, 3) < γθ◦ (1, 2) & γθ◦ (1, 2) < γθ0 (1, 1) γθ◦ (1, 3) > γθ◦ (1, 1) & γθ◦ (1, 2) < γθ0 (1, 3) γθ◦ (1, 3) < γθ◦ (1, 1) & γθ◦ (1, 2) < γθ0 (1, 1) γθ◦ (1, 3) > γθ◦ (1, 1) & γθ◦ (1, 2) > γθ0 (1, 3) γθ◦ (1, 3) < γθ◦ (1, 1) & γθ◦ (1, 2) > γθ0 (1, 3) else

f or, θ◦ ∈ [0◦ , 45◦ , 90◦ , 135◦ ], m = 2, 3.......M − 1; n = 2, 3.......N − 1 Using Eq. (9), the input image is transformed into four direction [0◦ , 45◦ , 90◦ , 135◦ ]. 3.2

Proposed Transformed Directional Tri Concomitant Triplet Patterns (TdtCTp)

To derive Transformed Directional Tri Concomitant Triplet Patterns (TdtCTp), tri concomitant triplet patterns is applied on each transformed image to obtain similar structure present in each transformed image by comparing center pixel with its neighbour. The mathematical procedure to derive TdtCTp value is given below: Calculate the first order derivatives of each directional transformed images with references to center pixel of 5 × 5 pattern image using Eqs. (10), (11) and (12) is given follows. Tθ1◦ (m, n) = Tθ◦ (m + t, n + t)

(10)

Tθ1◦ (gi ) = Tθ1◦ (gi ) − Tθ1◦ (gc );

(11)

p,q

Tθ1◦

p,q

p,q+1

(gi ) = Tθ1◦

p,q+1

p,q

(gi ) − Tθ1◦ (gc ); p,q

(12)

where, t = −2 : 2; i = 1, 2, ........p, m = 3, 4.......M − 2, n = 3, 4.......N − 2 and θ◦ ∈ [0◦ , 45◦ , 90◦ , 135◦ ] After first order derivative calculation, we code them based on the sign of derivatives as follows: 1 (13) Tθ1p,q ◦ (gi ) = f1 (Tθ ◦ (gi )) p,q Tθ1p,q+1 (gi ) = f1 (Tθ1p,q+1 (gi )) ◦ ◦ ◦

◦

◦

◦

◦

(14)

where, θ ∈ [0 , 45 , 90 , 135 ] Equations (13) and (14) gives the three-valued image, which further used to find the concomitant value between the elements using Eq. (15). ⎤ ⎡ f2 (Tθ1◦ p,q (g1 ), Tθ1◦ p,q+1 (g1 )), (15) T CT P (m, n) = ⎣ f2 (Tθ1◦ p,q (g2 ), Tθ1◦ p,q+1 (g2 )), ...⎦ ....f2 (Tθ1◦ p,q (gp ), Tθ1◦ p,q+1 (gp ))

Transformed Directional Tri Concomitant Triplet Patterns

377

where, θ◦ ∈ [0◦ , 45◦ , 90◦ , 135◦ ], m = 3, 4, ......M − 2, and n = 3, 4, .......N − 2. ⎧ ⎨ 1 if a = b = 1 (16) f2 (a, b) = −1 if a = b = −1 ⎩ 0 else Apply the procedure used to obtain two LBP images from LTP images on transformed concomitant triplet pattern (TCTP) image in each direction. Now, the unique TdtCTp decimal value for particular pixel is calculated using binomial weight as: 8

◦ ◦ T dtCT pθLT = T CT PθLT × (2i ) (17) P /U T P P /U T P i=1

◦

◦

◦

◦

◦

where, θ ∈ [0 , 45 , 90 , 135 ] Obtained TdtCTp values for all pixels are used to from feature vector by creating histogram using Eqs. (18) and (19) ◦

HistθT dtCT pLT P /U T P (p) =

N M

m=1 n=1

◦ f2 (T dtCT pθLT (m, n), p); P /U T P

1 x = y f2 (x, y) = 0 else

(18)

(19)

where, θ◦ ∈ [0◦ , 45◦ , 90◦ , 135◦ ]; p ∈ [0, 255]. Both the histogram (upper and lower) values obtained from TdtCTp in all considered directions are concatenated to from final TdtCTp feature using Eq. (20). ◦

◦

HistT dtCT p = [HistθT dtCT pU T P HistθT dtCT pLT P ]

(20)

Fig. 2. Illustration of proposed system algorithm

3.3

Proposed System Framework

Overall flow to calculate the proposed TdtCTp values is illustrated in Fig. 2 and step by step algorithm is illustrated in Algorithm 1.

378

C. A. Hussain et al.

Algorithm 1. Input: Image, Output: Retrieval Results 1: Load the input image (convert into a grey scale if input image is RGB). 2: Divide the input image into 3×3 overlapping structures using Eq. (6). 3: From 3×3 structures, collects four 1×3 sub- structures in horizontal (0◦ ), vertical (45◦ ), diagonal (90◦ ), and anti-diagonal (135◦ ) direction Eqs. (7) and (8). 4: Using Eq. (9), calculate transformed images in all four directions. 5: Construct four transformed images. 6: Apply triplet pattern on four transformed images. 7: Apply concomitant condition on four triplet-transformed images using Eqs. (15) and (16) 8: Divide the each directional TCTP image into two images using parameter values obtained from previous step to reduce the feature vector length. 9: Obtain two histograms (upper and lower) using TdtCTp values Eq. (17). 10: Construct the final feature vector by concatenating all upper and lower histograms obtained for each direction using previous step. 11: From database, using distance measure technique retrieved best matches images.

4

Results and Discussion

To analyze the effectiveness of proposed method, experimentation is carried out on three different datasets namely Corel-10K [1], Corel-5K [1] and VIA/IELCAP [2]. The retrieval accuracy is measured using self-similarity measurement with four different distance measures. The parameters used for measurement of retrieval accuracy obtained using proposed TdtCTp feature descriptor are precision (P), recall (R), average retrieval precision (ARP) and average retrieval rate (ARR) as given in [14]. To select the best (n) images for given query image, similarity between extracted feature from query image and database image feature is calculated. To match similarity, four distance measures are used which are Manhattan (L1 or city-block), Euclidean (L2), d1 and Canberra distance measure. To examine the effectiveness of proposed TdtCTp feature descriptor, the retrieval accuracy in terms of ARP and ARR is compared with existing state-of-the-art feature descriptor. 4.1

Retrieval Accuracy on Corel Dataset

This experiment is carried out on Corel-5K and Corel-10K database [1]. Corel-5K and Corel-10K dataset contains 50 and 100 different categorical natural images respectively. Each category contains 100 different images. The retrieval accuracy comparison of proposed technique with other state-of-the-art methods on Corel5K and Corel-10K dataset in terms of average precision and recall is given in Table 1. The retrieval accuracy is measured using different similarity measure given in Table 2 and it is observed that d1 distance measure outperforms the other existing distance measures. From Tables 1 and 2, we can observe that overall retrieval accuracy achieved by existing state-of-the-art method is 62.06% ARP and 28.37% ARR on Corel-5K database and 52.46% ARP and 21.89% ARR

Transformed Directional Tri Concomitant Triplet Patterns

379

Table 1. Precision(P) and Recall (R) comparison of proposed TdtCTp with other existing feature descriptor on Corel-10K and Corel-5K. Method

Corel–5K Corel 10K Precision (%) Recall (%) Precision (%) Recall (%)

CS LBP

32.92

14.02

26.41

10.15

LEPINV

35.19

14.84

28.92

11.28

LEPSEG

41.57

18.38

34.11

13.81

LBP

43.62

19.28

37.64

14.96

BLK LBP

45.77

20.32

38.18

15.33

DLEP

48.85

21.61

40.09

15.71

LTP

50.93

21.63

42.95

16.62

SS-3D-LTPu2 54.16

25.07

44.97

19.09

SS-3D-LTP

55.31

25.68

46.25

19.63

3D-LTrP2

60.96

27.61

51.33

21.21

3D-LTrP

62.06

28.37

52.45

21.89

TdtCTP

67.77

33.43

59.66

26.04

Table 2. Comparison of Precision and Recall using different distance measure of proposed method for Corel-10K, Corel-5K and VIA/I-ELCAP dataset Distance measure

Corel–10K

Corel–5K

VIA/I-ELCAP

d1

59.66

26.04

67.77

33.43

92.56

51.91

L1

56.45

25.36

66.45

33.32

92.41

54.48

P (%) R (%) P (%) R (%) P (%) R (%)

L2

54.85

24.21

65.51

33.06

91.62

52.85

Canberra

43.32

18.08

54.96

26.16

80.78

51.87

on Corel-10K database, whereas proposed approach achieved 67.77% ARP and 33.43% ARR on Corel-5K database and 59.66% ARP and 26.04% ARR on Corel10K database. 4.2

Retrieval Accuracy on VIA/I-ELCAP Dataset

In experiment #3, VIA/I-ELCAP dataset is used for measurement of retrieval accuracy of proposed algorithm (TdtCTp) in terms of ARP and ARR. The VIA/I-ELCAP dataset contains 1000 images includes 10 main categories with 100 different images per category. The comparison in terms of precision and recall of proposed TdtCTp with existing state-of-the-art feature descriptor on VIA/IELCAP dataset is given Tables 3 and 4 respectively. The retrieval accuracy is measured using different similarity measure given in Tables 2 and is observed that d1 distance measure outperforms the other existing distance measures.

380

C. A. Hussain et al.

Table 3. Retrieval accuracy comparison in terms of ARP on VIA/I-ELCAP database. Method

Top images considered for retrieval 10

20

30

40

50

60

70

80

90

100

LDP

85.21

77.90

72.19

67.45

63.56

60.09

57.19

54.50

51.93

49.32

LTP

71.10

65.16

61.08

58.04

55.70

53.58

51.59

49.95

48.44

46.93

LBP

79.21

73.77

70.26

66.90

64.06

61.35

58.93

56.72

54.31

51.92

LTCoP

86.77

79.03

73.63

69.29

65.58

62.37

59.72

57.18

54.83

52.24

GLDP

85.96

77.49

71.30

66.33

62.20

58.71

55.59

52.71

50.19

47.79

GLTP

75.45

65.50

58.63

53.83

50.03

46.87

44.29

42.15

39.95

37.90

GLBP

84.85

78.34

73.34

69.30

65.64

62.42

59.44

56.64

53.78

51.05

GLTCoP 89.41

82.07

76.44

72.02

68.23

64.90

61.93

59.17

56.62

53.85

LTCoP

79.03

73.63

69.29

65.58

62.37

59.72

57.18

54.83

52.24

86.77

TdtCTP 92.41 86.08 80.79 76.01 71.77 67.77 64.23 61.02 57.75 54.86

Table 4. Retrieval accuracy comparison in terms of ARR on VIA/I-ELCAP database. Method

Top images considered for retrieval 10 20 30 40 50 60

70

80

90

100

LDP

8.52 15.58

21.66

26.98

31.78

36.06

40.03

43.60

46.74

49.32

LTP

7.11 13.03

18.32

23.22

27.85

32.15

36.11

39.96

43.60

46.93

LBP

7.92 14.75

21.08

26.76

32.03

36.81

41.25

45.37

48.88

51.92

LTCoP

8.68 15.81

22.09

27.71

32.79

37.42

41.80

45.74

49.34

52.24

GLDP

8.60 15.50

21.39

26.53

31.10

35.23

38.91

42.17

45.17

47.79

GLTP

7.55 13.10

17.59

21.53

25.02

28.12

31.00

33.72

35.96

37.90

GLBP

8.49 15.67

22.00

27.72

32.82

37.45

41.61

45.31

48.40

51.05

GLTCoP 8.94 16.41

22.93

28.81

34.11

38.94

43.35

47.33

50.96

53.85

LTCoP

22.71

28.53

33.71

38.46

42.83

46.72

50.17

52.90

8.84 16.23

TdtCTP 9.24 17.22 24.24 30.40 35.89 40.67 44.96 48.81 52.42 54.86

From Tables 2, 3 and 4, we can observe that overall retrieval accuracy able to achieved by existing state-of-the-art method is 88.48% ARP and 54.56% ARR whereas proposed technique achieved 92.41% ARP and 54.86% ARR using traditional CBIR technique.

5

Conclusion

We have proposed a novel algorithm (TdtCTp) for bio-medical and natural image retrieval, which is tested on three publicly available standard bio-medical and natural image datasets. TdtCTp extracts local structures from input image, which is later converted into four sub-local structures i.e. this pattern encodes

Transformed Directional Tri Concomitant Triplet Patterns

381

information in sub-local structures. The minute details about pixel progression in sub-local structures are encoded using concomitant conditions, due to which the detail intensity changes in four different directions are recorded. This process makes our proposed method (TdtCTp) strong as compared to existing popular local patterns and gives significant improvement in experimental results. We have performed experiments on three different database. The proposed method gives better retrieval accuracy using d1 distance similarity measurement. From experimental analysis, we conclude that proposed local feature descriptor (TdtCTp) performs well and also we could observed that proposed feature descriptor is computationally efficient. The significant improvement in retrieval accuracy of proposed TdtCTp feature descriptor in terms of ARP and ARR shows effectiveness for image indexing and retrieval.

References 1. Corel-10k database. http://www.ci.gxnu.edu.cn/cbir/Dataset.aspx. Accessed 10 Mar 2019 2. Via/i-elcap database. http://www.via.cornell.edu/lungdb.html. Accessed 10 Mar 2019 3. Chan, H.C.: Empirical comparison of image retrieval color similarity methods with human judgement. Displays 29(3), 260–267 (2008) 4. Derefeldt, G., Swartling, T.: Colour concept retrieval by free colour naming. Identification of up to 30 colours without training. Displays 16(2), 69–77 (1995) 5. Dudhane, A., Shingadkar, G., Sanghavi, P., Jankharia, B., Talbar, S.: Interstitial lung disease classification using feed forward neural networks. In: International Conference on Communication and Signal Processing 2016 (ICCASP 2016), Atlantis Press (2016) 6. Dudhane, A.A., Talbar, S.N.: Multi-scale directional mask pattern for medical image classification and retrieval. In: Chaudhuri, B.B., Kankanhalli, M.S., Raman, B. (eds.) Proceedings of 2nd International Conference on Computer Vision and Image Processing. AISC, vol. 703, pp. 345–357. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7895-8 27 7. Galshetwar, G.M., Patil, P.W., Gonde, A.B., Waghmare, L.M., Maheshwari, R.: Local directional gradient based feature learning for image retrieval. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 113–118. IEEE (2018) 8. Gonde, A.B., Patil, P.W., Galshetwar, G.M., Waghmare, L.M.: Volumetric local directional triplet patterns for biomedical image retrieval. In: 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE (2017) 9. Huang, J., Kumar, S.R., Mitra, M.: Combining supervised learning with color correlograms for content-based image retrieval. In: Proceedings of the Fifth ACM International Conference on Multimedia, pp. 325–334. ACM (1997) 10. Jhanwar, N., Chaudhuri, S., Seetharaman, G., Zavidovique, B.: Content based image retrieval using motif cooccurrence matrix. Image Vis. Comput. 22(14), 1211– 1220 (2004) 11. Lin, C.H., Chen, R.T., Chan, Y.K.: A smart content-based image retrieval system based on color and texture feature. Image Vis. Comput. 27(6), 658–665 (2009)

382

C. A. Hussain et al.

12. Murala, S., Maheshwari, R., Balasubramanian, R.: Directional binary wavelet patterns for biomedical image indexing and retrieval. J. Med. Syst. 36(5), 2865–2879 (2012) 13. Murala, S., Maheshwari, R., Balasubramanian, R.: Local tetra patterns: a new feature descriptor for content-based image retrieval. IEEE Trans. Image Process. 21(5), 2874–2886 (2012) 14. Murala, S., Wu, Q.J.: Local ternary co-occurrence patterns: a new feature descriptor for MRI and CT image retrieval. Neurocomputing 119, 399–412 (2013) 15. Ojala, T., Pietik¨ ainen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 29(1), 51–59 (1996) 16. Pawar, M.M., Talbar, S.N., Dudhane, A.: Local binary patterns descriptor based on sparse curvelet coefficients for false-positive reduction in mammograms. J. Healthc. Eng. 2018, 1–16 (2018) 17. Stricker, M.A., Orengo, M.: Similarity of color images. In: Storage and Retrieval for Image and Video Databases III, vol. 2420, pp. 381–393. International Society for Optics and Photonics (1995) 18. Subrahmanyam, M., Maheshwari, R., Balasubramanian, R.: Expert system design using wavelet and color vocabulary trees for image retrieval. Expert Syst. Appl. 39(5), 5104–5114 (2012) 19. Subrahmanyam, M., Wu, Q.J., Maheshwari, R., Balasubramanian, R.: Modified color motif co-occurrence matrix for image indexing and retrieval. Comput. Electric. Eng. 39(3), 762–774 (2013) 20. Subrahmanyam, M., Maheshwari, R., Balasubramanian, R.: Local maximum edge binary patterns: a new descriptor for image retrieval and object tracking. Signal Process. 92(6), 1467–1479 (2012) 21. Swain, M.J., Ballard, D.H.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991) 22. Swain, M.J., Ballard, D.H.: Indexing via color histograms. In: Sood, A.K., Wechsler, H. (eds.) Active Perception and Robot Vision, pp. 261–273. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-642-77225-2 13 23. Tan, X., Triggs, B.: Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 19(6), 1635–1650 (2010) 24. van Schaik, P., Ling, J.: The effects of screen ratio and order on information retrieval in web pages. Displays 24(4–5), 187–195 (2003) 25. Vipparthi, S.K., Nagar, S.: Expert image retrieval system using directional local motif XoR patterns. Expert Syst. Appl. 41(17), 8016–8026 (2014) 26. Vipparthi, S.K., Nagar, S.: Directional local ternary patterns for multimedia image indexing and retrieval. Int. J. Signal Imag. Syst. Eng. 8(3), 137–145 (2015) 27. Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE Trans. Image Process. 19(2), 533–544 (2010)

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval Keshav Kumar Kedia, Gaurav Kumar Jain(B) , and Vipul Grover Samsung R&D Institute India Bangalore, Bangalore 560037, Karnataka, India {keshav.kedia,gaurav.kjain,vipul.grover}@samsung.com

Abstract. Suggesting clothing items using query images is very convenient in online shopping, however feature extraction is not very easy for the intended query due to other objects and background variations present in these images. We need better feature extractor which extracts features only from intended clothing item in query image and rejects strong features from remaining part. To accomplish this we propose single step encoder-decoder based image translation and feature extraction approach with reverse weight transfer method for creating a common space between catalog of shopping images and query images. We present generated images and retrieved images from test set and compute the exact image retrieval accuracy. We achieved 30.6%, 46.8% and 56.3% exact image retrieval accuracy for top 1, top 3 and top 5 suggestions respectively. We also conducted a subjective survey on similar items retrieval and on an average 3.33 clothing items are relevant out of 5 items retrieved for ~500 responses which shows the effectiveness of this method. Keywords: Cross domain learning · Image translation · Encoder-decoder · Image extraction · Image generation · Deep learning

1 Introduction Online shopping using query images is becoming useful and convenient among the shoppers. However it poses a challenging problem to retrieve similar product items from the query images due to weak features extracted out of query images. Suppose a user wishes to shop similar cloth items as shown in Fig. 1 query images and expects getting relevant cloth items retrieved as suggestions. The suggestions should be as relevant as possible with respect to features like shape, pattern, color, clothing type and so on. Query images of Fig. 1 have different background and poses which makes the retrieval problem challenging. In the past, image processing has been used to achieve such tasks using image retrieval techniques. Approaches using human pose estimation followed by segmentation and clustering have been suggested and used in the past. Moreover, recently there has been a surge in the computer vision community to use adversarial training based methods for cross domain image translation tasks. Many of these methods claim to learn common semantic space between the two domains while training the generator to generate realistic images in target domain conditioned on the input domain. Some approaches © Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 383–394, 2020. https://doi.org/10.1007/978-981-15-4015-8_34

384

K. K. Kedia et al.

Fig. 1. Input Image Poses: Different poses in input images make the retrieval problem challenging and difficult. (Images taken from LookBook dataset. https://drive.google.com/file/d/0By_ p0y157GxQU1dCRUU4SFNqaTQ)

recently used generative adversarial methods for cross domain image translation using unpaired data by using cyclic reconstruction loss. While these adversarial learning based unsupervised methods have been tested for tasks involving image translation between similar domains, they have not been tested in tasks where the two domains share only a portion of semantic information. Moreover most of these generative methods focus more on realistic image generation than learning a strong common semantic space. Generally traditional approaches using segmentation and feature extraction fail because same model does not work very well for different poses. We overcome this challenge effectively in the proposed method. The proposed approach is a single step encoder-decoder based image translation and cross domain feature extraction method which learns to generate feature vectors in the common semantic space.

2 Related Work There have been some work in the past in finding similar clothing from query images using image processing techniques. Some deep learning based approaches have also been suggested in the field of cross domain learning. A clothing retrieval technique from video content by performing frame retrieval with temporal aggregation and fast indexing technique is presented in [2]. A cross scenario clothing retrieval using parts alignment and auxiliary sets is presented in [3]. Clothing retrieval using similarity learning is proposed by [4] which tries to learn a similarity measure between query and shop items using a three layer fully connected network. A dual attribute-aware ranking network for retrieval feature learning using semantic attribute learning is proposed in [5]. A local similarity based method for coding global features by merging local descriptors used for clothing retrieval is proposed by [6]. A clothing image recommendation system using image processing techniques is suggested by [7]. Our work is based on image generation which is used as base model to find the semantic space for retrieving similar clothes. There have been many cross domain learning methods based on adversarial learning. The adversarial generative approach was first proposed by [16] which has an adversarial generation framework called Generative

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval

385

Adversarial Nets (GAN) which simultaneously trains two models: a generative model also called the generator and a discriminative model also called the discriminator where generator tries to generate data close to the input dataset to fool the discriminator and discriminator takes input data and tries to discriminate between the generated dataset and the real dataset. In [17] GAN is expanded to conditional version by feeding both the generator and discriminator with some extra conditional information as additional input layer and a generative model capable of producing high quality images is introduced in [18]. In [19] a deep convolutional GAN architecture capable of producing images with very less noise is introduced which is relatively stable than other GANs that produced noisy images. In cross domain learning using generative methods there have been supervised approaches [1, 13] as well as unsupervised approaches [8–10, 12, 15]. These methods use adversarial training to learn common semantic space between two image domains. They have been applied to tasks like style transfer and image translation. These unsupervised adversarial models have been tested for image translation tasks between similar domains but not between vastly different domains very successfully. An image-conditional image generation model using real/fake discriminator and a domain-discriminator is suggested in [1] to generate pixel images in clothes domain from model images. This method focuses more on generating realistic looking images and not on learning strong feature representations in the common semantic space. Further, this method only learns to generate images from model domain to clothes domain and the reverse is not done. Our method is very inspired by [1] but it does not use adversarial methods and extends the domain transfer and image translation task to cross domain mapping between different domains.

3 Objective The objective is to learn the common semantic space between query images and clothing catalog images domain such that given a query image of people wearing clothes, similar clothing items can be suggested with high relevance.

4 Conventional Approach Conventional approaches generally involve segmentation, feature extraction and then nearest neighbor algorithms are used to retrieve similar clothing items as depicted in Fig. 2.

Fig. 2. Conventional method of retrieval task

They have drawbacks that the same model for segmentation and feature extraction may not work very well for different poses like side and other poses. Further, feature extraction generally uses some predetermined features and this limits the model to give suggestions only on these features.

386

K. K. Kedia et al.

5 Proposed Method Proposed method contains an encoder-decoder based image translation model to learn the shared semantic space between the two domains. It uses an encoder-decoder based image translator trained to generate clothing domain images (target domain images) from corresponding query domain images (source domain images). In process of training the image translator, source domain encoder is trained and a semantic space is generated. Then feature vectors from semantic space which are strong enough to generate clothing images from query images are extracted. Then this trained source domain encoder is used to generate the ground truth for training the target domain encoder. Before training the target domain encoder, it is initialized with the weights of the source domain encoder. Then it is trained to output feature vectors in the same semantic space as that of the query images using the generated feature vectors as ground truth. Reverse Weight Transfer: Figure 3 shows a high level diagram of the proposed method. Figure 3a shows the encoder decoder based image translator and Fig. 3b shows the training of target encoder using ground truth generated by source encoder. After training first part the encoder part of the model w is used to find the feature vectors of all source images. Then weights of w is transferred to w . Then w is trained using target images as input and feature vectors of corresponding source images as ground truth as explained earlier. This weight transfer between the two encoders helps the model find easy convergence and allows the two encoders to encode the respective input images in the same semantic space. The training time improved almost two folds when transfer of weights was done. Intuitively this transfer method is working in reverse direction and supporting fast and better convergence which is different than conventional weight transfer which happens generally in same direction. After training, there will be two trained encoders w and w . Figure 4 shows the process to retrieve clothing based on query image. For all the clothing catalog images w is used to find the feature vectors while source encoder w will be used to generate feature vector from query image. These feature vectors are part of same semantic common feature

Fig. 3. Proposed method steps (a) Image translation training (b) target encoder training

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval

387

space and hence distance based similarity metrics can be used to extract similar items and relevant suggestions can be retrieved.

Fig. 4. Retrieval process where clothes are suggested based on query image

The model once trained can suggest relevant clothing images from the clothing domain conditioned on the input query image by automatically generating the feature vectors and extracting nearest clothing. This is accomplished without any segmentation and without using image processing techniques for identifying different features like texture, color and pattern. The results however show that the model is able to retrieve relevant clothing items with respect to all these features. Although in this paper experimentation and results are focused in the query to clothing domain, this method is not limited to this particular domain and can be easily extended to other domains since this method learns the common semantic space. The two domains here share a lot of features as both domains have the same clothing with transformation, deformation and some other source domain specific features like background, model, pose etc. The performance of proposed method in cross domain tasks where two domains share partial features will have to be investigated in further research. Proposed method uses image translation to train the model but does not uses adversarial methods which have become popular for all image generation related tasks in recent times. This is because adversarial methods tend to focus more on generating realistic images which is not actually the aim of this work. Here the focus is on learning a strong feature representation for the images and hence using adversarial methods is avoided, so instead of using adversarial losses while training image translator, mean square loss is used. Further, cross domain learning using adversarial methods use multiple losses like adversarial loss, cyclic reconstruction loss and so on. The performance of such models are very sensitive to the weights of the loss terms. Also since most of these methods depend on cyclic reconstruction for learning the common semantic space, these methods tend to not work very well for domains where the two domains share only a portion of semantic information and hence reconstruction of images from this common semantic space is not possible in both directions. Proposed method overcomes this problem by training two encoders separately and avoiding cyclic reconstruction.

388

K. K. Kedia et al.

6 Architecture Model architecture for encoder decoder is shown in Fig. 5. Both the source and target domain encoders have the same architecture and hence target encoder is not shown separately. In encoder decoder based image translation models, generally skip connections are used which makes training faster and better but then the decoder and encoder would not be independent of each other and hence the feature representation of the encoder may not be very strong although the images generated may be very good. Thus we decided not to use any skip connections. Architecture has series of convolution layers, wherein each layer is followed by a batch normalization layer in encoder and a series of transposed convolution layers again each layer followed by a batch normalization layer in decoder network. The last layer in both encoder and decoder is not followed by batch normalization layer. Kernel Size of [3 * 3] was used in every convolutional as well as transposed convolution layer in the architecture. Mean squared error loss was used for training both the image translator and the clothing domain encoder.

Fig. 5. Encoder and decoder model architecture

7 Training The dataset used for training is LookBook dataset which was presented by [1] in their work where they use a generative adversarial method for clothes image generation from model images.

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval

389

The dataset has 75000 images consisting of different poses of models wearing clothes and also their corresponding product images. The model images are used as the input query images domain and the product images as the catalog clothing images domain. Figure 6 presents some sample model images and some corresponding product images from LookBook dataset. We resized the images to [64 * 64] resolution and then randomly picked 5% images as our test set, 5% as validation set and rest for training. We used the model images from our training set as the input to image translation model and their corresponding product images as ground truth. After training the encoder-decoder based image translation model, encoder is used as source domain encoder then feature vectors are extracted out of the model images. The corresponding product images become the input for training the target domain encoder and the generated feature vectors become the ground truth. Then target encoder is initialized with source encoder weights before training. This training method using transfer of weights helped in early convergence and also made the training stable as depicted in Fig. 7. Once target encoder is trained we use it to generate the feature vector space for all catalog clothing elements and then for any input query image similar items can be retrieved.

Fig. 6. Example ground truth data from LookBook dataset

Keras is used to implement the models. Adam optimizer was used with learning rate set to 0.001, β1 = 0.9 and β2 = 0.999.

390

K. K. Kedia et al.

Fig. 7. Loss graph for target encoder training using weight transfer and without reverse weight transfer. Clearly reverse transfer helps in fast convergence in opposite direction as well.

8 Results and Inferences To evaluate the performance, top 5 similar clothing images are retrieved using the test query images (5% images separated). Table 1 shows the precision of exact product retrieval in Top-1, Top-3 and Top-5 results. It should be noted that this result is of extracting exact corresponding cloth image and not of the similar images retrieval. Table 1. Top-N precision percentage results Top N nearest clothing Precision percentage (%) Top 1

30.6%

Top 3

46.8%

Top 5

56.3%

Since there is only one perfect clothing image corresponding to every input query image, the other four images retrieved should also have similar semantic features to that of the input query image but we have no quantitative metric to validate this. Hence to validate this, the top 5 results are retrieved for sample query images from test set as shown in Fig. 8. It is observed from the results that retrieved results have common semantic features like color, pattern and shape with the query input image. Further, in

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval

391

Fig. 8. Input and Top 5 output shown for some of the test images data. The model is able to learn and retrieve clothing items similar to input query in color, texture and type of clothing

Fig. 9 we show that for different poses of the input query image, the model extracts relevant images and remains invariant of the differences. This shows effectiveness of this method beyond traditional methods.

392

K. K. Kedia et al.

Fig. 9. Same query image in different poses: our model is able to extract the exact clothing image from the test set ground truth clothing images which shows the robustness of the model

Further subjective validation via survey is organized to quantify the relevance of proposed method. In this survey 20 sample images from test data were randomly selected and participants were asked to mark relevant images out of 5 retrieved images for each sample. Table 2 shows the number of relevant images and the corresponding percentage of response. Table 2. Subjective evaluation survey results No. of relevant images [among 5] Percentage (%) response 5

21.84%

4

23.42%

3

28.94%

2

18.15%

1

7.36%

Results indicate that for around 75% of the time 3 or more images are relevant out of 5, which is very effective. Average number of relevant images in Top-5 are 3.33 which is quite significant in shopping retrieval wherein 3 relevant items among Top-10 presented to user is considered a good recommendation system. Figure 10 shows some test input images and their corresponding intermediate generated images by the encoder-decoder based image translator. This shows how well the clothing images are being generated and hence strong features are extracted out from the generated images. It is observed that intermediate generated images capture semantic features like shape, color and pattern very well.

Encoder Decoder Based Image Semantic Space Creation for Clothing Items Retrieval

393

Fig. 10. Some sample input (a) and their corresponding generated outputs (b)

9 Conclusion The work presented here is able to learn the common semantic space between two partial matching domains very effectively and extracted features from the space are retrieving the clothing items with high relevance. Also the reverse weight transfer is more stable and faster when transfer of weights from source to target encoder was done. Thus in this work a novel method for cross domain learning between input query images and catalog clothes images domains is presented. This work can be easily extended to other cross domain learning tasks and can be investigated in future research.

References 1. Yoo, D., Kim, N., Park, S., Paek, Anthony S., Kweon, I.S.: Pixel-level domain transfer. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 517–532. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_31 2. Garcia, N., Vogiatzis, G.: Dress like a star: retrieving fashion products from videos. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 3. Liu, S., et al.: Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2012) 4. Hadi Kiapour, M., et al.: Where to buy it: matching street clothing photos in online shops. In: Proceedings of the IEEE International Conference on Computer Vision (2015) 5. Huang, J., et al.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE International Conference on Computer Vision (2015) 6. Mizuochi, M., Kanezaki, A., Harada, T.: Clothing retrieval based on local similarity with multiple images. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM (2014) 7. Hsu, E., Paz, C., Shen, S.: Clothing image retrieval for smarter shopping. EE368. Department of Electrical and Engineering, Stanford University (2011) 8. Zhu, J.-Y., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 9. Kim, T., et al.: Learning to discover cross-domain relations with generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70 (2017). JMLR.org 10. Li, J.: Twin-GAN–unpaired cross-domain image translation with weight-sharing GANs. arXiv preprint arXiv:1809.00946 (2018) 11. Lu, Y., Tai, Y.-W., Tang, C.-K.: Conditional cyclegan for attribute guided face image generation. arXiv preprint arXiv:1705.09966 (2017) 12. Royer, A., et al.: XGAN: unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139 (2017)

394

K. K. Kedia et al.

13. Isola, P., et al.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 14. Yi, Z., et al.: DualGAN: unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 15. Liu, M.-Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems (2017) 16. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014) 17. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411. 1784(2014) 18. Denton, E.L., Chintala, S., Fergus, R.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems (2015) 19. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

Feature Learning for Effective Content-Based Image Retrieval Snehal Marab(B) and Meenakshi Pawar SVERI, Pandharpur, Maharashtra, India [email protected]

Abstract. Increase in the use of mobile phones for scene capturing turns into the exponential increase in the size of digital libraries. ContentBased Image Retrieval (CBIR) is an effective solution to handle this enormous data. Most of the existing approaches are relied on the handcrafted feature extraction for CBIR. These approaches could fail in largescale databases. Thus, in this paper, we have proposed an end-to-end deep network for CBIR. Proposed network is characterized by the residual learning and thus does not undergoes the vanishing gradient problem. Performance of the proposed method has been evaluated on two benchmark databases namely Corel-10K and GHIM-10K with the help of two evaluation parameters i.e. recall and precision. Comparison with the state-of-the-art methods reveiles that proposed method outperforms existing methods by a large margin on all of the databases.

Keywords: Content-based image retrieval Corel-10K · GHIM-10K

1 1.1

· Deep network ·

Introduction Motivation

Now a days, there is an exponential growth in the use of high-resolution camera phones. This increase in the use of mobile devices is the main reason for the expansion of number of digital images. Manual lebelling of these images is a tadious task as one has to do this for each image from huge database. Thus, there is a need of intelligent system which will able to do this task automatically and in an efficient way. Thus, we can do the asignment of labels by accessing the image content. With such an intelligent, automatic labeling makes easy search, access, and retrieve the images from huge database. Content-based image retrieval is a area of interest from last two decades in the field of computer vision because of its characteristic of identifiying images by their content and not by manually assigned label. The general pipeline of the CBIR system comprises of feature extraction followed by similarity measurement and retrieval task. Robust feature extraction is key step in any computer vision task and thus in CBIR also. Hand-crafted c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 395–404, 2020. https://doi.org/10.1007/978-981-15-4015-8_35

396

S. Marab and M. Pawar

Fig. 1. Proposed approach for Content-Based Image Retrieval. (a) Represents the training procedure while (b) represents the testing counterpart of the proposed approach.

features are designed to extract one of the image content such as color, texture or shape. However, it is difficult task to obtain a single best abstract representation for an input image. Thus, feature learning is an essential task to do for performance improvement of CBIR system. Thus, in this paper, we propose a use of convolution neural network for robust feature learning followed by similarity measurement and retrieval task. In next sub-section, we have discussed about the available image retrieval approaches. 1.2

Related Work

Feature extraction is broadly divided into global and local features. Global feature extractors access entire image to extract features. Whereas, local feature extractor access part of an image to extract features. Reseachers make use of the image content such as color, texture, and shape for feature extraction. Among these textural feature extraction found effective in content-based image retrieval. Initially, textural feature extraction algorithm [14,18,20,23,29,31,32] were proposed for texture classification. Discrete wavelet transform (DWT) was the first choice of the researchers to extract the directional information from a given image. Ahmadian et al. [1] proposed an wavelet-based feature extraction for texture classification. DWT extracts three directional i.e. 0◦ , 90◦ and 45◦ information from a given image. However, more directional could improve the systems performance for a specific task. To surpass the directional limitation, Gabor transform [18], rotated wavelet transform (RWT) [14] were proposed for feature extraction. Manjunath et al. [18] make use of Gabor wavelet for texture classification. They extracted 1st order moment of Gabor wavelet responce. As, first

Feature Learning for Effective Content-Based Image Retrieval

397

order moments are scale varient, their approach fails for variation in scale. Han et al. [12] proposed a novel Gabor transform which is invarient to scale as well as image rotation for CBIR. In last decade, local feature extraction took place of initial statistical feature extraction approaches. Local feature extraction comprises of encoding of input image with the help of respective local operator followed by histogram computation of encoded image. Computed histogram represents the feature vector for the given input image. Initially, Ojhala et al. [23] proposed a novel approach local binary pattern (LBP) for local feature extraction. They proposed use of LBP for face recognition. Further, variants of the LBP [2,4,7–10,19,21,26,29,31,32] were proposed for various tasks. Hambarde et al. [11] proposed local feature extraction based on the factorization of local matrix. Murala et al. have proposed several local feature discriptor [19–22,29,30] for natural and medical image retrieval. Murala et al. [20] have proposed Local Tetra Pattern which encodes input local region into tetra codes and extracts directional information. Also, they proposed spherical symmetric local ternary pattern [30] to extract multi-scale information from given medical scan. Several other researchers proposed other local feature discriptor for different tasks. Even tough local feature discriptor extracts the edge and local contrast information they fails to extract robust or discriminative features in case of complex scene. Because, in complex scene the first essential thing need to be understand is the semantics of the scene. In most of the cases natural outdoor scenes are occupied by number of objects and thus it is difficult to give a single label to these kind of images. A strong feature learning algorithm is required to extract features from complex scenes such that it could learn the salient part of the scene and discards the redundant and less important features. In last decade, convolution neural network (CNN) shown promising results for almost all computer vision applications. Various researchers make use of CNN for different applications like image recognition [13,15,27,28], image enhancement [5,6], moving object segmentation [24,25], anomaly detection [3]. The reason behind this is the robust feature learning which was the missing in previous hand-crafted feature extraction algorithms. Initially Krizhevsky et al. [15] proposed AlexNet for object recognition. Further, Simonyan et al. [28] proposed VGGNet for the image classification. They showed the use of small convolution filters allows to increase the network depth without decreasing the network accuracy. Still, beyond certain limit of network depth, performance of deep network starts dicreasing. The reason behind this is vanishing gradient. By introducing the identity mapping, He et al. [13] resolved the major thread of vanishing gradient. As discussed earlier, robust feature extraction is a key to design a reliable computer vision system. Thus, in this paper, we propose use of convolution neural network for content-based image retrieval. We utilized modified ResNet architecture for the feature extraction followed by the similarity measurement for the index matching and retrieval task. We also, compare the proposed approach with existing learning-based (deep networks) as well as non-learning approaches (hand-crafted features). Next section describes the proposed approach for the content-based image retrieval.

398

S. Marab and M. Pawar

Rest of the manuscript is organized as follows: Sect. 1 illustrates the motivation and literature review on CBIR. Section 2 depicts the proposed approach for CBIR. Result analysis using proposed and existing methods have been carried out Sect. 3. Finally, Sect. 4 concludes the proposed method for CBIR.

2

Proposed Method

In this section, the proposed approach for content-based image retrieval based on convolution neural network is discussed. Overview of the proposed network is shown in Fig. 1. We utilized ResNet for the feature extraction followed by the similarity measurement for the index matching and retrieval task. Next subsection gives the architectural details of the proposed network. 2.1

Residual Learning

Vanishing gradient is a core problem in training of deep networks. As the network depth increases this problem becomes more intense which reflects into the decrease in the system accuracy. He et al. [13] proposed the residual learning approach (ResNet) for object recognition. According to [13], convolution networks can be substaintially deeper, more accurate, and effitient to train if they contain shorter connections between the network layers. They call it as a identity mapping. The concept of identity mappings is coarsely analogous to the connections in the visual cortex (as it comprises of feed-forward connections). Thus, it outperforms in different popular vision benchmarks such as object detection, image classification, and image reconstruction. We consider the tiny ResNet architecture which has 18 residual blocks to extract the features from input query image. Figure 1 shows the residual block used in proposed network. Mathematical formulation of the residual block is given by Eqs. 1 and 2. (1) F1k = W1k |k ∈ [1, N ] ∗ F + B1k |k ∈ [1, N ] k F1 (x) if F1k (x) > 0 (2) F2k (x) = 0 else where, ∗ represents the convolution operation, W1k , B1k are the filter and biases at conv1 respectively, N is the number of filters used at conv1, F is the input feature map to the residual block and F1k represents the output of the first convolution layer. F2k is the output of the rectified linear unit (ReLU) layer. Similar to Eq. 1, F3k is the output of the second convolution layer. Equation 3 shows the functioning of the identity mapping. F4k (x) = F3k (x) + F k (x) given that F3 and F has same number of feature maps.

(3)

Feature Learning for Effective Content-Based Image Retrieval

399

Fig. 2. Sample images from COREL-10K database.

2.2

Index Matching and Image Retrieval

To this end, we utilized minimum distance classifier for the index matching and image retrieval task. We make use of canbera distance to measure similarity between the query image features and database features given by Eq. 4. N fDB − fQ p,q q D(Q, DB) = fDBp,q + fQq

(4)

q=1

where, Q is a given query image, N is a feature vector length, DB is database image, fDBp,q is q th feature of pth image in the database, fQq is q th feature of query image. Algorithm 1 gives the steps in proposed method for content-based image retrieval.

400

S. Marab and M. Pawar

Algorithm 1. Input: Image, Output: Retrieval Results 1: 2: 3: 4: 5:

3

Extract features from all database images using ResNet-18. Take an input query image. Extrcat features from query image using ResNet-18. Use a similarity measurement and find images close to the query image. Retrieve top N images from the database.

Results and Discussion

To analyze the effectiveness of proposed method, experimentation is carried out using two different techniques on two publically available benchmark datasets namely Corel-10K [17] and GHIM-10K [17]. The parameters used for measurement of retrieval accuracy obtained using proposed network are Precision (P), Recall (R), Average Retrieval Precision (ARP) and Average Retrieval Rate (ARR) using Eqs. (5) and (6). The precision (P) and recall (R) for query image is define as below: DB NR ∩ NRT 1 ; ARP = P (In ) (5) P recision : P (Iq ) = nRT DB n=1

Recall : R(Iq ) =

NR ∩ NRT ; nR

DB 1 ARR = R(In ) DB n=1

n≤10

(6) n≥10

where, NR is total number of relevant images present in the database, NRT is total number of retrieved image similar to query image from database, NR ∩ NRT gives retrieved images which are from query image category, nR is total relevant images present in the database for query image, nRT is total images retrieved using similarity measure , In is nth query image and total number of images in database is denoted by DB. To examine the effectiveness of proposed TdtCTp feature descriptor, the retrieval accuracy in terms of ARP and ARR is compared with existing state-of-the-art feature descriptor which are: [15,20,23,28,30,31] 3.1

Retrieval Accuracy on Corel-10K Dataset

In experiment #1, Corel-10K dataset is used to prove effectiveness of proposed network for image retrieval. Corel-10K dataset contains 100 different types of images like animals, sports, bikes, etc.,and each category contains 100 different images. Sample images are shown in Fig. 2. We compare retrieval accuracy of proposed approach with other existing deep networks as well as existing nonlearning approaches such as [15,20,23,28,30,31] in terms of ARP and ARR and is given in Tables 1 and 2 respectively. The comparison of retrieval accuracy of proposed network in terms of ARP and ARR with other existing methods on Corel-10K dataset witnessed the

Feature Learning for Effective Content-Based Image Retrieval

401

effectiveness of the proposed approach for content-based image retrieval. From Tables 1 and 2 we can observe that proposed network outperforms other existing methods by a large margin. Table 1. Retrieval accuracy comparison in terms of ARP on COREL-10K database. Note: PM: Proposed Method Method

Top images considered for retrieval 10

20

30

40

50

60

70

80

90

100

LBP

37.62

29.30

25.15

22.43

20.45

18.91

17.68

16.64

15.75

14.97

LTP

42.96

33.58

28.70

25.51

23.13

21.32

19.82

18.58

17.53

16.63

SS-3D-LTP 46.25

37.14

32.44

29.26

26.84

24.90

23.29

21.90

20.70

19.64

3D-LTrP

52.46

42.11

36.59

32.88

30.08

27.90

26.07

24.49

23.12

21.89

AlexNet

73.65

65.92

60.78

56.79

53.39

50.38

47.65

45.15

42.84

40.64

VGG16

78.76

71.69

66.92

62.93

59.47

56.36

53.49

50.77

48.19

45.69

PM

81.13 74.60 69.94 66.13 62.68 59.53 56.57 53.76 51.07 48.45

Table 2. Retrieval accuracy comparison in terms of ARR on COREL-10K database. Note: PM: Proposed Method Method

Top images considered for retrieval 60

70

80

90

100

LBP

10 3.76

5.86

7.54

8.97 10.23

11.35

12.37

13.31

14.17

14.97

LTP

4.30

6.72

8.61

10.20 11.56

12.79

13.87

14.86

15.78

16.63

SS-3D-LTP 4.63

7.43

9.73

11.70 13.42

14.94

16.30

17.52

18.63

19.64

3D-LTrP

5.25

8.42

10.98

13.15 15.04

16.74

18.25

19.59

20.81

21.90

AlexNet

7.36

13.18

18.23

22.72 26.70

30.23

33.35

36.12

38.56

40.64

VGG16

7.88

14.34

20.08

25.17 29.73

33.82

37.44

40.62

43.37

45.69

PM

8.11 14.92 20.98 26.45 31.34 35.72 39.60 43.01 45.96 48.45

3.2

20

30

40

50

Retrieval Accuracy on GHIM-10K Dataset

This experiment have been carried out on GHIM-10K database [16]. This database comprised of 20 classes having 500 images per class. Performance analysis of proposed approach is done by measuring the retrieval accuracy in terms of ARP and ARR. Tables 3 and 4 gives the comparison of retrieval accuracy in terms of ARP and ARR of proposed approach with other existing state-of-theart methods. Tables 3 and 4 depicts that the proposed approach outperforms existing learning-based CBIR approaches.

402

S. Marab and M. Pawar

Table 3. Retrieval accuracy comparison in terms of ARP on GHIM-10K database. Note: PM: Proposed Method Method

Top images considered for retrieval 20

30

40

50

60

70

80

90

100

AlexNet 93.16

10

90.82

89.28

88.10

87.03

86.07

85.20

84.41

83.64

82.89

VGG16

95.15

93.42

92.22

91.22

90.34

89.55

88.83

88.18

87.54

86.91

PM

96.48 95.05 94.06 93.24 92.49 91.83 91.24 90.68 90.13 89.58

Table 4. Retrieval accuracy comparison in terms of ARR on GHIM-10K database. Note: PM: Proposed Method Method Top images considered for retrieval 10 20 30 40 50 60

70

80

90

100

AlexNet 9.32 18.16

26.78

35.24

43.52

51.64

59.64

67.53

75.28

82.89

VGG16

9.51 18.68

27.66

36.49

45.17

53.73

62.18

70.54

78.78

86.91

PM

9.65 19.01 28.22 37.30 46.25 55.10 63.87 72.54 81.12 89.58

4

Conclusion

In this paper, we have discussed the usefulness of robust feature extraction for CBIR. We propose use of Residual learning to extract features from input query image followed by similarity measurement for index matching and retrieval task. Performance increment in the minimum distance classifier for CBIR is noted with the use of proposed feature learning approach. Performance of the proposed approach is compared with the other existing state-of-the-art methods for CBIR. We utilized two benchmark databases used by existing approaches for CBIR. Precision and recall are considered as performance evaluation parameters. Comparison of the proposed approach with existing methods depicts that the proposed method outperforms the other existing approaches for content-based image retrieval. In future, similar approach can be extended for the contentbased medical image retrieval.

References 1. Ahmadian, A., Mostafa, A.: An efficient texture classification algorithm using Gabor wavelet. In: Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2003, vol. 1, pp. 930–933. IEEE (2003) 2. Biradar, K., Kesana, V., Rakhonde, K., Sahu, A., Gonde, A., Murala, S.: Local Gaussian difference extrema pattern: a new feature extractor for face recognition. In: 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–5. IEEE (2017)

Feature Learning for Effective Content-Based Image Retrieval

403

3. Biradar, K., Gupta, A., Mandal, M., Kumar Vipparthi, S.: Challenges in timestamp aware anomaly detection in traffic videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 13–20 (2019) 4. Dudhane, A., Shingadkar, G., Sanghavi, P., Jankharia, B., Talbar, S.: Interstitial lung disease classification using feed forward neural networks. In: International Conference on Communication and Signal Processing 2016 (ICCASP 2016), Atlantis Press (2016) 5. Dudhane, A., Murala, S.: C2 MSNet: a novel approach for single image haze removal. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1397–1404. IEEE (2018) 6. Dudhane, A., Singh Aulakh, H., Murala, S.: RI-GAN: an end-to-end network for single image haze removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019) 7. Dudhane, A.A., Talbar, S.N.: Multi-scale directional mask pattern for medical image classification and retrieval. In: Chaudhuri, B.B., Kankanhalli, M.S., Raman, B. (eds.) Proceedings of 2nd International Conference on Computer Vision & Image Processing. AISC, vol. 703, pp. 345–357. Springer, Singapore (2018). https://doi. org/10.1007/978-981-10-7895-8 27 8. Galshetwar, G.M., Patil, P.W., Gonde, A.B., Waghmare, L.M., Maheshwari, R.: Local directional gradient based feature learning for image retrieval. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 113–118. IEEE (2018) 9. Ghadage, S., Pawar, M.: Integration of local features for brain tumour segmentation. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 173–178. IEEE (2018) 10. Gonde, A.B., Patil, P.W., Galshetwar, G.M., Waghmare, L.M.: Volumetric local directional triplet patterns for biomedical image retrieval. In: 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE (2017) 11. Hambarde, P., Talbar, S.N., Sable, N., Mahajan, A., Chavan, S.S., Thakur, M.: Radiomics for peripheral zone and intra-prostatic urethra segmentation in MR imaging. Biomed. Signal Process. Control 51, 19–29 (2019) 12. Han, J., Ma, K.K.: Rotation-invariant and scale-invariant Gabor features for texture image retrieval. Image Vis. Comput. 25(9), 1474–1481 (2007) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 14. Kokare, M., Biswas, P.K., Chatterji, B.N.: Texture image retrieval using rotated wavelet filters. Pattern Recogn. Lett. 28(10), 1240–1249 (2007) 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 16. Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003) 17. Liu, G.H., Yang, J.Y., Li, Z.: Content-based image retrieval using computational visual attention model. Pattern Recogn. 48(8), 2554–2566 (2015) 18. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996) 19. Murala, S., Maheshwari, R., Balasubramanian, R.: Directional local extrema patterns: a new descriptor for content based image retrieval. Int. J. Multimedia Inf. Retrieval 1(3), 191–203 (2012)

404

S. Marab and M. Pawar

20. Murala, S., Maheshwari, R., Balasubramanian, R.: Local tetra patterns: a new feature descriptor for content-based image retrieval. IEEE Trans. Image Process. 21(5), 2874–2886 (2012) 21. Murala, S., Wu, Q.: Peak valley edge patterns: a new descriptor for biomedical image indexing and retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 444–449 (2013) 22. Murala, S., Wu, Q.J.: Local mesh patterns versus local binary patterns: biomedical image indexing and retrieval. IEEE J. Biomed. Health Inform. 18(3), 929–938 (2014) 23. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 24. Patil, P., Murala, S., Dhall, A., Chaudhary, S.: MsEDNet: multi-scale deep saliency learning for moving object detection. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1670–1675. IEEE (2018) 25. Patil, P.W., Murala, S.: MSFgNet: a novel compact end-to-end deep network for moving object detection. IEEE Trans. Intell. Transp. Syst. 20(11), 4066–4077 (2018) 26. Pawar, M.M., Talbar, S.N., Dudhane, A.: Local binary patterns descriptor based on sparse curvelet coefficients for false-positive reduction in mammograms. J. Healthc. Eng. 2018, 1–16 (2018) 27. Shaha, M., Pawar, M.: Transfer learning for image classification. In: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 656–660. IEEE (2018) 28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 29. Subrahmanyam, M., Maheshwari, R., Balasubramanian, R.: Local maximum edge binary patterns: a new descriptor for image retrieval and object tracking. Signal Process. 92(6), 1467–1479 (2012) 30. Subrahmanyam, M., Wu, Q.J.: Spherical symmetric 3D local ternary patterns for natural, texture and biomedical image indexing and retrieval. Neurocomputing 149, 1502–1514 (2015) 31. Tan, X., Triggs, B.: Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans. Image Process. 19(6), 1635–1650 (2010) 32. Zhang, B., Gao, Y., Zhao, S., Liu, J.: Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE Trans. Image Process. 19(2), 533–544 (2010)

Instance Based Learning

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning P. K. Bhagat1(B) , Prakash Choudhary2 , and Kh Manglem Singh1 1

Department of Computer Science and Engineering, National Institute of Technology Manipur, Imphal, India [email protected] 2 Department of Computer Science and Engineering, National Institute of Technology Hamirpur, Hamirpur, HP, India [email protected]

Abstract. Image annotation plays a vital role in dealing with effective organization and retrieval of a large number of digital images. Multiinstance multi-label (MIML) learning can deal with complicated objects by solving the ambiguity in both input and output space. Image bag generator is a key component of MIML algorithms. A bag generator takes an image as its input and generates a set of instances for that image. These instances are the various subparts of the original image and collectively describe the image in totality. This paper proposes two new bag generators which can generate an instance for every possible object present in the image. The proposed bag generators effectively utilize the correlations among pixels to generate instances. We demonstrate that the proposed bag generators outperform the state-of-the-art bag generator methods. Keywords: Bag generator · Image bag generator MLL · Automatic image annotation

1

· MIML · MIL ·

Introduction

In this age of the digital world, the number of digital images is increasing very fast, in fact, faster than the expectation. This poses a challenge for the coherent organization of images to provide dynamic retrieval of images. Image annotation is a process of assigning labels (keywords or tags) to the images which represent contents of the image. Image annotation helps in a coherent organization, effective retrieval, categorization, object detection, object recognition, auto-illustration, etc. of the image. The assignment of keywords can be performed automatically, called automatic image annotation (AIA), or manually. The manual assignment of keywords is subjective (may vary from person to person), time-consuming, etc., hence we strive for automatic keyword assignment [1]. Multi-instance multi-label (MIML) learning is a supervised annotation process, where each image is represented as a bag of instances, and each image c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 407–418, 2020. https://doi.org/10.1007/978-981-15-4015-8_36

408

P. K. Bhagat et al.

contains multiple labels simultaneously. A bag is labeled as positive if any of the instances in the bag is positive and it is labeled negative only if all the instances in the bag are negative. MIML learning is a combination of multi-instanced learning (MIL) and multiple label learning (MLL) [1]. In MIL, originally proposed in [6] for drug activity prediction, each image is represented as a bag of instances. Each bag contains a set of instances, essentially the various subparts of the image, where each instance describes the contents of the image. In MIL, a label is associated with the bag rather than the individual instance. The idea is to label a bag negative if and only if none of the instances in the bag is positive otherwise mark the bag as positive. A diverse density (DD) based method for MIL problem [13,14] considers instances of different bags are independent and looks for the high diverse density point. The high diverse density is the point where many different positive bags have instances, and negative instances are farthest. A lazy learning approach using KNN with Hausdorff distance is proposed in [20] to deal with MIL problem. The author proposed two algorithms for MIL (Bayesian-knn and Citation-knn) and claimed to have achieved competitive performance over musk dataset. A boosting based MIL algorithm [23] considers all instances in the bag contributes equally and independently to a bag’s label. The proposed MIBoosting algorithm inspired many MIML algorithms such as MIMLBoost in [26]. The MLL [2,15,19] is all about studying the ambiguity in the output space. MLL can be considered as the extension of the multiclass problem wherein the former; an image may belong to multiple classes simultaneously, i.e., the labels are not mutually exclusive whereas in the latter case, one image belongs to one class only, i.e., labels are mutually exclusive. In [19], various boosting based MLL algorithms are proposed which can be seen as the basis of various MIML learning algorithms [26]. Later, MLL was extended to deal with missing labels [22] by analyzing the labels consistency and label smoothness. The MIML is essentially a supervised learning problem with ambiguous training set where each object in the training set has multiple instances and belongs to multiple classes (Fig. 1). Inspired from [19,23], two multi-instance multi-label algorithms are presented in [26,28]. The authors proposed MIMLBoost using MIL as a bridge between traditional supervised learning and MIML framework. The authors also proposed MIMLSVM using MLL as a bridge between traditional supervised learning and MIML framework. Later various MIL algorithms are extended for MIML such as Gaussian process based MIML [8] MIMLfast [9], multi-modal MIML LDA [16], etc. A key part of the MIML system is the bag generator. A bag generator generates a set of instances from the input image. Each instance of the image is a possible description of what the image is about [14]. In an ideal condition, the bag generator should generate one instance for each object which is a difficult task. If the bag generator is very effective, then even a simple learning algorithm would produce a very good result. Otherwise, we need a sophisticated learning algorithm to obtained good results. Hence, the sophistication of bag generator

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

409

Fig. 1. The different types of learning framework. (Source [26]).

plays a very important role in the overall output of the MIML learning based annotation system. In this paper, we are proposing two new subregion based bag generator methods, which do not require any sophisticated learning algorithm. The proposed methods generate instances as independent instances as well as based on the correlation among instances. The first bag generator is called BlobBag and is based on the block wise division of the image whereas second bag generator is called SpiderBag as it is inspired by spider net. We have applied the proposed bag generators with a simple and basic MIML learning method and obtained excellent result for AIA.

2

Related Work

To solve the drug activity prediction problem, [6] considered each low-energy shape of the molecule as an instance. This was the first example of instance generator. Later, a subregion based bag generator proposed in [14] subsamples the image in the 8 × 8 matrix to generate instance. The authors proposed five bag generators and three of them Row, single blob without neighbor (SB) and single blob with neighbor (SBN) produced excellent results [21,26]. A segmentation based bag generator proposed in [11] utilizes the JSEG segmentation [5] to generate instances from an image. The authors also proposed a human biologically motivated bottom-up saliency-based visual attention computation model based image bag generator called attention bag. The authors concluded that the performance of JSEG-bag is better than the attention bag. In JSEG-bag, first an image is segmented, and then features are extracted from segmented regions

410

P. K. Bhagat et al.

to generate instances. In k-meansSeg [25], Blobworld [3] and WevSeg [24], first features are extracted and those features are segmented to generate instance. The k-meansSeg [25] is based on k-means segmentation. First, an image is divided into 4 × 4 blobs, and from each blob, six features (three texture and three wavelet features) are extracted. Then, k-means segmentation is applied on feature vector to segment the image and obtain the instances. The Blobworld [3] uses mixture of Gaussian with expectation maximization (EM) to obtain the regions from the image based on the six features ( three texture and three color features). WavSeg is a wavelet feature based image bag generator which uses simultaneous partition and class parameter estimation (SPCPE) [24] segmentation algorithm to partition the image. The local binary pattern (LBP) [18] and scale invariant feature transform (SIFT) [12] can also be used as an image bag generator. Extensive experiments with a different set of image bag generators and MIL algorithms can be seen in [21]. The authors categorized the image bag generator according to Fig. 2 and concluded that the non-segmentation based bag generator always performs better than segmentation based bag generator. The SB, SBN, and local binary pattern (LBP) achieved the best classification accuracy. The performance of Row, SB, and SBN are better than LBP for scene classification while LBP performs better than Row for object classification.

3

Proposed Method

The instances of a bag can be considered in one of the two ways: (a) all the instances are independent and contribute independently for labeling the bag as positive or negative. (b) All the instances in the bag are interdependent and contribute collectively for labeling the bag as positive or negative. For scene type images, the performance of bag generators having color features as their instance features value (Row, SB, SBN) is better than the bag generators having texture features as their instance values (LBP) [21]. However, for object type images, texture based bag generator performed better [21]. The sampling of dense regions is better than the sparse interest points for image classification [17] hence, the performance of SB, SBN and LPB are satisfying [21] as they sample dense regions to generate instances. However, for scene type image, color features are dominant and provide robust discriminative information while texture features are dominant in object type images and may not be effectively useful for scene type images [21]. The SB does not take correlations with neighbors into consideration hence each instance can be considered independent. The problem with SBN is that it usually produces much more instances than SB which will cause a much larger computational cost. Also, SBN does not consider corners of the image for instance generation. Hence, when the patch size is small, the blind zones in SBN account for a large proportion of the image and when the patch size increases objects located in the corners of the image is not detected [21].

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

411

Row [14]

Non-segmentation

SB [14]

SBN [14]

k-meansSeg [25]

Blobworld [3] Bag generator

Segmentation WavSeg [24]

JSEG-bag [11]

LBP [18]

Miscellaneous

SIFT [12]

ImaBag [27]

Fig. 2. Categorization of the image bag generator approach.

3.1

BlobBag

Considering above-mentioned points in mind, we are proposing a new image bag generator which addresses all the problems mentioned above. Our proposed method divides the original image into 3 × 3 equal size blobs as shown in Fig. 3a. Before the division of the image, each image is sampled into 192 × 192 which results in each blob containing 4096 pixels. We assume, most of the pixels in a blob share similar property and focusing on small patch with correlations among pixels produce strong discriminative features. For the objects exceeding one blob, a strong relationship among the blobs can be obtained by making the correlation with the neighboring blobs. For each blob, we find the correlation with its three immediate neighbors. Thus, four instances are generated for four corner blobs each correlating with their three immediate neighbors as shown in Fig. 3b. For each noncorner blobs except for central blobs, we generate four instances correlating with their three non-diagonal blobs as shown in Fig. 3c. Likewise, we extract four instances for central pixels according to Fig. 3d. In this way, we are able to cover every part of the image with only 12 instances. This will greatly reduce the computational complexity of the MIML algorithms.

412

P. K. Bhagat et al.

Fig. 3. The proposed BlobBag bag generator. (a) sub division in image into 3 × 3 blob. (b) four correlated corner instances. (c) four noncorner correlated instances without certral blob. (d) four correlated instance for central pixel.

Texture plays an important role in the human visual perception system. The proposed method uses GLCM [7] to extract texture features from each blob separately. Before feature extraction, each blob is quantized in 16 grey levels and a cooccurrence matrix is calculated for each of the four orientations (0◦ , 45◦ , 90◦ and 135◦ ) with a distance one (d = 1). Then, the five features, i.e., energy, homogeneity, contrast, correlation, dissimilarity, are extracted from each cooccurrence matrix resulting in a total of 20 features for each blob [10]. Thus, the proposed method generates a bag of 12 instances for each image where an 80-dimensional feature vector describes each instance. The first 20 attributes of the vector represent energy, homogeneity, contrast, correlation, dissimilarity values for the four cooccurrence matrices of the blob. The next 60 attributes correspond to the difference in the energy, homogeneity, contrast, correlation, dissimilarity values for the four cooccurrence matrices between the blob and its three immediate neighboring blobs. This is done to establish the correlation among the blobs. If any object extends beyond one blob, it is spread in the neighboring blobs. Hence there should be a method to detect the presence of the object spreading beyond one blob. The difference in the feature values of a blob with neighboring blobs will be near to zero if a similar pattern exists in the neighboring blobs and in this way a correlation among the blobs can be established. The four corner instances (inst1 to inst4 ) are given Eqs. (1–4). Likewise the four noncorner instances except for central blob instance (inst5 to inst8 ) are given Eqs. (5–8). Finally, the four instances for central blob (inst9 to inst12 ) are given Eqs. (9–12). The extracted features from a blob, say m1, is represented by f(m1). inst1 = {f (m1), f (m1) − f (m4), f (m1) − f (m5), f (m1) − f (m2)}

(1)

inst2 = {f (m3), f (m3) − f (m2), f (m3) − f (m5), f (m3) − f (m6)}

(2)

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

3.2

413

inst3 = {f (m9), f (m9) − f (m6), f (m9) − f (m5), f (m9) − f (m8)}

(3)

inst4 = {f (m7), f (m7) − f (m8), f (m7) − f (m5), f (m7) − f (m4)}

(4)

inst5 = {f (m2), f (m2) − f (m1), f (m2) − f (m5), f (m2) − f (m3)}

(5)

inst6 = {f (m6), f (m6) − f (m3), f (m6) − f (m5), f (m6) − f (m9)}

(6)

inst7 = {f (m8), f (m8) − f (m9), f (m8) − f (m5), f (m8) − f (m7)}

(7)

inst8 = {f (m4), f (m4) − f (m7), f (m4) − f (m5), f (m4) − f (m1)}

(8)

inst9 = {f (m5), f (m5) − f (m1), f (m5) − f (m2), f (m5) − f (m3)}

(9)

inst10 = {f (m5), f (m5) − f (m3), f (m5) − f (m6), f (m5) − f (m9)}

(10)

inst11 = {f (m5), f (m5) − f (m9), f (m5) − f (m8), f (m5) − f (m7)}

(11)

inst12 = {f (m5), f (m5) − f (m7), f (m5) − f (m4), f (m5) − f (m1)}

(12)

SpiderBag

Our second bag generator method is inspired by the idea of spider net. The central pixel of the image is considered as the spider seed, and two rounds of the net is constructed to cover the whole image. If many rounds of the net are constructed, then there will be various sub-parts of the image, hence a large number of instances, i.e., every pixel may be considered as an instance. This will ultimately increase the computational complexity of the MIML algorithm. On the other hand, if only a few rounds of the net is constructed, there will be fewer instances, i.e., a large part of the image may be considered a single instance. This may result in the generation of an incompetent bag and may require a very sophisticated MIML algorithm to obtain a satisfactory result. Considering the above point in mind, the proposed method generates 16 instances per bag which is a moderate number of instances per bag. Due to the structure of the spider net, the corners of the image is not covered hence we have slightly changed the structure of the net. Thus, the proposed method is spider net inspired not the actual spider net. Once the net is constructed, one instance is generated for each blob of the net. The proposed method represents each image as a bag of 16 instances. The correlation among blobs is taken into consideration while generating instances. The construction of spider net and instance generation is shown in Fig. 4. The correlation among instances are shown in Figs. 4b and 4c. For example, instance one (Inst.1) of the outer block has correlation with two outer blobs (Inst.3 and Inst.15) and one inner blob (Inst.2) as shown in Fig. 4b. Likewise, Fig. 4c shows instance two (Inst.2) of the inner block is correlated with two inner blobs (Inst.4 and Inst.16) and one outer blob (Inst.1) . Color-based features of the image are one of the most powerful representations of the contents of the image. An image is composed of RGB colors, hence, extracting RGB based features has been a very active area of image content representation [1]. Before the construction of the net, each image is subsampled to a 24 × 24. We have experimented with various sizes of subsampled images,

414

P. K. Bhagat et al.

Fig. 4. The proposed SpiderBag bag generator. (a) sub division in image by constructing two rounds of net. (b) Eight correlated instances in outer block of the net. (c) Eight correlated instances in inner block of the net.

and 24 × 24 subsampled images produced the best result. The proposed method generates 16 instances for each image where each instance is described by a 48-dimensional vector. The first 12 attributes of the vector represent mean, variance, standard deviation, and median of the red, green and blue color channel of the blob. The next 36 attributes correspond to the difference in the mean, variance, standard deviation and median color value between the blob and its three immediate neighboring blobs.

4

Evaluation

The dataset consists of 2273 images belonging to six classes and is a subset of NUS-WIDE dataset [4]. All the images belong to two to six classes simultaneously. Out of 2273 images, 1500 images are used for training, and 723 is used for testing. The purposed bag generators are used with MIMLSVM [26] algorithm which is a simple MIML algorithm. For the performance evaluation, five multi-label evaluation metrics are used. A detailed description of the evaluation metrics can be found in [19]. The results of the experiments are shown in Tables 1, 2 and 3, where ‘↑’ indicates ‘the bigger the better’ while ‘↓’ indicates ‘the smaller the better’. Tables 1, 2, and 3 show average performance of the methods over five runs. The best obtained results are bolded. According to [21], the performance of SBN is best among other bag generators. Hence, SBN is selected as a base method for comparative performance analysis with the two proposed bag generators. Also, Zhou et al. [26] used SBN over 2000 image dataset having five classes with MIMLSVM method. But in that dataset, only 22% images belongs to multiple class simultaneously. However, in our dataset, all the images belong to multiple class simultaneously.From the Tables 1, 2 and 3, it can be comprehended that SpiderBag achieved better performance than the SBN and BlobBag while BlobBag achieved comparable results with SBN. The proposed bag generators possess the quality of both SB and SBN bag generators without generating a large number of instances, but unlike SBN they do not overlap pixels to generate instance. Unlike SB, the proposed bag generators take neighbors into account as in SBN. Hence, it has all the benefits of SB and SBN while removing their demerits. Also, the dominance of color features

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

415

Table 1. The performance of SBN with different Gamma values used in Gaussian kernel. Gamma value

Evaluation metric Hamming loss↓

One-error↓ Coverage↓ Ranking loss↓

2−5

0.26

0.31

1.91

0.26

0.80

2−3

0.21

0.28

1.21

0.19

0.82

2−1

0.25

0.34

1.9

0.25

0.81

21

0.32

0.38

2.3

0.21

0.80

0.32

0.39

2.5

0.28

0.80

2

2

Average precision↑

Table 2. The performance of BlobBag with different Gamma values used in Gaussian kernel. Gamma Evaluation metric value Hamming One-error ↓ Coverage ↓ Ranking loss↓ loss↓ 2−5

Average precision ↑

0.25

0.31

1.42

0.26

0.81

2

0.21

0.26

1.18

0.16

0.83

2−1

0.24

0.30

1.26

0.25

0.82

21

0.30

0.32

1.68

0.22

0.81

0.32

0.38

2.22

0.24

0.80

−3

2

2

Table 3. The performance of SpiderBag with different Gamma values used in Gaussian kernel. Gamma value

Evaluation metric Hamming loss↓

One-error↓ Coverage↓ Ranking loss↓

2−5

0.21

0.25

1.68

0.16

0.82

2−3

0.16

0.21

1.32

0.13

0.85

2−1

0.20

0.29

1.45

0.14

0.83

21

0.26

0.29

2.96

0.19

0.82

0.29

0.31

2.2

0.22

0.81

2

2

Average precision↑

over texture features or vice versa cannot be ascertained as texture based BlobBag achieved competitive performance compared to color based SBN. However, color based SpiderBag achieved better performance than BlobBag and SBN.

416

5

P. K. Bhagat et al.

Conclusion

In this paper, we have proposed two new bag generators that generate a fixed number of instances for each image. The proposed bag generators remove the demerits of SB and SBN bag generators while effectively utilizing the correlation of pixels. The proposed bag generator considers every pixel of the image without overlapping to generate instances. It will remove the burden of MIML algorithms as it generates very effective instances of each image. As both color and texture features can be extracted from both BlobBag and SpiderBag, an investigation on the effectiveness of color features over texture features and vice versa will pave the way for the preference of feature type.

References 1. Bhagat, P.K., Choudhary, P.: Image annotation: then and now. Image Vis. Comput. 80, 1–23 (2018). https://doi.org/10.1016/j.imavis.2018.09.017. http://www. sciencedirect.com/science/article/pii/S0262885618301628 2. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognit. 37(9), 1757–1771 (2004). https://doi.org/10.1016/j. patcog.2004.03.009 3. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002). https://doi.org/10. 1109/TPAMI.2002.1023800 4. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: NUS-WIDE: a realworld web image database from national university of Singapore. In: Proceedings of ACM Conference on Image and Video Retrieval (CIVR 2009), Santorini, Greece, 8–10 July 2009 (2009) 5. Deng, Y., Manjunath, B.S.: Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Pattern Anal. Mach. Intell. 23(8), 800–810 (2001). https://doi.org/10.1109/34.946985 6. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997). https:// doi.org/10.1016/S0004-3702(96)00034-3. http://dx.doi.org/10.1016/S0004-3702(9 6)00034-3 7. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Trans. Syst. Man Cybern. SMC 3(6), 610–621 (1973). https://doi. org/10.1109/TSMC.1973.4309314 8. He, J., Gu, H., Wang, Z.: Bayesian multi-instance multi-label learning using Gaussian process prior. Mach. Learn. 88(1), 273–295 (2012). https://doi.org/10.1007/ s10994-012-5283-x 9. Huang, S., Zhou, Z.: Fast multi-instance multi-label learning. CoRR abs/1310.2049 (2013) 10. Lerski, R., Straughan, K., Schad, L., Boyce, D., Bluml, S., Zuna, I.: VIII. MR image texture analysis-an approach to tissue characterization. J. Magn. Reson. Imaging 11(6), 873–887 (1993). https://doi.org/10.1016/0730-725X(93)90205-R 11. Liu, W., Xu, W., Li, L., Li, G.: Two new bag generators with multi-instance learning for image retrieval. In: 3rd IEEE Conference on Industrial Electronics and Applications, June 2008, pp. 255–259 (2008). https://doi.org/10.1109/ICIEA. 2008.4582518

Two Efficient Image Bag Generators for Multi-instance Multi-label Learning

417

12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664. 99615.94 13. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, NIPS 1997, pp. 570–576. MIT Press, Cambridge (1998). http://dl. acm.org/citation.cfm?id=302528.302753 14. Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 341–349. Morgan Kaufmann Publishers Inc., San Francisco (1998) 15. Nasierding, G., Tsoumakas, G., Kouzani, A.Z.: Clustering based multi-label classification for image annotation and retrieval. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 4514–4519, October 2009 16. Nguyen, C.T., Zhan, D.C., Zhou, Z.H.: Multi-modal image annotation with multiinstance multi-label LDA. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI 2013, pp. 1558–1564. AAAI Press (2013) 17. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006). https://doi.org/10.1007/ 11744085 38 18. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002). https://doi.org/10.1109/TPAMI.2002. 1017623 19. Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Mach. Learn. 39(2), 135–168 (2000). https://doi.org/10.1023/A: 1007649029923 20. Wang, J., Zucker, J.D.: Solving the multiple-instance problem: a lazy learning approach. In: Proceedings of the Seventeenth International Conference on Machine Learning, ICML 2000, pp. 1119–1126. Morgan Kaufmann Publishers Inc., San Francisco (2000) 21. Wei, X.-S., Zhou, Z.-H.: An empirical study on image bag generators for multiinstance learning. Mach. Learn. 105(2), 155–198 (2016). https://doi.org/10.1007/ s10994-016-5560-1 22. Wu, B., Lyu, S., Hu, B.G., Ji, Q.: Multi-label learning with missing labels for image annotation and facial action unit recognition. Pattern Recognit. 48(7), 2279–2289 (2015). https://doi.org/10.1016/j.patcog.2015.01.022 23. Xu, X., Frank, E.: Logistic regression and boosting for labeled bags of instances. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 272–281. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-247753 35 24. Zhang, C., Chen, S.C., Shyu, M.L.: Multiple object retrieval for image databases using multiple instance learning and relevance feedback. In: IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No. 04TH8763), vol. 2, pp. 775–778, June 2004. https://doi.org/10.1109/ICME.2004.1394315 25. Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.: Content-based image retrieval using multiple-instance learning. In: Proceedings of the Nineteenth International Conference on Machine Learning, ICML 2002, pp. 682–689. Morgan Kaufmann Publishers Inc., San Francisco (2002). http://dl.acm.org/citation.cfm?id=645531.656002

418

P. K. Bhagat et al.

26. Zhou, Z.H, Zhang, M.L.: Multi-instance multi-label learning with application to scene classification. In: Sch¨ olkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1609–1616. MIT Press (2007) 27. Zhou, Z.H., Zhang, M.L., Chen, K.J.: A novel bag generator for image database retrieval with multi-instance learning techniques. In: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, pp. 565–569, November 2003. https://doi.org/10.1109/TAI.2003.1250242 28. Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance multi-label learning. Artif. Intell. 176(1), 2291–2320 (2012). https://doi.org/10.1016/j.artint. 2011.10.002

Machine Learning

A Comparative Study of Big Mart Sales Prediction Gopal Behera(B) and Neeta Nain Malaviya National Institute of Technology Jaipur, Jaipur, India {2019rcp9002,nnain.cse}@mnit.ac.in

Abstract. Nowadays shopping malls and Big Marts keep the track of their sales data of each and every individual item for predicting future demand of the customer and update the inventory management as well. These data stores basically contain a large number of customer data and individual item attributes in a data warehouse. Further, anomalies and frequent patterns are detected by mining the data store from the data warehouse. The resultant data can be used for predicting future sales volume with the help of different machine learning techniques for the retailers like Big Mart. In this paper, we propose a predictive model using Xgboost technique for predicting the sales of a company like Big Mart and found that the model produces better performance as compared to existing models. A comparative analysis of the model with others in terms performance metrics is also explained in details.

Keywords: Machine learning Regression · Xgboost

1

· Sales forecasting · Random forest ·

Introduction

Day by day competition among different shopping malls as well as big marts is getting more serious and aggressive only due to the rapid growth of the global malls and on-line shopping. Every mall or mart is trying to provide personalized and short-time offers for attracting more customers depending upon the day, such that the volume of sales for each item can be predicted for inventory management of the organization, logistics and transport service, etc. Present machine learning algorithm are very sophisticated and provide techniques to predict or forecast the future demand of sales for an organization, which also helps in overcoming the cheap availability of computing and storage systems. In this paper, we are addressing the problem of big mart sales prediction or forecasting of an item on customer’s future demand in different big mart stores across various locations and products based on the previous record. Different machine learning algorithms like linear regression analysis, random forest, etc are used for prediction or forecasting of sales volume. As good sales are the life of every organization so the forecasting of sales plays an important role in any shopping complex. Always a better prediction is helpful, to develop as well as to c Springer Nature Singapore Pte Ltd. 2020 N. Nain et al. (Eds.): CVIP 2019, CCIS 1147, pp. 421–432, 2020. https://doi.org/10.1007/978-981-15-4015-8_37

422

G. Behera and N. Nain

enhance the strategies of business about the marketplace which is also helpful to improve the knowledge of marketplace. A standard sales prediction study can help in deeply analyzing the situations or the conditions previously occurred and then, the inference can be applied about customer acquisition, funds inadequacy and strengths before setting a budget and marketing plans for the upcoming year. In other words, sales prediction is based on the available resources from the past. In depth knowledge of past is required for enhancing and improving the likelihood of marketplace irrespective of any circumstances especially the external circumstance, which allows to prepare the upcoming needs for the business. Extensive research is going on in retailers domain for forecasting the future sales demand. The basic and foremost technique used in predicting sale is the statistical methods, which is also known as the traditional method, but these methods take much more time for predicting a sales also these methods could not handle non linear data so to over these problems in traditional methods machine learning techniques are deployed. Machine learning techniques can not only handle non-linear data but also huge data-set efficiently. To measure the performance of the models, Root Mean Square Error (RMSE) [15] and Mean Absolute Error (MAE) [4] are used as an evaluation metric as mentioned in the Eqs. 1 and 2 respectively. Here Both metrics are used as the parameter for accuracy measure of a continuous variable. n

M AE =

1 |xpredict − xactual | n i=1

n 1 2 RM SE = (|xpredict − xactual |) n i=1

(1)

(2)

where n: total number of error and |xpredict − xactual |: Absolute error. The remaining part of this article is arranged as following: Sect. 1 briefly describes introduction of sales prediction of Big Mart and also elaborate about the evaluation metric used in the model. Previous related work has been pointed in Sect. 2. The detailed description and analysis of proposed model is given in Sect. 3. Where as implementations and results are demonstrated in Sect. 4 and the paper concludes with a conclusion in the last section.

2

Related Work

Sales forecasting as well as analysis of sale forecasting has been conducted by many authors as summarized: The statistical and computational methods are studied in [2] also this paper elaborates the automated process of knowledge acquisition. Machine learning [6] is the process where a machine will learn from data in the form of statistically or computationally method and process knowledge acquisition from experiences. Various machine learning (ML) techniques with their applications in different sectors has been presented in [2]. Langley and Simon [7] pointed out most widely used data mining technique in the field

A Comparative Study of Big Mart Sales Prediction

423

of business is the Rule Induction (RI) technique as compared to other data mining techniques. Where as sale prediction of a pharmaceutical distribution company has been described in [10,12]. Also this paper focuses on two issues: (i) stock state should not undergo out of stock, and (ii) it avoids the customer dissatisfaction by predicting the sales that manages the stock level of medicines. Handling of footwear sale fluctuation in a period of time has been addressed in [5]. Also this paper focuses on using neural network for predicting of weekly retail sales, which decrease the uncertainty present in the short term planning of sales. Linear and non-linear [3] a comparative analysis model for sales forecasting is proposed for the retailing sector. Beheshti-Kashi and Samaneh [1] performed sales prediction in fashion market. A two level statistical method [11] is elaborated for forecasting the big mart sales prediction. Xia and Wong [17] proposed the differences between classical methods (based on mathematical and statistical models) and modern heuristic methods and also named exponential smoothing, regression, auto regressive integrated moving average (ARIMA), generalized auto regressive conditionally heteroskedastic (GARCH) methods. Most of these models are linear and are not able to deal with the asymmetric behavior in most real-world sales data [9]. Some of the challenging factors like lack of historical data, consumer-oriented markets face uncertain demands, and short life cycles of prediction methods results in inaccurate forecast.

3

Proposed System

For building a model to predict accurate results the dataset of Big Mart sales undergoes several sequence of steps as mentioned in Fig. 1 and in this work we propose a model using Xgboost technique. Every step plays a vital role for building the proposed model. In our model we have used 2013 Big mart dataset [13]. After preprocessing and filling missing values, we used ensemble classifier using Decision trees, Linear regression, Ridge regression, Random forest and Xgboost. Both MAE and RSME are used as accuracy metrics for predicting the sales in Big Mart. From the accuracy metrics it was found that the model will predict best using minimum MAE and RSME. The details of the proposed method is explained in the following section. 3.1

Dataset Description of Big Mart

In our work we have used 2013 Sales data of Big Mart as the dataset. Where the dataset consists of 12 attributes like Item Fat, Item Type, Item MRP, Outlet Type, Item Visibility, Item Weight, Outlet Identifier, Outlet Size, Outlet Establishment Year, Outlet Location Type, Item Identifier and Item Outlet Sales. Out of these attributes response variable is the Item Outlet Sales attribute and remaining attributes are used as the predictor variables. The data-set consists of 8523 products across different cities and locations. The data-set is also based on hypotheses of store level and product level. Where store level involves attributes like: city, population density, store

424

G. Behera and N. Nain

Fig. 1. Working procedure of proposed model.

capacity, location, etc and the product level hypotheses involves attributes like: brand, advertisement, promotional offer, etc. After considering all, a dataset is formed and finally the data-set was divided into two parts, training set and test set in the ratio 80:20. 3.2

Data Exploration

In this phase useful information about the data has been extracted from the dataset. That is trying to identify the information from hypotheses vs available data. Which shows that the attributes Outlet size and Item weight face the problem of missing values, also the minimum value of Item Visibility is zero which is not actually practically possible. Establishment year of Outlet varies from 1985 to 2009. These values may not be appropriate in this form. So, we need to convert them into how old a particular outlet is. There are 1559 unique products, as well as 10 unique outlets, present in the dataset. The attribute Item type contains 16 unique values. Where as two types of Item Fat Content are there but some of them are misspelled as regular instead of ‘Regular’ and low fat’, ‘LF’ instead of ‘Low Fat’. From Fig. 2. It was found that the response variable i.e. Item Outlet Sales was positively skewed. So, to remove the skewness of response variable a log operation was performed on Item Outlet Sales. 3.3

Data Cleaning

It was observed from the previous section that the attributes Outlet Size and Item Weight has missing values. In our work in case of Outlet Size missing value we replace it by the mode of that attribute and for the Item Weight missing values we replace by mean of that particular attribute. The missing

A Comparative Study of Big Mart Sales Prediction

425

Fig. 2. Univariate distribution of target variable Item outlet sales. The Target variable is positively skewed towards the higher sales.

attributes are numerical where the replacement by mean and mode diminishes the correlation among imputed attributes. For our model we are assuming that there is no relationship between the measured attribute and imputed attribute. 3.4

Feature Engineering

Some nuances were observed in the data-set during data exploration phase. So this phase is used in resolving all nuances found from the dataset and make them ready for building the appropriate model. During this phase it was noticed that the Item visibility attribute had a zero value, practically which has no sense. So the mean value item visibility of that product will be used for zero values attribute. This makes all products likely to sell. All categorical attributes discrepancies are resolved by modifying all categorical attributes into appropriate ones. In some cases, it was noticed that non-consumables and fat content property are not specified. To avoid this we create a third category of Item fat content i.e. “none”. In the Item Identifier attribute, it was found that the unique ID starts with either DR or FD or NC. So, we create a new attribute ‘Item Type New’ with three categories like Foods, Drinks and Non-consumables. Finally, for determining how old a particular outlet is, we add an additional attribute ‘Year’ to the dataset. 3.5

Model Building

After completing the previous phases, the dataset is now ready to build proposed model. Once the model is build it is used as predictive model to forecast sales of Big Mart. In our work, we propose a model using Xgboost algorithm and compare it with other machine learning techniques like Linear regression, Ridge regression [14], Decision tree [8, 16] etc. Decision Tree: A decision tree classification is used in binary classification problem and it uses entropy [8] and information gain [16] as metric and is defined in Eqs. 3 and 4 respectively for classifying an attribute which picks the highest information gain attribute to split the data set.

426

G. Behera and N. Nain

H(S) = −

p(c) log p(c)

(3)

c∈C

where H(S): Entropy, C: Class Label, P: Probability of class c. Infromation Gain(S, A) = H(S) − p(t)H(t)

(4)

t∈T

where S: Set of attribute or dataset, H(S): Entropy of set S, T : Subset created from splitting of S by attribute A. p(t): Proportion of the number of elements in t to number of element in the set S. H(t): Entropy of subset t. The decision tree algorithm is depicted in Algorithm 1.

Require: Set of features d and set of training instances D 1: if all the instances in D have the same target label C then 2: Return a decision tree consisting of leaf node with label level C end else if d is empty then 4: Return a decision tree of leaf node with label of the majority target level in D end 5: else if D is empty then 6: Return a decision tree of leaf node with label of the majority target level of the immediate parent node end 7: else 8: d[best] ← arg max IG(d, D) where d ∈ D 9: make a new node, Noded[best] 10: partition D using d[best] 11: remove d[best] from d 12: for each partition Di of D do 13: grow a branch from Noded[best] to the decision tree created by rerunning ID3 with D=Di end end Algorithm 1: ID3 algorithm Linear Regression: A model which create a linear relationship between the dependent variable and one or more independent variable, mathematically linear regression is defined in Eq. 5 y= wT x (5) where y is dependent variable and x are independent variables or attributes. In linear regression we find the value of optimal hyperplane w which corresponds to the best fitting line (trend) with minimum error. The loss function for linear regression is estimated in terms of RMSE and MAE as mentioned in the Eqs. 1 and 2.

A Comparative Study of Big Mart Sales Prediction

427

Ridge Regression: The cost function for ridge regression is defined in Eq. 6. 2 2 min |(Y − X(θ)|) + λ θ (6) here λ known as the penalty term as denoted by α parameter in the ridge function. So the penalty term is controlled by changing the values of α, higher the values of α bigger is the penalty. Figure 3 shows Linear Regression, Ridge Regression, Decision Tree and proposed model i.e. Xgboost. Xgboost (Extreme Gradient Boosting) is a modified version of Gradient Boosting Machines (GBM) which improves the performance upon the GBM framework by optimizing the system using a differentiable loss function as defined in Eq. 7. n l(yi , yî ) + kΩ(fk ), fk ∈ F (7) i=1

k

where yî : is the predicted value, yi : is the actual value and F is the set of function containing the tree, l(yi , yî ) is the loss function. This enhances the GBM algorithm so that it can work with any differentiable loss function. The GBM algorithm is illustrated in Algorithm 2. Step 1: Initialize model with a constant value: F0 = arg min

n

L(yi , γ)

i=0

Step 2: for m= 1 to M : do a. Compute pseudo residuals: ∂L(yi F (xi )) rim = − ∂F (xi ) F (x)=Fm−1 (x) f or all i = 1, 2...n b. Fit a Base learner hm (x) to pseudo residuals that is train the learner using training set. c. Compute γm γm = arg min γ

n

(L(yi , Fm−1 (xi ) + γh(xi )))

i=0

d. Update the model: Fm (x) = Fm−1 (x) + γm hm (x) end Step 3: Output FM Algorithm 2: Gradient boosting machine(GBM) algorithm

428

G. Behera and N. Nain

The Xgboost has following exclusive features: 1. Sparse Aware - that is the missing data values are automatic handled. 2. Supports parallelism of tree construction. 3. Continued training - so that the fitted model can further boost with new data. All models received features as input, which are then segregated into training and test set. The test dataset is used for sales prediction.

Fig. 3. Framework of proposed model. Model received the input features and split it into training and test set. The trained model is used to predict the future sales.

4

Implementation and Results

In our work we set cross-validation as 20 fold cross-validation to test accuracy of different models. Where in the cross-validation stage the dataset is divided randomly into 20 subsets with roughly equal sizes. Out of the 20 subsets, 19 subsets are used as training data and the remaining subset forms the test data also called leave-one-out cross validation. Every models is first trained by using the training data and then used to predict accuracy by using test data and this continues until each subset is tested once. From data visualization, it was observed that lowest sales were produced in smallest locations. However, in some cases it was found that medium size location produced highest sales though it was type-3 (there are three type of super market e.g. super market type-1, type-2 and type-3) super market instead of largest size location as shown in Fig. 4. To increase the product sales of Big mart in a particular outlet, more locations should be switched to Type 3 Supermarkets. However, the proposed model gives better predictions among other models for future sales at all locations. For example, how item MRP is correlated with outlet

A Comparative Study of Big Mart Sales Prediction

429

sales is shown in Fig. 5. Also Fig. 5 shows that Item Outlet Sales is strongly correlated with Item MRP, where the correlation is defined in Eq. 8.

n (xy) − ( x)( y)

(8) PCorr =

n[ x2 ) − ( x)2 ] n[ y 2 − ( y)2 ] From Fig. 8 it is also observed that target attribute Item Outlet Sales is affected by sales of the Item Type. Similarly, from Fig. 6 it is also observed that highest sales is made by OUT027 which is actually a medium size outlet in the super market type-3. Figure 7 describes that the less visible products are sold more compared to the higher visibility products which is not possible practically. Thus, we should reject the one of the product level hypothesis that is the visibility does not effect the sales.

Fig. 4. Impact of outlet location type on target variable item outlet sale. Displayed the sales volume of different outlet locations.

Fig. 5. Correlation among features of a dataset. Brown squares are highly correlated whereas black square represents bad correlation among attributes. (Color figure online)

Fig. 6. Impact of outlet identifier on target variable item outlet sale.

Fig. 7. Impact of item visibility on target variable item outlet sale. Less visible items are sold more compared to more visibility items as outlet contains daily used items which contradicts the null hypothesis.

430

G. Behera and N. Nain

Fig. 8. Impact of item type on target variable item outlet sale.

Fig. 9. Distribution of outlet size. The number of outlet size are available in the dataset.

From Fig. 9 it is observed that less number of high outlet size stores exist in comparison to the medium and small outlet size in terms of count. The crossvalidation score along with MAE and RMSE of the proposed model and existing models is shown in Tables 1 and 2 respectively. Similarly the root mean squared error for existing model and proposed model is presented in Table 2. From the results we observe that and found that the proposed model is significantly improved over the other model. Table 1. Comparison of cross validation score of different model Model

Cross validation score (Mean) Cross validation score (Std)

Linear regression 1129

43.24

Decision tree

1091

45.42

Ridge regression

1097

43.41

Table 2. Comparison of MAE and RMSE of proposed model with other model Model

5

MAE

RMSE

Linear regression 836.1

1127

Decision tree

751.6

1068

Ridge regression

836

1129

Xgboost

749.03 1066

Conclusions

In present era of digitally connected world every shopping mall desires to know the customer demands beforehand to avoid the shortfall of sale items in all seasons. Day to day the companies or the malls are predicting more accurately the

A Comparative Study of Big Mart Sales Prediction

431

demand of product sales or user demands. Extensive research in this area at enterprise level is happening for accurate sales prediction. As the profit made by a company is directly proportional to the accurate predictions of sales, the Big marts are desiring more accurate prediction algorithm so that the company will not suffer any losses. In this research work, we have designed a predictive model by modifying Gradient boosting machines as Xgboost technique and experimented it on the 2013 Big Mart dataset for predicting sales of the product from a particular outlet. Experiments support that our technique produce more accurate prediction compared to than other available techniques like decision trees, ridge regression etc. Finally a comparison of different models is summarized in Table 2. From Table 2 it is also concluded that our model with lowest MAE and RMSE performs better compared to existing models.

References 1. Beheshti-Kashi, S., Karimi, H.R., Thoben, K.D., L¨ utjen, M., Teucke, M.: A survey on retail sales forecasting and prediction in fashion markets. Syst. Sci. Control Eng. 3(1), 154–161 (2015) 2. Bose, I., Mahapatra, R.K.: Business data mining-a machine learning perspective. Inf. Manage. 39(3), 211–225 (2001) 3. Chu, C.W., Zhang, G.P.: A comparative study of linear and nonlinear models for aggregate retail sales forecasting. Int. J. Prod. Econ. 86(3), 217–231 (2003) 4. Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., Sartin, M.: Combing content-based and collaborative filters in an online newspaper (1999) 5. Das, P., Chaudhury, S.: Prediction of retail sales of footwear using feedforward and recurrent neural networks. Neural Comput. Appl. 16(4–5), 491–502 (2007). https://doi.org/10.1007/s00521-006-0077-3 6. Domingos, P.M.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012) 7. Langley, P., Simon, H.A.: Applications of machine learning and rule induction. Commun. ACM 38(11), 54–64 (1995) 8. Loh, W.Y.: Classification and regression trees. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 1(1), 14–23 (2011) 9. Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting Methods and Applications. Wiley, New York (2008) 10. Ni, Y., Fan, F.: A two-stage dynamic sales forecasting model for the fashion retail. Expert Syst. Appl. 38(3), 1529–1536 (2011) 11. Punam, K., Pamula, R., Jain, P.K.: A two-level statistical model for big mart sales prediction. In: International Conference on Computing, Power and Communication Technologies (GUCON), pp. 617–620. IEEE (2018) 12. Ribeiro, A., Seruca, I., Dur˜ ao, N.: Improving organizational decision support: detection of outliers and sales prediction for a pharmaceutical distribution company. Procedia Comput. Sci. 121, 282–290 (2017) 13. Shrivas, T.: Big mart dataset@ONLINE, June 2013. https://datahack. analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/ 14. Smola, A.J., Sch¨ olkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004). https://doi.org/10.1023/B:STCO.0000035301.49549.88 15. Smyth, B., Cotter, P.: Personalized electronic program guides for digital TV. AI Mag. 22(2), 89 (2001)

432

G. Behera and N. Nain

16. Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes (1996) 17. Xia, M., Wong, W.K.: A seasonal discrete grey forecasting model for fashion retailing. Knowl.-Based Syst. 57, 119–126 (2014)

Author Index

Abijah Roseline, S. II-62 Agarwal, Ayushi II-3 Agarwal, Snigdha II-3 Agrawal, Praveen II-387 Agrawal, Raghav II-243 Agrawal, Sanjay I-185 Anand, M. II-331 Ankush, P. A. II-143 Ansari, Mohd Haroon I-195 Anubha Pearline, S. II-319 Bag, Soumen I-15, II-223 Bajpai, Manish Kumar II-254 Banerjee, Biplab I-109 Bansod, Suprit D. II-117 Behera, Gopal I-421 Bhadouria, Sarita Singh I-261 Bhagat, P. K. I-407 Bhardwaj, Priyanka II-331 Bhatnagar, Gaurav I-323 Bhatt, Pranjal II-130 Bhoyar, Kishor K. II-432 Biswas, Anmol I-73 Biswas, Prabir I-348 Biswas, Samit I-270 Bours, Patrick II-163 Burewar, Sairaj Laxman II-283 Busch, Christoph II-49 Chak, Priyanka I-158 Chakraborty, Prasenjit II-143, II-155 Chanani, Anurag II-190 Chandaliya, Praveen Kumar II-294 Chandel, Sushmita I-323 Chatterjee, Sayantan I-27 Chaudhary, Anil Kumar II-373 Chaudhuri, Bidyut B. I-27, I-270 Chavan, Trupti R. II-107 Chhabra, Ayushmaan I-146 Choudhary, Prakash I-407, II-36, II-98, II-307 Chowdhury, Ananda S. II-495

Dansena, Prabhat II-223 Dash, Ratnakar II-343 Debnath, Manisha II-373 Deepa, P. L. I-216 Deivalakshmi, S. II-331 Deshmukh, Maroti II-3 Dhar, Joydip I-261 Dhiman, Ankit II-387 Dhiraj I-146 Didwania, Himansu II-211 Dixit, Anuja I-15 Doshi, Nishi II-15 Dube, Nitant II-266 Dubey, Pawan I-3 Dudhane, Akshay I-311 Dwivedi, Shivangi II-354 Eregala, Srinivas

II-170

Faheema, A. G. J. II-199 Faye, Ibrahima I-174 Gairuboina, Sai Krishna II-143 Gajbhiye, Gaurav O. I-174 Garai, Arpan I-270 Garg, Sanjay II-266 Gauttam, Hutesh Kumar II-423 Geetha, S. II-62 George, Minny II-74 Ghatak, Subhankar II-211 Godfrey, W. Wilfred I-261 Goel, Srishti II-398 Gonde, Anil Balaji I-311, II-283 Gour, Mahesh II-243 Gour, Neha I-94 Green Rosh, K. S. I-73 Grover, Vipul I-383 Gupta, Arpan II-509 Gupta, Deep I-123 Gupta, Hitesh II-398 Gupta, Manu II-25 Gupta, Mayank II-25

434

Author Index

Hari, G. II-62 Hari Krishnan, T. V. II-86 Harjani, Mayank II-294 Hussain, Chesti Altaff I-372 Intwala, Aditya I-205 Inunganbi, Sanasam II-307 Jadhav, Aakash Babasaheb II-283 Jain, Arpit I-94 Jain, Gaurav Kumar I-383 Jain, Sweta II-243 Jamthikar, Ankush D. I-123 Jiji, C. V. I-216 Jonnadula, Eshwar Prithvi II-413 Jose, V. Jeya Maria II-331 Joseph, Philumon I-281 Kalose Mathsyendranath, Raghavendra II-398 Kandpal, Neeta I-239 Kanumuri, Tirupathiraju I-3 Karel, Ashish II-509 Karnick, Harish II-190 Kashyap, Kanchan Lata II-254 Kattampally, Varsha J. II-233 Kedia, Keshav Kumar I-383 Khanna, Pritee I-94, II-254 Khare, Shivam II-86 Khilar, Pabitra Mohan II-413 Kim, Yehoon II-86 Kookna, Vikas I-109 Kovoor, Binsu C. I-281 Krishnamurthy, R. II-62 Kumar, Luckraj Shrawan II-143 Kumar, Manish II-170 Kumar, Munish II-457 Kumar, Nikhil I-239 Kumar, Parveen I-195, II-25 Kumar, Rajesh II-457 Kumar, Ravinder II-457 Kumar, Vardhman II-294 Lakshmi, A. II-199 Lehal, G. S. I-334 Lomte, Sachin Deepak

I-73

Mahata, Nabanita I-301 Maheshkar, Sushila I-227

Maheshkar, Vikas I-227 Maity, Sandip Kumar I-348 Majhi, Snehashis II-343 Mandal, Murari II-354 Mandal, Sekhar I-270 Manglem, Khumanthem II-307 Manne, Sai Kumar Reddy I-39 Marab, Snehal I-395 Massey, Meenakshi I-239 Mastani, S. Aruna I-372 Mehta, Preeti I-227 Mishra, Deepak I-134, I-248 Mishro, Pranaba K. I-185 Mitra, Suman K. II-15, II-130 Mittar, Rishabh II-143 Mohapatra, Ramesh Kumar II-423 Mubarak, Minha I-134 Mukhopadhyay, Susanta I-291 Murala, Subrahmanyam I-311 Muralikrishna, V. M. II-233 Nagar, Rajendra I-61 Nain, Neeta I-421, II-294, II-443 Nandedkar, Abhijeet V. I-174, I-361, II-107, II-117 Nandedkar, Amit V. I-361 Nandi, Gora C. I-27 Natraj, A. Ashwin II-331 Navadiya, Payal I-158 Oza, Vidhey

II-266

Pal, Rajarshi II-223 Palakkal, Sandeep II-86 Paliwal, Rahul II-443 Panda, Rutuparna I-185 Pandeeswari, R. II-331 Parameswaran, Sankaranarayanan Parikh, Bhavya I-158 Patel, Chandan Kumar II-373 Pathak, Ketki C. I-158 Patil, Prashant W. I-311 Patil, Sanjay M. II-432 Paul, Sandip I-248 Paunwala, Chirag I-82 Pawar, Meenakshi I-395, II-364 Prabhakar, Bhanu Pratap I-291 Prajapati, Pratik II-469

II-86

Author Index

Pramanik, Rahul II-223 Prasad, B. H. Pawan I-39 Prasad, Shitala II-373 Priyanka, Sreedevi II-199 Raj, Agastya I-109 Raja, Kiran II-49 Rakesh, Karan II-143 Ramachandra, Raghavendra II-49, II-163 Raman, Shanmuganathan I-61 Rana, Ankur I-334 Rao, D. Venkata I-372 Rasalia, Dhananjay II-266 Rosh, K. S. Green I-39 Rup, Suvendu II-211 Sa, Pankaj Kumar II-343 Sakthi Balan, M. II-509 Sankaran, Praveen II-74, II-233 Saraswathi, Vishlavath I-123 Sarawgi, Yash I-146 Sardeshpande, Kaushik II-180 Sathiesh Kumar, V. II-319 Sen, Mrinmoy II-155 Senthil, M. I-248 Seo, Chanwon II-86 Shah, Dharambhai I-51 Shah, Ketul II-469 Sharma, Ambalika I-195, II-25 Sharma, Dharam Veer I-334 Sharma, Harmohan I-334 Sharma, Megha II-233 Sharma, R. K. II-457 Sharma, Riya II-398 Sheeba Rani, J. I-134 Sheoran, Gyanendra I-3 Shikkenawis, Gitam II-15 Shingan, Mayuresh II-364 Siddiqui, Mohammed Arshad I-94 Sikdar, Arindam II-495 Sing, Jamuna Kanta I-301 Singh, Ankit Kumar I-109 Singh, Ankur II-190

Singh, Harjeet II-457 Singh, Jag Mohan II-49 Singh, Kh Manglem I-407 Singh, M. Sheetal II-98 Singh, Oinam Vivek II-36 Singh, Pankaj Pratap II-373 Somani, Shivam I-146 Srivastava, Rajeev II-485 Subramanian, K. II-331 Subudhi, Priyambada I-291 Sujata II-130 Suji, R. Jenkin I-261 Sunkara, Sai Venkatesh II-233 Swetha, A. II-233 Talbar, S. II-364 Thakkar, Priyank II-266 Thakkar, Shaival II-469 Thesia, Yash II-266 Thomas, Job I-281 Thomas, Thomas James I-134 Thongam, Khelchandra II-36, II-98 Thool, Vijaya R. II-180 Tripathy, Santosh Kumar II-485 Vaidya, Bhaumik I-82 Vaitheeshwari, R. II-319 Venkatesh, Sushma II-49, II-163 Verma, Divakar II-170 Verma, Karun II-457 Verma, Shashikant I-61 Vipparthi, Santosh Kumar II-354 Vyas, Ritesh I-3 Waghumbare, Ajay Ashokrao Wankhade, Shraddha I-15 Yadav, Shalini II-443 Yadav, Shekhar II-354 Yun, Sojung II-86 Zaveri, Tanish I-51

II-283

435