Smart Computer Vision 3031205405, 9783031205408

This book addresses and disseminates research and development in the applications of intelligent techniques for computer

720 77 14MB

English Pages 358 [359] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Smart Computer Vision
 3031205405, 9783031205408

Table of contents :
Preface
Contents
A Systematic Review on Machine Learning-Based Sports Video Summarization Techniques
1 Introduction
2 Two Decades of Research in Sports Video Summarization
2.1 Feature-Based Approaches
2.2 Cluster-Based Approaches
2.3 Excitement-Based Approaches
2.4 Key Event-Based Approaches
2.5 Object Detection
2.6 Performance Metrics
2.6.1 Objective Metrics
2.6.2 Subjective Metrics Based on User Experience
3 Evolution of Ideas, Algorithms, and Methods for Sports Video Summarization
4 Scope for Future Research in Video Summarization
4.1 Common Weaknesses of Existing Methods
4.1.1 Audio-Based Methods
4.1.2 Shot and Boundary Detection
4.1.3 Resolution and Samples
4.1.4 Events Detection
4.2 Scope for Further Research
5 Conclusion
References
Shot Boundary Detection from Lecture Video Sequences Using Histogram of Oriented Gradients and Radiometric Correlation
1 Introduction
2 Shot Boundary Detection and Key Frame Extraction
2.1 Feature Extraction
2.2 Radiometric Correlation for Interframe Similarity Measure
2.3 Entropic Measure for Distinguishing Shot Transitions
2.4 Key Frame Extraction
3 Results and Discussions
3.1 Analysis of Results
3.2 Discussions and Future Works
4 Conclusions
References
Detection of Road Potholes Using Computer Vision and Machine Learning Approaches to Assist the Visually Challenged
1 Introduction
2 Related Works
3 Methodologies
3.1 Pothole Detection Using Machine Learning and Computer Vision
3.2 Pothole Detection Using Deep Learning Model
4 Implementation
5 Result Analysis
6 Conclusion
References
Shape Feature Extraction Techniques for Computer VisionApplications
1 Introduction
2 Feature Extraction
3 Various Techniques in Feature Extraction
3.1 Histograms of Edge Directions
3.2 This Harris Corner
3.3 Scale-Invariant Feature Transform
3.4 Eigenvector Approaches
3.5 Angular Radial Partitioning
3.6 Edge Pixel Neighborhood Information
3.7 Color Histograms
3.8 Edge Histogram Descriptor
3.9 Shape Descriptor
4 Shape Signature
4.1 Centroid Distance Function
4.2 Chord Length Function
4.3 . Area Function
5 Real-Time Applications of Shape Feature Extraction and Object Recognition
5.1 Fruit Recognition
5.2 Leaf Recognition 2
5.3 Object Recognition
6 Recent Works
7 Summary and Conclusion
References
GLCM Feature-Based Texture Image Classification Using Machine Learning Algorithms
1 Introduction
2 GLCM
2.1 Computation of GLCM Matrix
2.2 GLCM Features
2.2.1 Energy
2.2.2 Entropy
2.2.3 Sum Entropy
2.2.4 Difference Entropy
2.2.5 Contrast
2.2.6 Variance
2.2.7 Sum Variance
2.2.8 Difference Variance
2.2.9 Local Homogeneity or Inverse Difference Moment (IDM)
2.2.10 Local Homogeneity or Inverse Difference Moment (IDM)
2.2.11 RMS Contrast
2.2.12 Cluster Shade
2.2.13 Cluster Prominence
3 Machine Learning Algorithms
4 Dataset Description
5 Experiment Results
5.1 Performance Metrices
5.1.1 Sensitivity
5.1.2 Specificity
5.1.3 False Positive Rate (FPR)
5.1.4 False Negative Ratio (FNR)
6 Conclusion
References
Progress in Multimodal Affective Computing: From Machine Learning to Deep Learning
1 Introduction
2 Available Datasets
2.1 DEAP Dataset
2.2 AMIGOS Dataset
2.3 CHEVAD 2.0 Dataset
2.4 RECOLA Dataset
2.5 IEMOCAP Dataset
2.6 CMU-MOSEI Dataset
2.7 SEED IV Dataset
2.8 AVEC 2014 Dataset
2.9 SEWA Dataset
2.10 AVEC 2018 Dataset
2.11 DAIC-WOZ Dataset
2.12 UVA Toddler Dataset
2.13 MET Dataset
3 Features for Affect Recognition
3.1 Audio Modality
3.2 Visual Modality
3.3 Textual Modality
3.4 Facial Expression
3.5 Biological Signals
4 Features for Affect Recognition Various Fusion Techniques
4.1 Decision-Level or Late Fusion
4.2 Hierarchical Fusion
4.3 Score-Level Fusion
4.4 Model-Level Fusion
5 Multimodal Affective Computing Techniques
5.1 Machine Learning-Based Techniques
5.2 Deep Learning-Based Techniques
6 Discussion
7 Conclusion
References
Content-Based Image Retrieval Using Deep Features and Hamming Distance
1 Introduction
1.1 Content-Based Image Retrieval: Review
2 Background: Basics of CNN
3 Proposed Model
3.1 Transfer Learning Using Pretrained Weights
3.2 Feature Vector Extraction
3.3 Clustering
3.4 Retrieval Using Distance Metrics
3.4.1 Euclidean Distance
3.4.2 Hamming Distance
4 Dataset Used
5 Results and Discussions
5.1 Retrieval Using Euclidean Distance
5.1.1 Retrieving 40 Images
5.1.2 Retrieving 50 Images
5.1.3 Retrieving 60 Images
5.1.4 Retrieving 70 Images
5.2 Retrieval Using Hamming Distance
5.2.1 Retrieving 40 Images
5.2.2 Retrieving 50 Images
5.2.3 Retrieving 60 Images
5.2.4 Retrieving 70 Images
5.3 Retrieval Analysis Between Euclidean Distance and Hamming Distance
5.4 Comparison with State-of-the-Art Models
6 Conclusion
7 Future Works
References
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray
1 Introduction
2 Related Work
3 Approaches and Tools
3.1 CIFAR Dataset of Chest X-Ray Image
3.2 Image Scaling in Preprocessing
3.3 Training and Validation Steps
3.4 Deep Learning Model
4 Cuckoo-Based Hash Function
5 Research Data and Model Settings
5.1 Estimates of the Proposed Model's Accuracy
6 Conclusion
References
Initial Stage Identification of COVID-19 Using Capsule Networks
1 Introduction
2 Literature Review
3 Dataset Description
4 Methodology
4.1 Overview of Layers Present in Convolutional Neural Networks
4.1.1 Convolutional Layer
4.1.2 Stride(S)
4.1.3 Pooling Layer
4.1.4 ReLU Activation Functions
4.1.5 Generalized Supervised Deep Learning Flowchart
5 Proposed Work
5.1 Capsule Networks
5.2 Proposed Architecture
5.3 Metrics for Evaluation
5.3.1 Accuracy
5.3.2 Precision
5.3.3 Recall
5.3.4 F1-Score
5.3.5 False Positive Rate (FPR)
6 Conclusion
References
Deep Learning in Autoencoder Framework and Shape Prior for Hand Gesture Recognition
1 Introduction
2 State-of-the-Art Techniques
3 Proposed Gesture Recognition Scheme
3.1 Preprocessing
3.1.1 A. Color Space Conversion
3.1.2 Background Removal
3.1.3 Bounding Box and Resizing
3.2 Feature Extraction
3.3 Classification
4 Simulation Results and Discussions
5 Conclusions and Future Works
References
Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning
1 Introduction
2 Related Work
3 NN-Based Point Cloud Segmentation Using Octrees
3.1 Box Search by Octrees
3.2 Feature Hierarchy
3.3 Permutation Invariance
3.4 Size Invariance
3.5 Architecture Details
4 Experiments
4.1 Implementation and Dataset Details
4.2 List of Experiments
4.3 Learning Curves
4.4 Qualitative Results
4.4.1 Shapenet Dataset
5 Conclusions and Future Work
References
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic Colorization of Grayscale Images
1 Introduction
2 Basics of Convolution Neural Network
2.1 Convolutional Layer
2.2 Pooling Layer
2.3 Fully Connected Layer
2.4 Overfitting or Dropout
2.5 Activation Functions
3 Auto-encoder and Decoder Model
4 Proposed Research Methodology
4.1 Data Description and Design Approaches
5 Experimental Analysis and Result
5.1 Classification and Validation Process
5.2 Prediction
6 Conclusion
References
Deep Learning-Based Open Set Domain Hyperspectral Image Classification Using Dimension-Reduced Spectral Features
1 Introduction
2 Methodology
2.1 Dataset
2.2 Salinas
2.3 Salinas A
2.4 Pavia U
3 Experiment Results
3.1 Dimensionality Reduction Based on Dynamic Mode Decomposition
3.1.1 Salinas Dataset
3.1.2 Salinas A Dataset
3.1.3 Pavia University Dataset
3.2 Dimension Reduction Using Chebyshev Polynomial Approximation
3.2.1 Salinas Dataset
3.2.2 Salinas A Dataset
3.2.3 Pavia U Dataset
4 Conclusion
References
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional Neural Network Models
1 Introduction
2 Related Work
3 Methodology
3.1 Research Objectives
3.2 Feature Selection
3.3 Proposed Models
3.3.1 CNN Model
3.3.2 CNN with SVM Classifier
3.3.3 CNN with RF Classifier
4 Experimental Results and Analysis
5 Conclusion and Future Work
References
Modified Discrete Differential Evolution with Neighborhood Approach for Grayscale Image Enhancement
1 Introduction
2 Related Works
3 Differential Evolution
3.1 Classical Differential Evolution
4 Proposed Approach
4.1 Best Neighborhood Differential Evolution (BNDE) Mapping
5 Phase I – Performance Comparison
5.1 Design of Experiments – Phase I
5.2 Results and Discussions – Phase I
6 Phase II – Image Processing Application
6.1 Design of Experiments – Phase II
6.2 Results and Discussions – Phase II
7 Conclusions
References
Swarm-Based Methods Applied to Computer Vision
Abbreviations
1 Introduction
2 Brief Description of Swarm-Based Methods
3 Some Advantages of Swarm-Based Methods
4 Swarm-Based Methods and Computer Vision
4.1 Feature Extraction
4.2 Image Segmentation
4.3 Image Classification
4.4 Object Detection
4.5 Face Recognition
4.6 Gesture Recognition
4.7 Medical Image Processing
References
Index

Citation preview

EAI/Springer Innovations in Communication and Computing

B. Vinoth Kumar P. Sivakumar B. Surendiran Junhua Ding   Editors

Smart Computer Vision

EAI/Springer Innovations in Communication and Computing Series Editor Imrich Chlamtac, European Alliance for Innovation, Ghent, Belgium

The impact of information technologies is creating a new world yet not fully understood. The extent and speed of economic, life style and social changes already perceived in everyday life is hard to estimate without understanding the technological driving forces behind it. This series presents contributed volumes featuring the latest research and development in the various information engineering technologies that play a key role in this process. The range of topics, focusing primarily on communications and computing engineering include, but are not limited to, wireless networks; mobile communication; design and learning; gaming; interaction; e-health and pervasive healthcare; energy management; smart grids; internet of things; cognitive radio networks; computation; cloud computing; ubiquitous connectivity, and in mode general smart living, smart cities, Internet of Things and more. The series publishes a combination of expanded papers selected from hosted and sponsored European Alliance for Innovation (EAI) conferences that present cutting edge, global research as well as provide new perspectives on traditional related engineering fields. This content, complemented with open calls for contribution of book titles and individual chapters, together maintain Springer’s and EAI’s high standards of academic excellence. The audience for the books consists of researchers, industry professionals, advanced level students as well as practitioners in related fields of activity include information and communication specialists, security experts, economists, urban planners, doctors, and in general representatives in all those walks of life affected ad contributing to the information revolution. Indexing: This series is indexed in Scopus, Ei Compendex, and zbMATH. About EAI - EAI is a grassroots member organization initiated through cooperation between businesses, public, private and government organizations to address the global challenges of Europe’s future competitiveness and link the European Research community with its counterparts around the globe. EAI reaches out to hundreds of thousands of individual subscribers on all continents and collaborates with an institutional member base including Fortune 500 companies, government organizations, and educational institutions, provide a free research and innovation platform. Through its open free membership model EAI promotes a new research and innovation culture based on collaboration, connectivity and recognition of excellence by community.

B. Vinoth Kumar • P. Sivakumar • B. Surendiran • Junhua Ding Editors

Smart Computer Vision

Editors B. Vinoth Kumar PSG College of Technology Coimbatore, Tamil Nadu, India B. Surendiran Thiruvettakudy National Institute of Technology Puducherry, Karaikal, India

P. Sivakumar PSG College of Technology Coimbatore, Tamil Nadu, India Junhua Ding University of North Texas Denton, TX, USA

ISSN 2522-8595 ISSN 2522-8609 (electronic) EAI/Springer Innovations in Communication and Computing ISBN 978-3-031-20540-8 ISBN 978-3-031-20541-5 (eBook) https://doi.org/10.1007/978-3-031-20541-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Computer vision is a field of computer science that works on enabling computers to see, identify, and process images in the same way that human vision does, and then provide appropriate output. It is like imparting human intelligence and instincts to a computer. It is an interdisciplinary field that trains computers to interpret and understand the visual world from digital images and videos. The main objective of this edited book is to address and disseminate state-of-the-art research and development in the applications of intelligent techniques for computer vision. This book provides contributions which include theory, case studies, and intelligent techniques pertaining to the computer vision applications. This will help the readers to grasp the extensive point of view and the essence of the recent advances in this field. The prospective audience would be researchers, professionals, practitioners, and students from academia and industry who work in this field. We hope the chapters presented will inspire future research both from theoretical and practical viewpoints to spur further advances in the field. A brief introduction to each chapter is as follows. Chapter 1 discusses the machine learning approaches applied to automatic sports video summarization. Chapter 2 proposes a new technique for lecture video segmentation and key frame extraction. The results are compared against six existing state-of-the-art techniques based on computational time and shot transitions. Chapter 3 presents a system to detect the potholes in the pathways/roadways using machine learning and deep learning approaches. It uses HOG (histogram of oriented gradients) and LBP (local binary pattern) features to enhance the classification algorithms performance. Chapter 4 aims to explore various feature extraction techniques and shape detection approaches required for image retrieval. It also discusses the real-time applications of shape feature extraction and object recognition techniques with examples. Chapter 5 describes an approach for texture image classification based on Gray Level Co-occurrence Matrix (GLCM) features and machine learning algorithm. Chapter 6 presents an overview of unimodal and multimodal affective computing. It also discusses the various machine learning and deep learning techniques for affect recognition. Chapter 7 proposes a deep learning model for content-based image v

vi

Preface

retrieval. It uses K-Means clustering algorithm and Hamming distance for faster retrieval of the image. Chapter 8 provides a bio-inspired convolutional neural network (CNN)-based model for COVID-19 diagnosis. A cuckoo search algorithm is used to improve the performance of the CNN model. Chapter 9 presents convolutional CapsNet for detecting COVID-19 disease using chest X-ray images. The model obtains fast and accurate diagnostic results with less trainable parameters. Chapter 10 proposes a deep learning framework for an automated hand gesture recognition system. The proposed framework classifies the input hand gestures, each represented by a set of histogram-oriented gradient feature vector into some predefined number of gesture classes. Chapter 11 presents a new hierarchical deep learning based approach for semantic segmentation of 3D point cloud. It involves nearest neighbor search for local feature extraction followed by an auxiliary pre trained network for classification. Chapter 12 summarizes that the proposed model acts as a better automatic colorization for colored and grayscale images without human intervention. The proposed model predicted the color for the new images with good prediction accuracy close to the real images. In future, such automatic colorization techniques help to identify vintage images or movies with grayscale images with their details in a very clear manner. Chapter 13 proposes a generative adversarial network (GAN) for hyperspectral image classification. It uses dynamic mode decomposition (DMD) to reduce the redundant features in order to attain better classification. Chapter 14 presents a brief introduction about the methodologies used for identifying diabetic retinopathy. It also uses convolutional neural network models to achieve an effective classification for diabetic detection of retinal fundus images. Chapter 15 proposes a modified differential evolution (DE), best neighborhood DE(BNDE), to solve discrete-valued benchmarking and real-world optimization problems. The proposed algorithm increases the exploitation and exploration capabilities of the DE and to reach the optimal solution faster. In addition, the proposed algorithm is applied to grayscale image enhancement. Chapter 16 presents an overview of the main swarm-based solutions proposed to solve problems related to computer vision. It presents a brief description of the principles behind swarm algorithms, as well as the basic operations of swarm methods that have been applied in computer vision. We are grateful to the authors and reviewers for their excellent contributions for making this book possible. Our special thanks go to Mary James (EAI/Springer Innovations in Communication and Computing) for the opportunity to organize this edited volume.

Preface

vii

We are grateful to Ms. Eliška Vlˇcková (Managing Editor at EAI – European Alliance for Innovation) for the excellent collaboration. We hope the chapters presented will inspire researchers and practitioners from academia and industry to spur further advances in the field. Coimbatore, Tamil Nadu, India Coimbatore, Tamil Nadu, India Puducherry, Karaikal, India Denton, TX, USA January 2023

B. Vinoth Kumar P. Sivakumar B. Surendiran Junhua Ding

Contents

A Systematic Review on Machine Learning-Based Sports Video Summarization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vani Vasudevan and Mohan S. Gounder Shot Boundary Detection from Lecture Video Sequences Using Histogram of Oriented Gradients and Radiometric Correlation . . . . . . . . . . . T. Veerakumar, Badri Narayan Subudhi, K. Sandeep Kumar, Nikhil O. F. Da Rocha, and S. Esakkirajan

1

35

Detection of Road Potholes Using Computer Vision and Machine Learning Approaches to Assist the Visually Challenged. . . . . . . . . . . . . . . . . . . . . U. Akshaya Devi and N. Arulanand

61

Shape Feature Extraction Techniques for Computer Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Fantin Irudaya Raj and M. Balaji

81

GLCM Feature-Based Texture Image Classification Using Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 R. Anand, T. Shanthi, R. S. Sabeenian, and S. Veni Progress in Multimodal Affective Computing: From Machine Learning to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 M. Chanchal and B. Vinoth Kumar Content-Based Image Retrieval Using Deep Features and Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 R. T. Akash Guna and O. K. Sikha Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 P. Manju Bala, S. Usharani, R. Rajmohan, T. Ananth Kumar, and A. Balachandar

ix

x

Contents

Initial Stage Identification of COVID-19 Using Capsule Networks . . . . . . . . 203 Shamika Ganesan, R. Anand, V. Sowmya, and K. P. Soman Deep Learning in Autoencoder Framework and Shape Prior for Hand Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Badri Narayan Subudhi, T. Veerakumar, Sai Rakshit Harathas, Rohan Prabhudesai, Venkatanareshbabu Kuppili, and Vinit Jakhetiya Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 J. Narasimhamurthy, Karthikeyan Vaiapury, Ramanathan Muthuganapathy, and Balamuralidhar Purushothaman Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic Colorization of Grayscale Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 A. Anitha, P. Shivakumara, Shreyansh Jain, and Vidhi Agarwal Deep Learning-Based Open Set Domain Hyperspectral Image Classification Using Dimension-Reduced Spectral Features . . . . . . . . . . . . . . . . 273 C. S. Krishnendu, V. Sowmya, and K. P. Soman An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Niteesh Kumar, Rashad Ahmed, B. H. Venkatesh, M. Anand Kumar Modified Discrete Differential Evolution with Neighborhood Approach for Grayscale Image Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Anisha Radhakrishnan and G. Jeyakumar Swarm-Based Methods Applied to Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . 331 María-Luisa Pérez-Delgado Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

A Systematic Review on Machine Learning-Based Sports Video Summarization Techniques Vani Vasudevan

and Mohan S. Gounder

1 Introduction Sports video summarization is one of the interesting fields of research as it tends to generate a highlight of the broadcast video. Usually, the broadcasted sports videos are longer, and the audiences may not have enough time to watch the entire duration of the game. Some of the sports like soccer (football), basketball, baseball, tennis, golf, cricket, and rugby are played for the duration of 90–180 minutes per match. Hence, creating a summarization that contains only events and excitements of interest pertaining to individual sports is an intense human task. There are several learning and non-learning-based techniques in the literature that attempt to automate the process of creating such highlights or summarization video. In addition to it, in recent years, the advancements of deep learning techniques have also contributed to accomplish remarkable results in the sports video summarization. Figure 1 shows the growing number of publications that are associated with “sports video summarization” over the past two decades. Video summarization techniques [2] have been widely used in many types of sports. The choice of sports/games chosen for this systematic review are based on the following criteria: (1) sports with high audience base (https://www.topendsports. com/world/lists/popular-sport/fans.html), (2) sports where the sponsorship is more, (3) sports with more watch views/hours, (4) sports where the research potential is high with large datasets, (5) sports where the need for technological advancement is very high, (6) frequency of the occurrence of the game/sport in a year, (7) number of countries participating in the sports, and (8) number of countries hosting the

V. Vasudevan Department of CSE, Nitte Meenakshi Institute of Technology, Bengaluru, India M. S. Gounder () Department of ISE, Nitte Meenakshi Institute of Technology, Bengaluru, India © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_1

1

2

V. Vasudevan and M. S. Gounder Number of Publications in Sports Video Summarization

100 90 80 70 60 50 40 30 20 10 0

89

3

9

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Fig. 1 Number of publications in sports video summarization from 2000 to 2020. (Data from google scholar advanced search with “sports video summarization” OR “sports highlights” anywhere in the article)

Number of Publications based on type of popular sports Soccer

938

Baseketball Baseball Tennis

578

Golf Cricket

206

Rugby Handball

52

Type of Sport -200

0

200

400

600

800

1000

Fig. 2 Number of publications based on types of popular sports videos used to generate video highlights from 2000 to 2020. (Data from google scholar advanced search with “sports video summarization” OR “sports highlights” < type of sport > anywhere in the article)

event. Figure 2 shows the publications based on the types of sports videos used to generate highlights where “type of sport” is substituted with soccer/football, basketball, baseball, etc. Based on the various criteria considered along with the number of publications in the literature, we have confined our scope of review to soccer, tennis, and cricket sports. The rest of this paper is organized as follows. In Sect. 2, we review the techniques established for sports video summarization since 2000. Some important ideas, algorithms, and methods evolved over a period for video highlight generation specific to two of the popular sports videos, namely, soccer and cricket are reviewed to a greater extent with a quick review on other sports in Sect. 3. In Sect. 4, scope of future research, weaknesses in the methods used, and possible solutions are discussed. We conclude the paper in Sect. 5.

A Systematic Review on Machine Learning-Based Sports Video. . .

3

2 Two Decades of Research in Sports Video Summarization In this section, we reviewed the history of sports video summarization in multiple aspects, including techniques established for sports video summarization, learning and non-learning techniques applied in sports video highlight generation, and evaluation metrics used. A generic architecture of the video summarization process is shown in Fig. 3. It highlights that long-duration sport’s video is processed with the help of various sports video summarization techniques by considering different factors and finally generates sports video highlights or summary. Figure 4 shows the established techniques in the last two decades in the field of sports video summarization. Fig. 3 Generic architecture of sports video summarization

Fig. 4 Techniques established for sports video summarization

4

V. Vasudevan and M. S. Gounder

Techniques established for sports video summarization is shown in Fig. 4. According to the literature that we referred, the techniques are broadly classified into feature based, cluster based, excitement based, and key event based.

2.1 Feature-Based Approaches Most of the sports events can be summarized based on features like color, motion, gesture of players or umpires/referees, combination of audio and visual cues, texts that displays the scores, and objects. For example, soccer and short version of cricket matches are played with different color jerseys for each team. The signals of referees or umpires involve identifying some key gestures. The method by [42] proposed a dominant color-based video summarization to extract the video highlights of sports video. The key frames are extracted based on color histogram analysis. This kind of features gives an additional confidence if the visual features are used for key frame extraction. This method has not adapted any such visual features to identify the key frames. However, it is found that the color factor plays an important role in influencing the sports video summarization. Figure 5 shows all such factors that influence the sports video summarization. The field dominant color is one of the major features in sports like cricket, soccer, and other field games. This also can be used to classify the events of on and off field, crowd, and player or umpire detection [64, 67]. The motion-based and gesture features are proposed by [63, 67]. The motion is the key in any sports. When the camera motion is also considered, the challenge to extract the events or key frames becomes more complicated. In [63], the summarization is more of an event based that is presented as a cinematography. The authors have proposed an interesting method to not only summarize the soccer video but also identify scenes of intensive competition between players and emotional events. In addition to this, they also proposed to classify the clips into many clips based on cinematographic features like video production techniques, transition of shots, and camera motions. The intensive competition is identified based on the movements of players, attack, defense of goal, etc. The reaction of players or the crowd is considered for emotional moments. This also counts what happened in the scene, who were involved in the scene, and how the players and audience reacted to the scene. The video clips are converted into segments of sematic shots and then each of them into clips based on camera motion. The interest level of each of these clips is measured based on cinematography and motion features. This work also classifies the shots as long view, close view, and medium view. Interestingly, a segment is created based on semantics in the scene. Thus, it forms a semantic structure of soccer video. The factors influencing the summarization in this method as per Fig. 4 are events, movement, object/event detection, and camera motion. In [67], the authors introduced a dataset called SNOW, which is used to identify the pose of umpires in cricket. They identified four important events in cricket,

A Systematic Review on Machine Learning-Based Sports Video. . .

5

Fig. 5 Factors influencing sports video summarization

namely, Six, No Ball, Out, and Wide ball based on the umpire pose or gesture recognition. The pretrained convolution networks such as Inception V3 and VGG19 have been used for extracting the features. The classification of the poses is based on the SVM. The authors have attempted to create a database for public use, and it has been made online for download. Some of the factors influencing the video summarization in this method are visual cues and object detection. Another interesting feature used in most of the video summarization methods is the audio [9, 35, 44, 45, 66, 92]. In Fig. 5, it is evident that audio features like commentator’s voice and crowd cheering are key factors influencing the video summarization. In [9], an audiovisual-based approach has been presented. The audio signal’s instantaneous energy and local periodicity are measured to extract highlighted moments in the soccer game. In the work proposed by [35], commentator’s voice, referee’s whistle, and crowd voice are considered to find the exciting events. The events related to soccer games are goal, penalty shootout, red card, etc. The authors also considered the audio noise during such events like musical instruments and applied Empirical Mode Decomposition to filter them. Another method proposed by [45] also applies audiovisual features to extract key frames from sports video. The audio features considered here are the excitement events that are identified by the spike in signal due to crowd cheer. In addition, the visual features such as score card detection has been proposed using deep learning methods. As in [9], the factors influencing are visual cues and crowd audio. The authors of [44] proposed an interesting method to detect the highlights based on referee whistle sound detection. A band-pass filter is designed to accentuate the whistle sound of referee and suppress any other audio events. A decision rulesbased [1, 39, 61, 84] time-dependent threshold method has been applied to detect the regions where the whistle sound occurs. The authors used an extensive 12 hours testing signal from various soccer, football, and basketball games. Like in [9, 35, 44, 45, 83], the method proposed in [66] employs audio feature like spectrum of

6

V. Vasudevan and M. S. Gounder

signal during the key events like goal. This is applied on top of key event detection using visual features and color features. Some of the factors used in this method are audio, color, visual cues, replay, excitements, batting, bowling shots [3], and player detection. This method is strongly dependent on some of the video production style, and the detection accuracy is only 70%. A time-frequency-based feature extraction is used to calculate local autocorrelations on complex Fourier values in [92]. Then the extracted features are used in exciting scenes detection. The authors have considered the environmental noise and proved that this method is robust, and the performance is better. However, the commentator’s voice is not considered. By carefully looking at all the methods that used audio as one of the features, there is certainly a scope for additional confidence to extract or identify the key frames or key events in a sports video. Majority of them focused on identifying the events based on crowd or spectator cheering. Though there would be additional noise, some of the methods like [44, 66, 92] have applied methods to deal with noises.

2.2 Cluster-Based Approaches The clustering-based methods work based on clustering similar frames or shots and then processing these clusters as required. In [13], a Fuzzy C-Means clustering method is applied to cluster video frames based on color feature. A shot detection algorithm is also used to find the number of clusters. The authors attempted to improve the computation speed and accuracy through this method of video summarization. Another method [54] attempted to develop a hierarchical summary based on a state transition model. In addition, the authors also used other cues like text, audio, and expert’s choice to improve the accuracy of proposed algorithm. This method uses cues of visual, text, and pitch as the factors of influence (Fig. 5). In [50], a neuro-fuzzy based approach has been proposed to segment the shots. The content of the shots is identified by the slots or windows that are more semantic. Hierarchical clustering of the identified windows provides textual summary to the user based on the content of video in the shot. The method claims to generate textual annotation of video automatically. It is also used to compare the similarity between any two video shots. In [79], a statistical classifier based on Gaussian mixture models with unsupervised model is proposed. This method adapts majorly the audio features to find the mismatch between test and pretrained models, which is also discussed in Sect. 2.3.

2.3 Excitement-Based Approaches All the sports and games do have some moments that can be identified as moment of excitement. The moments could be part of players reaction, crowd reaction,

A Systematic Review on Machine Learning-Based Sports Video. . .

7

referee’s action, or even the commentator’s reaction. The players reactions and expressions include high-fives, first pumps, aggressive, tense, and smiling. The crowd and commentator’s excitement can be identified by the energy of audio signal or the tone of the commentators. Some of the works that have been reported already based on audio [9, 35, 44, 45, 66, 92] uses these features to extract key events. In addition to them, it is also identified that the works like [59, 75, 79] exploit such excitement-based features to identify key events that eventually contribute to summarize sports video. In [58], the authors have used multiple features to identify the key frames. The information from players reactions, expressions, spectators cheer, and commentator’s tone are used to identify key events. It has been found that these methods are applied to summarize sports like tennis and golf. The excitementbased highlight generation reported in [75] considers the secondary events in a cricket match like drop catches and pressure moments based on certain strategy that includes loudness of a video, category associated with primary event, and replays. The player celebration detection, excitement of the crowd, appeals, and some intense commentaries are considered as excitement features. The method has been extensively tested on cricket videos. Another method that exploits the commentator speech is proposed in [79]. The method uses statistical classifier based on Gaussian mixture models with unsupervised model adaption. The acoustic mismatch between the training and testing data is compensated using maximum a posteriori adaption method.

2.4 Key Event-Based Approaches Every sport has its own list of key events. The summarization can be carried out based on such key events. This would obviously make the viewers to get the most exciting events in the sports of their choice. For example, the sports soccer has key events such as goals, foul, shoot, etc. There are substantial number of publications [4, 7, 34, 38, 39, 41, 48, 65, 75, 76, 81, 86, 91] that addressed the video summarization based on the key events. The work proposed in [4], an unsupervised framework for soccer goal event detection using external textual source typically from the reports of sports website, has been proposed. Instead of segmenting the actual video based on the visual and aural contents [73], this method claimed to be more efficient since noneventful segments are discarded. The method seems to be very promising and can be applied to any sports that has live coverage through text format in websites. An approach based on language independent, multistage classification is employed for detection of key acoustic events in [7]. The method has been applied on rugby. Though the method is like most of the approaches using audio features, it differentiates in the way it treats the audio events independent of languages. A hybrid approach based on learning and non-learning method has been proposed in [34] to automatically summarize sport video. The key events are goal, foul, shoot, etc. SVM-based method is applied for shot boundary detection, and a view classification algorithm is used for identifying game-field intensities and

8

V. Vasudevan and M. S. Gounder

player bounding box sizes. In [39], an automatic sports video summarization has been proposed based on key events based on replays. As shown in Fig. 5, the factor that influences this work is the replay. The frames corresponding to the replays are enclosed between gradual transitions. A thresholding-based approach is employed to detect these transitions. For each key event, a motion history image is generated by applying Gaussian mixture model. A trained extreme learning machine (ELM) classifier is used to learn the various events for labeling the key events, detecting replays, and generate game summarization. They have applied this method for four different sports containing 20 videos. In contrast to the event detection in the game, a method has been proposed by [41] to classify the crowd events while evaluating the video contents into marriage, cricket, shopping mall, and Jallikattu. The method applies deep CNN by learning the features from the training set of data. However, the method is good at classifying the events into a labeled outcome. Interestingly, a more customizable highlights generation method is proposed in [48]. The videos are divided into sematic slots, and then importance-based event selection is employed to include those important events in the highlights. The authors have considered cricket videos for highlight generation. Again, the work proposed in [75] exploits the audio intensity in addition to replays, player celebration, and playfield scenarios as key events. Further, the player stroke segmentation [26–29] and compilation in cricket can be used for highlight generation specific to a player in a match. There is a more general analysis of various computer vision systems from soccer video semantic point of video in [81]. The interpretation of the scene was based on the complexity of the semantic. This work makes an investigation and analysis of various approaches.

2.5 Object Detection Object detection is one of the important computer vision tasks applied in video summarization. Techniques used in detecting the objects in each image or a frame have gone through remarkable breakthroughs. Hence, it is highly required to understand the evolution as well as state-of-the-art techniques used in detecting the objects present in an image. This covers wide range of techniques starting from simple histogram-based techniques to complex computationally intensive deep learning techniques. The techniques evolved over a period of two decades that in turn address challenges [96] in object detection, which include but not limited to the following aspects: objects under different viewpoints, illuminations, and intraclass variations, object rotation and scale changes, accurate object localization, dense and occluded object detection, and speed up of detection. Figure 6 shows the predominant object detection techniques (object detectors) evolved over two decades that include latest development in 2021. Between 2000 and 2011, that is, before the rebirth of deep convolution neural network, there were more subtle and robust techniques applied in detecting the objects present in the frame or an image. It is referred to as traditional object detectors. With the limited computing resources, researchers have made a remarkable contribution to detect the objects

A Systematic Review on Machine Learning-Based Sports Video. . .

9

Fig. 6 Evolution of object detectors in two decades

based on handcrafted features. Between 2001 and 2004, Viola and Jones have achieved real-time detection of human face with the help of Pentium III CPU [87]. This detector was named as VJ detector and works with the sliding window concept. VJ detector improved its detection performance and reduced computation overhead through integral image, which uses Haar wavelet, feature selection with the help of AdaBoost algorithm, and detection cascades that were multistage detection paradigm by spending more computations on face target than background windows [88, 96]. This approach can certainly contribute to player detection from any of the sport’s video. In 2005, histogram of oriented gradient (HOG) detector was proposed by Dala and Triggs [11]. It was another important milestone as it balances the feature invariance. To detect objects of various sizes, HOG detector rescales the input frame or an image multiple times while keeping the detection window size the same. HOG detector was one of the important object detectors used in various computer vision applications that include sports video processing too. Between 2008 and 2012, Deformable Part-Based Model (DPM) and its variants were the peak of object detectors that evolved in the traditional object detectors era. DPM was proposed by Felzenszwalb in 2008 [18] as an extension to HOG, and then its variants were proposed by Girshick. DPM uses divide and conquer approach where learning of the model happens by decomposing an object and then ensemble the decomposed objects parts to form a complete object. The model comprises of root filter and many part filters. This model has been further enriched [16, 17, 21, 22] to deal with real-world objects with significant variations. A weekly supervised learning method was developed in DPM to learn all the configurations of part filters as latent variables. This has been further formulated as Multi-Instance Learning, and some important techniques such as hard negative mining, bounding box regression, and context priming were also applied for improving the detection performance [16, 21].

10

V. Vasudevan and M. S. Gounder

From 2012 onwards, deep learning era began with the ability to learn highlevel feature representations of an image [50] with the availability of necessary computational resources. Then, region-based CNN(RCNN) [24] was proposed and became a breakthrough research in object detection with the help of deep learning models. In this era, there were two genres in object detection, namely, two-stage detection with coarse to fine process and one-stage detection to complete the process in a step [96]. The RCNN extracts set of object proposals by selective search. Then, each proposal is rescaled to a fixed size image and given as input to CNN model trained on AlexNet [50] to extract features. In the end, linear SVM classifiers are used to predict the presence of an object within each region. RCNN achieved significant performance improvement over DPM. Even though RCNN had made significant improvement, it had drawbacks of redundant feature computations on many overlapped proposals that led to slow detection speed with GPU. In the same year, Spatial Pyramid Pooling Network (SPPNet) [33] model was proposed to overcome this drawback. Earlier CNN models require a fixed-size input, for example, 224x224 image for AlexNet [51]. In SPPNet, Spatial Pyramid Pooling layer enables to generate a fixed-length representation regardless of the size of image or region of interest without rescaling it. It is proved that SPPNet was more than twenty times faster than RCNN, which avoids redundancy while computing the convolutional features. Though SPPNet had improved the detection performance in terms of speed, there were drawbacks as training was still multistage and it only fine-tuned its fully connected (FC) layers. In 2015, Fast RCNN [23] was proposed to overcome the drawbacks of SPPNet. Fast RCNN train a detector and a bounding box regressor simultaneously under the same network configurations. This improved the detection speed 200 times faster than RCNN. Though there was an improvement in detection speed, it was limited by the proposal detection. Hence, it led to the proposal of Faster RCNN [71] where object proposals were generated with a CNN model. Faster RCNN was the first end-to-end and near real-time object detector. Even though Faster RCNN overcome the drawback of Fast RCNN, there was still computation redundancy that led to further developments, namely, RFCN [10] and light-head RCNN [56]. In 2017, Feature Pyramid Network (FPN) [57] was proposed. It focused on all the layers in top-down architecture for building high-level semantics at all scales. FPN had become a basic building block of latest detectors. Meanwhile in 2015, You Only Look Once (YOLO) was proposed. YOLO was the first one-stage detector in this deep learning era. It followed entirely different approach than the previously evolved models. It applied a single neural network to the full image. The network divided the image into regions and predicted bounding boxes and probabilities for each region simultaneously. Despite its improvements, YOLO lacks from a drop of the localization accuracy compared with two-stage detectors especially for detecting small objects. Based on the initial model, a series of improvements [8, 68–70] were proposed that further improved the detection accuracy and detection speed. Almost at the same time, Single Shot Multi-Box Detector (SSD) [89], which was a second one-stage detector, evolved. The main contribution of SSD was to introduce multi-reference and multi-resolution detection

A Systematic Review on Machine Learning-Based Sports Video. . .

11

techniques that significantly improved the detection accuracy of some small objects. Despite the high speed and simplicity, one-stage detectors have lacked the accuracy of two-stage detectors, and hence, RetinaNet [58] was proposed. RetinaNet focused on the foreground–background class imbalance issue by introducing a new loss function called focal loss. It reshaped the standard cross entropy loss to put more focus on hard, misclassified samples during training. Focal loss achieved comparable accuracy of two-stage detectors while maintaining very high detection speed. The deep learning models continue to evolve [55, 77, 94, 95] by considering both detection accuracy and speed. From these object detectors that evolved over the last two decades and especially with the rebirth of deep learning models, it is made possible to choose and apply appropriate detectors to solve most of the computer vision-based problems that include sports video summarization.

2.6 Performance Metrics Most of the research works in sports video summarization have used the following objective metrics to evaluate the constructed models’ performance.

2.6.1

Objective Metrics

1. Accuracy: Represents the ratio of the correctly labeled replay/non-replay frames or the key events/non-key events to the total number of frames or events [36, 47, 50, 60]. Accuracy =

(TP + TN) (P + N)

where TP: True Positive, TN: True Negative, P: Positive, N: Negative 2. Error: Represents the ratio of mislabeled replay frames (both FP and FN) to the total number of frames [36, 47]. Error =

(FP + FN) (P + N)

where FP : False Positive, FN: False Negative 3. Precision: Represents the ratio of correctly labeled frames to the total detected frames [4, 32, 34, 60, 90, 93]. Precision =

(TP) (TP + FP)

12

V. Vasudevan and M. S. Gounder

4. Recall: Represents the ratio of true detection of frames against the actual number of frames [4, 32, 34, 59, 90, 93]. Recall =

TP (TP + FN)

5. F1-Score: It is weighted average representation of precision and recall. It is computed because some methods have higher precision and lower recall and vice versa [35, 46, 54]. F1 − Score = 2 ∗

(Precision ∗ Recall) (Precision + Recall)

6. Confusion Matrix (CM): Represents Predicted Positives and Negatives over Positives and Negatives present in the chosen dataset. This is a highly recommended model evaluation metric present in literature. In [32], CM matrix with goal, foul, shoot, and non-highlight events was represented to compute precision and recall %. 7. Receiving Operating Characteristics (ROC) curve: Precision Vs Recall (FPR Vs TPR). It is desirable to use Receiver Operator Characteristic (ROC) curves when evaluating binary decision problems, which show how the number of correctly classified positive examples varies with the number of incorrectly classified negative [12]. As the False Position Rate (FPR) increases (i.e., more non-highlight plays are allowed to be classified incorrectly), it is desirable that the True Positive Rate (TPR) increases as quickly as possible (i.e., the derivative of the ROC curve is high) [19]. Other than the above listed objective metrics, which were predominantly used in the last two decades, the following user experiences (subjective metrics) were also used as an alternative performance evaluation metric in most of the sports video summarization works.

2.6.2

Subjective Metrics Based on User Experience

1. The quality of each summary is evaluated in seven levels: extremely good, good, upper average, average, below average, bad, and extremely bad [64]. 2. Mean Opinion Score (MOS): Considering the following user experience rating, (i) the overall highlights viewing experience is enjoyable, entertaining, and pleasant and not marred by unexciting scenes, (ii) the generated scenes do not begin or end abruptly, and (iii) the scenes are acoustically and/or visually exciting. 3. Human vs. system detected shots (closeup, crowd, replay, sixer) [5]. 4. Discounted cumulative gain (nDCG) metric, which is a standard retrieval measure computed as follows:

A Systematic Review on Machine Learning-Based Sports Video. . .

nDCG(k) =

13

k 1  2reli − 1 Z log2 (i + 1) i=1

where reli is the relevance score assigned by the users to clipi and Z is a normalization factor ensuring that the perfect ranking produces an nDCG score of 1.

3 Evolution of Ideas, Algorithms, and Methods for Sports Video Summarization In this section, evolution of some of the video summarization ideas, algorithms, and methods (learning and non-learning methods) over a period of two decades with respect to two popular sports, cricket and soccer, is reviewed and summarized in Tables 1, 2, and 3, respectively. In addition to it, a quick glimpse on the ideas, algorithms, and methods proposed for other sports (Table 4) including rugby, tennis, baseball, basketball, volleyball, football, golf, snooker, handball, hockey, and ice hockey is also reviewed. In a nutshell, most of the algorithms used seems to be based on feature extraction and key event detection. Also, most of the methods are classified under learning and non-learning. Non-learning methods are mostly used in preprocessing and key event detection and highlight generation, whereas learning methods are used in shot boundary classification, shot view classification, and feature extraction stages. Notably, in the last decade (2012 to present), which is from the rebirth of convolution neural network (CNN), almost all the stages in sports video summarization or video highlight generation are efficiently handled by deep learning models and its variants.

4 Scope for Future Research in Video Summarization Some of the works [34, 63, 74] that are specific to cricket and soccer video summarization have several weaknesses or they are not addressed properly. In this section, the weaknesses and scope for future research are discussed from the outcome of selected papers. Section 4.1 groups the weaknesses under certain categories. Section 4.2 highlights the scope for further research in sports video summarization.

14

V. Vasudevan and M. S. Gounder

Table 1 Ideas that evolved over a period in sports video summarization study [14]

Year of publication 2003

[48] [49]

2006 2008

[62] [53] [92]

2009 2010 2010

[9]

2010

[80]

2010

[54] [80] [93] [90] [6] [46] [42]

2011 2011 2011 2011 2012 2013 2013

[4]

2013

[63] [72] [7] [30] [5]

2014 2015 2015 2015 2015

[47] [66] [39] [43] [78]

2015 2015 2016 2017 2017

[15]

2017

[37] [19] [31] [75] [82]

2017 2018 2018 2018 2018

Major idea Framework based on cinematic and object-based features Extracted events and semantic concepts Caption analysis and audio energy level based HMM-based approach Priority curve based Time-frequency feature extraction to detect excitement scenes in audio cues Personalized summarization of soccer sport using both audio and visual cues Excited commentator speech detection with unsupervisory model adaptation Automated highlight generation Unsupervised event detection framework Machine learning based Logo detection and replay sequence SVM-based shot classification Framework in encoded MPEG video Dominant color-based extraction of key frames Framework for goal event detection through collaborative multimodal (textual cues, visual, and aural) analysis Cinematography and motion analysis Annotation of cricket videos Key acoustic events detection Automatic summarization of hockey video Multilevel hierarchical framework to generate highlights by interactive user input Bayesian network based Audiovisual descriptor based Learning and non-learning-based approach Real-time classification Parameterized approach with end-end human-assisted pipeline Hidden-to-observable transferring Markov model Court aware CNN-based classification CNN-based approach to detect no balls Event-driven and excitement based Deep players’ action recognition features

Type of sports Cricket Cricket Cricket Cricket Cricket Cricket Cricket Cricket Cricket Cricket Soccer Soccer Soccer Soccer Cricket Cricket, soccer, rugby, football, and tennis Soccer Cricket Rugby Soccer Soccer Soccer Soccer Cricket Soccer Soccer Soccer Volleyball Cricket Cricket Cricket Soccer (continued)

A Systematic Review on Machine Learning-Based Sports Video. . .

15

Table 1 (continued) study [27] [52]

Year of publication 2019 2019

[36] [60] [41] [35] [40]

2019 2019 2019 2019 2019

[59] [26] [64] [34] [45]

2019 2020 2020 2020 2020

Multimodal excitement features Cricket stroke localization Transfer learning for scene classification Hybrid approach Content aware summarization—audiovisual approach

[32]

2020

Multimodal multi-labeled extraction

Major idea Cricket stroke dataset creation Outcome classification in cricket using deep learning Classify bowlers AlexNet CNN-based approach CNN-based crowd event detection Decomposed audio information Confined elliptical local ternary patterns and extreme learning machine

Type of sports Cricket Cricket Cricket Soccer Cricket Soccer Cricket, tennis, baseball, and basketball Golf and tennis Cricket Soccer Soccer Cricket, soccer, rugby, basketball, baseball, football, tennis, snooker, handball, hockey, ice hockey, and volleyball Soccer

4.1 Common Weaknesses of Existing Methods In this section, the existing methods and their weaknesses are highlighted based on the below mentioned categories.

4.1.1

Audio-Based Methods

– Audio excitements of audiences or spectators may create noise with commentator’s speech. The noise level of the audience will sometimes be mixed with instruments sounds or the spectrum may be higher for more than a specific time. Also, the commentator’s voice may be masked by the audience cheer. This may be a considerable challenge to be addressed. – Some methods depend on the speech to text, which is directly based on the accuracy of Google or other similar APIs. Other approaches like MS cognitive services or IBM Watson are not attempted in the reviewed works. Natural language processing to extract the players conversation also is not addressed in any of the works. – Many of the methods do not state the exact number of training and testing samples used when the audio features are considered.

16

V. Vasudevan and M. S. Gounder

Table 2 Notable research work in cricket sport video summarization Study [36]

Algorithms 1. Transfer learning

Methods 1. Pretrained VGG16 to build the classifier

[13]

1. Dominant color region detection 2. Robust shot boundary detection 3. Shot classification 4. Goal detection 5. Referee detection 6. Penalty box detection

Naïve Bayes classifier – Learning Method for Short Classification and non-learning methods for other algorithms

[31]

1. Transfer learning 2. Video resizing

[47]

1. Hierarchical feature-based classifier 2. Finding concept and event rank

[27]

1. Temporal localized stroke segmentation 2. Shot boundary detection 3. Cut predictions

[48]

1. Sum of absolute difference model for caption recognition 2. Short-time zero crossing for estimating spectral properties of audio 1. Hidden Markov model for state transition

1. Inception V3 CNN for transfer learning 2. SoftMax activation function and SVM used for high-level reasoning 1. Semantic base rule 2. Importance ordering and video pruning of concepts 3. Importance ordering and video pruning of events 4. Temporal ordering of events and concepts 1. Grayscale histogram difference feature to detect shot boundaries 2. Cut predictions using random forest 3. SVM for finding the first frames 4. Machine learning algorithm to extract video shots for first frame HOG features 1. Caption recognition 2. Event detection for excitement clips 3. Performance measure of event detection and caption recognition

[62]

1. Shot boundary detection and key frame extraction based on color changes 2. Shot classification using view classification probabilities

Output 1. Classify the bowlers based on action 2. Created a dataset containing bowlers 1. All slow-motion segments in a game based on cinematic features 2. All goals in a game based on cinematic features 3. Slow-motion segments classified according to object-based features Classified results as “no ball” or not

Summarized video based on the events and concepts

A dataset of videos containing strokes played by batsmen

Summarized video using the events and captions

Summarized video based on the features of color and excitement

(continued)

A Systematic Review on Machine Learning-Based Sports Video. . .

17

Table 2 (continued) Study [93]

Algorithms 1. Acoustic feature extraction 2. Highlight scene detection

[75]

1. Event detection 2. Video shot segmentation 3. Replay detection 4. Scoreboard detection 5. Playfield scenario detection

[9]

1. Highlighted moment detection through audio cues 2. Shot (or clip) boundary detection/video segmentation 3. Sub-summaries detection 1. Excited speech segmentation through pretrained pitched speech segment 2. Excited speech detection Event detection Highlight clip detection

[80]

[81]

[10]

1. Shot boundary detection 2. Video segmentation 3. Candidate sub-summaries preparation 4. Metadata extraction

Methods Mel bank filtering, local autocorrelation on complex Fourier values – non-learning methods Complex Subspace Method – Unsupervised Learning 1. Frame difference for video shot 2. CNN + SVM framework for replay detection 3. Pretrained AlexNet for OCR 4. CNN + SVM for classifying frames 5. Audio cues for excitement detection 6. AlexNet for player celebration 1. Hot spot/special moment detection based on two acoustic features 2. View type subsequence matching 3. Lagrangian optimization and convex-hull approximation – non-learning methods

Output Highlight scene generation

1. Gaussian mixture models 2. Unsupervised model adaptation – average log-likelihood ratio score 3. Maximum a posteriori adaptation – learning methods 1. Unsupervised event discovery based on color histogram of oriented gradients 2. Supervised phase trains SVM from clips labeled as highlight or non-highlight 1.Computation of convex hull of the benefit/cost curve of each segment 2. Lagrangian relaxation – non-learning methods

Event highlights generated based on excited speech score

Video summary with important events

Resource constrained summarization based on user’s narrative preference

Video highlights of cricket

Collection of nonoverlapping sub-summaries under the given user-preferences and duration constraint. (continued)

18

V. Vasudevan and M. S. Gounder

Table 2 (continued) Study [54]

Algorithms 1. Pitch segmentation using K-means 2. SVM classifier to recognize digits 3. Finite state automation model based on semantic rules

[39]

1. Excitement detection 2. Key events detection 3. Decision tree for video summarization

[41]

Event recognition

[42]

Playfield and non-playfield detection

[26]

1. Construction of two learning-based localization pipelines 2. Boundary detection

[72]

1. Video shot recognition 2. Shot classification 3. Text classification

Methods 1. Temporal segmentation to detect boundaries and wickets 2. Replay detection using Hough transform-based tracking 3. Ad detection using transitions 4. Camera motion using KLT method 5. Scene change using hard cut detection 6. Crowd view detection using textures 7. Boundary view detection using field segmentation 1. Rule-based induction to find excited clips 2. Score caption region using temporal image averaging 3. OCR to recognize the characters 4. Graduation transition detection by dual threshold-based method CNN (baseline and VGG16) to detect predefined events

1. Color histogram analysis 2. Extraction of dominant color frames-thresholding hue values – non-learning methods 1. Pretrained C3D model with GRU training 2. Boundary detection with first frame classification 3. Modified weighted mean TIoU for single category temporal localization problem 1. K-means clustering to build visual vocabulary 2. Shot representation by bag of words 3. Classification using multiclass Kernel SVM 4. Linear SVM for bowler and batsman category

Output Match summarization with semantic results like batting, bowling, boundary, etc.

Summarized video with key events

Classification of crowd video into four classes: marriage, cricket, Jallikattu, and shopping mall Extracted key frame

Two cricket strokes datasets

Annotated video clips containing events of interest

(continued)

A Systematic Review on Machine Learning-Based Sports Video. . .

19

Table 2 (continued) Study [52]

Algorithms 1. Jittered sampling 2. Temporal augmentation 3. Training; hyper parameter tuning Video classification of cricket shots

Methods 1. Pretrained VGGNet is used on ImageNet dataset for transfer learning 2. LRCN to classify the ball-by-ball activities 1. Adam optimizer for training the model 2. CNN model with 13 layers for classification

[53]

1. Play break detection 2. Event detection through visual, audio, and text cues 3. Peak detection (find similar events)

1. Block creation 2. Thresholding the duration between continuous long shots 3. Detecting low-level features (occurrence of replay scene, the excited audience or commentator speech, certain camera motion or certain highlight event-related sound, and crowd excitement) 4. Grass pixel ratio to detect the boundaries in cricket 5. Audio feature extraction: Root mean square volume Zero crossing rate Pitch period Frequency Centroid Frequency bandwidth Energy ratio 6. Priority assignment – non learning methods 7. SVM to identify text line from frame 8. Optical Character Recognition (OCR) for text recognition – learning methods

4.1.2

Shot and Boundary Detection

[19]

Output Automatically generated commentary. Classify the outcome of each ball as run, dot, boundary and wicket Classified shots of cricket video that belongs to cut shot, cover drive, straight drive, pull shot, leg glance shot, and scoop shot Summarized video based on the priorities block merging

– The motivation for choosing some of the core classifiers like the one in [74] HRFDBN for labeling each shot and RF classifier for dividing the shots is not given clearly. – Umpire jerseys and its colors are one of the key elements in detecting the umpire frames. Though it appears to be a straightforward approach, there is no discussion on the challenges faced while segmenting the frames, for example, the color variation of jerseys due to different light intensity [74].

20

V. Vasudevan and M. S. Gounder

Table 3 Notable research work in soccer sport video summarization Study [5]

Algorithms For the video clip 1. Key frame detection 2. Frame extraction 3. Replay frame detection 4. Event frame detection 5. Crowd frame detection 6. Close-up frame detection 7. Sixer detection

[6]

1. Dominant color extraction 2. Connected components (players), middle rectangle and vertical strips, two horizontal strips features extraction 1. Low-level visual information extraction 2. Grass modeling and detection 3. Camera motion estimation 4. Shot boundary detection (a) Abrupt transition detection (b) Dissolve transition detection (c) Logo transition modeling and detection 5. Shot-type classification 6. Playfield zone classification 7. Replay detection 8. Audio analysis 9. Ranking

[46]

Methods 1. RGB color histogram for frame boundary detection 2. RGB to HSV for key frame detection 3. Visual features (grass pixel ratio, edge pixel ratio, skin color ratio) for shot classification 4. Haar wavelets for close-up detection 5. Edge detection from YCbCr converted frame for crowd detection 6. Black pixel percentage measure to detect sixer 7. Sliding window for event detection – non-learning methods SVM classifier – learning method

Output Detected features and event close-up, replay, crowd, and sixer on 292 frames from 4 min 52 seconds video

1. Feature extracted using non-learning methods 2. Hierarchy of SVM classifiers – learning method

Summarized MPEG-1 video

Classified long, medium, and infield and outfield close-up shots

(continued)

A Systematic Review on Machine Learning-Based Sports Video. . .

21

Table 3 (continued) Study [47]

[43]

[35]

[66]

Algorithms 1. Extraction of exciting clips from audio cues using short time audio energy algorithm 2. Event detection and classification (annotation) using hierarchical tree 3. Exciting clip selection 4. Temporal ordering of selected exciting clips Scene classification

1. Split audio and video 2. Intrinsic Mode Function (IMF) extraction from audio signal 3. Feature extraction from energy matrix of the signal ((a) energy level of the frame in shot, (b) audio power increment, (c) average audio energy increment in continuous shots, (d) whistle detector) 1. Video shot segmentation 2. MPEG-7-based audio descriptor 3. Whistle detector 4. MPEG-7 motion descriptor 5. MPEG-7 color descriptor 6. Replay detector 7. Persons detector 8. Long-shot detector 9. Zooms detector

Methods 1. Non-learning methods to detect different views 2. Bayes Belief Network (to assign semantic concept labels to the exciting clips: goals, saves, yellow cards, red cards, and kicks in video sequence) – learning method 1. Radial basis decompositions of a color address space followed by Gabor wavelets in frequency space 2. The above is used to train SVM classifier 1. Empirical Mode Decomposition (EMD) to filter the noise and extract audio 2. Non-learning methods to extract features and compute shot score and summary generation

Output Generated highlights based on the selected labeled clips based on the degree of importance

1. VJ AdaBoost method with skin filter for face detection – learning method 2. Other algorithms used non-learning methods such as Discrete Fourier Transform for whistle detection

Highlight generated based on user input

Real-time video indexing and dataset

Generated events (goals, shots on goal, shots off goal, red card, yellow card, penalty decision)

(continued)

22

V. Vasudevan and M. S. Gounder

Table 3 (continued) Study [93]

Algorithms 1. Shot boundary detection 2. Shot-type, play break classification 3. Replay detection 4. Scoreboard detection 5. Excitement event detection 6. Logo-based event detection 7. Audio loudness detection

[78]

1. Define segmentation points 2. Replay detection 3. Player detection and interpolation 4. Soccer event segmentation 5. Bin-packing to select subset of plays based on utility from eight bins Video classification of cricket shots

[18]

Methods 1. SVM and NN (replay and scoreboard) 2. K-means 3. Hough transform (vertical goal post detection) 4. Gabor filter (Goal Net) 5. Volume of each audio frame, subharmonic-to-harmonic ratio-based pitch determination, dynamic thresholds – learning and non-learning methods 1. Background subtraction using GMM for replay detection 2. YOLO for player detection 3. Histogram of optical flow to capture player motion – learning and non-learning methods

Output Highlights the most important events that include goals and goal attempts

1. Adam optimizer for training the model 2. CNN model with 13 layers for classification

Classified shots of cricket video that belongs to cut shot, cover drive, straight drive, pull shot, leg glance shot, and scoop shot Classified shots of sports video with classes like close, crowd, long, and medium shots Classified shots into batting, bowling, boundary, crowd, and close-up Detected replay from given video sequence

[60]

1. Shot classification

1. AlexNet CNN for shot classification

[64]

Scene classification

Pretrained AlexNet CNN

[90]

1. Detect candidate set for logo template 2. Find logo template from the candidate set 3. Match the logo (pair logo for replay detection)

1. Difference and accumulated difference in a window of 20 frames 2. K-means clustering to find exact logo template 3. Adaptive criterion: frame difference and mean intensity of the current frame with those of the logo template – learning and non-learning methods

The output video is parameterized based on events over time and the user priority list.

(continued)

A Systematic Review on Machine Learning-Based Sports Video. . .

23

Table 3 (continued) Study [82]

Algorithms 1. Video segmentation 2. Highlight classification

[45]

1. Scorebox detection (binary map hole filling algorithm) 2. OCR to recognize text in scorebox 3. Parse and clean algorithm to recognize clearly text from text region 4. Audio feature extraction 5. Key frame detection (start and end frame estimation algorithm)

[32]

1. Unimodal learning 2. Multimodal learning 3. Multimodal and multi-label learning

Methods 1. Two-stream deep neural network (1. Holistic feature stream: 2D CNN 2. Body joint stream: 3D CNN) – trained from lower layer to the top layers by using a UGSV summarization dataset. 2. LSTM (highlight classification) 1. Nonoverlapping sliding window operation on frame pairs for scorebox detection 2. OCR using deep CNN with 25 layers 3. Butterworth band-pass filter and Savitzky-Golay smoothing filter for audio feature extraction 4. Speech to text using Google API2 – both learning and non-learning methods are used. 1. (a) Multibranch Convolutional Networks (merge the convolutional features from input frames; then the regression value is obtained) (b) 3D CNN to capture more temporal and motion information. (c) Long-term Recurrent Convolutional Networks uses pretrained CNN model to extract features. 2. Pretrained CNN features with NN (latent features fusion (LFF) and pretrained CNN features with deep NN (early features fusion (EFF)) 3. (a) Construct a network for training each label separately (b) Jointly train a multi-label network, and extract the joint features from the last dense layer.

Output User-generated sports video (UGSV)

Highlight generated based on user preferences

1. Highlight generated based on unimodal learning 2. Highlight generated based on multimodal learning 3. Highlight generated based on multimodal, multi-label learning

(continued)

24

V. Vasudevan and M. S. Gounder

Table 3 (continued) Study [63]

[34]

Algorithms 1. Shot detection (a) Shot classification (b) Replay detection 2. Video segmentation (collection of successive shots) 3. Clip boundary calculation 4. Interest level measure 1. Shot boundary detection 2. Shot view classification (global view, medium view, and close-up view) 3. Replay detection 4. Play break detection 5. Penalty box detection 6. Key event detection

Methods Non-learning methods used to identify important events from long, medium, close, and replay views

Output Generated video summary based on user input on the length (N clips) of the summary

1. Linear SVM classifier 2. Green color dominance and threshold frequencies over player bounding box 3. Histogram difference between logo frames 4. Statistical measure for key event detections

Summary with replay and without replay segments

– The shots are classified into only specific categories as in [6, 9, 34, 45, 60]. The shot detection in these works is not generalized. For example, in [34], the size of the bounding box is the major parameter to decide between medium and long shots. This may go wrong if the shadows are detected as boundaries. The authors have not addressed such issues. – Authors of [34] have used an algorithm to find three parallel lines in a frame to find near goal ratio. Due to the camera angle, these lines may not appear to be parallel due to perspective distortion. This part has not been addressed to resolve such issues. – The clip boundaries are detected using the camera motion [34]. This may not be applicable to other sports where the camera keeps moving.

4.1.3

Resolution and Samples

– In most of the works, the video samples and their resolutions are assumed to be much lower than the quality of broadcast. The frame resolution is mentioned as 640 × 480 [74]. It is common that the video resolution will not be always the same. Either every video must be down sampled to the standard size that is processed or the method must be flexible in processing any resolution videos. – Number of samples used in methods [74] are significantly less, and the performance is the models are justified only with such lower number of samples. The impact on performance of the algorithm when the samples are increased is not specified. – Only few of the methods mentioned the series of sports that has been used for video samples. The reason for choosing the specific series is not given clearly.

A Systematic Review on Machine Learning-Based Sports Video. . .

25

Table 4 Notable research work in other sports video summarization Study [4]

[7]

Algorithms 1. Shot boundary detection – rank tracing algorithm 2. Short view classification – dominant view detection, playfield region segmentation, object-size determination, and shot classification 3. Minute by minute textual cues – event keyword matching Time stamp extraction, text video synchronization Event search localization 4. Candidate shot list generation 5. Candidate ranking 1. Feature extraction (referee’s whistle, commentators’ exciting speech) 2. Multistage classification 3. Highlight generation

Methods 1. Rule-based approach to classify (far view, close-up view) shots – learning method and non-learning methods for other algorithms

Output Detected goal events

1. Mel Frequency Cepstral Coefficients (MFCC) and their first order delta-MFCC – to represent whistle sound and exciting speech 2. Multistage Gaussian mixture models (GMM) – learning method to learn and classify: Stage 1: speech and nonspeech Stage 2: Excited (from speech) or whistle (from nonspeech) Five GMM models: (a) a speech model, (b) an excited speech model, (c) an unexcited speech model, (d) a whistle model, and (e) a model to classify all other acoustic events 3. Decision window and onset and offset determination for scene

Generated video highlights

(continued)

26

V. Vasudevan and M. S. Gounder

Table 4 (continued) Study [30]

Algorithms 1. Shot detection 2. Penalty cornet and penalty stroke detection 3. Umpire gesture detection 4. Foul detection 5. Replay and logo shot detection 6. Goal detection

[37]

1. Rally scene detection

[40]

1. Replay segment extraction 2. Key events detection (a) Motion pattern detection (b) Feature extraction

[59]

1. Audio analysis (crowd cheer and the commentator excitement detection) 2. Visual analysis (players – action recognition, shot boundary detection) 3. Text analysis (text based: 60 words/phrases dictionary)

Methods 1. Structural Similarity Index Measure (SSIM) 2. Color segmentation and morphological operations – long shot/umpire shot 3. Field color detection and skin color detection – close-up shot 4. Hough transformation and morphological operations – goal post shot – Non-learning methods 1. Unsupervised shot clustering based on HSV histogram 2. Correlation analysis between court position and ball position 3. Rally rank evaluation – adjusted R squares – learning methods 1. Thresholding-based approach (fade-in and fade-out transition during start and end of replay) 2. Gaussian mixture model (GMM) to extract silhouettes and generate motion history image (MHI) for each key event 3. Confined elliptical local ternary patterns (CE-LTPs) for feature extraction 4. Extreme learning machine (ELM) classifier for key event detection – learning methods 1. Sound net classifier (crowd cheer and commentator speech) 2. Linear SVM classifier (commentator excitement detection – tone based) 3. Speech to text conversion (commentator excitement) 4. VGG-16 model (player action of celebration) 5. Optical Character Recognition (OCR) – learning methods

Output Events are tagged by above event names and stitched in the order of appearance based on user preferences to generate customized highlights.

Extracted highlights from unimodal integrated with multimodal

Detected replay events

Automatically extracting highlights

A Systematic Review on Machine Learning-Based Sports Video. . .

27

– The methods that use AlexNet CNN and Decision Tree classifier do not apply substantial number of samples to evaluate the robustness of the mode. Some of them used as low as 50 samples. – Dataset has limited number of samples as in [25, 64, 67]. The video has been chosen based on the type of view it has as predefined. There seems to be no preprocessing done to classify the videos into different types of views.

4.1.4

Events Detection

– Only four key events are considered in [34]. Likewise, in most of the eventbased methods, standard events relevant to sports (boundaries in cricket, goals in soccer, etc.) are considered. Any attempt to improve the number of key events or sub events of main key events would have made this work better. – The audio events in most of the works [7, 45, 66, 92] uses cheer energy or the spectrum of commentators. Attempts to record in field audios like stump mics and umpire’s mics have not been reported in any of the works. – The replay is the key to identify the semantic segment of the video in [20, 40, 90]. If the replay is not identified, there will be misleading semantics in the segments. – The human subjective evaluation may not always reveal the true results as in [74]. The samples used for evaluation are too less in number and that cannot be considered for any proper conclusion.

4.2 Scope for Further Research The major objective of sports video summarization is to reduce the length of the broadcast video in a way the shortened video shows interesting events only. As such, every sports video is lengthy in duration. Some of the sports like cricket has days long videos. When the automated highlight generation is applied, it is supposed to deal with huge volume of video data. With the current video broadcasting standards, every video that is supposed to be processed will be at higher resolution, sometimes up to 4 and 8 K. Keeping these constraints, any algorithm that is developed to automate the video summarization should address these requirements. Hence, any prospective research should deal with such high resolution and high volume of data. In addition to the resolution of the video, the availability of standardized dataset is a huge letdown for the researchers to benchmark the results. Only countable number of researchers [25, 67] attempted to create dataset for sports summarization. Most of the proposed methods employed custom created dataset or used commonly available data from sources like YouTube. Creating and standardizing dataset with huge collection of samples for each of the sports video is one of the high priority research projects. In the literature, it is found that most of the methods [6, 21, 34, 36, 38, 43, 50, 54, 59, 67, 72, 85] applied two levels of model building. Since the method of video

28

V. Vasudevan and M. S. Gounder

summarization involves majorly detecting or segmenting objects and classifying them, the advanced object detection and classification methods like YOLO [68, 69], R-CNN [71], etc. can be used to reduce the computational complexity. Since these methods are standardized in terms of detection and classification, the sports video summarization models can employ them to reduce model building time as well. On the similar line with methods like YOLO [68, 69], which has common objects classification, an attempt can be made to build a pretrained object detection model specific to sports. There are many common events or scenes among the games under consideration. For example, a green playfield, players jersey with names and numbers printed, umpires, and scoreboards are most common in all the sports. The pretrained models can address to classify or detect such common labels in sports video. Further, the same can be extended to sports-specific models. As stated earlier, the video summarization is a compute-intensive process. It needs to process thousands of frames and hundreds of video shots before it classifies or identifies proper results. The potential of any video processing algorithm is that it can be implemented in parallel architecture. The best solution for the parallelism at the consumer level is to utilize multicore processing capabilities of CPUs and GPUs. Only selected works from the literature [60] has addressed the use of GPU. However, there exists a huge potential for exploiting the GPUs with the support of hardware architecture and software development tools like CUDA. Further, methods that process the videos in real time to detect key events for video summarization would produce very attractive results. The results can be customized depending on the duration of summary that a viewer is interested to watch. Embedding such methods as a product to real-time television streaming boxes would bring much of commercial value to the video summarization not only for sports but also for other events like coverage of musical events, festivals, and other functions.

5 Conclusion In this chapter, a systematic review of latest developments on video summarization of sports video has been discussed. The chapter summarized some of the key methods in detail by analyzing the methods and algorithms used for various sports and the events. It is believed that the weaknesses posed by each of the papers can potentially lead to further research avenues that can be solved by prospective researchers of this domain. Though most of the papers focused on resolving the problem of video summary generation, each of the method has its own merits and demerits. Understanding the methods will certainly help to benchmark these results against any further developments in this area. Another major contribution of this work is to identify the methods that exploit latest machine learning methods and high-performance computing. It is also shown that very few methods have deployed the GPU or multicore computing to resolve the methods. This further opens room for the budding researchers to experiment the potential of GPUs for processing such

A Systematic Review on Machine Learning-Based Sports Video. . .

29

high volume of data like sports video. The results of such summarization can be instantly compared with highlights generated by broadcasting channels at the end of every match. Going further, the results of such highlights should also include some additional key events and drama, not just the key events of the games. For instance, in soccer if a player is given a red card, the manual editing will show all the events related to the player that led to red card and sometimes his activities from previous matches also shown by manual editors. The automated system should be capable enough to identify such key events and include in the highlights. Some of other elements like pre- and post-match ceremony, player’s entry, injuries to players, etc. should also be captured. Eventually, the machine learning-based methods should learn to include the style of commentary, team’s jersey, noise removal from common cheering, series-specific scene transition, and smooth commentary or video cuts that will potentially reduce the human editor’s work. It is anticipated that this chapter will turn out as one of the standard references for researchers to actively develop video summarization algorithms using learning or non-learning approaches.

References 1. Rahman, A. A., Saleem, W., & Iyer, V. V. Driving behavior profiling and prediction in KSA using smart phone sensors and MLAs. In 2019 IEEE Jordan international joint conference on Electrical Engineering and Information Technology (JEEIT) (pp. 34–39). 2. Ajmal, M., Ashraf, M. H., Shakir, M., Abbas, Y., & Shah, F. A. (2012). Video summarization: Techniques and classification. In Computer vision and graphics (Vol. 7594). ISBN: 978-3-64233563-1. 3. Sen, A., Deb, K., Dhar, P. K., & Koshiba, T. (2021). CricShotClassify: An approach to classifying batting shots from cricket videos using a convolutional neural network and gated recurrent unit. Sensors, 21, 2846. https://doi.org/10.3390/s21082846 4. Halin, A. A., & Mandava, R. (2013, January). Goal event detection in soccer videos via collaborative multimodal analysis. Pertanika Journal of Science and Technology, 21(2), 423– 442. 5. Amruta, A. D., & Kamde, P. M. (2015, March). Sports highlight generation system based on video feature extraction. IJRSI (2321–2705), II(III). 6. Bagheri-Khaligh, A., Raziperchikolaei, R., & Moghaddam, M. (2012). A new method for shot classification in soccer sports video based on SVM classifier. In Proceedings of the 2012 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI). Santa Fe, NM. 7. Baijal, A., Jaeyoun, C., Woojung, L., & Byeong-Seob, K. (2015). Sports highlights generation based on acoustic events detection: A rugby case study. In 2015 IEEE International Conference on Consumer Electronics (ICCE) (pp. 20–23). https://doi.org/10.1109/ICCE.2015.7066303 8. Alexey, B., Chien-Yao, W., & Hong-Yuan, M. L. (2020). YOLOv4: Optimal speed and accuracy of object detection. In arXiv 2004.10934[cs.CV]. 9. Chen, F., De Vleeschouwer, C., Barrobés, H. D., Escalada, J. G., & Conejero, D. (2010). Automatic summarization of audio-visual soccer feeds. In 2010 IEEE international conference on Multimedia and Expo (pp. 837–842). https://doi.org/10.1109/ICME.2010.5582561 10. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (pp. 379–387). 11. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society conference on Computer Vision and Pattern Recognition (CVPR ‘05) (Vol. 1, pp. 886–893). https://doi.org/10.1109/CVPR.2005.177

30

V. Vasudevan and M. S. Gounder

12. Jesse, D., & Mark, G. (2006). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML ‘06) (pp. 233–240). ACM, New York, NY, USA. https://doi.org/10.1145/1143844.1143874 13. Asadi, E., & Charkari, N. M. (2012). Video summarization using fuzzy c-means clustering. In 20th Iranian conference on Electrical Engineering (ICEE2012) (pp. 690–694). https://doi.org/ 10.1109/IranianCEE.2012.6292442 14. Ekin, A., Tekalp, A., & Mehrotra, R. (2003). Automatic soccer video analysis and summarization. IEEE Transactions on Image Processing, 12(7), 796–807. 15. Fani, M., Yazdi, M., Clausi, D., & Wong, A. (2017). Soccer video structure analysis by parallel feature fusion network and hidden-to-observable transferring Markov model. IEEE Access, 5, 27322–27336. 16. Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010). Cascade object detection with deformable part models. In 2010 IEEE computer society conference on Computer Vision and Pattern Recognition (pp. 2241–2248). https://doi.org/10.1109/CVPR.2010.5539906 17. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010, September). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645. https://doi.org/10.1109/TPAMI.2009.167 18. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In 2008 IEEE conference on Computer Vision and Pattern Recognition (pp. 1–8). https://doi.org/10.1109/CVPR.2008.4587597 19. Foysal, M. F., Islam, M., Karim, A., & Neehal, N. (2018). Shot-Net: A convolutional neural network for classifying different cricket shots. In Recent trends in image processing and pattern recognition. Springer Singapore. 20. Ghanem, B., Kreidieh, M., Farra, M., & Zhang, T. (2012). Context-aware learning for automatic sports highlight recognition. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) (pp. 1977–1980). 21. Girshick, R. B. (2012). From rigid templates to grammars: object detection with structured models (Ph.D. Dissertation). University of Chicago, USA. Advisor(s) Pedro F. Felzenszwalb. Order Number: AAI3513455. 22. Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar models. In Proceedings of the 24th international conference on Neural Information Processing Systems (NIPS’11) (pp. 442–450). Curran Associates Inc., Red Hook, NY, USA. 23. Girshick, R., & Fast, R.-C. N. N. (2015). 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169 24. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016, January 1). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158. https://doi.org/10.1109/ TPAMI.2015.2437384 25. Gonzalez, A., Bergasa, L., Yebes, J., & Bronte, S. (2012). Text location in complex images. In IEEE ICPR. 26. Gupta, A., & Muthaiah, S. (2020). Viewpoint constrained and unconstrained Cricket stroke localization from untrimmed videos. Image and Vision Computing, 100. 27. Gupta, A., & Muthaiah, S. (2019). Cricket stroke extraction: Towards creation of a large-scale cricket actions dataset. arXiv:1901.03107 [cs.CV]. 28. Gupta, A., Karel, A., & Sakthi Balan, M. (2020). Discovering cricket stroke classes in trimmed telecast videos. In N. Nain, S. Vipparthi, & B. Raman (Eds.), Computer vision and image processing. CVIP 2019. Communications in computer and information science (Vol. 1148). Springer Singapore. 29. Arpan, G., Ashish, K., & Sakthi Balan, M. (2021). Cricket stroke recognition using hard and soft assignment based bag of visual words. In Communications in computer and information science (pp. 231–242). Springer Singapore. https://doi.org/10.1007/2F978-981-16-1092-2021 30. Hari, R. (2015, November). Automatic summarization of hockey videos. IJARET (0976–6480), 6(11).

A Systematic Review on Machine Learning-Based Sports Video. . .

31

31. Harun-Ur-Rashid, M., Khatun, S., Trisha, Z., Neehal, N., & Hasan, M. (2018). Crick-net: A convolutional neural network based classification approach for detecting waist high no balls in cricket. arXiv preprint arXiv:1805.05974. 32. He, J., & Pao, H.-K. (2020). Multi-modal, multi-labeled sports highlight extraction. In 2020 international conference on Technologies and Applications of Artificial Intelligence (TAAI) (pp. 181–186). https://doi.org/10.1109/TAAI51410.2020.00041 33. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on Computer Vision (pp. 346–361). Springer. 34. Khurram, I. M., Aun, I., & Nudrat, N. (2020). Automatic soccer video key event detection and summarization based on hybrid approach. Proceedings of the Pakistan Academy of Sciences, A Physical and Computational Sciences (2518–4245), 57(3), 19–30. 35. Islam, M. R., Paul, M., Antolovich, M., & Kabir, A. (2019). Sports highlights generation using decomposed audio information. In IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 579–584). https://doi.org/10.1109/ICMEW.2019.00105 36. Islam, M., Hassan, T., & Khan, S. (2019). A CNN-based approach to classify cricket bowlers based on their bowling actions. In 2019 IEEE international conference on Signal Processing, Information, Communication & Systems (SPICSCON) (pp. 130–134). https://doi.org/10.1109/ SPICSCON48833.2019.9065090 37. Takahiro, I., Tsukasa, F., Shugo, Y., & Shigeo, M. (2017). Court-aware volleyball video summarization. In ACM SIGGRAPH 2017 posters (SIGGRAPH ‘17) (pp. 1–2). Association for Computing Machinery, New York, NY, USA, Article 74. https://doi.org/10.1145/ 3102163.3102204 38. Javed, A., Malik, K. M., Irtaza, A., et al. (2020). A decision tree framework for shot classification of field sports videos. The Journal of Supercomputing, 76, 7242–7267. https:// doi.org/10.1007/s11227-020-03155-8 39. Javed, A., Bajwa, K., Malik, H., Irtaza, A., & Mahmood, M. (2016). A hybrid approach for summarization of cricket videos. In IEEE International Conference on Consumer ElectronicsAsia (ICCE-Asia). Seoul. 40. Javed, A., Irtaza, A., Khaliq, Y., & Malik, H. (2019). Replay and key-events detection for sports video summarization using confined elliptical local ternary patterns and extreme learning machine. Applied Intelligence, 49, 2899–2917. https://doi.org/10.1007/s10489-019-01410-x 41. Jothi Shri, S., & Jothilakshmi, S. (2019). Crowd video event classification using convolutional neural network. Computer Communications, 147, 35–39. 42. Kanade, S. S., & Patil, P. M. (2013, March). Dominant color based extraction of key frames for sports video summarization. International Journal of Advances in Engineering & Technology, 6(1), 504–512. ISSN: 2231-1963. 43. Kapela, R., McGuinness, K., & O’Connor, N. (2017). Real-time field sports scene classification using colour and frequency space decompositions. Journal of Real-Time Image Process, 13, 725–737. 44. Kathirvel, P., Manikandan, S. M., & Soman, K. P. (2011, January). Automated referee whistle sound detection for extraction of highlights from sports video. International Journal of Computer Applications (0975–8887), 12(11), 16–21. 45. Khan, A., Shao, J., Ali, W., & Tumrani, S. (2020). Content-aware summarization of broadcast sports videos: An audio–visual feature extraction approach. Neural Process Letter, 1945– 1968. 46. Kiani, V., & Pourreza, H. R. (2013). Flexible soccer video summarization in compressed domain. In ICCKE 2013 (pp. 213–218). https://doi.org/10.1109/ICCKE.2013.6682798 47. Kolekar, M. H., & Sengupta, S. (2015). Bayesian network-based customized highlight generation for broadcast soccer videos. IEEE Transactions on Broadcasting, (2), 195–209. 48. Kolekar, M. H., & Sengupta, S. (2006). Event-importance based customized and automatic cricket highlight generation. In IEEE international conference on Multimedia and Expo. Toronto, ON.

32

V. Vasudevan and M. S. Gounder

49. Kolekar, M. H., & Sengupta, S. (2008). Caption content analysis based automated cricket highlight generation. In National Communications Conference (NCC). Mumbai. 50. Bhattacharya, K., Chaudhury, S., & Basak, J. (2004, December 16–18). Video summarization: A machine learning based approach. In ICVGIP 2004, Proceedings of the fourth Indian conference on Computer Vision, Graphics & Image Processing (pp. 429–434). Allied Publishers Private Limited, Kolkata, India. 51. Alex, K., Ilya, S., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th international conference on Neural Information Processing Systems, Volume 1 (NIPS’12) (pp. 1097–1105). Curran Associates Inc., Red Hook, NY, USA. 52. Kumar, R., Santhadevi, D., & Janet, B. (2019). Outcome classification in cricket using deep learning. In IEEE international conference on Cloud Computing in Emerging Markets CCEM. Bengaluru. 53. Kumar Susheel, K., Shitala, P., Santosh, B., & Bhaskar, S. V. (2010). Sports video summarization using priority curve algorithm. International Journal on Computer Science and Engineering (0975–3397), 02(09), 2996–3002. 54. Kumar, Y., Gupta, S., Kiran, B., Ramakrishnan, K., & Bhattacharyya, C. (2011). Automatic summarization of broadcast cricket videos. In IEEE 15th International Symposium on Consumer Electronics (ISCE). Singapore. 55. Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019). Scale-aware trident networks for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 6053– 6062). https://doi.org/10.1109/ICCV.2019.00615 56. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2017). Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264. 57. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 936–944). https://doi.org/10.1109/CVPR.2017.106 58. Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018, July). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826 59. Merler, M., Mac, K. N. C., Joshi, D., Nguyen, Q. B., Hammer, S., Kent, J., Xiong, J., Do, M. N., Smith, J. R., & Feris, R. S. (2019, May). Cricket automatic curation of sports highlights using multimodal excitement features. IEEE Transactions on Multimedia, 21(5), 1147–1160. https://doi.org/10.1109/TMM.2018.2876046 60. Minhas, R., Javed, A., Irtaza, A., Mahmood, M., & Joo, Y. (2019). Shot classification of field sports videos using AlexNet Convolutional Neural Network. Applied Sciences, 9(3), 483. 61. Mohan, S., & Vani, V. (2016). Predictive 3D content streaming based on decision tree classifier approach. In S. Satapathy, J. Mandal, S. Udgata, & V. Bhateja (Eds.), Information systems design and intelligent applications. Advances in intelligent systems and computing (Vol. 433). Springer. https://doi.org/10.1007/978-81-322-2755-7_16 62. Namuduri, K. (2009). Automatic extraction of highlights from a cricket video using MPEG7 descriptors. In First international communication systems and networks and workshops. Bangalore. 63. Nguyen, N., & Yoshitaka, A. (2014). Soccer video summarization based on cinematography and motion analysis. In 2014 IEEE 16th international workshop on Multimedia Signal Processing (MMSP) (pp. 1–6). https://doi.org/10.1109/MMSP.2014.6958804 64. Rafiq, M., Rafiq, G., Agyeman, R., Choi, G., & Jin, S.-I. (2020). Scene classification for sports video summarization using transfer learning. Sensors, 20, 1702. 65. Raj, R., Bhatnagar, V., Singh, A. K., Mane, S., & Walde, N. (2019, May). Video summarization: Study of various techniques. In Proceedings of IRAJ international conference, arXiv:2101.08434. 66. Raventos, A., Quijada, R., Torres, L., & Tarrés, F. (2015). Automatic summarization of soccer highlights using audio-visual descriptors. Springer Plus, 4, 1–13.

A Systematic Review on Machine Learning-Based Sports Video. . .

33

67. Ravi, A., Venugopal, H., Paul, S., & Tizhoosh, H. R. (2018). A dataset and preliminary results for umpire pose detection using SVM classification of deep features. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1396–1402). https://doi.org/ 10.1109/SSCI.2018.8628877 68. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In 2017 IEEE conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6517–6525). https:// doi.org/10.1109/CVPR.2017.690 69. Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767. 70. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 779–788). 71. Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal. arXiv:1506.01497 [cs.CV]. 72. Sharma, R., Sankar, K., & Jawahar, C. (2015). Fine-grain annotation of cricket videos. In Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR). Kuala Lumpur, Malaysia. 73. Shih, H. (2018). A survey of content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology, 28(5), 1212–1231. 74. Shingrakhia, H., & Patel, H. (2021). SGRNN-AM and HRF-DBN: A hybrid machine learning model for cricket video summarization. The Visual Computer, 38, 2285. https://doi.org/ 10.1007/s00371-021-02111-8 75. Shukla, P., Sadana, H., Verma, D., Elmadjian, C., Ramana, B., & Turk, M. (2018). Automatic cricket highlight generation using event-driven and excitement-based features. In IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, UT. 76. Sreeja, M. U., & KovoorBinsu, C. (2019). Towards genre-specific frameworks for video summarisation: A survey. Journal of Visual Communication and Image Representation (1047– 3203), 62, 340–358. https://doi.org/10.1016/j.jvcir.2019.06.004 77. Su Yuting., Wang Weikang., Liu Jing., Jing Peiguang., and Yang Xiaokang., DS-Net: Dynamic spatiotemporal network for video salient object detection, arXiv:2012.04886 [cs.CV], 2020. 78. Sukhwani, M., & Kothari, R. A parameterized approach to personalized variable length summarization of soccer matches. arXiv preprint arXiv:1706.09193. 79. Sun, Y., Ou, Z., Hu, W., & Zhang, Y. (2010). Excited commentator speech detection with unsupervised model adaptation for soccer highlight extraction. In 2010 international conference on Audio, Language, and Image Processing (pp. 747–751). https://doi.org/10.1109/ ICALIP.2010.5685077 80. Tang, H., Kwatra, V., Sargin, M., & Gargi, U. (2011). Detecting highlights in sports videos: Cricket as a test case. In IEEE international conference on Multimedia and Expo. Barcelona. 81. Saba, T., & Altameem, A. (2013, August). Analysis of vision based systems to detect real time goal events in soccer videos. International Journal of Applied Artificial Intelligence, 27(7), 656–667. https://doi.org/10.1080/08839514.2013.787779 82. Antonio, T.-d.-P., Yuta, N., Tomokazu, S., Naokazu, Y., Marko, L., & Esa, R. (2018, August). Summarization of user-generated sports video by using deep action recognition features. IEEE Transactions on Multimedia, 20(8), 2000–2010. 83. Tien, M.-C., Chen, H.-T., Hsiao, C. Y.-W. M.-H., & Lee, S.-Y. (2007). Shot classification of basketball videos and its application in shooting position extraction. In Proceedings of the IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP 2007). 84. Vadhanam, B. R. J., Mohan, S., Ramalingam, V., & Sugumaran, V. (2016). Performance comparison of various decision tree algorithms for classification of advertisement and nonadvertisement videos. Indian Journal of Science and Technology, 9(1), 48–65. 85. Vani, V., Kumar, R. P., & Mohan, S. Profiling user interactions of 3D complex meshes for predictive streaming and rendering. In Proceedings of the fourth international conference on Signal and Image Processing 2012 (ICSIP 2012) (pp. 457–467). Springer, India.

34

V. Vasudevan and M. S. Gounder

86. Vani, V., & Mohan, S. (2021). Advances in sports video summarization – a review based on cricket video. In The 34th international conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems, Special Session on Big Data and Intelligence Fusion Analytics (BDIFA 2021). Accepted for publication in Springer LNCS. 87. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society conference on Computer Vision and Pattern Recognition. CVPR 2001 (p. I-I). https://doi.org/10.1109/CVPR.2001.990517 88. Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. 89. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer. 90. Xu, W., & Yi, Y. (2011, September). A robust replay detection algorithm for soccer video. IEEE Signal Processing Letters, 18(9), 509–512. https://doi.org/10.1109/LSP.2011.2161287 91. Khan, Y. S., & Pawar, S. (2015). Video summarization: Survey on event detection and summarization in soccer videos. International Journal of Advanced Computer Science and Applications (IJACSA), 6(11). https://doi.org/10.14569/IJACSA.2015.061133 92. Ye, J., Kobayashi, T., & Higuchi, T. Audio-based sports highlight detection by Fourier local auto-correlations. In Proceedings of the 11th annual conference of the International Speech Communication Association, INTERSPEECH 2010 (pp. 2198–2201). 93. Hossam, Z. M., Nashwa, E.-B., Ella, H. A., & Tai-hoon, K. (2011). Machine learning-based soccer video summarization system, multimedia, computer graphics and broadcasting (Vol. 263). ISBN: 978-3-642-27185-4. 94. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Singleshot refinement neural network for object detection. In IEEE CVPR. 95. Zhang, S., Wen, L., Lei, Z., & Li, S. Z. (2021, February). RefineDet++: Single-shot refinement neural network for object detection. IEEE Transactions on Circuits and Systems for Video Technology, 31(2), 674–687. https://doi.org/10.1109/TCSVT.2020.2986402 96. Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055.

Shot Boundary Detection from Lecture Video Sequences Using Histogram of Oriented Gradients and Radiometric Correlation T. Veerakumar, Badri Narayan Subudhi, K. Sandeep Kumar, Nikhil O. F. Da Rocha, and S. Esakkirajan

1 Introduction Due to rapid growth and development in multimedia techniques, e-learning is getting more popularity. In the last few years, a lot of online courses are uploaded in the Internet for basic study use. The user performs basic task like browsing the content of a particular video or analyzing some specific parts of a video lecture. Sometimes, a video may contain some specific video lecture of some topics or subtopics. The main difficulty lies in finding specific pieces of knowledge for video because it is unstructured. For example, if someone wants to analyze specific content or to attend some specific part of the video for the particular lecture, then he has to watch the entire two- or three-hour lecture. Hence, browsing these videos and analyzing those are quite challenging and tiresome job. Hence, smooth browsing and indexing of lecture videos are considered to be a primary task of computer vision. One likely solution is to segment a video into different shots so as to facilitate the students for learning and minimize learning time. Recent years have seen a rapid increase in the storage of visual information. It made scientists to find a way to index the visual data and retrieve it efficiently. Content-Based Video Retrieval (CBVR) is an area of research that has been catered

T. Veerakumar () · K. S. Kumar · N. O. F. Da Rocha Department of Electronics and Communication Engineering, National Institute of Technology Goa, Farmagudi, Ponda, Goa, India e-mail: [email protected] B. N. Subudhi Department of Electrical Engineering, Indian Institute of Technology Jammu, Nagrota, Jammu, India S. Esakkirajan Department of Instrumentation and Control Engineering, PSG College of Technology, Coimbatore, Tamil Nadu, India © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_2

35

36

T. Veerakumar et al.

to such demands for the users [1]. In this process, a video was first segmented into successive shots. However, to automate the process of shot segmentation, the analysis of the subsequent frames for changes in visual content is necessary. These changes can be abrupt or gradual. After detecting the shot boundaries, key frames are extracted from each shot. Key frames provide a suitable abstraction and framework for video indexing, browsing, and retrieval. The usage of key frames significantly reduces the amount of data required in video indexing and provides an organizational framework for dealing with video content. The users, while searching for a video of their interest, browse the videos randomly and view only certain key frames that matches the content of search query. CBVR has various stages like shot segmentation, key frame extraction, feature extraction, feature indexing, retrieval mechanism, and result ranking mechanism [1]. These key frames are used for image-based video retrieval where an image is given as a query to retrieve a video from a collection of lecture videos. Varieties of ways have been reported in the literature. The simple method is the pixelwise difference between consecutive frames [2]. But it is very sensitive to camera motion. An approach based on local statistical difference is proposed in [3], which is obtained by dividing the image into few regions and comparing statistical measures like mean and standard deviation of the gray levels within the regions of the image. However, this approach is found to be computationally burden. The most shared and popular method used for shot boundaries detection is based on histograms [4–6]. The simplest one computes the gray level or color histogram of the two images. If the sum of bin-wise difference between the two histograms is above a threshold, a shot boundary is assumed. It may be noted that these approaches are relatively stable, but the absence of spatial information may produce substantial dissimilarities in between the frames and hence incurs a reduction in accuracy. Mutual information computed from the joint histogram of consecutive frames are also used to solve such task [7]. The renowned Machine Learning and Pattern Recognition methods like neural network [8], KNN [9], fuzzy clustering [10, 11], and support vector machines [12] have also been used for shot boundaries detection. Shot boundary detection based on orthogonal polynomial method is proposed [13]. Here, orthogonal polynomial function is used to identify the shot boundary in the video sequence. In essence, previous works expose that the researchers have proposed numerous types of features and dissimilarity measures. Many state-of-the-art techniques suffer from the difficulty of selecting the thresholds and window size. However, such methods prohibit the accuracy of shot boundary detection by generating false positives due to illumination change. The next phase after shot detection is key frame extraction. A key frame is a representative for individual shot. One of the popular approaches to key frame extraction is using singular value decomposition (SVD) and correlation minimization [14, 15]. Another method for key frame extraction is KS-SIFT [16]; it extracts the local visual features using SIFT, represented as feature vectors, from a selected group of frames from a video shot. KS-SIFT method analyzes those feature vectors eliminating near duplicate key frames, helping to keep a compact key frame set. But it takes more computation time, and approach

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

37

is found to be complex. Robust Principal Component Analysis (RPCA) method is also introduced to extract the key frames in a video. RPCA provides a stable tool for data analysis and dimensionality reduction. Under the RPCA framework, the input data set is decomposed into a sum of low-rank and sparse components. This approach is based on l1 norm optimization technique [17]; however, this method is more complicated and takes high computational time. The problem of moving object segmentation using background subtraction is introduced in [18]. Moving object segmentation is very important for many applications: visual surveillance of both in outdoor and indoor environments, traffic control, behavior detection during sport activities, and so on. A new approach to the detection and classification of scene breaks in video sequences is discussed in [19]. In this work, it can able to detect and classify a variety of scene breaks: including cuts, fades, dissolves, and wipes even in sequences involving significant motion. A novel dual-stage approach for an abrupt transition detection is introduced [20], which is able to withstand under certain illumination and motion effects. A hybrid shot boundary detection method is developed by integrating a high-level fuzzy Petri net (HLFPN) model with keypoint matching [21]. The HLFPN model with histogram difference is performed as a predetection. Next, the speeded-up robust features (SURF) algorithm that is reliably robust to image affine transformation and illumination variation is used to figure out all possible false shots and the gradual transition based on the assumption from the HLFPN model. The top-down design can effectively lower down the computational complexity of SURF algorithm. From the above discussions, it may be concluded that shot boundary detection and key frame extraction are important tasks in image and video analysis and need attention. It is also to be noted that the works on lecture video analysis for shot boundary detection are also very few [22]. This article focuses on shot boundary detection and key frame extraction for lecture video sequences. Here, the combined advantages of Histogram of Oriented Gradients (HOG) [23] and radiometric correlation with the entropic measure are adhered to perform the task of shot boundary detection. The key frames from the video are obtained by analyzing the peaks and valleys of the radiometric correlation plot against different frames of the lecture video. In the proposed scheme initially, HOG features are extracted from each frame. The similarities between the consecutive image frames are obtained by finding the radiometric correlation between the HOG features. To analyze the shot transition, a plot of the radiometric correlation between the consecutive frames are plotted. The radiometric correlation for the complete lecture video is found to have a significant amount of uncertainty due to the variation in color, illumination, or object motion in consecutive frames of a lecture video scene. Hence, the concept of entropic measure is used here. In the proposed scheme, a center sliding window is considered on the radiometric correlation plot to compute the entropy at each frame. Similarly, the analysis of peaks and valleys of the radiometric correlation plot is used to find the key frames from each shot. The proposed scheme is tested on several lecture sequences, and seven results are reported in this article. The results obtained by the proposed scheme are compared with six existing state-of-the-art techniques by considering the computational time and shot detection.

38

T. Veerakumar et al.

This article is organized as follows. Section 2 describes the proposed algorithm. The simulation results with discussions and future works are given in Sect. 3. Finally, conclusions are drawn in Sect. 4.

2 Shot Boundary Detection and Key Frame Extraction The block diagram of the proposed shot boundary detection scheme is shown in Fig. 1. The proposed scheme follows three steps: feature extraction, shot boundary detection, and key frame extraction. In the proposed scheme initially, HOG features are extracted from all the frames of the sequence. Here, the extracted HOG feature vectors from each of frame are compared with subsequent frame by radiometric correlation [23] measure. Then the local entropy corresponding to the radiometric correlation is obtained to identify the shot boundaries in the lecture video. In the next step, the key frames from each shot are extracted by analyzing the peaks and valleys of the radiometric correlation.

Video

Extracted frames

ith frame

HOG features

(i+1)th frame

HOG features

Radiometric correlation

Fig. 1 Flowchart of the proposed technique

Entropy over sliding window

If (entropy < threshold)

Shot boundary is detected

Key frame extraction

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

39

2.1 Feature Extraction In the proposed scheme, we have used HOG feature for our analysis. HOG feature is initially suggested for the purpose of object detection [23] in computer vision and image processing. The method is based on evaluating well-normalized histograms of image gradient orientations. The basic idea is that object appearance and shape can often be characterized well by the distribution of intensity gradients or edge directions without precise knowledge of the corresponding gradient or edge positions. HOG captures edge or gradient structure that describes the shape. It does so by representation with an easily controllable degree of invariance and photometric transformations. Translations or rotations make little difference if they are much smaller than the spatial or orientation bin size. Since it is gradient based, it captures the object shape information very well. The essential thought behind HOG is to describe the object appearance and shape within the image by the distribution of intensity gradients. The histograms are contrast-normalized by calculating a measure of the intensity across the image and then normalize all the values. As comparing the consecutive frames in the video is the key to detect a shot boundary, using HOG is a good choice as it is computationally faster. The proposed shot boundary detection algorithm uses radiometric correlation and entropic measure for shot transition identification. It is discussed in detail in the next section.

2.2 Radiometric Correlation for Interframe Similarity Measure The basic idea behind the shot boundary detection in a lecture sequence is to find the similarity/correlation between the consecutive frames in the video and point out the discontinuity from it. In this regard, we have considered the radiometric correlation-based similarity measure to find the correlation between the frames. The extracted HOG features are compared in between consecutive frames to estimate the radiometric correlation. Here, it is assumed that the time instant is same as that of the frame instant. Let the successive frames of a sequence is represented by It (x, y) −−→ and It − 1 (x, y) and the extracted HOG feature vectors be represented by HOGt and −−→ HOGt−1 , respectively. Then the radiometric correlation is given by [23] −−−→ −−−−−→ −−−→ −−−−−→ m HOGt .HOGt−1 − m HOGt m HOGt−1 R (It (x, y) , It−1 (x, y)) = ,   −−−→ −−−−−→ v HOGt v HOGt−1 (1) −−−→ −−−−−→ where m HOGt .HOGt−1 represents the mean of the product of the extracted feature vectors and can be obtained as

40

T. Veerakumar et al.

m

−−−→ −−−−−→ 1 −−−−−→ −−−→T HOGt .HOGt−1 = HOGt−1 HOGt , n

(2)

where [HOGt−1 ] and [HOGt ] are the extracted HOG feature vectors (with matrices of size [1 × n]) for (t−1)th and tth frames, respectively. −−→‘n’  is the n-dimensional HOG features computed from each frame. m HOGt and −−→  v HOGt represent mean and variance of the HOG feature vectors, respec−−→ th tively, in t frame. The HOG features can  be represented as HOGt = HOG(t,1) , HOG(t,2) , HOG(t,3) , . . . , HOG(t,n) . Hence, the mean vector can be computed as n −−−→ 1

m HOGt = HOG(t,i) , n

(3)

n −−−→ 1 −−−→ 2 v HOGt = . HOG(t,i) − m HOGt n

(4)

i=1

and

i=1

The radiometric correlation varies in the range [0, 1]. From the radiometric correlation values obtained, a threshold is required to detect the shot boundary. The radiometric correlation values for consecutive frames are calculated. So, for N frames, (N−1) radiometric correlation values can be obtained. Figure 2a shows the plot of radiometric correlation vs. frames of lecture video 1 sequence.

2.3 Entropic Measure for Distinguishing Shot Transitions After obtaining the radiometric correlation, the next step is shot boundary detection. Now, the aim is to identify the discontinuity point, this radiometric distribution of the consecutive frames. In Fig. 2a, it can be seen that there is a significant difference in the radiometric correlation values from one frame to another. However, finding the discontinuity on these values that correspond to the shot transition is very difficult. It is also true that keeping a threshold directly on this similarity values is not a good idea as it varies widely. So we have taken the help of moving windowbased entropy measure on the radiometric correlation. The idea is to rather than keeping a threshold on the radiometric correlation values, a single dimensional overlapping moving window may be considered on these radiometric values to compute the entropic measure. It improves the performance. A moving window is considered over the radiometric correlation plot. From the radiometric correlation plot, the entropy is calculated for each location of the window. In information theory, entropy is used as a measure of uncertainty, and this

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

41

Fig. 2 Lecture video 1: (a) radiometric correlation for different frame (b) corresponding entropy values

gives the average information. Hence, the use of entropy in our work will reduce the randomness or vagueness of the local radiometric correlation. We calculate the entropy Em at each point (frame) m of the radiometric correlation values using the formula   1 (5) pi log Em = , pi i∈ηm

where ηm represents the considered neighborhood at location m, i represents the frame instant, and pi is the local radiometric plot. As the frame contents are changing at the shot boundary, a lower entropy value for the radiometric correlation may be obtained. This is to be expected as the transition for the shot boundary. It also gives an easier choice to keep the threshold. The entropy values obtained for lecture video are plotted as shown in Fig. 2b. A white dot can be seen in Fig. 2b that shows the detected shot boundary. To identify the threshold or shot transition, we have considered a variance-based selection strategy on entropic plot Em . For the selection of the threshold, we have considered each location in the x-axis of the entropic plot Em as a threshold and searched for a particular frame position where the sum of the variance along the left and right side of the point will be high. The total variance is computed as σm = σl + σr ,

(6)

42

T. Veerakumar et al.

where the variance σ m is the total variance at frame position m. σ l and σ r are the variances computed from the entropic information from the left and right side of m. Then the threshold value is obtained by finding a point m such that   Th = arg min σj , j

(7)

where j represents the threshold for shot transition. For lecture video 1, we apply Eq. (7), and the shot boundary is detected at 247th frame. The results of which are being explained and have two shots, and hence, one shot boundary is detected. However, we can go for video with more than two shots also. Hence, for a video with P shots, the total number of threshold (Th) will be (P−1). For automatic shot − → boundary detection, we assumed that j is a vector and represented by j , and to obtain a threshold, we have considered a maximum number of possible components − → − → in j as jp−1 where j = {j1 , j2 , . . . , jP −1 } . The expression for the threshold will − → be represented by a vector, Th = {Th1 , Th2 , . . . , ThP −1 }.

2.4 Key Frame Extraction Once the shot boundaries are extracted from a given sequence, there is a need to extract the key frames to represent each shot. It can be seen from the graph in Fig. 3 that there is variation in similarity measure for a particular shot. The maxima of this variation represent the frames where there is more similarity of nearest frames. The idea here is to take the frames where there is a maximum of similarity distribution to pickup as key frames. These maxima are the peaks of the distributions. If we can properly isolate these maxima, then we can find the key frames. However, there will be temporal redundancy in between the consecutive frames of the video; hence, it is not a good idea to take two maxima that are close to each other. It is to be noted that most of the shots contain significant variation in radiometric similarity measure due to noise or illumination change. Hence, before maxima are picked for shot representation, the similarity distribution corresponding to each frame is smoothened by a one-dimensional smoothing filter. Using this scheme, the different key frames for different shots are detected. Figure 3 shows the key frames for different shots (a total of three) of lecture video 1 sequence: shot 1 [41, 187, 323], shot 2 [434, 772, 1013, 1249], and shot 3 [1291, 1345, 1394, 1464]. Once the key frames are extracted, it is thus checked if the visual contents of two consecutive key frames are same. Hence, the radiometric correlation is obtained between the consecutive key frames and significant key frames are selected as final key frames for a particular shot. We also applied the same on lecture video 1, and we obtained the final key frames as [187, 1013, 1291, 1394].

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

43

Fig. 3 Location of the shots and key frames of lecture video 1. (a) Location of the shots. (b) Shot 1 key frames: [41, 187, 323]. (c) Shot 2 key frames [434, 772, 1013, 1249]. (d) Shot 3 key frames [1291, 1345, 1394, 1464]

3 Results and Discussions To assess the effectiveness of the proposed algorithm, the results obtained using the proposed methodology are compared with those obtained using six different stateof-the-art techniques. The proposed algorithm is implemented in MATLAB and is run on Pentium D, 2.8 GHz PC with 2G RAM, and Windows 8 operating system. Experiments are carried out on several lecture sequences. However, for illustration, we have provided results on seven test sequences. This section is further divided into two parts: (i) analysis of results and (ii) discussions and future works. In the former section, the detailed discussion of visual illustration with different sequences are discussed. In the later section, the quantitative analysis of the results with the discussion of the proposed scheme with future works is discussed.

44

T. Veerakumar et al.

Fig. 4 Key frames for lecture video 1 [187, 1013, 1291, 1394] out of 1497 frames and three shots

3.1 Analysis of Results Four key frames are extracted from the lecture video 1 that are given by the frame numbers [187, 1013, 1291, 1394]. These extracted key frames are shown in Fig. 4. Corresponding visual illustration for radiometric correlation and extracted key frames are shown in Figs. 2 and 3. Similarly, the radiometric correlation values for the lecture video 2 is taken, the graph of which is shown in Fig. 5. It may be observed that here, a total of one shot boundary or two shots are detected. The peaks and valley analysis reveal that there are four major peaks, and three major peaks are there in shot 1 and shot 2, respectively. The red marks in Fig. 5 shows that the selected maxima (peaks) are selected as key frames. These key frames are given by the frame numbers as [18, 78, 143, 228, 277, 389, 467]. However, it may be observed that many key frames selected from the last stage have large correlation; hence, after refinement (as discussed in Sect. 2.4), we obtained two key frames as shown in Fig. 6. The third example considered for our experiments is lecture video 3 sequence. The radiometric correlation plot with corresponding entropy value plot of this sequence is shown in Fig. 7. The use of an automated thresholding scheme on entropic plot has produced two shots for this sequence. Application of key frame extraction process results in 11 key frames for this sequence. After pruning, we obtained that six key frames are extracted from this sequence and are shown in Fig. 8. Similar experiments are conducted on different other sequences to validate our results. The fourth example considered for our experiments is lecture video 4 sequence. The radiometric correlation plot with corresponding entropy value is shown in Fig. 9. The obtained key frames extracted from this sequence are shown in Fig. 10. This sequence is found to be containing a total of four shots. After proposed pruning mechanism, it is obtained that a total of four key frames are detected. It may be noted that the said video contains a view with camera movements/jitter. However, the proposed scheme has overcome without false detection. A detailed discussion with example for camera jitter is also provided in Sect. 3.2. Next example considered for our experiment is lecture video 5 sequence whose radiometric correlation plot with selected key frames is shown in Figs. 11 and 12. This sequence has several instances where fade-in and fade-out are there. However, the proposed scheme effectively able to distinguish the exact number of key frame as 2. A detailed analysis of the same with example is also provided in Sect. 3.2.

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

45

Fig. 5 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 2 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of the shots. (d) Key frames of shot 1 [18, 78, 143, 228]. (e) Key frames of shot 2 [277, 389, 467]

The next examples considered are lecture video 6 and lecture video 7 sequences. The entropic value plot with shot categorization with the selected key frames are provided in Figs. 13, 14, 15, and 16. In these two sequences, the scene undergoes

46

T. Veerakumar et al.

Fig. 6 Key frames for lecture video 2 [143, 389] out of 505 frames and two shots

zoomed in and out condition. However, this does not affect the results of the proposed scheme. All the sequences considered in our experiment are to validate the proposed scheme in different challenging scenarios: camera with motion, the scene with different subtopics, and camera with zoomed in and out. A detailed analysis of the same is provided in the next section.

3.2 Discussions and Future Works In this section, we have provided the quantitative analysis of the results with brief discussions on the advantages/disadvantages and other issues related to the proposed work. The efficiency of the algorithm is tested with the key frame extraction and computational complexity. The computational time for the proposed and existing algorithms is given in Tables 1 and 2. From these tables, it is possible to observe that the proposed algorithm takes much more computational time than PWPD, CHBA, and ECR, but these algorithms are found not to be good in the key frame extraction when compared to the proposed algorithm. The other existing algorithms like LTD, KS-SIFT, and RPCA results in the key frame extraction are similar to that of the proposed algorithm. But the proposed algorithm claims much lesser computational time than those existing algorithms. From that, we can conclude that the proposed algorithm outperforms in the key frame extraction with less computational complexity. Here, it is required to mention that the shot boundary identification from lecture sequence is a challenging task. The similarity among the frames of the video contains a large amount of uncertainty due to variation in color, artificial effects like fade-in and fade-out, illumination changes, object motion, camera jitter, and zooming and shrinking. The proposed scheme is found to be providing a better result in this regard for all of these considered scenarios. The performance of the proposed scheme against each of these scenarios with examples are discussed as follows: Figure 17 shows two examples of motion blur condition. The two examples depict the two frames from two different shots of lecture video 2 and lecture video

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

47

Fig. 7 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 3 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of the shots. (d) Key frames of the shot 1 [7, 90, 150, 231]. (e) Key frames of the shot 2 [241, 344, 429]. (f) Key frames of the shot 3 [481, 562, 634, 706]

5 sequences, respectively. Due to motion blur, it is hard to distinguish them as part of the same shots. Hence, different existing schemes identify them as part of two distinct shots. However, the proposed scheme can distinguish them as a part of the

48

T. Veerakumar et al.

Fig. 8 Key frames for lecture videos 3 [7, 150, 241, 429, 481, 706]

single shot. This happens due to the capability of HOG features used in the proposed scheme. Figure 18 shows an example for shots with fade-in and fade-out conditions. The use of other schemes is found to be not able to distinguish them as two different shots rather detected them as three different shots: texts in the board, professor, and fade-in/fade-out frames. However, the proposed scheme well represents them into two shots for each sequence. This is due to the capability of entropic measures that diminish the effects of variation in radiometric correlation measure. Figure 19 shows an example of lecture video 4 sequences where the sequence undergoes with camera jitter or movements. This is due to the active application of HOG feature with radiometric similarity measure that can detect them to being part of a single shot. A similar analysis is made on the lecture video 4 and lecture video 6 sequences with a zoomed in and out conditions (shown in Fig. 20). A view variation by camera zoomed in and zoomed out view is also shown and found to be a single shot by the proposed scheme shown in Fig. 20. Thanks to the integration of radiometric similarity with entropic measures to deal with real-life uncertainty for efficiently detecting the shot transitions in the considered challenging scenarios. Figure 21 shows another example with noise. An application of the proposed scheme never distinguishes them being part of different shots, whereas existing techniques fail to do so. With the above analysis, we found that the proposed scheme is found to be providing better results against variation in color, artificial effects like fade-in and fade-out, illumination changes, object motion, camera jitter, zooming and shrinking, and noisy video scenes. It is to be noted that most of the false detection of key frames by other considered scheme for comparisons reported in Tables 1 and 2 is due to the abovesaid effects. The effectiveness of the proposed scheme can be concluded in two phases. In the first phase, the use of HOG features will try to preserve the shape information from a given lecture video. The shape information includes details of the texts in the board, drawings, slides, pictures, teaching professor, etc. from the lecture video. It is also to be noted that, as reported by literature, the HOG feature

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

49

Fig. 9 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 4 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames of the shot 1 and 2 [5, 127, 351, 494]. (d) Key frames of the shot 3 and 4 [502, 647, 808, 922, 1018]

is found to be providing good results against illumination changes, motion blur, and noisy video scene. This is quite well understood from Figs. 17, 18, 19, 20, and 21. In the second phase, the radiometric similarity between the frames is computed and the variation in it is reduced by mapping to entropic scale. This minimizes the effects of false detection in key frames and effective against the fade-in, fade-out, zoomed in and out, and other irrelevant effects in the video.

50

T. Veerakumar et al.

Fig. 10 Key frames for lecture video 4 [127, 494, 647, 1018]

Fig. 11 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 5 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames of the shots 1, 2, and 3 [2, 245, 487, 721, 776, 909]

Fig. 12 Key frames for lecture video 5 [2, 487, 776]

There are few parameters that are used in the proposed scheme and need further discussions. One of the important parameters used in this article is one-dimensional window size or the neighborhood used for computation of the entropy from the radiometric similarity plot. In the proposed scheme, we have used a fixed window

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

51

Fig. 13 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 6 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames of the shots 1 and 2 [2128, 481, 1422]

Fig. 14 Key frames for lecture video 6 [128, 481, 1422]

of size (7×1) for all the considered video sequences. However, a variable sized window may be considered for this. In all the considered sequences, the choice of window size may affect the performance of the proposed scheme. If the number of frames in a particular video will be high and a small size window will be chosen, then there will be many false shot transitions. If the number of frames in a particular video will be low and a higher size window will be chosen, then there may be chances of missing few shot transitions. A tabular representation of the performance of the proposed scheme on all the considered sequences with different window size is provided in Table 3. The proposed scheme is tested with different window size, (11 × 1), (9 × 1), (7 × 1), (5 × 1), and (3 × 1), and the number of key frames detected by each window size are presented in Table 3. It is also observed from this

52

T. Veerakumar et al.

Fig. 15 Radiometric correlation, corresponding entropy plots, and extracted key frames from each shot for lecture video 7 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames of the shots 1, 2, and 3 [243, 824, 1236, 1618, 2015, 4143]

Fig. 16 Key frames for lecture video 7 [243, 1236, 2015, 4143]

table that use of (5 × 1), (7 × 1), and (9 × 1) gives almost same results for most of the sequences in terms of number of key frames detected, whereas the results obtained by the window size (7 × 1) and (9 × 1) are same. Hence, by the taking the average of all the results concluded by manual trial and error basis infers that a use of (7 × 1) window size is able to provide an affordable result. Hence, we fixed it to size (7 × 1). It is to be noted that all experiments are made on frame size of 320 × 240. In this article, all the results reported in Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and Tables 1, 2, 3 for comparison with the proposed scheme are developed by the authors in the working lab using MATLAB software. All the considered technique codes are developed in an optimized manner so as to validate the proposed scheme in same scale balance.

[186, 558, 1208.36 838, 943, 1189, 1266]

[186, 558, 1319.57 751, 943, 1076, 1189]

[186, 558, 1328.66 838, 943, 1076, 1189, 1266] [186, 772, 845.36 1289, 1392]

6

KS-SIFT [16] 6

7

4

LTD [5]

RPCA [17]

Proposed

[246, 771, 985.66 1024, 1295, 1314, 1321, 1326]

7

ECR [14]

[246, 771, 456.93 1024, 1295]

4

Lecture video 1 (# frames = 1497) # Key Key CT in sec. frame(s) frame # 4 [246, 771, 182.84 1024, 1295]

CHBA [4]

PWPD [2]

Method

2

6

5

5

2

5

68.94

31.29

34.99

[143, 389]

45.96

[9, 143, 223, 75.22 277, 389, 467]

[9, 143, 223, 72.37 389, 467]

[79, 143, 277, 389, 467]

[12, 240]

[90, 134, 180, 240, 270]

Lecture video 2 (# frames = 505) # Key Key CT in sec. frame(s) frame # 1 [240] 13.45

Table 1 Comparison of different lecture video with existing algorithms

6

7

7

7

9

10

[66, 180, 240, 273, 303, 420, 480, 531, 600, 681] [130, 240, 480, 601, 605, 606, 609, 613, 642] [66, 150, 240, 429, 480, 642, 706] [66, 180, 273, 420, 480, 681, 706] [10, 150, 240, 429, 481, 600, 706] [7, 150, 241, 429, 481, 706] 410.73

719.02

705.69

665.87

560.33

524.19

Lecture video 3 (# frames = 737) # Key Key CT in sec. frame(s) frame # 2 [240, 480] 66.02

4

8

5

7

8

[102, 399, 619, 700, 833, 909, 999, 1020] [127, 494, 647, 1018]

[27, 323, 419, 647, 709, 899, 1001] [127, 480, 617, 712, 1022]

[102, 409, 647, 709, 811, 905, 992, 1013]

1021.22

1928.83

1899.14

1278.45

1027.92

Lecture video 4 (# frames = 1025) # Key frame(s) Key frame # CT in sec. 5 [127, 394, 255.22 847, 981, 1018] 5 [112, 506, 502.81 647, 901, 1001]

7

3

RPCA [17]

Proposed

[2, 144, 778, 1481.73 951, 1077, 1219, 1425] [2, 144, 894.52 1219]

[25, 145, 1064.55 802, 1005, 1219, 1408] [2, 144, 951, 1406.69 1219]

6

LTD [5]

KS-SIFT [16] 4

[2, 245, 681, 848.25 915, 1022, 1219, 1408]

7

Key frame CT in sec. # Key frame(s) # [2, 144, 685, 193.95 6 902, 1012, 1219] [2, 255, 685, 418.64 6 912, 1012, 1219]

ECR [14]

CHBA [4]

PWPD [2]

Method

Lecture video 5 (# frames = 963)

3

6

4

6

8

8

[127, 762, 827, 1065, 1284, 1495] [128, 481, 1422]

[105, 303, 481, 551 719, 845, 909, 1378] [82, 125, 398, 592, 704, 899, 1004, 1365] [82, 762, 998, 1092, 1304, 1405] [97, 709, 1065, 1385]

Key frame # Key frame(s) # 3 [297, 965, 1501]

Lecture video 6 (# frames = 1507)

Table 2 Comparison of different lecture video with existing algorithms

952.86

1781.95

1724.55

1108.12

914.96

476.27

CT in sec. 230.35

Lecture video 7 (# frames = 4327) Key frame # Key frame(s) # CT in sec. 5 [228, 456, 239.88 921, 2547, 4129] 8 [54, 228, 504.39 456, 756, 921, 1221, 2547, 4529] 8 [84, 218, 958.27 456, 756, 921, 1221, 2547, 4019] 7 [32, 218, 460, 1169.51 756, 921, 2221, 4071] 6 [84, 241, 1804.28 456, 756, 2224, 4071] 7 [84, 218, 456, 1881.37 756, 1221, 2224, 4050] 4 [243, 1236, 994.24 2015, 4143]

54 T. Veerakumar et al.

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

55

Fig. 17 Detected as part of single shot with motion blur

Fig. 18 Detected as part of single shot with fade-in and fade-out

The proposed scheme is mainly designed for lecture video segmentation or shots boundary detection in lecture video sequences. It is to be noted that a lecture sequence mostly have two or three different kinds of frames or shots that include the face of the professor, writing texts/slides, and hand of the professor. The transitions between these shots in a video occur in the fashion of face of the professor to hand, hand to texts in the board, board to hand, and then again hand to professor’s face. In few cases, it may happen that the view may change from text to face of the professor and back to the text. Hence, before starting of all new topic or subtopic, most of the video undergoes a transition like old topic/subtopic to face of the professor to new topic/subtopic. Hence, the proposed scheme detects them as three different shots. However, in rare cases, it may happen that the transition of frames may happen like old topic/subtopic to new topic/subtopic. In this case, it will be difficult to identify the shot transitions. Segments of radiometric correlation plot and corresponding entropic values plot of a shot (contains a combination of two subtopics) are shown in Fig. 22. The proposed scheme fails to separate both contents into two different shots. This is because the significant change in scene view is not reflected in the radiometric similarity. Hence, entropic plot fails to distinguish them. One way to solve such issue is to split the radiometric correlation plot into the different part and then entropy values may be computed locally at each location. Figure 23 shows such an example, where the entropic plot can be easily separable at the change in topic/subtopic region of the video. This is a preliminary result on this. The choice of splitting the radiometric plot is manual. In the future, we would like to work more on this issue. The proposed scheme is mainly used to identify the gradual shot transition. In the future, we would like to develop some techniques that will be able to determine the soft transitions.

Window size 11 9 7 5 3

# Key frame(s) Lecture video 1 (# frames = 1497) 3 4 4 4 5

Lecture video 2 (# frames = 505) 2 2 2 2 3

Lecture video 3 (# frames = 737) 4 6 6 7 7

Table 3 Performance comparison with different window size

Lecture video 4 (# frames = 1927) 3 4 4 4 5

Lecture video 5 (# frames = 1484) 2 3 3 4 4

Lecture video 6 (# frames = 1807) 3 3 3 3 5

Lecture video 7 (# frames = 1890) 3 4 4 4 5

56 T. Veerakumar et al.

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

57

Fig. 19 Detected as part of single shot with camera movements and jitter

Fig. 20 Detected as single shot for zoomed in and out condition with view variation for different video

Fig. 21 Detected single shot in the presence of noise Fig. 22 Radiometric correlation plot and entropic values for a shot with a combination of two subtopics

4 Conclusions In this article, a shot boundary detection and key frame extraction technique for lecture video sequences, using an integration of HOG, and radiometric correlation with entropic-based thresholding scheme are proposed. In the proposed approach,

58

T. Veerakumar et al.

Fig. 23 Radiometric correlation plot and obtained split entropic values

the advantages of HOG feature are explored to describe each frame effectively. The similarities between the n-dimensional extracted HOG features are compared to the consecutive image frames using radiometric correlation measure. The radiometric correlation for the complete video is found to have a significant amount of uncertainty due to variation in color, illumination, and camera and object motion. To deal with these uncertainties, the entropic thresholding is adhered to it to find the shot boundaries. After detection of the shot boundaries, the key frame from each shot is obtained by analyzing the peaks and valleys of the entropic associated pdf of the radiometric correlation measures. The proposed scheme is tested on several lecture sequences, and results on seven lecture video sequences are reported here. The results obtained by the proposed scheme are compared against six existing stateof-the-art techniques by considering the computational time and shot detection. It is obtained that the proposed scheme is found to be better.

References 1. Hu, W., Xie, N., Li, L., Zeng, X., & Maybank, S. (2011). A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 41, 797–819. 2. Zhang, H. J., Kankanhalli, A., & Smoliar, S. W. (1993). Automatic partitioning of full-motion video. ACM/Springer Multimedia System, 1, 10–28. 3. Huang, C. L., & Liao, B. Y. (2001). A robust scene-change detection method for video segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 11, 1281– 1288. 4. Borecsky, J. S., & Rowe, L. A. (1996). Comparison of video shot boundary detection techniques. Proceedings of SPIE, 2670, 170–179. 5. Grana, C., & Cucchiara, R. (2007). Linear transition detection as a unified shot detection approach. IEEE Transactions on Circuits and Systems for Video Technology, 17, 483–489. 6. Patel, N. V., & Sethi, I. K. (1997). Video shot detection and characterization for video databases. Pattern Recognition, 30, 583–592. 7. Cernekova, Z., Pitas, I., & Nikou, C. (2006). Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 16, 82–91. 8. Lee, M. H., Yoo, H. W., & Jang, D. S. (2006). Video scene change detection using neural network: Improved ART2. Expert Systems and Applications, 31, 13–25. 9. Cooper, M., & Foote, J. (2005). Discriminative techniques for keyframe selection. In Proceedings of ICME (pp. 502–505). Amsterdam, The Netherlands. 10. Haoran, Y., Rajan, D., & Chia, L. T. (2006). A motion-based scene tree for browsing and retrieval of compressed video. Information Systems, 31, 638–658.

Shot Boundary Detection from Lecture Video Sequences Using Histogram. . .

59

11. Cooper, M., Liu, T., & Rieffel, E. (2007). Video segmentation via temporal pattern classification. IEEE Transactions on Multimedia, 9, 610–618. 12. Duan, F. F., & Meng, F. (2020). Video shot boundary detection based on feature fusion and clustering technique. IEEE Access, 8, 214633–214645. 13. Abdulhussain, S. H., Ramli, A. R., Mahmmod, B. M., Saripan, M. I., Al-Haddad, S. A. R., & Jassim, W. A. (2019). Shot boundary detection based on orthogonal polynomial. Multimedia Tools and Applications, 78(14), 20361–20382. 14. Lei, S., Xie, G., & Yan, G. (2014). A novel key-frame extraction approach for both video summary and video index. The Scientific World Journal, 1–9. 15. Bendraou, Y., Essannouni, F., Aboutajdine, D., & Salam, A. (2017). Shot boundary detection via adaptive low rank and SVD-updating. Computer Vision and Image Understanding, 161, 20–28. 16. Barbieri, T. T. S., & Goularte, R. (2014). KS-SIFT: a keyframe extraction method based on local features. In IEEE International Symposium on Multimedia (pp. 13–17). Taichung. 17. Dang, C., & Radha, H. (2015). RPCA-KFE: Key frame extraction for video using robust principal component analysis. IEEE Transactions on Image Processing, 24, 3742–3753. 18. Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. Proceedings of CVPR, 1, 886–893. 19. Spagnolo, P., Orazio, T. D., Leo, M., & Distante, A. (2006). Moving object segmentation by background subtraction and temporal analysis. Image and Vision Computing, 24, 411–423. 20. Zabih, R., Miller, J., & Mai, K. A. (1995). A feature-based algorithm for detecting and classifying scene breaks. In Proceedings of ACM Multimedia (pp. 189–200). San Francisco, CA. 21. Singh, A., Thounaojam, D. M., & Chakraborty, S. (2020, June). A novel automatic shot boundary detection algorithm: Robust to illumination and motion effect. Signal, Image Video Process., 14(4), 645–653. 22. Subudhi, B. N., Veerakumar, T., Esakkirajan, S., & Chaudhury, S. (2020). Automatic lecture video skimming using shot categorization and contrast based features. Expert Systems with Applications, 149, 113341. 23. Shen, R. K., Lin, Y. N., Juang, T. T. Y., Shen, V. R. L., & Lim, S. Y. (2018, March). Automatic detection of video shot boundary in social media using a hybrid approach of HLFPN and keypoint matching. IEEE Transactions on Computational Social Systems, 5(1), 210–219.

Detection of Road Potholes Using Computer Vision and Machine Learning Approaches to Assist the Visually Challenged U. Akshaya Devi and N. Arulanand

1 Introduction It can be challenging for blind people to move around different places independently. The presence of potholes, curbs, and staircases is a hindrance for blind people to travel to various places freely without having to rely on others. The necessity in identifying the potholes, curbs, and other obstacles on the pathway has led many researchers to build smart systems to assist blind people. Various smart systems incorporated in the walking stick, wearable system, etc. are being proposed in achieving the aim of pothole detection for blind users. The proposed system is a vision-based experimental study that employs machine learning classification with computer vision techniques and a deep learning object detection model to detect potholes with improved precision and speed. In machine learning classification with computer vision approach, the images are preprocessed and features extraction methods such as HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern) are applied with an assumption that use of a fusion of feature vector of HOG and LBP feature descriptors will improve the classification performance. Various classification models are implemented and compared using performance evaluation metrics and methodologies. The process is extended to pothole localization for the images that are classified as pothole images. The proof of the hypothesis, i.e., the use of a fusion of feature extraction methods will improve the performance of the classification model, is derived. The second approach is pothole detection using a deep learning model. Through the years, deep learning has proven to provide reliable solutions to real-world problems involving computer vision and image analysis. The convolutional neural network in deep learning plays

U. Akshaya Devi () · N. Arulanand Department of Computer Science and Engineering, PSG College of Technology, Coimbatore, India © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_3

61

62

U. Akshaya Devi and N. Arulanand

a vital role in extracting features and classifying the data precisely. In this approach, YOLO v3 model is implemented for the pothole detection system. The results of the detection of potholes are analyzed, and the efficiency of the proposed system in outdoor real-time navigation for visually challenged people is studied.

2 Related Works Mae M. Garcillanosa et al., [1] implemented a system to detect and report the presence of potholes using image processing techniques. The system was installed in a vehicle with a camera and Raspberry Pi that will monitor the pavements. The processing was performed on the real-time video at a rate of 8 frames per second. Canny edge detection, contour detection, and final filtering were carried out on each video frame. The location and image of the pothole are captured when the pothole is detected, which can later be viewed. The system achieved an accuracy of 93.72% in pothole detection, but an improvement was required in recognizing the normal road conditions. The total processing time was 0.9967 seconds for video frames with the presence of potholes and 0.8994 seconds for video frames with normal road conditions. Aravinda S. Rao et al. [2] proposed a system to detect potholes, staircases, and curbs using a systematic computer vision algorithm. An Electronic Travel Aid (ETA) equipped with a camera and a laser was employed to capture the pathway. The camera was mounted on the ETA with an angle of 30◦ –45◦ between the camera and the vertical axis and a distance of 0.5 meters between the camera and the pathway. The Canny edge detection algorithm and Hough transform were used to process each frame in the video to detect the laser lines. The output of the Hough transform that depicts the number of intersecting lines was transformed into the Histogram of Intersections (HoI) feature. The Gaussian Mixture Model (GMM) learning model was utilized to detect whether the pathway is safe or unsafe. The system gave an accuracy of over 90% in detecting the potholes. Since the system uses laser patterns to identify the potholes, it can only be used during the nighttime. Kanza Azhar et al. [3] proposed a system to detect the presence of potholes for proper maintenance of the roadways. For classifying pothole/non-pothole images, HOG (Histogram of Oriented Gradients) representation of the input image was generated. The HOG feature vector was provided to the Naïve Bayes classifier as it has higher scalability and strong independent nature. For the images classified as an image containing pothole(s), localization of pothole(s) was carried out using a technique called graph-based segmentation using normalized cut. The system attained an accuracy of 90%, precision of 86.5%, recall of 94.1%, and a processing time of 0.673 seconds. The core idea of the research work by Muhammad Haroon Yousaf et al. [4] is to detect and localize the potholes in an input image. The input image was converted from RGB color space to grayscale and resized to 300 × 300 pixels. The system was implemented using the following steps: feature extraction, visual

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

63

vocabulary construction, histogram of words generation, and classification using Support Vector Machine (SVM). The Scale-Invariant Feature Transform (SIFT) was used to represent the pavement images as a visual vocabulary of words. To test and train the histogram of words, support vector algorithm was applied. The system gave an accuracy of 95.7%. Yashon O. Ouma et al. [5] developed a system to detect potholes on asphalt road pavements and estimate the areal extent of the detected potholes. The Fuzzy c-means clustering algorithm was used to partition each pixel in the image into a collection of M-fuzzy cluster centers. Since the FCM is prone to noise or outliers, a small weight was assigned to the noisy data points and a large weight to the clean data points to estimate an accurate cluster center. It was followed by morphological reconstruction. The clusters as an output of the FCM clustering algorithm were used to deduce and characterize the region as linear cracks, non-distress areas, or no-data regions. The mean CPU runtime of the system was 95 seconds. The dice coefficient of similarity, Jaccard Index, and sensitivity metric for the pothole detection were 87.5%, 77.7%, and 97.6% respectively. Byeong-ho Kang et al. [6] introduced a system that involves 2D LiDAR laser sensor-based pothole detection and vision-based pothole detection. In the 2D LiDAR-based pothole detection method, the steps include filtering, clustering, line extraction, and gradient of data function. In the vision-based pothole detection method, the steps include noise filtering, brightness control, binarization, addictive noise filtering, edge extraction, object extraction, noise filtering, and detection of potholes. The system exhibited a low error rate in the detection of potholes. Emir Buza et al. [7] proposed an unsupervised vision-based method with the utilization of image processing and spectral clustering for the identification and estimation of potholes. An accuracy of 81% was obtained for the estimation of pothole regions on images with varied sizes and shapes of potholes. Ping Ping et al. [8] proposed a pothole detection system using deep learning algorithms. The models include YOLO V3 (You Look Only Once), SSD (Single Shot Detector), HOG (Histogram of Oriented Gradients) with SVM (Support Vector Machine), and Faster R-CNN (Region-based Convolutional Neural Network). The data preparation involved labeling the images by creating bounding boxes on the objects using an image labeling tool. The resultant XML data of the image were appended to a CSV file. This file was used as an input file for the deep learning models. The performance comparison of the four models indicated that the YOLO V3 model performed well with an accuracy of 82% in detecting the potholes. The system proposed by Aritra Ray et al. [9] constitutes a low-power and portable embedded device to serve as a visual aid for visually impaired people. The system uses a distance sensor and pressure sensor to detect potholes or speed breakers when the user is walking along the roadside. The device containing the sensors was attached to the walking stick and the communication takes place through voice messages. A simple mobile application was also developed that can be launched by pressing the volume up button of the mobile phone. Using the speech communication, the user can give with his/her destination and the user will be guided to the location using Google Maps navigation facility. The smart

64

U. Akshaya Devi and N. Arulanand

portable device attached to the foldable walking stick was assembled with the following: ATmega328 8-bit microcontroller, HC-SR04 ultrasonic distance sensor, signal conditioner, pressure sensor, speaker, android device, walking stick, buzzer (piezoelectric), and power supply (using Li-Ion rechargeable batteries - 2500 mAh, AA1.2V X 4). The pressure sensor was attached to the bottom end of the walking stick. When the user strikes the walking stick on the ground, the reading is taken from the pressure sensor as well as the ultrasonic sensor. Using the Pythagoras theorem, a predefined value is set for the value that would be sensed by the ultrasonic sensor. If the currently sensed value exceeds that predefined value, the system informs the user that there is a presence of obstacles like potholes. If the currently sensed value is lesser than the predefined value, the system informs the user that there is a presence of obstacles like speed breakers. The result of sensitivity of object detection exceeded 96%. It can be noted from the previous works that there is a scope of improvement in detection accuracy as well as processing speed, and the false-negative outcomes in the detection results can be reduced. Most of the related works are targeted for periodic assessment and maintenance of the roadways in which the system takes high runtime, whereas the pothole detection for the visually challenged requires the system to perform with high speed and accuracy that swiftly alerts the user if any pothole is detected. Thus, the main idea behind the proposed approach is to develop a precise, fast pothole detection system that is effective and beneficial for the visually challenged. Two approaches (machine learning algorithm with computer vision techniques and a deep learning model) were implemented using suitable machine learning and deep learning models for real-time pothole detection. The system is trained with pothole images of various shapes and textures to provide a broad solution. In the case of machine learning algorithm and computer vision approach, the system performs localization of pothole region only if the image is classified as a pothole image. This step helps in improving the computational efficiency of the system as it reduces the number of false-negative outcomes by the system.

3 Methodologies 3.1 Pothole Detection Using Machine Learning and Computer Vision The first approach of pothole detection comprises image processing techniques to preprocess the input data, computer vision algorithms such as HOG (Histogram of Oriented Gradients), LBP (Local Binary Patterns) feature descriptors to extract the feature set, and machine learning classifiers to classify pothole/non-pothole images. The system architecture is shown in Fig. 1. Various feature extraction methods, machine learning algorithms, and evaluation methodologies are described briefly in the following subsections.

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

Fig. 1 System architecture

65

66

U. Akshaya Devi and N. Arulanand

HOG Feature Descriptor The HOG (Histogram of Oriented Gradients) feature descriptor is a well-known algorithm to extract important features for building the image detection and recognition system. The input image of the HOG feature descriptor must be of standard size and color scale (grayscale). The horizontal (gx ) and vertical (gy ) gradients are computed by filtering the image with the kernels Mx and My as in Eqs. 1 and 2. Mx = [−1 0 1]

(1)

⎤ −1 My = ⎣ 0 ⎦ 1

(2)



Equations 3 and 4 are used to determine the value of the gradient magnitude “g” and gradient angle “θ .” g=



gx2 + gy2

θ = tan−1

(3)

gy gx

(4)

Assume that the images are resized to a standard size of 200 × 152 pixels and the parameters such as pixels per cell, cells per block, and number of orientations are set to (8,8), (2,2), 9 respectively. Thereby, each image is divided into 475 (25 × 19) nonoverlapping cells of 8 × 8 pixels. In each cell, the magnitude values of the 64 pixels are binned and cumulatively added into nine buckets of gradient orientation histogram (Fig. 2). Fig. 2 Gradient orientation histogram with nine bins (orientations)

16000

Gradient Magnitude

14000 12000 10000 8000 6000 4000 2000 0

0

20

40

60

80

100 120 140 160 180

Orientation

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

67

Fig. 3 LBP feature extraction

A block of 2 × 2 cells is slid across the image. In every block, the corresponding histogram of each of the four cells is normalized into a 36 × 1 element vector. This process is repeated until the feature vector of the entire image is computed. The prime benefit of the HOG feature descriptor is its capability of extracting the basic yet meaningful information of an object such as shape, outline, etc. It is simpler, less powerful, and faster in computation compared to deep learning object detection models. LBP Feature Descriptor Local Binary Patterns (LBP) feature descriptor is mainly used for texture classification. To compute the LBP feature vector, neighborhood thresholding is computed for each pixel in the image, and the existing pixel value is replaced with the threshold result. For example, the image is divided into 3 × 3 pixel cells as shown in Fig. 3. The pixel value of the eight neighbors is compared with the value of the center pixel (value = 160). If the value of the center pixel is greater than the pixel value of the neighbor, the neighboring pixel takes the value “0”; otherwise, it is “1.” The resultant eight-bit binary code is converted into a decimal number and stored as the center pixel value. This procedure is implemented for all the pixels in the input image. A histogram is computed for the image with the number of bins ranging from 0 to 255 where each bin denotes the frequency of that value. The histogram is normalized to obtain a one-dimensional feature vector. The main advantages of the LBP feature descriptor are computational simplicity and discriminative nature. Such properties in a feature descriptor are highly useful in real-time settings.

68

U. Akshaya Devi and N. Arulanand

Machine Learning Models Various machine learning models employed in the system are Adaboost, Gaussian Naïve Bayes, Random Forest, and Support Vector Machine. The Adaboost classifier is an iterative boosting algorithm that combines multiple weak classifiers to get an accurate strong classifier. It is trained iteratively by selecting the training set based on the accurate prediction of the previous training. The weights of each classifier are set randomly during the first iteration, and the weights of each classifier during the successive iterations are based on its classification accuracy in its previous iteration. This process is continued until a maximum number of estimators is reached. This boosting algorithm combines a set of weak learners to generate a strong learner that shows a better classification accuracy and lower generalization error. The Naïve Bayes classifier is based on the Bayes’ theorem with an assumption of strong independence between the features. It is ideal for real-time applications due to its simplicity and faster predictions. Support Vector Machine (SVM) algorithm is a supervised learning algorithm that is used for both classification and regression problems. The SVM algorithm aims to create a decision boundary that can segregate n-dimensional space to distinctly classify the data points. This decision boundary is called a hyperplane. SVM classifier chooses the extreme points/vectors to create the hyperplane. These vectors are called support vectors, and hence algorithm is termed as Support Vector Machine. Random Forest algorithm is an ensemble learning method used for classification and regression problems. A Random Forest model applies several decision trees on random subsets of the dataset, and enhanced prediction accuracy is obtained by combining the results of the individual decision trees. The model provides high prediction accuracy and limits overfitting to some extent. Performance Evaluation The evaluation of a machine learning model is an important aspect to measure the effectiveness of the model for a given problem. Accuracy, precision, recall, and F1 score are the performance metrics used in this work to determine the performance of a model. The performance metrics are computed as shown in the Eqs. 5, 6, 7, and 8:

Accuracy =

TP + TN TP + TN + FP + FN

(5)

TP TP + FP

(6)

Precision =

Recall =

TP TP + FN

(7)

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

F1 score = 2 ×

precision × recall precision + recall

69

(8)

where TP, TN, FP, FN corresponds to true positive, true negative, false positive, and false negative instances, respectively. Accuracy is the number of correctly predicted instances out of all the instances. Precision quantifies the number of positively predicted instances that actually belong to the positive class. Recall quantifies the number of correctly predicted positive instances made out of all positive instances in the dataset. The F1 score also called F-score or F-Measure provides a single score that balances both precision and recall. It can be described as a weighted average of precision and recall. In addition to the performance metrics, the AUC-ROC curve is plotted for binary classifiers. ROC curve (Receiver Operating Characteristic curve) is a probability curve that is plotted with FPR (false-positive rate) against TPR (true-positive rate or recall). The AUC (Area Under the Curve) score defines the capability of the model to distinguish between the positive class and negative class. The score usually ranges from 0.0 to 1.0 where a score of 0.0 denotes the inability of the model to distinguish between positive/negative classes and a value of 1.0 denotes the strong ability of the model to distinguish between positive/negative classes. Localization of Potholes The three steps in localization of potholes are pre-segmentation using k-means clustering, construction of Region Adjacency Graph (RAG), and normalized graph cut. In the pre-segmentation stage, the image is segmented using k-means clustering. The result of this step will give the centroid of all the segmented clusters. In the second step, the Region Adjacency Graph is constructed using mean colors. The obtained clusters are represented as nodes where any two adjacent nodes are separated by an edge in the RAG. The nodes that are similar in color are merged, and the value of edges is set as the difference in the average of RGB of the adjacent nodes. On the Region Adjacency Graph, a two-way normalized cut is performed recursively as step 3. Thereby, the result will contain a set of nodes where any two points in the same node have a high degree of similarity and any two points in different nodes have a high degree of dissimilarity. As a result, the pothole region can be clearly differentiated from the other regions from the image.

3.2 Pothole Detection Using Deep Learning Model YOLO V3 Model The YOLO (You Look Only Once) object detection algorithm is based on a single deep convolutional neural network called Darknet-53 architecture. The YOLO v3 model is viewed as a single regression problem where a single neural network is

70

U. Akshaya Devi and N. Arulanand

Fig. 4 YOLO v3 model (reproduced from Joseph redmon et al. 2016) [12]

trained on an entire image. An input image is split into an S × S grid as shown in Fig. 4. The prediction of bounding boxes B and confidence score for each bounding box takes place for every grid. The confidence score of a bounding box, as in Eq. 9, is the product of the probability that the object is present in the box and the Intersection over Union (IOU) between the predicted box and actual truth. Confidence Score = P (Object) ∗ IOU

(9)

For each bounding box, values of x, y, width, height, and confidence score are predicted. The x and y values represent the center coordinates of the bounding box with respect to the grid cell. The product of conditional class probabilities (P(Classi |Object)) and the individual bounding box confidence scores gives the confidence scores of each class in the bounding box. This score indicates the probability of the presence of a class in the box and how well the predicted box fits the object. The main advantages of the YOLO object detection algorithm are the fast processing of images in real-time and low false detections.

4 Implementation The proposed work was implemented on Intel Core i5 1.60 GHz CPU with 8 GB RAM. To implement the machine learning models with computer vision techniques, the Jupyter notebook Web application was used to write and execute the python code. Pothole detection dataset from Kaggle was used as dataset 1 [10]. The size of the dataset was 197 MB containing 320 pothole images and 320 non-pothole images (640 images in total). Dataset 2 was created manually (using Google image

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

71

Fig. 5 (a) Selected region of interest (ROI). (b) After setting the pixels of the region external to the ROI to 0

search) with 504 images consisting of 252 pothole images and 252 non-pothole images. The size of the dataset was 14.5 MB. OpenCV (Open-Source Computer Vision Library) is an open-source library that is mainly used for programming realtime applications that involve image processing and computer vision models. In this work, the OpenCV library was used to read an image from the source directory, convert it from RGB to grayscale, resize the image, and filter the image. To ensure that all the images have a standard size, the scale of the images was resized to 128 × 96 pixels. Since the images will be divided into 8 × 8 patches during the feature extraction stage, a size of 128 × 96 pixels is preferable. The RGB images in the dataset contain three layers of pixel values ranging from 0 to 255, and hence, it is computationally complex. Thus, RGB to grayscale conversion was performed to ensure simpler computational complexity. The Gaussian filter (Gaussian blur) is a widely used image filtering technique to reduce noise and intricate details. This low-pass blurring filter that smooths the edges and removes noise from an image is considered to be efficient for thresholding, edge detection, and finding contours in an image. Thus, it will improve the efficiency of the pothole localization procedure during region clustering and construction of the Region Adjacency Graph (RAG). A Gaussian filter of kernel size 5 was applied to the image. The pathway/road in the input image is the only required portion to determine the presence of potholes (Fig. 5a). The remaining portion of the image was selected as a polygonal region, and the pixel values were set to 0 (Fig. 5b). Therefore, the portion of the road/pathway was selected as the region of interest. The HOG features and a fusion of HOG and LBP features were extracted. These features were applied to various machine learning classifiers to classify the pothole and non-pothole images. The Adaboost, Gaussian Naïve Bayes, Random Forest, and Support Vector Machine algorithms were selected for the proposed work. The train/test set was split in the ratio of 70:30. To find optimum parameters for the classifiers, the grid search algorithm was used. The grid search algorithm chooses the hyperparameters by employing exhaustive search on the set of parameters given for the classification model. It estimates the performance for every combination of the given parameters and chooses the best performing combination of hyperparameters. The RBF (radial

72

U. Akshaya Devi and N. Arulanand

basis function) kernel SVM was selected using the grid search method. The values of hyperparameters c and gamma were set to 100 and 1, respectively. For the Random Forest classifier, the values of hyperparameters such as n_estimators (total number of trees), criterion, max_depth (maximum depth of the tree), min_samples_leaf (minimum number of instances needed to be at a leaf node), and min_samples_split (minimum number of instances needed to split an internal node) were set to 100, “gini” (Gini impurity), 5, 5, 5 respectively. For the Adaboost classifier, the values of hyperparameters such as n_estimators (maximum number of estimators required) and learning rate were set to 200 and 0.2, respectively. The machine learning classifiers were evaluated using a cross-validation method. Subsequently, the pothole localization was performed for the positively predicted images. To implement the deep learning model, the google colaboratory notebook with a single GPU was utilized. The size of the initial dataset was 270 MB with 1106 labeled pothole images [11]. Image data augmentation techniques, which processes and modifies the original image to create variations of that image, were employed on the images of the initial dataset. The techniques such as horizontal flip, change of image contrast, and incorporation of Gaussian noise were adopted to synthetically expand the size of the dataset. The resultant images of various data augmentation operations are shown in Fig. 6. The data augmentation methods benefit the deep learning model as larger training data leads to an enhanced generalization of the neural network, reduction of overfitting, and improvement in real-time detections. The dataset obtained after data augmentation was 773 MB in size with 2500 pothole images and 2500 non-pothole images. The size of the input images was 416 × 416 pixels. The object labels in each image were represented using a text file containing five parameters: object class, xcenter, y-center, width, and height. The object class is an integer number given for each object with the value starting from 0 to (number of classes-1). The x-center, y-center, width, and height are float values relative to the width and height of the image. The dataset was split into train/test set with a ratio of 70:30. The number of iterations was set as 6000, and batch size for training and testing was set as 64 and 1, respectively. Various performance metric such as precision, recall, F1 score, mean Average Precision (mAP), and prediction time was inferred with the test set.

5 Result Analysis In the approach of machine learning and computer vision, the classification report comprising of accuracy, precision, recall, and F1 score was generated and tabulated (Table 1) for all the models. To estimate the classification model accurately, the kfold cross-validation method was utilized. In k-fold cross-validation, the dataset is divided into k equal-sized partitions. The training of the classifier is performed for k−1 partitions, and the remaining one partition is used for testing score of the “k” classifications is averaged and used for performance estimation. In this work, the machine learning models were evaluated using the 10-fold cross-validation method.

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

73

Fig. 6 (a) Original image and the resultant images of (b) horizontal flip, (c) contrast change, and (d) Gaussian noise addition Table 1 Classification performance report for different feature sets as input (dataset 1) ML classifiers Adaboost Naïve-Bayes Random Forest SVM

HOG features Accuracy Precision 86.66% 87% 85.33% 87% 87.33% 88% 90.66% 91%

Recall 87% 85% 87% 91%

F1 score 87% 85% 87% 91%

Combination of HOG and LBP features Accuracy Precision Recall F1 score 96.66% 97% 97% 97% 90.66% 91% 91% 91% 95.33% 96% 95% 95% 92.66% 93% 93% 93%

The average accuracy scores acquired using the 10-fold cross-validation for all the classifiers are shown in Table 2. The ROC curve (Receiver Operating Characteristic curve) is plotted for models that use only the HOG feature set and models that use a fusion of HOG and LBP feature sets (Fig. 7). The AUC scores are computed from the ROC curves and the results were tabulated (Table 3). It can be noted that the AUC score for all the classification models that use a fusion of HOG and LBP feature set is above 90%. The performance improvement does prove that the adaption of fused HOG and LBP features for the classification model helps achieve better results.

Adaboost Naïve-Bayes Random Forest SVM

Machine learning classifiers HOG features 85.16% 82.83% 85%

Combination of HOG and LBP features 94.38% 89.97% 93.58%

88.57%

HOG features 89.18% 88.58% 88.77%

87.57%

84.16%

Dataset 2

Dataset 1

Table 2 Average accuracy computed with 10-fold cross-validation method

85.16%

Combination of HOG and LBP features 87.33% 83.5% 82%

74 U. Akshaya Devi and N. Arulanand

Machine learning classifiers Adaboost Naïve-Bayes Random Forest SVM

Dataset 1 HOG features 93.28% 88.58% 95.51% 95.44% Combination of HOG and LBP features 99.24% 95.06% 98.81% 97.54%

Table 3 Comparison of AUC scores obtained using different feature sets Dataset 2 HOG features 86.03% 87.01% 90.86% 90.09%

Combination of HOG and LBP features 95.78% 92.17% 95.24% 94.22%

Detection of Road Potholes Using Computer Vision and Machine Learning. . . 75

76

a

U. Akshaya Devi and N. Arulanand ROC curve

b

1.0

0.6 0.4 0.2

SVM Random Forest Naive Bayes Adaboost

0.0 0.0

0.2

0.4 0.6 False Positive Rate

0.8

0.8 0.6 0.4 0.2

SVM Random Forest Naive Bayes Adaboost

0.0 1.0

ROC curve

0.0

d

1.0

0.2

0.4 0.6 False Positive Rate

0.8

1.0

ROC curve 1.0 0.8

0.8

True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate

0.8

c

ROC curve 1.0

0.6 0.4 0.2

SVM Random Forest Naive Bayes Adaboost

0.0 0.0

0.2

0.4 0.6 False Positive Rate

0.8

0.6 0.4 0.2

SVM Random Forest Naive Bayes Adaboost

0.0 1.0

0.0

0.2

0.4 0.6 False Positive Rate

0.8

1.0

Fig. 7 (a) ROC curve for the classification models that uses (a) HOG feature set extracted from images of dataset 1. (b) Fusion of HOG and LBP feature set extracted from images of dataset 1. (c) HOG feature set extracted from images of dataset 2. (d) Fusion of HOG and LBP feature set extracted from images of dataset 2

The Adaboost classification algorithm shows the best performance among all the classifiers. Further, the exact location of the pothole region must be determined and highlighted. Therefore, the normalized graph cut segmentation using RAG (Region Adjacency Graph) was employed for pothole localization in positively classified images. Figures 8 and 9 depicts the process and result of pothole detection using classification and localization. In the deep learning approach, detection results of the YOLO v3 model run on the test data are shown in Table 4. The prediction time for YOLO v3 was 26.90 milliseconds. The sample output of pothole detection by the YOLO v3 model is shown in Fig. 10. Based on the outcome of classification using HOG features and fusion of HOG and LBP features, it is evident that the fusion of HOG and LBP features improves the classification performance of the machine learning models. The classification results of machine learning algorithms convey that the Adaboost classifier with the HOG and LBP feature set outperforms all the other classifiers. For creating a bounding box around the pothole region, localization of potholes was performed using normalized graph cut using RAG (Region Adjacency Graph). The overall detection time for this approach is approximately 0.35 seconds.

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

77

Fig. 8 A step-by-step illustration of normalized graph cut segmentation using Region Adjacency Graph (RAG)

Fig. 9 Sample output of Adaboost classification and normalized graph cut segmentation using RAG for detection of potholes Table 4 Results of YOLO v3 model detection

Metric Precision Recall F1 score mAP

Resultant value 83% 87% 85% 88.01%

This approach of pothole detection does not require high processors like GPU to run smoothly. However, the results have few false positives during the localization of potholes. The YOLO v3 model achieved a mean Average Precision (mAP) of 88.01% and a faster inference time to detect the pothole(s) in an image. With a prediction time of 26.90 milliseconds, the model can process up to 37 frames per second. However, the requirement of higher processing power and disk space makes the model unlikely to be used in low-power edge devices.

78

U. Akshaya Devi and N. Arulanand

Fig. 10 Sample output of pothole detection performed using YOLO v3 model for pothole and non-pothole images

6 Conclusion A system to detect the potholes in the pathways/roadways can be highly useful and convenient for visually challenged people. This work presented two different approaches for pothole detection. In the machine learning and computer vision approach, we introduced a fusion of features by combining HOG and LBP features to enhance the classification performance. The results of various classifiers have shown improved performance with the usage of fusion of HOG and LBP features. Among all classifiers, the Adaboost algorithm with the fusion of HOG and LBP features attained the highest accuracy of 96.6% in classifying pothole and nonpothole images. For detecting the exact location of potholes, normalized graph cut segmentation with Region Adjacency Graph was implemented. The inference time to detect potholes using Adaboost classifier with the segmentation algorithm was 0.35 seconds approximately. But a few false positive outcomes were present during the localization of potholes. In the deep learning approach, the YOLO v3 model exhibited a favorable outcome with mAP of 88.01% and a rapid prediction time of 26.90 milliseconds. Moreover, the model requires the system to work on processors with high computation power. Therefore, we can realize the Adaboost classifier and normalized graph cut using RAG in a real-time pothole detection system on a low-power and economical microprocessor if we have cost and computation power constraints. It can also be integrated with stereo camera/dual camera technology that generates a depth map, thereby detecting the presence of potholes accurately and reducing the false-positive predictions. In the absence of cost and power limitations, the YOLO v3 model can be employed in GPU-based embedded devices.

Detection of Road Potholes Using Computer Vision and Machine Learning. . .

79

References 1. Garcillanosa, M. M., Pacheco, J. M. L., Reyes, R. E., & San Juan, J. J. P. (2018). Smart detection and reporting of potholes via image-processing using Raspberry-Pi microcontroller. In 10th international conference on knowledge and smart technology (KST), Chiang Mai, Thailand. 31 Jan–3 Feb, 2018. 2. Rao, A. S., Gubbi, J., Palaniswami, M., & Wong, E. (2016). A vision-based system to detect potholes and uneven surfaces for assisting blind people. In IEEE international conference on communications (ICC), Kuala Lumpur, Malaysia. 22–27 May, 2016. 3. Azhar, K., Murtaza, F., Yousaf, M. H., & Habib, H. A. (2016). Computer vision based detection and localization of potholes in Asphalt Pavement images. In IEEE Canadian conference on electrical and computer engineering (CCECE), Vancouver, BC, Canada. 15–18 May, 2016. 4. Yousaf, M. H., Azhar, K., Murtaza, F., & Hussain, F. (2018). Visual analysis of asphalt pavement for detection and localization of potholes. Advanced Engineering Informatics, Elsevier, 38, 527–537. 5. Ouma, Y. O., & Hahn, M. (2017). Pothole detection on asphalt pavements from 2D-colour pothole images using fuzzy c-means clustering and morphological reconstruction. Automation in construction, Elsevier, 83, 196–211. 6. Kang, B.-H., & Choi, S.-I. (2017). Pothole detection system using 2D LiDAR and camera. In Ninth international conference on ubiquitous and future networks (ICUFN), Milan, Italy. 4–7 July, 2017. 7. Buza, E., Omanovic, S., & Huseinovic, A. (2013). Pothole detection with image processing and spectral clustering. In Recent advances in computer science and networking, 2013. 8. Ping, P., Yang, X., & Gao, Z. (2020). A deep learning approach for street pothole detection. In IEEE sixth international conference on big data computing service and applications, Oxford, UK. 3–6 Aug, 2020. 9. Ray, A., & Ray, H. (2019). Smart portable assisted device for visually impaired people. In International conference on intelligent sustainable systems (ICISS), Palladam, India, 21–22 Feb, 2019. 10. Atuyla kumar. (2020). Kaggle pothole detection dataset. figsharehttps://www.kaggle.com/ atulyakumar98/pothole-detection-dataset 11. Atikur Rahman Chitholian. (2020). YOLO v3 pothole detection dataset. figsharehttps:// public.roboflow.com/object-detection/pothole 12. Redmon, J. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA. 27–30 June, 2016.

Shape Feature Extraction Techniques for Computer Vision Applications E. Fantin Irudaya Raj and M. Balaji

1 Introduction The act of identifying similar objects in a digital image is defined as object recognition in computer vision [1]. Many adversaries, such as rotations, variations in pose, poor illumination, scaling, occlusion, and so on, make shape-based object recognition difficult [2]. Several methods have been developed to improve the accurateness and ease of recognition of shape-based objects. Matching is a key aspect of the digital object recognition system and one of the major concerns in object recognition [3]. A primary objective of matching is to measure, compare, and validate image data in order to perform accurate recognition. In object recognition, the matching process entails some form of search, during which the set of features extracted is compared to the set of features stored in a database for detection. Appropriate and equivalent features must be extracted from the input image data to complete this task [4]. Numerous methods and approaches for object recognition in computer vision applications are discussed in the literature [5–8]. Translation and rotation invariant qualities are critical in most image classification tasks and must be addressed in the image retrieval features in any strategy for object recognition. The shapebased object recognition procedure is divided into three steps. They are (a) data preprocessing, (b) feature extraction, and c) classification of digital images. Image data is preprocessed in the preprocessing stage to make it clearer or noise-free for the feature extraction procedure. There are numerous sorts of filtering techniques used E. F. I. Raj () Department of Electrical and Electronics Engineering, Dr. Sivanthi Aditanar College of Engineering, Tiruchendur, Tamil Nadu, India M. Balaji Department of Electrical and Electronics Engineering, SSN College of Engineering, Chennai, Tamil Nadu, India © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_4

81

82

E. F. I. Raj and M. Balaji

in this stage to improve image quality by reducing noise and making images clearer for measuring current features [9]. The feature extraction stage extracts feature from preprocessed images to make the recognition task easier and more accurate. Many future extracting techniques are available to extract the important features of the object present in the image. The retrieved features are then saved in a database. The classifier will then utilize the database to look for and identify a comparable image based on the input image attributes. Among all of these procedures, feature extraction is among the most important for enabling object detection simpler and more precisely [10]. Shape feature extraction is important in various applications, including (1) shape retrieval, (2) shape recognition and classification, (3) shape alignment and registration, and (4) shape estimation and classification. Shape retrieval is the process of looking for full shapes that seem to be identical to a query shape in a large database of shapes [11]. In general, all shapes that are within a specific distance of the query, or the first limited shapes with the shortest distance, are calculated. Shape recognition and classification is the process of determining if a given shape resembles a model well or which database class is the most comparable. The processes of converting or interpreting one shape to match other shapes completely or partially are known as shape alignment and registration [12]. Estimation and simplification of shapes reduce the number of elements (points, segments, etc.) while maintaining similarity to the original.

2 Feature Extraction The layout, texture, color, and shape of an object are used by the majority of image retrieval systems. Its shape defines the physical structure of an object. Moment, region, border, and so on can all be used to depict it. These depictions can be used to recognize objects, match shapes, and calculate shape dimensions. The structural patterns of surfaces of cloth, grass, grain, and wood are examples of texture. Normally, it refers to the repeating of basic texture pieces known as Texel. A Texel is made up of many pixels that are placed in a random or periodic pattern. Artificial textures are usually periodic or deterministic, but natural textures are often random. Linear, uneven, smooth, fine, or coarse textures are all possibilities. Texture can be divided into two types in image analysis: statistical and structural. The textures in a statistical approach are random. Textures in the structural approach are entirely structural and predictable, repeating according to certain deterministic or random placement principles. Another approach is also proposed in the literature, which combines statistical and structural analysis. Mosaic models are what they’re called. It represents a geometrical process that is random. Although texture, color, and shape are key aspects of image retrieval, they are rendered useless when the image in the database or the input image lacks such qualities. An example is that the query image is a lighter version with simply white and black lines.

Shape Feature Extraction Techniques for Computer Vision Applications

83

Translation and rotation invariant qualities must be taken into consideration when selecting features for image retrieval in most object recognition tasks. The following are the two primary methodologies used to categorize invariant methods: (a) invariant features and (b) image alignment. Invariant image qualities that do not change when the object is rotated or translated are used in the invariant feature methodology. Although this method is more commonly employed, it is still reliant on geometric features. In the image alignment approach, the object recognition algorithm transforms the image so that the object in the image is positioned in a defined standard position. The extraction of geometric information such as boundary curvature is the mainstay of this method. Segmentation is required when an object image contains numerous objects. The term “feature” has a highly application-dependent definition in general. A feature is the consequence of certain projected values on the data stream input. To effectively execute the object recognition task, the extracted feature is compared to the stored feature data in the image database. Numerous techniques for feature extraction have been developed and explained in literature to make shape-based object identification faster and more efficient. The following sections go over some of the most important feature extraction techniques.

3 Various Techniques in Feature Extraction 3.1 Histograms of Edge Directions Kim et al. [13] propose a novel watermarking algorithm for grayscale text document images. Edge image matching is a common comparison technique in computer vision and retrieval of the image. This edge directions histogram is an important tool for object detection in images with the absence of color information and identical color information [14]. For this feature extraction, the edge is extracted using the Canny edge operator, and the related edge directions are then quantized into 72 bins of 50 each [15]. Histograms of edge directions (HED) can also be used to represent shapes.

3.2 This Harris Corner This detector has been utilized in a wide range for many additional images matching applications, demonstrating its effectiveness for efficient motion tracking [16]. Although these feature detectors are commonly referred to as corner detectors, they are capable of detecting any image region with significant gradients in all directions at a prearranged scale. This approach is ineffective for matching images of varying sizes because it is sensitive to variations in image scale.

84

E. F. I. Raj and M. Balaji

3.3 Scale-Invariant Feature Transform In [17, 18], the authors created a scale-invariant feature transform (SIFT) by combining the concepts of feature-based and histogram-based picture descriptors. Image data is transformed into scale-invariant coordinates relative to local features using this transformation. This method provides many features that cover the image at all scales and locations, which is an essential trait. A typical image with a resolution of 500 × 500 pixels will yield around 2000 stable features (even though it depends upon the image’s content). For object recognition, the selection of features is especially important. Before being utilized for image recognition and matching, SIFT features are taken from a set of reference images and stored in a database. A new image is matched by comparing each feature in the new image to the primary database and finding matching features based on the Euclidean distance between their feature vectors.

3.4 Eigenvector Approaches Every image is described by a minimal number of coefficients, kept in a database, and processed effectively for object recognition using the eigenvector technique [19, 20]. The method is increasingly advanced. Although it is an effective strategy, it does have certain disadvantages. The primary disadvantage is that any change in each pixel value induced by manipulations such as translation, rotation, or scaling will modify the image’s eigenvector representation. The eigenspace is calculated while taking all possible alterations into account to address this issue [21].

3.5 Angular Radial Partitioning In the Angular Radial Partitioning (ARP) approach, the edge detection is conducted after the images stored in the database are transformed to gray scale [22]. Surrounding circles partition the edge image to achieve scale invariance, surrounding circles are found with the intersection points of the edge, and for the feature extraction technique employed in the picture retrieval comparison procedure, angles are measured. The approach takes advantage of an object’s edge’s surrounding circle to generate a number of radial divisions for that object’s edge image; therefore, equidistant circles are created to extract the features required for scale after creating the surrounding circle invariance.

Shape Feature Extraction Techniques for Computer Vision Applications

85

3.6 Edge Pixel Neighborhood Information The Edge Pixel Neighborhood Information (EPNI) method identifies neighboring edge pixels whose structure will create an enhanced feature vector [23, 24]. In the image retrieval matching procedure, this feature vector is used. This method is invariant in terms of translation and scale but not in terms of rotation.

3.7 Color Histograms In [25, 26], the authors introduced histogram-based image retrieval methodologies. A color histogram is constructed by extracting the colors of the image and measuring the number of instances of each distinct color in the image array for a particular color image. Histograms are translation and rotation invariant and respond slowly to occlusion, scale changes, and changes in the angle of view. Due to the gradual change in histograms with perspective, a three-dimensional object can also be described by a limited number of histograms [27, 28]. This approach utilizes the color histogram of a test object to retrieve an image of a similar object in the database. This method has a significant disadvantage in that it is more sensitive to the color and color of the input object image and the intensity of the light source.

3.8 Edge Histogram Descriptor A histogram made up of edge pixels is called the Edge Histogram Descriptor (EHD) [29]. It is an excellent texture signature approach that can also be used to match images. But the main drawback is it is a rotation-variant approach. In its texture section, the standard MPEG-7 defines the EHD [30]. This technique is beneficial for image-to-image matching. This descriptor, on the other hand, is ineffective at describing rotation invariance.

3.9 Shape Descriptor The shape is a critical fundamental feature that is used to describe the content of an image. However, as a result of occlusion, noise, and arbitrary distortion, the shape of the object is frequently corrupted, complicating the object recognition problem. Shapes are represented using shape characteristics that are either based on the boundary plus the inside content or on the shape boundary information. Object identification uses a variety of shape features, which are evaluated based on how well they allow users to retrieve comparable forms from the database.

86

E. F. I. Raj and M. Balaji

Shape descriptors can be used to quickly find similar shapes in a database, even if they have been affinely transformed in some way, such as scaled, flipped, rotated, or translated [31]. It can also be used to quickly find noise-affected shapes, defective shapes, and human-tolerated shapes for shape recovery and comparison. A shape descriptor should be capable of retrieving images for the widest possible variety of shapes, not just specific. As a result, it should be application generic. Low calculation complexity is one of the shape descriptor’s most important characteristics. The calculation can be accelerated and computation complexity reduced by adding fewer picture attributes in the calculation technique, making it more resilient. The following are some crucial characteristics that must be present in effective shape features [32]. They are (a) identifiability, (b) rotation invariance, (c) scale invariance, (d) translation invariance, (e) occultation invariance, (f) affine invariance, (g) noise resistance, (h) reliable, (i) well-defined range, and (j) statistically independent. A shape descriptor is a collection of variables that are used to describe a specific characteristic of a shape. It tries to measure a shape in a way that is compatible with how humans see it. A form characteristic that can efficiently find comparable shapes from a database is required for a high recognition rate of retrieval [33]. The features are usually represented as a vector. The following requirements should be met by the shape feature: (1) It should be simple to calculate the distance between descriptors; else, implementation will take a long time. (2) It should be efficiently denoted and stored so that the descriptor vector size does not get too huge. (3) It should be comprehensive enough to accurately describe the shape. Types of Shape Features For shape retrieval applications, several shape explanation and depiction approaches have been developed [34]. Depending on whether shape features are extracted from the entire shape region or just the contour, the methodologies for describing and depicting shapes are classified into two categories: (a) contour-based approach and (b) region-based approach. Every technique is then broken down into two approaches: global and structural. Both the global and structural techniques act by determining whether the shape is defined as a whole or by segments. Contour-Based Approach Boundary or contour information can be recovered using contour-based algorithms [35]. The shape representation is further classified into the structural and global approaches. The discrete approach is also known as the structural approach because they break the shape boundary information into segments or subparts called primitives. The structural methodology is usually represented as a string or a tree (or graph) that will be utilized for image retrieval matching. The global approaches do not partition the shape into subparts and build the feature vector and perform the matching procedure using the complete boundary information. As a result, they’re often referred to as continuous approaches. A multidimensional numeric feature vector is constructed from the shape contour information, in the global approach, which will be used in the matching phase.

Shape Feature Extraction Techniques for Computer Vision Applications

87

Manipulation of the Euclidean distance or point-to-point matching completes the matching procedure. Shapes are divided into primitives, which are boundary segments in the structural method because the contour information is broken into segments. The result is stored as S = s1, s2, ..., sn, where si is an element of a shape feature that contains a characteristic such as orientation, length, and so on. The string can also be used to directly portray the shape or as an input parameter of an image retrieval system. The structural method’s primary limitation is the cohort of features and primitives. There is no appropriate definition for an object or shape since the number of primitives required for each shape is unknown. The other constraint is calculating its effectiveness. This method does not ensure the best possible match. It is more reliable than global techniques because changes in object contour cause changes in primitives. Region-Based Approach Region-based techniques can tackle problems that contour-based techniques can’t [36]. They are more durable and can be used in a variety of situations. They are capable of dealing with shape defection. The region-based method considers all pixels in the shape region, so the complete region is used to represent and describe the shape. In the same way that contour-based approaches are separated into global and structural methods, region-based approaches are separated into global and structural approaches depending on whether or not they divide the shapes into subparts. For shape description and representation, global methodologies consider the complete shape region. Region-based structural approaches, which divide the shape region into subparts, are utilized for shape description and representation. The challenges that region-based structural approaches have are similar to those that contour structural techniques have. Contour-based techniques are more prominent than region-based techniques for the following reasons: (a) The contour of a shape, rather than its inside substance, is important in various applications, and (b) humans can easily recognize shapes based on their contours. But this method also has some shortfalls. They are as follows: (a) Contours are not attainable in a number of applications. (b) Because small sections of forms are used, contour-based shape descriptors are susceptible to noise and changes. (c) The inside contents of some applications are more important than others. Figure 1 shows the classification and some examples of shape descriptors in detail.

4 Shape Signature The shape signature refers to the one-dimensional shape feature function obtained from the shape’s edge coordinates. The shape signature generally holds the perspective shape property of the object. Shape signatures can define the entire shape; they are also commonly used as a preprocessing step before other feature

88

E. F. I. Raj and M. Balaji

Fig. 1 Shape descriptor – classification and example

extraction procedures. The important one-dimensional feature functions are (a) centroid distance function, (b) chord length function, and (c) area function.

4.1 Centroid Distance Function Centroid distance function (CDF) is defined as the distance of contour points from the shape’s centroid (x0 , y0 ) and is represented by Eq. (1) [37].  r(n) = (x(n) − x0 )2 + (y(n) − y0 )2

(1)

The centroid is located at the coordinates (x0 , y0 ), which are the average of the x and y coordinates for all contour points. A shape’s boundary is made up of a series of contour or boundary points. A radius is a straight line that connects the centroid to a point on the boundary. The Euclidean distance is used in the CDF model to capture a shape’s radii lengths from its centroid at regular intervals as the shape’s descriptor [38]. Let  be the regular interval (in degrees) between two radii (Fig. 2). K = 360/ then gives the number of intervals. All radii lengths are normalized by dividing by the longest radius length from the extracted radii lengths. Moreover, without sacrificing generality, assume that the intervals are considered clockwise from the x-axis. The shape descriptor can then be represented as a vector, as illustrated in Eq. 2. Figure 3 illustrates the centroid distance function approach plot of a shape boundary.

Shape Feature Extraction Techniques for Computer Vision Applications

89

Fig. 2 Centroid distance function (CDF) approach

Fig. 3 Centroid distance plot of shape boundary

S = {r0 , rθ , r2θ , . . . r(k−1)θ

 (2)

This method has some advantages and disadvantages. This method is translationindependent due to the deduction of centroid, which designates the shape’s position from edge coordinates. It is the main advantage. The main drawback is that this method fails to properly depict the shape if there are multiple boundary points at the same interval.

4.2 Chord Length Function The chord length function (CLF) is calculated from the shape contour without using a reference point [39]. As shown in Fig. 3, the CLF of each contour point C is the shortest distance between C and the other contour point C’ such that line CC’

90

E. F. I. Raj and M. Balaji

Fig. 4 Chord length function (CLF) approach

is orthogonal to the tangent vector at C. This method also has some merits and demerits. The important merit is this method is not a translation variant, and it addresses the issue of biased reference points (the fact that the centroid is frequently biased by contour defections or noise). The demerit is the chord length function is extremely sensitive to noise, and even smoothed shape boundaries can cause an extreme burst in the signature.

4.3 . Area Function In area function (AF) approach, when the contour points along the shape edge are changed, the area of the triangle modeled by two consecutive contour points and the centroid changes as well [40]. It yields an area function that can be thought of as a shape representation. It is illustrated in Fig. 4. Let An denote the area between consecutive edge points Pn, Pn+1, and the centroid C. The area function approach and its plot of a shape boundary are shown in Figs. 5 and 6.

5 Real-Time Applications of Shape Feature Extraction and Object Recognition The shape is an important visual and emerging feature for explaining image content. One of the most difficult problems in developing effective content-based image retrieval is the usage of object shape [41]. Because determining the similarity between shapes is difficult, a precise explanation of shape content is impossible. Thus, in shape-based image retrieval, two steps are critical: shape feature extraction and similarity calculation among the extracted features. Some of the real-time

Shape Feature Extraction Techniques for Computer Vision Applications

91

Fig. 5 Area function (AF) approach

Fig. 6 Area function plot of shape boundary

applications of shape feature extraction and objection recognition are explained further.

5.1 Fruit Recognition Fruit recognition can be accomplished in a variety of ways by utilizing the shape feature [42]. One of the fruit recognition algorithms that use the shape feature is as follows: Step 1: First, gather images of various types of fruits with varying shapes. Figure 7 depicts an orange image. Step 2: The images are divided into two sets: training and testing.

92

E. F. I. Raj and M. Balaji

Fig. 7 Image of an range

Fig. 8 Binarized image of an orange

Step 3: Convert all images to binary so that the fruit pixels are 1 s and the residual pixels are 0 s [43], as shown in Fig. 8. Step 4: The Canny edge detector is an edge detection operator that detects a wide range of edges in images using a multistage approach. The Canny edge detection algorithm [44] is used to extract the fruit contour, as shown in Fig. 9. Step 5: For each image, compute the centroid distance [45]. Figure 10 depicts Fig. 9’s centroid distance plot. Step 6: Euclidean distance measurement is used to compare the centroid distance between training and testing images [46]. Step 7: The test fruit image is distinguished from the training images by the smallest difference [47].

Shape Feature Extraction Techniques for Computer Vision Applications

93

Fig. 9 Contour of the orange fruit

Fig. 10 The centroid distance plot of Fig. 9

5.2 Leaf Recognition 2 The shape feature can be used in a variety of ways to recognize leaves [48]. The following is an example of a leaf recognition algorithm that uses the shape feature: Step 1: First, gather some images of various sorts of leaves with varying shapes. A leaf is depicted in Fig. 11. Step 2: The images are classified into two parts: training and testing. Step 3: Convert all images to binary, with the fruit pixels being 1 s and the residual pixels being 0 s (Fig. 12) [43]. Step 4: Following that, the leaf contour is extracted using the clever edge detection algorithm (Fig. 13) [44]. It is an image processing approach that identifies points in a digital image that have discontinuities or sharp changes in brightness. Step 5: Calculate the seven Hu moments [49] associated with each image. Figure 14 depicts the plot of Fig. 13’s seven Hu moment values. Step 6: Euclidean distance measurement is used to compare moments between training and testing images. Step 7: The test leaf image is distinguished from the training images by the smallest difference [47].

94 Fig. 11 Image of a leaf

Fig. 12 Binarized image of the leaf depicted in Fig. 11

Fig. 13 Contour of the leaf

E. F. I. Raj and M. Balaji

Shape Feature Extraction Techniques for Computer Vision Applications

95

Fig. 14 The Hu moment plot of Fig. 13

Fig. 15 (a) Target image, (b) test image

5.3 Object Recognition There are two images in this case: the test image and the other of which is the target image. The test image depicts a scene of flowers in front of a window, with the target image (flower) to be found using scale-invariant feature transform (SIFT). Step 1: Input target image (flower) – (Fig. 15a). Step 2: Input test image (scene with cluttered objects) – (Fig. 15b). Step 3: By using SIFT, find 100 strongest points in the target image – (Fig. 16a). Step 4: By using SIFT, find 200 strongest points in the test image – (Fig. 16b). Step 5: Calculate normatively matched spots by comparing the two images – (Fig. 17). Step 6: Calculate a point that is precisely matched – (Fig. 18). Step 7: A polygon should be drawn around the region of exactly matching points – (Fig. 19).

96

E. F. I. Raj and M. Balaji

Fig. 16 (a) 100 strongest points in the target image, (b) 200 strongest points in the test image

Fig. 17 Normatively matching points in target and test images

Fig. 18 Exactly matching points in the test and target images

Shape Feature Extraction Techniques for Computer Vision Applications

97

Fig. 19 Detected target in the test image

6 Recent Works There are so many recent works reported in the literature regarding shape feature extraction in computer vision. The same approach can be used in many recent applications like robotics, fault detection, autonomous vehicle management system, industry 4.0 framework, medical applications, etc. Here, we listed a few of them for our reference. Yang et al. [50] explained Fish Detection and Behavior Analysis Using Various Computer Vision Models in Intelligent Aquaculture, and Foysal et al. [51], Application for Detecting Garment Fit on a Smartphone Using Computer Vision Approach. In [52–54], the authors detailed various autonomous vehicle management systems. A comprehensive review of vehicle detection, traffic light recognition by autonomous vehicles in the city environment, and pothole detection in the roadways for such vehicles are also explained. In [55], Das et al. provided parking area patterns from autonomous vehicle positions using an aerial image by computer vision using mask R-Convolutional Neural Network. Devaraja et al. [56] explained computer vision-based grasping of the robotic hands used in industries. Adding to this, the authors brief about shape recognition by autonomous robots in an industrial environment. In [57], the authors detailed the computer vision-based robotic equipment used in the medical field and their importance in surgeries. In [58], the author describes robotic underwater vehicles that use computer vision to monitor deepwater animals. The high efficiency of the system can be attained through employing machine learning techniques along with computer vision. In [59], the authors detailed the computer vision-enabled support vector machines assisted fault detection in industrial texture. Cho et al. [60] explained the fault analysis and fault detection in a wind turbine system using an artificial neural network along with a Kalman filter using computer vision approaches. In [61, 62], the author detailed fault detection in the aircraft wings and sustainable

98

E. F. I. Raj and M. Balaji

fault detection of electrical facilities using computer vision methodologies. In [63–65], the authors detailed the neural networks and deep learning configuration and a computer vision approach for identifying and classifying faults in switched reluctance motor drive used in automobile and power generation applications. Esteva et al. [57] explained that deep learning enabled computer vision applications in their work. By combining deep learning with a computer vision-based approach, the accuracy and classification performance are getting higher. Naresh et al. [66] detailed computer vision-based health-care management through mobile communication. The work briefs more about telemedicine-based applications. Pillai et al. [67] discussed COVID-19 detection using computer vision and deep convolutional neural networks. They provided a detailed analysis and compared the results with already existing conventional methodologies. In the real world, the use of computer vision techniques in health care improves disease prognosis and patient care. Recent advancements in object detection and image classification can significantly assist medical imaging. Medical imaging, also known as medical image analysis, is a technique for visualizing specific organs and tissues in order to provide a more precise diagnosis. Several research in pathology, radiology, and dermatology [68–70] have shown encouraging results in complicated medical diagnostics tasks. Computer vision has been utilized in a variety of healthcare applications to help doctors make better treatment decisions for their patients. In the medical field, computer vision applications have proven to be quite useful, particularly in the detection of brain tumors [71]. Furthermore, researchers have discovered a slew of benefits of employing computer vision and deep learning algorithms to diagnose breast cancer [72]. It can help automate the detection process and reduce the risks of human error if it is trained with a large database of images containing both healthy and malignant tissue [73]. There are other important works also available in the literature. Here, few of the important and recent works are only listed for reference.

7 Summary and Conclusion The present work is focusing more on the shape feature extraction technique used in computer vision applications. Various feature extraction techniques are also explained in detail. Histogram-based image retrieval feature extraction approaches used in computer vision include the Edge Histogram Descriptor and histograms of edge directions. The eigenvector approach, unlike scale variant, rotation, or translation, is particularly sensitive to changes in individual pixel values. ARP is invariant in terms of scale and rotation. The EPNI method is invariant in terms of scale and translation but not in terms of rotation. Noise affects the color histogram. But the color histogram approach is insensitive to rotation and translation. Shape description and representation approaches are divided into two categories: contour-based approaches and region-based approaches. Both sorts of approaches are further subdivided into global and structural techniques. Although contour-based

Shape Feature Extraction Techniques for Computer Vision Applications

99

techniques are more popular than region-based techniques, they still have significant drawbacks. The region-based approaches can circumvent these restrictions. Shape signatures are frequently utilized as a preprocessing step before the extraction of other features. The most significant one-dimensional feature functions are also presented in the current work. Some of the real-time feature extraction and object recognition applications used in computer vision are explained in detail. In addition to that, the latest recent works related to shape feature extraction with computer vision are also listed.

References 1. Bhargava, A., & Bansal, A. (2021). Fruits and vegetables quality evaluation using computer vision: A review. Journal of King Saud University-Computer and Information Sciences, 33(3), 243–257. 2. Zhang, L., Pan, Y., Wu, X., & Skibniewski, M. J. (2021). Computer vision. In Artificial intelligence in construction engineering and management (pp. 231–256). Springer. 3. Dong, C. Z., & Catbas, F. N. (2021). A review of computer vision–based structural health monitoring at local and global levels. Structural Health Monitoring, 20(2), 692–743. 4. Iqbal, U., Perez, P., Li, W., & Barthelemy, J. (2021). How computer vision can facilitate flood management: A systematic review. International Journal of Disaster Risk Reduction, 53, 102030. 5. Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003, October). Contextbased vision system for place and object recognition. In Computer vision, IEEE international conference on (Vol. 2, pp. 273–273). IEEE Computer Society. 6. Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3367– 3375). 7. Kortylewski, A., Liu, Q., Wang, A., Sun, Y., & Yuille, A. (2021). Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion. International Journal of Computer Vision, 129(3), 736–760. 8. Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2021). Inception recurrent convolutional neural network for object recognition. Machine Vision and Applications, 32(1), 1–14. 9. Cisar, P., Bekkozhayeva, D., Movchan, O., Saberioon, M., & Schraml, R. (2021). Computer vision based individual fish identification using skin dot pattern. Scientific Reports, 11(1), 1– 12. 10. Saba, T. (2021). Computer vision for microscopic skin cancer diagnosis using handcrafted and non-handcrafted features. Microscopy Research and Technique, 84(6), 1272–1283. 11. Li, Y., Ma, J., & Zhang, Y. (2021). Image retrieval from remote sensing big data: A survey. Information Fusion, 67, 94–115. 12. Lucny, A., Dillinger, V., Kacurova, G., & Racev, M. (2021). Shape-based alignment of the scanned objects concerning their asymmetric aspects. Sensors, 21(4), 1529. 13. Kim, Y. W., & Oh, I. S. (2004). Watermarking text document images using edge direction histograms. Pattern Recognition Letters, 25(11), 1243–1251. 14. Bakheet, S., & Al-Hamadi, A. (2021). A framework for instantaneous driver drowsiness detection based on improved HOG features and Naïve Bayesian classification. Brain Sciences, 11(2), 240. 15. Heidari, H., & Chalechale, A. (2021). New weighted mean-based patterns for texture analysis and classification. Applied Artificial Intelligence, 35(4), 304–325.

100

E. F. I. Raj and M. Balaji

16. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. 17. Linde, O., & Lindeberg, T. (2012). Composed complex-cue histograms: An investigation of the information content in receptive field based image descriptors for object recognition. Computer Vision and Image Understanding, 116(4), 538–560. 18. Hazgui, M., Ghazouani, H., & Barhoumi, W. (2021). Evolutionary-based generation of rotation and scale invariant texture descriptors from SIFT keypoints. Evolving Systems, 12, 1–13. 19. Shapiro, L. S., & Brady, J. M. (1992). Feature-based correspondence: An eigenvector approach. Image and Vision Computing, 10(5), 283–288. 20. Park, S. H., Lee, K. M., & Lee, S. U. (2000). A line feature matching technique based on an eigenvector approach. Computer Vision and Image Understanding, 77(3), 263–283. 21. Schiele, B., & Crowley, J. L. (2000). Recognition without correspondence using multidimensional receptive field histograms. International Journal of Computer Vision, 36(1), 31–50. 22. Chalechale, A., Mertins, A., & Naghdy, G. (2004). Edge image description using angular radial partitioning. IEE Proceedings-Vision, Image and Signal Processing, 151(2), 93–101. 23. Chalechale, A., & Mertins, A. (2002, Oct). An abstract image representation based on edge pixel neighborhood information (EPNI). In EurAsian conference on information and communication technology (pp. 67–74). Springer. 24. Wang, Z., & Zhang, H. (2008, July). Edge linking using geodesic distance and neighborhood information. In 2008 IEEE/ASME international conference on advanced intelligent mechatronics (pp. 151–155). IEEE. 25. Chakravarti, R., & Meng, X. (2009, April). A study of color histogram based image retrieval. In 2009 sixth international conference on information technology: New generations (pp. 1323– 1328). IEEE. 26. Liu, G. H., & Wei, Z. (2020). Image retrieval using the fused perceptual color histogram. Computational Intelligence and Neuroscience, 2020, 8876480. 27. Mohseni, S. A., Wu, H. R., Thom, J. A., & Bab-Hadiashar, A. (2020). Recognizing induced emotions with only one feature: A novel color histogram-based system. IEEE Access, 8, 37173–37190. 28. Chaki, J., & Dey, N. (2021). Histogram-based image color features. In Image Color Feature Extraction Techniques (pp. 29–41). Springer. 29. Park, D. K., Jeon, Y. S., & Won, C. S. (2000, November). Efficient use of local edge histogram descriptor. In Proceedings of the 2000 ACM workshops on multimedia (pp. 51–54). 30. Alreshidi, E., Ramadan, R. A., Sharif, M., Ince, O. F., & Ince, I. F. (2021). A comparative study of image descriptors in recognizing human faces supported by distributed platforms. Electronics, 10(8), 915. 31. Virmani, J., Dey, N., & Kumar, V. (2016). PCA-PNN and PCA-SVM based CAD systems for breast density classification. In Applications of intelligent optimization in biology and medicine (pp. 159–180). Springer. 32. Chaki, J., Parekh, R., & Bhattacharya, S. (2016, January). Plant leaf recognition using a layered approach. In 2016 international conference on microelectronics, computing and communications (MicroCom) (pp. 1–6). IEEE. 33. Tian, Z., Dey, N., Ashour, A. S., McCauley, P., & Shi, F. (2018). Morphological segmenting and neighborhood pixel-based locality preserving projection on brain fMRI dataset for semantic feature extraction: An affective computing study. Neural Computing and Applications, 30(12), 3733–3748. 34. Chaki, J., Parekh, R., & Bhattacharya, S. (2018). Plant leaf classification using multiple descriptors: A hierarchical approach. Journal of King Saud University-Computer and Information Sciences, 32, 1158. 35. AlShahrani, A. M., Al-Abadi, M. A., Al-Malki, A. S., Ashour, A. S., & Dey, N. (2018). Automated system for crops recognition and classification. In Computer vision: Concepts, methodologies, tools, and applications (pp. 1208–1223). IGI Global. 36. Chaki, J., & Parekh, R. (2012). Designing an automated system for plant leaf recognition. International Journal of Advances in Engineering & Technology, 2(1), 149.

Shape Feature Extraction Techniques for Computer Vision Applications

101

37. Dey, N., Roy, A. B., Pal, M., & Das, A. (2012). FCM based blood vessel segmentation method for retinal images. arXiv preprint arXiv:1209.1181. 38. Chaki, J., & Parekh, R. (2011). Plant leaf recognition using shape based features and neural network classifiers. International Journal of Advanced Computer Science and Applications, 2(10), 41. 39. Kulfan, B. M. (2008). Universal parametric geometry representation method. Journal of Aircraft, 45(1), 142–158. 40. Dey, N., Das, P., Roy, A. B., Das, A., & Chaudhuri, S. S. (2012, Oct). DWT-DCT-SVD based intravascular ultrasound video watermarking. In 2012 world congress on information and communication technologies (pp. 224–229). IEEE. 41. Zhang, D., & Lu, G. (2001, Aug). Content-based shape retrieval using different shape descriptors: A comparative study. In IEEE international conference on multimedia and expo, 2001. ICME 2001 (pp. 289–289). IEEE Computer Society. 42. Patel, H. N., Jain, R. K., & Joshi, M. V. (2012). Automatic segmentation and yield measurement of fruit using shape analysis. International Journal of Computer Applications, 45(7), 19–24. 43. Gampala, V., Kumar, M. S., Sushama, C., & Raj, E. F. I. (2020). Deep learning based image processing approaches for image deblurring. Materials Today: Proceedings. 44. Deivakani, M., Kumar, S. S., Kumar, N. U., Raj, E. F. I., & Ramakrishna, V. (2021). VLSI implementation of discrete cosine transform approximation recursive algorithm. Journal of Physics: Conference Series, 1817(1), 012017 IOP Publishing. 45. Priyadarsini, K., Raj, E. F. I., Begum, A. Y., &Shanmugasundaram, V. (2020). Comparing DevOps procedures from the context of a systems engineer. Materials Today: Proceedings. 46. Chaki, J., Dey, N., Moraru, L., & Shi, F. (2019). Fragmented plant leaf recognition: Bagof-features, fuzzy-color and edge-texture histogram descriptors with multi-layer perceptron. Optik, 181, 639–650. 47. Chouhan, A. S., Purohit, N., Annaiah, H., Saravanan, D., Raj, E. F. I., & David, D. S. (2021). A real-time gesture based image classification system with FPGA and convolutional neural network. International Journal of Modern Agriculture, 10(2), 2565–2576. 48. Lee, K. B., & Hong, K. S. (2013). An implementation of leaf recognition system using leaf vein and shape. International Journal of Bio-Science and Bio-Technology, 5(2), 57–66. 49. Chaki, J., & Parekh, R. (2017, Dec). Texture based coin recognition using multiple descriptors. In 2017 international conference on computer, electrical & communication engineering (ICCECE) (pp. 1–8). IEEE. 50. Yang, L., Liu, Y., Yu, H., Fang, X., Song, L., Li, D., & Chen, Y. (2021). Computer vision models in intelligent aquaculture with emphasis on fish detection and behavior analysis: A review. Archives of Computational Methods in Engineering, 28(4), 2785–2816. 51. Foysal, K. H., Chang, H. J., Bruess, F., & Chong, J. W. (2021). SmartFit: Smartphone application for garment fit detection. Electronics, 10(1), 97. 52. Abbas, A. F., Sheikh, U. U., AL-Dhief, F. T., & Haji Mohd, M. N. (2021). A comprehensive review of vehicle detection using computer vision. Telkomnika, 19(3), 838. 53. Liu, X., & Yan, W. Q. (2021). Traffic-light sign recognition using capsule network. Multimedia Tools and Applications, 80(10), 15161–15171. 54. Dewangan, D. K., & Sahu, S. P. (2021). PotNet: Pothole detection for autonomous vehicle system using convolutional neural network. Electronics Letters, 57(2), 53–56. 55. Das, M. J., Boruah, A., Malakar, J., & Bora, P. (2021). Generating parking area patterns from vehicle positions in an aerial image using mask R-CNN. In Proceedings of international conference on computational intelligence and data engineering (pp. 201–209). Springer. 56. Devaraja, R. R., Maskeli¯unas, R., & Damaševiˇcius, R. (2021). Design and evaluation of anthropomorphic robotic hand for object grasping and shape recognition. Computers, 10(1), 1. 57. Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., et al. (2021). Deep learning-enabled medical computer vision. NPJ Digital Medicine, 4(1), 1–9. 58. Katija, K., Roberts, P. L., Daniels, J., Lapides, A., Barnard, K., Risi, M., et al. (2021). Visual tracking of deepwater animals using machine learning-controlled robotic underwater vehicles.

102

E. F. I. Raj and M. Balaji

In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 860–869). 59. Tellaeche Iglesias, A., Campos Anaya, M. Á., Pajares Martinsanz, G., & Pastor-López, I. (2021). On combining convolutional autoencoders and support vector machines for fault detection in industrial textures. Sensors, 21(10), 3339. 60. Cho, S., Choi, M., Gao, Z., & Moan, T. (2021). Fault detection and diagnosis of a blade pitch system in a floating wind turbine based on Kalman filters and artificial neural networks. Renewable Energy, 169, 1–13. 61. Almansoori, N. N., Malik, S., & Awwad, F. (2021). A novel approach for fault detection in the aircraft body using image processing. In AIAA Scitech 2021 Forum (p. 0520). 62. Kim, J. S., Choi, K. N., & Kang, S. W. (2021). Infrared thermal image-based sustainable fault detection for electrical facilities. Sustainability, 13(2), 557. 63. Raj, E. F. I., & Balaji, M. (2021). Analysis and classification of faults in switched reluctance motors using deep learning neural networks. Arabian Journal for Science and Engineering, 46(2), 1313–1332. 64. Sijini, A. C., Fantin, E., & Ranjit, L. P. (2016). Switched reluctance Motor for Hybrid Electric Vehicle. Middle-East Journal of Scientific Research, 24(3), 734–739. 65. Raj, E. F. I., & Kamaraj, V. (2013, March). Neural network based control for switched reluctance motor drive. In 2013 IEEE international conference ON emerging trends in computing, communication and nanotechnology (ICECCN) (pp. 678–682). IEEE. 66. Naresh, E., Sureshkumar, K. R., & Sahana, P. S. (2021). Computer vision in healthcare management system through mobile communication. Elementary Education Online, 20(5), 2105–2117. 67. Pillai, V. G., & Chandran, L. R. (2021). COVID-19 detection using computer vision and deep convolution neural network. Cybernetics, cognition and machine learning applications: Proceedings of ICCCMLA 2020, 323. 68. Razzak, M. I., Naz, S., & Zaib, A. (2018). Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps (pp. 323–350). 69. Neri, E., Caramella, D., & Bartolozzi, C. (2008). Image processing in radiology. Medical radiology. Diagnostic imaging. Springer. 70. Fourcade, A., & Khonsari, R. H. (2019). Deep learning in medical image analysis: A third eye for doctors. Journal of Stomatology, Oral and Maxillofacial Surgery, 120(4), 279–288. 71. Mohan, G., & Subashini, M. M. (2018). MRI based medical image analysis: Survey on brain tumor grade classification. Biomedical Signal Processing and Control, 39, 139–161. 72. Tariq, M., Iqbal, S., Ayesha, H., Abbas, I., Ahmad, K. T., & Niazi, M. F. K. (2021). Medical image based breast cancer diagnosis: State of the art and future directions. Expert Systems with Applications, 167, 114095. 73. Selvathi, D., & Poornila, A. A. (2018). Deep learning techniques for breast cancer detection using medical image analysis. In Biologically rationalized computing techniques for image processing applications (pp. 159–186). Springer.

GLCM Feature-Based Texture Image Classification Using Machine Learning Algorithms R. Anand, T. Shanthi, R. S. Sabeenian, and S. Veni

1 Introduction A picture describes a scene efficiently and conveys the information in better way. Human visual perception aids the human to interpret more details from an image. Almost 90% of the data processed by human brain is visual data, and this helps human brain to respond and process visual data 60,000 times better than any other form of data. Image processing systems need representation of image in digital form. A digital image is a two-dimensional array of numbers, where the numbers represent the intensity values of the image at various spatial locations. These pixels possess spatial coherence that can be inherited by performing arithmetic operations such as addition, subtraction, etc. The statistical manipulations of the pixel values help to develop an image processing technique for a variety of applications. Most of the techniques employ feature extraction as one of the steps. A variety of features such as colour, shape, and textures can be extracted from digital images. Among these features, texture features such as fine, coarse, smooth, grained, etc., play an important role.

R. Anand () Department of ECE, Sri Eshwar College of Engineering, Coimbatore, India T. Shanthi · R. S. Sabeenian Research Member in Sona SIPRO, Department of ECE, Sona College of Technology, Salem, India e-mail: [email protected]; [email protected] S. Veni Department of Electronics and Communication Engineering, Amrita School of Engineering, Coimbatore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_5

103

104

R. Anand et al.

Texture of an image describes the distribution of intensity values in an image. The spatial distribution of intensities provides the texture information. The texture describes the characteristics of image or portion of the image. This texture information can be used for extracting several valuable features that help for segmentation and classification. Texture feature calculation [1] uses the content of the GLCM to give a measure of the variation in intensity at the pixel of interest. Images with varying textures have certain characteristics that can be extracted statistically. Generally, statistical approaches have four prevalent methods such as GLCM, histogram method, autocorrelation method, and morphological operation. These four methods have both advantages and disadvantages. Out of these four methods, this chapter adopts GLCM from which different 10 features are extracted. These features are elaborated in the next section, and the literature on statistical approaches is shown in Table 1. In the work [2], Elli, Maria, and Yi-Fan extracted sentiment from the reviews and analysed the result to build up a business model. They have claimed that demonstrated implements were robust enough to give them high precision. The use of business analytics made their decision more congruous. They additionally worked on detecting emotions from reviews, gender predicated on the designations, and additionally detecting fake reviews. The commonly used programming language is python. They mainly used multinomial naive Bayesian (MNB) and support vector machine (SVM) as their main classifiers. In this paper [3], the author has been applied a subsist supervised machine learning algorithms for presage a review rating on a given numerical scales. They have utilized hold-out cross-validation utilizing 70% data as training data and 30% data as testing data. In this chapter, the author used different classifiers to determine the precision and recall values. The author in paper [4] applied and elongated the current work in the field of natural language processing and sentiment analysis to data from Amazon review datasets. Naïve Bayesian and decision list classifiers were habituated to tag a given review as positive or negative. They have culled books and kindle section reviews from Amazon. The author in [5] aimed to build a system that visualizes the sentiment of the review in the form of data scraping from Amazon URL to get the data and preprocess it. In this chapter, they have applied NB, SVM, and maximum entropy. The paper claims that they summarize the product review to be the main point so there is no precision showed. They showed their result in the statistical chart. In the paper [6], the authors built a model for presaging the product ratings predicated on rating text utilizing a bag of words. These models tested utilized unigrams and bigrams. They utilized a subset of Amazon video game utilizer reviews from UCSD time-predicated models that did not work well as the variance in average rating between each year month and day was relatively diminutive. Between unigrams and bigrams, unigrams engendered the most precise result. And popular unigrams were profoundly serviceable presager for ratings for their more immensely colossal variance. Unigram results had a 15.89% better performance than bigrams. In paper [7], sundry feature extraction or cull techniques for sentiment analysis

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

105

are performed. They amassed the Amazon dataset at first and then performed preprocessing for stop words and special characters abstraction. They applied phrase level, single word, and multiword feature cull or extraction techniques. Ingenuous Bayes is used as the classifier. They concluded that Ingenuous Bayes gives better results for phrase level than a single word and multiword. The main cons of this chapter are that they used only an Ingenuous Bayes classifier algorithm from which we cannot get an ample result. In paper [8], it has utilized more facile algorithms so it is facile to understand. The system gives high precision on SVM, so it cannot work felicitously on the astronomically immense dataset. They used support vector machine (SVM), logistic regression, and decision trees method. In paper [9], tf-idf is utilized here as a supplemental experiment. It can prognosticate rating by utilizing a bag of words. But classifiers used here are only few. They used root mean square error and linear regression model. So, those are some cognate works mentioned above, and we endeavoured to make our work more efficient by culling best conceptions from them and applied those together. In our system, we used a sizably voluminous amount of datasets to give efficient results and make better decisions. Moreover, we have utilized active learning approach to label datasets that can dramatically expedite many machine learning tasks. Our system additionally consists of several types of feature extraction methods. To the best of our erudition, our proposed approach gave higher precision than the subsisting research works. Strength and weakness of statistical approach methods for texture image classifications of the proposed work are shown in Table 1. Table 1 Strength and weakness of statistical approach methods for texture image classifications Methods Morphological operation [5] Autocorrelation method [6]

Strengths Good efficient aperiodic image texture 1. It overcomes illumination distortion, and it is robust to noise. 2. Low computational complexity

Grey-level co-occurrence matrix [7]

1. Spatial relationship of pixels with different 10 statistical computations 2. Contrast, Energy, Homogeneity, Mean, Standard Deviation, Entropy, RMS, Variance, Smoothness, IDM 3. Accuracy rate will be high

Histogram method [8]

1. Less computations. 2. Invariant to conversion and rotation 3. Mathematically solvable

Weakness 1. Morphological operations are not applicable for periodic images. 1. Real-time applications for large images need high computations. 2. Not suitable for all kinds of textures 1. High Computational time. 2. Optimum movement vector is problematic. 3. It requires feature selection procedure. 4. Accuracy depends on the offset rotation. 1. Sensitive to noise 2. Low recognition rate

106

R. Anand et al.

2 GLCM The Grey-Level Co-occurrence Matrix is a square matrix that is obtained from the input image. The dimension of the GLCM matrix is equal to the number of grey levels in the input image. For example, an 8-bit image will have 256 grey levels ranging from [0 255]. For such an image, the GLCM matrix will have 256 rows and 256 columns with each row/column representing one of the intensity values. The second-order statistics are attained by considering a set of pixels related to each other in positive three dimensions. The Grey-Level Co-occurrence Matrices provide rare mathematical statistics on the texture. GLCM matrix of image depends on the direction and offset values. The direction can be anyone among the eight possible directions as shown in Fig. 1. The offset represents the distance between pixels. If the distance between the pixels is 1, the immediate neighbouring pixel in the direction is taken for consideration. By this way, several GLCM matrices can be obtained from a single image that is shown in Fig. 1.

2.1 Computation of GLCM Matrix GLCM matrix is a square matrix that has the same number of rows and the same number of columns with positive numbers only. GLCM matrix is N .× N matrix, where N denotes the number of possible grey levels in an image. For example, a 2-bit image will have four grey levels (0–3) and results in a GLCM matrix of size 4.×4. The rows and columns are the grey values (0–7). Consider the following image .f (x, y) of size 5 .× 5 with its grey-level representation given in Figs. 2 and 3. The matrix .G, θ, d = G0, 1 represents the GLCM matrix. The first row corresponds to the grey value 0, and next rows to the grey values 1, 2, and 3. Fig. 1 Co-occurrence matrix

135’ [D, -D]

90’ [-D, -0]

0’ [0, D]

180’ [0, -D]

225’ [D, -D]

45’ [-D, D]

270’ [D, 0]

315’ [D, D]

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

1

0

2

2

1

1

0

0

1

2

1

3

1

1

3

0

1

1

1

3

0

2

2

1

2

107

f (x, y) =

Fig. 2 The intensity values and its corresponding grey levels for an image segment .f (x, y) with four grey levels

Similarly the first column corresponds to the grey value 0, and the next columns to the grey values 1, 2, and 3.The first element in the first row of the GLCM matrix gives the count of occurrence of the grey value 0 in the neighbourhood of zero direction. Observing at the input matrix, the pair .(0, 0) occurs only at one point; hence, the first cell in the GLCM matrix equals to the value 1. The second element in the first row of the GLCM matrix gives the count of occurrence of the grey value 1 in the neighbourhood of 0 direction. Observing at the input matrix, the pair .(0, 1) occurs at two points; hence, the second element in the GLCM matrix equals to the value 2. Similarly, the third and fourth elements are calculated based on the occurrence of the pairs (0,2) and (0,3). The second row of the GLCM matrix is computed based on the occurrence of the pairs (1,0), (1,1), (1,2), and (1,3). The third row is based on the occurrence of the pairs (2,0), (2,1), (2,2), and (2,3), and the fourth row is based on the occurrence of the pairs (3,0), (3,1), (3,2), and (3,3). The resulting GLCM matrix G 0,1 is given in Fig. 4. GLCM matrix with single offset is not sufficient for image analysis. For example, the GLCM matrix with zero offset is not adequate to extract information from an image with vertical details. The input image may contain details in any direction; hence, GLCM matrices with different offset values and different distance values are computed from a single image, and the average of all these matrices is utilized for further analysis. Each and every value in this matrix is divided by the total number of pairs available in the input matrix to get a normalised GLCM matrix, i.e., for an image with L grey levels each and every element of the average matrix is to be divided by (L .× L .− 1). The normalised GLCM matrix g (m, n) can be used to extract several features from the image. These features are elaborated in the upcoming section.

108

R. Anand et al.

f (x, y) =

f (x, y) =

1

0

2

2

1

1

0

0

1

2

1

3

1

1

3

0

1

1

1

3

0

2

2

1

2

1

0

2

2

1

1

0

0

1

2

1

3

1

1

3

0

1

1

1

3

0

2

2

1

2

2

1

2

0

GTd

f (x, y) =

f (x, y) =

1

0

2

2

1

1

0

0

1

2

1

3

1

1

3

0

1

1

1

3

0

2

2

1

2

1

0

2

2

1

1

0

0

1

2

1

3

1

1

3

0

1

1

1

3

0

2

2

1

2

Fig. 3 Computation of the first row in co-occurrence matrix

Fig. 4 GLCM matrix G 0,1

G0.1-

1

2

2

0

2

3

2

3

0

2

2

0

0

1

0

0

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

109

2.2 GLCM Features Let .g(m, n) represent the normalised matrix, with N number of grey levels and .μx , σx and .μy , .σy are mean and standard deviation of the marginal probability matrices .Px (m)andPy (n), respectively. .

Px (m) =

N−1 

g(m, n)

(1)

g(m, n).

(2)

.

n=0

Py (n) =

N−1 

.

n=0

The mean values of the marginal probability matrices .P x(m) and .P y.(n) are given as N−1 

μx =

.

N−1 

m

m=0

μx =

g (m, n)

(3)

n=0

N−1 

.

m Px (m)

(4)

n=0

μy =

N−1 

.

n=0

μy =

n

N−1 

g (m, n)

(5)

m=0

N−1 

.

nPy (m).

(6)

n=0

The standard deviation values of the marginal probability matrices .Px (m) and Py (n) are given as

.

σx2 =

N−1 

.

(m − μx )2

m=0

σy2 =

N−1 

.

N−1 

g (m, n)

(7)

g (m, n) .

(8)

n=0



n − μy

n=0

Px+y (l) =

 2 N−1 m=0

N−1  N−1 

.

m=0 n=0

g (m, n),

(9)

110

R. Anand et al.

where .l = x + y for .l = 0 to .2(N − 1). N −1 N−1  

Px−y (l) =

.

g (m, n),

(10)

m=0 n=0

where .l = x − y, for .l = 0 to .(N − 1).

2.2.1

Energy

The energy (E) is computed as the sum of squares of the elements in the GLCM matrix. It returns the value in the range of [0–1]. Energy value of 1 indicates that the image is the constant value. It also reveals about the uniformity of the image that is shown in Eq. 11. (N−1)  (N−1) 

Energy (E) =

.

(g(m, n))2 .

(11)

(m=0) (n=0)

2.2.2

Entropy

Entropy measures the disorder or complexity of an image. It measures the amount of arbitrariness in the image. If the entropy is large, then the image is not textually uniform, and if the texture is complex, then entropy becomes high. .g(m, n) using the Eq. 11 Entropy (En) = −

N −1 N−1  

.

g (m, n) × log (g (m, n))

(12)

m=0 n=0

2.2.3

Sum Entropy SEn

.

=−

2N 

Px+y (m) log(Px+y (m))

(13)

m=2

2.2.4

Difference Entropy DEn = −

N −1 

.

m=0

Px−y (m) log(Px−y (m))

(14)

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

2.2.5

111

Contrast

Contrast measures the spatial frequency of an image and different moment of GLCM. It is the difference between the highest and lowest values of an adjacent set of pixels. Contrast is 0 for a constant image. Inertia and variance also mean the same property. It is also referred as inertia. Contrast can be calculated from .g(m, n) using the following equation: Contrast (C) =

N−1  N−1 

.

(m − n)2 g (m, n).

(15)

m=0 n=0

2.2.6

Variance

This statistic measures the heterogeneity, and it is strongly correlated with firstorder statistical variable such as standard deviation. It returns a high value for the elements that greatly differs from the average value of .g(m, n). It is also referred to as the sum of squares. Variance can be calculated for image .g(m, n) using the following equation, and .μ indicates the mean of an input image. V ariance (V ) =

N−1  N−1 

.

(m − μ)2 g (m, n).

(16)

m=0 n=0

2.2.7

Sum Variance =

SV

.

2N 

(m − SEn)2 P x+y (m).

(17)

m=2

2.2.8

Difference Variance DV

.

=

N−1 

m2 P x−y (m).

(18)

m=0

2.2.9

Local Homogeneity or Inverse Difference Moment (IDM)

Homogeneity is the consistency in the arrangement of an input image .g(m, n). If the arrangement of an input image follows a regular pattern, then it is said to be homogeneous. Homogeneity value of 1 indicates that the image is the constant value. Mathematically, it can be expressed by the following equation:

112

R. Anand et al.

H omogeneity (H ) =

N−1  N−1 

.

m=0 n=0

2.2.10

g (m, n) . 1 + |m − n|2

(19)

Local Homogeneity or Inverse Difference Moment (IDM)

The input image .g(m, n) is highly correlated between the adjacent pixels; then we can say the image is auto-correlated (the autocorrelation of input data with itself after shifting one pixel). The correlation measures the linear dependency between the pixels at the respective location, and it can be calculated by the following equation: Corr =

.

sumN−1 m=0

N−1  n=0

2.2.11

(m × n) g (m, n) − (μx × μy ) . (σx × σy )

(20)

RMS Contrast

Root mean square (RMS) measures the standard deviation of the pixel intensities. It does not depend upon any angular frequency or spatial distribution of contrast of an input image. Mathematically, it can be expressed as    1 N−1  N−1  2  Iij − I¯ , .RMS Contrast (RC) = 2 N

(21)

m=0 n=0

where “I ” is the normalized pixel intensity values between 0 and 1.

2.2.12

Cluster Shade

Cluster shade measures the unevenness in the input matrix and gets the information about the uniformity in the image. Disproportionate images result in higher cluster shade values. The cluster shade is computed using the following equation: CS =

N−1  N−1 

.



m + n − μx − μy

3

× g (m, n).

(22)

m=0 n=0

2.2.13

Cluster Prominence

Cluster prominence is also used to measure the asymmetric nature of the image. Higher cluster prominence indicates that the image is less symmetric. Smaller

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

113

variance in grey levels in the image results in lower cluster prominence value. CP =

N−1  N−1 

.

 4 m + n − μx − μy × g (m, n).

(23)

m=0 n=0

3 Machine Learning Algorithms Texture features of an image are calculated considering only one band at a time. Channel information can be consolidated using PCA before calculating texture features. Texture features of an image can be used for both supervised and unsupervised image classification. A classification method, Random Forest [10], builds multiple models by using various bootstrapped feature sets. The following stages are included in this algorithm: To construct a single tree in the ensemble, the algorithm boots up the training set many times, after which it applies the fresh set to create a single tree. To identify the optimal split variable and new features, a random selection of features is drawn from the training sets every time the sample of tree is divided. The random forest took extra time to calculate the validation procedure, but it had acceptable performance. These methods are compared with KNN and SVM to fight this problem. In the technique suggested by Shanthi et al. [1], the classifier called K-nearest neighbours (KNN) is used next. Using “a” and “o” input letters in this 2-dimensional feature space, this method may then identify “c”, another feature vector that must be analysed. When faced with this scenario, it identifies the Knearest neighbours without respect to labels. Figure 3 shows the classes “a” and “o” in the image; for our purposes, imagine the number 3 next to them. The objective of the algorithm is to discover which class “c” belongs to. “c” needs to have its three neighbours recognised, since k is 3. Of the three adjacent places, one is a “a”, while the other two are “o”. “o” has two votes, while “a” has one. Class “o” will be attributed to the “c” vector. When K is equal to 1, the class is defined by the first closest neighbour of the element. Computation time for KNN prediction is extremely long; however, training for KNN prediction is quicker than that of random forest. Despite improved training timings, it takes more processing resources to calculate data in the higher dimensions. And last, this chapter examines how well these algorithms perform when compared to SVM. This method identifies the most comparable observation to the one we are trying to predict and that observation serves as a reasonable proxy for an answer since it helps us determine the most likely response by averaging the values around the observation. Finding the answer requires the algorithm to locate neighbours to calculate the integer number or k value. Smaller values of k will force the algorithm to adapt to the data we are using, putting it at danger of overfitting and allowing it to fit complicated borders between classes. Bigger K values distance themselves from the ups and downs of actual data and result in smoother class separators in data. KNN prediction takes a lot of time to compute, yet it can train in a fraction of the time of

114

R. Anand et al.

Fig. 5 Separating hyperplane in SVM

Margin Separating Hyperplane +ve class -ve class

Support Vectors

random forest. The HSI method for training may run faster but is more demanding on memory. In this chapter, we have used support vector machine (SVM) [11, 12] for texture image classification. This method falls under the category of supervised machine learning [9]. Support vector machine was introduced in 1992 as a supervised machine leaning algorithm. This algorithm gained its popularity because of its higher accuracy rate and minimum error rate. SVM is one of the best examples for “Kernel Method” that is the key areas of machine learning that is shown in Fig. 5. The idea behind SVM is to make use of nonlinear mapping function .φ that transforms data in input space to data in feature space in such a way that it becomes a linearly separable that is shown in Fig. 5 [2]. The SVMs then automatically discover the optimal separating hyperplane, which is nothing but a complex decision surface. The equation of hyperplane is derived from a line equation .y = ax + b ever; even though hyperplane is a line, its equation is shown in below [13, 14], where “w & x” are the vectors, and it can be computed by dot matrix of these two vectors that is shown in Eq. 24. w T x = 0.

.

(24)

Any hyperplane can be framed as set of points(x), which can be satisfies the optimum point with minimum of .w. x + b = 0. Two such hyperplanes are chosen, and based on the values obtained, they are classified as class 1 and class 2 as given in Eqs. 25 and 26. w. x + b ≥ 1 or xi having class 1

(25)

w. x + b ≤ −1or xi having class 2.

(26)

.

.

Here, optimizing problem may occur because the goal is to maximize the margin among all the possible hyperplanes meeting the constraints. The hyperplane with

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

115

the small .w|| is chosen because of the biggest margin it provides. The pseudocode for optimizing problem is given in Eq. 27. minimizew,b .

 1  2 w  2

subject to y (i) (w T xi + b) ≥ 1 .

(27)

The solution to the above equation is computing the values of .(w, b), with minimum possible margin. The equation that satisfies the constraints will be considered as the equation of the optimal hyperplane.

4 Dataset Description The dataset from Centre for Image Analysis in Swedish University of Agriculture Sciences [3], Uppsala university, has been used in this chapter. Totally, 4480 images of 28 different texture classes were taken using Canon EOS 550d DSLR camera, which is shown in Fig. 6. Each texture class has around 160 images, in that 112 images are used for training and 48 images are used for testing, which is shown in Table 2. The Fig. 7 shows the complete flowchart of the proposed method for texture image classifications using GLCM features. First step, the segmented image is resized to [576 .× 576].

5 Experiment Results The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e., the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3. Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane:

116

R. Anand et al.

Blanket1

Blanket2

Canvas

Ceiling1

Ceiling2

Cushion1

Floor1

Floor2

Grass1

Lentils1

Linseeds1

Oatmeal

Pearl sugar

RICE1

RICE2

RUG1

SAND1

SCARF1

SCARF2

SCREEN1

SEAT1

SEAT2

Sesame Seeds

STONE1

STONE2

STONE3

STONELAB

WALL

Fig. 6 Sample images from 28 different texture images

1. 2. 3. 4.

True positive (TP): Total number of faulty images accurately tagged True negative (TN): Non-defective images that were properly identified as such False positive (FP): Number of incorrect classifications False negative (FN): Percentage of erroneous non-defective images identifications

The abbreviations FP and FN indicate classification. FP is referred to as a type-1 error, while FN is referred to as a type-2 mistake which is shown in (Table 3). Type1 mistake is to some degree tolerable in medical diagnosis as compared to type-2

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

117

Table 2 Dataset descriptions Class label Blanket1 Blanket2 Canvas Ceiling1 Ceiling2 Cushion1 Floor1 Floor2 Grass1 Lentils1 Linseeds1 Oatmeal Pearl sugar Rice1 Rice2 Rug1 Sand 1 Scarf1 Scarf2 Screen1 Seat1 Seat2 Sesame seeds Stone1 Stone2 Stone3 Stonelab Wall

Training samples 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112

Testing samples 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48

Total samples 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160

Fig. 7 Flowchart for produced texture image classifications

error. When type-2 error is high, it implies that a greater proportion of individuals with illness are classified as healthy, which may result in serious consequences [15– 17]. Table 4 illustrates the confusion matrix for a multiple class issue. For each class, the entities TP, TN, FP, and FN may be assessed using the following equations:

118

R. Anand et al.

Table 3 Confusion matrix for binary classification problem Binary classification Defective image Non-defective image

Actual class

Predicted class Defective image True positive False positive

Non-defective image False negative True negative

Table 4 Confusion matrix for multiclassification problem Predicted class Class 2 Class 1 X11 X12 X21 X22 X31 X32 X41 X42

Multiclass classification Class 1 Class 2 Class 3 Class 4

Actual class

1. 2. 3. 4.

Class 3 X13 X23 X33 X43

Class 4 X14 X24 X34 X44

True positive (TP) of .class A = X AA True negative (TN) of .class A = 4i=1 Xii −X AA False positive (FP) of .class A = 4i=1 XiA −−X AA False negative (FN) of .class A = 4i=1 XAi −−X AA

For example, TP, TN, FP, and FN values of Class 1 are computed as: 1. TP of .Class 1 = X11 2. TN of Class 1 = .

4 

Xii −X 11 = X11 + X 22 +X33 +X 44 − X 11

i=1

(28)

= X22 +X 33 +X 44 3. FP of Class 1 = .

4 

Xi1 −−X 11 = X11 + X 21 +X 31 +X 41 − X11

i=1

(29)

= X21 +X 31 +X 41 4. FN of Class 1 = .

4 

X1i −−X 11 = X11 + X 12 +X 13 +X 14 − X11

i=1

= X12 +X13 +X 14

(30)

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions Table 5 Comparison of different machine learning algorithm for texture data

Accuracy Precision F1 score Sensitivity Specificity

Random forest 94.45 90.41 0.86 88.48 90.41

119 KNN 95.42 81.21 0.84 89.69 94.35

SVM 99.35 92.09 0.92 91.65 99.66

These are the points that help us build our SVM. The performance of the proposed system is measured in terms of sensitivity, specificity, accuracy, precision, false positive rate, and false negative rate. The sensitivity and specificity are important measures in classification. The accuracy of the system represents the exactness of the system with respect to classification. To be precise, measurements must be somewhat near to one another for the same object. The overall classification accuracy of the proposed system is around 99.4% with the precision of 92.4%. The false negative rate and false positive rate are very minimum in the range of 0.003 and 0.085, respectively. The sensitivity and specificity of the system are around 91.5 and 99.7%. The texture classes 2, 4, 5, 7, 9, 12, and 19 have been classified with good accuracy and precision when compared to other classes which is shown in Table 5.

5.1 Performance Metrices 5.1.1

Sensitivity

Additionally, true positive rate (TPR), memory, and likelihood of detection are used. It is a metric for true positives. It provides precise measurements of the test’s amount or completeness.

5.1.2

Specificity

Specificity is also referred to as a genuine negative rate (TNR). It quantifies the true downsides. Sensitivity improves type-1 error reduction.

5.1.3

False Positive Rate (FPR)

False positive rate (FPR) is also known as false alarm rate. It is the ratio of misclassified to total negative samples (Table 5).

120

R. Anand et al.

Performance measure

Formula

Best score

Worst score

Sensitivity/TPR

TP 3100 TP + FN

100

0

Specificity/TNR

TN 3100 TN + FP

100

0

False Positive rate

FP TN + FP

0

1

False negative rate

FN TP + FN

0

1

Accuracy

TP + TN 3100 TP + FP + TN + FN

100

0

Precision

TP TP + FP

1

0

Negative Predictive value

TN TN + FN

1

0

F1-Score

2 • TP 2TP + FP + FN

1

0

Fig. 8 Performance measures

5.1.4

False Negative Ratio (FNR)

False negative ratio (FNR) is also called as miss rate. It is the ratio of misclassified to total positive samples. A system’s performance is measured in terms of its efficiency. The efficiency with which a system solves a classification issue is quantified using metrics such as sensitivity, specificity, false positive rate, false negative rate, accuracy, precision, and F1 score [18, 19]. The formulas used to compute the parameters are given, along with their reference values shown in Figs. 8 and 9, and the comparison of our method as shown in Table 6 and individual class performance metrics are shown in Table 7.

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

121

Fig. 9 Texture prediction image using support vector machine

6 Conclusion Texture of an image is a description of the spatial arrangement of colours or intensities in the image. Texture of image can be used to categorize the image into several classes. Ample number of texture features can be computed mathematically and used for image analysis. The method proposed in this paper combines the texture features computed from GLCM matrix with those of the standard machine learning algorithm for image classification. The overall classification accuracy of the proposed system is around 99.4% with the precision of 92.4%. The classification accuracy of the system can be further improved by increasing the size of the dataset.

122

R. Anand et al.

Table 6 Comparison of different SVM-based performance measures for texture data Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Average

Accuracy 99.76 100 99.43 99.6 99.68 99.27 99.92 99.35 99.43 99.11 99.19 99.43 98.95 99.11 99.03 99.19 99.51 99.6 99.76 99.11 98.87 99.51 98.8 99.27 98.95 99.43 99.19 99.43 99.35

Precision 97.87 100 91.84 100 100 97.56 100 93.48 91.84 86.27 89.58 100 88.64 86.27 83.33 91.3 95.65 97.78 100 84.91 81.48 92 78.95 93.33 82.69 90.2 88 95.56 92.09

F1 score 0.97 1 0.93 0.95 0.96 0.9 0.99 0.91 0.93 0.89 0.9 0.92 0.86 0.89 0.88 0.89 0.94 0.95 0.97 0.89 0.86 0.94 0.86 0.9 0.87 0.93 0.9 0.92 0.92

Sensitivity 95.83 100 93.75 89.58 91.67 83.33 97.92 89.58 93.75 91.67 89.58 85.42 82.98 91.67 93.75 87.5 91.67 91.67 93.75 93.75 91.67 95.83 93.75 87.5 91.49 95.83 91.67 89.58 91.65

Specificity 99.92 100 99.66 100 100 99.92 100 99.75 99.66 99.41 99.58 100 99.58 99.41 99.25 99.66 99.83 99.92 100 99.33 99.16 99.66 99 99.75 99.25 99.58 99.5 99.83 99.66

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Actual class

Predicted class 1 2 3 4 46 0 1 0 48 0 0 0 0 45 0 0 1 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0

5 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 0 40 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 7 Confusion matrix for 28 classes

7 0 0 0 0 0 0 47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 2 0 0 0 1 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0 45 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 0 44 0 4 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 0 43 0 3 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0

12 0 0 0 0 0 0 0 0 0 0 0 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

13 0 0 0 1 0 0 0 2 0 0 0 0 39 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0

14 0 0 0 0 0 3 0 0 0 2 2 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0

15 0 0 0 0 0 1 0 0 0 2 1 0 0 2 45 0 0 0 0 0 0 0 3 0 0 0 0 0

16 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 42 0 0 0 0 0 0 0 0 0 0 0 0

17 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0

18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 1 0 0 0 0 0 0 0 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45 0 0 0 0 0 0 0 0 0

20 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 45 3 0 0 0 0 0 0 0

21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 3 44 2 0 0 0 0 0 0

22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 46 0 0 0 0 2 0

23 0 0 0 0 0 4 0 0 0 0 2 3 3 0 0 0 0 0 0 0 0 0 45 0 0 0 0 0

24 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 1 0 0 0

25 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 4 43 2 0 0

26 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 46 0 0

27 0 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 2

28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 43

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 123

124

R. Anand et al.

References 1. Shanthi, T., Sabeenian, R. S., Manju, K., Paramasivam, M. E., Dinesh, P. M., & Anand, R. (2021). Fundus image classification using hybridized GLCM features and wavelet features. ICTACT Journal of Image and Video Processing, 11(03), 2345–2348. 2. Veni, S., Anand, R., & Vivek, D. (2020). Driver assistance through geo-fencing, sign board detection and reporting using android smartphone. In: K. Das, J. Bansal, K. Deep, A. Nagar, P. Pathipooranam, & R. Naidu (Eds.), Soft computing for problem solving. Advances in Intelligent Systems and Computing (Vol. 1057). Singapore: Springer. 3. Kylberg, G. The Kylberg Texture Dataset v. 1.0, Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, External report (Blue series) No. 35. Available online at: http://www.cb.uu.se/gustaf/texture/ 4. Anand, R., Veni, S., & Aravinth, J. (2016) An application of image processing techniques for detection of diseases on brinjal leaves using k-means clustering method. In 2016 International Conference on Recent Trends in Information Technology (ICRTIT). IEEE. 5. Sabeenian, R. S., & Palanisamy, V. (2009). Texture-based medical image classification of computed tomography images using MRCSF. International Journal of Medical Engineering and Informatics, 1(4), 459. 6. Sabeenian, R. S., & Palanisamy, V. (2008). Comparison of efficiency for texture image classification using MRMRF and GLCM techniques. Published in International Journal of Computers Information Technology and Engineering (IJCITAE), 2(2), 87–93. 7. Haralick, R. M., Shanmugam, K., & Dinstein, I. H. (1973). Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621. 8. Varma, M., & Zisserman, A. (2005). A statistical approach to texture classification from single images. International Journal of Computer Vision, 62(1–2), 61–81. 9. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. 10. Shanthi, T., Sabeenian, R. S. (2019). Modified AlexNet architecture for classification of diabetic retinopathy images. Computers and Electrical Engineering, 76, 56–64. 11. Sabeenian, R. S., Paramasivam, M. E., Selvan, P., Paul, E., Dinesh, P. M., Shanthi, T., Manju, K., & Anand, R. (2021). Gold tree sorting and classification using support vector machine classifier. In Advances in Machine Learning and Computational Intelligence (pp. 413–422). Singapore: Springer. 12. Shobana, R. A., & Shanthi, D. T. (2018). GLCM based plant leaf disease detection using multiclass SVM. International Journal For Research & Development In Technology, 10(2), 47–51. 13. Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. 14. Bennett, K. P., & Demiriz, A. (1999). Semi-supervised support vector machines. In Advances in Neural Information Processing Systems (pp. 368–374). 15. Anand, R., Shanthi, T., Nithish, M. S., & Lakshman, S. (2020). Face recognition and classification using GoogLeNET architecture. In: Das, K., Bansal, J., Deep, K., Nagar, A., Pathipooranam, P., & Naidu, R. (Eds.), Soft computing for problem solving. Advances in Intelligent Systems and Computing (Vol. 1048). Singapore: Springer. 16. Shanthi, T., Sabeenian, R. S., & Anand, R. (2020). Automatic diagnosis of skin diseases using convolution neural network. Microprocessors and Microsystems, 76, 103074. 17. Hall-Beyer, M. (2000). GLCM texture: A tutorial. In National Council on Geographic Information and Analysis Remote Sensing Core Curriculum 3.

A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions

125

18. Shanthi, T., Anand, R., Annapoorani, S., & Birundha, N. (2023). Analysis of phonocardiogram signal using deep learning. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien, S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and communications (Lecture notes in networks and systems) (Vol. 471). Springer. https://doi.org/ 10.1007/978-981-19-2535-1_48 19. Kandasamy, S. K., Maheswaran, S., Karuppusamy, S. A., Indra, J., Anand, R., Rega, P., & Kathiresan, K. (2022). Design and fabrication of flexible Nanoantenna-based sensor using graphene-coated carbon cloth. Advances in Materials Science & Engineering.

Progress in Multimodal Affective Computing: From Machine Learning to Deep Learning M. Chanchal and B. Vinoth Kumar

1

Introduction

Emotions and sentiments play a significant role in our day-to-day lives. They help in decision-making, learning, communication, and handling situations. Affective computing is a technology that aims to detect, perceive, interpret, process, and replicate emotions from the given data sources by using different type of techniques. The word “affect” is a synonym for “emotions.” Affective computing technology is a human-computer interaction system that detects the data captured through cameras, microphones, and sensors and provides the user’s emotional state. The advancement in signal processing and AI has led to the development of usage of affective computing in medical, industry, and academia alike for detecting and processing affective information from the data sources [5]. Emotions can be recognized either from one type of data or more than one type of data. Hence, affective computing can be classified broadly into two types. They are unimodal affective computing and multimodal affective computing. Figure 1 depicts the overview of affective computing. Unimodal systems are those in which the emotions are recognized from one type of data. Generally, human beings rely on multimodal information more than unimodal. This is because one can understand a person’s intention by looking at his/her facial expression when he/she is speaking. In this case, both the audio and video data provide more information than the information that is provided from one type of data. For example, during an online class, the teacher can interpret M. Chanchal () Department of Computer Science and Engineering, Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Coimbatore, India B. Vinoth Kumar Department of Information Technology, PSG College of Technology, Coimbatore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_6

127

128

M. Chanchal and B. Vinoth Kumar

Sensing human affect response

Recognizing the affect response

Understanding and modelling affect

Affect expression

Emotive Human

Fig. 1 Overview of affective computing

Image

Audio Pre-processing

Feature extraction and selection

Model selection and training

Affect computing

Video

Physiological signals Data source

Fig. 2 Multimodal affective computing

more accurately if the students have understood the class or not by both looking at student’s expression and also by asking their feedback, rather than just asking them only their feedback. The way people express their opinion varies from person to person. One person may express his/her opinion more verbally while other person may express his/her opinion through expression [12]. Thus, a model that can interpret emotion for any type of person is required. This is when multimodal affective computing plays a major role. Unimodal systems are the building block of multimodal systems. The multimodal system outperforms the unimodal system since more than one type of data are used for interpretation. The multimodal affective computing structure is presented in Fig. 2. Till date, only a very limited survey analysis has been done on multimodal affective computing. Also, the previous studies do not concentrate specifically on the machine learning and deep learning approaches. With the advancement in AI techniques, a number of machine learning and deep learning algorithms can be applied for multimodal affective computing. The objective of this chapter is to provide a clear idea on the various machine learning and deep learning methods used for multimodal affect computing. In addition to it, the details about the various datasets, modalities, and fusion techniques have been elaborated. The remaining part

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

129

of this chapter is organized with multiple sections. Section 2 is to present about the available datasets, Sect. 3 is to elaborate about the various features used for affect recognition, Sect. 4 explains about the various fusion techniques, Sect. 5 describes about the various machine learning and deep learning techniques for multimodal affect recognition, Sect. 6 is for discussion, and finally, Sect. 7 concludes the chapter.

2 Available Datasets In literature, two types of datasets were found. They are publicly available dataset and dataset collected from subjects based on the predecided concept. In the latter, subjects were selected based on the tasks that need to be performed, and respective data were collected for further processing. This section describes the publicly available datasets for multimodal affective computing. Various kind of datasets for multimodal affect analysis datasets are discussed in Table 1.

2.1 DEAP Dataset The DEAP dataset [1] was collected from 32 subjects who were watching 40 videos clips that stimulated emotions. It contained 1-min-long video clippings. Based on that, EEG signals and Peripheral Physiological signals (PPS) were captured. The PPS signal includes both electromyographic (EMG) and EOG data. It had four emotions like valence, arousal, liking, and dominance and have a scale of 1–9.

2.2 AMIGOS Dataset The AMIGOS database [4] was collected from subjects using two different experimental settings mainly for mood, personality, and affect research purpose. In the first experimental setting, 40 subjects watched 16 short videos. Each video varied between 51 and 150 s. In second experimental setting, some subjects watched four long videos of different scenarios like individually and as groups. The wearable sensors were used to get the EEG, ECG, and GSR signals in this dataset. Also, it contains face and depth data that were collected using separate equipment.

2.3 CHEVAD 2.0 Dataset It is an extension of CHEAVAD dataset [14], including 4178 additional samples to it. CHEVAD 2.0 dataset were collected from Chinese movies, soap operas, and

130

M. Chanchal and B. Vinoth Kumar

Table 1 Datasets for multimodal affective computing Dataset DEAP [1]

Modality A+V+T+B

Subjects 32

Data Facial, text, EEG, and PPS signals

AMIGOS [4]

A+V+B

40

Facial, EEG, ECR, and GSR signals

CHEVAD 2.0 [14]

A+V

527

Audio and video data

RECOLA [26] IEMOCAP [28] CMUMOSEI [17] SEED IV [30] AVEC 2014 [9] SEWA [26]

A+V+B

46

A+V+T A+V

10 (5-m and 5-f) 1000

Audio, video, ECG, and EDA signals Audio, video, and lexical Audio and video data

A+V+B

44

A+V



A+V+T

64

AVEC 2018 [7]

A+V+T

64

DAIC-WOZ dataset [22]

A+V+T



UVA toddler [23]

A+V

61

MET [23]

A+V

3000

Audio, video, and EEG signals Audio, video, and lexical Audio, video, and textual Audio, video, and lexical Audio, facial feature, voice, facial action, and eye feature Audio, video, and head and body pose

Audio and video features

Emotions Arousal, valence, dominance, and liking Valence, arousal, dominance, liking, familiarity, and basic emotions Neutral, happiness, sadness, anger, surprise, fear, disgust, frustration, and excitement Arousal and valence Angry, sad, happy, and neutral Angry, disgust, fear, happy, sad, and surprise Angry, sad, happy, and neutral BDI-II depression scale range Arousal and valence Arousal, valence, and preference for commercial products BDI-II depression scale range Classroom Assessment Scoring System (CLASS) dimension – positive or negative Classroom Assessment Scoring System (CLASS) dimension – positive or negative

A audio, V video, T text, B biological, ECG electrocardiography, EEG electroencephalography, GSR galvanic skin response, EDA electrodermal activity, PPS Peripheral Physiological signals

TV shows that contain noise in the background so as to simulate the real-world condition. This dataset has 474 min of emotional segments. It contains 527 speakers aging from kids to elderly people. The subjects were in a distribution: 58.4% of

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

131

male and 41.6% of female. The duration of the video clippings ranges from 1 to 19 s with the average duration being 3.3 s.

2.4 RECOLA Dataset Remote COLlaborative and Affective (RECOLA) dataset [26] is a multimodal database of spontaneous collaborative and affective interactions in French. Fortysix French-speaking subjects were involved in this task. The subjects were recorded during a video conference in dyadic interaction when completing a task that required collaboration. This includes a total of 9.5 h. Six annotators measured the emotions on the basis of two dimensions: arousal and valence. The dataset contains audio, video, and physiological signals like electrocardiogram (ECG) and electrodermal activity (EDA).

2.5 IEMOCAP Dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [28] contains speech, text, and face modalities that were collected from ten actors during dyadic interaction using motion capture camera. The conversations included both spontaneous and scripted sessions. There are four labeled annotations: angry, sad, happy, and neutral. The dataset had five sessions, and each session was between one female and one male (two speakers).

2.6 CMU-MOSEI Dataset The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset [17] contained 23453 annotated video segments. It included 1000 different speakers and 250 topics that were taken from social media. Six labeled annotations are there in this dataset. They are angry, disgust, fear, happy, sad, and surprise.

2.7 SEED IV Dataset The SEED IV dataset [30] contains four annotated emotions like happy, sad, fear, and neutral. Forty-four subjects were used for this out of which 22 were female college students. They were asked to assess their emotions when watching the film clips as either sad, happy, fear, or neutral with rating from −5 to 5 for two dimensions: arousal and valence. The valence scale ranges from sad to happy and

132

M. Chanchal and B. Vinoth Kumar

arousal scale ranges from calm to excited. At the end, 72 film clips were selected that had the highest match among the subjects. The duration of each clip was 2 min.

2.8 AVEC 2014 Dataset It is a subset of audiovisual depression language corpus [9]. The dataset had 300 video clips, which was recorder using Web cameras and microphones when people were having computer interactions. One to four recordings of all the subjects were taken with a gap of 2 weeks between two recordings. The length of the video clips was between 6 s and 4 min. This dataset contains subjects with age 18– 63 and average age being 31.5 years. The BDI-II depression scale ranges from 0 to 63, where 0–10 is normal, 11–16 is mild depression, 17–20 is borderline, 21– 30 is moderate depression, 31–40 is severe depression, and above 40 is extreme depression. The highest that was recorded was 45.

2.9 SEWA Dataset Sentiment Analysis in the Wild (SEWA) dataset [26] contains audio and video recordings that were collected from Web cameras and microphones, and also natural emotions like arousal and valence. This dataset included a total of 64 subjects with ages ranging from 18 to 60 years where training set were with 36 subjects, validation set with 14 subjects, and testing set with 16 subjects. They were paired (a total of 32 pairs) and made to watch commercial videos and were asked to discuss the content of the video with their partner for a limit of 3 min. The dataset includes text, audio, and video data. Six German-speaking annotators (three males and three females) annotated the dataset for arousal and valence.

2.10 AVEC 2018 Dataset It is an extension of AVEC 2017 database [7]. The AVEC 2017 is like the SEWA dataset of German culture with 64 subjects, having 36 for training, 14 for validation, and 16 for testing. In AVEC 2018 dataset, the testing set is added with new subject of Hungarian culture with same age as the German culture. This dataset includes both audio and video recordings. It annotated three emotions: arousal, valence, and preference for the commercial products. All emotions were annotated with scale ranging from −1 to +1. The duration of recordings was 40 s to 3 min. The emotions were annotated for every 100 ms.

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

133

2.11 DAIC-WOZ Dataset The Distress Analysis Interview Corpus depression dataset [22] includes clinical interviews used for diagnosis of psychological conditions like anxiety, depression, and posttraumatic stress disorder. It included audio and video recordings and questionnaire response from the interviews conducted by virtual interviewer called Ellie that was controlled by a human interviewer in another room. It contains 189 sessions of interviews. Each interview contains audio file of interview session, 68 facial points of subjects, HoG (Histogram of oriented Gradients) facial feature, head pose, eye features, file having continuous facial action, and file containing subjects voice and transcript file of interview. All features except the transcript file are timeseries data.

2.12 UVA Toddler Dataset The University of Virginia (UVA) Toddler dataset [23] has 192 videos each of 45–60 min. It is collected from 61 child care centers having toddlers of 2–3 years old. The videos are recorded using digital camera with integrated microphone. Each video includes a day of preschool including individual and group activity, outdoor plays, and sharing meals. They included singing, reading, playing with blocks and toys, and so on. Each session includes an average of 1.7 teachers and 7.59 students. It includes video, audio along with background Noise, and head and body pose.

2.13 MET Dataset The Measures of Effective Teaching (MET) dataset [23] is one of the Classroom Assessment Scoring System (CLASS)-coded video dataset. It includes 16000 videos where 3000 teachers were teaching language, mathematics, arts, and science in both middle and elementary schools of USA (six districts). The data were collected using 360◦ cameras integrated with microphone, which was placed in the center of the classroom to capture both the teachers and students properly.

3 Features for Affect Recognition Affective computing requires extraction of meaningful information from the gathered data. This can be done by using various techniques. This section describes the various modalities along with their techniques used in multimodal affective

134

M. Chanchal and B. Vinoth Kumar

Subject Physiological

Behavioral

EEG ECG GSR PPS

Audio Video Textual Facial expression

Captured through sensors

Naturally observed / Computer interaction

Fig. 3 Categories of modality for affect recognition

computing. The modalities acquired from a subject fall into two broad categories: They are physiological and behavioral modalities. In this section, the primary focus is given on the audio, visual, textual, facial expression, and biological signals detection, along with their techniques. In the abovementioned modalities, audio, visual, textual, and facial expression falls under behavioral category and biological signal is nothing but the physiological signals. Figure 3 shows the modality categories.

3.1 Audio Modality Audio is one medium for capturing the emotions of a user. OpenSMILE toolkit is one popular method for extraction of audio features like pitch, intensity of utterance, bandwidth, pause duration, and perceptual linear predictive coefficients (PLP) [19]. Mel-frequency cepstral coefficients (MFCC) [28] is the most popular audio extraction method. Nowadays, for better extraction of audio features, more deep neural networks are used.

3.2 Visual Modality Similar to audio, visual modality is an important feature for affect computing. OpenSMILE toolkit is one common method for extraction of visual features [19]. Also, Local Binary Patterns on Three Orthogonal Planes (LBPTOP) are sometimes used as the baseline visual feature set [14]. Deep neural network like ResNet-50,

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

135

DenseNet, VGG Face, MobileNet, and HRNet can also be used for extracting better visual features [25].

3.3 Textual Modality For affect computing, text features play a vital role. Textual features are of two types. They are Crowd-Sourced Annotation (CSA) and DISfluency and Non-verbal Vocalization (DIS-NV). The DIS-NV are done by manual annotations. The CSA features are extracted by removing the stop words like “a,” “and,” “the,” etc. and then lemmatizing the remaining word using Natural Language Toolkit [19]. Parts of Speech (PoS), n gram features, and TF-IDF (Term frequency-Inverse Document Frequency) are useful features for emotion recognition. Google English word embedding (Word2Vec) [12] and Global Vectors (GloVec) are also used to extract textual features.

3.4 Facial Expression Facial expression can be captured using the AdaBoost algorithm using the Haar eigen value [8]. Chehra algorithm can be used to locate the facial points in the image frame [11]. With these facial points, the face feature can be further extracted using the Facial Action Unit (AU) recognition algorithm [17]. The LBP-TOP can be used to extract the face pictures or features [14]. The facial landmark can be detected in an image using Openface, which is an open-source tool [22].

3.5 Biological Signals The biological signals include ECG (electrocardiography), EEG (electroencephalography), GSR (galvanic skin response), and PPS (Peripheral Physiological signals) signals. All these signals are captured using appropriate electrodes. The electrodes are attached to the human body. NeuroScan is one system that can be used for recording and analyzing EEG signals [30]. Shimmer3 sensors can be used for measuring ECG [2].

4 Features for Affect Recognition Various Fusion Techniques Multimodal affect computing involves fusion of various modalities that are captured. In order to do the analysis, the modalities are combined using various fusion techniques. Fusion of various data provides enormous information and thus achieves

136

M. Chanchal and B. Vinoth Kumar

a result with a very good accuracy. There are two main levels of fusion. They are feature-level fusion or early-fusion and decision-level fusion or late fusion. Also, there are other fusion techniques like hierarchical fusion, model-level fusion, and score-level fusion. This section describes the various fusion techniques.

4.1 Decision-Level or Late Fusion Decision-level fusion [20] is a technique that combines the emotion recognition results of several unimodal by using an algebraic combination method. Each modality is given as an input individually to the unimodal, and the results obtained from these emotion classifiers are combined by algebraic methods like “Sum,” “Min,” “Max,” etc. Hence, decision-level fusion is called late fusion. In decisionlevel fusion, the unimodal emotion recognition is built for each of the multimodal feature set. The main advantage of decision-level fusion is that the complete knowledge about the individual modality can be applied separately.

4.2 Hierarchical Fusion The hierarchical fusion techniques use different multimodal feature sets at its different level of hierarchy [19]. For example, the set of ideas or perceived emotion annotation types of features are used in the lower layers of a model, whereas abstract features like text, audio, or video features are used in the higher layers. This method fuses two-stream network at different level of hierarchy to improve the performance of emotion recognition.

4.3 Score-Level Fusion Score-level fusion technique is another variant of decision-level fusion [29]. It is mainly used in audiovisual emotion recognition systems. The class score values are obtained by using any techniques like equally weighted summation method. The final predicted category is taken as the emotion category that has the highest or maximum value in the fused score vector. The score-level fusion is based on combining the individual classification score, thus indicating the possibility that a sample might belong to a different class. Whereas, the decision-level fusion is done by combining various predicted class labels.

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

137

4.4 Model-Level Fusion Model-level fusion is a compromise between the techniques of feature-level fusion and decision-level fusion [29]. It uses the technique of correlation between the data observed from the different modality, with the fusion of data in a relaxed manner, and it mainly uses a probabilistic approach. This approach is used mainly for audio and video modalities. For example, in neural networks, the model-level fusion is done by fusion of the features at hidden layers of neural network where the multiple modalities are given as input. Then an additional hidden layer is added to learn the joined feature from the fused feature vector.

5 Multimodal Affective Computing Techniques Multimodal affective computing is a method of emotion recognition from more than one modality. The extracted features from the data are used to train a model for emotion recognition. This model can use any type of technique. In the recent years, a greater number of machine learning and deep learning techniques are used in multimodal affective computing. This section tells about the various machine learning and deep learning techniques for emotion recognition. Various machine learning- and deep learning-based techniques for multimodal affect computing are discussed in Tables 2 and 3, respectively.

5.1 Machine Learning-Based Techniques Eun-Hye Jang et al. [10] proposed a method for fear-level detection using physiological measures like skin conductance level and response (SCL, SCR), heart rate (HR), pulse transit time (PTT), fingertip temperature (FT), and respiratory rate (RR). The task was performed using the data collected from 230 subjects who were asked to watch fear-inducing video clips. Correlation and linear regression among the physiological measures were performed to check the fear intensity. ML techniques like nonparametric spearman’s rank correlation coefficient was used. SCR and HR were positively correlated to the intensity, whereas the SCL, RR, and FT were negatively correlated. It showed an accuracy of 92.5% on fearinducing clips. Oana Balan et al. [2] proposed an automation fear-level detection and acrophobia virtual therapy system. It used galvanic skin response (GSR), heart rate (HR), and the values of electroencephalography (EEG) from subjects who played acrophobia game and who were undergoing vivo therapy and virtual reality therapy. Two classifiers were used: one to determine the present fear level and another one to determine the game level that needs to be played next. ML techniques like Support Vector Machine, Random Forest, k-Nearest Neighbors,

138

M. Chanchal and B. Vinoth Kumar

Table 2 Machine learning-based techniques for multimodal affect computing Reference Eun-Hye Jang et al. [10] Oana Balan et al. [2]

Seul-Kee Kim et al. [13]

Cheng-Hung Wang et al. [27] Jose Maria Garcia-Garcia et al. [5] Shahla Nemati et al. [20] Javier Marín-Morales et al. [16] Li Ya et al. [14]

Dataset Data collected from 230 subjects Data collected from subjects playing acrophobia game

Data collected directly from subjects (showing them clips of real pedestrian environments) Data collected from 136 subjects Data collected from six children DEAP dataset Data collected from 60 subjects CHEVAD 2.0 dataset

Feature used SCL, SCR, HR, PTT, FT, and RR

Models used Spearman’s rank correlation

Result obtained 92.5% accuracy

GSR, HR, and EEG

SVM, RF, KNN, Linear Discriminant Analysis, and four deep neural network models t-tests or Mann-Whitney U, NOVAs, or Kruskal-Wallis tests

Two scales: SVM – 89.5% DNN – 79.12% Four scales: KNN – 52.75% SVM – 42.5% Comparison done using significance values

t-test and Cohen’s d standard t-test

Highest effect value (0.71)

EEG, ECG, and GSR

Textual features and facial expression Facial expression, key strokes, and speech features Video, audio, and text features EEG and heart rate variability (HRV) Audio and video features

SVM and Naive Bayes SVM-RFE and LOSO cross-validation SVM with decision-level and feature-level fusion

Deger Ayata et al. [1]

DEAP emotion dataset

GSR and PPG (photo plethysmography)

KNN, RF, and decision tree methods

Asim Jan et al. [9]

AVEC 2014 dataset

Audio and visual features

Feature Dynamic History Histogram (FDHH) algorithm, Motion history histogram (MHH), PLS regression, and LR techniques

SUS score with lesser attempts (60% less) SVM had better accuracy (92%). Arousal – 75% Valence – 71.21% Decision-level fusion – 35.7% MAP Feature-level fusion – 21.7% MAP Arousal – 72.06% Valence – 71.05% FDHH had better RMSE and MAE.

(continued)

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

139

Table 2 (continued) Reference Sandeep Nallan Chakravarthula et al. [3]

Dataset Data collected from 62 couples

Feature used Acoustic, lexical, and behavioral features Posture data and electrodermal activity data Visual and physiological information

Nathan L. Henderson et al. [6] Papakostas M et al. [21]

Data collected from 119 subjects Data collected from 45 subjects

Anand Ramakrishnan et al. [23]

UVA toddler dataset and MET dataset

Audio features and face image of both the teacher and students

Dongmin Shin et al. [24]

Data collected from 30 subjects

EEG and ECG signals

Models used SVM

Result obtained Recall% better by 13–20%

SVM and neural network

Kappa score of multimodal was better. F1 score of multimodal was better.

RF, Gradient Boosting classifier, and SVM classifier Resnet, Pearson correlation, and Spearman correlation

Bayesian network (BN), SVM, and MLP

Resnet – correlation values Positive – 0.55 Negative – 0.63 Pearson correlation Positive – 0.36 Negative – 0.41 Spearman correlation Positive – 0.48 Negative – 0.53 BN had highest accuracy (98.56%).

SCL, SCR skin conductance level and response, HR heart rate, PTT pulse transit time, FT fingertip temperature, RR respiratory rate, GSR galvanic skin response, EEG electroencephalography, ECG electrocardiographic, SVM Support Vector Machine, RF Random Forest, KNN k-Nearest Neighbors

Linear Discriminant Analysis, and four deep neural network models were used. The models were compared based on the accuracy, and it was found that DNN model had highest accuracy of 79.12% for player-independent and SVM had an accuracy of 89.5% for player-dependent modality for two scales, whereas for four scales, high accuracy was seen for KNN (52.75%) and SVM (42.5%). Seul-Kee Kim et al. [13] proposed a method to determine the fear of crime using multimodality based on the data collected from the subjects by showing them clips of real pedestrian environments. The features like electroencephalographic (EEG), electrocardiographic (ECG), and galvanic skin response (GSR) signals were used. To compare the difference of fear of crime between the two groups (i.e., Low Fear of crime Group (LFG) and High Fear of crime Group (HFG)), techniques like independent t-tests or Mann-Whitney U tests were used. To compare the fear of crime based on the video clips that were provided to the subject, NOVAs or KruskalWallis tests were used. The values were compared by setting up a significance level of p < 0.05. Cheng-Hung Wang et al. [27] proposed a method for multimodal

140

M. Chanchal and B. Vinoth Kumar

Table 3 Deep learning-based techniques for multimodal affect computing Reference Michal Muszynski et al. [19]

Dataset LIRIS-ACCEDE dataset

Feature used Audiovisual features, lexical features, physiological reactions like GSR, and ACCeleration signals (ACC)

Models used LSTM, DBN, and SVR

Joaquim Comas et al. [4]

AMINGOS dataset

BMMN (bio multi model network)

Jiaxin Ma et al. [15]

DEAP dataset

Facial and physiological signals like ECG, EEG, and GSR EEG signals and physiological signals

Panagiotis Tzirakis et al. [26]

RECOLA dataset

Audio and video features

Seunghyun Yoon et al. [28]

IEMOCAP dataset

Text and audio features

Trisha Mittal et al. [17]

IEMOCAP and CMU-MOSEI dataset AMINGOS dataset

Facial, text, and speech features

Siddharth et al. [11]

EEG, ECG, GSR, and frontal videos

Deep LSTM network, residual LSTM network, and Multimodal Residual LSTM (MMResLSTM) network LSTM network

Multimodal Dual Recurrent Encoder (MDRE) LSTM

Extreme Learning Machine (ELM) along with 10-fold cross-validation

Result obtained LSTM had best result: (A – arousal, V – valence) MSE (A – 0.260, V – 0.070) CC (A – 0.251, V- 0.266) CCC (A- 0.111, V – 0.143) Accuracy: Arousal – 87.53% Valence – 65.05% MM-ResLSTM had best result: Arousal – 92.87% Valence – 92.30%

Raw signals of audio and video: Arousal – 78.9% Valence – 69.1% Raw signals of audio, raw, and geometric signals of video: Arousal – 78.8% Valence – 73.2% WAP values – 0.718 Accuracy: 68.8% to 71.8% Increase of 2–7% in F1 and 5–7% in MA EEG and frontal videos, the accuracy – 52.51% GSR and ECG, the accuracy – 38.28% (continued)

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

141

Table 3 (continued) Reference Wei-Long Zheng et al. [30]

Dataset Data collected from 44 subjects

Feature used EEG and eye moments

Models used Bimodal deep auto-encoder (BDAE) and SVM 3D-CNN with DBN

Shiqing Zhang et al. [29]

RML database, eNTERFACE05 database, and BAUM-1 s database

Audio and video features

Eesung Kim et al. [12] Huang Jian et al. [7]

IEMOCAP dataset AVEC 2018 dataset

Acoustic and lexical features Visual, acoustic, and textual features

Deep neural network (DNN) LSTM-RNN

Qureshi et al. [22]

DAIC-WOZ depression dataset

Acoustic, visual, and textual features

Luntian Mou et al. [18]

Data collected from 22 subjects

Panagiotis Tzirakis et al. [26]

SEWA dataset

Eye features and vehicle and environmental data Text, audio, and video features

Attention-based fusion network with deep neural network (DNN) Attention-based CNN-LSTM network Attention-based fusion strategies

Result obtained Accuracy – 85.11%

Accuracy: RML database (80.36%), eNTERFACE05 database (85.97%) and B4AUM-1 s (54.57%) WAR – 66.6 UAR – 68.7 Arousal – 0.599–0.524 Valence – 0.721–0.577 Liking – 0.314–0.060 Accuracy – 60.61%

Accuracy – 95.5%

Arousal – 69% Valence – 78.3%

GSR galvanic skin response, EEG electroencephalography, ECG electrocardiographic, SVM Support Vector Machine, LSTM Long Short-Term Memory, DBN Deep Belief Network, RNN Recurrent Neural Network, SVR Support Vector Regression

emotion computing for tutoring system. It used textual features and facial expression collected from 136 subjects. The technique used was t-test. Also, to determine the significance level, Cohen’s d standard was used. This model did a comparison of test results obtained from normal Internet teaching group and affective teaching group statistics. Pretest and posttest were conducted for both the groups, and it was found that posttest value of emotional teaching group produced a moderate to higher effect value (0.71) and closer significance value. Jose Maria Garcia-Garcia et al. [5] proposed a multimodal affect computing method to improve the users experience on educational software application. Facial expression, key strokes, and speech features were the features used. The method used was the t-test to compare the mean of all the datasets and to test the null

142

M. Chanchal and B. Vinoth Kumar

hypothesis. The test was done using two types of system: one with emotion recognition application and other one without emotion recognition. The System Usability Scare (SUS) score was used for determining which system performs better. And it was found that one with emotion recognition had better SUS score with lesser attempts required (60% less) and using less help. Shahla Nemati et al. [20] proposed a method for hybrid latent space data fusion technique for emotion recognition. Video, audio, and text features were used from the DEAP dataset. SVM and Naive Bayes were used as a classifier. Feature-level fusion and decision-level fusion are employed in this model using Marginal Fisher Analysis (MFA) to cross-modal factor analysis (CFA) and canonical correlation analysis (CCA). In feature-level fusion, SVM classifier outperforms Naïve Bayes classifier. But in decision-level fusion, it is mainly dependent on the type of classifier. Javier Marín-Morales et al. [16] proposed a method for emotion recognition using the brain and heartbeat dynamics like electroencephalography (EEG) and heart rate variability (HRV). The data was collected from a total of 60 subjects. SVM-RFE and LOSO cross-validation techniques were used for emotion recognition. Two predictions were done, one for arousal and another one for valence. The features were extracted from HRV, EEG band power, and EEG MPS. It was found that the arousal dimension attained an accuracy of 75%, and valence had an accuracy of 71.21%. Li Ya et al. [14] proposed a multimodal emotion recognition challenge using the audio and video features of CHEVAD 2.0 dataset. SVM classifier was used for emotion recognition. Two fusion techniques were compared. They are decision-level fusion and feature-level fusion. Among the two fusion techniques, decision-level fusion with 35.7% in MAP was better than feature-level fusion, which had only 21.7% in MAP. Also, its results were compared with the individual feature predictions. It was found that with audio alone and video alone, it had 39.2% and 21.7%, respectively. Deger Ayata et al. [1] proposed music recommendation system based on emotions using the GSR (galvanic skin response) signals and PPG (photo plethysmography) signals obtained from 32 subjects in DEAP emotion dataset. The features are extracted from these signals and fused using feature-level fusion technique. The classifiers are fed with the feature vector to obtain the arousal and valence values. KNN, Random Forest, and decision tree methods are used for emotion identification. The arousal and valence accuracy were compared with using only GSR signal, only PPS signals, and multimodal features. It was found that accuracy of fused method had better accuracy for both arousal (72.06%) and valence (71.05%). Asim Jan et al. [9] proposed a method for automatic depression-level analysis using the audio and visual features. Two methods are compared. Feature Dynamic History Histogram (FDHH) algorithm is a fusion technique to produce dynamic feature vector. Motion history histogram (MHH) is used to get the features of visual data and then fuse it with audio data. PLS regression and LR techniques have been used for determining the correlation between the feature space and for the depression scale. On comparison FDHH was better with less MAE and RMSE values.

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

143

Sandeep Nallan Chakravarthula et al. [3] proposed the suicidal risk prediction among the military couples using the conversation among the couples. It used features like acoustic, lexical, and behavioral aspects from the couple’s conversation that was collected from a total of 62 couples, having a total of 124 people. The model was used to check three scenarios: none, ideation, and attempt. The recall% of the proposed system was 135–20% better than the chance. Principal Component Analysis (PCA) was done to get only the important features. Support Vector Machine was used as a classifier for the determination of risk prediction. Nathan L. Henderson et al. [6] proposed a method for affect detection for gamebased learning environment. The posture data and Q-sensors were used to get the electrodermal activity data. The data was collected from a total of 119 subjects who were involved in the TC3Sim training. Two types of fusion techniques were tested: feature-level and decision-level fusion. Based on these data, classifiers like Support Vector Machine and neural network were used to determine the student’s affective states. The results were compared using Kappa score taking only EDA data, only posture data, and then multimodal data. The classifier performance was improved when the EDA data was combined with posture data. Papakostas M et al. [21] proposed a method for understanding and categorization of driving distraction. It made use of visual and physiological information. The data was collected from 45 subjects who were exposed to four different distractions (three cognitive and one physical). Both early fusion and late fusion were tested. It was tested for two class (mental and physical distraction) and four class (text, cognitive task, listening to radio, GPS interaction). The two class and four class test results were compared for visual features alone, physiological features alone, and early fusion and late fusion. Classifiers like Random Forest (RF) with about 100 decision trees, Gradient Boosting classifier, and SVM classifier with linear and RBF kernel were used for determining the driver distraction. In both two class and four class, the visual features only performance was 15% comparatively in F1 score and thus cannot be used in stand-alone mode. Anand Ramakrishnan et al. [23] proposed a method for automatic classroom observation. It used the audio features, face image of both the teacher and students from the UVA toddler dataset, and MET dataset. It is used to determine the Classroom Assessment Scoring System’s (CLASS) positive and negative aspect. Classifier like Resnet, Pearson correlation, and Spearman correlation is used. Using Resnet, the correlation values were 0.55 and 0.63 for positive and negative. The Pearson correlation resulted in correlation values of 0.36 and 0.41 on positive and negative aspects, respectively, for UVA dataset. In MET dataset, the Spearman correlation was compared with Pearson correlation, and Spearman correlation values were better for both positive and negative (0.48 and 0.53). Dongmin Shin et al. [24] developed an emotion recognition system using EEG and ECG signals. It recognized six types of feelings: amusement, fear, sadness, joy, anger, and disgust. The noise was removed from the signals to create the data table. The classifier used is the Bayesian network (BN) classifier. Also, BN classifier was compared with MLP and SVM. All three classifiers accuracy was found for EEG signals alone and also EEG signal with ECG signal. It was seen that BN result of multimodal modal had the highest accuracy of 98.56%, which was 35.78% increase in the accuracy.

144

M. Chanchal and B. Vinoth Kumar

5.2 Deep Learning-Based Techniques Michal Muszynski et al. [19] proposed a method for recognizing the emotions that were induced when watching movies. Audiovisual features, lexical features, physiological reactions like galvanic skin response (GSR), and ACCeleration signals (ACC) were used from the LIRIS-ACCEDE dataset. In order to determine the emotion from the multimodal signals, LSTM, DBN, and SVR models were compared against each other for arousal and valence in basis of MSE, Pearson correlation coefficient (CC), and concordance correlation coefficient (CCC) for both unimodal and multimodal. LSTM outperformed SVR and DBN with MSE (A – 0.260, V – 0,070), CC (A – 0.251, V – 0.266), and CCC (A – 0.111, V – 0.143), where A stands for arousal and V for valence. Joaquim Comas et al. [4] proposed a method for emotion recognition using the facial and physiological signals like ECG, EEG, and GSR from the AMINGOS dataset. Deep learning techniques like CNN (Convolution Neural Network) is used for emotion recognition. BMMN (bio multi model network) is used to estimate the affect state using the features extracted using the Bio Auto encoder (BAE). Three networks are tested: BMMN that uses the features directly, BMMN-BAE1 that uses only the latent features extracted using BAE, and BMMN-BAE2 that used latent features along with the essential features. The BMMN-BAE2 model outperformed all the other models with accuracy for accuracy of 87.53% and valence of 65.05%. Jiaxin Ma et al. [15] proposed an emotion recognition system using the EEG signals and physiological signals of the DEAP dataset. The dataset was compared for deep LSTM network, residual LSTM network, and Multimodal Residual LSTM (MM-ResLSTM) network for both arousal and valence emotions. The MMResLSTM outperformed the other two methods with an accuracy of 92.87% for arousal and 92.30% for valence. Also, the proposed method was tested against some state of art methods like SVM, MESAE, KNN, LSTM, BDAE, and DCCA. Among all the method, MM-ResLSTM had better accuracy. Panagiotis Tzirakis et al. [26] proposed a method for emotion recognition using the audio and video features of RECOLA dataset. The audio and video features were extracted using the ResNet. These extracted features were used for emotion recognition using the LSTM network. The proposed model was compared against few other state-ofthe-art methods like the Output-Associative Relevance Vector Machine Staircase Regression (OA RVM-SR) and strength modeling system proposed by Han et al. for both arousal and valence prediction. The proposed method outperformed all other methods with an accuracy of 78.9% for arousal and 69.1% for valence using raw signals of both audio and video and 78.8% for arousal and 73.2% for valence using raw signals of audio and raw and geometric signals of video. Seunghyun Yoon et al. [28] proposed a multimodal speech emotion recognition system using the text and audio features from IEMOCAP dataset. This model is used to identify four emotions like happy, sad, angry, and neutral. The Multimodal Dual Recurrent Encoder (MDRE) containing two RNNs is used for the prediction of speech emotions. The proposed model is compared against the Audio Recurrent

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

145

Encoder (ARE) and Text Recurrent Encoder (TRE) using the weighted average precision (WAP) score. It was found that the MDRE model had better WAP values of 0.718; thus, accuracy is ranging from 68.8% to 71.8%. Trisha Mittal et al. [17] proposed a multiplicative multimodal emotion recognition (M3ER) system that uses facial, text, and speech features. This was done using the IEMOCAP and CMUMOSEI dataset. Deep learning models were used for feature extraction to remove the inefficient signals, and finally, an LSTM is used for emotion classification. The result of M3ER were compared with the already existing SOTA methods using the F1 score and Mean Accuracy (MA) score. Having modality check on ineffective modality of the dataset causes an increase of 2–5% in F1 and 4–5% in MA, and when dataset undergoes a proxy feature regeneration step, it led to a further increase of 2–7% in F1 and 5–7% in MA for M3ER model, which was better than SOTA model. Siddharth et al. [11] proposed a multimodal affective computing using the EEG, ECG, GSR, and frontal videos of the subject from AMINGOS dataset. The features are extracted using the CNN-VGG network. The Extreme Learning Machine (ELM) along with 10-fold cross-validation and sigmoid function were used to train the emotions like arousal, valence, liking, and dominance at a scale of 1–9. The features are tested for emotion classification individually and also as multimodal. By combining EEG and frontal videos, the accuracy was 52.51%, which is better than the accuracy obtained individually for the features. By combining the features like GSR and ECG, the accuracy was 38.28%. Wei-Long Zheng et al. [30] proposed a model for emotion recognition using EEG signals and eye movements. The data was collected from 44 subjects. Bimodal deep auto-encoder (BDAE) was used to extract the shared features of both EEG and eye moments. The Restricted Boltzmann machines (RBMs) were used, one for EEG and another for eye moments to extract the features. Finally, an SVM was used as a classifier to do the emotion classification. The modal was tested with accuracy for individual features and for multimodal. The multimodal had an accuracy of 85.11%, which was better compared to EE signals alone (70.33%) and eye movements alone (67.82%). Shiqing Zhang et al. [29] proposed a method for emotion recognition using audio and visual features. The model was tested on RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database. The CNN and 3D-CNN are used to capture the audio and video features, respectively. The result from these networks is fed to a DBN along with a fusion network to produce the fused features. Linear SVM is used as a classifier for emotion classification. The proposed model uses fusion technique along with CNN along with DBN to build the fusion network. The model is tested and compared with unimodal features and different fusion methods like feature level, score level, and FC for all three datasets. Among all, the proposed method with DBN outperformed with highest accuracy for all three datasets. The accuracies were RML database (80.36%), eNTERFACE05 database (85.97%), and B4AUM-1s (54.57%). Eesung Kim et al. [12] proposed a method for emotion recognition using acoustic and lexical features of IEMOCAP dataset. The emotion recognition was compared using the weighted average recall (WAR) and UAR. The deep neural network (DNN) is used for feature extraction and also as a classifier.

146

M. Chanchal and B. Vinoth Kumar

The proposed model was compared with the results obtained using only lexical features and also few state-of-the-art methods like LLD+MMFCC+BOWLexicon, LLD+BOWCepstral+GSVmean+BOW+eVector, LLD+mLRF, and Hierarchical Attention Fusion Model. Of all the models, the proposed model had greater WAR and UAR value of 66.6 and 68.7 respectively. Huang Jian et al. [7] proposed a model for emotion recognition using the visual, acoustic, and textual features. These features are used from AVEC 2018 dataset. The features are extracted and fused using both feature-level and decision-level fusion and compared. LSTM-RNN is used to train the model with the features extracted and emotion classification is performed. The comparison is done for unimodal (using only visual, only audio, and only textual features). But multimodal features had better prediction of emotions like arousal, valence, and liking. The German part of the dataset had good performance in proposed multimodal with values 0.599–0.524 for arousal, 0.721–0.577 for valence, and 0.314–0.060 for liking. For the Hungarian part, the performance was good for textual features. Qureshi et al. [22] proposed a method for estimation of depression level in an individual using multimodality like acoustic, visual, and textual features. The features were extracted from DAIC-WOZ depression dataset. An attention-based fusion network is used, and deep neural network (DNN) is used for classification of depression in PHQ-8 score scale. RMSE, MAE, and accuracy were used to test the Depression Level Regression (DLR) and Depression Level Classification (DLC). To test the multimodality, two-based one single-task representation learning (ST-DLR-CombAtt and ST-DLC-CombAtt) and two others based on multitask representation learning (MT-DLR-CombAtt and MT-DLCCombAtt) were used. The multimodal had better classification accuracy of 60.61%. Luntian Mou et al. [18] proposed a model for determining the driver stress level using the eye features and vehicle and environmental data. The data were collected from a total of 22 subjects. The stress level was classified into three classes: low, medium, and high. An attention-based CNN-LSTM network is used as a classifier. The proposed model is compared with other state-of-the-art multimodal methods where the handcrafted features are used and also some unimodal method. It was seen that the attention-based CNN-LSTM network outperformed all the other state-of-the-art methods with an accuracy of 95.5%. Panagiotis Tzirakis et al. [26] proposed affect computing using the text, audio, and video features from the SEWA dataset. ML techniques like Concordance Correlation Coefficient (ρc) was used to determine the agreement level between the prediction and also in determination of correlation coefficient with their mean square difference. The SEWA dataset was compared with proposed model for single feature alone, also with different fusion strategies like concatenation, hierarchical attention, self-attention, residual selfattention, and cross-modal self-attention and cross-modal hierarchical self-attention. The two emotions that are tested for are arousal and valence. It was also compared with few state-of-the-art methods. It was seen that the model outperformed for text, visual, and multimodality.

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

147

6 Discussion The advancement in human computer interaction has led to the use of multimodal analysis from unimodal analysis for affective computing. Also, the use of more modality for affect detection can be more appropriate rather than using only single feature. Earlier, only still images were used for affective computing. Nowadays, the advancement in technology has led to the usage of audio and video formats for affect detection. From the above study, it can be seen that a number of datasets are available for audio, video, and textual data. But very few datasets are available for biological signals. Most of the biological signals are obtained by getting the data directly from the subjects based on the experiment that needs to be performed. In multimodal affective computing, fusion techniques play a major role. A number of fusion techniques has been discussed in the above section. Since more than one modality are used in multimodal affective computing, fusion techniques are applied to those technique. The fusion techniques are determined based on the dataset and the model that is selected. The most commonly used fusion technique is featurelevel fusion. One problem with the publicly available datasets is that it contains only posed expression or acted expressions. Choosing an appropriate dataset is a challenging task. In many cases, more naturalistic data are used. The features extracted from these data may be numerous, and hence, feature reduction is essential. Only nonredundant and relevant data are required for further processing and to increase the speed and processing of the affect computation algorithm. Also, an appropriate classification algorithm needs to be selected based on the dataset. Initially, a number of machine learning techniques had been applied for affective computing. But the advancement in AI has led to the usage of deep learning techniques for affective computing. From our literature survey, it is clear that most of the studies that involved emotion recognition used modalities like audio, video, or textual data. But for studies that were application oriented like stresslevel detection, fear-level detection, education sector, or so on, more of biological signals were used. This is because the physiological response of the human body helps in determining these kinds of expressions much better rather than using only audio, video, and textual data. The physiological responses were collected using the sensors. For affective computation, some methods had used features that were manually extracted, whereas some studies have used deep learning techniques for feature extraction. As the survey demonstrates, there are various research challenges in multimodal affective computing. One important sector would be to focus on application-oriented studies that could be helpful in real-world applications. Manually extracted features and features extracted by deep learning can be compared to determine which would give a better result in affect computation. Another aspect of future work would be usage of biological signals for these kinds of affect computation. These biological signals speak more about a person; hence, it can be helpful more in medical field. If these biological signals are used with other modalities, then it would be a major advance in many medical research fields.

148

M. Chanchal and B. Vinoth Kumar

In many cases, machine learning- and deep learning-based multimodal affective computing has been used in a general sense for emotion recognition. Very few researches focused on specific areas like detection of depression level in humans, detection of fear level, and more specifically to a particular kind of fear (e.g., aquaphobia, etc.). Multimodal affective computing can be used in the applications like education sector for determining the student’s level of understanding during the online learning. It can also be used in medical field to determine the fear and depression level. The fear and depression level can be used to find if the person is prone to some medical ailments. It can be used for autism people to aid them with technologies involving communication development. Multimodal affective computing finds application in music players to play songs based on mood or to determine the emotions of person when they are watching an advertisement in TV.

7 Conclusion This chapter explains about the brief overview of affective computing and how emotions are recognized. A brief introduction of unimodal and multimodal affective computing has been discussed in this chapter. A clear study on the available dataset, its modality, and emotions in each dataset has also been explained. In addition, the various features used for affect recognition and fusion techniques are elaborated. The machine learning and deep learning techniques for affect recognition are explained, along with the discussion on what features were used for affect recognition and against what other techniques the proposed methodology was compared. Also, a few challenges in this research field have been identified. They are to use real-time dataset for the study, to have more investigation before capturing the data for study, more understanding about selecting the model, and also to extend the research to application-oriented studies.

References 1. Ayata, D., Yaslan, Y., & Kamasak, M. E. (2018). Emotion based music recommendation system using wearable physiological sensors. IEEE Transactions on Consumer Electronics, 64(2), 196–203. 2. B˘alan, O., Moise, G., Moldoveanu, A., Leordeanu, M., & Moldoveanu, F. (2020). An investigation of various machine and deep learning techniques applied in automatic fear level detection and acrophobia virtual therapy. Sensors, 20(2), 496. 3. Chakravarthula, S. N., Nasir, M., Tseng, S. Y., Li, H., Park, T. J., Baucom, B., et al. (2020, May). Automatic prediction of suicidal risk in military couples using multimodal interaction cues from couples conversations. In ICASSP 2020–2020 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6539–6543). IEEE. 4. Comas, J., Aspandi, D., & Binefa, X. (2020, November). End-to-end facial and physiological model for affective computing and applications. In 2020 15th IEEE international conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 93–100). IEEE.

Progress in Multimodal Affective Computing: From Machine Learning to Deep. . .

149

5. Garcia-Garcia, J. M., Penichet, V. M., Lozano, M. D., Garrido, J. E., & Law, E. L. C. (2018). Multimodal affective computing to enhance the user experience of educational software applications. Mobile Inf Syst, 2018. 6. Henderson, N. L., Rowe, J. P., Mott, B. W., & Lester, J. C. (2019). Sensor-based data fusion for multimodal affect detection in game-based learning environments. In EDM (workshops) (pp. 44–50). 7. Huang, J., Li, Y., Tao, J., Lian, Z., Niu, M., & Yang, M. (2018, October). Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In Proceedings of the 2018 on audio/visual emotion challenge and workshop (pp. 57–64). 8. Huang, Y., Yang, J., Liao, P., & Pan, J. (2017). Fusion of facial expressions and EEG for multimodal emotion recognition. Computational Intelligence and Neuroscience, 2017, 1. 9. Jan, A., Meng, H., Gaus, Y. F. B. A., & Zhang, F. (2017). Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Transactions on Cognitive and Developmental Systems, 10(3), 668–680. 10. Jang, E. H., Byun, S., Park, M. S., & Sohn, J. H. (2020). Predicting individuals’ experienced fear from multimodal physiological responses to a fear-inducing stimulus. Advances in Cognitive Psychology, 16(4), 291. 11. Jung, T. P., & Sejnowski, T. J. (2018, July). Multi-modal approach for affective computing. In 2018 40th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 291–294). IEEE. 12. Kim, E., & Shin, J. W. (2019, May). Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In ICASSP 2019-2019 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6720–6724). IEEE. 13. Kim, S. K., & Kang, H. B. (2018). An analysis of fear of crime using multimodal measurement. Biomedical Signal Processing and Control, 41, 186–197. 14. Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., & Jia, J. (2018, May). Mec 2017: Multimodal emotion recognition challenge. In 2018 first Asian conference on Affective Computing and Intelligent Interaction (ACII Asia) (pp. 1–5). IEEE. 15. Ma, J., Tang, H., Zheng, W. L., & Lu, B. L. (2019, October). Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International conference on multimedia (pp. 176–183). 16. Marín-Morales, J., Higuera-Trujillo, J. L., Greco, A., Guixeres, J., Llinares, C., Scilingo, E. P., et al. (2018). Affective computing in virtual reality: Emotion recognition from brain and heartbeat dynamics using wearable sensors. Scientific Reports, 8(1), 1–15. 17. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020, April). M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 02, pp. 1359–1367). 18. Mou, L., Zhou, C., Zhao, P., Nakisa, B., Rastgoo, M. N., Jain, R., & Gao, W. (2021). Driver stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Systems with Applications, 173, 114693. 19. Muszynski, M., Tian, L., Lai, C., Moore, J., Kostoulas, T., Lombardo, P., et al. (2019). Recognizing induced emotions of movie audiences from multimodal information. IEEE Transactions on Affective Computing, 12, 36–52. 20. Nemati, S., Rohani, R., Basiri, M. E., Abdar, M., Yen, N. Y., & Makarenkov, V. (2019). A hybrid latent space data fusion method for multimodal emotion recognition. IEEE Access, 7, 172948–172964. 21. Papakostas, M., Riani, K., Gasiorowski, A. B., Sun, Y., Abouelenien, M., Mihalcea, R., & Burzo, M. (2021, April). Understanding driving distractions: A multimodal analysis on distraction characterization. In 26th international conference on Intelligent User Interfaces (pp. 377–386). 22. Qureshi, S. A., Saha, S., Hasanuzzaman, M., & Dias, G. (2019). Multitask representation learning for multimodal estimation of depression level. IEEE Intelligent Systems, 34(5), 45– 52.

150

M. Chanchal and B. Vinoth Kumar

23. Ramakrishnan, A., Zylich, B., Ottmar, E., LoCasale-Crouch, J., & Whitehill, J. (2021). Toward automated classroom observation: Multimodal machine learning to estimate class positive climate and negative climate. IEEE Transactions on Affective Computing. 24. Shin, D., Shin, D., & Shin, D. (2017). Development of emotion recognition interface using complex EEG/ECG bio-signal for interactive contents. Multimedia Tools and Applications, 76(9), 11449–11470. 25. Tzirakis, P., Chen, J., Zafeiriou, S., & Schuller, B. (2021). End-to-end multimodal affect recognition in real-world environments. Information Fusion, 68, 46–53. 26. Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-toend multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1309. 27. Wang, C. H., & Lin, H. C. K. (2018). Emotional design tutoring system based on multimodal affective computing techniques. International Journal of Distance Education Technologies (IJDET), 16(1), 103–117. 28. Yoon, S., Byun, S., & Jung, K. (2018, December). Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112– 118). IEEE. 29. Zhang, S., Zhang, S., Huang, T., Gao, W., & Tian, Q. (2017). Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 3030–3043. 30. Zheng, W. L., Liu, W., Lu, Y., Lu, B. L., & Cichocki, A. (2018). Emotionmeter: A multimodal framework for recognizing human emotions. IEEE transactions on cybernetics, 49(3), 1110– 1122.

Content-Based Image Retrieval Using Deep Features and Hamming Distance R. T. Akash Guna and O. K. Sikha

1 Introduction Rapid evolution of smart devices and social media applications resulted in large volume of visual data. The exponential growth of visual content demands for models that can effectively index and retrieve relevant information according to the user’s requirement. Image retrieval is broadly used in Web services to search for similar images. Text-based queries are used to retrieve images during its early ages [6], which required large scale of manual annotations. In this context, content-based image retrieval systems gained popularity as one of the hot research topic since 1990s. Content-based image retrieval systems use visual features of an image to retrieve similar images from the database. Most of the state-of-the-art CBIR models extract low-level feature representations such as color descriptors [1, 2], shape descriptors [3, 4], and texture descriptors [5, 8] for image retrieval. Usage of lowlevel image features causes CBIR to be a heuristic technique. The major drawback of classical CBIR models is the semantic gap between the feature representation and user’s retrieval concept. The low-level semantics fails to reduce the semantic gap when the image database grows large. Also, similarities among different classes increase as the dimensionality of the database increases. Figure 1 further illustrates the shortcomings of the classical CBIR models, which use low-level image features. In Fig. 1, both a and b shares similar texture and color although they belong to two different classes (Africa and beach). Figure 1c, d have different color and texture even though they belong to same class (mountains). Figure 1e, f shows images from the same class (Africa) with different shape, texture, and color.

R. T. Akash Guna · O. K. Sikha () Department of Computer Science and Engineering, Amrita School of Engineering, Coimbatore, India Amrita Vishwa Vidyapeetham, Coimbatore, India © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_7

151

152

R. T. Akash Guna and O. K. Sikha

Fig. 1 Illustration of failure cases of classical features on CBIR

Fig. 2 Illustration of a general CBIR architecture [21]

1.1 Content-Based Image Retrieval: Review Content-based image retrieval systems use visual semantics of an image to retrieve similar images from large databases. The basic architecture of a CBIR system is shown in Fig. 2. In general, feature descriptors are used to extract significant features from the image data. Whenever a query image comes, the same set of features are extracted and are compared with the feature vectors stored in the database. Similar images are retrieved based on the similarity measures like Euclidean distance as depicted in Fig. 2. Deep learning has been widely researched for constructing CBIR systems. Wan Ji et al. [11] in 2014 introduced Convolutional Neural Networks (CNN) to form feature representations for content-based image retrieval. Usage of CNNs for CBIR achieved greater accuracy than classical CBIR systems. Babenko [23] in 2014 transfers learned state-of-the-art CNN models used for Imagenet Classification for feature representation. Transfer learning significantly

Content-Based Image Retrieval Using Deep Features and Hamming Distance

153

improved the accuracy of state-of-the-art CBIR systems since the base model is heavily trained on large volume of image data. Lin Kevin et al. [22] generated binary hash codes using CNNs for fast image retrieval. This method was highly scalable to increase in the dataset size. Putzu [25] in 2020 introduced a CBIR system using relevance feedback mechanism where the users are expected to give their feedback on misclassification with respect to the retrieval results. Based on the user-level feedback, CBIR model alters the parameters or similarity measures for getting better accuracy. The major drawback of relevant feedback-based CBIR models is that the accuracy of those systems purely depends on the feedback provided by the user. If the user fails to give proper feedback, then the system may fail. Some common applications of CBIR [32–36]. The primary objective of this chapter is to investigate the effectiveness of highlevel semantic features computed by deep learning models for image retrieval. The major contributions of this chapter are follows: • Transfer learned deep features are used as high-level image representation for CBIR. • Applicability of Hamming distance as a distance metric for deep feature vectors is explored. • Clustering dataset before retrieval to induce faster retrieval is experimented. The organization of the chapter is as follows. Section 2 describes the background of CNN. Proposed model is detailed in Sect. 3. Sections 4 and 5 detailed the dataset used for experimentation and results obtained, respectively. Finally, the paper concludes with Sect. 6.

2 Background: Basics of CNN Convolutional Neural Networks (CNNs) are deep learning networks that detect visual patterns present in the input images. Fukushima introduced the concept of Convolutional Neural Networks (CNNs) in 1980 that was initially named as “Neocognitron” [7] since it resembled the working of cells. Figure 3 shows the basic architecture of a simple Convolutional Neural Network (CNN). Deep neural networks are capable of computing high-level features that can distinguish objects more precisely than classical features. Since CNN “doesn’t need a teacher,” it automatically finds features suitable to distinguish and retrieve the images. A basic CNN can have six layers as shown in Fig. 3: input layer, convolutional layer, RELU layer, pooling layer, dense layer, and output classification layer. 1. Input layer: This layer holds the input raw image data for processing. 2. Convolution layer: It extracts features from the input image by convolving with filters of various size. Hyper parameters like stride (number of pixels that a kernel/filter can skip) can be tuned to get better accuracy.

154

R. T. Akash Guna and O. K. Sikha

Fig. 3 Simple Convolutional Neural Network [10] used for classification

3. ReLU layer: Rectified Linear Unit layer acts as a thresholding layer that converts any value less than zero as 0. 4. Pooling layer: It reduces the feature map dimension to avoid the possibility of overfitting. 5. Dense layer: Feature map obtained from the pooling layer is flattened into a vector form, which is then fed into the final classification layer. 6. Output/classification layer: Predicts the final class of the input image. Here, the number neurons are equal to the classes.

3 Proposed Model This section describes the proposed CNN-based model for high-level feature extraction and image retrieval in detail. The CNN model being used is described first followed by the methodology to extract feature vectors from the model, and then the techniques used to retrieve similar images are described. In this work, a reduced InceptionV3 Network [9] is chosen as the feature extractor. The generalized architecture of the proposed CBIR framework is shown in Fig. 4, and the steps followed are described below: 1. The high-level feature representation for the database images is calculated by passing it through the pretrained CNN (Inception V3) model. 2. The features are then clustered into N clusters using an adaptive K-means clustering algorithm. 3. For a query image, the feature vector is calculated by passing it into the pretrained CNN model. 4. The least distant cluster of the feature vector is found. 5. Least distinct images are retrieved from that cluster by using similarity measures.

Content-Based Image Retrieval Using Deep Features and Hamming Distance

155

Fig. 4 Architecture of the proposed model

3.1 Transfer Learning Using Pretrained Weights The proposed model explored a pretrained inception network for extracting highlevel feature representation for the candidate images. The model is transfer learned with pretrained weights for ImageNet [24] classification. Transfer learning [14] is a well-known technique through which a pretrained model for a similar task can be used to train another model. Transferred weights used for training the model improvises the quality of features captured in a short span of time. The model is initially trained like a classification model; thus, the addition of transferred weights to the model gave a surge in the results produced. The proposed model consists of a chain of ten inception blocks. The input tensor to an inception block is convoluted in four different paths, and the output of those paths are concatenated together. This model is chosen owing to its capability to deduce different features using a single input tensor. An inception block has four different paths. Path 1 consists of three convolutional blocks, path 2 consists of two convolutional blocks, path 3 consists of an Average Pooling Layer followed by a convolutional block, and path 4 consists of a single convolutional block. A convolutional block has a convolutional layer, batch normalization layer [12], and a ReLU [13] activation layer stacked in the above order. The architecture of an

156

R. T. Akash Guna and O. K. Sikha

inception block is visualized in Fig. 5. Following the inception blocks, a global max pooling layer and three deep layers are present. The dimension of the final deep layer is equal to the number of different classes present in the image database used for retrieval. The activation of the final layer is a SoftMax activation function that normalizes the values of the final deep layer to a scale of 0–1.

3.2 Feature Vector Extraction Deep layers of inception network were explored as a feature descriptor. The feature vector obtained from the deep layers 1–3 is denoted as DF1, DF2, and DF3, respectively. Figure 6 depicts the feature extraction from the deep layers of inception network. Ji Wan compared the effectiveness of feature vectors extracted from the deep layers (1–3) of inception network in [1]. Their study concludes that the penultimate layer (DF2) produced better results compared to DF1 and DF3. Inspired by the work of Ji Wan, feature vector extracted from DF2 is used in this work. Figure 8 shows the intermediate results received when calculating deep features from intermediate layers of Inception Resnet for all classes of Wang’s dataset. The extracted feature vectors are then stored as csv files.

3.3 Clustering The feature vector obtained from the inception model is then fed into a clustering module to perform an initial grouping. A K-means [15] clustering algorithm is used for clustering the extracted features. The objective of introducing clustering is to reduce the searching time for a query image. K-means clustering is an iterative algorithm that tries to separate the feature vectors into K nonoverlapping groups using expectation-maximum approach. It uses Euclidean distance to measure the distance between a feature vector and centroid of a cluster and assigns the feature vector to the cluster that is least distanced from the feature vector. The centroid of that cluster is then updated with the feature vector added to the group. The mathematical representation of K-means clustering is given as: J =

K m  

 2   wij x i − μj 

(1)

i=1 j =1

where J is the objective function, wij = 1 for feature vector xi if it belongs to cluster j; otherwise, wij = 0. Also, μj is the centroid of the cluster of xi .

Content-Based Image Retrieval Using Deep Features and Hamming Distance

Fig. 5 Architecture of an inception block present in an Inception V3 Network [9]

157

158

R. T. Akash Guna and O. K. Sikha

Fig. 6 A deep neural network that could be used as feature extractor by using intermediate deep layers

Convolutional Layers

DF1 [0,7.4,0,0,2.1,0,-1.75...0] DF2 [0,3.1,0,0,1.1,0,15.75...0] DF3 [0,0,0,1,0,0,0,...0]

3.4 Retrieval Using Distance Metrics Relevant images from the database are retrieved by calculating the distance between the input image feature vector and feature vectors stored in the database. This work compares two distance metrics, Euclidean distance and Hamming distance, for calculating the similarity.

Content-Based Image Retrieval Using Deep Features and Hamming Distance

3.4.1

159

Euclidean Distance

Euclidean distance [16] represents the shortest distance between two points. It is given as square root of summation of squared distances.   n  E.D =  (xi − yi )2

(2)

i=1

where n is the dimension of the feature vectors, and xi and yi are elements of the feature vectors x and y, respectively.

3.4.2

Hamming Distance

Hamming distance [17] measures the similarity between two feature vectors. Hamming distance for two feature vectors is the number of positions at which corresponding characters are different. H.D =

n 

∼ (xi = yi )

(3)

i=1

where n is the dimension of the feature vectors, and xi and yi are elements of the feature vectors x and y, respectively. xi = yi = 1 if both are equal, else 0.

4 Dataset Used This section describes the datasets used for experimentation. Wang’s dataset is a tailor-made dataset for content-based image analysis and its larger version: the COREL-10000 dataset is chosen for the analysis. Wang’s dataset consists of 1000 images divided into 100 images per classes, and the classes of Wang’s dataset are African tribe, beach, bus, dinosaur, elephant, flower, food, mountain, Rome, and horses. The COREL dataset comprises of 10,000 images downloaded from COREL photo gallery and is widely used for CBIR applications [18–20]. The dataset comprises of 100 classes with 100 images in each class. Figure 7 shows sample images from Wang’s dataset and COREL dataset (Fig. 8).

160

R. T. Akash Guna and O. K. Sikha

Fig. 7 Sample images from Corel dataset and Wang’s dataset

5 Results and Discussions The model is experimented with Euclidean distance and Hamming distance as similarity metrics for both of the datasets. The model is tested to retrieve 40, 50, 60, and 70 images on Wang’s dataset and COREL dataset, respectively. Average precision is used as the performance metric for evaluating the proposed CBIR mode. Average Precision: Precision is one of the commonly used measures for evaluating image retrieval algorithms that is defined as: Precision =

Similar images retrieved Total images retrieved

Since we are having more than one category of images, we use average precision. Average precision is defined as: n 

Average precision =

k=0

Precision [k]

Number of categories(n)

Content-Based Image Retrieval Using Deep Features and Hamming Distance

161

Fig. 8 The features that were computed by the intermediate convolutional layers of the feature extractor CNN

162

Fig. 8 (continued)

R. T. Akash Guna and O. K. Sikha

Content-Based Image Retrieval Using Deep Features and Hamming Distance

163

Number of classes with precision >0.95: Since the proposed model uses deep features that are sparse in nature, it is necessary to know the number of categories with a high precision for final retrieval. Hence, a threshold of 0.95 is set for the retrieval task.

5.1 Retrieval Using Euclidean Distance 5.1.1

Retrieving 40 Images

On retrieving 40 images per class from the Wang’s dataset using Euclidean distance, an average precision of 0.946 is obtained. Seven classes (out of 10) were retrieved with a precision greater than 0.95. When retrieving 40 images from COREL dataset, the average precision received is 0.961 in which 92 classes have a precision of 0.95.

5.1.2

Retrieving 50 Images

Average precision of 0.944 is achieved when retrieving 50 images per class from Wang’s dataset using Euclidean distance. Seven out of the 10 classes were retrieved with a precision of above 0.95. With an average precision of 0.955, 91 classes had a precision of above the threshold of 0.95.

5.1.3

Retrieving 60 Images

Retrieval of 60 images per class from Wang’s dataset resulted in images being retrieved. With an average precision of 0.94, 5 classes of a total of 10 classes had a precision of above 0.95. Retrieval of 60 images from COREL dataset had an average precision of 0.952, while 87 of 100 classes had a precision of above 0.95 during retrieval.

5.1.4

Retrieving 70 Images

When retrieving 70 images per class on the Wang’s dataset, Euclidean distance retrieved images with an average precision of 0.932. The number of classes with precision greater than 0.95 is 5. When the retrieval is done on the COREL dataset, the average precision in which the images were retrieved is 0.95 and 84 classes out of 100 had a precision of above 0.95.

164

R. T. Akash Guna and O. K. Sikha

5.2 Retrieval Using Hamming Distance 5.2.1

Retrieving 40 Images

Retrieval of 40 images per class from Wang’s dataset resulted in the average precision of 0.957 while eight classes were retrieved with a precision of above 0.95. On retrieving the same number of images from the COREL dataset, we received an average accuracy of 0.946 while retrieving 90 classes with a precision of above 0.95.

5.2.2

Retrieving 50 Images

When 50 images per class were retrieved from Wang’s dataset using Hamming distance with an average precision of 0.956, eight classes were retrieved with a precision of above 0.95. When retrieved from the COREL dataset, the average accuracy was 0.943, and 90 classes were retrieved with a precision of above 0.95.

5.2.3

Retrieving 60 Images

Retrieval of 60 images per class from the Wang’s dataset resulted in images being retrieved with an average precision of 0.942, and eight classes had a precision of above 0.95. Retrieving the same number of images from COREL dataset resulted in the average accuracy being 0.941, and 86 of 100 classes had a precision of above 0.95 during the retrieval.

5.2.4

Retrieving 70 Images

When retrieving 70 images per class from the Wang’s dataset, Hamming distance retrieved images with an average precision of 0.926 while 7 classes were retrieved with a precision of greater than 0.95. Hamming distance produced an average precision of 0.94 from the COREL dataset while retrieving 85 classes with precision of above the threshold value of 0.95. Table 1 summarizes the average precision obtained using Euclidean and Hamming distance on images from Wang’s dataset and COREL dataset. From the table, it is evident that the deep features obtained from the proposed model is effective on image retrieval. The transition of retrieving 40 images to 70 images from Wang’s dataset using Euclidean distance caused the average precision to reduce by 1.4% while the number of classes with precision above 0.95 reduced from 7 classes to 5 classes. During the transition from 40 to 70 images in the COREL dataset, the precision got reduced by 1.1% while the number of classes above the threshold reduced from 92 to 84 classes. Hamming distance produced a change of 3.1% on Wang’s dataset, but the number of classes above threshold just reduced from 8 to 7.

Content-Based Image Retrieval Using Deep Features and Hamming Distance

165

Table 1 Average precision obtained for image retrieval using Euclidean distance and Hamming distance for images from Wang’s dataset and COREL dataset

Fig. 9 Graphical representation of decrease in average precision from retrieval of 40–70 images

50 0.944 50 0.95

60 0.944 60 0.95 3.0

Percentage of change of Average Precision

Euclidean distance Number of images retrieved 40 Wang’s dataset Average precision 0.946 40 CORELdataset Average precision 0.961

2.5

70 0.932 70 0.95

Hamming distance Number of images retrieved 40 50 60 70 0.957 0.956 0.942 0.926 40 50 60 70 0.946 0.943 0.941 0.95

corel euc corel ham wangs euc wangs ham

2.0 1.5 1.0 0.5 0.0 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0 Number of Images Retrieved

On the COREL dataset, the change in precision on transition from 40 images to 70 images was only 0.6% while the number of classes reduced from 90 to 85 classes. The graphical representation of the precision for each class is represented in Fig. 10. From the figure, it is evident that although precision of Euclidean distance is slightly higher compared to Hamming distance, the change is drastic and faster which in turn increases the error rate also. Figure 9 shows that Hamming distance when compared to Euclidean distance has a higher precision as the number of images to be retrieved increases. To further illustrate the effectiveness of the proposed retrieval model, retrieval results obtained from Wang’s dataset for 10 images using Euclidean distance and Hamming distance is shown in Fig. 11, and the statistics are tabulated in Table 2.

5.3 Retrieval Analysis Between Euclidean Distance and Hamming Distance On the retrieval of 10 images from each class as shown in Fig. 11A, B, both of the distance metrices showed a 100 percent retrieval precision. In the retrieved images, a few images were commonly retrieved by both of the metrices, and a few

166

R. T. Akash Guna and O. K. Sikha

Fig. 10 Graphical Representation of Precision Obtained for COREL dataset (A, B) and for Wang’s Dataset (C, D). (a) Graphical representation of precision on retrieval of N number of images using (A) Euclidean distance and (B) Hamming distance from COREL dataset. The X axis represents the classes of COREL classes and Y axis represents precision. (b) Graphical representation of precision on retrieval of N number of images using (A) Euclidean distance and (B) Hamming distance from Wang’s dataset. The X axis represents the classes of COREL classes and Y axis represents precision

classes had higher number of images that were similar and few classes had minimal similarity parring the input image. The number of same images retrieved by different metrices from each class is visualized in Fig. 12. These similarities in images show that some classes like beaches and mountains have certain internal clusters with unique features that make the retrieval more efficient. Classes like Rome, flowers, dinosaurs, and horses although retrieved with a precision of 100%, the number of same images retrieved showed that these classes have inseparable images within the

Content-Based Image Retrieval Using Deep Features and Hamming Distance

167

Fig. 10 (continued)

classes. The primary goal of content-based image retrieval is to retrieve the images most similar by its content. When we look at the images retrieved using Hamming distance and Euclidean distance, we found certain subtle differences between the images retrieved, and Hamming distance gave high precision than Euclidean as the number of images increases as shown in Fig. 8. Difference in Horse Class: Consider the retrieval of horse class as in Fig. 11A–H and Fig. 11B–H. The quey image has two horses, a white horse with brown foal and a brown horse. All of the images retrieved by Hamming distance retrieved images containing the same horse and foal (refer to Fig. 11A–H), but when retrieved using Euclidean distance, images containing multiple horses and images containing horses of different colors were

168 Table 2 Retrieval of 10 images using Euclidean distance and Hamming distance

R. T. Akash Guna and O. K. Sikha

Classes of dataset Africa Beach Mountain Bus Horses Flowers Elephants Dinosaurs Rome Food

Correctly retrieved Euclidean Hamming 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

also retrieved (refer to Fig. 11B–H). Figure 13 shows the distribution of subclasses of the retrieved horses images. Difference in Flower Class: The input image provided for retrieval from the flowers class had red petals as in Fig. 11A–G, and leaves are visible in the background. The images retrieved using Hamming distance had red or reddish-pink petals in all the retrieved images, and leaves were visible (refer to Fig. 11A–H). The images retrieved by Euclidean distances showed wider variety of colors such as red, reddish-pink, pink, orange, and yellow. In all retrieved images, leaves were visible but were not visible in a substantial amount as seen in the input image and the images retrieved using Hamming distance (refer to Fig. 11B–H). Figure 14 shows the number of retrieved images belonging to each subcategory of flower class. Difference in Dinosaurs Class: Dinosaur class has two major subclusters that differ only by its orientation. The dinosaurs in the first cluster faces to left while other dinosaurs face to the right. The input image provided to the retrieval system had a dinosaur oriented toward the right. One major characteristic of the dinosaur is the dinosaur’s long neck. All the images retrieved by Hamming distance was oriented toward the right, and also it can be noticed that all the dinosaurs retrieved had long necks (refer to Fig. 11A–G). A handful of images retrieved by the Euclidean distance were either oriented toward the left or had a shorter neck (refer to Fig. 11B–G). Figure 15 shows the statistics of the number of images from each subclass of the dinosaurs category. Difference in Rome Class: The input image to the Rome class contained an image of the colosseum. Hamming distance was able to retrieve only one image of the colosseum out of 10 retrieved images from Rome category (refer to Fig. 11A–I), whereas Euclidean distance was able to retrieve more number of images of the colosseum from the Rome class (refer to Fig. 11B–I). The statistics of the number of images containing colosseum retrieved by Euclidean and Hamming distance is shown in Fig. 16.

Content-Based Image Retrieval Using Deep Features and Hamming Distance

169

Fig. 11 Results on retrieving 10 images using a sample image from each category of Wang’s dataset. Results for each class contains input image and 10 retrieved images. The classes are represented in the order: (A) Africa (B) beach (C) mountains (D) bus (E) dinosaurs (F) elephants (G) flowers (H) horses (I) Rome (J)

170

Fig. 11 (continued)

R. T. Akash Guna and O. K. Sikha

Content-Based Image Retrieval Using Deep Features and Hamming Distance

171

Fig. 11 (continued) Number of Same Images in Different Class 6

Number of Same Images

5 4 3 2 1 0 Africa

Beach Mountains

Bus Dinosaurs Elephant Flowers Horses Classes-Wangs Dataset

Fig. 12 Same images retrieved by both distance metrics

Rome

Food

172

R. T. Akash Guna and O. K. Sikha

Fig. 13 Comparison of interclass clusters of horses

Number of images per Sub Group of Horses

Number of images

10

Hamming Euclidean

9 8 7 6 5 4 3 2 1 0 Similar

Fig. 14 Comparison of interclass clusters of flowers

Multiple Horse Sub groups

Wrong Colour

Number of images per Sub Group of Flowers

Number of images

10 9

Hamming Euclidean

8 7 6 5 4 3 2 1 0 Red Reddish-Pink Pink Oranges Sub groups

Yellow

5.4 Comparison with State-of-the-Art Models To further evaluate the effectiveness of deep features for image retrieval, obtained results are compared against state-of-the-art CBIR models with classical features reported in the literature. CBIR models proposed by Lin et al. [27], Irtaza et al. [26], Wang et al. [28], and Walia et al. [29, 30] are considered for comparison. CBIR system based on CNN proposed by Hamreras et al. [31] is also compared with proposed model. Table 3 compares the proposed deep feature-based CBIR model with state-of-the-art classical feature-based models in terms of precision. From the table, it can be inferred that deep features-based image retrieval produced good results compared to other models under consideration across all the classes. Table 4 compares the average precision achieved by the proposed model against the five SOA models under consideration. Dinosaurs class was retrieved with an average precision of 99.1, which was the maximum average precision received among the

Content-Based Image Retrieval Using Deep Features and Hamming Distance Fig. 15 Comparison of interclass clusters of dinosaurs

173

Number of images per Sub Group of Dinosaurs 10 9

Hamming Euclidean

Number of images

8 7 6 5 4 3 2 1 0 Right+Tall

Left+Tall

Rigt+Short

Left+Short

Sub groups

Fig. 16 Clusters in Rome class

Number of images per Sub Group of Rome 10 9

Hamming Euclidean

Number of images

8 7 6 5 4 3 2 1 0 With Colosseum

Without Colosseum

Sub groups

SOA models. While horses, flowers, and bus classes have an average precision of 81.0, 84.95, and 73.85, Rome, food, elephants, beach, and mountain classes have an average precision of less than 60%. When compared to the SOA models, our proposed model produces a greater average precision for all the 10 classes of Wang’s dataset (Africa, beach, bus, dinosaurs, elephants, horses, food, mountain, flowers, Rome). Table 5 compares the recall results of state-of-the-art CBIR models with the proposed model. From the table, it is evident that the proposed model outperforms better than all of the state-of-the-art models giving perfect results on retrieving 20 images from each class of Wang’s dataset. Table 6 represents the average recall achieved by each class for SOA models. Dinosaurs class was retrieved with an average recall of 19.82, which was the maximum average precision received among the SOA models. While horses, flowers, and bus classes have an average recall of

Wang’s database African Beach Rome Bus Dinosaurs Elephants Flowers Horse Mountain Food Average

Lin et al. [27] 55.5 66 53.5 84 98.25 63.75 88.5 87.25 48.75 68.75 71.425

Irtaza et al. [26] 53 46 59 73 99.75 51 76.75 70.25 62.5 70.75 66.2

Wang et al. [28] 80.5 56 48 70.5 100 53.75 93 89 52 62.25 67.2

Walia et al. [29] 41.25 71 46.75 59.25 99.5 62 80.5 68.75 69 29.25 60.91

Walia et al. [30] 73 39.25 46.25 82.5 98 59.25 86 89.75 41.75 53.45 66.92

Hamreras et al. [31] 93.33 90 96.67 100 100 100 96.67 100 83.83 96.83 95.73

Table 3 Precision comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images Proposed model 100 100 100 100 100 100 100 100 100 100 100

174 R. T. Akash Guna and O. K. Sikha

Content-Based Image Retrieval Using Deep Features and Hamming Distance Table 4 Comparison of average precision of proposed model against six state-of-the art models across 10 classes of Wang’s dataset

Classes of dataset Africa Beach Mountain Bus Horses Flowers Elephants Dinosaurs Rome Food

Average precision Proposed SOA 100 65.97 100 61.37 100 59.64 100 78.20 100 84.1 100 86.90 100 64.95 100 99.25 100 58.36 100 63.50

175

Difference 34.03 38.03 40.36 21.8 15.9 13.1 35.05 0.75 41.64 36.5

16.2, 16.99, and 14.77, Rome, food, elephants, beach, and mountain classes have an average recall of less than 12. When compared to the SOA models, our proposed model produces a greater average recall for all the 10 classes of Wang’s dataset (Africa, beach, bus, dinosaurs, elephants, horses, food, mountain, flowers, Rome).

6 Conclusion This chapter proposed a deep learning-based content-based retrieval model using Hamming distance as the similarity metrics. While comparing Euclidean distance and Hamming distance for image retrieval, it is found that Hamming distance produced less change in precision over change in number of images retrieved, causing the number of classes to have precision above 0.95. On retrieving 10 images using Euclidean and Hamming distance, it was noticed that some of the images received were similar. In different images retrieved by Euclidean and Hamming distance, we found the presence of interclass clusters in classes such as horses, dinosaurs, Rome, and flowers of the Wang’s dataset. Hamming distance does a better job in identifying the interclass clusters since Hamming distance retrieved more images related to the interclass cluster of the given input image. The proposed model that uses Hamming distance for retrieval when compared to SOA models produced a significant increase in average precision and average recall. The classwise precision and recall were also significantly higher. Some classes that received low precision and recall using SOA models received a higher precision and recall when using the proposed model. We conclude our chapter upon the note that Hamming distance performs better than Euclidean distance when the dataset becomes very large and number of image to be retrieved is high. Hamming distance is also capable of identifying interclass cluster of classes, which is relevant when the number of images in the database is

Wang’s database African Beach Rome Bus Dinosaurs Elephants Flowers Horse Mountain Food Average

Lin et al. [27] 11.1 13.2 10.7 16.8 19.65 12.75 17.7 17.45 9.75 13.75 14.285

Irtaza et al. [26] 10.6 9.2 11.8 14.6 19.95 10.2 15.35 14.05 12.5 14.15 13.24

Wang et al. [28] 16.1 11.2 9.6 14.1 20 10.75 18.6 17.8 10.4 12.45 14.1

Walia et al. [29] 8.25 14.2 9.35 11.85 19.9 12.4 16.1 13.75 13.8 5.85 12.545

Walia et al. [30] 14.6 7.85 9.25 16.5 19.6 11.85 17.2 17.95 8.35 10.69 13.384

Table 5 Recall comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images Hamreras et al. [31] 18.6 18.0 19.33 20 20 20 19.33 20 16.6 19.33 19.125

Proposed model 20 20 20 20 20 20 20 20 20 20 20

176 R. T. Akash Guna and O. K. Sikha

Content-Based Image Retrieval Using Deep Features and Hamming Distance Table 6 Comparison average precision of proposed model against six state-of-the art models across 10 classes of Wang’s dataset

Classes of dataset Africa Beach Mountain Bus Horses Flowers Elephants Dinosaurs Rome Food

Average recall Proposed SOA 20 13.20 20 12.275 20 11.90 20 15.64 20 16.83 20 17.38 20 12.99 20 19.85 20 11.67 20 12.66

177

Difference 6.8 7.725 8.1 4.37 3.62 2.62 7.01 0.15 8.33 7.34

large and diverse within classes. This leads to the retrieval of more similar contentbased image retrieval.

7 Future Works To evaluate the effectiveness of the proposed model against microscopic images and exploration of preprocessing techniques to enhance the proposed model on applying to a medical dataset.

References 1. Pass, G., & Zabih, R. (1996). Histogram refinement for content-based image retrieval. In Proceedings third IEEE workshop on applications of computer vision. WACV’96 (pp. 96–102). https://doi.org/10.1109/ACV.1996.572008 2. Konstantinidis, K., Gasteratos, A., & Andreadis, I. (2005). Image retrieval based on fuzzy color histogram processing. Optics Communications, 248(4–6), 375–386. 3. Jain, A. K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition, 29(8), 1233–1244. 4. Folkers, A., & Samet, H. (2002). Content-based image retrieval using Fourier descriptors on a logo database. In Object recognition supported by user interaction for service robots (Vol. 3). IEEE. 5. Manjunath, B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837–842. https:/ /doi.org/10.1109/34.531803 6. Hörster, E., Lienhart, R., & Slaney, M. (2007). Image retrieval on large-scale image databases. Proceedings of the 6th ACM international conference on Image and video retrieval. 7. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets (pp. 267–285). Springer.

178

R. T. Akash Guna and O. K. Sikha

8. Haralick, R. M., Shanmugam, K., & Dinstein, I.’. H. (1973). Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621. 9. Szegedy, C., et al. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition. 10. LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 11. Wan, J., et al. (2014). Deep learning for content-based image retrieval: A comprehensive study. Proceedings of the 22nd ACM international conference on Multimedia. 12. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 13. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. ICML. 14. Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. 15. Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern Recognition, 36(2), 451–461. 16. Danielsson, P.-E. (1980). Euclidean distance mapping. Computer Graphics and Image Processing, 14(3), 227–248. 17. Norouzi, M., Fleet, D. J., & Salakhutdinov, R. R. (2012). Hamming distance metric learning. In Advances in neural information processing systems. 18. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9), 947–963. 19. Tao, D., et al. (2006). Direct kernel biased discriminant analysis: A new content-based image retrieval relevance feedback algorithm. IEEE Transactions on Multimedia, 8(4), 716–727. 20. Bian, W., & Tao, D. (2009). Biased discriminant Euclidean embedding for content-based image retrieval. IEEE Transactions on Image Processing, 19(2), 545–554. 21. Haldurai, L., & Vinodhini, V. (2015). Parallel indexing on color and texture feature extraction using R-tree for content based image retrieval. International Journal of Computer Sciences and Engineering, 3, 11–15. 22. Lin, K., et al. (2015). Deep learning of binary hash codes for fast image retrieval. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 23. Babenko, A., et al. (2014). Neural codes for image retrieval. In European conference on computer vision. Springer. 24. Chollet, F., et al. (2015). Keras. https://keras.io. 25. Putzu, L., Piras, L., & Giacinto, G. (2020). Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications, 79(37), 26995– 27021. 26. Irtaza, A., Jaar, M. A., Aleisa, E., & Choi, T.-S. (2014). Embedding neural networks for semantic association in content based image retrieval. Multimedia Tools and Applications, 72(2), 1911{1931}. 27. Lin, C.-H., Chen, R.-T., & Chan, Y.-K. (2009). A smart content-based image retrieval system based on color and texture feature. Image and Vision Computing, 27(6), 658{665}. 28. Wang, X.-Y., Yu, Y.-J., & Yang, H.-Y. (2011). An e_ective image retrieval scheme using color, texture and shape features. Computer Standards & Interfaces, 33(1), 59{68}. 29. Walia, E., & Pal, A. (2014). Fusion framework for e_ective color image retrieval. Journal of Visual Communication and Image Representation, 25(6), 1335{1348. 30. Walia, E., Vesal, S., & Pal, A. (2014). An e_ective and fast hybrid framework for color image retrieval. Sensing and Imaging, 15(1), 93. 31. Hamreras, S., et al. (2019). Content based image retrieval by convolutional neural networks. In International work-conference on the interplay between natural and artificial computation. Springer. 32. Sikha, O. K., & Soman, K. P. (2021). Dynamic Mode Decomposition based salient edge/region features for content based image retrieval. Multimedia Tools and Applications, 80, 15937.

Content-Based Image Retrieval Using Deep Features and Hamming Distance

179

33. Akshaya, B., Sri, S., Sathish, A., Shobika, K., Karthika, R., & Parameswaran, L. (2019). Content-based image retrieval using hybrid feature extraction techniques. In Lecture notes in computational vision and biomechanics (pp. 583–593). 34. Karthika, R., Alias, B., & Parameswaran, L. (2018). Content based image retrieval of remote sensing images using deep learning with distance measures. Journal of Advanced Research in Dynamical and Control System, 10(3), 664–674. 35. Divya, M. O., & Vimina, E. R. (2019). Performance analysis of distance metric for content based image retrieval. International Journal of Engineering and Advanced Technology (IJEAT), 8(6), 2249. 36. Byju, A. P., Demir, B., & Bruzzone, L. (2020). A progressive content-based image retrieval in JPEG 2000 compressed remote sensing archives. IEEE Transactions on Geoscience and Remote Sensing, 58, 5739–5751.

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray P. Manju Bala, S. Usharani, R. Rajmohan, T. Ananth Kumar, and A. Balachandar

1 Introduction COVID-19, a novel virus, was revealed in December 2019, at Wuhan, China [1]. This is a member of coronavirus class; however, it is more virulent and hazardous than the other coronaviruses [2]. Many nations are allowed to administer the COVID-19 trial to a minor group of participants due to limited diagnostic facilities. There are significant attempts to expand a feasible method for diagnosing COVID19, a key stumbling block continues to be the health care offered in so many nations. There is also a pressing need to develop a simple and easy way to identify and diagnose COVID-19. As the percentage of patients afflicted with this virus grows by the day, physicians are finding it increasingly difficult to complete the clinical diagnosis in the limited time available [3]. One of most significant areas of study is clinical image processing, which provides identification and prediction model for a number of diseases, including the MERS coronavirus and COVID-19, concerning many others. Imaging techniques have increasingly gained prominence and effort. As a result, interpreting these images needs knowledge and numerous methods to improve, simplify, and provide a proper treatment [4]. Numerous efforts have been made to use computer vision and artificial intelligence techniques to establish an efficient and quick technique to detect infected patients earlier on. For example, digital image processing with supervised learning technique has been developed for COVID-19 identification by fundamental genetic fingerprints used for quick virus categorization [5]. Using a deep learning method, a totally spontaneous background is created to diagnose COVID-19 as of chest X-ray [6]. The information was acquired from clinical sites in order to effectively diagnose COVID-19 and

P. M. Bala · S. Usharani · R. Rajmohan · T. A. Kumar () · A. Balachandar Department of Computer Science and Engineering, IFET College of Engineering, Villupuram, Tamilnadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_8

181

182

P. M. Bala et al.

distinguish it from influenza as well as other respiratory illnesses. For diagnosing COVID-19 in X-ray image of chest, a combined deep neural network architecture is suggested in [7]. First, the image of chest X-ray contrast was improved, and the background noise was minimized. The training parameters from two distinct preconditioning deep neural methods are combined and then utilized to identify and distinguish between normal and COVID-19-affected individuals. They used a combined heterogeneous deep learning system to develop their abilities based on images of chest X-ray for pulmonary COVID-19 analysis [8]. A comprehensive analysis of multiple deep learning methods for automatic COVID-19 detection from CXR employing CNN, SVM, Naive Bayes, KNN, and Decision Tree, as well as different neural deep learning structures has been presented [9]. An innovative method to aid in the detection of COVID-19 for comparison and rating reasons, the multicriteria judgment (MCJ) technique, was combined with optimization technique, while variance was employed to generate the values of factors as well as the SVM classification has been used for COVID-19 detection [10]. Artificial intelligence and the IoT [11] to create a model for diagnosing COVID-19 cases in intelligent health care have been presented. Deep image segmentation, adjustment of preconditioning deep neural networks (DNN), and edge preparation of a built DNN-based COVID19 categorization from Chest X-ray imageries were suggested in [12]. Using chest X-ray images [13] offered alternative designs of autonomous deep neural network for the categorization of COVID-19 from ordinary patients. ResNet has the optimum showing, with a precision of 98.35 percent. The Stacked RNN model [14] was suggested for the identification of COVID-19 patients using images of chest Xray. To mitigate the loss of training samples, this approach employs a variety of pretrained methods. Based on literature related to the research, we may conclude that precision and optimal timing continue to be a significant issue for physicians in minimizing human pain. Traditional artificial intelligence (AI) algorithms have encountered various issues when utilized on X-ray-based pulmonary COVID-19 detection, including caught-in-state space, tedious noise susceptibility, and ambiguity. The constraint of dimensions is the most essential and challenging. When it comes to characteristic selection, there are generally two methods: (1) The filtering technique provides ratings to each characteristic based on statistical parameters, and (2) the time of induction is based on a met heuristic of all potential groups of features [15]. Bioinspired Particle Swarm Algorithm (PSA) methods are important for optimization technique to increase and enhance the efficiency of selecting characteristics [16]. The China ministry declared that COVID-19 identification as a critical indicator for backward transcriptional synthesis or hospitalized should be validated by genetic research for lung or blood samples. Because of the present public health crisis, the real-time polymerase network reaction’s primary sensor makes it difficult to identify and treat many COVID-19 cases. Furthermore, the disease is extremely infectious; a larger population is at danger of sickness. Rather than waiting for positive viral testing, the diagnosis now encompasses all patients who demonstrate the common COVID-19 lung bacterial meningitis characteristic. This method allows officials to isolate and treat the patient more quickly. Even if death doesn’t really happen at COVID-19, many patients recovered with lifelong

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

183

lung loss. As per the WHO (World Health Organization), COVID-19 also causes pores in the chest, similar to MERS, providing them a “hexagonal appearance.” Some of the ways for controlling pneumonia is digital chest imaging. Machine learning (ML)-based image analytical techniques for the recognition, measurement, and monitoring of MERS-CoV (Middle East respiratory syndrome coronavirus) were created to discriminate among individuals with coronavirus and those who were not. Deep learning method to autonomously partition all lung and disease locations using chest radiography is developed. To create an earlier model for detecting COVID-19, influenza and pneumonia-bacterial meningitis in a healthy case utilizing image data and in-depth training methodologies are identified. In the investigation by the authors, they created a deep neural approach based on COVID-19 radiography alterations of X-ray images that can bring out the visual features of COVID-19 prior pathologic tests, thus reducing critical time for illness detection. MERS features like pneumonia can be seen on chest X-ray images and computer tomography scans, according to the author study. Data mining approaches to discriminate between MERS and predictable influenza depending on X-ray pictures were used in the research. The clinical features of 40 COVID-19 participants, indicating that coughing, severe chronic fatigue, and weariness were common beginning symptoms, has been evaluated. All 40 patients were determined to have influenza, and the chest X-Ray examination was abnormal. The author team identified the first signs of actual COVID-19 infection at the Hong Kong University [17]. The author proposed a statistical methodology to predict the actual amount of instances discovered in COVID-19 during January 2020. They came to the conclusion that there were 469 unregistered instances between January 1 and January 14, 2020. They also stated that the number of instances has increased. Using information from 555 Chinese people relocated from Wuhan on the 29th and 31st of January 2020, the author suggested a COVID-19 disease rate predictive models in China. According to their calculations, the anticipated rate is 9.6 percent, with a death rate of 0.2 percent to 0.5 percent. Unfortunately, the number of Asian citizens moved from Wuhan is insufficient to assess illness and death. A mathematical method to identify the chance of infection for COVID-19 has been proposed. Furthermore, they estimated that the maximum will be attained after 2 weeks. The prediction of persistent human dissemination of COVID-19 from 48 patients was based on information from Thompson’s (2020) research [18]. The researchers study created a prototype of the COVID-19 risk of dying calculation. For two other situations, the percentages are 5.0 percent and 8.3 percent. For the two situations, the biological number was calculated to be 2.0 and 3.3, respectively. COVID-19 could cause an outbreak, according to the projections. X-ray imaging is utilized to check for fracture, bone dislocations, respiratory problems, influenza, and malignancies in the national health. Computed tomography is a type of advanced Xray that evaluates the extremely easy structure of the functioning amount of the body and provides sharper images of soft inside organs and tissues. CT is faster, better, more dependable, and less harmful than X-rays. Death can increase if COVID-19 infection is not detected and treated immediately.

184

P. M. Bala et al.

To summarize this article, the following contributions are made: • CIFAR dataset is used for normalization procedure. • The cuckoo-based hash function is realized to determine the interested regions of the COVID-19 X-ray images. In CHF, we represent the intention to move a destination with the probability less than 1 in order to ensure that the total number of regions to assess remain constant. Additionally, we take an arbitrary number from the image and assign it to a position. • Incorporated and test the training accuracy and validation accuracy. The rest of the paper is organized as follows: Section II introduces the background work related to CNN approach in terms of diagnosing COVID-19 using chest X-ray images. Section III outlines the approaches and tools used for diagnosing COVID-19. Section IV describes cuckoo-based hash function to determine the regions of X-ray images. Section V discusses the implementation and settings of the model. Finally, Section VI concludes the accuracy of the proposed COVID-19 disease classification.

2 Related Work The use of an X-ray image of chest has grown commonplace in recent years. A chest X-ray is used to evaluate a patient’s respiratory status, including evolution of the infection and any accident-related wounds. In comparison to CT scan images, chest X-ray has shown encouraging outcomes in the period of COVID19. Moreover, due to the domain names rapid growth, academics have become more unaware of advances across many techniques, and as a result, knowledge of different algorithms is waning. As a consequence, artificial neural network, particle swarm optimization, firefly algorithm, and evolutionary computing dominate the research on bioinspired technology. The researchers then investigated and discussed several techniques relevant to the bioinspired field, making it easier to select the best match algorithm for each research [17]. Big data can be found in practically every industry. Furthermore, the researchers of this research emphasize the significance of using an information technology rather than existing data processing techniques such as textual data and neural networks [19]. A fuzzy logic learning tree approach is utilized in this study to improve image storing information performance [20]. The researchers’ purpose is to provide the concept of image recommendations from friends (IRFF) and a comprehensive methodology for it. The significance of reproductive argumentative and their numerous requests in the domain of background subtraction has been said according to the author of this research. Health care, outbreaks, face recognition, traffic management, image translation, image analysis, and 3D image production are some of the uses of GAN that have been discovered [21]. In interacting with radiographic images, state-of-theart computing and machine learning have investigated an amount of choices to make diagnoses. The rapid increase of deep neural networks and their benefits to the

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

185

health-care industry has been unstoppable since the 1985. Specular reflection class activation transfer has been standing up with DNN to conquer over the identification of COVID-19. Deep learning techniques have been operating interactively to assist in the analysis of COVID-19. Aside from the time restrictions, deep neural networks (DNN) are providing confidence in the analysis of COVID-19 utilizing chest X-ray data, with no negative cases. DNN’s main advantage is that it detects vital properties without the need for human contact. Understanding the present condition and irrespective of COVID-19 confirmation, it is critical to diagnose COVID-19 in a timely manner so that COVID-19 patients diagnosed can be free of additional respiratory infection. Image categorization and information extraction play a significant role in the nature of X-ray of chest and diagnostic image procedures. Based on autonomous pulmonary classification, a convolutional deep neural network is needed to retrieve significant information and partition the pulmonary region more accurately. In this article, an SRGAN+VGG framework is designed, where a deep neural network named visual geometrical team infrastructure (VGT16) is being used to recognize the COVID-19 favorable and unfavorable results from the image of chest X-ray, and a deep learning model is used to rebuild those chest x-ray images to excellent quality. A convolutional neural network identified as Disintegrate, Transmit, and Composition was employed for the categorization of chest X-ray pictures with COVID-19 illness. DNN investigates the image dataset’s category limitations by a session disintegration methodology to handle with any irregularities. Multiple preconditioning methods, such as VGT16 and ResNet, have been deployed for categorization of COVID-19 chest X-ray pictures from a normal image of chest X-ray to one impacted with influenza using a supervised learning process. Dense convolutional network was employed in this study to improve the outcomes of COVID-19 illness utilizing the suggested bioinspired CNN model.

3 Approaches and Tools 3.1 CIFAR Dataset of Chest X-Ray Image COVID-19 was diagnosed based on two sets of chest X-ray images, one taken from one source and another. Joseph Paul Cohen and colleagues [22] were able to magnify the chest X-ray of COVID-19 data collection by using images from various publicly available sources. Four hundred ninety-five of them have been identified as having COVID-19 antibodies. Figure 1 shows the distribution of the CIFAR dataset has 950 people in it as of the time of this writing. Results from COVID-19 were found in 53.3 percent (of all the images) of the photos with COVID-19 findings, while results from normal-healthy X-ray findings were found in 46.7 percent (of all the images). The standard X-ray images of a healthy chest were developed by Paul Mooney, who independently created them after reading an article in the same journal

186 Fig. 1 Data distribution of images

P. M. Bala et al.

Chart Title 60%

54% 46%

50% 40%

1 Covid-19

Normal/Regular

written by Thomas Kermany and his colleagues [23]. This CIFAR dataset contains a total of 1341 regular and healthy photos. Roughly one-third of the photos were chosen at random. It is essential to ensure that images with chest radiographs that are images of normal-healthy people are not included because this prevents learning using unequal datasets. If the dataset has many samples, it favors classes with a few images, limiting the images that can be used. All X-rays fall into one class, called normal or healthy X-rays, while COVID-19 X-rays are separate. There are two types of data: When the number of patients who have COVID-19 is calculated and considering gender, there are 346 males who have the disease and 175 females who have it. The results show that 88 of the COVID-19-positive patients diagnosed between the ages of 20 and 40 are 20- to 30-year-olds. In 175 patients, the most patients were 41–61 years old, and 175 patients were received. COVID-19 was detected in 172 patients who were between the ages of 62 and 82 years old. Even medical specialists may find the X-ray images challenging to interpret. We propose that our approach might be able to help them achieve their goals. PyTorch is utilized for the model’s development and implementation. Tensor computations are an essential element in developing deep learning algorithms, which consists a deep learning directory with tensor computation functionality. Google is developing it, and it is being used by Facebook’s AI Research Lab at the moment. The explosion in interest in this subject among researchers has undoubtedly resulted in the advance of several leading-edge algorithms, including NLP and computer vision, which has been applied throughout the entire field of deep learning. By using chest X-ray of COVID-19 for diagnosis, one of the most important goals is to develop a model for image ordering, which is the important goal of using PyTorch. Classification models that form images could generate considerable concern for clinicians, particularly those who utilize X-ray imaging.

3.2 Image Scaling in Preprocessing Two types of views have been chosen to distinguish between the lung image scenario and the infection affecting it: posteroanterior (PA) and anteroposterior (AP). In the posteroanterior (PA) view, the X-ray of the patient’s chest is taken from the posterior to anterior ends of the patient’s upper body. When it occurs with chest Xrays, the term “anteroposterior” refers to the X-ray taken from the patient’s anterior

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

187

Fig. 2 Normal image

to posterior coverage. Before the labeling process can begin, the images must first be scaled to accommodate data augmentation, which takes place before the images can be labeled. It is necessary to scale the image first before preprocessing (Fig. 2) because it will be subjected to several transformations during its creation process. Compared to the chest X-ray images of regular patients, which are readily available, the COVID-19 data is limited in comparison. K-Fold Cross-Validation is one of the methods that can be used to develop skewed datasets, which are defined as those that have a significant difference in the amount of data they contain from one another. While staging the data throughout the study, all of the images were taken in proportion to the data in order to avoid overfitting the results to the datasets. The following data transformations are available such as data augmentation and preprocessing. The next step involves loading the dataset with images that are positive for COVID-19 and normal in appearance. Data augmentation is the process of creating novel data from available data and a few simple image manipulation and image classification methods, which are referred to as data synthesis. By including an augmentation, the model’s generalization will be improved, and the risk of overfitting training results will be eliminated. Using image augmentation, adding additional information to the live dataset can be manually entering without overwhelming. PyTorch’s torch vision library can be used to accomplish all of these tasks. In addition to data transformation and handling, Torchvision offers deep learning models that are already defined and at the field’s cutting edge. An image augmentation technique may include image rotation, image shifting, image flipping, or image noising, among other things. The training and validation of image transforms are performed with only a small amount of data, resulting in the creation of additional data. Preprocessed input images (Fig. 3) are always required for pretrained models. The datasets are first to read into PIL image (Python imaging format), which is then used to create a sequence of transformations. Aside from that, To TenSor converts a PIL image (C, H, W), which is in the range of [0–255] in (x, y, width, height), and shape (C, H, W), which is in the range of [0–1], to a floating-point FloatTensor in

188

P. M. Bala et al.

Fig. 3 Preprocessed image

the range of [0–1], which is in the range [0–1]. Images have been normalized to a range of 0 to 1, with a 0.5 standard deviation serving as the standard deviation.

Input =

Input − μ Standard deviation

Input =

(1)

Input − 0.5 0.5

In this case, μ is equivalent to the average deviation 0.5. The length of the channel is denoted by the letter C, the height is denoted by H, and the width is denoted by the letter W, H, and W and must score a minimum of 224 points in order to be considered for the tournament. Normative values were calculated using the mean and average deviation, with the mean range being [0.484, 0.455, 0.405] and the standard deviation range being [0.228, 0.225, 0.226], respectively, for the data. In this case, the CIFAR dataset is used for the normalization procedure. CIFAR dataset is a group of images that is commonly used to train deep learning algorithms. It is mainly used dataset for image classification and also used by scientists with different algorithms. Imaging networks are used in many applications, and CIFAR dataset is the known dataset for machine learning and computer vision algorithms. CIFAR is widely used in machine learning applications, such as image recognition techniques with machine learning. In total, the forum has more than 1.2 million images in 10,000 different categories, which can be searched by typing in a keyword. On the other hand, the data in this dataset are loaded into a top-of-the-line piece of hardware, such as a CPU that alone cannot handle datasets of this size and complexity.

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

189

3.3 Training and Validation Steps During training and validating the model, the dataset is divided into 80/20 ratios to avoid utilizing skewed datasets. With regard to each folder, the images are tagged using the class name of the folder where the images are located. Additionally, the DataLoader loads the labeled images and train tracks in the game’s memory (Class Name). This divides the dataset into two distinct classes: regular and healthy X-rays and the other for the dangerous COVID-19 X-ray. In either case, data is loaded into CUDA (graphics processing unit) or the CPU before moving on to model definition. Torchvision is a sub-package that contributes to deep learning image classification, detection of objects, image segmentation, etc. For image processing, Torchvision offers a pretrained and in-built deep learning model.

3.4 Deep Learning Model The CNN model is a form of neural network that allows us to obtain higher interpretations for picture input. Unlike traditional image processing, which requires the user to specify the feature representations, CNN takes the image’s original captured image, training the system, and then separates the characteristics for improved categorization. The structure of the brain is deeply implicated in machine learning. Like MRI, CT, and X-rays, signal and image processing technology are widely used to apply deep learning to images described in Fig. 4. Deep learning models are overly configured by the CNN model parameters using deep feature extraction techniques. The visual system of the human brain inspired CNNs. The goal of CNNs is to enable computers to see the world in the same way that humans do. Image identification and interpretation, image segmentation, and natural language processing can all benefit from CNNs in this fashion [24]. CNNs feature convolutional, max pooling, and nonlinear activation layers and are a type of deep neural network. The convolutional layer, which is considered a CNN’s core layer, performs the “convolution” action that gives the network its name. Convolutional neural networks are similar to classic machine learning. Figure 5 describes convolution layers have odd number layers, while sharing and subsampling layers have even number layers, excluding the input and output layers. Figure 6 describes the CNN has eight different layers linked to sharing layers with kernel, the group dimension is 100, and the model boundary is 100 epochs.

190

P. M. Bala et al.

Preprocessing

Applying Classifiers and finetuning CNN Features

Covid `19 Lung CT scan

Training

Testing Detecting Covid’19

Classification

Fig. 4 Classification approach for COVID-19

Hidden Layers

Input Layer x1 x2

Output Layer y0 y1

x3 x4 xp

Fig. 5 Convolution layers of neural network

yq

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

191

Fig. 6 CNN architecture with kernel

Image as input

Convolution Kernel size: 3 Feature map: 8 Sharing Stride=2

Convolution Kernel size: 3 Feature map: 6 Sharing Stride=2

Convolution Kernel size: 3 Feature map: 4 Sharing Stride=2

COVID-19 Image as Output

4 Cuckoo-Based Hash Function The cuckoo algorithm is realized to determine the interested regions of the COVID19 x-ray images. Here, the cuckoo-based hash function (CHF) is a contour metaheuristic process that utilizes constant variance as a method of search. To design cuckoos as they look for the strongest roost to lay, a method of this kind is investigated to obtain pixel locations. Almost every pixel Pi is indeed a destination

192

P. M. Bala et al.

that might be good for applying the information gain and can potentially be selected from the pixel locations that meet the function’s criteria. It is assumed that, for the purposes of medical imaging, x-ray pixel intensities Pi are probable locations. We can greatly increase efficiency by placing initial egg-nesting cuckoos throughout all C-ray spatial domain. In CHF, we represent the intention to move a destination with the probability less than 1 in order to ensure that the total number of regions to assess remain constant. Additionally, we take an arbitrary number from the image and assign it to a position. The pixel selection over the X-ray image based on cuckoo hash functions is modeled as: Ti +1

Pi  ρ = Least

T

= Pi i + ϕ.

Loc (ρ, σ, τ )

(2)

ϕHu1 (Pi ) + ϕHu2 (Pi ) ϕSa1 (Pi ) + ϕSa2 (Pi ) ϕBr1 (Pi ) + ϕBr2 (Pi ) , , 2 2 2

(3)  σ = Least

ωHu1 (Pi ) + ωHu2 (Pi ) ωSa1 (Pi ) + ωSa2 (Pi ) ωBr1 (Pi ) + ωBr2 (Pi ) , , 2 2 2

(4)  τ = Least

ϑHu1 (Pi ) + ϑHu2 (Pi ) ϑSa1 (Pi ) + ϑSa2 (Pi ) ϑBr1 (Pi ) + ϑBr2 (Pi ) , , 2 2 2

(5) where Ti stands for the event time period, PiTi represents the chosen pixel location, ϕ defines the measured normal variance distance, Loc(ρ , σ , τ ) express the location of current pixel location in terms of rows and column, Hu denotes the hue value of pixel location, and Sa denotes the saturation value of pixel location. 

Mean (H u1 (Pi )) − Hu (Pi ) width

(6)

Mean (Hu2 (Pi )) − Hu (Pi ) height

(7)

Mean (Hu1 (Pi )) (ϕHu1 (Pi ) − Hu (Pi )) width

(8)

ϕHu1 =

 ϕHu2 =

 ωHu1 =

Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray

 ωHu2 =

 ϑHu1 =

 ϑHu2 =

Mean (Hu2 (Pi )) (ϕHu2 (Pi ) − Hu (Pi )) height

(9)

Mean (Hu1 (Pi )) (ωHu1 (Pi ) − Hu (Pi )) width

(10)

Mean (Hu2 (Pi )) (ωHu2 (Pi ) − Hu (Pi )) height

(11)



Mean (Sa1 (Pi )) − Sa (Pi ) width

(12)

Mean (Sa2 (Pi )) − Sa (Pi ) height

(13)

Mean (Sa1 (Pi )) (ϕSa1 (Pi ) − Sa (Pi )) width

(14)

Mean (Sa2 (Pi )) (ϕSa2 (Pi ) − Sa (Pi )) height

(15)

Mean (Sa1 (Pi )) (ωSa1 (Pi ) − Sa (Pi )) width

(16)

Mean (Sa2 (Pi )) (ωSa2 (Pi ) − Sa (Pi )) height

(17)

ϕSa1 =

 ϕSa2 =

 ωSa1 =

 ωSa2 =

 ϑSa1 =

 ϑSa 2 =

193

 ϕBr1 =

Mean (Br1 (Pi )) − Br (Pi ) width

(18)

194

P. M. Bala et al.



Mean (Br2 (Pi )) − Br (Pi ) height

(19)

Mean (Br1 (Pi )) (ϕBr1 (Pi ) − Br (Pi )) width

(20)

Mean (Br2 (Pi )) (ϕBr2 (Pi ) − Br (Pi )) height

(21)

Mean (Br1 (Pi )) (ωBr1 (Pi ) − Br (Pi )) width

(22)

Mean (Br2 (Pi )) (ωBr2 (Pi ) − Br (Pi )) height

(23)

ϕBr2 =

 ωBr1 =

 ωBr2 =

 ϑBr1 =

 ϑBr2 =

We realize the X-ray image vital points using the feature vector for pixels Pi , which uses integer positions. The notion of conditional variance is used to enhance stochastic search, as in the case of the recommended CHF. Across each phase, the distance of the normal distribution is set using a unique dissemination that is calculated as ⎧ ⎪ ⎨ σ Loc (ρ, σ, τ ) = 2 ⎪ ⎩

τ lim 1+ ρ1

ρ→∞

(ρ−σ )5/2

0

0 2), N binary SVM classifiers are built [7].

3.3.3

CNN with RF Classifier

Figure 3 represents the hybrid CNN with RF classifier model. The proposed model consists of CNN for feature extractions from the images, and the extracted features are used by a random forest for classification.

An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . .

301

Fig. 3 Coupled architecture of CNN and random forest classifier

The net CNN feature extraction is the same as that explained in the CNN with SVM classifier. But the main difference is the coupled classifier, which in this case is the random forest (RF) classifier. Random forest consists of many individual decision trees that operate as an ensemble [12]. Each decision tree predicts the class of the diabetes, and the class with maximum votes becomes the prediction class of the model. The CNN model is trained to extract features from the input image, and the fully connected layer of CNN is replaced by a random forest classifier to classify the image pattern. The output of the dense layer of CNN produces a feature vector representing the image pattern, consisting of 645 values. The random forest classifier is trained using the features of images produced by the CNN model. The trained random forest uses the features to perform classification task and makes decisions on testing images using features extracted by CNN. In the experiment, the random forest contains 50 individual decision trees keeping other values default.

4 Experimental Results and Analysis The experiment is performed on high-end server with NVIDIA Tesla V100 32 GB passive GPU that has up to 3 DW and 6 SW GPU cards. The GPU contains 12 Gbps PCIe 3.0 with 2 GB NV cache supporting RAID levels 1, 5, 6, 10, 50. The highend server CPU configuration is—2 no’s x Intel Xeon Gold 6240R 2.4G, 24C/48T, 10.4GT/s, 35.75 M cache, Turbo, HT(165W) DDr4-2933.

302

N. Kumar et al.

Table 2 Training and testing dataset distribution Class labels 0 1 2 3 4

Number of samples in train set 1834 1222 1222 917 917

Number of samples in test set 230 154 154 115 115

The original dataset repository published by Kaggle consisted of over 35,000 fundoscopic images. Due to the fact that these images were not collected in complaisant laboratory environment, these images are of relatively inharmonious nature. Image resolutions of these images range from 2594 .× 1944 pixels to 4752 .× 3168 pixels, and due to sub-optimal lighting circumstance, the images contain some amount of noise. From the Kaggle dataset, 8407 representative and highquality images constituting about 8 GB of data were selected to build the dataset that is used for training and testing the proposed models reported in this chapter. Out of the 8 GB dataset containing 8407 images, 6112 images are used for training 200 the model (Table 2). Finally, for testing purpose, almost 10% of the images, i.e., 768 images, are employed. The images are then chosen in such a way that each stage in the current reorganized dataset has a reasonably balanced collection. From Table 3, we can observe that the models using the LeakyReLU activation function have a better improvement on applying preprocessing. This is mostly due to the loss of information that is observed in the ReLU activation function when the output value becomes negative. But in case of LeakyReLU, this negative output value is not discarded and a parametric measure is applied. Accuracy metric is adopted to measure and compare the performance of different CNN-based classifiers. Equation 1 represents the formal definition of accuracy where .χi = 1, if the predicted class is true class for image “i,” else .χi = 0. 1  χi . m m

Accuracy =

.

(1)

i=1

5 Conclusion and Future Work The test accuracy obtained by the models is around the range of 70–75% using just 24% of the images available in the DR dataset. Our experimental results indicate the importance of CNN and machine learning techniques for detection of a different diabetic retinopathy. While, even on such a small-sized training data, the accuracy of the models is reasonable, indicating that there is room for further improvements to the models in the future. The models can be employed with user-friendly user

An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . .

303

Table 3 Accuracy of the proposed classifiers Models CNN CNN CNN+RF CNN+RF CNN+SVM CNN+SVM SVM

Activation function LeakyReLU ReLU LeakyReLU ReLU LeakyReLU ReLU None

Test accuracy without feature selection 74.48 74.12 73.74 73.99 74.12 72.48 55.29

Test accuracy with feature selection 75.01 74.11 74.01 73.97 75 73.13 73.14

Table 4 Comparison of average accuracy metric over recent DR classifiers Research [11] [2] [14] [5] [9] [8] [15] Proposed method

Methodology Linear kernel SVM InceptionV3, Multiclass SVM Gaussian mixture model MLP Gradient-weighted class activation ImageNet VGG16 CNN+SVM

Average test accuracy 69.04 76.1 74 66.04 77.11 71.06 51 75

interface for specialists, especially ophthalmologist to evaluate the severity level of DM by recognizing different DR stages, further supporting for proper management of prognosis of diabetic retinopathy. Table 4 describes the average testing accuracy of the various researches that were performed over the years in the field of DR classification. Clearly, [9] boast the best performance metric in terms of average test accuracy with 77.11%. Since the dataset distribution used in [9] and the proposed method does not overlap completely, it is difficult to make a direct comparison. But nonetheless, the proposed method has performed quite well taking into consideration the performance metrics of the recently developed methods. Since the domain of deep learning is constantly evolving, further efforts can be made to improve the performance of the models. The accuracy of the proposed model depends on quality of dataset and training as it employs only machine learning techniques. Computer vision techniques can be employed to detect the different important parts of retina such as cotton wool spots that will help the CNN models to select most important features. The real-life fundoscopic images consist of noises, and the proposed model does not have a layer to deal with different noises. Image preprocessing techniques can also be used to remove noises from the image so that the model can work more efficiently. Last but not the least, different machine learning techniques can be combined to build hybrid models as a novel state-of-theart DR classifier that can provide better performance.

304

N. Kumar et al.

Acknowledgments We would like to express our gratitude toward the Information Technology Department of NITK, Surathkal for its kind cooperation and encouragement that helped us in the completion of this project entitled “An Effective Diabetic Retinopathy Detection using Hybrid Convolutional Neural Network Models.” We would like to thank the department for providing the necessary cluster and GPU technology to implement the project in a preferable environment. We are grateful for the guidance and constant supervision as well as for providing necessary information regarding the project and also for its support in completing the project.

References 1. Bhatia, K., Arora, S., & Tomar, R. (2016). Diagnosis of diabetic retinopathy using machine learning classification algorithm. In 2016 2nd International Conference on Next Generation Computing Technologies (NGCT) (pp. 347–351). https://doi.org/10.1109/NGCT.2016. 7877439 2. Boral, Y. S., & Thorat, S. S. (2021). Classification of diabetic retinopathy based on hybrid neural network. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC) (pp. 1354–1358). https://doi.org/10.1109/ICCMC51019.2021.9418224 3. Carrera, E. V., González, A., & Carrera, R. (2017). Automated detection of diabetic retinopathy using SVM. In 2017 IEEE XXIV International Conference on Electronics, Electrical Engineering and Computing (INTERCON) (pp. 1–4). https://doi.org/10.1109/INTERCON.2017. 8079692 4. Cuadros, J., Bresnick, G. (2009). EyePACS: an adaptable telemedicine system for diabetic retinopathy screening. Journal of Diabetes Science and Technology, 3, 509–516. 5. Harun, N. H., Yusof, Y., Hassan, F., & Embong, Z. (2019). Classification of fundus images for diabetic retinopathy using artificial neural network. In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 498–501). https://doi.org/10.1109/JEEIT.2019.8717479 6. Herliana, A., Arifin, T., Susanti, S., & Hikmah, A. B. (2018). Feature selection of diabetic retinopathy disease using particle swarm optimization and neural network. In: 2018 6th International Conference on Cyber and IT Service Management (CITSM) (pp. 1–4). https:// doi.org/10.1109/CITSM.2018.8674295 7. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. https://doi.org/10.1109/ 72.991427 8. Jayakumari, C., Lavanya, V., & Sumesh, E. P. (2020). Automated diabetic retinopathy detection and classification using ImageNet convolution neural network using fundus images. In: 2020 International Conference on Smart Electronics and Communication (ICOSEC) (pp. 577–582). https://doi.org/10.1109/ICOSEC49089.2020.9215270 9. Jiang, H., Xu, J., Shi, R., Yang, K., Zhang, D., Gao, M., Ma, H., & Qian, W. (2020). A multilabel deep learning model with interpretable Grad-CAM for diabetic retinopathy classification. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC) (pp. 1560–1563). https://doi.org/10.1109/EMBC44109.2020.9175884 10. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (Vol. 1, pp. 1097–1105). NIPS’12, Red Hook, NY, USA: Curran Associates. 11. Kumar, S., & Kumar, B. (2012). Diabetic retinopathy detection by extracting area and number of microaneurysm from colour fundus image. In: 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 359–364). https://doi.org/10.1109/ SPIN.2018.8474264

An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . .

305

12. Ramani, R. G., Shanthamalar J., J., & Lakshmi, B. (2017). Automatic diabetic retinopathy detection through ensemble classification techniques automated diabetic retinopathy classification. In: 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC) (pp. 1–4). https://doi.org/10.1109/ICCIC.2017.8524342 13. Roy, A., Dutta, D., Bhattacharya, P., & Choudhury, S. (2017). Filter and fuzzy C means based feature extraction and classification of diabetic retinopathy using support vector machines. In: 2017 International Conference on Communication and Signal Processing (ICCSP) (pp. 1844– 1848). https://doi.org/10.1109/ICCSP.2017.8286715 14. Roychowdhury, S., Koozekanani, D. D., & Parhi, K. K. (2016). Automated detection of neovascularization for proliferative diabetic retinopathy screening. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 1300–1303). https://doi.org/10.1109/EMBC.2016.7590945 15. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556. 16. Sodhu, P. S., & Khatkar, K. (2014). A hybrid approach for diabetic retinopathy analysis. International Journal of Computer Application and Technology, 1(7), 41–48. 17. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–9). https://doi.org/10.1109/CVPR. 2015.7298594

Modified Discrete Differential Evolution with Neighborhood Approach for Grayscale Image Enhancement Anisha Radhakrishnan and G. Jeyakumar

1 Introduction Evolutionary Algorithms (EAs) are the potential optimization tools for wide range of benchmarking and real-world optimization problems. The most prominent algorithms under EAs are Differential Evolution (DE), Genetic Algorithm (GA), Genetic Programming (GP), Evolutionary Programming (EP), and Evolutionary Strategies (ES). Though the algorithmic structure of these algorithms is similar, their performance varies based on different factors, viz., population representation, variation operations, selection operations, and the nature of the problem to be solved. Among all these algorithms, DE is simpler and is applicable for complex realvalued parameter optimization problems. The differential mutation operation of DE makes it not directly applicable for discrete parameter optimization problems. Considering the unique advantages of DE, extending its applicability to discrete optimization problems is an active part of research. In computer vision, good contrast images have vital role in many applications of image processing. From past few decades, an extensive research was carried out on metaheuristic approach for automatic image enhancement. The objective of the study presented in this paper is to propose an algorithmic change to DE, by adding a new mapping technique, to make it suitable for discrete optimization problem. The performance of DE with proposed mapping technique was tested with benchmarking travelling salesperson (TSP) problems and an image enhancement problem. The algorithmic structure, design of experiments, results, and discussion are presented in this chapter.

A. Radhakrishnan · G. Jeyakumar () Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_15

307

308

A. Radhakrishnan and G. Jeyakumar

The remainder of this chapter is sectioned as follows: Sect. 2 summarizes the related works, Sect. 3 describes the theory of DE, Sect. 4 describes the proposed approach, Sect. 5 explains Phase I of the experiment, Sect. 6 presents the Phase II of experiment, and finally, Sect. 7 concludes the chapter.

2 Related Works Though DE is native algorithm for real-valued parameter optimization, it is extended to discrete-valued parameter optimization also with relevant changes in its algorithmic structure. These changes are done either in population level or in the operator levels. Similar such works are highlighted below. A mapping mechanism for converting the continuous variable to discrete is proposed in [1]. The authors also have suggested a way to move the solution faster to the optimality. The MADEB algorithm with binary mutation operator was proposed in [2] to discretize the DE mutation operation. Interestingly, in [3], an applicationspecific discrete mapping technique was added with DE. The application attempted in this work was multi-targets assignment for multiple unmanned aerial vehicle. Similar to [3], there were many application-specific changes suggested for DE in literature. For solving an antenna design problem, a binary DE named NBDE was presented in [4]. For computer vision and image processing problem, the DE-based algorithm to detect the circles was introduced in [5]. A new operator name “position swapping” was added to DE in [6] and used for index tracking mechanism. The forward and backward transformation of converting the real values to integer values and vice versa was discussed in [7]. There were research ideas in the literature discussing about appropriate population representation for making DE apt for discreate-valued problems. In [8], a discrete representation for the candidates in the population was proposed and was tested for flow shop scheduling problem. A modified DE with binary mutation rule was introduced in [9] and was tested on few discrete problems. Taking TSP as benchmarking problem, a set of discrete DE algorithms were introduced in [10]. A novel mutation operation for spatial data analysis was introduced in [11], which was named as linear ordering-based mutation operator. A set of changes were proposed to incorporate in DE, in [12], for solving vehicle routing problems. An attempt to solve the discrete RFID reader placement problem was tried out in [13], with an application-specific mapping. The keyframe extraction problem was experimented in [14] to solve it by a modified DE algorithm. The work presented in [13] was extended by authors in [15] to experiment DE on a discrete multi objective optimization problem. Image enhancement is a technique where the information in images becomes more expressible. It transforms original image to enhanced image, which is visually good, and an object can be distinguished from the background. The purpose of enhancement is to improve image quality, focus certain features, and strengthen interpretation of an image. It has vital role in computer vision and image processing in which it is a preprocessing phase in applications like image analysis and remote

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

309

sensing. Regions with low contrast appear as dark, and high-contrast regions appear to be illuminated nonnatural. The outcome of both is loss of pertinent information. Thus, the optimal enhancement of image contrast that represents the relevant information of the input image is a difficult task [16, 17]. There is no generic approach for image enhancement; they are image dependent. Histogram Equalization (HE) and its variants are effectively applied to enhance the contrast of the image that is widely used in several image processing applications [18–20]. The major drawback of this approach is that for the darker and lighter image, it does not produce quality enhanced image due to the noise and loss of relevant information. In recent years, several bioinspired algorithmic approaches are used in image contrast enhancement [21]. These algorithms help in searching the optimal mapping of gray-level input image to new gray-level image with enhanced image contrast. Automatic image enhancement requires well-defined evaluation criterion that validates wide range of datasets. An approach for tuning the parameters of transformation function can be adopted. The transformation function is evaluated by the objective function. Bioinspired algorithms search the optimal combination of transformation parameters stochastically. Embedding population-based approach in image enhancement has gained wide popularity in recent years. This approach helps to explore and exploit such complex problems and search the solution space to achieve the optimal parameter setting [22, 23]. Plenty of literature indicates the application of metaheuristic algorithm for image contrast enhancement. Pal et al. used Genetic Algorithm (GA) for automating the operator selection for image enhancement [24]. Saitoh used Genetic Algorithm for modeling the intensity mapping of transformation function. This approach generated better result with respect to execution time [25]. Braik et al. examined Particle Swarm Optimization (PSO) by increasing the entropy and edge details [26]. Dos Santos Coelho et al. [27] modeled three DE variants by adopting the similar objective function proposed in [26]. The advantage of this approach was faster convergence but could not provide suitable statistical evidence for the quality of enhanced image. Shanmugavadivu and Balasubramanian proposed new method that avoids mean shift that happens at the equalization process [28]. This method could preserve the brightness of enhanced images. Hybridization approaches are also found in improving the quality of enhanced image. Mahapatra et al. proposed hybridization approach where PSO is combined with Negative Selection Algorithm (NSA) [29]. This method could preserve the number of edge pixels. Shilpa and Shyam [30] investigated a Modified Differential Evolution (MDE), which could avoid the premature convergence. It also enhanced the exploitation capability of DE by adapting the Levy flight from Cuckoo search. In MDE, mean intensity of the enhanced image is preserved. A comparative study of five traditional image enhancement algorithms (Histogram Equalization, Local Histogram Equalization, Contrast Limited Adaptive Histogram Equalization, Gamma Correction, and Linear Contrast Stretch) was presented in [31]. A study on the effect of image enhancement was carried out in [32], using weighted thresholding enhancement techniques. On understanding the interesting research attempts in making DE suitable for solving discrete-valued parameter optimization problems, and the importance of

310

A. Radhakrishnan and G. Jeyakumar

image enhancement process in computer vision applications, this chapter proposes to investigate the novelty of a modified DE to solve discrete TSP and an image enhancement problem.

3 Differential Evolution Differential Evolution (DE) is a probabilistic population-based approach that has gained recognition among other Evolutionary Algorithms (EAs) because of its easiness and robustness. The algorithm was formulated to solve problems in continuous domain and was modeled by Storn in 1997 [33]. Self-organizing capability of this algorithm provided researchers to expand DE to discrete domain. The research work presented in this chapter is an effort to improve the exploration and exploitation nature of DE by mapping gene of the mutant vector appropriately.

3.1 Classical Differential Evolution The classical DE has two phases: The first phase includes population initialization, followed by the evolution phase (second phase). Mutation and crossover are performed during the evolving stage. Selection of the candidate happens thereafter, which replaces the candidate from the population thereby generating population for the next generation. This is iterated until the termination criterion. (a) Population Initialization – In this phase, the candidate set is generated g = in  fashion. The set of candidate solution .C  guniformly distributed Ck : k = 1, 2, 3.. . . . n , where g denotes generation, and n denotes g g size of the population.   . Ck denotes a d – dimensional vector. .Ck = g

g

g

g

c1,k , c2,k , c3,k . . . cd,k and is generated using random uniform distribution, as mentioned in Eq. (1). g

Ck = CL + (CU − CL ) ∗ rand (0, 1)

(1)

where CL and CU represent the lower bound and upper bound of search space Sg . (b) Evolution – In this phase, the mutation operation that is a crucial step in DE is performed. Three random vectors are selected to generate mutant vector. The g weighted difference is added with the base vector. The mutant vector .vk for g every target vector .Ck is generated using Eq. (2):  g g g g  vk = cr1 + F cr2 − cr3

(2)

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

311

where r1 , r2 , and r3 are the random vectors in the population and r1 = r2 = r3 , and F is the Mutation Factor with the value in the range [0, 1]. Once mutant vector is generated, the crossover operation is performed between the mutant vector and the parent. The crossover operation is performed  g g g g g between mutant vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k and target vector   g g g g g .C = c , c , c .. . . . c k D,k , with a crossover probability Cr ∈ [0, 1], and 1,k 2,k 3,k   g g g g g g a trial vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k is generated. Each trial vector .ui,k   g g g is produced as .ui,k = ui,k if ran dk ≤ Cr, xi,k if not where i ∈ {1, 2, 3, .. . . . D} selection is performed after crossover, and the individual with better fitness value moves to the next generation (it can be trial vector or the target vector). These operations (mutation, crossover, and selection) repeat till the termination criterion.

4 Proposed Approach This section proposes new mapping approach Best Neighborhood Differential Evolution (BNDE). DE cannot be directly applied to discrete-valued parameter problems (also called as combinatorial optimization problem), and appropriate mapping approach for the mutant vector is required in enhancing the exploration and exploitation of algorithm. In proposed method, initialization of population is performed same as the classical DE. In the evolution phase, few genes of mutant vector are selected by a probability. Neighbors of those selected genes are replaced in the mutant vector with the gene that has the best optimal value. The genes can be from the best or average or worst candidates in the population. The proposed approach was investigated to solve classical travelling salesman problem. The algorithmic components of BNDE algorithm used for the experiment is described below. The structure of the algorithm is depicted in Fig. 1. • Initializing the Population – Population was initialized by the random positioning of the city nodes. Euclidean distance was considered to get the fitness of the candidates in the population. • Fitness Evaluation: Fitness of each candidate is evaluated using objective function. The candidate with shortest path has best fitness and the one with longest path has worst fitness. Based on the fitness, the selection of individuals is performed. For mapping approach, the best, average, and worst candidates are selected. • Mutation – DE variant DE/BEST/RAND2/BIN is considered for the mutation as mentioned in Eq. (3).  g  g g g g  g  vk = cBEST + F cr4 − cr3 + F cr1 − cr2

(3)

312

A. Radhakrishnan and G. Jeyakumar

Fig. 1 Algorithmic structure of enhanced DE

The BNDE mapping approach is performed. • Crossover – Crossover is performed and trial vector is obtained. • Selection Scheme – The fitness-based selection was considered. The fitness of trial vector is analyzed with the target vector, and the vector with improved fitness is considered for the next generation. The enhanced DE in Fig. 1 is described for travelling salesman problem. Initialization of population was performed by random positioning of the city nodes. The objective function evaluates the fitness of each candidate in the population. The TSPLIB dataset was considered for the experiment. The candidate with the minimum distance has the better fitness value. Euclidean distance was used for calculating the distance. The candidates were ranked in the population based on the fitness value. Candidates with best fitness, average, and worst fitness were considered to perform the BNDE mapping. DE/BEST/RAND2/BIN was considered for finding the mutant vector [33]. The quality of best gene was considered; thus, best vector was chosen as base vector to improve the exploitation of the mutant vector, and weighted scale of random vectors were considered to improve the exploration of mutant vector. This approach replaces the neighbors of the gene selected in mutant vector with the best gene from the best or average or worst candidates. This approach can improve the quality of the candidate as we are mapping the neighbors with potential gene and could converge to better optimal solution. Crossover was applied to normalize the searching. The candidate with the better fitness was selected for the next iteration.

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

313

4.1 Best Neighborhood Differential Evolution (BNDE) Mapping The best neighborhood Differential Evolution is a mapping approach proposed to enhance the exploration and exploitation of differential evolution in the discrete domain. Probability is generated for all the genes in the mutant vector. For a gene (say n) that has the probability less than 0.3, its adjacent neighbors n + 1 and n-1 are considered. Only for 30 percent of the genes, its neighbors are replaced with best gene to preserve the randomness. For dataset with higher dimensions, mapping 50% with the gene of good candidate improves the result, but for dataset with lower dimensions results in stagnation and results in more computational time. Hence, 30% gene replacement is considered for better results in lower and higher dimensions. Exploration and exploitation processes are also balanced to search better candidates. This method could prevent premature convergence and yield possible optimal solution than other mapping approaches. The selected genes’ neighbors are replaced with better gene considering the best, average, and worst selected candidate. This approach could yield a stability in exploring and exploiting the enhanced differential evolution. The algorithm for BNDE is presented in Fig. 2. This chapter presents the proposed algorithm in two phases. In Phase I, the proposed algorithm is compared with other existing algorithms, and in Phase II, the performance of the proposed algorithm is validated on an image processing

Fig. 2 Algorithm for best neighborhood differential evolution

314

A. Radhakrishnan and G. Jeyakumar

application. The design of experiments, results, and discussion for the Phase I and Phase II experiments are presented next.

5 Phase I – Performance Comparison In this phase, the performance of BNDE is compared with existing mapping algorithms for solving travelling salesman problem.

5.1 Design of Experiments – Phase I The parameter setting of DE was carried out with appropriate values after trial and error. The crucial parameters of DE were set appropriately. The population size n was set to 100. The mutation scale factor (F) was considered as [0.6, 1.5], referring [1, 34]. F > 1 solves many problems and F < 1 shows good convergence. Optimal value of F is between the values Fmin = 0.6 and Fmax = 1.5,and it is calculated using the equation (Eq. (4)) below (as given in [1]):  F =

Fmin − Fmax × cfe + FMax MaxFit

(4)

where cfe is the number of times the objective function is evaluated, and MaxFit is the maximum number of fitness evaluations. Crossover (Cr) value was considered as 0.9, number of generations (g) = 2000, and number of runs (r) = 30. All the DE mapping approaches were implemented in a computer system with 8 GB RAM, i7 processor with Windows 7 operating system, using python 3.6 programming language. The performance analysis of BNDE was carried out with travelling salesman problem (TSP) as the benchmarking problem. TSP is an NP hard problem. The solution for TSP is the smallest path of the salesman to visit all the cities (nodes) in the city map, with the constraint of visiting each city only once. Six different instances of TSP from TSPLIB dataset were considered for the experiment. Each candidate in the population is a possible path for the salesman. The Euclidean distance was calculated to validate the fitness of the path. The objective function defined to measure the distance (D) is given in Eq. (5). D = Dj,k +

k

Dj,j +1

(5)

j =1

where Dj denotes the node j, Dj + 1 denotes the neighbor node j + 1, and k denotes the total number of nodes in the city map.

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

315

The performance metrics used for the comparative study were the best of optimal values (Aov ) and the error rate (Er ). The Aov is the best of the optimal values attained by a variant on all it runs, and it is calculated using the Eq. (6). Aov = Bestof (ovi)

(6)

where ovi is the optimal value obtained at a run i. The error rate (Er ) is the measure to indicate how much the Aov is deviating from expected optimal solution of the problem and was calculated using Eq. (7). Er = (Aov − AS) AS ∗ 100

(7)

where Aov is the obtained solution, and the AS is the actual solution.

5.2 Results and Discussions – Phase I Performance comparison of BNDE with existing mapping approach of DE was measured for TSP with Aov and Er. Six TSP datasets were considered for the experiment. The datasets are att48, eil5, berlin52, st70, pr76, and eil76. Six existing mapping techniques were implemented, and their performance were analyzed empirically and statistically. The mapping approaches referred to this experiment are Truncation Procedure (TP1) [35], TP rounding to the closest integer number (TP2) [1], TP only with int part (TP3) [35], rank based (RB) [36], largest ranked value (LRV) [37], and Best Match Value (BMV) [1]. The algorithms with all the mapping techniques considered in this experiment were applied to solve TSPs dataset with the similar DE parameters. The empirical results are shown in Table 1 with best optimal value. Table 2 presents the result obtained for average optimal values. Table 3 shows the comparison of result obtained for the worst optimal values. Table 1 presents the comparison of BNDE approach with the state-of-the-art mapping approaches with the best optimal value obtained from 30 independent runs. BNDE could outperform TP1, TP2, TP3, RB, and LRV, but not the BMV. Similarly,

Table 1 Comparison of BNDE with existing mapping approaches with best value obtained Dataset att48 eil51 berlin52 st70 pr76 eil76

Optimal solution 33,523 429 7544.37 675 108,159 545.39

TP1 BEST 95,309 1112 20585.53 2550 230,691 1825

TP2 BEST 98,565 1118 18,768 2506 236,060 1809

TP3 BEST 97,298 1106 19,670 2625 416,316 1860

RB BEST 97,309 1124 20,166 2559 318,696 1890

LRV BEST 90,293 1078 19,977 2574 397,748 1865

BMV BEST 58,679 766 13,292 1597 235,442 1184

BNDE BEST 75,031 863 15,758 1902 311,432 1394

Dataset att48 eil51 berlin52 st70 pr76 eil76

Optimal solution 33,523 429 7544.37 675 108,159 545.39

TP1 Avg 102364.83 1157.23 21,431 2677.47 254327.63 1922.86

TP2 Avg 104100.56 1159.233 20308.90 2654.07 252948.50 1920.07

TP3 Avg 102361.56 1156 21107.33 2727.33 431907.76 1932.63

RB Avg 101795.83 1174.20 21080.80 2711.43 344088.63 1937.03

Table 2 Comparison of BNDE with existing mapping approaches with average value obtained LRV Avg 102043.30 1180.46 20925.66 2716.83 415970.40 1935.17

BMV Avg 70946.70 832.93 14814.80 1876.03 269644.20 1364.03

BNDE Avg 83238.40 959.83 17185.26 2116.43 333313.46 1516.40

316 A. Radhakrishnan and G. Jeyakumar

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

317

Table 3 Comparison of BNDE with existing mapping approaches, with worst value obtained Dataset att48 eil51 berlin52 st70 pr76 eil76

Optimal solution 33,523 429 7544.37 675 108,159 545.39

TP1 Worst 106,700 1216 19,559 2807 276,195 1998

TP2 Worst 98,565 1118 21,266 2779 267,409 1973

TP3 Worst 107,536 1206 21,914 2810 440,280 1985

RB Worst 106,878 1218 21,660 2798 426,851 1982

LRV Worst 105,805 1233 21,911 2779 437,337 1994

BMV Worst 82,000 898 16,579 2155 302,095 1577

BNDE Worst 91,649 1046 18,519 2270 351,145 1613

Table 4 Comparison of error rate with existing mapping approaches, with best value obtained Dataset att48 eil51 berlin52 st70 pr76 eil76

Optimal solution 33,523 429 7544.37 675 108,159 545.39

TP1 Er 184.30 159.20 172.85 277.77 113.28 234.62

TP2 Er 194.02 160.60 148.76 271.25 118.25 231.68

TP3 Er 190.24 157.80 160.72 288.88 284.91 241.04

RB Er 190.27 162.00 167.29 279.11 194.65 246.54

LRV Er 169.34 151.28 164.79 281.33 267.74 241.95

BNDE Er 123.81 101.16 108.87 181.77 187.93 155.59

BMV Er 75.04 78.55 76.18 136.59 117.68 117.09

for the other two cases of comparing the approaches with average values and worst values, also the BNDE failed to outperform BMV. Table 4 presents the error rate (Er ) calculated with the best values. The comparison shows the BNDE could outperform TP1, TP2, TP3, RB, and LRV (except for pr76 with TP1 and TP2). In overall, it is observed that the BNDE could outperform existing mapping approaches, except BMV. However, performance of BNDE was comparable with BMV. Further tuning of BNDE to make it better than BMV is taken as future study of this work. To validate these findings, statistical significance analysis was performed using Wilcoxon Signed Ranks Test for paired samples. The Wilcoxon Signed Ranks Test for Paired Samples with two tails was used for the comparison. Optimal value and error rate are measured for the independent run of the algorithm. The parameters used for this test were number of samples n, test statistics (T), critical value of T (T Crit), level of significance (α), the z-score, the p-value, and the significance. The observations are summarized in Table 5. The ‘+’ indicates that the performance difference between the BNDE and the counterpart approach is statistically significant, and the ‘-’ indicates that the performance difference is not statistically significant. For the BNDE-TP1, BNDE-TP2, and BNDE-TP3 pairs, the outperformance of BNDE was statistically significant for all the datasets, except the att48 dataset. For the BNDE-RB and BNDE-LRV pairs, it is observed that the outperformance of BNDE was statistically significant for all the datasets. Interestingly, for BNDE-BMV pair, though BMV empirically outperformed BNDE, the performance differences were not statistically significant.

318

A. Radhakrishnan and G. Jeyakumar

Table 5 Statistical analysis of BNDE using Wilcoxon Signed Ranks Test Dataset att48 eil51 berlin52 st70 pr76 eil76 Dataset att48 eil51 berlin52 st70 pr76 eil76 Dataset att48 eil51 berlin52 st70 pr76 eil76

TP1 T 151 117.5 66 48 17 11 TP3 T 139 104 51 51 3 11 LRV T 122 91 64 55 18 11

Dataset T att48 eil51 berlin52 st70 pr76 eil76 Dataset T-Critic p-value Significance T 137 0.0545 No − att48 137 0.0082 Yes + eil51 137 0.0002 Yes + berlin52 137 0.0002 Yes + st70 137 0.0000 Yes + pr76 137 0.0000 Yes + eil76 Dataset T-Critic p-value Significance T 137 0.023 Yes + att48 137 0.0036 Yes + eil51 137 0.00052 Yes + berlin52 137 0.00026 Yes + st70 137 0.0000 Yes + pr76 137 0.0000 Yes + eil76 T-Critic 137 137 137 137 137 137

p-value 0.09367 0.018 0.00061 berlin52 st70 pr76

Significance No – Yes + Yes + Yes + Yes + Yes +

TP2 T-Critic 153 105 72 56 1 11 RB T-Critic 126 103 74 36 29 11 BMV T-Critic 107 107 50 10 1 11

p-value 137 137 137 137 137 137

Significance 0.1020 No 0.0087 Yes 0.0010 Yes 0.0003 Yes 0.0000 Yes 0.0000 Yes

− + + + + +

p-value 137 137 137 137 137 137

Significance 0.02848 Yes 0.00773 Yes 0.00111 Yes 0.0000 Yes 0.0000 Yes 0.0000 Yes

+ + + + + +

p-value 137 137 137 137 137 137

Significance 0.0098 No 0.0098 No 0.00017 No 0.0000 No 0.0000 No 0.0000 No

− − − − − −

From the statistical analysis examined from Table 5, we can perceive that BNDE could outperform significantly the existing algorithms (except BMV). Also, for other few exceptional cases of TP1, TP2, and TP3 for the dataset att48, the statistical significance on the improved performance of BNDE was not shown.

6 Phase II – Image Processing Application In the second phase of the experiment, an attempt to apply BNDE mapping approach for finding the optimal parameter combination of transformation function of image contrast enhancement was made. Since the tuning of parameters for image contrast enhancement is a combinatorial optimization problem, this mapping approach could explore and exploit the optimal parameter combinations. Transformation Function Local enhancement method [38] was applied using the contrast improvement mentioned in Eq. (8):

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

h (x, y) = A (x, y) [ f (x, y) − m (x, y)] + m (x, y) where .A (x, y) =

k.M σ (x,y)

319

(8)

and k ∈[0, 1].

h(x, y) represents the pixel with new pixel intensity value at (x, y), f (x, y) denotes the intensity value of original pixels, m(x, y) represents the local mean, σ (x, y) is the local standard deviation, and M denotes mean of the pixel intensities of the grayscale image. The modified local enhancement is used to incorporate the edge parameters a, b, c as h (x, y) =

k.M [c ∗ f (x, y) − m (x, y)] + m(x, y)a σ (x, y) + b

(9)

where a ∈[1, 2],b ∈[0, 2], c ∈ [0, 0.5], k ∈ [0, 1]. Fitness of transformed images was calculated to examine the better candidates:   log2 log2 (S(T i)) .E(T i) ∗ H (T i) ∗ P (T i) ∗ L(T i) F (T i) = h∗w

(10)

where F(Ti) indicates transformed image fitness, and S(Ti) denotes the sum of pixel values of the image edges. Canny edge detector is used. For adding the intensity of the pixel that are white, the white pixel position is considered in the original image, and the sum of those pixel intensities is denoted S(Ti). E(Ti) denotes pixel counts that form the image edges. Edge detector detects the count of white pixels. Shannon entropy of an image is calculated using H (T i) =

255

pi.log2 (pi)

(11)

0

pi denotes the probability of ith intensity value of an image that ranges from [0, 255]. P (T i) = 20log10 (255) − 10log10 (MSE(T i))

(12)

where MSE(T i) =

h−1 w−1 1

[ T i (i, j ) − T 0 (i, j )]2 h.w

(13)

i=0 j =0

where original image is denoted as T0Ti A = π r2 (2μ T i μ T 0 + c1) (2σ T i T 0 + c2)   L(T i) =  2 μT i + μ2T 0 + c1 σT2 i + σT2 0 + c1

(14)

320

A. Radhakrishnan and G. Jeyakumar

Fig. 3 BNDE for image enhancement

where μ Ti denotes the mean value of the transformed image pixel value, μ T0 denotes the mean value of the pixel values for original image, σ Ti is the transformed image variance, σ T0 is the variance of original image, and h denotes height and w denotes width of the image.

6.1 Design of Experiments – Phase II The proposed approach was implemented in a computer system with 8 GB RAM, i5 processor with Mac OS, using python 3.6 programming language. The performance analysis of BNDE was carried out with mouse cell embryo dataset. The performance of BNDE was compared with Histogram and Contrast Limited Adaptive Histogram (CLAHE). The parameter setting of BNDE approach was population size = 20, number of iterations = 20, and number of runs = 10. For the Mutation Factor F and Cr, the values were considered same as applied for the TSP problem. The algorithmic approach of BNDE for image contrast enhancement is shown in Fig. 3. The random population of size 20 was generated with the parameters a, b, c, and k values generated randomly within the boundary range. Local image

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

321

enhancement was applied, and fitness for each transformed image were evaluated. The DE/rand/1/bin variant was used. Three random vectors from the population were selected for the rand/1 mutation. The genes of the mutant vector were replaced with the best gene after comparing with best, average, and worst candidates. Since the proposed mapping approach was applied to an image processing application, the BNDE approach was modified according to the application. Replacement of best gene was carried out for the selected gene with probability greater than 0.25.

6.2 Results and Discussions – Phase II The ideal combination of the parameters a, b, c, and k was selected based on the finest fitness value, and the transformed image with these parameters was considered for the evaluation. The original image was compared with the enhanced image, and its histogram was analyzed. The result is shown in Table 6 (Tables 6.a and 6.b). First column from the left represents the original image and its edge detected using canny edge detector. The second column represents the original image histogram. Third column from the left represents the enhanced image and its edge detected. Fourth column represents the histogram of the enhanced image. Based on the analysis of edge detected for the enhanced image, it is observed that the number of edges pixels detected are more. By comparing the histograms of original and enhanced images, it is observed that the contrast is enhanced in the enhanced images. Comparison of the BNDE approach with the existing algorithm was also performed. Two existing algorithms Histogram Equalization and CLAHE were considered for the analysis. Table 7 (Tables 7.a, 7.b and 7.c) shows the result obtained. Though BNDE approach could outperform the Histogram Equalization, it was observed that enhanced image by CLAHE algorithm could detect more edges. Enhanced image by Histogram Equalization generated noisy images. The performance analysis of state-of-the-art algorithm was assessed using the metrics Peak to Signal Noise Ratio (PSNR), Mean Squared Error (MSE), entropy, and Structural Similarity Index (SSIM). Table 8 summarizes the result obtained. The comparison of BNDE with Histogram Equalization and CLAHE is summarized in Table 8. From the analysis of values obtained for metrics, PSNR value obtained for BNDE is better than the other algorithms. MSE is inversely proportional to the PSNR. MSE obtained for BNDE is less compared to the Histogram Equalization and CLAHE. Entropy is another metrics used for measuring the quality of image. It is a measure of randomness in image. BNDE approach could obtain less entropy when comparing the performance with the state-of-the-art algorithm. Reduced entropy value shows that enhanced image is more homogeneous than the input image. SSIM obtained for BNDE outperforms the other method. SSIM measures the similarity of the enhanced image with input image. BNDE approach could enhance the image, but still similarity to input image is ensured.

Original image

Original image Histogram

Enhanced image (BNDE approach)

Table 6.a Comparison of original image and the enhanced images (1–6) BNDE approach Histogram

322 A. Radhakrishnan and G. Jeyakumar

Original image

Original image Histogram

Enhanced image (BNDE approach)

Table 6.b Comparison of original image and the enhanced images (7–12) BNDE approach Histogram

Modified Discrete Differential Evolution with Neighborhood Approach for. . . 323

324

A. Radhakrishnan and G. Jeyakumar

Table 7.a Comparison of enhanced images (1–4) Original image

Hist-equalization

CLAHE

BNDE

7 Conclusions This chapter presented a study in two phases. The Phase I of the study proposed an improved Differential Evolution algorithm, named as BNDE. The BNDE is added with a proposed mapping approach to improve the exploration and exploitation nature of DE. The proposed mapping technique made DE suitable to solve discrete optimization problems. The performance of the proposed algorithm was evaluated on benchmarking TSPs and compared with state-of-the-art six different similar

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

325

Table 7.b Comparison of enhanced images (5–8) Original image

Hist-equalization

CLAHE

BNDE

algorithms. The empirical studies revealed that the proposed algorithm works better than all the approaches, except a one (the BMV approach, to which the performance difference was not significant statistically). Though it could not outperform BMV approach, their performance was comparable. The statistical studies highlighted that in the cases where BNDE outperformed, it was statistically significant. In phase II, the proposed algorithm was tested on an image enhancement application and

326

A. Radhakrishnan and G. Jeyakumar

Table 7.c Comparison of enhanced images (9–12) Original image

Hist-equalization

CLAHE

BNDE

compared with two classical image enhancement algorithms. The results obtained for the performance metrics, used in the experiment, reiterated the quality of the proposed algorithm that it could outperform the classical algorithms by the PSNR, MSE, and SSIM values. This approach is validated in grayscale images. For validating in RGB images, the mapping approach can be applied on red channel, green channel, and blue

7_19_M16 7_19_2ME2 7_19_2ME4 7_19_2ME5 7_19_2ME6 7_19_2ME8 7_19_2ME10 7_19_2ME13 7_19_2ME15 7_19_1ME1 7_19_1ME9 7_19_1ME10

Dataset

HISTO PSNR 11.21 11.343 11.28 11.29 11.29 11.39 11.185 11.294 11.576 11.23 11.292 11.269

Entropy 7.91 7.90 7.90 7.89 7.93 7.91 7.913 7.92 9.97 7.95 7.94 7.94

MSE 105.47 105.62 107.06 103.63 102.05 104.19 03.002 102.04 104.567 105.30 105.650 105.25

SSIM 0.1547 0.1549 0.1471 0.1399 0.1775 0.1848 0.17365 0.17923 0.2567 0.1299 0.129 0.1255

Table 8 Comparison of BNDE with existing algorithms CLAHE PSNR 28.06 26.07 26.92 27.01 27.43 27.30 27.98 27.55 28.44 27.44 27.45 27.50 Entropy 5.45 5.72 5.51 5.49 5.50 5.53 5.35 5.55 5.80 5.66 5.67 5.66

MSE 31.37 38.72 31.96 32.86 31.97 30.465 27.11 33.14 33.04 38.56 37.43 37.96

SSIM 0.8711 0.8700 0.8738 0.8728 0.8803 0.8808 0.8859 0.8794 0.8356 0.8176 0.8223 0.8210

BNDE PSNR 69.74 66.10 65.27 64.115 64.11 62.17 62.17 68.85 63.82 63.45 62.66 64.41 Entropy 4.311 4.40 4.241 4.20 4.44 4.49 4.37 4.41 4.93 4.43 4.447 4.38

MSE 0.00022135 0.00011719 0.00013346 0.00020833 0.00020833 0.00022135 0.00022135 0.00013346 0.00028971 0.00021159 0.00020833 0.00018555

SSIM 0.9999985 0.99999817 0.99999771 0.99999754 0.99999754 0.99999736 0.99999734 0.99999845 0.99999757 0.99999744 0.99999748 0.99999768

Modified Discrete Differential Evolution with Neighborhood Approach for. . . 327

328

A. Radhakrishnan and G. Jeyakumar

channel. The future scope of this work is to extend the mapping approach in color images. This study also will be enhanced further to hybridize other optimization techniques with DE to make BNDE outstanding among all the state-of-the-art mapping techniques.

References 1. Ali, I. M., Essam, D., & Kasmarik, K. (2019). A novel differential evolution mapping technique for generic combinatorial optimization problems. Applied Soft Computing, 80, 297–309. 2. Santucci, V., Baioletti, M., Di Bari, G., & Milani, A. (2019). A binary algebraic differential evolution for the multidimensional two-way number partitioning problem. European Conference on Evolutionary Computation in Combinatorial Optimization, 11451, 17–22. 3. Ming, Z., Zhao Linglin, S., Xiaohong, M. P., & Yanhang, Z. (2017). Improved discrete mapping differential evolution for multi-unmanned aerial vehicles cooperative multi-targets assignment under unified model. International Journal of Machine Learning and Cybernetics, 8(3), 765– 780. 4. Goudos, S. (2017). Antenna design using binary differential evolution: Application to discretevalued design problems. IEEE antennas and propagation magazine, 59(1), 74–93. 5. Cuevas, E., Zaldivar, D., Perez Cisneros, M. A., & Ramirez-Ortegon, M. A. (2011). Circle detection using discrete differential evolution optimization. Pattern Analysis and Applications, 14(1), 93–107. 6. Davendra, D., & Onwubolu, G. (2009). Forward backward transformation. In Differential evolution: A handbook for global permutation-based combinatorial optimization (pp. 35–80). Springer. 7. Wang, L., Pan, Q.-K., Suganthan, P. N., & Wang, W. (2010). A novel hybrid discrete differential evolution algorithm for blocking flow shop scheduling problems. Computers & Operations Research, 37(3), 509–520. 8. Viale Jacopo, B., ThiemoKrink, S. M., & Paterlini, S. (2009). Differential evolution and combinatorial search for constrained index-tracking. Annals of Operations Research, 172(1), 39–59. 9. Wagdy, A. (2016). A new modified binary differential evolution algorithm and its applications. Applied Mathematics & Information Sciences, 10(5), 1965–1969. 10. Sauer, J. G., & Coelho, L. (2008). Discrete differential evolution with local search to solve the traveling salesman problem: Fundamentals and case studies. In Proceedings of 7th IEEE international conference on conference: cybernetic intelligent systems. 11. Uher, V., Gajdo, P., Radecky, M., & Snasel, V. (2016). Utilization of the discrete differential evolution for optimization in multidimensional point clouds. Computational Intelligence and Neuroscience, 13(1–14). 12. Lingjuan, H. O. U., & Zhijiang, H. O. U. (2013). A novel discrete differential evolution algorithm. Indonesian Journal of Electrical Engineering, 11(4). 13. Rubini, N., Prashanthi, C. V., Subanidha, S., & Jeyakumar, G. (2017). An optimization framework for solving RFID reader placement problem using differential evolution algorithm. In Proceedings of ICCSP-2017 – International conference on communication and signal proceedings. 14. Abraham, K. T., Ashwin, M., Sundar, D., Ashoor, T., & Jeyakumar, G. (2017). Empirical comparison of different key frame extraction approaches with differential evolution based algorithms. In Proceedings of ISTA-2017 – 3rs international symposium on intelligent system technologies and applications.

Modified Discrete Differential Evolution with Neighborhood Approach for. . .

329

15. Shinde, S. S., Devika, K., Thangavelu, S., & Jeyakumar, G. Multi-objective evolutionary algorithm based approach for solving RFID reader placement problem using weight-vector approach with opposition-based learning method. International Journal of Recent Technology and Engineering (IJRTE) 2277–3878, 7(5), 177–184. 16. Lu, X., Wang, Y., & Yuan, Y. (2013). Graph-regularized low-rank representation for destriping of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4009– 4018. 17. Lu, X., & Li, X. (2014). Multiresolution imaging. IEEE Transactions on Cybernetics, 44(1), 149–160. 18. Sujee, R., & Padmavathi, S. (2017). Image enhancement through pyramid histogram matching. International Conference on Computer Communication and Informatics (ICCCI), 2017, 1–5. https://doi.org/10.1109/ICCCI.2017.8117748 19. Zhu, H., Chan, F. H., & Lam, F. K. (1999). Image contrast enhancement by constrained local histogram equalization. Computer Vision and Image Understanding, 73, 281–290. https:/ /doi.org/10.1006/cviu.1998.0723 20. Chithirala, N., et al. (2016). Weighted mean filter for removal of high density salt and pepper noise. In 2016 3rd international conference on advanced computing and communication systems (ICACCS) (Vol. 1). IEEE. 21. Radhakrishnan, A., & Jeyakumar, G. (2021). Evolutionary algorithm for solving combinatorial optimization—A review. In H. S. Saini, R. Sayal, A. Govardhan, & R. Buyya (Eds.), Innovations in computer science and engineering (Lecture notes in networks and systems) (Vol. 171). Springer. 22. Gorai, A., & Ghosh, A. (2009). Gray-level image enhancement by particle swarm optimization. Proc IEEE World Cong Nature Biol Inspired Comput, 72–77. 23. Munteanu, C., & Rosa, A. Gray-scale image enhancement as an automatic process driven by evolution. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 24. Pal, S. K., Bhandari, D., & Kundu, M. K. (1994). Genetic algorithms for optimal image enhancement. Pattern Recognition Letters, 15(3), 261–271. 25. Saitoh, F. (1999). Image contrast enhancement using genetic algorithm. Proceedings of IEEE International Conference on Systems, Man and Cybernetics, 4, 899–904. 26. Braik, M., Sheta, A., & Ayesh, A. (2007). Particle swarm optimisation enhancement approach for improving image quality. International Journal of Innovative Computing and Applications, 1(2), 138–145. 27. dos Santos Coelho, L., Sauer, J. G., & Rudek, M. (2009). Differential evolution optimization combined with chaotic sequences for image contrast enhancement. Chaos, Solitons & Fractals, 42(1), 522–529. 28. Shanmugavadivu, P., & Balasubramanian, K. (2014). Particle swarm optimized multi-objective histogram equalization for image enhancement. Optics & Laser Technology, 57, 243–251. 29. Mahapatra, P. K., Ganguli, S., & Kumar, A. (2015). A hybrid particle swarm optimization and artificial immune system algorithm for image enhancement. Soft Computing, 19(8), 2101– 2109. 30. Suresh, S., & Lal, S. (2017). Modified differential evolution algorithm for contrast and brightness enhancement of satellite images. Applied Soft Computing, 61, 622–641. 31. Harichandana, M., Sowmya, V., Sajithvariyar, V. V., & Sivanpillai, R. (2020). Comparison of image enhancement techniques for rapid processing of post flood images. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Xliv-M-22020, 45–50. 32. Sony, O., Palanisamy, T., & Paramanathan, P. (2021). A study on the effect of thresholding enhancement for the classification of texture images. Journal of The Institution of Engineers (India): Series B, 103, 29. https://doi.org/10.1007/s40031-021-00610-9 33. Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–335. 34. Rönkkönen, J., Kukkonen, S., & Price, K. V. (2005). Real-parameter optimization with differential evolution. Congress on Evolutionary Computation, 506–513.

330

A. Radhakrishnan and G. Jeyakumar

35. Li, H., & Zhang, L. (2014). A discrete hybrid differential evolution algorithm for solving integer programming problems. Engineering Optimization, 46(9), 1238–1268. 36. Liu, B., Wang, L., & Jin, Y.-H. (2007). An effective pso-based memetic algorithm for flow shop scheduling. IEEE Transactions on Systems Man and Cybernetics Part B, 37(1), 18–27. 37. Li, X., & Yin, M. (2013). A hybrid cuckoo search via lévy flights for the permutation flow shop scheduling problem. International Journal of Production Research, 51(16), 4732–4754. 38. Keerthanaa, K., & Radhakrishnan, A. (2020). Performance enhancement of adaptive image contrast approach by using artificial bee colony algorithm. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), 255–260.

Swarm-Based Methods Applied to Computer Vision María-Luisa Pérez-Delgado

Abbreviations Below are the abbreviations used in the chapter: AA ABC ALO BA BFO CRS CSO CT CUS FA FPA FSA GWO MR PSO RGB WO

Artificial ants Artificial bee colony Ant lion optimizer Bat algorithm Bacterial foraging optimization Crow search Cat swarm optimization computed tomography Cuckoo search Firefly algorithm Flower pollination algorithm Fish swarm algorithm Gray wolf optimization magnetic resonance Particle swarm optimization Red, Green, Blue Whale optimization

M.-L. Pérez-Delgado () University of Salamanca, Escuela Politécnica Superior de Zamora, Zamora, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_16

331

332

M.-L. Pérez-Delgado

1 Introduction Nowadays, computer vision has become a very important element in many sectors, such as the development of autonomous vehicles, the surveillance and supervision systems, the manufacturing industry, or the health care sector [1]. It involves the application of different image processing operations to analyze the data and extract relevant information. The dimensionality of the data makes many of these operations have a high computational cost. This requires applying methods with reasonable execution time to generate solutions. Among such methods, swarmbased algorithms have been successfully applied in various image processing operations. This chapter shows the application of this type of solution to various image processing tasks related to computer vision. The objective is not to include an exhaustive list of articles, since the length of the chapter does not allow it. Rather, the chapter focuses on the most recent and interesting proposals where swarm-based solutions have been successfully applied.

2 Brief Description of Swarm-Based Methods Swarm-based algorithms define a metaheuristic approach to solve complex problems [2, 3]. These algorithms mimic the behavior observed in natural systems in which all individuals of a swarm or population contribute to solve a problem. This collective behavior was simulated to apply it to solve optimization problems. Certainly, it has been shown that swarm-based methods can perform well for complex problems [4–6]. Various swarm algorithms have been proposed in recent years [7]. Although each algorithm has its peculiarities, they all share the same basic structure. The first operation initializes the population. This operation generally associates each individual with a random solution of the search space. Then, an iterative process is applied to improve the current solutions associated with the individuals in the population. At each iteration, the quality or fitness of the solutions is determined. This value is computed by applying the objective function of the problem (or a modification of said function) to the solution represented by each individual. The solution with the best fitness is considered the solution to the problem in the current iteration. Then, the population shares information to try to move the individuals to better areas of the search space. This operation is different for each swarm-based method, but it always moves some individual (all or some of them) to new positions (generally more promising positions) in the search space. The iterative process continues until the stopping criterion of the algorithm is met. This occurs when the algorithm has performed a specific number of iterations or when the solution converges. At the end of the iterations, the solution to the problem is the best solution found by the swarm throughout the iterations.

Swarm-Based Methods Applied to Computer Vision

333

Algorithm 1 PSO algorithm 1: Set initial values for xi (0) and vi (0), for i = 1, . . . , P 2: Set bi (0) = xi (0), for i = 1, . . . P 3: Compute g(0) according to Eq. 1 .g(t)

= {bi (t) | f itness(bi (t)) = max (f itness(bi (t))) ∀j } j

(1)

4: for t = 1 to T MAX do 5: Compute vi (t), xi (t) and bi (t), for i = 1, . . . , P , according to Eqs. 2, 3 and 4, respectively .vi (t)

= ωvi (t − 1) + φ1 1 [bi (t − 1) − xi (t − 1) + φ2 2 [g(t − 1) − xi (t − 1)] .xi (t)

 .bi (t)

=

= xi (t − 1) + vi (t)

xi (t) if f itness(xi (t)) > f itness(bi (t − 1)) bi (t − 1) otherwise

(2)

(3)

(4)

6: Compute g(t) according to Eq. 1. 7: end for

The preceding paragraph shows an overview of the operations of the swarmbased algorithms. This information is completed by describing the specific operations of the particle swarm optimization (PSO) algorithm, which is one of the most widely used swarm algorithms (Algorithm 1). The variables used in the description of this algorithm are defined as follows. A swarm of P particles is used to solve a problem defined in an r-dimensional space. We consider that the algorithm will conclude after performing T MAX iterations. At iteration t of the algorithm, particle i has a position .xi (t) and a velocity .vi (t) and remembers the best position it has found so far .bi (t), (.xi (t) = (xi1 (t), . . . , xir (t)), .vi (t) = (vi1 (t), . . . , vir (t)), .bi (t) = (bi1 (t), . . . , bir (t)), with .i = 1, . . . , P ). .g(t) denotes the best position found by the swarm up to iteration t (the solution to the problem). .f itness(a) represents the function applied to compute the quality of a solution a. Finally, .ω, .φ1 , and .φ2 are predefined weights, while .1 , .2 are random vectors. Equations 1 and 4 are defined considering that the problem to be solved is a maximization problem. Figure 1 graphically shows the elements that condition the movement of a particle within the solution space. Table 1 lists the swarm-based solutions mentioned in this chapter, along with a reference that provides the reader with detailed information on each method.

334

M.-L. Pérez-Delgado

Fig. 1 PSO determines the new position of particle i (.xi (t)), taking into account its previous position (.xi (t − 1)), the best position found by the particle (.bi (t − 1)), the best position found by the swarm (.g(t − 1)), and the current velocity of the particle (.vi (t)) Table 1 Basic references for the swarm-based algorithms cited in this article

Swarm-based Method Artificial bee colony (ABC) Artificial ants (AA) Ant lion optimizer (ALO) Bat algorithm (BA) Bacterial foraging optimization (BFO) Cuckoo search (CUS) Cat swarm optimization (CSO) Crow search (CRS) Flower pollination algorithm (FPA) Firefly algorithm (FA) Fish swarm algorithm (FSA) Gray wolf optimization (GWO) Particle swarm optimization (PSO) Whale optimization (WO)

Reference [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

3 Some Advantages of Swarm-Based Methods Computer vision systems require handling noisy, complex, and dynamic images. For the system to be useful, it must interpret the image data accurately and quickly. Many operations related to computer vision can be formulated as optimization problems (segmentation, classification, tracking, etc.). Many of these problems are difficult to solve for different reasons (the high dimensionality of the data, the large volume of data to be processed, the noise in the data, the size and characteristics of the solution space, etc.). Therefore, the resulting problems are often highdimensional optimization problems with complex search spaces that can include complex constraints. Finding the optimal solution to these problems is very difficult, and operations require a lot of execution time. The characteristics of these problems make the classical optimization techniques not suitable for their resolution. For this

Swarm-Based Methods Applied to Computer Vision

335

reason, various optimization techniques have been proposed in recent years to avoid the problems of classical techniques. These methods have been applied to solve optimization problems for which classical techniques do not work satisfactorily. Swarm-based methods have been successfully applied to solve several computer vision tasks, providing a good solution since they avoid getting stuck in local optima. Swarm-based algorithms were developed to solve optimization problems, and they have been successfully applied to many problems in different areas [2, 3]. These algorithms have been applied to complex non-linear optimization problems. They are also useful to solve high-dimensional and multimodal problems. Furthermore, these methods require little a priori knowledge of the problem and have low computational cost. The characteristics of the swarm-based methods make them have several advantages over classical optimization algorithms: • Individuals are very simple, which facilitates their implementation. • Individuals do their work independently, so swarm-based algorithms are highly parallelizable. • The system is flexible, as the swarm can respond to internal disturbances and changes in the environment. In addition, the swarm can adapt to both predetermined stimuli and new stimuli. • The system is scalable because it can include from very few individuals to a lot of them. • The control of the swarm is distributed among its members, and this allows the swarm to give a rapid local response to a change. This operation is quick because it is not necessary to communicate with a central control or with all the individuals in the swarm. • The system is robust. Since there is no central control in the swarm, the system can obtain a solution even if several individuals fail. • Individuals interact locally with other individuals and also with the environment. This behavior is useful for problems where there is no global knowledge of the environment. In this case, the individuals exchange locally available information, and this allows obtaining a good global solution to the problem. In addition, the system can adapt to changes in its environment. • In order to apply some classical methods, it is necessary to make assumptions about the characteristics of the problem to be solved or the data to be processed. In contrast, swarm-based methods do not make assumptions and can be applied to a wide range of problems. • These methods can explore a larger region of the search space and can avoid premature convergence. Since the swarm evaluates several feasible solutions in parallel, this prevents the system from being trapped in local minima. Although some individual falls into a local optimum, other individuals may find a promising solution. • These algorithms include few parameters, and they do not require to be fine-tuned for the algorithm to work.

336

M.-L. Pérez-Delgado

4 Swarm-Based Methods and Computer Vision Many of the image processing operations discussed below are closely related and are often applied sequentially to an image. However, the operations have been separated into several sections, each citing swarm-based solutions that focus on the specific operation.

4.1 Feature Extraction Feature extraction is a preliminary task for other image processing, since it allows reducing the dimensionality of the data to be handled in said processing. This operation obtains the most relevant information from the image and represents it in a lower dimensional space. The set of features obtained by this operation can be used as input information to apply other processing to the image. When a feature set has been extracted from an image, feature selection allows selecting a subset of features from the entire set of candidate features. This is a complex task, and swarm-based solutions have been proposed to reduce the computational cost. In general, the interesting feature subset is conditioned by the image processing that will be applied to those features. For this reason, the feature selection operation is usually the previous step to another more general operation that conditions the features to be selected (Fig. 2). For example, this occurs when selecting features for image classification. Several swarm-based methods have been used for feature selection to classify images, such as AA [22], PSO [23, 24], or ABC [25]. Feature selection is an important aspect of hyperspectral image processing, as it allows selecting the relevant bands of the image in order to reduce the dimensionality. PSO was used in [26] to select features and then apply a convolutional neural network to classify hyperspectral images. PSO was also applied in [27], but combining two swarms: one of them estimates the optimal number of bands and the other selects the bands. The proposal of [28] combines PSO with genetic algorithms for feature selection. PSO operations are applied to update the particles, and then a

Fig. 2 Feature extraction and feature selection are two initial steps for other image processing operations

Swarm-Based Methods Applied to Computer Vision

337

new population is generated by applying the operators of the genetic algorithm. The method automatically determines the number of features to select. Other researchers have applied various swarm algorithms to address the same problem, including GWO [29], ALO [30], CUS [31], ABC [32], or FA [33]. Feature selection is also important for image steganalysis, which is the process of detecting hidden messages in an image. It has been performed by ABC [34], PSO [35, 36], or GWO [37]. In addition, other articles that apply swarms to this problem are described in [38]. Detecting an object or region of interest within an image is highly dependent on the image features being analyzed. The objective of feature detection is to identify features, such as edges, shapes, or specific points. Swarm-based solutions reduce the time required to perform this operation. Several articles describe the use of artificial ants for edge detection. The proposal presented in [39] uses the algorithm called ant system, while the methods described in [40] and [41] use the ant colony system algorithm. In all the cases, ants are used to obtain the edge information. On the other hand, the method described in [42] applies artificial ants as a second operation to improve the edge information obtained by other conventional edge detection algorithms (the Sobel and Canny edge detection approaches). Other proposals for the application of artificial ants for edge detection are described in [43] and [44]. Other swarm-based methods that have been applied for edge detection are PSO [45] and ABC [46]. There are also articles that describe the application of swarms for shape detection. PSO and genetic algorithms were combined in [47] to define a method that detects circles. ABC was applied in [48] to detect circular shapes, while BFO was applied in [49].

4.2 Image Segmentation Image segmentation consists of decomposing an image into non-overlapping regions. Interesting parts can then be extracted from the image for further processing. For example, this makes it possible to separate different objects and also to separate an object from the background (Fig. 3). Image segmentation is

Fig. 3 Example of segmentation process applied to extract the objects from the background

338

M.-L. Pérez-Delgado

very important in computer vision applications, as it is a preliminary step for other operations such as image understanding or image recognition. Several techniques are commonly used for image segmentation, such as clustering, thresholding, edge detection, or region identification. To analyze swarm-based solutions, we will focus on the first two approaches. Clustering algorithms are one of the simplest segmentation techniques. The pixels of the image are divided into clusters or groups of similar pixels, and each cluster is represented by a single color. PSO was used in [50] to define the initial centroids to apply the well-known K-means clustering method. The centroid of a cluster is the value used to represent that cluster. The same methods were combined in [51], but in this case PSO not only defines the initial centroids for K-means but also determines the number of centroids. The proposal of [52] combines FSA with the fuzzy c-means clustering method. The first method is used to determine the number of clusters for the second method and also to optimize the selection of the initial centroids. Artificial ants were applied in [53]. In this case, an ant is assigned to each pixel and moves around the image looking for low grayscale regions. When the algorithm concludes, the pheromone accumulated by the ants allows the pixels to be classified as black or white. The proposal of [54] also uses artificial ants, but the information used to define the clusters is the gray value, the gradient, and the neighborhood of the pixels. The method described in [55] applies the ant-tree algorithm, which is an ant-based method in which the ants represent items that are connected in a tree structure to define clusters. Thresholding methods are popular techniques for image segmentation due to its simplicity. They divide the pixels of the image based on their intensity and determine the boundaries between classes. Bi-level thresholding is applied to divide an image into two classes (e.g., the background and the object of interest), while multi-level thresholding is used to divide it into more than two classes. The methods used to compute the thresholds can be divided into non-parametric and parametric. Nonparametric methods determine the thresholds by optimizing some criteria, and they have been proven to be more accurate than parametric methods. Several thresholding criteria have been proposed. The Otsu criterion is a very popular method that selects optimal thresholds by maximizing the between-class variance [84]. Entropy-based criteria maximize the sum of entropy for each class and are also widely used. Among the criteria of this type, we can mention the Kapur entropy [85], the Tsallis entropy [86], the minimum cross entropy [87], or the fuzzy entropy. Many swarm-based methods have been applied to determine the thresholds in multi-level thresholding (Table 2). In general, they define the fitness function of the swarm using some of the thresholding criteria described above.

Swarm-Based Methods Applied to Computer Vision Table 2 Swarm-based methods applied to multi-level thresholding for image segmentation

Swarm ABC

FA

CUS

PSO

GWO BA AA WO CRS

References [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [62] [69] [70] [71–73] [74] [59, 75] [76] [77] [78] [79] [80] [81] [82] [83]

339 Criterion Otsu Kapur Tsallis Kapur, Otsu Kapur, Otsu, Tsallis Otsu Kapur, Otsu Tsallis and Kapur Otsu, Kapur, minimum cross entropy Fuzzy entropy Minimum cross entropy Kapur Tsallis Kapur and Otsu Minimum cross entropy Kapur, Otsu, Tsallis Otsu Kapur Kapur, Otsu Minimum cross entropy Kapur Kapur, Otsu Otsu Kapur, Otsu Otsu Otsu, Kapur Kapur

4.3 Image Classification Image classification is the process of identifying groups of similar image primitives. These image primitives can be pixels, regions, line elements, etc., depending on the problem encountered. Many basic image processing techniques such as quantization or segmentation can be viewed as different instances of the classification problem. A classification method can be applied to associate an image with a specific class (Fig. 4). Another possibility is to classify parts of the image as belonging to certain classes (river, road, forest, etc.). Several swarm-based approaches have been proposed to associate images to specific classes. In general, swarm methods are combined with other methods to define a classification system. For example, the system described in [88], uses PSO to update the weights of a neural network that classifies color images. PSO was also used

340

M.-L. Pérez-Delgado

Fig. 4 Example of a classification system that can distinguish ripe and unripe tomatoes from an image

in [89, 90], and [91] to define the optimal architecture of a convolutional neural network applied to classify images. A system to classify fruit images was proposed in [92] that applies a variant of ABC to train the neural network that performs the classification. The solution proposed in [93] to classify remote-sensing images uses a Naïve Bayes classifier and applies CUS to define the classifier weights. The system for identifying and classifying plant leaf diseases described in [94] uses BFO to define the weights of a radial basis function neural network. Swarm algorithms have also been applied to define classification methods that allow classifying parts of an image. Omran et al. described two applications of PSO for this type of image classification, using each particle to represent the mean of all the clusters. In the first case, the fitness function tries to minimize the intra-cluster distance and to maximize the inter-cluster distance [95]. In the second case, the function includes a third element to minimize the quantization error [96]. The method described in [97] uses artificial ants to classify remote-sensing images, so that different land uses are identified in the image. The same problem was solved in [98], but applying PSO. The method described in [99] classifies a high-resolution urban satellite image to identify 5 land cover classes: vegetation, water, urban, shadow, and road. The article proposes two classification methods, which apply artificial ants and PSO, respectively. Another crop classification system was defined in [100]. This system uses PSO to train a neural network that can differentiate 13 types of crops in a radar image.

4.4 Object Detection Object detection consists of finding the image of a specific object into another more general image or in a video sequence (Fig. 5). Automatic object detection is a very important operation in computer vision, but it is difficult due to many factors such as rotation, scale, occlusion, or viewpoint changes. The practical applications of this

Swarm-Based Methods Applied to Computer Vision

341

Fig. 5 Example of object detection trying to find tomatoes in an image. The figure shows the object to be detected (left) and the instances of that object identified in a general image (right)

operation include surveillance, medical image analysis, or image retrieval, among others. A method that uses feature-based object classification together with PSO is proposed in [101]. This method allows finding multiple instances of objects of a class. Each particle of the swarm is a local region classifier. The objective function measures the confidence that the image distribution in the region currently analyzed by a particle is a member of the object class. PSO was also used in [102] to define a feature-based method to distinguish a salient object from the background of the image. Model-based methods use a mathematical model to describe the object to be recognized. Said model must preserve the key visual properties of the object category, so that it can be used to identify objects of that category with variations due to deformations, occlusions, illumination, etc. A model-based system that uses PSO was described in [103]. In this case, the object detection operation is considered as an optimization problem, and the objective function to be maximized represents the similarity between the model and a region of the image under investigation. PSO is used to optimize the parameters of the deformable template that represents the object to be found. The method presented in [104] applies PSO to detect traffic signs in real time. In this case, the signs are defined as sets of three-dimensional points that define the contour. The fitness function of PSO detects a sign belonging to a certain category and, at the same time, estimates its position relative to the camera’s reference frame. The proposal of [105] uses PSO to optimize the parameters of a support vector machine that is used to identify traffic signs. Active contour models, also called snakes, are deformable models applied to detect the contour of an object from an image. Control points are defined near the object of interest, which are moved to conform to the shape of the object. Tseng et al. used PSO to define an active contour model that uses several swarms, each associated with a control point [106]. The method proposed in [107] uses ABC to apply an active contour model.

342

M.-L. Pérez-Delgado

Model-based methods include graph models, which break the object into parts and represent each one by a graph vertex. This approach is considered in [108], which applies artificial ants for road extraction from very high-resolution satellite images. First, the image is segmented to generate image objects. These objects are then used to define the nodes of the graph that the ants will traverse to define a binary roadmap. At the end of the process, the binary roadmap is vectorized to obtain the center lines of the road network. A cuckoo search-based method was applied in [109] to detect vessels in a synthetic aperture radar image. The proposals described in [110] and [111] define two template-matching methods that apply ABC. Template-matching methods try to find a sub-image, called template, within another image. The objective function proposed in [110] computes the difference between the RGB level histograms corresponding to the target object and the template object. On the other hand, the absolute sum of differences between pixels of the target image and the template image was used in [111] to define the fitness function. A method for visual target recognition for low altitude aircraft was described in [112]. It is a shape matching method that uses ABC to optimize the matching parameters. Before concluding this section, it should be noted that object recognition is a necessary operation for object tracking. Several applications of PSO for object tracking appear in [113, 114], or [115]. Other swarm-based solutions considered are CUS [116, 117], BA [118], and FA [119].

4.5 Face Recognition Face recognition is an interesting area of image analysis. It has a wide range of applications, including human–computer interaction and security systems (Fig. 6). Face recognition is a difficult operation due to the variability of many parameters, such as scale, size, pose, expression, hair, and environmental parameters. The quality of a face recognition system is highly influenced by the set of features selected to complete the operation. The most discriminant features should be selected, especially those that are not affected by variations in scale, facial

Fig. 6 Blocks of a security system that includes face recognition

Swarm-Based Methods Applied to Computer Vision

343

expressions, pose, or illumination. Several swarm-based methods have been used to improve feature selection for face recognition, including artificial ants [120], FA [121], PSO [122, 123], CUS [124], BFO [125], or BA [126]. The authors of [127] addressed the face recognition problem when the illumination conditions are not suitable. They used a sensor that simultaneously takes two face images (visible and infrared) and applied PSO to merge both images. BFO was used in [128] to recognize faces with age variations. Since aging affects each facial region differently, they defined specific weights for the information extracted from each area and applied the swarm-based algorithm to combine the features of global and local facial regions. In addition to using swarm algorithms for feature selection in face recognition, these algorithms are also combined with other methods to define a face recognition system. The face recognition system defined in [129] combines support vector machines with PSO. In this case, PSO was used to optimize the support vector machine parameters. The proposal of [130] defines a system based on linear discriminant analysis in which BFO was used to define the optimal principal components. The method described in [131] combines ABC with Volterra kernels. The solution described in [132] combines PSO with a neural network to optimize the parameters of the network. CUS was combined with principal component analysis and intrinsic discriminant analysis in [133]. The proposal of [134] applies PSO and ABC to define a classifier for face recognition. The system defined in [135] combines a neural network with FA and uses the fireflies to define the parameters of the network.

4.6 Gesture Recognition Humans show many emotions through facial expressions (happiness, sadness, anger, etc.). The recognition of these expressions is useful for the analysis of customer satisfaction, video games, or virtual reality, among other applications. Several swarm-based methods have been proposed for the automatic recognition of facial expressions. They use FA [136], PSO [137], CSO [138], or GWO [139]. On the other hand, the method described in [140] proposes a three-dimensional facial expression recognition model that uses artificial ants and PSO. Face recognition and facial expression recognition are related to head pose estimation. Head pose estimation is a difficult problem in computer vision. This problem was addressed in [141] by a method that uses images from a depth camera and applies the PSO algorithm to solve the problem as an optimization problem. The method presented in [142] is a PSO-based solution for three-dimensional head pose estimation that uses a commodity depth camera. A variant of ABC was used in [143] for face pose estimation from a single image. Human motion recognition is a process that requires detecting changes in the position of a human posture or gesture in a video sequence. The starting point of this process is the identification of the initial position. Tracking human motion from

344

M.-L. Pérez-Delgado

video sequences has applications in fields such as human–computer interaction and surveillance. Several articles propose methods for hand pose estimation based on image analysis. The method described in [144] uses PSO to estimate hand pose from two-dimensional silhouettes of a hand extracted from a multi-camera system. The proposal presented in [145] uses PSO to estimate the three-dimensional pose of the hand. Another PSO-based method is proposed in [146] to estimate the pose of a hand that interacts with an object. The problem of tracking hand articulations was solved in [147] by a model that uses PSO. In this case, the input information was obtained by a Kinect sensor, which includes an image and a depth map. Human body pose estimation from images is an interesting starting point for more complex operations, such as tracking human body pose in a video sequence. The proposal of [148] uses PSO to estimate the human pose from still images. The input data used by this method is a set of multi-view images of a person sitting at a table. On the other hand, BA was used in [149] to estimate the pose of a human body in video sequences. PSO was applied in [150] to estimate upper-body posture from multi-view markerless sequences. A system to detect a volleyball player from a video sequence was proposed in [151]. The authors of the article analyzed the application of several swarm methods, concluding that CUS generates the best results. PSO was applied in [152] for markerless full-body articulated human motion tracking. They used multi-view video sequences acquired in a studio environment. The same swarm-based method was used in [153] to define a model for threedimensional tracking of the human body. A PSO-based solution was described in [154] to track multiple pedestrians in a crowded scene.

4.7 Medical Image Processing Many techniques can be applied to obtain medical images, such as magnetic resonance (MR) imaging, computed tomography (CT), or X-ray. The images obtained by these methods provide very useful information for making medical decisions. To this end, various image processing techniques are often applied to medical images. Many articles describe swarm-based solutions that apply to medical images some operations already discussed in previous sections, such as segmentation (Table 3), classification (Table 4), or feature selection (Table 5). Image registration is an interesting operation applied to medical images. In general, the images obtained by different techniques must be compared or combined by experts to make decisions. To combine these images properly, they must first be geometrically and temporally aligned. This alignment process is called registration. Table 6 shows several articles that apply swarm-based methods to medical image registration.

Swarm-Based Methods Applied to Computer Vision Table 3 Swarm-based methods applied to medical image segmentation Swarm ABC

References [155] [156, 157] [158]

AA

[159] [160, 161] [162]

CUS

[163] [164] [165]

PSO

[166]

[167]

[168] [169]

GWO

[170]

FPA BA

[171] [172]

FA

[173]

Image type MR brain image MR brain image (Combines ABC with fuzzy c-means) CT images to segment the liver area (Clustering method to segment the liver area) Fundus photographs for exudate segmentation MR brain image MR brain image (Combines artificial ants with fuzzy segmentation) Microscopic image MR brain image to detect brain tumors Stomach images (PSO optimizes the parameters for Otsu criterion) MR brain image (PSO selects the optimal cluster centers for the fuzzy c-means method that performs segmentation) CT images to detect lung tumor (PSO selects the optimal cluster centers for the fuzzy c-means method that performs segmentation) Several types of medical images (Active contour-based image segmentation) MR angiography (PSO estimates the parameters of a finite mixture model that fits the intensity histogram of the image) Skin images to detect melanoma (GWO optimizes a multilayer perceptron neural network designed to detect melanoma) CT and MR imaging MR brain image to detect brain tumors (BA selects the optimal cluster centers for the fuzzy c-means method that performs segmentation) MR brain image to detect brain tumors The fitness function of FA uses Tsallis entropy

345

346

M.-L. Pérez-Delgado

Table 4 Swarm-based methods applied to medical images classification Swarm ABC GWO

References [174] [175]

PSO

[176]

Objective Cervical cancer detection in CT images Classification of MR brain images as normal or abnormal (Combines GWO with neural networks) Classification of MR brain images as normal or abnormal (The classification is performed by a support vector machine whose parameters are optimized by PSO) Detection breast abnormalities in mammograms (Combines PSO with a neural network) Breast tumor classification (FA updates the weights of the neural network that performs the classification)

[177] FA

[178]

Table 5 Swarm-based methods applied to feature selection in medical images Swarm PSO FA ABC GWO BA CUS

References [179] [180] [181] [182] [183] [184] [185]

Feature selection for. . . Skin cancer diagnosis Detection of brain tumors on MR brain image Classification of breast lesion on mammogram images Classification of brain images for Alzheimer detection Classification of cervical lesions as benign and malignant Classification of brain tumor by a support vector machine Breast tumor identification on mammogram images

Table 6 Swarm-based methods applied to medical image registration Swarm AA

References [186]

PSO

[187] [188] [189]

GWO CRS

[190] [191]

Applied to. . . Brain images (The result of the ant-based algorithm is provided to a neural network) Several types of images Several types of images (Combines PSO and differential evolution) Several types of images (Describes several PSO-based methods published for this issue) Brain images CT and MR images

Swarm-Based Methods Applied to Computer Vision

347

References 1. Szeliski, R. (2010). Computer vision: Algorithms and applications, Springer Science & Business Media. 2. Panigrahi, B. K., Shi, Y., & Lim, M. H. (2011). Handbook of swarm intelligence: Concepts, principles and applications (Vol. 8). Springer Science & Business Media. 3. Yang, X. S., Cui, Z., Xiao, R., Gandomi, A. H., & Karamanoglu, M. (2013). Swarm intelligence and bio-inspired computation: theory and applications. Newnes. 4. Abraham, A., Guo, H., & Liu, H. (2006). Swarm intelligence: foundations, perspectives and applications. In Swarm intelligent systems (pp. 3–25). Springer. 5. Abdulrahman, S. M. (2017). Using swarm intelligence for solving NP-hard problems. Academic Journal of Nawroz University, 6(3), 46–50. 6. Hassanien, A. E., & Emary, E. (2018). Swarm intelligence: Principles, advances, and applications. CRC Press. 7. Slowik, A. (2021). Swarm intelligence algorithms: Modifications and applications. CRC Press. 8. Karaboga, D., & Basturk, B. (2007). A powerful and efficient algorithm for numerical function optimization: Artificial bee colony (ABC) algorithm. Journal of Global Optimization, 39(3), 459–471. 9. Dorigo, M., & Stützle, T. (2019). Ant colony optimization: overview and recent advances. In Handbook of metaheuristics (pp. 311–351). 10. Mirjalili, S. (2015). The ant lion optimizer. Advances in Engineering Software, 83, 80–98. 11. Yang, X. S. (2010) A new metaheuristic bat-inspired algorithm. In González, J., Pelta, D., Cruz, C., Terrazas, G., & Krasnogor, N. (Eds.), Nature Inspired Cooperative Strategies for Optimization (NICSO 2010) (pp. 65–74). Springer. 10.1007/978-3-642-12538-6_6 12. Passino, K. M. (2002). Biomimicry of bacterial foraging for distributed optimization and control. IEEE Control Systems Magazine, 22(3), 52–67. 13. Yang, X. S., & Deb, S. (2009). Cuckoo search via Lévy flights. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC) (pp. 210–214). IEEE. 10.1109/NABIC.2009.5393690 14. Chu, S. C., & Tsai, P. W. (2007). Computational intelligence based on the behavior of cats. International Journal of Innovative Computing, Information and Control, 3(1), 163–173. 15. Askarzadeh, A. (2016). A novel metaheuristic method for solving constrained engineering optimization problems: Crow search algorithm. Computers & Structures, 169, 1–12. 16. Yang, X. S., Karamanoglu, M., & He, X. (2014). Flower pollination algorithm: a novel approach for multiobjective optimization. Engineering Optimization, 46(9), 1222–1237. 17. Yang, X. S., & He, X. (2013). Firefly algorithm: recent advances and applications. International Journal of Swarm Intelligence, 1(1), 36–50. 18. Li, X. L., Shao, Z. J., & Qian, J. X. (2002). An optimizing method based on autonomous animats: Fish-swarm algorithm. Systems Engineering - Theory and Practice, 22(11), 32–38. 19. Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf optimizer. Advances in Engineering Software, 69, 46–61. 20. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of ICNN’95-International Conference on Neural Networks (Vol. 4, pp. 1942–1948). IEEE. 10.1109/ICNN.1995.488968 21. Mirjalili, S., & Lewis, A. (2016). The whale optimization algorithm. Advances in Engineering Software, 95, 51–67. 22. Chen, B., Chen, L., & Chen, Y. (2013) Efficient ant colony optimization for image feature selection. Signal Processing, 93(6), 1566–1576. 23. Kumar, A., Patidar, V., Khazanchi, D., & Saini, P. (2016). Optimizing feature selection using particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification. Procedia Computer Science, 89, 324–332.

348

M.-L. Pérez-Delgado

24. Naeini, A. A., Babadi, M., Mirzadeh, S. M. J., & Amini, S. (2018). Particle swarm optimization for object-based feature selection of VHSR satellite images. IEEE Geoscience and Remote Sensing Letters, 15(3), 379–383. 25. Andrushia, A. D., & Patricia, A. T. (2020). Artificial bee colony optimization (ABC) for grape leaves disease detection. Evolving Systems, 11(1), 105–117. 26. Ghamisi, P., Chen, Y., & Zhu, X. X. (2016). A self-improving convolution neural network for the classification of hyperspectral data. IEEE Geoscience and Remote Sensing Letters, 13(10), 1537–1541. 27. Su, H., Du, Q., Chen, G., & Du, P. (2014). Optimized hyperspectral band selection using particle swarm optimization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6), 2659–2670. 28. Ghamisi, P., & Benediktsson, J. A. (2014). Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geoscience and Remote Sensing Letters, 12(2), 309–313. 29. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2016). Gray wolf optimizer for hyperspectral band selection. Applied Soft Computing, 40, 178–186. 30. Wang, M., Wu, C., Wang, L., Xiang, D., & Huang, X. (2019). A feature selection approach for hyperspectral image based on modified ant lion optimizer. Knowledge-Based Systems, 168, 39–48. 31. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2015). Binary cuckoo search algorithm for band selection in hyperspectral image classification. IAENG International Journal of Computer Science, 42(3), 183–191. 32. Xie, F., Li, F., Lei, C., Yang, J., & Zhang, Y. (2019). Unsupervised band selection based on artificial bee colony algorithm for hyperspectral image classification. Applied Soft Computing, 75, 428–440. 33. Su, H., Cai, Y., & Du, Q. (2016). Firefly-algorithm-inspired framework with band selection and extreme learning machine for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 309–320. 34. Mohammadi, F. G., & Abadeh, M. S. (2014). Image steganalysis using a bee colony based feature selection algorithm. Engineering Applications of Artificial Intelligence, 31, 35–43. 35. Chhikara, R. R., Sharma, P., & Singh, L. (2016). A hybrid feature selection approach based on improved PSO and filter approaches for image steganalysis. International Journal of Machine Learning and Cybernetics, 7(6), 1195–1206. 36. Adeli, A., & Broumandnia, A. (2018). Image steganalysis using improved particle swarm optimization based feature selection. Applied Intelligence, 48(6), 1609–1622. 37. Pathak, Y., Arya, K., & Tiwari, S. (2019). Feature selection for image steganalysis using Levy flight-based grey wolf optimization. Multimedia Tools and Applications, 78(2), 1473–1494. 38. Zebari, D. A., Zeebaree, D. Q., Saeed, J. N., Zebari, N. A., & Adel, A. Z. (2020). Image steganography based on swarm intelligence algorithms: A survey. Test Engineering and Management, 7(8), 22257–22269. 39. Nezamabadi-Pour, H., Saryazdi, S., & Rashedi, E. (2006). Edge detection using ant algorithms. Soft Computing, 10(7), 623–628. 40. Tian, J., Yu, W., & Xie, S. (2008). An ant colony optimization algorithm for image edge detection. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence) (pp. 751–756). IEEE. 10.1109/CEC.2008.4630880 41. Baterina, A. V., & Oppus, C. (2010). Image edge detection using ant colony optimization. WSEAS Transactions on Signal Processing, 6(2), 58–67. 42. Lu, D. S., & Chen, C. C. (2008). Edge detection improvement by ant colony optimization. Pattern Recognition Letters, 29(4), 416–425. 43. Verma, O. P., Hanmandlu, M., & Sultania, A. K. (2010). A novel fuzzy ant system for edge detection. In 2010 IEEE/ACIS 9th International Conference on Computer and Information Science (pp. 228–233). IEEE. 10.1109/ICIS.2010.145 44. Etemad, S. A., & White, T. (2011). An ant-inspired algorithm for detection of image edge features. Applied Soft Computing, 11(8), 4883–4893.

Swarm-Based Methods Applied to Computer Vision

349

45. Setayesh, M., Zhang, M., & Johnston, M. (2009). A new homogeneity-based approach to edge detection using PSO. In 2009 24th International Conference Image and Vision Computing New Zealand (pp. 231–236). IEEE. 10.1109/IVCNZ.2009.5378404 46. Yigitbasi, E. D., & Baykan, N. A. (2013). Edge detection using artificial bee colony algorithm (ABC). International Journal of Information and Electronics Engineering, 3(6), 634–638. 47. Dong, N., Wu, C. H., Ip, W. H., Chen, Z. Q., Chan, C. Y., & Yung, K. L. (2012). An opposition-based chaotic GA/PSO hybrid algorithm and its application in circle detection. Computers & Mathematics with Applications, 64(6), 1886–1902. 48. Cuevas, E., Sención-Echauri, F., Zaldivar, D., & Pérez-Cisneros, M. (2012) Multi-circle detection on images using artificial bee colony (ABC) optimization. Soft Computing, 16(2), 281–296. 49. Dasgupta, S., Das, S., Biswas, A., & Abraham, A. (2010). Automatic circle detection on digital images with an adaptive bacterial foraging algorithm. Soft Computing, 14(11), 1151– 1164. 50. Li, H., He, H., & Wen, Y. (2015). Dynamic particle swarm optimization and k-means clustering algorithm for image segmentation. Optik, 126(24), 4817–4822. 51. Omran, M.G., Salman, A., & Engelbrecht, A. P. (2006). Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Analysis and Applications, 8(4), 332–344. 52. Chu, X., Zhu, Y., Shi, J., & Song, J. (2010). Method of image segmentation based on fuzzy c-means clustering algorithm and artificial fish swarm algorithm. In 2010 International Conference on Intelligent Computing and Integrated Systems (pp. 254–257). IEEE. 53. Malisia, A. R., & Tizhoosh, H. R. (2006). Image thresholding using ant colony optimization. In The 3rd Canadian Conference on Computer and Robot Vision (CRV’06) (pp. 26–26). IEEE. 10.1109/CRV.2006.42 54. Han, Y., & Shi, P. (2007). An improved ant colony algorithm for fuzzy clustering in image segmentation. Neurocomputing, 70(4–6), 665–671. 55. Yang, X., Zhao, W., Chen, Y., & Fang, X. (2008). Image segmentation with a fuzzy clustering algorithm based on ant-tree. Signal Processing, 88(10), 2453–2462. 56. Ye, Z., Hu, Z., Wang, H., & Chen, H. (2011). Automatic threshold selection based on artificial bee colony algorithm. In 2011 3rd International Workshop on Intelligent Systems and Applications (pp. 1–4). IEEE. 10.1109/ISA.2011.5873357 57. Horng, M. H. (2010). A multilevel image thresholding using the honey bee mating optimization. Applied Mathematics and Computation, 215(9), 3302–3310. 58. Zhang, Y., & Wu, L. (2011). Optimal multi-level thresholding based on maximum Tsallis entropy via an artificial bee colony approach. Entropy, 13(4), 841–859. 59. Akay, B. (2013). A study on particle swarm optimization and artificial bee colony algorithms for multilevel thresholding. Applied Soft Computing, 13(6), 3066–3091. 60. Bhandari, A. K., Kumar, A., & Singh, G. K. (2015). Modified artificial bee colony based computationally efficient multilevel thresholding for satellite image segmentation using Kapur’s, Otsu and Tsallis functions. Expert Systems with Applications, 42(3), 1573–1601. 61. Sri Madhava Raja, N., Rajinikanth, V., & Latha, K. (2014). Otsu based optimal multilevel image thresholding using firefly algorithm. Modelling and Simulation in Engineering, 2014. 10.1155/2014/794574 62. Brajevic, I., & Tuba, M. (2014). Cuckoo search and firefly algorithm applied to multilevel image thresholding. In Yang, X. (Ed.), Cuckoo search and firefly algorithm. Studies in Computational Intelligence (pp. 115–139). Springer. 63. Manic, K. S., Priya, R. K., & Rajinikanth, V. (2016). Image multithresholding based on Kapur/Tsallis entropy and firefly algorithm. Indian Journal of Science and Technology, 9(12), 1–6. 10.17485/ijst/2016/v9i12/89949 64. He, L., & Huang, S. (2017). Modified firefly algorithm based multilevel thresholding for color image segmentation. Neurocomputing, 240, 152–174.

350

M.-L. Pérez-Delgado

65. Pare, S., Bhandari, A. K., Kumar, A., & Singh, G. K. (2018). A new technique for multilevel color image thresholding based on modified fuzzy entropy and Lévy flight firefly algorithm. Computers & Electrical Engineering, 70, 476–495. 66. Horng, M. H., & Liou, R. J. (2011). Multilevel minimum cross entropy threshold selection based on the firefly algorithm. Expert Systems with Applications, 38(12), 14805–14811. 67. Bhandari, A. K., Singh, V. K., Kumar, A., & Singh, G. K. (2014). Cuckoo search algorithm and wind driven optimization based study of satellite image segmentation for multilevel thresholding using Kapur’s entropy. Expert Systems with Applications, 41(7), 3538–3560. 68. Agrawal, S., Panda, R., Bhuyan, S., & Panigrahi, B. K. (2013). Tsallis entropy based optimal multilevel thresholding using cuckoo search algorithm. Swarm and Evolutionary Computation, 11, 16–30. 69. Pare, S., Kumar, A., Bajaj, V., & Singh, G. K. (2017). An efficient method for multilevel color image thresholding using cuckoo search algorithm based on minimum cross entropy. Applied Soft Computing, 61, 570–592. 70. Suresh, S., & Lal, S. (2016). An efficient cuckoo search algorithm based multilevel thresholding for segmentation of satellite images using different objective functions. Expert Systems with Applications, 58, 184–209. 71. Gao, H., Xu, W., Sun, J., & Tang, Y. (2009). Multilevel thresholding for image segmentation through an improved quantum-behaved particle swarm algorithm. IEEE Transactions on Instrumentation and Measurement, 59(4), 934–946. 72. Liu, Y., Mu, C., Kou, W., & Liu, J. (2015). Modified particle swarm optimization-based multilevel thresholding for image segmentation. Soft Computing, 19(5), 1311–1327. 73. Ghamisi, P., Couceiro, M. S., Martins, F. M., & Benediktsson, J. A. (2013). Multilevel image segmentation based on fractional-order Darwinian particle swarm optimization. IEEE Transactions on Geoscience and Remote Sensing, 52(5), 2382–2394. 74. Maitra, M., & Chatterjee, A. (2008). A hybrid cooperative–comprehensive learning based PSO algorithm for image segmentation using multilevel thresholding. Expert Systems with Applications, 34(2), 1341–1350. 75. Duraisamy, S. P., & Kayalvizhi, R. (2010). A new multilevel thresholding method using swarm intelligence algorithm for image segmentation. Journal of Intelligent Learning Systems and Applications, 2(03), 126–138. 76. Yin, P. Y. (2007). Multilevel minimum cross entropy threshold selection based on particle swarm optimization. Applied Mathematics and Computation, 184(2), 503–513. 77. Li, L., Sun, L., Guo, J., Qi, J., Xu, B., & Li, S. (2017). Modified discrete grey wolf optimizer algorithm for multilevel image thresholding. Computational Intelligence and Neuroscience, 2017. 10.1155/2017/3295769 78. Khairuzzaman, A. K. M., & Chaudhury, S. (2017). Multilevel thresholding using grey wolf optimizer for image segmentation. Expert Systems with Applications, 86, 64–76. 79. Satapathy, S. C., Raja, N. S. M., Rajinikanth, V., Ashour, A. S., & Dey, N. (2018). Multilevel image thresholding using Otsu and chaotic bat algorithm. Neural Computing and Applications, 29(12), 1285–1307. 80. Alihodzic, A., & Tuba, M. (2014). Improved bat algorithm applied to multilevel image thresholding. The Scientific World Journal, 2014. 10.1155/2014/176718 81. Liang, Y. C., Chen, A. H. L., & Chyu, C. C. (2006). Application of a hybrid ant colony optimization for the multilevel thresholding in image processing. In King, I., Wang, J., Chan, L., & Wang, D. (Eds.), International Conference on Neural Information Processing. Lecture Notes in Computer Science (Vol. 4233, pp. 1183–1192). Springer. 82. Abd El Aziz, M., Ewees, A. A., Hassanien, A. E., Mudhsh, M., & Xiong, S. (2018). Multiobjective whale optimization algorithm for multilevel thresholding segmentation. In Advances in Soft Computing and Machine Learning in Image Processing (pp. 23–39). Springer. 83. Upadhyay, P., & Chhabra, J. K. (2020). Kapur’s entropy based optimal multilevel image segmentation using crow search algorithm. Applied Soft Computing, 97. 10.1016/j.asoc.2019.105522

Swarm-Based Methods Applied to Computer Vision

351

84. Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66. 85. Kapur, J. N., Sahoo, P. K., & Wong, A. K. (1985). A new method for gray-level picture thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image Processing, 29(3), 273–285. 86. Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1), 479–487. 87. Li, C. H., & Lee, C. (1993). Minimum cross entropy thresholding. Pattern Recognition, 26(4), 617–625. 88. Chandramouli, K., & Izquierdo, E. (2006). Image classification using chaotic particle swarm optimization. In 2006 International Conference on Image Processing (pp. 3001–3004). IEEE. 10.1109/ICIP.2006.312968 89. Wang, B., Sun, Y., Xue, B., & Zhang, M. (2018). Evolving deep convolutional neural networks by variable-length particle swarm optimization for image classification. In 2018 IEEE Congress on Evolutionary Computation (CEC) (pp. 1–8). IEEE. 10.1109/CEC.2018.8477735 90. Fielding, B., & Zhang, L. (2018). Evolving image classification architectures with enhanced particle swarm optimisation. IEEE Access, 6, 68560–68575. 91. Junior, F. E. F., & Yen, G. G. (2019). Particle swarm optimization of deep neural networks architectures for image classification. Swarm and Evolutionary Computation, 49, 62–74. 92. Wang, S., Zhang, Y., Ji, G., Yang, J., Wu, J., & Wei, L. (2015). Fruit classification by wavelet-entropy and feedforward neural network trained by fitness-scaled chaotic ABC and biogeography-based optimization. Entropy, 17(8), 5711–5728. 93. Yang, J., Ye, Z., Zhang, X., Liu, W., & Jin, H. (2017). Attribute weighted Naive Bayes for remote sensing image classification based on cuckoo search algorithm. In 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC) (pp. 169–174). IEEE. 10.1109/SPAC.2017.8304270 94. Chouhan, S. S., Kaul, A., Singh, U. P., & Jain, S. (2018). Bacterial foraging optimization based radial basis function neural network (BRBFNN) for identification and classification of plant leaf diseases: An automatic approach towards plant pathology. IEEE Access, 6, 8852– 8863. 95. Omran, M. G., Engelbrecht, A. P., & Salman, A. (2004). Image classification using particle swarm optimization. In K. Tan, M. Lim, X. Yao, & L. Wang (Eds.), Recent Advances in Simulated Evolution and Learning (pp. 347–365). World Scientific. 10.1142/9789812561794_0019 96. Omran, M., Engelbrecht, A. P., & Salman, A. (2005). Particle swarm optimization method for image clustering. International Journal of Pattern Recognition and Artificial Intelligence, 19(03), 297–321. 97. Liu, X., Li, X., Liu, L., He, J., & Ai, B. (2008). An innovative method to classify remotesensing images using ant colony optimization. IEEE Transactions on Geoscience and Remote Sensing, 46(12), 4198–4208. 98. Liu, X., Li, X., Peng, X., Li, H., & He, J. (2008). Swarm intelligence for classification of remote sensing data. Science in China Series D: Earth Sciences, 51(1), 79–87. 99. Omkar, S., Kumar, M. M., Mudigere, D., & Muley, D. (2007). Urban satellite image classification using biologically inspired techniques. In 2007 IEEE International Symposium on Industrial Electronics (pp. 1767–1772). IEEE. 10.1109/ISIE.2007.4374873 100. Zhang, Y., & Wu, L. (2011). Crop classification by forward neural network with adaptive chaotic particle swarm optimization. Sensors, 11(5), 4721–4743. 101. Owechko, Y., & Medasani, S. (2005). Cognitive swarms for rapid detection of objects and associations in visual imagery. In Proceedings of 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005. (pp. 420–423). IEEE. 102. Singh, N., Arya, R., & Agrawal, R. (2014). A novel approach to combine features for salient object detection using constrained particle swarm optimization. Pattern Recognition, 47(4), 1731–1739.

352

M.-L. Pérez-Delgado

103. Ugolotti, R., Nashed, Y. S., Mesejo, P., Ivekoviˇc, Š., Mussi, L., & Cagnoni, S. (2013). Particle swarm optimization and differential evolution for model-based object detection. Applied Soft Computing, 13(6), 3092–3105. 104. Mussi, L., Cagnoni, S., & Daolio, F. (2009). GPU-based road sign detection using particle swarm optimization. In 2009 Ninth International Conference on Intelligent Systems Design and Applications (pp. 152–157). IEEE. 105. Maldonado, S., Acevedo, J., Lafuente, S., Fernández, A., & López-Ferreras, F. (2010). An optimization on pictogram identification for the road-sign recognition task using SVMs. Computer Vision and Image Understanding, 114(3), 373–383. 106. Tseng, C. C., Hsieh, J. G., & Jeng, J. H. (2009). Active contour model via multi-population particle swarm optimization. Expert Systems with Applications, 36(3), 5348–5352. 107. Horng, M. H., Liou, R. J., & Wu, J. (2010). Parametric active contour model by using the honey bee mating optimization. Expert Systems with Applications, 37(10), 7015–7025. 108. Maboudi, M., Amini, J., Hahn, M., & Saati, M. (2017). Object-based road extraction from satellite images using ant colony optimization. International Journal of Remote Sensing, 38(1), 179–198. 109. Iwin, S., Sasikala, J., & Juliet, D. S. (2019). Optimized vessel detection in marine environment using hybrid adaptive cuckoo search algorithm. Computers & Electrical Engineering, 78, 482–492. 110. Banharnsakun, A., & Tanathong, S. (2014). Object detection based on template matching through use of best-so-far ABC. Computational Intelligence and Neuroscience, 2014. 10.1155/2014/919406 111. Chidambaram, C., & Lopes, H. S. (2009). A new approach for template matching in digital images using an artificial bee colony algorithm. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC) (pp. 146–151). IEEE. 10.1109/NABIC.2009.5393631 112. Xu, C., & Duan, H. (2010). Artificial bee colony (ABC) optimized edge potential function (EPF) approach to target recognition for low-altitude aircraft. Pattern Recognition Letters, 31(13), 1759–1772. 113. Zhang, X., Hu, W., Qu, W., & Maybank, S. (2010). Multiple object tracking via speciesbased particle swarm optimization. IEEE Transactions on Circuits and Systems for Video Technology, 20(11), 1590–1602. 114. Kobayashi, T., Nakagawa, K., Imae, J., & Zhai, G. (2007). Real time object tracking on video image sequence using particle swarm optimization. In 2007 International Conference on Control, Automation and Systems (pp. 1773–1778). IEEE. 10.1109/ICCAS.2007.4406632 115. Ramakoti, N., Vinay, A., & Jatoth, R. K. (2009). Particle swarm optimization aided Kalman filter for object tracking. In 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies (pp. 531–533). IEEE. 10.1109/ACT.2009.135 116. Walia, G. S., & Kapoor, R. (2014). Intelligent video target tracking using an evolutionary particle filter based upon improved cuckoo search. Expert Systems with Applications, 41(14), 6315–6326. 117. Ljouad, T., Amine, A., & Rziza, M. (2014). A hybrid mobile object tracker based on the modified cuckoo search algorithm and the Kalman filter. Pattern Recognition, 47(11), 3597– 3613. 118. Gao, M. L., Shen, J., Yin, L. J., Liu, W., Zou, G. F., Li, H. T., & Fu, G. X. (2016). A novel visual tracking method using bat algorithm. Neurocomputing, 177, 612–619. 119. Gao, M. L., He, X. H., Luo, D. S., Jiang, J., & Teng, Q. Z. (2013). Object tracking using firefly algorithm. IET Computer Vision, 7(4), 227–237. 10.1049/iet-cvi.2012.0207. 120. Kanan, H. R., & Faez, K. (2008). An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system. Applied Mathematics and Computation, 205(2), 716–725. 121. Kotia, J., Bharti, R., Kotwal, A., & Mangrulkar, R. (2020). Application of firefly algorithm for face recognition. In Dey, N. (Ed.), Applications of firefly algorithm and its variants (pp. 147–171). Springer.

Swarm-Based Methods Applied to Computer Vision

353

122. Ramadan, R. M., & Abdel-Kader, R. F. (2009). Face recognition using particle swarm optimization-based selected features. International Journal of Signal Processing, Image Processing and Pattern Recognition, 2(2), 51–65. 123. Krisshna, N. A., Deepak, V. K., Manikantan, K., & Ramachandran, S. (2014). Face recognition using transform domain feature extraction and PSO-based feature selection. Applied Soft Computing, 22, 141–161. 124. Tiwari, V. (2012). Face recognition based on cuckoo search algorithm. Indian Journal of Computer Science and Engineering, 3(3), 401–405. 125. Jakhar, R., Kaur, N., & Singh, R. (2011). Face recognition using bacteria foraging optimization-based selected features. International Journal of Advanced Computer Science and Applications, 1(3), 106–111. 126. Kumar, D. (2017). Feature selection for face recognition using DCT-PCA and bat algorithm. International Journal of Information Technology, 9(4), 411–423. 127. Raghavendra, R., Dorizzi, B., Rao, A., & Kumar, G. H. (2011). Particle swarm optimization based fusion of near infrared and visible images for improved face verification. Pattern Recognition, 44(2), 401–411. 128. Yadav, D., Vatsa, M., Singh, R., & Tistarelli, M. (2013). Bacteria foraging fusion for face recognition across age progression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 173–179). 10.1109/CVPRW.2013.33 129. Wei, J., Jian-Qi, Z., & Xiang, Z. (2011). Face recognition method based on support vector machine and particle swarm optimization. Expert Systems with Applications, 38(4), 4390– 4393. 130. Panda, R., Naik, M. K., & Panigrahi, B. K. (2011). Face recognition using bacterial foraging strategy. Swarm and Evolutionary Computation, 1(3), 138–146. 131. Chakrabarty, A., Jain, H., & Chatterjee, A. (2013). Volterra kernel based face recognition using artificial bee colony optimization. Engineering Applications of Artificial Intelligence, 26(3), 1107–1114. 132. Lu, Y., Zeng, N., Liu, Y., & Zhang, N. (2015). A hybrid wavelet neural network and switching particle swarm optimization algorithm for face direction recognition. Neurocomputing, 155, 219–224. 133. Naik, M. K., & Panda, R. (2016). A novel adaptive cuckoo search algorithm for intrinsic discriminant analysis based face recognition. Applied Soft Computing, 38, 661–675. 134. Nebti, S., & Boukerram, A. (2017). Swarm intelligence inspired classifiers for facial recognition. Swarm and Evolutionary Computation, 32, 150–166. 135. Sánchez, D., Melin, P., & Castillo, O. (2017). Optimization of modular granular neural networks using a firefly algorithm for human recognition. Engineering Applications of Artificial Intelligence, 64, 172–186. 136. Zhang, L., Mistry, K., Neoh, S. C., & Lim, C. P. (2016). Intelligent facial emotion recognition using moth-firefly optimization. Knowledge-Based Systems, 111, 248–267. 137. Mistry, K., Zhang, L., Neoh, S. C., Lim, C. P., & Fielding, B. (2016). A micro-GA embedded PSO feature selection approach to intelligent facial emotion recognition. IEEE Transactions on Cybernetics, 47(6), 1496–1509. 138. Sikkandar, H., & Thiyagarajan, R. (2021). Deep learning based facial expression recognition using improved cat swarm optimization. Journal of Ambient Intelligence and Humanized Computing, 12(2), 3037–3053. 139. Sreedharan, N. P. N., Ganesan, B., Raveendran, R., Sarala, P., & Dennis, B. (2018). Grey wolf optimisation-based feature selection and classification for facial emotion recognition. IET Biometrics, 7(5), 490–499. 140. Mpiperis, I., Malassiotis, S., Petridis, V., & Strintzis, M. G. (2008). 3D facial expression recognition using swarm intelligence. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 2133–2136). IEEE. 10.1109/ICASSP.2008.4518064 141. Padeleris, P., Zabulis, X., & Argyros, A. A. (2012). Head pose estimation on depth data based on particle swarm optimization. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (pp. 42–49). IEEE.

354

M.-L. Pérez-Delgado

142. Meyer, G. P., Gupta, S., Frosio, I., Reddy, D., & Kautz, J. (2015). Robust model-based 3D head pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3649–3657). 143. Zhang, Y., & Wu, L. (2011). Face pose estimation by chaotic artificial bee colony. International Journal of Digital Content Technology and its Applications, 5(2), 55–63. 144. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2010). Markerless and efficient 26-DOF hand pose recovery. In Asian Conference on Computer Vision (pp. 744–757). Springer. 145. Ye, Q., Yuan, S., & Kim, T. K. (2016). Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), European Conference on Computer Vision (pp. 346–361). Springer. 10.1007/978-3-319-46484-8_21 146. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In 2011 International Conference on Computer Vision (pp. 2088–2095). IEEE. 10.1109/ICCV.2011.6126483 147. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Efficient model-based 3D tracking of hand articulations using Kinect. In J. Hoey, S. McKenna, & E. Trucco (Eds.), British Machine Vision Conference (Vol. 1, pp. 2088–2095). 10.5244/C.25.101 148. Ivekoviˇc, Š., Trucco, E., & Petillot, Y. R. (2008). Human body pose estimation with particle swarm optimisation. Evolutionary Computation, 16(4), 509–528. 149. Akhtar, S., Ahmad, A., & Abdel-Rahman, E. M. (2012). A metaheuristic bat-inspired algorithm for full body human pose estimation. In 2012 Ninth Conference on Computer and Robot Vision (pp. 369–375). IEEE. 10.1109/CRV.2012.55 150. Robertson, C., & Trucco, E. (2006). Human body posture via hierarchical evolutionary optimization. In British Machine Vision Conference (Vol. 6, pp. 111–118). 10.5244/C.20.102 151. Balaji, S., Karthikeyan, S., & Manikandan, R. (2021). Object detection using metaheuristic algorithm for volley ball sports application. Journal of Ambient Intelligence and Humanized Computing, 12(1), 375–385. 152. John, V., Trucco, E., & Ivekovic, S. (2010). Markerless human articulated tracking using hierarchical particle swarm optimisation. Image and Vision Computing, 28(11), 1530–1547. 153. Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., & Maybank, S. (2010). A swarm intelligence based searching strategy for articulated 3D human body tracking. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops (pp. 45–50). IEEE. 154. Thida, M., Eng, H. L., Monekosso, D. N., & Remagnino, P. (2013). A particle swarm optimisation algorithm with interactive swarms for tracking multiple targets. Applied Soft Computing, 13(6), 3106–3117. 155. Hancer, E., Ozturk, C., & Karaboga, D. (2013). Extraction of brain tumors from MRI images with artificial bee colony based segmentation methodology. In 2013 8th International Conference on Electrical and Electronics Engineering (ELECO) (pp. 516–520). IEEE. 0.1109/ELECO.2013.6713896 156. Taherdangkoo, M., Yazdi, M., & Rezvani, M. (2010). Segmentation of MR brain images using FCM improved by artificial bee colony (ABC) algorithm. In Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine (pp. 1–5). IEEE. 10.1109/ITAB.2010.5687803 157. Menon, N., & Ramakrishnan, R. (2015). Brain tumor segmentation in MRI images using unsupervised artificial bee colony algorithm and FCM clustering. In 2015 International Conference on Communications and Signal Processing (ICCSP) (pp. 6–9). IEEE. 10.1109/ICCSP.2015.7322635 158. Mostafa, A., Fouad, A., Abd Elfattah, M., Hassanien, A. E., Hefny, H., Zhu, S. Y., & Schaefer, G. (2015). CT liver segmentation using artificial bee colony optimisation. Procedia Computer Science, 60, 1622–1630. 159. Pereira, C., Gonçalves, L., & Ferreira, M. (2015). Exudate segmentation in fundus images using an ant colony optimization approach. Information Sciences, 296, 14–24. 160. Huang, P., Cao, H., & Luo, S. (2008). An artificial ant colonies approach to medical image segmentation. Computer Methods and Programs in Biomedicine, 92(3), 267–273.

Swarm-Based Methods Applied to Computer Vision

355

161. Lee, M. E., Kim, S. H., Cho, W. H., Park, S. Y., & Lim, J. S. (2009). Segmentation of brain MR images using an ant colony optimization algorithm. In 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering (pp. 366–369). IEEE. 10.1109/BIBE.2009.58 162. Karnan, M., & Logheshwari, T. (2010). Improved implementation of brain MRI image segmentation using ant colony system. In 2010 IEEE International Conference on Computational Intelligence and Computing Research (pp. 1–4) IEEE. 10.1109/ICCIC.2010.5705897 163. Chakraborty, S., Chatterjee, S., Dey, N., Ashour, A. S., Ashour, A. S., Shi, F., & Mali, K. (2017). Modified cuckoo search algorithm in microscopic image segmentation of hippocampus. Microscopy Research and Technique, 80(10), 1051–1072. 164. Ilunga-Mbuyamba, E., Cruz-Duarte, J. M., Avina-Cervantes, J. G., Correa-Cely, C. R., Lindner, D., & Chalopin, C. (2016). Active contours driven by cuckoo search strategy for brain tumour images segmentation. Expert Systems with Applications, 56, 59–68. 165. Li, Y., Jiao, L., Shang, R., & Stolkin, R. (2015). Dynamic-context cooperative quantumbehaved particle swarm optimization based on multilevel thresholding applied to medical image segmentation. Information Sciences, 294, 408–422. 166. Mekhmoukh, A., & Mokrani, K. (2015). Improved fuzzy C-means based particle swarm optimization (PSO) initialization and outlier rejection with level set methods for MR brain image segmentation. Computer Methods and Programs in Biomedicine, 122(2), 266–281. 167. Kavitha, P., & Prabakaran, S. (2019). A novel hybrid segmentation method with particle swarm optimization and fuzzy c-mean based on partitioning the image for detecting lung cancer. International Journal of Engineering and Advanced Technology, 8(5), 1223–1227. 168. Mandal, D., Chatterjee, A., & Maitra, M. (2014). Robust medical image segmentation using particle swarm optimization aided level set based global fitting energy active contour approach. Engineering Applications of Artificial Intelligence, 35, 199–214. 169. Wen, L., Wang, X., Wu, Z., Zhou, M., & Jin, J. S. (2015). A novel statistical cerebrovascular segmentation algorithm with particle swarm optimization. Neurocomputing, 148, 569–577. 170. Parsian, A., Ramezani, M., & Ghadimi, N. (2017). A hybrid neural network-gray wolf optimization algorithm for melanoma detection. Biomedical Research, 28(8), 3408–3411. 171. Wang, R., Zhou, Y., Zhao, C., & Wu, H. (2015). A hybrid flower pollination algorithm based modified randomized location for multi-threshold medical image segmentation. Bio-medical Materials and Engineering, 26(s1), S1345–S1351. 10.3233/BME-151432 172. Alagarsamy, S., Kamatchi, K., Govindaraj, V., Zhang, Y. D., & Thiyagarajan, A. (2019). Multi-channeled MR brain image segmentation: A new automated approach combining bat and clustering technique for better identification of heterogeneous tumors. Biocybernetics and Biomedical Engineering, 39(4), 1005–1035. 173. Rajinikanth, V., Raja, N. S. M., & Kamalanand, K. (2017). Firefly algorithm assisted segmentation of tumor from brain MRI using Tsallis function and Markov random field. Journal of Control Engineering and Applied Informatics, 19(3), 97–106. 174. Agrawal, V., & Chandra, S. (2015). Feature selection using artificial bee colony algorithm for medical image classification. In 2015 Eighth International Conference on Contemporary Computing (IC3) (pp. 171–176). IEEE. 10.1109/IC3.2015.7346674 175. Ahmed, H. M., Youssef, B. A., Elkorany, A. S., Saleeb, A. A., & Abd El-Samie, F. (2018). Hybrid gray wolf optimizer–artificial neural network classification approach for magnetic resonance brain images. Applied Optics, 57(7), B25–B31. 176. Zhang, Y., Wang, S., Ji, G., & Dong, Z. (2013). An MR brain images classifier system via particle swarm optimization and kernel support vector machine. The Scientific World Journal, 2013. 10.1155/2013/130134 177. Dheeba, J., Singh, N. A., & Selvi, S. T. (2014). Computer-aided detection of breast cancer on mammograms: A swarm intelligence optimized wavelet neural network approach. Journal of Biomedical Informatics, 49, 45–52. 178. Senapati, M. R., & Dash, P. K. (2013). Local linear wavelet neural network based breast tumor classification using firefly algorithm. Neural Computing and Applications, 22(7), 1591–1598. 179. Tan, T. Y., Zhang, L., Neoh, S. C., & Lim, C. P. (2018). Intelligent skin cancer detection using enhanced particle swarm optimization. Knowledge-based Systems, 158, 118–135.

356

M.-L. Pérez-Delgado

180. Jothi, G., & Hannah Inbarani, H. (2016). Hybrid tolerance rough set–firefly based supervised feature selection for MRI brain tumor image classification. Applied Soft Computing, 46, 639– 651. 181. Santhi, S., & Bhaskaran, V. (2014). Modified artificial bee colony based feature selection: A new method in the application of mammogram image classification. International Journal of Scientific and Technology Research, 3(6), 1664–1667. 182. Shankar, K., Lakshmanaprabu, S., Khanna, A., Tanwar, S., Rodrigues, J. J., & Roy, N. R. (2019). Alzheimer detection using group grey wolf optimization based features with convolutional classifier. Computers & Electrical Engineering, 77, 230–243. 183. Sahoo, A., & Chandra, S. (2017). Multi-objective grey wolf optimizer for improved cervix lesion classification. Applied Soft Computing, 52, 64–80. 184. Kaur, T., Saini, B. S., & Gupta, S. (2018). A novel feature selection method for brain tumor MR image classification based on the Fisher criterion and parameter-free bat optimization. Neural Computing and Applications, 29(8), 193–206. 185. Sudha, M., & Selvarajan, S. (2016). Feature selection based on enhanced cuckoo search for breast cancer classification in mammogram image. Circuits and Systems, 7(04), 327–338. 186. Kavitha, C., & Chellamuthu, C. (2014). Medical image fusion based on hybrid intelligence. Applied Soft Computing, 20, 83–94. 187. Wachowiak, M. P., Smolíková, R., Zheng, Y., Zurada, J. M., & Elmaghraby, A. S. (2004). An approach to multimodal biomedical image registration utilizing particle swarm optimization. IEEE Transactions on Evolutionary Computation, 8(3), 289–301. 188. Talbi, H., & Batouche, M. (2004). Hybrid particle swarm with differential evolution for multimodal image registration. In 2004 IEEE International Conference on Industrial Technology, 2004. IEEE ICIT’04. (Vol. 3, pp. 1567–1572). IEEE. 10.1109/ICIT.2004.1490800 189. Rundo, L., Tangherloni, A., Militello, C., Gilardi, M. C., & Mauri, G. (2016). Multimodal medical image registration using particle swarm optimization: A review. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1–8). IEEE. 10.1109/SSCI.2016.7850261 190. Daniel, E., Anitha, J., Kamaleshwaran, K., & Rani, I. (2017). Optimum spectrum mask based medical image fusion using gray wolf optimization. Biomedical Signal Processing and Control, 34, 36–43. 191. Parvathy, V. S., & Pothiraj, S. (2020). Multi-modality medical image fusion using hybridization of binary crow search optimization. Health Care Management Science, 23(4), 661–669.

Index

A Affective computing, v, 127–148 Audio, 4–8, 14–23, 26, 27, 127–134, 136–147 Auto-encoder, 253–271 Automatic colorization, vi, 253–271

B Bio inspired CNN, vi

C Capsule network, 203–230 Cheby-shev polynomial approximation, 282–285, 289, 290, 292 Chest X-ray images, vi, 182–187, 195, 199, 205, 211, 219 Classification, 5, 37, 81, 103, 136, 153, 182, 207, 226, 243, 273, 295, 334 Combinatorial optimization, 311, 318 Computer vision, 8, 9, 29, 35, 61–78, 81–99, 186, 188, 223, 230, 237, 243, 253, 303, 307, 308, 310, 331–346 Confusion matrix (CM), 12, 118, 123, 213, 239 Content based image retrieval, 151–177 Convolutional neural network (CNN), vi, 8, 10, 13–19, 22, 23, 27, 28, 61, 63, 69, 97, 98, 141, 144–146, 152–154, 161, 181–199, 204, 209–211, 214–217, 243–257, 259, 261, 263, 295–304 COVID-19, vi, 98, 181–199, 203–220 Cricket video summarization, 7, 14, 19, 22 Cuckoo search approach, 334, 342

D Dataset, 1, 68, 104, 128, 153, 184, 204, 231, 243, 254, 274, 296, 309 Deep features for CBIR, 151–177 Deep learning, v, vi, 1, 5, 10, 11, 13, 61, 63, 64, 67, 69–70, 72, 76, 78, 98, 127–148, 152, 153, 175, 181–183, 185–191, 203–207, 210, 214, 223–240, 243–250, 255, 259, 263, 273–292, 296, 297, 303 Deep neural network (DNN), 23, 134, 138, 139, 141, 146, 158, 182, 184, 185, 189, 199, 224, 226, 231, 232, 234, 237, 239, 245, 254 Diabetic, vi, 295–304 Differential evolution, vi, 307–328 Digital image processing, 181 Dimensionality reduction, 37, 274, 278–282, 292, 299 3D point cloud processing, vi, 243–250 Dynamic mode decomposition (DMD), vi, 274, 275, 279, 280, 282–287, 289–292

E Emotions, 4, 104, 127–132, 134–138, 141–148, 343 Entropy, 11, 37–42, 44, 45, 47–52, 55, 57, 58, 104, 105, 110, 215, 217, 247, 309, 319, 321, 327, 338, 339, 345

F Feature, 4, 36, 61, 81, 103, 128, 151, 182, 204, 224, 244, 254, 273, 296, 308, 336

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5

357

358 Feature descriptor, 61, 64, 66, 67, 152, 156 Fundus, vi, 296, 345

G Gray level co-occurrence matrix, 36, 309

H Hamming distance, vi, 151–177 Heuristics, 151, 182, 191, 307, 309, 332 Histogram of oriented gradients (HOG), v, 9, 16, 17, 35–58, 61–64, 66, 67, 71, 73–76, 78, 133, 224–226, 230, 231, 234, 237, 239 Human machine interface (HMI), 223, 237 Hyper spectral image classification, 273–292

I Image enhancement, vi, 307–328 Image processing, 39, 62–64, 71, 93, 101, 181, 189, 205, 257, 274, 297, 307–309, 313, 318–324, 332, 336, 339, 344–346 Image retrieval, v, 81–83, 85–87, 90, 98, 151–177, 341

K Key frames, 1, 4, 7, 16, 18, 20, 23, 36–38, 42–54, 56–58 K-means clustering, vi, 18, 22, 69, 156, 338

M Machine learning (ML), v, 1–29, 36, 61–78, 97, 103–123, 127–148, 183, 184, 188, 189, 205, 210, 231, 253, 271, 274, 297, 302 Multimodal, v, 14, 15, 23, 26, 127–148, 335

N Nearest neighborhood search, vi, 226, 246

O Object recognition, v, 81, 83–85, 90, 95, 342 Octree, 245–246, 248 OpenSet Domain adaptation, 274, 291

Index P Physiological, 128–131, 134, 135, 137, 139, 140, 143, 144, 147 Population-based methods, 309, 310, 312 Pothole detection, 61–64, 69, 71, 76, 77, 78, 97 Prediction, vi, 16, 68, 70, 72, 76, 77, 78, 113, 121, 142–146, 181, 183, 196, 205, 211, 213, 214, 217, 218, 254, 263, 268–271, 301

R Radiometric correlation, 35–58 Retinopathy, vi, 295–304

S Semantic segmentation, 27, 243–250 Shape descriptor, 85–88, 98 Shape feature extraction, v, 81–99 Shot boundary, 7, 13, 16, 17, 20, 22, 24–26, 35–58 Sports video classification, 19, 22 Sports video highlights, 3 Support vector machine (SVM), 5, 7, 10, 14, 16–22, 24, 26, 36, 63, 68, 72–76, 104, 105, 113–115, 119–122, 138, 139, 141–145, 182, 207, 296, 299–301, 303, 346 Swarm-based methods, 331–346

T Text, 6, 7, 15, 18, 19, 23, 25, 26, 55, 72, 83, 104, 130–132, 135, 136, 138, 140–143, 145–147, 151 Texture image, v, 103–123 Travelling salesman (TSP) problem, 307, 308, 310–312, 314, 315, 320

V Video, 1, 35, 62, 104, 127, 230, 340 Video summarization, v, 1–29

X X Ray, vi, 181–199, 203–208, 211, 213, 215, 218, 219, 344