Big Data Analytics for Intelligent Healthcare Management [1 ed.] 012818146X, 9780128181461

Big Data Analytics for Intelligent Healthcare Management covers both the theory and application of hardware platforms an

921 77 17MB

English Pages 312 [298] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Big Data Analytics for Intelligent Healthcare Management [1 ed.]
 012818146X, 9780128181461

Table of contents :
Cover
Big Data Analytics for
Intelligent Healthcare
Management
Copyright
Contributors
Preface
Acknowledgments
1
Bio-Inspired Algorithms for Big Data Analytics: A Survey, Taxonomy, and Open Challenges
Introduction
Dimensions of Data Management
Big Data Analytical Model
Bio-Inspired Algorithms for Big Data Analytics: A Taxonomy
Evolutionary Algorithms
Swarm-Based Algorithms
Ecological Algorithms
Discussions
Future Research Directions and Open Challenges
Resource Scheduling and Usability
Data Processing and Elasticity
Resilience and Heterogeneity in Interconnected Clouds
Sustainability and Energy-Efficiency
Data Security and Privacy Protection
IoT-Based Edge Computing and Networking
Emerging Research Areas in Bio-Inspired Algorithm-Based Big Data Analytics
Container as a Service (CaaS)
Serverless Computing as a Service (SCaaS)
Blockchain as a Service (BaaS)
Software-defined Cloud as a Service (SCaaS)
Deep Learning as a Service (DLaaS)
Bitcoin as a Service (BiaaS)
Quantum Computing as a Service (QCaaS)
Summary and Conclusions
Acknowledgments
References
Further Reading
2
Big Data Analytics Challenges and Solutions
Introduction
Consumable Massive Facts Analytics
Allotted Records Mining Algorithms
Gadget Failure
Facts Aggregation Challenges
Statistics Preservation-Demanding Situations
Information Integration Challenges
Records Analysis Challenges
Scale of the Statistics
Pattern Interpretation Challenges
Arrangements of Challenges
User Intervention Method
Probabilistic Method
Defining and Detecting Anomalies in Human Ecosystems
Demanding Situations in Managing Huge Records
Massive Facts Equal Large Possibilities
Present Answers to Challenges for the Quantity Mission
Hadoop
Hadoop-distributed file system
Hadoop MapReduce
Apache spark
Grid computing
Spark structures
Capacity solutions for records-variety trouble
Image Mining and Processing With Big Data
Potential Answers for Velocity Trouble
Transactional databases
Statistics representation
Massive actualities calculations
Ability solutions for privateers and safety undertaking
Ability Solutions for Scalability Assignments
Big data and cloud computing
Cloud computing service models
Answers
Use record encryption
Imposing access controls
Logging
Discussion
Conclusion
Glossary
References
Further Reading
3
Big Data Analytics in Healthcare: A Critical Analysis
Introduction
Big Data
Healthcare Data
Structured Data
Unstructured Data
Semistructured Data
Genomic Data
Patient Behavior and Sentiment Data
Clinical Data and Clinical Notes
Clinical Reference and Health Publication Data
Administrative and External Data
Medical Image Processing and its Role in Healthcare Data Analysis
Recent Works in Big Data Analytics in Healthcare Data
Architectural Framework and Different Tools for Big Data Analytics in Healthcare Big Data
Architectural Framework
Different Tools Used in Big Data Analytics in Healthcare Data
Challenges Faced During Big Data Analytics in Healthcare
Conclusion and Future Research
References
Further Reading
4
Transfer Learning and Supervised Classifier Based Prediction Model for Breast Cancer
Introduction
Related Work
Dataset and Methodologies
Convolution Neural Networks (CNNs/ConvNets)
Transfer learning and convolution networks
Convolution networks as fixed feature extractors
Dimensionality reduction and principle component analysis (PCA)
Supervised machine learning
Proposed Model
Implementation
Feature Extraction
Dimensionality Reduction
Classification
Tuning Hyperparameters of the Classifiers
Result and Analysis
10-fold Cross Validation Result
Magnification Factor Wise Analysis on Validation Accuracy
Validation accuracy of 40x
Validation accuracy of 100x
Validation accuracy of 200x
Validation accuracy of 400x
Best validation accuracy
Performance on the test set
Result and Analysis of Test Performance
Test performance on 40x
Overall performance on 40x
Test performance on 100x
Overall performance on 100x
Test performance on 200x
Test performance on 400x
Overall performance on 400x
Discussion
Conclusion
References
Further Reading
5
Chronic TTH Analysis by EMG and GSR Biofeedback on Various Modes and Various Medical Symptoms Using IoT
Introduction and Background
Biofeedback
Mental Health Introduction
Importance of Mental Health, Stress, and Emotional Needs and Significance of Study
Meaning of Mental Health
Definitions
Factors Affecting Mental Health
Models of Stress: Three Models in Practice
Types of stress
Causes of stress
Symptoms of stress
Big Data and IoT
Previous Studies (Literature Review)
Tension Type Headache and Stress
Independent Variable: Emotional Need Fulfillment
Meditation-Effective Spiritual Tool With Approach of Biofeedback EEG
Mind-Body and Consciousness
Sensor Modalities and Our Approach
Biofeedback Based Sensor Modalities
Electromyograph
Electrodermograph
Proposed Framework
Experiments and Results-Study Plot
Study Design and Source of Data
Study Duration and Consent From Subjects
Sampling Design and Allocation Process
Sample Size
Study Population
Inclusion criteria
Exclusion criteria
Intervention
Outcome Parameters
Primary variables
Secondary variables
Analgesic Consumption
Assessment of Outcome Variables
Pain Diary
Data Collection
Statistical Analysis
Hypothesis
Data Collection Procedure-Guided Meditation as per Fig. 5.7G
Results, Interpretation and Discussion
The Trend of Average of Frequency
The Trend of Average of Duration
The Trend of Average of Intensity
The Trend of Duration per Cycle With Time
Trend on Correlation of TTH Duration and Intensity
Trend on Correlation of TTH Duration With Occurrence
The Trend of Average of Frequency
The Trend of Average of Duration
The Trend of Average of Intensity
The Trend of Duration per Cycle With Time
Trend on Correlation of TTH Duration and Intensity
Trend on Correlation of TTH Duration With Occurrence
The Trend of Average of Frequency
The Trend of Average Duration
The Trend of Average Intensity
The Trend of Duration per Cycle With Time
Trend on Correlation of TTH Duration and Intensity
Trend on Correlation of TTH Duration With Occurrence
The Trend of Average of Frequency
The Trend of Average of Duration
The Trend of Average Intensity
The Trend of Duration per Cycle With Time
Trend on Correlation of TTH Duration and Intensity
Trend on Correlation of TTH Duration With Occurrence
Findings in This Chapter
Future Scope, Limitations, and Possible Applications
Conclusion
Comprehensive Conclusion
Acknowledgment
References
Further Reading
6
Multilevel Classification Framework of fMRI Data: A Big Data Approach
Introduction
Related Work
Our Approach
Dataset
Methodology
Result Evaluation
Experimental Results
Subject-Dependent Experiments on PS+SP
All features
ROI-based feature
Average ROI-based feature
N-most active-based feature
N-most active ROI-based feature
Subject-Dependent Experiment on PS/SP
ROI-based feature
Average ROI-based feature
N-most active-based feature
Most active ROI-based feature
Result Analysis
Summary of the Subject-Dependent Results
Subject-Independent Experiment
Conclusion and Future Work
References
Further Reading
7
Smart Healthcare: An Approach for Ubiquitous Healthcare Management Using IoT
Introduction
Literature Survey
Proposed Model
Fetch Module
Ingest Module
Retrieve Module
Act/Notify Module
Prototype Model of the Proposed Work
Implementation of the Proposed System
Simulation and Result Discussion
Conclusion
References
8
Blockchain in Healthcare: Challenges and Solutions
Introduction
Roadmap
Healthcare Big Data and Blockchain Overview
Healthcare Big Data
Blockchain
How Blockchain Works
Privacy of Healthcare Big Data
Privacy Right by Country and Organization
How Blockchain Is Applicable for Healthcare Big Data
Digital Trust
Intelligent Data Management
Smart Ecosystem
Digital Supply Chain
Cybersecurity
Interoperability and Data Sharing
Improving Research and Development (R&D)
Fighting Counterfeit Drugs
Collaborative Patient Engagement
Online Access to Longitudinal Data by Patient
Off-Chain Data Storage due to Privacy and Data Size
Blockchain Challenges and Solutions for Healthcare Big Data
GDPR versus Blockchain
Problem statement and key factors of GDPR
Solutions
Off-chain blockchain advantages
Off-chain blockchain disadvantages
Conclusion and Discussion
References
Further Reading
9
Intelligence-Based Health Recommendation System Using Big Data Analytics
Introduction
Background
Recommendation System and Its Basic Concepts
Phases of Recommendation System
Methodology
Filtering techniques
Collaborative-based filtering recommendation system
Evaluation of recommendation system
Health Recommendation System
Designing the Health Recommendation System
Framework for HRS
Methods to Design HRS
Evaluation of HRS
Proposed Intelligent-Based HRS
Dataset Description
Experimental Result Analysis
Advantages and Disadvantages of the Proposed Health Recommendation System Using Big Data Analytics
Conclusion and Future Work
References
Further Reading
10
Computational Biology Approach in Management of Big Data of Healthcare Sector
Introduction
Application of Big Data Analysis
Database Management System and Next Generation Sequencing (NGS)
De novo Assembly, Re-Sequencing, Transcriptomics Sequencing and Epigenetics
Data Collection, Extraction of Genes, and Screening of Drugs
Different Algorithms Related to Docking
Molecular Interactions, Scoring Functions, and Discussion of Some Docking Examples
Conclusions
Acknowledgments
References
11
Kidney-Inspired Algorithm and Fuzzy Clustering for Biomedical Data Analysis
Introduction
Biological Structure of the Kidney
Kidney-Inspired Algorithm
Literature Survey
Proposed Model
Fuzzy C-Means Algorithm
Proposed KA-Based Approach for Biomedical Data Analysis
Obtaining optimal cluster centers using KA
Cluster analysis using optimal cluster centers
Results Analysis
Evaluation Metrics
Experimental Results
Statistical Validity
Conclusion
Acknowledgment
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z
Back Cover

Citation preview

Big Data Analytics for Intelligent Healthcare Management

Advances in Ubiquitous Sensing Applications for Healthcare

Big Data Analytics for Intelligent Healthcare Management Volume Three Series Editors Nilanjan Dey Amira S. Ashour Simon James Fong Volume Editors Nilanjan Dey Techno India College of Technology, Rajarhat, India

Himansu Das KIIT, Bhubaneswar, India

Bighnaraj Naik VSSUT, Burla, India

Himansu Sekhar Behera VSSUT, Burla, India

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom # 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-818146-1 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara Conner Acquisition Editor: Chris Katsaropoulos Editorial Project Manager: Ana Claudia A. Garcia Production Project Manager: Punithavathy Govindaradjane Cover Designer: Christian Bilbow Typeset by SPi Global, India

Contributors Satyabrata Aich Department of Computer Engineering, Inje University, Gimhae, South Korea Navneet Arora Indian Institute of Technology, Roorkee, India Rabindra Kumar Barik KIIT, Bhubaneswar, India Akalabya Bissoyi Department of Biomedical Engineering, National Institute of Technology, Raipur, India Dibya Jyoti Bora School of Computing Sciences, Kaziranga University, Jorhat, India Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia D.K. Chaturvedi Dayalbagh Educational Institute, Agra, India Sumit Chauhan ABES Engineering College, Ghaziabad, India Himansu Das KIIT, Bhubaneswar, India Satya Ranjan Dash School of Computer Applications, Kalinga Institute of Industrial Technology, Bhubaneswar, India Pandit Byomakesha Dash Department of Computer Application, Veer Surendra Sai University of Technology, Burla, India Sukhpal Singh Gill Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia Mayank Gupta Tata Consultancy Services, Noida, India Somnath Karmakar Government College of Engineering and Leather Technology, Kolkata, India Ramgopal Kashyap Amity School of Engineering & Technology, Amity University Chhattisgarh, Raipur, India Fuad Khan Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh Chul-Soo Kim Department of Computer Engineering, Inje University, Gimhae, South Korea

xiii

xiv

Contributors

Hee-Cheol Kim Institute of Digital Anti-Aging Healthcare, Inje University, Gimhae, South Korea Pradeep Kumar Maharana Department of Physics, Silicon Institute of Technology, Bhubaneswar, India Sitikantha Mallik KIIT, Bhubaneswar, India Sushma Rani Martha Orissa University of Agriculture and Technology, Bhubaneswar, India Bhabani Shankar Prasad Mishra KIIT, Bhubaneswar, India Chinmaya Misra School of Computer Applications, Kalinga Institute of Industrial Technology, Bhubaneswar, India Suchismita Mohanty Department of Computer Science and Engineering, College of Engineering and Technology, Bhubaneswar, India Subhadarshini Mohanty Department of Computer Science and Engineering, College of Engineering and Technology, Bhubaneswar, India Subasish Mohapatra Department of Computer Science and Engineering, College of Engineering and Technology, Bhubaneswar, India Maheswata Moharana Department of Chemistry, Utkal University; Department of Hydrometallurgy, CSIR-Institute of Minerals and Material Technology, Bhubaneswar, India Bighnaraj Naik Department of Computer Application, Veer Surendra Sai University of Technology, Burla, India Janmenjoy Nayak Department of Computer Science and Engineering, Sri Sivani College of Engineering, Srikakulam, India Md. Nuruddin Qaisar Bhuiyan Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh Md. Mehedi Hassan Onik Department of Computer Engineering, Inje University, Gimhae, South Korea Luina Pani School of Computer Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar, India Subrat Kumar Pattanayak Department of Chemistry, National Institute of Technology, Raipur, India Chittaranjan Pradhan KIIT, Bhubaneswar, India

Contributors

Farhin Haque Proma Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh Rohit Rastogi ABES Engineering College, Ghaziabad, India Shamim H. Ripon Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh Abhaya Kumar Sahoo KIIT, Bhubaneswar, India Satya Narayan Sahu Orissa University of Agriculture and Technology, Bhubaneswar, India Santosh Satya Indian Institute of Technology, Delhi, India Md. Shamsujjoha Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh Pallavi Sharma ABES Engineering College, Ghaziabad, India Kanithi Vakula Department of Computer Science and Engineering, Sri Sivani College of Engineering, Srikakulam, India Vishwas Yadav ABES Engineering College, Ghaziabad, India Jinhong Yang Department of Healthcare IT, Inje University, Gimhae, South Korea

xv

Preface Nowadays, the biggest technological challenge in big data is to provide a mechanism for storage, manipulation, and retrieval of information on large amounts of data. In this context, the healthcare industry is also being challenged with difficulties in capturing data, storing data, analyzing data, and data visualization. Due to the rapid growth of the large volume of information generated on a daily basis, the use of existing infrastructure has become impracticable to handle this issue. So, it is essential to develop better intelligent techniques, skills, and tools to automatically deal with patient data and its inherent insights. Intelligent healthcare management technologies can play an effective role in tackling this challenge and change the future for improving our lives. Therefore, there are increasing interests in exploring and unlocking the value of the massively available data within the healthcare domain. Healthcare organizations also need to continuously discover useful and actionable knowledge and gain insight from raw data for various purposes such as saving lives, reducing medical errors, increasing efficiency, reducing costs, and improving patient outcome. Thus, data analytics in intelligent healthcare management brings a great challenge and also plays an important role in intelligent healthcare management systems. In the last decade, huge advances in the large scale of data due to the smart devices has led to the development of various intelligent technologies. These smart devices continuously produce very large amounts of structured and unstructured data in healthcare, which is difficult to manage in real life scenarios. Big data analytics generally use statistical and machine learning techniques to analyze huge amounts of data. These high dimensional data with multiobjective problems in healthcare is an open issue in big data. Healthcare data is rapidly growing in volume and multidimensional data. Heterogeneous healthcare data in various forms such as text, images, video, etc., are required to be effectively stored, processed, and analyzed to avoid the increasing cost of healthcare and medical errors. This rapid expansion of data leads to urgent development of intelligent healthcare management systems for analysis. The main objective of this edited book is to cover both the theory and applications of hardware platforms and architectures, development of software methods, techniques and tools, applications and governance, and adoption strategies for the use of big data in healthcare and clinical research. It aims to provide an intellectual forum for researchers in academia, scientists, and engineers from a wide range of applications to present their latest research findings in this area and to identify future challenges in this fledging research area. To achieve the objectives, this book includes eleven chapters, contributed to by promising authors. In Chapter 1, Gill et al. highlighted a broad methodical literature analysis of bio-inspired algorithms for big data analytics. This chapter will also help in choosing the most appropriate bio-inspired algorithm for big data analytics in a specific type of data along with promising directions for future research. In Chapter 2, the author’s objective is to examine the potential impact of immense data challenges, open research issues, and distinctive instrument identification in big data analytics. In Chapter 3, the author includes every possible terminology related to the idea of big data, healthcare data, and the architectural context for big data analytics, different tools, and platforms are discussed in details.

xvii

xviii

Preface

Chapter 4 addresses a machine learning model to automate the classification of benign and malignant tissue image. In Chapter 5, the author describes the use of multimedia and IoT to detect TTH and to analyze the chronicity. It also includes the concept of big data for the storage and processing the data, which will be generated while analyzing the TTH stress through the Internet of Things (IoT). Chapter 6 discusses how to train a fMRI dataset with different machine learning algorithms such as Logistic Regression and Support Vector Machine towards the enhancement of the precision of classification. In Chapter 7, the authors developed a prototype model for healthcare monitoring systems use the IoT and cloud computing. These technologies allow for monitoring and analyzing of various health parameters in real time. In Chapter 8, Onik et al. includes an overview, architecture, existing issues, and future scope of blockchain technology for successfully handling privacy and management of current and future medical records. In Chapter 9, Sahoo et al. describes the intelligent health recommendation system (HRS) that provides an insight into the use of big data analytics for implementing an effective health recommendation engine and shows a path of how to transform the healthcare industry from the traditional scenario to a more personalized paradigm in a tele-health environment. Chapter 10 discussed the interactions between drugs and proteins that was carried out by means of molecular docking process. Chapter 11 integrates the kidney inspired optimization and fuzzy c-means algorithm to solve nonlinear problems of data mining. Topics presented in each chapter of this book are unique to this book and are based on unpublished work of contributing authors. In editing this book, we attempted to bring into the discussion all the new trends and experiments that have been performed in intelligent healthcare management systems using big data analytics. We believe this book is ready to serve as a reference for a larger audience such as system architects, practitioners, developers, and researchers. Nilanjan Dey Techno India College of Technology, Rajarhat, India Himansu Das KIIT, Bhubaneswar, India Bighnaraj Naik VSSUT, Burla, India Himansu Sekhar Behera VSSUT, Burla, India

Acknowledgments Completing this edited book successfully was similar to a journey that we had undertaken for several months. We would like to take the opportunity to express our gratitude to the following people. First of all, we wish to express our heartfelt gratitude to our families, friends, colleagues, and well-wishers for their constant support and cooperation throughout this journey. We also express our gratitude to all the chapter contributors, who allowed us to quote their work in this book. In particular, we would like to acknowledge the hard work of authors and their cooperation during the revisions of their chapters. We indebted to and grateful for the valuable comments of the reviewers that have enabled us to select these chapters out of the many chapters and also improve the quality of the chapters. We are grateful for the help that was extended from the Elsevier publisher team for their continuous support throughout the entire process of publication.

xix

CHAPTER

BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS: A SURVEY, TAXONOMY, AND OPEN CHALLENGES

1

Sukhpal Singh Gill, Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia

1.1 INTRODUCTION Cloud computing is now the spine of the modern economy, which offers on-demand services to cloud customers through the Internet. To improve the performance and effectiveness of cloud computing systems, new technologies, such as internet of things (IoT) applications (healthcare services, smart cities etc.) and big data, are emerging, which further requires effective data processing to process data [1]. However, there are two problems in existing big data processing approaches, which degrade the performance of computing systems such as large response time and delay due to data being transferred twice [2]: (1) computing systems to cloud and (2) cloud to IoT applications. Presently, IoT devices collect data with a huge amount of volume (big data) and variety and these systems are growing with the velocity of 500 MB/seconds or more [3]. For IoT based smart cities, the transfer of data is used to make effective decisions for big data analytics. Data is stored and processed on cloud servers after collection and aggregation of data from smart devices on IoT networks. Further, to process the large volume of data, there is a need for automatic highly scalable cloud technology, which can further improve the performance of the systems [4]. Literature reported that existing cloud-based data processing systems are not able to satisfy the performance requirements of IoT applications when a low response time and latency is needed. Moreover, other reasons for a large response time and latency are: geographical distribution of data and communication failures during transfer of data [5]. Cloud computing systems become bottlenecked due to continually receiving raw data from IoT devices [6]. Therefore, a bio-inspired algorithm based big data analytics is an alternative paradigm that provides a platform between computing systems and IoT devices to process user data in an efficient manner [7].

Big Data Analytics for Intelligent Healthcare Management. https://doi.org/10.1016/B978-0-12-818146-1.00001-5 # 2019 Elsevier Inc. All rights reserved.

1

2

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

Dimensions of data management

Volume

Variety

Velocity

Veracity

Variability

FIG. 1.1 Dimensions of data management.

Variety of data

Text

Audio

Video

Social

Transactional

Operational

Cloud service

Machine to machine data

FIG. 1.2 Variety of data.

1.1.1 DIMENSIONS OF DATA MANAGEMENT As identified from existing literature [1–6], there are five kinds of dimensions of data, which are required for effective management. Fig. 1.1 shows the dimensions of data management for big data analytics: (1) volume, (2) variety, (3) velocity, (4) veracity, and (5) variability. The Volume represents the magnitude of data in terms of data sizes (terabytes or petabytes). For example, Facebook processes a large amount of data such as millions of photographs and videos. Variety refers to heterogeneity in a dataset, which can be different types of data. Fig. 1.2 shows the variety of data, which can be text, audio, video, social, transactional, operational, cloud service, or machine to machine data (M2M data). Velocity refers to the rate of production of data and analysis for processing a huge amount of data. For example, velocity can be 250 MB/minute or more [3]. Veracity refers to abnormality, noise, and biases in data, while variability refers to the change in the rate of flow of data for generation and analysis. The rest of the chapter is organized as follows. In Section 1.2, we present the big data analytical model. In Section 1.3, we propose the taxonomy of bio-inspired algorithms for big data analytics. In Section 1.4, we analyze research gaps and present some promising directions toward future research in this area. Finally, we summarize the findings and conclude the chapter in Section 1.5.

1.2 BIG DATA ANALYTICAL MODEL Big data analytics is a term, which is a combination of “big data” and “deep analysis” as shown in Fig. 1.3. Every minute, a large amount of user data is being transferred from one device to another device, which needs high processing power to perform data mining for the extraction of useful information from the database. Fig. 1.3 shows the model for big data analytics, which shows that an OLTP (on-line transaction processing) system creates data (txn data). A data cube represents a big data, out of

1.2 BIG DATA ANALYTICAL MODEL

3

Model Report Txn data Report generator

Testing data

Model validation

Data partition

Draft model

Training data

Model generation

Data sampling Data clustering

Training data

Clean data Feature extraction Dimension aggregation

Data cubes

FIG. 1.3 Big data analytical model.

Big data process

Data management

Acquisition and recording

Extraction and cleaning

Analytics

Integration and aggregation

Modeling and analysis

Data interpretation

FIG. 1.4 Big data process.

which required information can be extracted using data mining. Initially, different types of data come from different users or devices and the process of data cleansing is performed to remove the irrelevant data and stores the clean data in the database [8]. Further, data aggregation is performed to store the data in an efficient manner because incoming data contains a variety of data and a report is generated for easy use in future. The aggregated data is further stored in data cubes using large storage devices. For deep analysis, feature extraction is performed using data sampling, which generates the required type of data. The deep analysis includes data visualization, model learning (e.g., K-nearest-neighbor, Linear regression), and model evaluation [9]. Fig. 1.4 shows the process of big data, which has two main components: data management and analytics. There are five different stages in processing big data: (1) acquisition and recording

4

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

(to store data), (2) extraction and cleaning (cleansing of data), (3) integration and aggregation (compiling of required data), (4) modeling and analysis (study of data), and (5) data interpretation (represent data in required form).

1.3 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS: A TAXONOMY This section presents the existing literature of bio-inspired algorithms for big data analytics. The bioinspired algorithms for big data analytics are categorized into three categories: ecological, swarmbased, and evolutionary. Fig. 1.5 shows the taxonomy of bio-inspired algorithms for big data analytics along with focus of study (FoS).

1.3.1 EVOLUTIONARY ALGORITHMS Kune et al. [10] proposed a genetic algorithm (GA) based data-aware family scheduling approach for analytics of big data, which focuses on bandwidth utilization, computational resources, and data dependencies. Moreover, the GA algorithm decoupled data and computational services are provided as cloud services. The results demonstrate that the GA algorithm gives effective results in terms of turnaround time because the GA algorithm processes data using parallel processing. Gandomi et al. [11] proposed a multiobjective genetic programming (GP) algorithm-based approach for big data mining, which is used to develop the concrete creep model to provide unbiased and accurate predictions. The GP model works with high and normal strength. Elsayed and Sarker [12] proposed a differential evolution (DE) algorithm-based big data analytics approach, which uses local search to increase the exploitation capability of the DE algorithm. This approach optimizes the big data 2015 benchmark problems with both multi- and single-objective problems but it exhibits large computational time. Kashan et al. [13] proposed an evolutionary strategy (ES) algorithm-based big data analytics technique, which processes data efficiently and accurately using parallel scheduling of cloud resources. Further, the ES algorithm minimizes the execution time by partitioning a group of jobs into disjointed sets, in which the same resources execute all the jobs in the same set. Mafarja and Mirjalili [14] proposed a simulated annealing (SA) algorithm-based big data optimization technique, which uses the whale optimization algorithm (WOA) to architect various feature selection approaches to reduce the manipulation by probing the most capable regions. The proposed approach helps to improve the classification accuracy and selects the most useful features for categorization tasks. Further, Barbu et al. [15] proposed an SA algorithm-based feature selection (SAFS) technique for big data learning and computer vision. Based on a criterion, the SAFS algorithm removes variables and tightens a sparsity constraint, which reduces the problem size gradually during the iterations and this makes it mainly fit for big data learning. Tayal and Singh [16] proposed big data analytics based on the FSO and SA-based hybrid (FSOSAH) technique for a stochastic dynamic facility layout-based multiobjective problem to manage data effectively. Saida et al. [17] proposed the cuckoo search optimization (CO) algorithm-based big data analytics approach for clustering data. Further, different datasets from the UCI Machine Learning Repository are considered to validate the CO algorithm through experimental results and these datasets perform better in terms of computational efficiency and convergence stability.

1.3 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS: A TAXONOMY

Bio-inspired algorithms for big data analytics

Evolutionary

Swarm-based

Ecological

2015/16

2014

2018 Particle swarm optimization (PSO)

Genetic algorithm (GA)

[19] [22]

[25]

Biogeography based optimization (BBO)

[10]

FoS: Group scheduling FoS: Group scheduling

FoS: Multilayer perceptron training

2018 2017 Simulated annealing (SA)

Artificial bee colony (ABC) [16]

2016

[13] Invasive weed colony (IWC)

FoS: Clustering of data

FoS: Feature selection

[9]

FoS: Fuzzy normalization 2016 Shuffled frog leaping (SPL)

2016 Cuckoo search optimization (CO) [24]

[31]

FoS: Biomedical data

FoS: Convergence stability

2017 Fish swarm optimization (FSW)

2016 Evolutionary strategy (ES)

[30]

[32]

2014 Intelligent water drops (IWD)

[33]

2016

FoS: Workflow scheduling Genetic programming (GP)

[29] 2017/14

FoS: Concrete creep model Bacterial foraging optimization (BFO)

2016 Differential evolution (DE)

[34] [35]

FoS: Intrusion prevention and detection

[28] 2014

FoS: Local search FSO hybrid (FSOH)

[15]

FoS: Execution cost 2017 Artificial immune system (AIS)

[8]

FoS: Internet traffic data 2015 Firefly swarm optimization (FSO)

[14]

FoS: Social network 2015 Group searcher optimization (GSO)

[36]

FoS: Data clustering 2016 Cat swarm optimization (CSO)

[21]

FoS: Text classification 2016 Swarm intelligence (SI)

[23]

FoS: Load dispatcher 2018/16 Ant colony optimization (ACO)

[26] [27]

FoS: Mobile and medical big data

FIG. 1.5 Taxonomy of bio-inspired algorithms for big data analytics.

[20]

FoS: Data selection mining

FoS: Fault tolerance

FoS: Cloud resources

2016 Multi-species optimizer (PS2O)

5

6

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

1.3.2 SWARM-BASED ALGORITHMS Ilango et al. [9] proposed an artificial bee colony (ABC) algorithm-based clustering technique for management of big data, which identifies the best cluster and performs the optimization for different dataset sizes. The ABC algorithm approach minimizes the execution time and improves the accuracy. A MapReduce-based Hadoop environment is used for implementation and results demonstrate that the ABC algorithm delivers a more effective outcome than the differential evolution and particle swarm optimization (PSO) in terms of execution time. Raj and Babu [18] proposed a firefly swarm optimization (FSO) algorithm for big data analytics for establishing novel connections in social networks to calculate the possibility of sustaining a social network. In this technique, a mathematical model is introduced to test the stability of the social network and this reduces the cost of big data management. Wang et al. [19] proposed an FSO algorithm-based hybrid (FSOH) approach for big data optimization to focus on six multiobjective problems. It reduces execution costs but it has high computational time complexity. Wang et al. [20] proposed a PSO algorithm-based big data optimization approach to improve online dictionary learning and introduced a dictionary-learning model using the atom-updating stage. The PSO algorithm reduces the heavy computational burdens and improves the accuracy. Hossain et al. [21] proposed a parallel clustered PSO algorithm (PCPSO)-based approach for big data-driven service composition. The PCPSO algorithm handles huge amounts of heterogeneous data and process data using parallel processing with MapReduce in the Hadoop platform. Lin et al. [22] proposed a cat swarm optimization (CSO) algorithm-based approach for big data classification to choose characteristics during classification of text for big data analytics. The CSO algorithm uses the term frequency-inverse document occurrence to improve accuracy of feature selection. Cheng et al. [23] proposed a swarm intelligence (SI) algorithm-based big data analytics approach for the economic load dispatch problem and the SI algorithm handles the high dimensional data, which improves the accuracy of the data processing. Banerjee and Badr [24] proposed the ant colony optimization (ACO) algorithm-based approach for mobile big data using rough set. The ACO algorithm helps to select an optimal feature for resolved decisions, which aids in effectively managing big data from social networks (tweets and posts). Pan [25] proposed the improved ACO algorithm (IACO)based big data analytical approach for management of medical data such as patient data, operation data etc., which helps doctors retrieve the required data quickly. Hu et al. [26] proposed a shuffled frog leaping (SFL) approach to perform the selection of the feature for improved high-dimensional biomedical data. For improved high-dimensional biomedical data, the SFL algorithm maximizes the predictive accuracy by exploring the space of probable subsets to obtain the group of characteristics and reduce irrelevant features. Manikandan and Kalpana [27] proposed a fish swarm optimization (FSW) algorithm for feature selection in big data. The FSO algorithm reduces the combinatorial problems by employing the fish swarming behavior and this is effective for diverse applications. Social interactions among big data have been designed using the movement of fish in their search for food. This algorithm provides effective output in terms of fault tolerance and data accuracy. Elsherbiny et al. [28] proposed the intelligent water drops (IWD) algorithm for workflow scheduling to effectively manage big data. The workflows simulation toolkit is used to test the effectiveness of the IWD-based approach and results show that the IWD-based approach is performed effectively in terms of cost and makespan when compared to the FCFS, Round Robin, and PSO algorithm.

1.3 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS: A TAXONOMY

7

Neeba and Koteeswaran [29] proposed a bacterial foraging optimization (BFO) algorithm to classify the informative and affective content from medical weblogs. MAYO clinic data is used as a medical data source to evaluate the accuracy to retrieve the relevant information. Ahmad et al. [30] proposed a BFO algorithm for network-traffic (BFON) to detect and prevent intrusions during the transfer of big data. Further, it controls the intrusions using a resistance mechanism. Schmidt et al. [31] proposed an artificial immune system (AIS) algorithm-based big data optimization technique to manage and classify flow-based Internet traffic data. To improve the classification performance, the AIS algorithm used Euclidian distance and the results demonstrate that this technique produces more accurate results when compared to the Naı¨ve Bayes classifier. George and Parthiban [32] proposed the group search optimization (GSO) algorithm-based big data analytics technique using FSO to perform data clustering for the high dimensional dataset. This technique replaces the worst fitness values in every iteration of the GSO with the improved values from FSO to test the performance of clustering data.

1.3.3 ECOLOGICAL ALGORITHMS Pouya et al. [33] proposed the invasive weed optimization (IWO) algorithm-based big data optimization technique to resolve the multiobjective portfolio optimization task. Further, the uniform design and fuzzy normalization method are used to transform the multiobjective portfolio selection model into a single-objective programming model. The IWO algorithm manages big data more quickly than PSO. Pu et al. [34] proposed a hybrid biogeography-based optimization (BBO) algorithm for multilayer perceptron training under the challenge of analysis and processing of big data. Experimental results show that BBO is effective in providing training to multilayer perceptron and performs better in terms of convergence when compared to the GA and PSO algorithm. Fong et al. [35] proposed the multispecies optimizer (PS2O) algorithm-based approach for data stream mining big data to select features. An incremental classification algorithm is used in the PS2O algorithm to classify the collected data streams pertaining to big data, which enhanced the analytical accuracy within a reasonable processing time. Fig. 1.6 shows the evolution of bio-inspired algorithms for big data analytics based on existing literature as discussed above. Fig. 1.7 shows the number of papers published for each category of bio-inspired algorithm per year. This helps to recognize the important types of bio-inspired algorithms [14–23, 35] [11–13, 25–29] that were highlighted from 2014 to 2018.

2014 • CO [24] • GA [25] • BFON [35]

2015 • FSO [14] • PSO [19] • GSO [36]

2016 • • • • • • • • •

PCPSO [22] CSO [21] SI [23] IACO [27] DE [28] GP [29] ES [30] SFL [31] IWO [9]

FIG. 1.6 Evolution of bio-inspired algorithms for big data analytics.

2017 • • • • • • • •

SA [16] SAFS [17] PS2O [20] FSW [32] IWD [33] BFO [34] AIS [8] FSOH [15]

2018 • ABC [13] • ACO [26] • BBO [10]

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

Number of papers

8

7 6 5 4 3 2 1 0

Evolutionary Swarm-based Ecological 2014

2015

2016

2017

2018

Year

FIG. 1.7 Time count of bio-inspired algorithms for big data analytics.

Type of analytics for bioinspired algorithms

Text analytics

Audio analytics

Video analytics

Information extraction

LVCSR

Server-based architecture

Text summarization

Phonetic based system

Edge-based architecture

Social media analytics

Predictive analytics

Content-based analytics

Heterogeneity

Structure-based analytics

Noise accumulation

Question answering

Spurious correlation

Sentimental analysis

Incidental endogeneity

FIG. 1.8 Type of analytics for bio-inspired algorithms.

The literature reported that there are five types of analytics for big data management using bioinspired algorithms: predictive analytics, social media analytics, video analytics, audio analytics, and text analytics as shown in Fig. 1.8. Text analytics is a method to perform text mining for an extraction of required data from the database such as news, corporate documents, survey responses, online forums, blogs, emails, and social network feeds. There are four methods for text analytics: (1) sentimental analysis, (2) question answering, (3) text summarization, and (4) information extraction. The information extraction technique extracts structured data from unstructured data, for example, an extraction of tablet name, type, and expiry date from patient’s medical data. The text summarization method extracts a concise summary of various documents related to a specific topic. The question answering method uses a natural language processing to find answers to the questions. The sentiment analysis method examines the viewpoint of people regarding events or products. Audio analytics or speech analytics is a process of extraction of structured data from unstructured audio data and examples of an audio analytics are healthcare or call center data. Audio analytics has

1.3 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS: A TAXONOMY

9

two types: large-vocabulary continuous speech recognition (LVCSR) and phonetic-based technique. LVCSR performs indexing (to transliterate the speech content of audio) followed by searching (to find an index-term). The phonetic-based technique deals with phonemes or sounds and performs phonetic indexing and searching. Video analytics visualize, examine, and extract meaningful information from video streams such as CCTV footage, live streaming of sport matches etc. Video analytics can be performed at end devices (edge) or centralized systems (server). Social media analytics examines the unstructured or structured data of social media websites (a platform that enables an exchange of information among users) such as Facebook, Twitter etc. There are two kinds of social media analytics: (1) content-based (data posted by users) or (2) structure-based (synthesizing the structural attributes). Predictive analytics is a method that uses historical and current data to predict future outcomes, which can be done based on: heterogeneity (data coming from different sources), noise accumulation (an estimation error during interpretation of data), spurious correlation (uncorrelated variable due to huge size of dataset), or incidental endogeneity (predictors or explanatory variables, which are independent of the residual term). Fig. 1.9 shows the different parameters that are considered in different bio-inspired algorithms for big data analytics. There are four types of data mining techniques as studied from literature: classification, prediction, clustering, or association. In classification, model attributes are used to arrange the data in a different set of categories. The prediction technique is used to find out the unknown values. Clustering is an Hbase Scalability

Cassandra

NoSQL server

MongodB

Storage

Couchbase

Fault tolerance

Neo4J

Agility Classification

Parameters for big data analytics

Virtualization Prediction Analytical technique Clustering Cost Association Ease of use Reactive Mechanism Proactive Data management

FIG. 1.9 Parameters of different bio-inspired algorithms for big data analytics.

10

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

unsupervised technique, which clusters the data based on related attributes. The association technique is used to establish a relationship among different datasets. There are five types of NoSQL database management systems (DBMS) that are used in existing techniques: Hbase, Cassandra, MongodB, Couchbase, and Neo4J. The two different types of mechanism for making decisions in bio-inspired algorithms for big data analytics are: proactive (working on forward-looking decisions, which requires forecasting or text mining) and reactive (decisions based on the requirement for data analytics). Scalability refers to the mechanism of a computing system to scale-up or scale-down its nodes based on the amount of transfer of data for analytics. Big data analytics techniques use a large amount of storage space to store the information to perform the different types of analytics to extract the required information. Fault tolerance of a system is the ability to process the user data within the required time frame. The type of data that is requiring analytics is continually changing, so there is a need for agilitybased big data analytical models to process user data in a required format. Virtualization is a technique that is required for cloud-based systems to create virtual machines for processing user data in a distributed manner. Execution cost is the amount of efforts that are required to perform big data analytics. Ease of use is defined as the mechanism that explains how easily the system can be used to perform big data analytics. Data management is discussed in Section 1.2. Table 1.1 shows the comparisons of bioinspired algorithms for big data analytics based on different parameters.

1.3.4 DISCUSSIONS Table 1.1 shows the comparisons of bio-inspired algorithms for big data analytics based on different parameters, which helps the reader to choose the most appropriate bio-inspired algorithm. In the current scenario, cloud computing has emerged as the fifth utility of computing and has captured the significant attention of industries and academia for big data analytics. Virtualization technology is progressing continuously, and new models, mechanisms, and approaches are emerging for effective management of big data using cloud infrastructure. Fog computing uses network switches and routers, gateways, and mobile base stations to provide cloud service with minimal possible network latency and response time. Therefore, fog or edge devices can also perform big data analytics at the edge device instead of at a decentralized database or server.

1.4 FUTURE RESEARCH DIRECTIONS AND OPEN CHALLENGES Bio-inspired algorithm-based big data analytics has several challenges that need to be addressed, such as resource management, usability, data processing, elasticity, resilience, heterogeneity in interconnected clouds, sustainability, energy efficiency, data security, privacy protection, edge computing, and networking.

1.4.1 RESOURCE SCHEDULING AND USABILITY Cloud resource management is the ability of a computing system to schedule available resources to process user data over the Internet. The cloud uses virtual resources for big data analytics to process user data quickly and cheaply. The virtualization technology provides effective management of cloud resources using bio-inspired algorithms to improve user satisfaction and resource utilization. There is a

Table 1.1 Comparison of Bio-Inspired Algorithms for Big Data Analytics

Cost

Ease of Use

Type of Analytics

No SQL DBMS

Mechanism

Type of Data

Technique

Scalability

Storage

Fault Tolerance

ABC [9]















Audio

Cassandra

Reactive

Audio

FSO [18]















Social

MongodB

Reactive

Social

FSOH [19]















Video

Hbase

Proactive

Video

SA [14]















Text

Neo4J

Proactive

Text

SAFS [15]















Predictive

Hbase

Reactive

Operational

FSOSAH [16]















Social

Cassandra

Proactive

Social

PSO [20]















Audio

Neo4J

Proactive

Audio

PCPSO [21] PS2O [35]















Predictive

Cassandra

Proactive















Predictive

Hbase

Reactive

Cloud service M2M Data

CSO [22]















Text

Cassandra

Proactive

Text

SI [23]















Video

MongodB

Proactive

Video

CO [17]















Predictive

Cassandra

Proactive

Operational

GA [10]















Predictive

Couchbase

Reactive

M2M Data

Agility

Virtualization

Dimension of Data Management Volume, variety, velocity Volume, variability Variety, velocity, veracity Volume, velocity Volume, velocity Variability, veracity, variability Variability, velocity Variety, veracity Volume, velocity, variability Volume, velocity, variability Variety, veracity Variability, variety, variability Veracity, variability

Data Mining Technique Prediction

Classification Clustering

Clustering Prediction Classification

Prediction Association Clustering

Prediction

Prediction Association

Prediction

Continued

Table 1.1 Comparison of Bio-Inspired Algorithms for Big Data Analytics—cont’d

Cost

Ease of Use

Type of Analytics

No SQL DBMS

Mechanism

Type of Data

Technique

Scalability

Storage

Fault Tolerance

ACO [24]















Predictive

MongodB

Proactive

Transactional

IACO [25]















Social

Cassandra

Proactive

Social

DE [12]















Video

Hbase

Reactive

Video

GP [11]















Audio

Neo4J

Reactive

Audio

ES [13]















Predictive

Cassandra

Proactive

SFL [26]















Predictive

MongodB

Proactive

Cloud service Operational

FSW [27]















Audio

Couchbase

Proactive

Audio

IWD [28]















Text

Hbase

Reactive

Text

BFO [29]















Predictive

Cassandra

Reactive

Transactional

BFON [30]















Predictive

Hbase

Proactive

Operational

AIS [31]















Predictive

Cassandra

Proactive

Transactional

IWC [33]















Text

Couchbase

Proactive

Text

BBO [34]















Predictive

Cassandra

Reactive

Cloud service

GSO [32]















Predictive

MongodB

Reactive

Transactional

Agility

Virtualization

Dimension of Data Management Volume, variability, velocity Veracity, variability, velocity Volume, velocity, variety Volume, veracity, velocity Velocity, variability Velocity, veracity, variability Variety, veracity Volume, velocity, variety Volume, variability, velocity Velocity, veracity, variability Volume, velocity Volume, variability, velocity Volume, variety, veracity, variability Volume, velocity, veracity

Data Mining Technique Clustering

Association

Clustering

Association

Classification Classification

Prediction Clustering

Clustering

Prediction

Association Classification

Association

Clustering

1.4 FUTURE RESEARCH DIRECTIONS AND OPEN CHALLENGES

13

need to optimize provisioning of cloud resources in existing bio-inspired algorithms for big data analytics. To solve this challenge, a quality of service (QoS)-aware bio-inspired algorithm-based resource management approach is required for the efficient management of big data to optimize the QoS parameters.

1.4.2 DATA PROCESSING AND ELASTICITY There is a challenge of data synchronization in bio-inspired algorithms due to data processing that is taking place geographically, which increases overprovisioning and underprovisioning of cloud resources. There is a need to identify the overloaded resources using rapid elasticity, which can handle the data received from different IoT devices. To improve the recoverability of data, there is a need for a data backup technique for big data analytics, which can provide the service during server downtime.

1.4.3 RESILIENCE AND HETEROGENEITY IN INTERCONNECTED CLOUDS The cloud providers such as Microsoft, Amazon, Facebook, and Google are delivering reliable and efficient cloud service by utilizing various cloud resources such as disk drives, storage devices, network cards, and processors for big data analytics. The complexity of computing systems is increasing with an increasing size of cloud data centers (CDCs), which increases the resource failures during big data analytics. The resource failure can be premature termination of execution, data corruption, and service level agreement (SLA) violation. There is a need to find out more information about the failures to make the system more reliable. There is a need for replication of cloud services to analyze the big data in an efficient and reliable manner.

1.4.4 SUSTAINABILITY AND ENERGY-EFFICIENCY To reduce energy consumption, there is a need to migrate user data to more reliable servers for efficient execution of cloud resources. Moreover, introducing the concept of resource consolidation can increase the sustainability and energy efficiency of a cloud service by consolidating the multiple independent instances of IoT applications.

1.4.5 DATA SECURITY AND PRIVACY PROTECTION To improve the reliability of distributed cloud services, there is a need to integrate security protocols in the process of big data analytics. Further, there is a need to incorporate authentication modules at different levels of data management.

1.4.6 IoT-BASED EDGE COMPUTING AND NETWORKING There are a large number of edge devices participating in the IoT-based Fog environment to improve the computation and reduce the latency and response time, which can further increase the energy consumption. Fog devices are not able to offer resource capacity in spite of additional computation and storage power. There is a need to process the user data at an edge device instead of at the server, which can reduce execution time and cost.

14

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

1.5 EMERGING RESEARCH AREAS IN BIO-INSPIRED ALGORITHM-BASED BIG DATA ANALYTICS In addition to future research directions, there are various hotspot research areas in bio-inspired algorithmbased big data analytics that need to be addressed in the future such as containers, serverless computing, blockchain, software-defined clouds, bitcoin, deep learning, and quantum computing. In this section, we discuss hotspot research areas in the context of bio-inspired algorithm- based big data analytics.

1.5.1 CONTAINER AS A SERVICE (CaaS) Docker is a container-based virtualization technology that can be used for bio-inspired algorithm-based big data analytics in multiple clouds using a lightweight web server that is, HUE (Hadoop user experience) web interface. The HUE-based Docker container provides a robust and lightweight container as a service (CaaS) data processing facility using a virtual multicloud environment.

1.5.2 SERVERLESS COMPUTING AS A SERVICE (SCaaS) Serverless computing can be used for bio-inspired algorithm-based big data analytics without managing the cloud infrastructure and it is effective in processing user data without configuration of the network and resource provisioning. Serverless computing as a service (SCaaS) has two different services: backend-as-a-service (BaaS) and function-as-a-service (FaaS), which can improve the efficiency, robustness, and scalability of big data processing systems and analyze the data quickly.

1.5.3 BLOCKCHAIN AS A SERVICE (BaaS) Blockchain is a distributed database system, which can manage a huge amount of data at a low cost and it provides instant risk-free transaction. Blockchain as a service (BaaS) decreases the time of processing a transaction dramatically and increases security, quality, and integrity of data in bio-inspired algorithm-based big data analytics.

1.5.4 SOFTWARE-DEFINED CLOUD AS A SERVICE (SCaaS) A huge amount of data originates from different IoT devices and it is necessary to transfer data from source to destination without any data loss. Software-defined cloud as a service (SCaaS) is a new paradigm, which provides the effective network architecture to move data from IoT devices to a cloud datacenter in a reliable and efficient manner by making intelligent infrastructure decisions. Further, SCaaS offers other advantages for bio-inspired algorithm-based big data analytics such as failure recovery, optimization of network resources, and fast computing power.

1.5.5 DEEP LEARNING AS A SERVICE (DLaaS) Deep learning is a new paradigm for bio-inspired algorithm-based big data analytics to process the user data with high accuracy and efficiency in a real time manner using hybrid learning and training mechanisms. Deep learning as a service (DLaaS) uses a hierarchical learning process to get high-level, complex abstractions as representations of data for analysis and learning of huge chunks of unsupervised data.

ACKNOWLEDGMENTS

15

1.5.6 BITCOIN AS A SERVICE (BIaaS) Cryptocurrencies are a very popular technology used to provide secure and reliable service for a huge number of financial transactions. Bitcoin as a service (BiaaS) performs real-time data extraction from the blockchain ledger and stores the big data in an efficient manner for bio-inspired algorithm-based big data analytics. BiaaS-based big data analytics provides interesting benefits such as trend prediction, theft prevention, and identification of malicious users.

1.5.7 QUANTUM COMPUTING AS A SERVICE (QCaaS) The new trend of quantum computing helps bio-inspired algorithm-based big data analytics to solve complex problems by handling massive digital datasets in an efficient and quick manner. Quantum computing as a service (QCaaS) allows for quick detection, analysis, integration, and diagnosis from large scattered datasets. Further, QCaaS can search extensive, unsorted datasets to quickly uncover patterns.

1.6 SUMMARY AND CONCLUSIONS This chapter presents a review of bio-inspired algorithms for big data analytics. The comparison of bioinspired algorithms has been presented based on taxonomy, focus of study, and identified demerits. Bio-inspired algorithms are categorized into three different categories and we investigated the existing literature on big data analytics towards finding the open issues. Further, promising research directions are proposed for future research.

GLOSSARY Big data Data management Big data analytics

Bio-inspired optimization Cloud computing

this is the set of huge datasets, which contains different types of data such as video, audio, text, social etc. there are five types of dimensions of data management for big data analytics: volume, variety, velocity, veracity, and variability. the process of an extraction of required data from unstructured data. There are five types of analytics for big data management using bio-inspired algorithms: text analytics, audio analytics, video analytics, social media analytics, and predictive analytics. the bio-inspired algorithms are used for big data analytics, which can be ecological, swarm-based, and evolutionary algorithms. cloud computing offers three types of main service models: software, platform, and infrastructure. At the software level, the cloud user can utilize the application in a flexible manner, which is running on cloud datacenters. The cloud user can access infrastructure to develop and deploy cloud applications at platform level. Infrastructure as a service offers access to computing resources such as a processor, networking, and storage and enables virtualization-based computing.

ACKNOWLEDGMENTS One of the authors, Dr. Sukhpal Singh Gill [Postdoctoral Research Fellow], gratefully acknowledges the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne, Australia, for awarding him the Fellowship to carry out this research work. This research work is supported by Discovery Project of Australian Research Council (ARC), Grant/Award Number: DP160102414.

16

CHAPTER 1 BIO-INSPIRED ALGORITHMS FOR BIG DATA ANALYTICS

REFERENCES [1] S. Khan, X. Liu, K.A. Shakil, M. Alam, A survey on scholarly data: from big data perspective, Inf. Process. Manag. 53 (4) (2017) 923–944. [2] A. Gandomi, M. Haider, Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manag. 35 (2) (2015) 137–144. [3] S. Singh, I. Chana, A survey on resource scheduling in cloud computing: issues and challenges, J. Grid Comput. 14 (2) (2016) 217–264. [4] S.S. Gill, R. Buyya, A taxonomy and future directions for sustainable cloud computing: 360 degree view, ACM Comput. Surv. 51 (6) (2019) 1–37. [5] C.P. Chen, C.Y. Zhang, Data-intensive applications, challenges, techniques and technologies: a survey on big data, Inf. Sci. 275 (2014) 314–347. [6] J. Wang, Y. Wu, N. Yen, S. Guo, Z. Cheng, Big data analytics for emergency communication networks: a survey, IEEE Commun. Surv. Tutorials 18 (3) (2016) 1758–1778. [7] S.S. Gill, I. Chana, M. Singh, R. Buyya, RADAR: self-configuring and self-healing in resource management for enhancing quality of cloud services, in: Concurrency and Computation: Practice and Experience (CCPE), vol. 31, No. 1, Wiley Press, New York, 2019, pp. 1–29, ISSN: 1532-0626. [8] I. Singh, K.V. Singh, S. Singh, Big data analytics based recommender system for value added services (VAS), in: Proceedings of Sixth International Conference on Soft Computing for Problem Solving, Springer, Singapore, 2017, pp. 142–150. [9] S.S. Ilango, S. Vimal, M. Kaliappan, P. Subbulakshmi, Optimization using artificial bee colony based clustering approach for big data, Clust. Comput. (2018) 1–9, https://doi.org/10.1007/s10586-017-1571-3. [10] R. Kune, P.K. Konugurthi, A. Agarwal, R.R. Chillarige, R. Buyya, Genetic algorithm based data-aware group scheduling for big data clouds, in: Big Data Computing (BDC), 2014 IEEE/ACM International Symposium, IEEE, 2014, pp. 96–104. [11] A.H. Gandomi, S. Sajedi, B. Kiani, Q. Huang, Genetic programming for experimental big data mining: a case study on concrete creep formulation, Autom. Constr. 70 (2016) 89–97. [12] S. Elsayed, R. Sarker, Differential evolution framework for big data optimization, Memetic Comput. 8 (1) (2016) 17–33. [13] A.H. Kashan, M. Keshmiry, J.H. Dahooie, A. Abbasi-Pooya, A simple yet effective grouping evolutionary strategy (GES) algorithm for scheduling parallel machines, Neural Comput. & Applic. 30 (6) (2018) 1925–1938. [14] M.M. Mafarja, S. Mirjalili, Hybrid whale optimization algorithm with simulated annealing for feature selection, Neurocomputing 260 (2017) 302–312. [15] A. Barbu, Y. She, L. Ding, G. Gramajo, Feature selection with annealing for computer vision and big data learning, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2) (2017) 272–286. [16] A. Tayal, S.P. Singh, Integrating big data analytic and hybrid firefly-chaotic simulated annealing approach for facility layout problem, Ann. Oper. Res. 270 (1–2) (2018) 489–514. [17] I.B. Saida, K. Nadjet, B. Omar, A new algorithm for data clustering based on cuckoo search optimization, in: Genetic and Evolutionary Computing, Springer, Cham, 2014, pp. 55–64. [18] E.D. Raj, L.D. Babu, A firefly swarm approach for establishing new connections in social networks based on big data analytics, Int. J. Commun. Netw. Distrib. Syst. 15 (2-3) (2015) 130–148. [19] H. Wang, W. Wang, L. Cui, H. Sun, J. Zhao, Y. Wang, Y. Xue, A hybrid multi-objective firefly algorithm for big data optimization, Appl. Soft Comput. 69 (2018) 806–815. [20] L. Wang, H. Geng, P. Liu, K. Lu, J. Kolodziej, R. Ranjan, A.Y. Zomaya, Particle swarm optimization based dictionary learning for remote sensing big data, Knowl.-Based Syst. 79 (2015) 43–50.

FURTHER READING

17

[21] M.S. Hossain, M. Moniruzzaman, G. Muhammad, A. Ghoneim, A. Alamri, Big data-driven service composition using parallel clustered particle swarm optimization in mobile environment, IEEE Trans. Serv. Comput. 9 (5) (2016) 806–817. [22] K.C. Lin, K.Y. Zhang, Y.H. Huang, J.C. Hung, N. Yen, Feature selection based on an improved cat swarm optimization algorithm for big data classification, J. Supercomput. 72 (8) (2016) 3210–3221. [23] S. Cheng, Q. Zhang, Q. Qin, Big data analytics with swarm intelligence, Ind. Manag. Data Syst. 116 (4) (2016) 646–666. [24] S. Banerjee, Y. Badr, Evaluating decision analytics from mobile big data using rough set based ant colony, in: Mobile Big Data, Springer, Cham, 2018, pp. 217–231. [25] X. Pan, Application of improved ant colony algorithm in intelligent medical system: from the perspective of big data, Chem. Eng. 51 (2016) 523–528. [26] B. Hu, Y. Dai, Y. Su, P. Moore, X. Zhang, C. Mao, J. Chen, L. Xu, Feature selection for optimized highdimensional biomedical data using the improved shuffled frog leaping algorithm, IEEE/ACM Trans. Comput. Biol. Bioinform. 15 (2016) 1765–1773. [27] R.P.S. Manikandan, A.M. Kalpana, Feature selection using fish swarm optimization in big data. Clust. Comput. (2017) 1–13, https://doi.org/10.1007/s10586-017-1182-z. [28] S. Elsherbiny, E. Eldaydamony, M. Alrahmawy, A.E. Reyad, An extended intelligent water drops algorithm for workflow scheduling in cloud computing environment, Egypt Inform. J. 19 (1) (2018) 33–55. [29] E.A. Neeba, S. Koteeswaran, Bacterial foraging information swarm optimizer for detecting affective and informative content in medical blogs, Clust. Comput. (2017) 1–14, https://doi.org/10.1007/s10586-017-1169-9. [30] K. Ahmad, G. Kumar, A. Wahid, M.M. Kirmani, Intrusion detection and prevention on flow of Big Data using bacterial foraging, in: Handbook of Research on Securing Cloud-Based Databases With Biometric Applications, IGI Global, 2014, p. 386. [31] B. Schmidt, A. Al-Fuqaha, A. Gupta, D. Kountanis, Optimizing an artificial immune system algorithm in support of flow-based internet traffic classification, Appl. Soft Comput. 54 (2017) 1–22. [32] G. George, L. Parthiban, Multi objective hybridized firefly algorithm with group search optimization for data clustering, in: Research in Computational Intelligence and Communication Networks (ICRCICN), 2015 IEEE International Conference, IEEE, 2015, pp. 125–130. [33] A.R. Pouya, M. Solimanpur, M.J. Rezaee, Solving multi-objective portfolio optimization problem using invasive weed optimization, Swarm Evol. Comput. 28 (2016) 42–57. [34] X. Pu, S. Chen, X. Yu, L. Zhang, Developing a novel hybrid biogeography-based optimization algorithm for multilayer perceptron training under big data challenge, Sci. Program. 2018 (2018) 1–7. [35] S. Fong, R. Wong, A.V. Vasilakos, Accelerated PSO swarm search feature selection for data stream mining big data, IEEE Trans. Serv. Comput. 9 (1) (2016) 33–45.

FURTHER READING S.S. Gill, I. Chana, R. Buyya, IoT based agriculture as a cloud and big data service: the beginning of digital India, JOEUC 29 (4) (2017) 1–23.

CHAPTER

BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

2 Ramgopal Kashyap

Amity School of Engineering & Technology, Amity University Chhattisgarh, Raipur, India

2.1 INTRODUCTION There is a rising form of analytics called prescriptive analytics, which recommends one or more publications of movement and shows the probable outcome of each choice; businesses need to overcome challenges to achieve the blessings of analytics. This chapter aims to address the demanding situations for empowering predictive big statistics analytics. The first venture is to run analytics on differing information to meet the commercial enterprise need for reading records from many sources, such as relational databases, Excel files, Twitter, and Facebook. The calls for different established codecs, semi-dependent and unstructured, are disbursed throughout various information centers relational databases, NoSQL databases, and file systems, and the need is to place them in a format that can be processed by the facts-mining algorithms [1]. Most of the existing libraries use an extract-remodel-load operation to extract the records from the unique stores and to remodel their layout to an appropriate schema [2]. This approach is time-consuming and requires that all facts be acquired in advance. Fig. 2.1 shows how blood pressure monitoring, intelligent pillbox, and blood sugar monitoring services are using the Internet of things and using servers in hospitals, shopping malls, bus stops, and restaurants for ad hoc service in cases of emergency.

2.1.1 CONSUMABLE MASSIVE FACTS ANALYTICS The fact that an investigation is multidisciplinary makes it troublesome for organizations to find the required specialized aptitudes to embrace enormous actualities examination. The consumable research provides essential characteristics for managing this task and for triumphing over the inaccessibility of analytical capacities [3] by a method for making the survey less difficult to apply [4]. Consumable examination refers to the developing of the abilities that are effective and current in a business endeavor by a method for creating devices that make an investigation easier to build, oversee, and expend [5]. The consumable inquiry is a public interface or a programming dialect for checking human services information such as circulatory strain, weight, and sugar level, using the door shown in Fig. 2.2. Big Data Analytics for Intelligent Healthcare Management. https://doi.org/10.1016/B978-0-12-818146-1.00002-7 # 2019 Elsevier Inc. All rights reserved.

19

20

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

FIG. 2.1 Service through the Internet of Things for patients via servers.

FIG. 2.2 Gateway providing health care data.

2.1 INTRODUCTION

21

2.1.2 ALLOTTED RECORDS MINING ALGORITHMS Most people working with present information mining libraries like R, WEKA, and RapidMiner only aid sequential single-gadget completion of the facts mining algorithms. This makes these libraries incorrect for coping with the extensive records massive volumes [6]. Scalable distributed facts mining libraries, such as Apache Mahout, Cloudera Oryx, Oxdata H2O, MLlib, and Deeplearning4j, rewrite the records mining algorithms to run in a disbursed fashion on Hadoop and Spark. Those libraries are advanced by looking the algorithms for components to be performed in parallel and rewriting them. This procedure is complicated, time-consuming, and the nice, modified set of rules depends entirely on the participants’ information [7]. It makes these libraries tough to broaden, preserve, and enlarge and know-how large statistics are especially vital. It is important that the facts one is counting on are well analyzed. The additional need for IT experts is an assignment for big records in accordance with McKinsey’s examination on large data as huge facts: the following boundary for modernism. These records are evidence that for a business enterprise to take the massive records initiative, it has either to rent experts or to educate existing personnel on the brand-new discipline [8].

2.1.3 GADGET FAILURE This affects the system of storing statistics and is making it extra tough to paint with; it can create an everlasting connection between the devices that are sending records to the system. The “sender” will make certain that the “receiver” has no gaps regarding the information that should be saved. This loop ought to paint as long as the system receiving data tells the machine that sends it to prevent facts that are saved as the only dispatched. So, can a simple assessment system that can save you lose facts? This method can also be sluggish throughout the whole process [9]. To avoid this from occurring, for any content that is transmitted, the sender should generate a “key.” To improve expertise, this solution is comparable with the MD5 Hash that generated over a compressed content material. However, in this example, the keys are in comparison robotically. Losing records is not a constant hardware problem. The software can as easily malfunction and cause irreparable and riskier information loss. If one difficult drive fails, there is usually another one to back it up, so there is no damage in information; however when software fails because of a programming “worm” or a flaw in the layout, facts are misplaced all the time. To overcome this hassle, programmers evolved series of tools to lessen the impact of a software breakdown [10]. An easy example is Microsoft Word, which occasionally saves the paintings that a consumer creates to protect against their loss in case of hardware or software program failure; saving prevents complete statistics loss.

2.1.4 FACTS AGGREGATION CHALLENGES Currently, the approach most generally used for huge aggregate portions of data is to duplicate the statistics to a massive garage power and then deliver the control to the vacation spot. Even so, large data study initiatives usually contain more than one corporation, distinct geographic places, and large numbers of researchers. This is inefficient and affords a barrier for records change among groups that are using some of the techniques [11]. Other means involve the application of networks to transfer the files. However, shifting vast amounts of information into or out of a statistics repository (e.g., data warehouse) is a large networking venture.

22

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

2.1.5 STATISTICS PRESERVATION-DEMANDING SITUATIONS Because great fitness facts include extensive collections of datasets, it is difficult to effectively shop and preserve the records in an unmarried robust force using traditional information control systems including relational databases [12]. Additionally, it is a heavy cost and time burden for IT within a small organization or lab.

2.1.6 INFORMATION INTEGRATION CHALLENGES This stage entails integrating and reworking information into the right layout for subsequent statistics evaluation. Combining unstructured statistics is the primary mission for big data analytics (BDA). Regardless of established electronic health-care record (EHR) data integration, there are numerous troubles [13], as shown in Fig. 2.3. The problem arises when fixed health information saved in an Oracle database in machine X is transferred to a MySQL database in machine Y. The Oracle database and the MySQL database use statistics structures to keep statistics [14]. Also, machine X would possibly use the “quantity” data type to store patients’ sexuality statistics, whereas system Y may use the “CHAR” data type. Metadata on record describes the characteristics of a resource [15]. Within the relational database version, the column names are used as metadata to explain the traits of the stored statistics.

FIG. 2.3 EHR data incorporation challenges.

2.2 RECORDS ANALYSIS CHALLENGES

23

There are two significant troubles in metadata integration. First, different database systems use unique metadata to describe content. For example, one system might use “sex” at the identical time that another might use “gender” when discussing an affected person. A PC does not realize that “sex” and “gender” can be semantically comparable. Second, there are issues when mapping simple metadata to composite metadata. For example, a PC cannot robotically map a metadata “PatientName” into a system in the composite metadata “FirstName” + “LastName” inside the different gadget. Also, different structures using one-of-a-kind coding schemes for analyzing information could purpose code map problems. For instance, the SNOMED CT and ICD-10 codes for the disease “abscess of foot” are one-of-a-kind. The coding device to move mapping is not developed [16].

2.2 RECORDS ANALYSIS CHALLENGES The essential expectation for wellness BDA is to utilize registered models to anticipate complex human being wonders from different and enormously scaled datasets. The difficulty of choosing or building prescient styles is consist with the complexity of the assessment inconvenience if the examination inconvenience is essentially composed of “what’s the normal patient age with diabetes in the worldwide?” It is at this point that a straightforward recommended count set of principles can procure the appropriate response in a period direct to the records; information sources such as restorative information, social information, video, and overview information are shown in Fig. 2.4.

FIG. 2.4 Big data analysis data sources.

24

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

If the watch question is NP-hard, after that, the registering time might be superexponential [17]. On occasion, the Bayesian people group is a simplified arrangement of principles for displaying learning in computational science and bioinformatics. Inside the calculation for the many-sided qualities of the Bayesian people group, the registering time for finding a useful system increments exponentially because the vast assortment of information will increment.

2.2.1 SCALE OF THE STATISTICS For some intricate investigation comprehensive of “posting every single diabetic patient with congestive heart disappointment hardship who is more energetic than the regular diabetic patient of the patient’s worldwide. It is difficult to strategize this inquiry rapidly while the table contains seven billion columns without ordering. It will take as a base 15 days to obtain the final product using the same PC.

2.2.2 PATTERN INTERPRETATION CHALLENGES Also, many people assume that greater certainties regularly provide better information for making determinations. Be that as it may, the hardware of great information innovation and know-how does not shield us from skews, holes, and inadequate presumptions [18]. Another endeavor shows that with large datasets, sizable costs are normal when the goal is making information as straightforward as it appeared in world everyday information (Fig. 2.5), when online networking information is expanding, and information examination remains a test [19, 20].

World day to day data

Social media and health data

Data archive Data analysis

FIG. 2.5 Data archive and analysis.

2.3 ARRANGEMENTS OF CHALLENGES

25

2.3 ARRANGEMENTS OF CHALLENGES Unstructured insights are extremely difficult to incorporate and systematize adequately while in an uncooked format. In that capacity, records extraction techniques are executed to separate essential and sensible ward information from the uncooked insights. Furthermore, a few answers have been proposed for unstructured data reconciliation [21]. The issue with these methodologies is that a large portion of them are inconveniently situated, that is, the technique is most viably actualized to the oneof-a-kind examination of informational indexes. Not many nonexclusive systems exist for joining unstructured records. Answers for subordinate data incorporation are ordered into basic strategies.

2.3.1 USER INTERVENTION METHOD Computerized composition (a portion of the metadata) or measurement case-mapping calculations continually create blunders. Indeed, those blunders can be consistent through a couple of territory experts. Be that as it may, this method cannot work for large data joining, as it uses such a large number of metadata to test the errors physically—numerous analysts at that point support utilizing swarm remarks for upgrading the incorporation. A case of this includes a decision-based methodology that tends to the complex information-incorporation issues [22]. A PC framework principally provided the system with hundreds of physical, social insurance mapping, and after that recognizes the most extreme, reasonable counterpart for tables and the related fields inside the outline by using coordinating rules. Those rules deal with the multifaceted nature of semantics decided in human-services databases. The first gain of this model is that client mediations drastically blast blueprint-mapping exactness.

2.3.2 PROBABILISTIC METHOD The probabilistic incorporation system doles out opportunities to circumstance connections among sets of pattern devices. After the probabilities are assessed, a limit is used to pick coordinating and nonmatching things. In this manner, the vulnerability that was produced all through the blend procedure is wiped out. The probabilistic technique attempts to make an intervened composition mechanically from settled information sources and the seeming semantic mappings among the assets and the interceded construction [23]. It stays away from human intercession issues. Likewise, which method ought to be tried before sending them? There are various methodologies for approving the styles: (1) utilizes measurable legitimacy to determine if there are issues inside the records or in the model, (2) isolates the data into preparing and testing sets to test the exactness of styles, (3) requests that space specialists check what the watched forms have, which implies checking inside the focal situation. To manage the security-requesting circumstances, we will utilize protection-saving records that determine calculations for the revelation of learning to make sure of the certainty of privacy. Governments can likewise build complete arrangements to protect document security [24]. Difficult circumstances have been identified with providing assurances to the public. Currently, extensive data are expressed to produce potential outcomes for society, even though the utilization can additionally create challenges, for example, administration-requesting circumstances, top-notch information, and good privacy issues, which all can happen because of the use of extensive information by a method for the overall population within a territory.

26

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

Frequency of visit 90 80 70 60 50 40

Daily Twice a week Weekly Monthly

30 20 10 0

FIG. 2.6 Human body as a source of data.

2.3.3 DEFINING AND DETECTING ANOMALIES IN HUMAN ECOSYSTEMS An overarching undertaking, while attempting to update a degree or to discover anomalies in human ecosystems, is the characterization of abnormality. The character and detection of what may constitute socioeconomic “anomalies” and how they differ certainly may be much less clear cut from anomalies in the realm of detecting sickness outbreaks or monitoring malfunctions in other types of dynamic structures, such as upgraded car engines [25]. Data size is increasing exponentially every day and is creating a large amount of health data, as shown in Fig. 2.6.

2.4 DEMANDING SITUATIONS IN MANAGING HUGE RECORDS The data required for analytical and computational causes are strongly heterogeneous, which include typical integration problems of both statistics and schema, and the data are additionally updated by the advent of new and updated architectures for analytics [26]. A massive challenge

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES

27

in updating big data is the transformation of unstructured statistics into a suitable and dependent layouts to present updated and meaningful design analytics [27]. Records are scaling, which is a problematic issue, as information quantity is growing faster than computer assets, and CPU speeds are static [28]. The layout of a gadget that correctly offers length is probably additionally updated, resulting in systems that can provide statistics within a given period more quickly [29]. The integration of significant facts is multidimensional and multidisciplinary and requires a multiera technique that poses a broad mission.

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES Massive facts have many implications for sufferers, companies, researchers, payers, and various health-care components. It will update the impact of how those players interact with the health-care atmosphere, specifically while external information, regionalization, globalization, mobility, and social networking are concerned. In the older model, health-care centers and different companies were incentivized to hold sufferers in treatment; that is, more inpatient days translated into extra revenue [30]. The trend with new models and currently responsible care groups is to update incentives and to compensate companies to remain updated to keep patients healthy. Equally, sufferers are increasingly demanding information about their health-care options so that they can comprehend their selections and can participate in choices about their care. Patients also provide a vital detail in maintaining lower health-care fees and improving results when sufferers are supplied with correct and current information and guidance, and these facts will assist them to make better decisions and higher adherence to updated remedy programs [31]. Updated statistics are convenient for gathering demographics and clinical data; every other record supply is data that patients expose about themselves. While combined with results, information provided by patients can update a treasured source of records for researchers and others seeking information on reducing costs, boosting positive consequences, and enhancing treatment. Several demanding situations exist with self-suggested records: •







Accuracy: People tend to understate their weight and the documentation of their interaction with bad behaviors such as smoking; in the meantime, they tend overstate unusual behaviors such as a workout [32]. Privacy worries: People are usually reluctant to reveal information about themselves because of privacy and other issues. Creative approaches are needed to discover information and to inspire patients to accomplish this without adversely impacting their records [19, 20]. Consistency: Benchmarks require portrayal and connections to offer consistency in self-revealing records using social insurance means to eliminate errors and to increase the convenience of certainties in rules and principles [33]. Facility: Mechanisms provide a breakthrough in e-wellbeing and m-wellness, which are up-andcoming, versatile, and interpersonal interactions that in the future need to be imaginatively used to facilitate donors’ capacity for specific self-records. Supplying up-to-date unidentified statistics can concurrently enhance ranges of self-reporting as a community develops among members [34].

28

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

2.5.1 PRESENT ANSWERS TO CHALLENGES FOR THE QUANTITY MISSION 2.5.1.1 Hadoop Hadoop tools are top notch for adapting to vast volumes of organized, semiset up, and unstructured records. As another innovation, numerous experts are impressed with Hadoop. A lot of sources need to be learned, and at some point, the eye is redirected from setting the primary objective toward becoming acquainted with Hadoop. Apache Hadoop is an open-source execution of the MapReduce structure, proposed by Google. It allows the coursed treatment of datasets in the demand of petabytes across hundreds or thousands of product PCs that are related within a framework. It has been routinely used to run parallel applications for taking a large amount of data in the course of an examination. The accompanying two sections present Hadoop’s two essential fragments: HDFS and MapReduce.

2.5.1.2 Hadoop-distributed file system The Hadoop-Distributed File System (HDFS) is the limited portion of Hadoop; it is expected to store generous enlightening accumulations on clusters regularly and to stream that data at high throughput to customer applications. HDFS stores record structure metadata and application data autonomously. Naturally, it stores three free copies of each datum square (replication) to ensure faithful quality, openness, and execution.

2.5.1.3 Hadoop MapReduce Hadoop MapReduce is a parallel programming framework for dispersed planning, completed over HDFS. The Hadoop MapReduce engine contains a JobTracker and a couple of TaskTrackers. Right when a MapReduce work is executed, the JobTracker parts it into smaller errands (outline reduce) managed by the TaskTrackers. In the Map step, the pro-centerpoint takes the information, segments it into smaller subproblems, and passes them on to worker centers. Each worker center point shapes a subissue and creates its results as key/regard sets. In the Reduce step, the characteristics with a corresponding key are accumulated and arranged by a comparable machine to outline the last yield.

2.5.1.4 Apache spark Apache Spark is an open-source in-memory data examination pack for figuring structure, made in the AMPLab at UC Berkeley. As a MapReduce-like gathering and enrolling engine, Spark moreover has incredible traits, for instance, versatility and adjustment to inside disappointment as MapReduce does [35]. The essential impression of Spark is Resilient Distributed datasets (RDDs), which impact Spark to be an all-around program that meets all necessities to process iterative businesses, including PageRank computation, K-suggests figuring, and so forth. RDDs are stand-out to Spark and, as such, isolate Start from standard MapReduce engines. Additionally, given RDDs, applications on Spark can keep data in memory transversely over the request and reproduction of like data lost in the midst of dissatisfactions. RDD is a scrutinized data collection, which can be either a recordset away in an outside limit structure, for instance HDFS, or can be an induced dataset made by various RDDs. RDDs store much information, for example, its distributions and a course of action of conditions on parent RDDs called heredity with the help of the heredity, Spark recovers lost data quickly and effectively. It is beginning to show great execution in getting iterative estimation ready, since it can reuse direct results and keep data in memory over various parallel undertakings [36].

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES

29

2.5.1.5 Grid computing Grid computing is represented via some servers that might be interconnected by an excessive velocity community; each of the servers performs one or many roles. The two predominant benefits of grid computing are the high garage functionality and the processing of electricity, which interprets the facts and the computational grids [37].

2.5.1.6 Spark structures The version of spark use, plus in-memory computing, creates significant overall performance gains for excessive volume and diverse information. All these methods permit companies and groups to discover massive volumes of facts and to gain business insights from them. There are viable approaches to address the quantity hassle. We can both decrease the information and invest in appropriate infrastructure to resolve the trouble of statistics volume and, primarily based on our value price range and necessities, we can pick technologies and methods described earlier [38]. If we have resources with understanding in Hadoop, we can continuously use them.

2.5.1.7 Capacity solutions for records-variety trouble OLAP equipment (analytical processing tools) records processing can be achieved using OLAP gear, and it establishes connections among information. It subsequently assembles information into a logical format that allows one to gain a right of entry. Without problems, OLAP-equipment professionals can achieve high speed and less frequent lag time for processing top volume records. OLAP-equipment techniques handle all of the documents provided to them, regardless of whether they are applicable or not, and this is one of the drawbacks of OLAP tools [39]. Apache Hadoop is a wide-open supply software, and its most fundamental motive is to manipulate vast quantities of statistics in a complete, short period with tremendous ease. Hadoop can divide statistics among a couple of systems’ infrastructures that are able to process them. A map of the content is created in Hadoop so it can be accessed and discovered without problem. SAP HANA is an inreminiscence records platform that is deployable as on-premise equipment, or within the cloud. It is a revolutionary platform it is pleasant and suitable for appearing in real-time analytics, and for developing and deploying real-time applications. New DB and indexing architectures make the experience of disparate data assets swift.

2.5.2 IMAGE MINING AND PROCESSING WITH BIG DATA Different therapeutic picture division approaches have been proposed, and numerous significant enhancements have been acquired. Nonetheless, because of inadequacies in social-insurance imaging frameworks, medicinal pictures can contain distinctive assortments of ancient rarities. These old rarities can influence the item information and dumbfound the pathology. The attractive imaging innovation can moderate a few relics, and some require subsequent control. In clinical research, the natural marvels made are commotion, power inhomogeneity, and incomplete volume impacts that happen, which are considered as the open issues in a restorative picture division [40]. There are various procedures to position a photograph into areas that might be homogeneous. Not every one of the methods is reasonable for medicinal picture examination, given the many-sided quality and errors [41]. No standard picture division system may create agreeable outcomes for all imaging applications like mind MRI, cerebrum growth analysis, and so forth. The ideal determination of highlights, tissues, cerebrum,

30

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

and nonmind components are thought of as principle confinements for the mind-picture division. The exact division in an overabundance of the full field of view is another intrusion. Administrator bearing and manual thresholding are for the most part different boundaries to fragment the cerebrum picture. Amid the division treatment, the check of advantages is another reason for trouble [19, 20]. Picture division will be the issue of expelling closer-view objects from the foundation in a picture. It is among the most fundamental problems in PC vision, and it has intrigued numerous investigators consistently. As the simple use of PCs incrementally increase over time, dependable picture division is required in more applications, in the modern, restorative, and individual fields. Thoroughly programmed division continues to be an open issue due to a wide assortment of conceivable articles’ blend, thus undoubtedly the utilization of human “clues” is unavoidable. Intelligent picture division subsequently is increasingly prevalent among explorers these days [42]. The objective of the intuitive division is (a) separate object(s) from the foundation in an exact way, utilizing client information in a way that requires insignificant discussion and negligible answer time. This theory will unquestionably begin by depicting general procedures to sort division approaches, proceeds with an intensive study of existing shape-based original picture-division methods, and finishes by presenting a radical new joined altering and division instrument. Picture division is the pivotal issue with picture investigation alongside picture understanding. It is additionally a fundamental issue of PC vision and example acknowledgment [43]. The active contour models (ACM) are the best procedures in picture division, and the critical thought of ACM is to advance a bend as laid out by some specific limitations to separate the required protest. These common dynamic form models, classified as edge-based and based, are two sorts that have their special paybacks and negatives, and the different attributes with the pictures control the decision between them to use in applications. The model forms an edge-based capacity utilizing pictureedge certainties, which can produce the shape on the protest limits. The edge-based size by the picture inclination can investigate the correct confinements for the pictures with extraordinary clamor or possibly a slight edge. On the other hand, a district-based model uses factual data to build up a locale, ceasing a capacity that could stop the form advancement between particular areas [44]. Contrasted with the edge-based model, this model can improve the situation for pictures with obscured closes. The area-based model is not delicate to a statement of the level set capacity and can perceive the protest’s limitations. Regionbased models are favored for picture division, since they give a change over the edge-based model in a couple of perspectives, and it has confinements. The general district-based models that are proposed in parallel pictures, with the suspicion that every picture area can be homogeneous, do not work impeccably for photos with force inhomogeneity, it is touchy to the starter shape, and the developing bend might catch into local minima. In expansion, the Chan-Vese (CV) technique is not reasonable for speedy preparing in light of the fact that, in every emphasis, the standard powers inside and in the past form should wind up as shown in Fig. 2.7. Various segmentation methods are giving different result for the same data, which enhances the calculation time [45]. The neighborhood twofold-fitted model, by installing nearby picture data, can fragment pictures with force inhomogeneity that is considerably more exact than arranged systems. The essential thought is to present the Gaussian piecework, even though it partitions well the pictures with power inhomogeneity. It has a high computational time and a multifaceted nature. In this manner, the division process takes a considerable time when contrasted with old division systems. Zhang proposed a dynamic form strategy propelled by local image-fitting vitality, which gives the same division, coming about and containing less time unpredictability when contrasted with local binary fitting. Reinitializing the specific

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES

31

FIG. 2.7 Comparative study of methods giving different segmentation result.

level set capacity to some marked separation work all through the development was utilized for keeping up stable advancement and specific alluring outcomes. From a liberal perspective, the reintroduction process is frequently somewhat convoluted and expensive. The locale-based level-set method with an adjusted marked separation capacity to wipe out the need, related with a reintroduction and regularization function, worked admirably under the high-power inhomogeneity issue and had better outcomes in correlation with different techniques. It uses both edge and the spot data to section a picture into the absence of a covering spot and gives control of the developing bend in through the enrollment degree in the present pixel inside or outside the dynamic shape. It is executed through the marked weight work that uses the nearby data in the picture. The proposed strategy will fragment imaging with power inhomogeneity and put on MR pictures to demonstrate the consistent quality, adequacy, and strength in the calculation.

2.5.3 POTENTIAL ANSWERS FOR VELOCITY TROUBLE Flash memory miles wished for caching records, in particular in dynamic solutions that could parse statistics as either warm (frequently accessed statistics) or cold (hardly ever accessed files).

2.5.3.1 Transactional databases A value-based database is a database administration machine that has the usefulness to move, bring down, back, or fix a database exchange or task on the off chance that is not continuously complete. They might be outfitted with continuous examination and a quicker reaction for making choices

32

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

[46]. A cloud using the hybrid model increases a personal cloud in the use of a hybrid version, allowing eruption of the extracomputational energy wished for records analysis and to pick out hardware, software, and business-technique changes to address high-paced information needs. Testing information sampling is a framework. Statistical examination systems are various records methods used to pick, control, and watch an elective subset of factual variables to secure examples and inclinations inside the great certainties reviewed. It makes data and might help bargain appropriately with issues like volume and pace. Crossover SAAS/PAAS/LAAS frameworks related to cloud calculation and carport can greatly clear up the speed bother, however, at the same time, we should manage security and privacy of data.

2.5.3.2 Statistics representation If the information concerned is top notch, visualization goes about as a great way because the reality perception of data gives us a chance to see where exceptions and nonrequired data lie. For top-notch trouble, partnerships need a measurements control; reconnaissance or data-control frameworks live to verify that the certainties are spotless by utilizing the gathering of the measurements together, or “binning,” in which one may practically picture the records.

2.5.3.3 Massive actualities calculations Records quality and significance is not a tidy issue; its miles from the time we begin taking care of data, and the capacity of each bit for insights that a firm delivers, make a point. It is too steeply valued to have dingy records, and it costs offices in the United States many billions of dollars every year. Further to being the best to continue, taking care of and cleaning certainties, huge measurement calculations can be a smooth advance to clean the insights. There are numerous calculations and styles, or we can make our predictions to carry on with measurements to clean them and make them liberal [47].

2.5.3.4 Ability solutions for privateers and safety undertaking Cloud organizations that store tremendous insights inside the cloud are conventional methods for the carport, and because of this we need to deal with its assurance components. To verify cloud organization, regular assurance reviews are required that include a disclaimer for paying penalties if security benchmarks have not been met. Amassing real-time security following surveillance or tracking must be a part of getting a section of the records. Threat assessment and insights ought to be used to anticipate the unapproved access to information. If an anonym punches the documents, there ought to be a guarantee that each piece of tricky data is killed from the arrangement of information gathered more quickly than it would take to begin actualities and to prepare investigations. Continuous security following enterprises must get a right of passage to verify that there is no case of unapproved access. Hazard knowledge should be in an area that proves that assault is recognized and that organizations can respond to dangers rapidly. The use of confirmation strategy authentication is a way of checking a purchaser or a device personality more quickly than it would take to access the machine [48]. Actualizing the motivation admission to controls authorization is a method for demonstrating inspire admission to oversee benefits for clients or machines to brighten security. Using key administration document-layer encryption is lacking if an assailant or a programmer can get passage to encryption keys. In the vast majority of the cases, executives keep keys on next step for brief and straightforward access. However, this is detectably dangerous and shaky, as cores are recovered without issues through the stage director or an assailant. In this way, a significant administration benefit for keys

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES

33

dissemination, instead of holding one key and doling out particular declarations and taking care of stand-out keys for every establishment, utility, interface, and individual, is the safest strategy toward difficulties. Logging strikes, screw-ups, or abnormalities might be fundamental, and certainties are the pleasant sureties for a meeting to oversee event data. Log archives supply a gaze upward when something falls flat or is hacked. So, to meet well-being and privacy concerns, we must review the whole contraption on the premise of using secure correspondence to execute reports among hubs, interfaces, strategies, and bundles. Security home-information tests can be abused, and the wrong use of individual information might be limited, mainly through connections from more than one source of insights. Thus, the unapproved and unauthenticated use of data should be secured. There are two typical systems to shield the security and protection of large actualities. First is to obtain the right of passage to the measurements with the guide to a confirmation or an inspire section to control to the information sections.

2.5.4 ABILITY SOLUTIONS FOR SCALABILITY ASSIGNMENTS 2.5.4.1 Big data and cloud computing An answer inside the cloud will scale in a less convoluted and a speedier path when contrasted with an on-premises arrangement. Keeping actualities anchored in a cloud can avoid inconvenience because the cloud can be rooted, and the cloud might be divided so that it can be considered nearly boundless. An example of in-house accumulation of large data is clustered in network-attached storage [49]. The setup would begin with a framework-associated limit (NAS) case involving a couple of PCs joined to a PC used as the (NAS) device. A couple of NAS units would be joined to each other through the PC used as the NAS contraption. Gathered NAS storing is an expensive prospect for a small-to-medium sized business. A cloud organization’s provider can equip the essential storage space for greatly reduced costs. Separating a large amount of data is done using a programming perspective called MapReduce. In the MapReduce perspective, an inquiry is made, and the data are mapped to find key characteristics considered to relate to the request; the results are then reduced to a dataset that notes the application [50]. The MapReduce perspective requires that enormous proportions of data be break down. The mapping is done at the same time by each unique NAS contraption; the mapping requires parallel preparation [51]. The parallel immediate needs of MapReduce are costly and require the plan noted for the limit. Cloud-expert centers can handle with necessities.

2.5.4.2 Cloud computing service models Common association models for conveyed processing fuse the arrangement as an organization (PaaS), programming as an organization (SaaS), establishment as an organization (IaaS), and hardware as an organization (HaaS). Cloud-sending game plans can provide benefits that associations would not have the ability to oversee. Associations can in like manner use cloud-sending plans as a test measure before grasping another application or large advancement. For associations using the cloud for PaaS, stage as a service is the use of appropriate registering to offer steps to the headway and method of custom applications (Salesforce.com, 2012). The PaaS courses of action consolidate the application plan and headway mechanical assemblies, application testing, shaping, mix, sending and encouraging, state organization, and other related change gadgets. Associations achieve cost venture reserves using PaaS through the systematization and use of the cloud-based stage over different applications. Distinctive purposes in using PaaS include cutting down threats by using pretested developments, progressing with

34

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

shared organizations, improving programming security, and cutting down mastery necessities required for new structures headway. As related to extensive data, PaaS gives associations a phase for making and using anticipated custom applications that would look at large measures of unstructured data quickly and for the most part in a safe and sheltered space. The business would not charge for hardware, only for the information-exchange limit related to the time and number of key customers. The fundamentally favored viewpoint of SaaS is that the course of action empowers associations to move the risks identified with programming acquirement while moving IT from being receptive to proactive. Points of interest for the use of SaaS are easier programming association, modified updates and fix organization, programming comparability over the business, fewer joint exertion requests, and overall transparency. Programming as a service provides associations that are separating large data with programming answers for data examination. The refinement among SaaS and PaaS for this circumstance is that SaaS wouldn’t provide an altered course of action while PaaS will empower the association to develop an answer exclusively fitted to the association’s needs. In the IaaS show, a client business will pay for each usage explanation behind the use of rigging to figure out errands, including limit, gear, servers, and framework organization equipment. An organizational system is the disseminated figuring model that gets the most thought from the market, with a craving for 25% of endeavors expecting to get a procommunity for IaaS. Organizations available to associations through the IaaS indicate that they join disaster recovery, figure as an organization, accumulate as an organization, cultivate the server as an organization, and have a virtual work region system and cloud impacting, which provides apex stack capacity to variable strategies. Favorable circumstances of IaaS consolidate extended money-related flexibility, choice of organizations, business agility, fiscally astute adaptability, and extended security. Although not previously being used as comprehensively as PaaS, SaaS, or IaaS, HaaS has a cloud advantage in the model of timesharing on minicomputers and incorporated servers from the 1970s.

2.5.4.3 Answers RTTS (Real-Time Technology Solutions) did a survey and discovered that 60% of agencies executed best-test records manually in 2013. Manual checking refers to evaluating data sets extracted from databases and records warehouses using a human eye. In addition to the manual inspection of the statistics, evaluation applications need to be used to gain metadata about statistics residences and to discover facts about exceptional troubles. Although significant statistics are used, there is an even greater motive to automatize testing exercises. There is a need for automated checking routines, but the stage of automation may be small due to the variety of data. The velocity of facts needs to be taken into consideration while performance problems still need to be overcome. Issues in speed may triumph over right overall performance. Whole performance testing identifies bottlenecks in the system. A Hadoop performance-tracking tool can capture the performance metrics such as activity finishing touch time and throughput. Device-stage metrics like reminiscence utilization are a part of a performance checkout. Checking out unstructured facts could be very time consuming and complicated. A form of fact assets was verified after the points converted right into a structured format with the aid of custom-built scripts. Step one is to turn the records into a formal layout [52]. Unstructured facts are transformed into the traditional format using a scripting language like PIG. Semidependent facts converted if there are recognized patterns. Big information sizes are continually increasing, starting from some dozen terabytes in 2012 to many petabytes of statistics in a single recordset today. Tremendous information creates a high-quality possibility for the world economic system, both inside the

2.5 MASSIVE FACTS EQUAL LARGE POSSIBILITIES

35

discipline of countrywide security and also in regions ranging from advertising and credit-score danger analysis to clinical studies and city planning. The fantastic advantages of great statistics are lessened by privacy and information protection. Any safety control used for vital records ought to meet the subsequent necessities. It ought not compromise the first capability of the cluster; it has to scale inside the same way because the group should not now compromise critical big-data characteristics. It ought to cope with a safety danger to high-statistics environments or records saved inside the group [53].

2.5.4.4 Use record encryption Encryption guarantees confidentiality and privateness of consumer statistics, and it secures the touchy facts. Encryption protects records if malicious users or administrators take advantage of access to information and immediately check out documents, and then render stolen files or copied disk photographs unreadable. Data-layer encryption affords steady protection throughout different platforms regardless of the OS/platform kind. Encryption meets our necessities for massive statistics security. Open-source products are available for maximum Linux systems; commercial merchandise, moreover, provides external key management and complete aid. It is a cost-effective way to cope with numerous facts and safety threats [54].

2.5.4.5 Imposing access controls Authorization is a system of specifying entry for managing privileges for a person or a gadget to document protection. File-layer encryption is not always useful if an attacker can gain entry to encryption keys. Many significant facts encourage administrators keep keys on nearby disk drives because it is convenient and comfortable, but it is also unsecure as keys can be obtained via the platform administrator or an attacker. Use of a key management provider is preferred to distribute keys and certificates and to manipulate distinct keys for every institution, software, and user.

2.5.4.6 Logging To uncover attacks, diagnose disasters, or check out egregious conduct, we need a record of pastime. In contrast to less scalable statistics-management structures, massive facts are a natural match for gathering and handling occasion data. Many web organizations start with great information, especially for managing log documents. It gives us an area of appearance when something fails, or when someone thinks someone may have been a hack. So, to meet safety necessities, it is best to audit the whole device on a periodic basis. Even secure operations can be time-consuming. Finding the relevant and meaningful information is difficult in view of the fact that most of the statistics might not be applicable daily to the mission at hand. An undertaking of huge facts every day makes a distinction among the full facts set and the consultant statistics set. Huge fact sets accumulated from Twitter might not be representative fact sets, even though the whole information is loaded [55]. Also, a massive statistics set does not imply accurate day-to-day statistics. In a few instances, the larger the facts set is, the higher the correct classifications that may be made. Huge data units enable more top commentary of rare, essential events. Large volumes of information can also result in focusing solely on finding styles or correlations without using details on the broader dynamics at play. Nonconsultant samples can offer internally legal conclusions that cannot be generalized on a day-to-day, one-of-a-kind basis. Biased and nonrepresentative samples are avoided with the aid of random sampling. Statistics are not continually additive, and conclusions cannot be drawn daily on subset assessment. Processing vital records units require

36

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

scalability and overall performance. Data are usually filtered to produce smaller record sets for analysis. Information use requires finding relevant and meaningful data, information on the value of the statistics, and understanding the context and question requested. Demanding situations in facts range from one-of-a-kind information sorts that are prominent in everyday-based statistics, semistructured information, and unstructured records. A random fact represents an actual record in day-to-day lifestyles, and its miles are expressed in herbal language and are not using a specific shape or area described. Human-generated, unstructured statistics are full of nuances, variances, and double meanings. Caution is needed in deciphering the content of human-generated random facts by day-to-day data. Semantic inconsistencies complicate evaluation. Metadata can improve consistency via joining a glossary of business phrases, hierarchies, and taxonomies for business ideas [56]. Versions are dominant if statistics sets are a daily human behavior and preferred strategies may not be observed. As an example, statistical measures like daily averages are meaningless in moderately populated information units. The messiness of big data makes it a day-to-day procedure to comprehend the properties and boundaries of a dataset, regardless of its size. Processing a variety of facts requires changing unstructured and semistructured records in an everyday-based layout so that they can be saved, accessed, and analyzed along with different structured information. Records usage involves knowledge of the nuances, variations, and double meanings in human-generated unstructured facts. Other necessities are ethicality of the usage of effects units and privacy-preserving analysis. Challenges in the speed of facts include massive statistics accruing in continuous streams, which allow everyday-grained customer segmentation on the situation rather than segmentation in an everyday tally on historical statistics. The question of when the facts are now not applicable day-to-day in a modern-day evaluation is more legitimate in actual-time statistics. The pace related to the day-to-day attribute is the speed that records are shared in a day-to-day human community [57]. The information is used without delay after it flows into the system. The processing data speed is on-call for actual time accessibility compared daily to the traditional on-supply and overtime right of entry. Information use requires quicker choice-making and quicker response time in the enterprise.

2.6 DISCUSSION Gigantic data overwhelms current establishments and programming as a result of its volume, speed, grouping, and variance. In this manner, while taking a gander at the requirements for full colossal data examination applications, it is essential to think about each one of these characteristics. A couple of elements are vital to empower huge data examination and to speak to the complexities between delaytolerant and progressing applications. (1) Acquiring: It is necessary to recognize all wellsprings of data and every possible arrangement and structure foreseen. The particulars of the data required should be undeniably portrayed and recorded to think about handling capabilities. Also, it is fundamental to recognize the rate of data anchoring, as it will impact the coping with and limiting of plans. Another essential point is to describe the time apportioned to data assembling and age. A couple of data are made at once, whereas others are spouted on a determined introduction and are typically dealt with as they arrive. (2) Defending: Efficient limit and affiliation instruments are relied on to enhance storage space usage, imprisonment, openness, and constancy. A couple of data will be available from various sources that the application engineers have no influence over; in any case, others are acquired by the

2.7 CONCLUSION

37

application. Coordination and comparability issues ought ideally to be on time. Also, it is critical to have a sensible measure of the volume of data that will increase after some time and to settle on the storage space and instruments used. It may help to settle a decision about whether to acquire greater gear to store all data in-house, to use available cloud structures, or to choose a mixture of both. Application fashioners should think about the general properties of the data required for the application and should have specific portrayals of how they will be dealt with and secured. These necessities must be analyzed with the client and recorded unquestionably to help propel the resolution of the needs. (3) Preprocessing: Most accumulated data can be taken care of already to make partial results that can later help the bona fide planning to accomplish the required results. Possible preprocessing techniques join in orchestrating the perspective of specific fields, requesting various keys, condensing for particular objectives, gathering unmistakable specific targets, and attaching marks and metadata to unstructured data forms. (4) Setting up: An application will use the open data to obtain significant results. Given the monstrous size of data accumulations to be managed, the examination is a remarkable, whole, and tedious process and ought to be enhanced for the best execution inside the application’s essentials and necessities. If the requirements and plan for immense data obtainment, security, and preprocessing are streamlined, the preparation stage receives approval to immediately start toward the goal. Regardless, it is vital to make viable prepared systems for the examination. By perceiving the time objectives of an application, the techniques for the handling of stages can change radically. Trade-offs may in like manner are imperative here while using gauge frameworks to alter the precision levels versus the time to get in contact with the supportive results. (5) Making/Presenting Results: The last piece of the problem is the examination results. A couple of requirements on the precision, comfort, and transport strategies for the results must be recognized. One fundamental point perceives how the results are used. That will oversee how the results are sorted out, set away, and exchanged. Specific information on what data ought to show and how soon this data ought to represented is crucial.

2.7 CONCLUSION Cloud environments actively leverage huge integrated solutions, built-in generated and built-in integrated, fault-tolerant, scalable, and available environments to big statistics systems. The HDFS architecture is designed to detect mistakes, such as call-node disasters, built-in node failure, and community failure, and to rout built-integrated to better continue with the blended process. Redundancy offers a facts locality that is important while running with big information sets. A backup procedure for the NameNode guarantees accessibility and availability of the information. An enormous insights issue comprises the majority of the insufficiencies seen inside the current data, which is a valuable asset in addition to considerable varieties in the fulfillment of the measurements. Requesting circumstances and necessities outlined through three sizes of monstrous information: amount, range, and speed. Building a suitable response for large and complex data is a venture that organizations in this control are continually learning and upholding for better approaches to manage it. One in all the greatest inconveniences concerning comprehensive information is the foundation’s high expenses. Equipment gear could be extremely costly for the most extreme of the organizations, regardless of the way that cloud answers are accessible. Each huge certainties gadget requires gigantic preparing vitality and solid and confounded system designs, which are made specialists. Other than equipment

38

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

foundation, programming answers tend to have unnecessary charges if the recipient does not pick open the source programming. Regardless of whether they chose open supply, to design, there are, in any case, required experts with the expected abilities to do this. The downside of open source is that support and guidance are not provided like the instance of the paid programming program. Along these lines, all that is important to keep up an enormous certainties arrangement and working practical needs, in the most significant occurrences, is an out-of-entryways redesign team. Equipment abilities compel programming answers. Equipment must be as fast as present-day innovation can provide. A product program arrangement sends the obligations as an approach to handling. The product structure endeavors to make up for the absence of feeling speed through arranging and requesting the solicitations so systems can be streamlined as an approach to harvest outstanding execution, generally speaking. To handle measurements, significant realities could be separated for corresponding use, which calls for human assessment abilities. A PC program can do what it is customized to do; it cannot see dark areas and cannot learn or adjust to new kinds of insights until modified to manage them. In this way, human aptitudes are needed to type certainties with a firm of hardware, which accelerates the procedure. The handiest blast that time the outcomes might be shown, thus the investigation of the impacts, with an end-goal to assess present-day position or conjecture, will bring down the recipient’s ideal opportunity for taking measures or planning well.

GLOSSARY IoT The Internet of things (IoT) is the arrangement of physical contraptions, vehicles, home mechanical assemblies, and distinctive things introduced with equipment, programming, sensors, actuators, and accessibility. These enable these things to partner and exchange data, making open entryways for more direct coordination of the physical world into PC-based structures, realizing capability redesigns, money-related points of interest, and reduced human mediation. ACM Active Contour Model, also called snakes, is a structure in computer vision for depicting a protest layout from perhaps a loud 2D picture. The snakes display is mainstream in computer vision, and snakes are utilized as question following, shape acknowledgment, division, edge recognition, and stereo coordinating. A snake is a vitality limiting, deformable spline affected by the requirement and picture powers that draw it toward protest forms and interior powers that oppose misshaping. Snakes comprehended as a unique instance of the general system of coordinating present a deformable model to a picture by methods for vitality minimization. In two measurements, the dynamic shape display speaks to a discrete rendition of this approach, exploiting the point circulation model to confine the shape range to an unequivocal area learned from a preparation set. DM Data mining is the path toward discovering outlines in broad educational lists, including procedures at the intersection purpose of machine learning, bits of knowledge, and database frameworks. It is a fundamental method where alert systems are associated with remove data patterns. It is an interdisciplinary subfield of PC science. The general goal of the data mining process is to expel information from an instructive gathering and to change it into a legal structure for utilization help. Besides the rough examination step, it incorporates database and data organization perspectives, data preparation, model and determination thoughts, interesting quality estimations, versatile quality considerations, posttreatment of discovered structures, portrayal, and web-based refreshing. Data mining is the examination adventure of the “learning exposure in databases” process, or KDD. EHR An electronic health record (EHR), or electronic medical record (EMR), is the systematized aggregation of patient and people electronically set away as prosperity information in a propelled arrangement. These records can be shared transversely over different human administrations settings. Files are shared through a related framework; try large information structures or other information frameworks and exchanges. EHRs may fuse an extent of data, including economics, significant history, arrangement and hypersensitivities, immunization

REFERENCES

39

status, examine office test comes to fruition, radiology pictures, vital signs, singular estimations like age and weight, and charging data. EHR systems are proposed to store data decisively and to get the state of a patient transversely after some time.

REFERENCES [1] L. Wang, Heterogeneous data and big data analytics, Autom. Control and Inform. Sci. 3 (1) (2017) 8–15, https://doi.org/10.12691/acis-3-1-3. [2] O. Day, T. Khoshgoftaar, A survey on independent transfer learning, J. Big Data 4 (1) (2017), https://doi.org/ 10.1186/s40537-017-0089-0. [3] E. Tromp, M. Pechenizkiy, M. Gaber, Expressive modeling for trusted big data analytics: techniques and applications in sentiment analysis, Big Data Anal. 2 (1) (2017), https://doi.org/10.1186/s41044-016-0018-9. [4] A.E. Hassanien, N. Dey, S. Borra (Eds.), Medical Big Data and Internet of Medical Things: Advances, Challenges, and Applications, Taylor & Francis, 2019. [5] P. Jones, A data linkage strategy for producing census and population statistics from administrative data, Int. J. Popul. Data Sci. 1 (1) (2017), https://doi.org/10.23889/ijpds.v1i1.378. [6] C. Luo, Y. Chen, Research on association rule mining algorithm based on distributed data, Adv. Mater. Res. 998–999 (2014) 899–902, https://doi.org/10.4028/www.scientific.net/amr.998-999.899. [7] N. Dey, A.E. Hassanien, C. Bhatt, A.S. Ashour, S.C. Satapathy (Eds.), Internet of Things and Big Data Analytics Toward Next-Generation Intelligence, Springer International Publishing, 2018. [8] J. Bughin, Big data, Big bang? J. Big Data 3 (1) (2016), https://doi.org/10.1186/s40537-015-0014-3. [9] C. Domenig, M. Aspalter, M. Umathum, T. Holzenbein, Redo pedal bypass surgery after pedal graft failure: gain or gadget? Ann. Vasc. Surg. 21 (6) (2007) 713–718, https://doi.org/10.1016/j.avsg.2007.07.010. [10] M. Hoskins, Common big data challenges and how to overcome them, Big Data 2 (3) (2014) 142–143, https:// doi.org/10.1089/big.2014.0030. [11] S. Boubiche, D. Boubiche, A. Bilami, H. Toral-Cruz, Big data challenges and data aggregation strategies in wireless sensor networks, IEEE Access 6 (2018) 20558–20571, https://doi.org/10.1109/access.2018. 2821445. [12] A. Wilson, Big-data analytics for predictive maintenance modeling: challenges and opportunities, J. Pet. Technol. 68 (10) (2016) 71–72, https://doi.org/10.2118/1016-0071-jpt. [13] M. Desai, The integration of the data scientist into the team: implications and challenges, Data Sci. (2017) 1–6, https://doi.org/10.3233/ds-170008. [14] E. Chalmers, D. Hill, V. Zhao, E. Lou, Prescriptive analytics applied to brace treatment for AIS: a pilot demonstration, Scoliosis 10 (S1) (2015), https://doi.org/10.1186/1748-7161-10-s1-o64. [15] V. Abrar, M.R. Wahyudi, Implementasi Heterogenous Distributed Database System Oracle Xe 10g dan MySQL Rekam Medis Poliklinik UIN Sunan Kalijaga, Creat. Inform. Technol. J. 4 (1) (2016) 9, https:// doi.org/10.24076/citec.2016v4i1.91. [16] V. Boehme-Neßler, Privacy: a matter of democracy. Why democracy needs privacy and data protection, Int. Data Priv. Law 6 (3) (2016) 222–229, https://doi.org/10.1093/idpl/ipw007. [17] W. Cooper, Origin and development of data envelopment analysis: challenges and opportunities, Data Envelop. Anal. J. 1 (1) (2014) 3–10, https://doi.org/10.1561/103.00000002. [18] R. Luethi, M. Phillips, Challenges and solutions for long-term permafrost borehole temperature monitoring and data interpretation, Geogr. Helv. 71 (2) (2016) 121–131, https://doi.org/10.5194/gh-71-121-2016. [19] R. Kashyap, A. Piersson, Big data challenges and solutions in the medical industries, in: Handbook of Research on Pattern Engineering System Development for Big Data Analytics, IGI Global, 2018, pp. 1–24. [20] R. Kashyap, A. Piersson, Impact of big data on security, in: Handbook of Research on Network Forensics and Analysis Techniques, IGI Global, 2018, pp. 283–299.

40

CHAPTER 2 BIG DATA ANALYTICS CHALLENGES AND SOLUTIONS

[21] R. Lomotey, R. Deters, Unstructured data mining: use case for CouchDB, Int. J. Big Data Intell. 2 (3) (2015) 168, https://doi.org/10.1504/ijbdi.2015.070597. [22] X. Vu, M. Abel, P. Morizet-Mahoudeaux, A user-centered approach for integrating social data into groups of interest, Data Knowl. Eng. 96–97 (2015) 43–56, https://doi.org/10.1016/j.datak.2015.04.004. [23] K. Yue, W. Liu, Y. Zhu, W. Zhang, A probabilistic-graphical-model based approach for representing lineages in uncertain data, Chin. J. Comput. 34 (10) (2011) 1897–1906, https://doi.org/10.3724/sp.j.1016.2011.01897. [24] J. Haspel, A big data platform to enable the integration of high-quality clinical data and next generation sequencing data, Eur. J. Mol. Clin. Med. 2 (2) (2015) 57, https://doi.org/10.1016/j.nhtm.2014.11.011. [25] A. Telang, P. Deepak, S. Joshi, P. Deshpande, R. Rajendran, Detecting localized homogeneous anomalies over spatiotemporal data, Data Min. Knowl. Disc. 28 (5–6) (2014) 1480–1502, https://doi.org/10.1007/ s10618-014-0366-x. [26] J. Zhao, P. Papapetrou, L. Asker, H. Bostr€om, Learning from unrelated temporal data in electronic health records, J. Biomed. Inform. 65 (2017) 105–119, https://doi.org/10.1016/j.jbi.2016.11.006. [27] H. Xie, X. Chen, Cloud storage-oriented unstructured data storage, J. Comput. Appl. 32 (6) (2013) 1924–1928, https://doi.org/10.3724/sp.j.1087.2012.01924. [28] K. Jabeen, Scalability study of hadoop MapReduce and hive in big data analytics, Int. J. Eng. Comput. Sci. (2016), https://doi.org/10.18535/ijecs/v5i11.11. [29] T. Chou, P. Chou, E. Lin, Improving the timeliness of turning signals for business cycles with monthly data, J. Forecast. 35 (8) (2016) 669–689, https://doi.org/10.1002/for.2379. [30] C. Bhatt, N. Dey, A.S. Ashour (Eds.), Internet of Things and Big Data Technologies for Next-Generation Healthcare, Springer International Publishing, 2017. [31] E. Dumbill, Big data is rocket fuel, Big Data 1 (2) (2013) 71–72, https://doi.org/10.1089/big.2013.0017. [32] K. Jutz, The accuracy of data-based sensitivity indices, SIAM Undergrad. Res. Online 9 (2016), https://doi. org/10.1137/15s014757. [33] C. Changchit, K. Bagchi, Privacy and security concerns with healthcare data and social media usage, J. Inform. Priv. Secur. 13 (2) (2017) 49–50, https://doi.org/10.1080/15536548.2017.1322413. [34] M.S. Kamal, S. Parvin, A.S. Ashour, F. Shi, N. Dey, De-Bruijn graph with the MapReduce framework towards metagenomic data classification, Int. J. Inf. Technol. 9 (1) (2017) 59–75. [35] J. Hester, A robust, format-agnostic scientific data transfer framework, Data Sci. J. 15 (2016) 12, https://doi. org/10.5334/dsj-2016-012. [36] J. Christensen, Effective data visualization: the right chart for the right data, and data visualization: a handbook for data driven design, Technol. Archit. Des. 1 (2) (2017) 242–243, https://doi.org/10.1080/ 24751448.2017.1354629. [37] F. Bajaber, R. Elshawi, O. Batarfi, A. Altalhi, A. Barnawi, S. Sakr, Big data 2.0 processing systems: taxonomy and open challenges, J. Grid Comput. 14 (3) (2016) 379–405, https://doi.org/10.1007/s10723016-9371-1. [38] G. Vogel, Danish sperm counts spark data dispute, Science 332 (6036) (2011) 1369–1370, https://doi.org/ 10.1126/science.332.6036.1369. [39] S. Surender Punia, Improving resource management and solving scheduling problem in data ware house using OLAP and OLTP, Int. J. Mech. Eng. Inform. Technol. (2016), https://doi.org/10.18535/ijmeit/v4i9.01. [40] R. Kashyap, V. Tiwari, Active contours using global models for medical image segmentation, Int. J. Comput. Syst. Eng. 4 (2/3) (2018) 195, https://doi.org/10.1504/ijcsyse.2018.091404. [41] H. Das, B. Naik, H.S. Behera, Classification of diabetes mellitus disease (DMD): a data mining (DM) approach, in: Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 539–549. [42] R. Kashyap, P. Gautam, V. Tiwari, Management and monitoring patterns and future scope, in: Handbook of Research on Pattern Engineering System Development for Big Data Analytics, IGI Global, 2018, pp. 230–251. [43] R. Kashyap, P. Gautam, Fast medical image segmentation using energy-based method, in: Pattern and Data Analysis in Healthcare Settings, Medical Information Science Reference, IGI Global, USA, 2017, pp. 35–60.

FURTHER READING

41

[44] R. Kashyap, V. Tiwari, Energy-based active contour method for image segmentation, Int. J. Electron. Healthc. 9 (2–3) (2017) 210–225. [45] A. Upadhyay, R. Kashyap, Robust hybrid energy-based method for accurate object boundary detection, J. Adv. Res. Dyn. Control Syst. 1 (06) (2017) 1476–1490. [46] J. Tiago, T. Guerra, A. Sequeira, A velocity tracking approach for the data assimilation problem in blood flow simulations, Int. J. Numer. Method Biomed. Eng. 33 (10) (2017) e2856. https://doi.org/10.1002/cnm.2856. [47] J. Bacardit, P. Widera, N. Lazzarini, N. Krasnogor, Hard data analytics problems make for better data analysis algorithms: bioinformatics as an example, Big Data 2 (3) (2014) 164–176, https://doi.org/10.1089/ big.2014.0023. [48] O. von Maurich, A. Golkar, Data authentication, integrity and confidentiality mechanisms for federated satellite systems, Acta Astronaut. 149 (2018) 61–76, https://doi.org/10.1016/j.actaastro.2018.05.003. [49] H. Das, A.K. Jena, J. Nayak, B. Naik, H.S. Behera, A novel PSO based back propagation learning-MLP (PSO-BP-MLP) for classification, in: Computational Intelligence in Data Mining-Volume 2, Springer, New Delhi, 2015, pp. 461–471. [50] C. Pradhan, H. Das, B. Naik, N. Dey, Handbook of Research on Information Security in Biomedical Signal Processing, IGI Global, Hershey, PA, 2018, pp. 1–414, https://doi.org/10.4018/978-1-5225-5152-2. [51] R. Sahani, C. Rout, J.C. Badajena, A.K. Jena, H. Das, Classification of intrusion detection using data mining techniques, in: Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 753–764. [52] G. Ruan, H. Zhang, Closed-loop big data analysis with visualization and scalable computing, Big Data Res. 8 (2017) 12–26, https://doi.org/10.1016/j.bdr.2017.01.002. [53] E. Zeide, The structural consequences of big data-driven education, Big Data 5 (2) (2017) 164–172, https:// doi.org/10.1089/big.2016.0061. [54] K. Gai, M. Qiu, H. Zhao, Privacy-preserving data encryption strategy for big data in mobile cloud computing, IEEE Trans. Big Data (2017) 1, https://doi.org/10.1109/tbdata.2017.2705807. [55] J. Li, G. Tao, K. Zhang, B. Wang, H. Wang, An effective data processing flow for the acoustic reflection image logging, Geophys. Prospect. 62 (3) (2014) 530–539, https://doi.org/10.1111/1365-2478.12103. [56] K.H.K. Reddy, H. Das, D.S. Roy, A data aware scheme for scheduling big-data applications with SAVANNA Hadoop, in: Futures of Network, CRC Press, 2017. [57] G. Boulton, The challenges of a big data earth, Big Earth Data 2 (1) (2018) 1–7, https://doi.org/10.1080/ 20964471.2017.1397411. [58] Salesforce.com. (2019). [online] Available at: https://www.salesforce.com/in/paas/overview/ [Accessed 25 Jan. 2019].

FURTHER READING N. Dey, C. Bhatt, A.S. Ashour, Big Data for Remote Sensing: Visualization, Analysis, and Interpretation, Springer, 2018. R. Kashyap, P. Gautam, Fast level set method for segmentation of medical images, in: Proceedings of the International Conference on Informatics and Analytics (ICIA-16), ACM, New York, NY, 2016. Article 20, seven pages. Y. Wu, A mining model of network log data based on Hadoop, Int. J. Perform. Eng. (2018), https://doi.org/ 10.23940/ijpe.18.05.p27.10781087.

CHAPTER

BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

3 Dibya Jyoti Bora

School of Computing Sciences, Kaziranga University, Jorhat, India

3.1 INTRODUCTION The enormous increase in data in different disciplines, such as internet, business, and finance, contributes to the development of big data. But currently healthcare units are also gradually becoming an important discipline where there is a big scope for big data development. In US nonfederal acute care hospitals, most of them are adopting basic automated healthcare record systems [1, 2]. Nowadays, technology such as the internet-of-things, makes it possible to collect personal health data from millions of consumers in an increasing trend. For example, wearable fitness trackers and health apps on smartphones are examples of such devices [1]. So data grows in an enormous way, even in a single institution. And such institutions exist in a vast number in the United States. If we talk about the whole world, then this number will increase dramatically. The generation of big data is taking place in medical healthcare units. In healthcare units, big data is concerned with some important datasets that are considerably too big, too fast, and too complex for healthcare suppliers to use with their existing tools. Medical images are the most important and sensitive parts of these datasets. These images are used for many important decisions by physicians about the status of the patient’s disease. These big datasets contain both structured and unstructured data. Therefore a proper prediction of disease needs a comprehensive approach where structured and unstructured data coming from both clinical and nonclinical resources are exploited for a better perception of the disease state. The big data scientists are always trying to discover the associations and hidden patterns among these big medical data so that an improved care and treatment program can be devised for patients by making a more accurate decision about the patient’s disease and thereby saving lives and also lessening the costs of treatments. Now, how to define big data. Section 3.2 below gives a brief introduction to big data.

Big Data Analytics for Intelligent Healthcare Management. https://doi.org/10.1016/B978-0-12-818146-1.00003-9 # 2019 Elsevier Inc. All rights reserved.

43

44

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

FIG. 3.1 Big data and its 6V’s.

3.2 BIG DATA Big data can be described as data that grows at a rate so that it surpasses the processing power of conventional database systems and doesn’t fit the structures of conventional database architectures [3, 4]. Its characteristics can be defined with 6V’s: Volume, Velocity, Variety, Value, Variability, and Veracity [3, 4]. A brief introduction to every V is given below and in Fig. 3.1. Volume: Volume generally refers to the data size. In this case, data size may be terabytes or TB (nearly 1e + 12 bytes), petabytes or PB (nearly 1e + 15 bytes), and zettabytes or ZB (1021 bytes), etc. [3]. Velocity: Velocity generally denotes the speed at which vast amounts of data is being created, collected, and analyzed. In case of big data, data is generated at a very high speed. Variety: Variety can be defined as the different types of data that contribute towards big data. Data may exist in diverse forms such as structured, unstructured, or semistructured. For example, structured data are organized in relational tables, semistructured data obtained from key-value web clicks, and unstructured data are obtained from email messages, articles, and streamed video and audio, etc. [3]. Value: This is one of the most significant characteristics of big data. Value means how much the extracted data is worth [5]. Although big data means an endless amount of data, unless they can be turned into something of value, they are worthless. Big data has a value if, and only if, the collected data adds knowledge. Variability: Variability means that data may change during the processing period or any stage in its lifecycle [3]. If variety and variability increases, then there is a possibility of discovering unexpected, hidden, and valuable information from big data. Veracity: Veracity denotes the quality or trustworthiness of the data. By the term quality, we mean two points: Data consistency (or reliability) and data trustworthiness [3]. There should not be incompleteness, ambiguities, uncertainty, and deception due to data inconsistency, etc. in big data. The next question is what is healthcare big data? Section 3.3 discusses healthcare data.

3.3 HEALTHCARE DATA

45

3.3 HEALTHCARE DATA Data generated and obtained in healthcare industries are the healthcare data. This type of data can be mainly classified as structured, semistructured, and unstructured data. These different types of data can be defined as below [6].

3.3.1 STRUCTURED DATA Structured data are those data that can be captured and stored as discrete coded values. One example of such data is Logical Observation Identifiers Names and Codes (LOINC), which is a database and universal standard for identifying medical laboratory observations [6]. When healthcare data is stored in a structured way, it becomes very easy for computer systems to pass data back and forth. This makes the analysis phase easier. Nowadays, structured data is given more preference such as the data in diagnosis, laboratories, procedure orders, and medications are kept as structured data [7].

3.3.2 UNSTRUCTURED DATA Unstructured data is generally captured and stored as free text. Although humans can easily read this data, they are not readable for computers. The common examples of unstructured data are progress notes, pathology results, and radiology reports etc. [7].

3.3.3 SEMISTRUCTURED DATA Semistructured data can be defined as a mixture of structured and unstructured data. One example of semistructured data can be a data entry interface that may have a grouping of structured data capture and free text. If we categorize the healthcare big data per application, then the following classification is used [3].

3.3.4 GENOMIC DATA Genomic data generally means the genomic and DNA data of an organism. In bioinformatics, scientists have been gathering, accumulating, and processing the genomes of different living things. This type of data is generally very big in size and requires big storage and specially built bioinformatics software to analyze them.

3.3.5 PATIENT BEHAVIOR AND SENTIMENT DATA These are the data generally collected or gathered through different social media avenues. Patients can provide their valuable feedback and information about their own experiences with doctors, nurses, and other staff members of a particular hospital or clinic [8] [9]. By sharing this information, they can seek help in different social media channels, patient surveys, and discussion forms. This information is collected in great amounts and valuable facts can be extracted from them if they are analyzed properly with an efficient big data analytics tool.

3.3.6 CLINICAL DATA AND CLINICAL NOTES Clinical data has always proved to be the main resource for health and medical research problems [10]. Clinical data are generally collected during the course of ongoing patient care. These data can also be formulated as a part of a formal clinical trial program. These data are mostly unstructured (about 80%) for example, documents, images, and clinical or transcribed notes.

46

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

Provider Software Hospital EHR’s

Opt-in Genome Registries

Non Retail Outlets Mobile Data and Wearables Medical Claims

Patient Registries Private Payers and Plan Claims Government Health Plan Claims Pharmacy Claims

FIG. 3.2 Different types of healthcare big data resources. Based on Big Data in Healthcare Market Value Share of 20.69% With Cerner Co, Cognizant, Dell, Philips, Siemens and Business Forecast to 2022 j. (2018). Retrieved from: https://www.medgadget.com/2018/04/big-data-in-healthcare-market-value-share-of-2069-with-cerner-co-cognizant-dell-philips-siemens-and-business-forecast-to-2022.html; Elliott, R., Morss, P. (2018). Big Data: How It Can Improve Our Health j Elliott Morss. Retrieved from: http://www.morssglobalfinance.com/big-data-how-it-can-improve-our-health/.

3.3.7 CLINICAL REFERENCE AND HEALTH PUBLICATION DATA These are data collected from different publications such as journal articles, clinical practice guidelines (text-based reference), health products, and clinical and medical research materials etc. [11].

3.3.8 ADMINISTRATIVE AND EXTERNAL DATA These are the data that is generally collected through different external sources such as insurance statements and correlated financial data, billing statements and scheduling [3], different biometric data such as those collected through fingerprints, handwriting, and iris scans, etc. Fig. 3.2 very clearly present the above discussed different sources of healthcare big data. The following section concentrates on what medical image processing is and why we need it in healthcare big data analysis.

3.4 MEDICAL IMAGE PROCESSING AND ITS ROLE IN HEALTHCARE DATA ANALYSIS Medical image processing means analysis of medical image data with the help of different image processing algorithms. Image processing algorithms generally constitute contrast enhancement, noise reduction, edge sharpening, edge detection, segmentation etc. These techniques make the manual diagnosis process of disease detection automatic or semiautomatic. Fig. 3.3 is an example of medical

3.4 MEDICAL IMAGE PROCESSING AND ITS ROLE

47

FIG. 3.3 (A) Original image [12], (B) contrast-improved version, and (C) segmented version.

image processing techniques where a soft computing technique [13] is used to segment an MRI brain image that is first contrast-improved with the local region-based enhanced technique [14]. Nowadays, the size and dimensionality of healthcare data has increased tremendously. This results in a high-level complexity in understanding the dependencies among the data [15]. Manually analyzing data with this much complexity is unfeasible. The rapid increase in the number of healthcare organizations as well as the number of patients undergoing treatment has increased the need for computeraided diagnostic models. These models are considered to be efficient if the embedded image processing algorithms are able to produce an optimal analysis result. The incorporation of such techniques with suitable care has helped clinicians advance diagnostic accuracy [15]. The integration of medical images

48

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

with other kinds of EHR (electronic health record) data and genomic data is also required to advance the accuracy and lessen the time taken for a diagnosis. Medical imaging can be defined as the technique of generating a visual demonstration of the interior of a body for clinical investigation and medical intervention, as well as a visual demonstration of the function of some organs or tissues (physiology) [16]. Some of the frequently used medical imaging techniques are computed tomography (CT), fluoroscopy, magnetic resonance imaging (MRI), mammography, molecular imaging, photoacoustic imaging, positron emission tomography-computed tomography (PET-CT), ultrasound, and X-ray etc. [15]. The size of these medical image data can vary from a few megabytes for a single study (e.g., histology images) to hundreds of megabytes per study (e.g., thin-slice CT readings covering up to 2500 + scans per study [15, 17]). This much data requires large amounts of storage with adequate durability. To analyze this data, it needs some high processing, swift, and optimal algorithms. An example of such a high processing technique is AMIGO (advanced multimodal image-guided operating) and it is a suite designed with an angiographic X-ray system, 3D ultrasound, MRI, and PET/CT imaging in the operating room (OR) [18]. This system has been efficiently employed for cancer therapy and helped to progress the localization and marking of the diseased tissue. The medical image processing domain has a large role in determining efficient healthcare big data analysis. Recently, many different techniques and frameworks have been proposed for big data analytics in healthcare research. Some of the noteworthy contributions are described in the below Section 3.5.

3.5 RECENT WORKS IN BIG DATA ANALYTICS IN HEALTHCARE DATA In this section, a few recent works in the field of big data analytics in healthcare are briefly presented. Ojha et al. [19] presented an insight into how big data analytics tools like Hadoop can be used with healthcare data. They discuss how meaningful information can be extracted from EHR (electronic health record). For this work, they conducted their experiments on the healthcare data obtained from central India’s major government hospital, Maharaja Yeshwantrao Hospital (M. Y.) situated in Indore, Madhya Pradesh, India. This hospital generates a large amount of data every day, which the authors suggest to store in EHR. A unique number is assigned to every patient whose data is stored here. With this number, information about any patient can be accessed more easily and quickly. Also, the data warehouse, EDH (enterprise data hub), can be used to store this data. Different data mining techniques such as classification, clustering, and association can easily be performed on this data directly. This implies a steady and efficient medical big data analysis technique. Koppad et al. [20] introduced an application of big data analytics in the healthcare system with an aim to predict COPD (chronic obstructive pulmonary disease). They have used the Decision Tree technique, a data mining technique to perform COPD diagnosis in an individual patient. The Aadhar number is used to refer to the patient’s details that are stored in a centralized clinical data repository. As the Aadhar number is unique, it links to the treatments given to the patient in different hospitals and also about the doctors in charge. The authors claim an encouraging accuracy in diagnosing COPD patients and the efficacy of the proposed system through experimental results.

3.5 RECENT WORKS IN BIG DATA ANALYTICS IN HEALTHCARE DATA

49

Patel et al. [21] discuss the reasons behind the consideration of data in healthcare and the results of various surveys to demonstrate the influence of big data in healthcare. They also present some case studies on big data analytics in healthcare industries. Different tools for handling big data problems are discussed. Simpao et al. [22] focuses on big data analytics in the anesthesia and healthcare units. They also focus on visual analytics, which is the science of analytical reasoning simplified by interactive visual interfaces. This can also assist with the performance of cognitive activities concerning big data. The Anesthesia Quality Institute, and the Multicenter Preoperative Outcomes Group have led significant efforts to gather anesthesia big data for outcomes research and this also aids quality improvement. They suggested that the efficient use of data combined with quantitative and qualitative analysis to make decisions can be employed to big data for excellence and performance enhancement. For instance, several important applications are clinical decision support, predictive risk assessment, and resource management. Jokonya et al. [23] propose a big data integration framework to support deterrence and control of HIV/AIDS, TB, and silicosis (HATS) in the mining industry. The link among HIV/AIDS, TB, and silicosis is the focus in this work. The authors claim that their proposed approach is the first one to use big data in understanding the linkage between HATS in the mining industry. The proposed big data framework addresses the needs of predictive epidemiology, which is important in forecasting and disease control in the mining industry. They suggest the use of a viable systems model and big data to tackle the challenges of HATS in the mining industry. Weider et al. [24] introduce a Big Data Based Recommendation Engine for early identification of diseases in the modern health care environment. A classification algorithm, Naı¨ve Bayes (NB), is used to build this system. This algorithm runs on top of Apache Mahout and it advises the health conditions of users, readmission rates, treatment optimization, and adverse occurrences. The proposed work focuses on analyzing and using new big data methodologies. The proposed approach is a very efficient one in the sense that once the disease is identified, it will be easy to deliver the correct care to the patients and in this way, the average life expectancy of people can be increased if they are given suitable care from the early stages. Chrimes et al. [25] propose a framework built to form a big data analytics (BDA) platform via the use of real volumes of healthcare big data. The existing high-performance computing (HPC) architecture is utilized with HBase (NoSQL database) and Hadoop (HDFS). The generated NoSQL database was imitated from metadata and inpatient profiles of the Vancouver Island Health Authority’s hospital system. A special modification of Hadoop’s ecosystem and HBase with the addition of “salt buckets” to ingest was utilized. The authors claim that data migration performance requirements of the proposed BDA platform can capture large volumes of data while decreasing data retrieval times and its associations to innovative processes and configurations. Chawla et al. [26] propose a personalized patient-centered framework, CARE. The proposed system serves as a data-driven computational support for physicians evaluating the disease risks facing their patients. It has the ability of early caution indicators of possible disease risks of an individual, which can then be converted into a dialogue between the physician and patient and this will aid in patient empowerment. CARE can be utilized in full potential to explore broader disease histories, recommend previously unconsidered concerns, and facilitate discussion about early testing and prevention, as well as wellness strategies that may be more recognizable to the individual and easy to implement.

50

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

McGregor in [27] discusses the benefits and effectiveness of the use of big data in neonatal intensive care units. He claims that it will lead to earlier discovery and deterrence of a wide range of fatal medical conditions. The capability to process multiple high-speed physiological data streams from numerous patients in numerous places and in real time could considerably improve both healthcare competence and patient outcomes. Fahim et al. [28] propose ATHENA (Activity-Awareness for Human-Engaged Wellness Applications) to plan and assimilate the association between the basic health needs and suggest the human lifestyle and real-time recommendations for wellbeing services. With this system, their motive is to develop a system to encourage an active lifestyle for individuals and to suggest valuable interferences by making comparisons to their past habits. The proposed system processes sensory data through an ML (machine learning) algorithm inside smart devices and exploits cloud infrastructure to decrease the cost involved. Here, big data infrastructure is employed for huge sensory data storage and fast retrieval for recommendations. Das et al. [29] proposed a data-mining-based approach for the classification of diabetes mellitus disease (DMD). They applied J48 and Naı¨ve Bayesian techniques for the early detection of diabetes. Their proposed model is elaborated in consecutive steps to help the medical practitioner to easily explore and recognize the discovered rules better. The dataset used is collected from a college medical hospital as well as from the online repository. Further practical applications are based on the proposed approach. The PSO (particle swarm optimization) based approached can also be employed for this classification task. One such type of classification technique can be found in [30]. This technique is a PSO-based evolutionary multilayer perceptron, which is trained using the back propagation algorithm. Some other advanced techniques such as the one proposed in [31] can also be adopted for the classification task. This technique [31] is based on the De-Bruijn graph with the MapReduce framework, and it is used for metagenomic gene classifications. The graph-based MapReducing approach has two phases: mapping and reducing. In the mapping phase, a recursive naive algorithm is employed to generate K-mers. The De-Bruijn graph is a compact representation of K-mers that finds out an optimal path (solution) for genome assembly. The authors utilized similarity metrics for finding similarity among the DNA (De-Oxy Ribonucleic Acid) sequences. In the reducing phase, Jaccard similarity and purity of clustering are applied as dataset classifiers to classify the sequences based on their similarity. The experimental results claim this technique is an efficient one for metagenomic data clustering. But it is also important to discuss the possible security threats that may arise while transferring medical images and data over the internet. To deal with privacy and copyright protection of such a huge amount of medical data, we need more robust and efficient techniques. These different techniques should be studied and tested empirically for finding the most efficient technique that is easy to implement and can provide optimal protection. Such different techniques are thoroughly discussed in [32, 33]. Different challenges that might be faced during the analysis of medical big data also need to be addressed. A very good study on this can be found in [34]. In summary, above are some noteworthy contributions towards big data analytics in healthcare data. These works contain some innovative ideas for utilizing big data analytics in healthcare data to extract new valuable information and thereby discover innovative ways to deal with different serious diseases. To deal with healthcare big data, there should be a sound architectural framework and then arises the need for some big data analytics tools. These tools are briefly introduced, along with their advantages and disadvantages in the below section.

3.6 ARCHITECTURAL FRAMEWORK AND DIFFERENT TOOLS

51

3.6 ARCHITECTURAL FRAMEWORK AND DIFFERENT TOOLS FOR BIG DATA ANALYTICS IN HEALTHCARE BIG DATA This section is subdivided into the following two subsections.

3.6.1 ARCHITECTURAL FRAMEWORK An architectural framework is an important requirement for any knowledge discovery process. When it is about big data analytics in healthcare data then it is obligatory for an accurate analysis and idea generation. A conceptual model provides the guidelines during the whole analysis process from healthcare data acquisition to the high level of data analysis. Some very good conceptual models on big data analytics in healthcare data can be found in [35] and [36]. A modification of the same is presented below. A simple and easy to understand framework is needed for an optimal study. Firstly, a level 0 architectural framework for big data analytics in healthcare data is presented (Fig. 3.4). Secondly, a level 1 architectural framework is presented in Fig. 3.5 which is an elaborated version of Fig. 3.4. In the above Fig. 3.5, the external data sources may be web and social media, machine to machine data such as reading from different sensors etc. and biometric data, human-generated data such as that obtained from electronic medical records, doctor’s or physician’s notes etc. Every day these sources are generating a huge amount of data leading to the formation of big data. Also, the sources of this data may be situated in different geolocations and the data can have different

FIG. 3.4 Level 0 architectural framework for big data analytics in healthcare data.

52

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

Generated reports External healthcare data

Raw healthcare data

Transformation of raw data into recognizable big data format

Different big data platforms and analytics (e.g., hadoop, mapreduce, pig and piglatin, hive etc.)

Internal healthcare data

Big data analytics

Queries

OLAP for business intelligence

Data mining for knowledge discovery

FIG. 3.5 Level 1 architectural framework for big data analytics in healthcare data.

formats such as ASCII/text, flat files, .csv, relational tables etc. [35]. It is clear that medical big data analytics is different from traditional big data analytics. The main reason for this is the varying nature of raw medical big data and also a heterogeneous source of generation. Because of this, an extra preprocessing step is needed for data cleansing, data normalization, and data transformation. If it is the case of medical image big data, then special enhancement techniques (most preferably those with fuzzy based techniques) should be adopted for better enhancement and hence a better medical diagnosis.

3.6.2 DIFFERENT TOOLS USED IN BIG DATA ANALYTICS IN HEALTHCARE DATA A number of different platforms and tools are used in big data analytics in healthcare data. Among them, the ones [35] which are frequently used are briefly introduced below: (a) The Hadoop Distributed File System (HDFS): This is the prime data storage system used by Hadoop applications. Here a NameNode and DataNode architecture is used to implement a DFS (distributed file system) that provides high-performance access to data across highly scalable Hadoop clusters. It has the ability to rapidly transfer data between different computer nodes. The complexity lies in the usage and this its main drawback. Also, it is not suited for small data. (b) MapReduce: MapReduce is the central part of the Apache Hadoop software framework. It acts in two vital roles: mapper and reducer. In the first role, it filters and parcels out work to various nodes within the cluster or map and, in the second role, it systematizes and reduces the results from each node into a consistent answer to a query. When tasks are executed, it tracks the processing of

3.6 ARCHITECTURAL FRAMEWORK AND DIFFERENT TOOLS

(c)

(d)

(e)

(f)

(g)

(h)

(i)

53

each server/node. The main drawback of MapReduce is that it is not suitable for Interactive Processing. Hive: Hive is a runtime Hadoop support architecture. It controls Structure Query Language (SQL) with the Hadoop platform. It is mainly used to analyze structured and semistructured data. It allows SQL programmers to develop Hive Query Language (HQL) statements akin to typical SQL statements [35]. Its fastness is its first advantage. But some of the major drawbacks include no realtime access to data and a complicated system of updating data. Pig and PigLatin: Pig programming language is built up to incorporate all types of data (structured/unstructured). It has two prime modules: PigLatin (i.e., the language), and the runtime version where the PigLatin code is executed [35]. Although the biggest advantage of Pig is the short development time, it also has the disadvantage of handling errors, for example, when there is a case of simple syntax error, it will show “exec error”. Zookeeper: ZooKeeper is a centralized service for preserving configuration information, naming, providing distributed synchronization, and group services [35]. Synchronization is provided across a cluster of servers that are utilized by big data analytics to coordinate parallel processing across big clusters. Its high scalability is its prime advantage. Limited support for cross-cluster scenarios is one of the major drawbacks of ZooKeeper. Jaql: This is a functional data processing and declarative query language used for JSON query processing on BigData. One of its prime tasks is to convert “‘high-level’ queries into ‘low-level’ queries” comprising of MapReduce tasks and in this way, it supports parallel processing. The problem of error handling is a major drawback. HBase: HBase is an open-source, nonrelational, distributed database model. This can be defined as a column-oriented database management system that sits on top of HDFS and uses a nonSQL approach. Linear and modular scalability is its prime advantage. No exception handling is one major disadvantage of HBase. Cassandra: Apache Cassandra is a free and open-source, widely distributed column store NoSQL database management system [35]. It is specially designed to handle big amounts of data across many commodity servers, providing high accessibility with no single point of failure. No single point of failure is its first advantage. The disadvantages include no ad-hoc queries and no aggregations. Lucene: This is mainly text analytics/searches. Lucene has already been integrated into several open-source projects. Its scope comprises full-text indexing and library search for use within a Java application. The first advantage is that Lucene is available for free as open source under the liberal Apache Software license. Also, speed and high-performance indexing is another advantage. Efficient and accurate search algorithms are implemented properly which in turn makes it possible to perform a most accurate search. So, these are some distinct features of Lucene.

54

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

(j) Oozie: Apache Oozie is a server-based workflow scheduling system to handle Hadoop jobs. Here, workflows are described as a group of control flow and action nodes in a directed acyclic graph. The main benefit of Oozie is its capacity to launch workflows that will grow as the cluster grows. (k) Mahout: Apache Mahout is a project of the Apache Software Foundation to create free implementations of distributed or otherwise scalable machine learning algorithms emphasized principally in the areas of collaborative filtering, clustering, and classification that support big data analytics on the Hadoop platform. (l) Avro: Avro is a data serialization system. It has some additional features such as versioning and version. The smallest size is its one of the greatest benefits of using Avro. The disadvantages include that it is a type of schema-based system and needs a schema for reading/writing data.

3.7 CHALLENGES FACED DURING BIG DATA ANALYTICS IN HEALTHCARE The major challenges that are faced during big data analytics in healthcare are discussed below: (a) The first challenge faced during big data analytics is that the healthcare data are not in a standardized format, often discovered in fragmented form, or in some incompatible formats [35]. So, it is suggested that healthcare systems and data should be standardized before proceeding for further processing. (b) The second major challenge faced is the real-time processing issue. Real-time big data analytics is an important requirement in the healthcare industry [36]. To address this issue, the delay between data acquisition and data processing should be dealt with quickly. (c) The possible time effect is another big challenge. It may happen to occur that the results of big data analytics may differ from time to time. The reasons behind this may be due to change in technology or adaptation of high-end technology, and genetic changes that might occur from time to time in the patient population [37]. (d) The adaptation of cloud technology in the healthcare industry is progressing rapidly. For example, during the diagnosis process for a patient, the expert may need to access the electronic medical record (EMR), which contains huge multimedia big data including X-rays, ultrasounds, CT scans, and MRI reports. For easy accessibility, the EMR is often stored in the cloud. But the question is how secure is this sensitive data in the cloud? This is an emerging challenge that needs to be addressed. Emphasis should be placed on some security models that may be implemented during clinical data sharing or any healthcare-based data sharing over the cloud. Also, the optimality of these models in real-time data sharing should be properly tested and verified. In literature, we find a number of good models that may be adopted to address this issue. The selection of a cloud service provider should be done very carefully, for which trust values calculated through history-based reputation rating might be adapted [38]. Also, a fog computing facility with pairing-based cryptography may also be beneficial while maintaining the privacy of those sensitive healthcare data in the cloud [39].

REFERENCES

55

(e) For analytical purposes, the dynamic availability of the different machine learning algorithms or the extended ones of the earlier algorithms to work on such large scale of data should be easily accessible such as a pull-down type menu. (f) Concurrency in big data analytics should be maintained in an efficient way so that data inconsistency should not occur at any instant or at any cost as otherwise this will lead to a serious problem of affecting the whole healthcare industry concerned. The above are some inevitable challenges that should be addressed appropriately for efficient and optimal big data analytics in healthcare.

3.8 CONCLUSION AND FUTURE RESEARCH Adoption of big data technology is rapidly increasing in the healthcare industry. The medical imaging field, which may be considered as an automation process of the manual diagnosis process, is also dependent directly or partially on medical big data since an accurate diagnosis of a serious disease at the right stage needs continuous study and research on a huge volume of diagnosis data collected from patients with similar symptoms from different clinics from different geolocations. For efficient big data analytics in healthcare data, there should be a standard framework or model through which an optimal result might be expected. Also, for implementation, we need to select the right platform and tools. Besides this, there are several other challenges that need to be addressed throughout the analysis phase. All these issues are discussed in this paper. Although big data analytics in healthcare has great potential, the discussed challenges need to be addressed and solved to make it successful. For future research, these challenges will be focused on and a novel framework will be built to include all the necessary steps for accurate medical big data analysis. At first, an empirical study will be conducted to investigate the techniques that can be adopted for the raw medical big data cleansing and normalization process. As the medical datasets are a mixture of structured, semistructured, and unstructured data, the data transformation step will be given more importance. And as a big portion of medical big datasets is contributed to by medical images, which are generally of a fuzzy nature, to develop an advanced fuzzy set-based technique for such medical image data enhancement is an unavoidable aspect of the future research. After the preprocessing step, the remaining portion of the big data analytics will be more similar to traditional big data analytics, so, future research will mainly target the mandatory preprocessing steps.

REFERENCES [1] R. Zhang, H. Wang, R. Tewari, G. Schmidt, D. Kakrania, Big data for medical image analysis: A performance study, in: Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, IEEE, 2016, pp. 1660–1664. [2] D. Charles, M. Gabriel, M.F. Furukawa, Adoption of electronic health record systems among US non-federal acute care hospitals: 2008-2012, ONC Data Brief 9 (2013) 1–9. [3] L. Wang, C.A. Alexander, Big data in medical applications and health care, AME Med. J. 6 (1) (2015) 1. [4] E. Dumbill, Making Sense of Big Data, 2013.

56

CHAPTER 3 BIG DATA ANALYTICS IN HEALTHCARE: A CRITICAL ANALYSIS

[5] J. Cano, The V’s of Big Data: Velocity, Volume, Value, Variety, and Veracity, [online] Xsnet.com. Available from: https://www.xsnet.com/blog/bid/205405/the-v-s-of-big-data-velocity-volume-value-variety-andveracity, 2018. Accessed 9 June 2018. [6] Health Data Knowledge, List of Different Types of Health Data-Health Data Knowledge, [online] Available from: http://www.healthdataknowledge.com/list-of-different-types-of-health-data/, 2018. Accessed 10 June 2018. [7] LOINC, Retrieved from https://en.wikipedia.org/wiki/LOINC, 2018. [8] S. Yang, M. Njoku, C.F. Mackenzie, ‘Big data’approaches to trauma outcome prediction and autonomous resuscitation, Br. J. Hosp. Med. 75 (11) (2014) 637–641. [9] Healthcare Analytics, Patient Behavior and Sentiment Data Patient Behaviors and Preferences, (Retail Purchases e.g. Data Captured in Running Stores), Retrieved from: http://healthcareanalyticsglobal.com/?p¼44, 2018. [10] L. Guides, Data Resources in the Health Sciences: Clinical Data, Retrieved from: http://guides.lib.uw.edu/ hsl/data/findclin, 2018. [11] K. Miller, Big data analytics in biomedical research, Biomed. Comput. Rev. 2 (2012) 14–21. [12] A.K. Vaswani, W.M. Nizamani, M. Ali, G. Aneel, B.K. Shahani, S. Hussain, Diagnostic accuracy of contrastenhanced FLAIR magnetic resonance imaging in diagnosis of meningitis correlated with CSF analysis, ISRN Radiol. 2014 (2014) 1–7. [13] D.J. Bora, HE Stain Image Segmentation Using an Innovative Type-2 Fuzzy Set Based Approach, Histopathological Image Analysis in Medical Decision Making, 2018. [14] D.J. Bora, An efficient innovative approach towards color image enhancement, Int. J. Inf. Retr. Res. 8 (1) (2018) 20–37. [15] A. Belle, R. Thiagarajan, S.M. Soroushmehr, F. Navidi, D.A. Beard, K. Najarian, Big data analytics in healthcare, Biomed. Res. Int. 2015 (2015) 370194. [16] Medical Imaging, Retrieved from: https://en.wikipedia.org/wiki/Medical_imaging, 2018. [17] J.A. Seibert, Modalities and data acquisition, in: Practical Imaging Informatics, Springer, New York, 2009, pp. 49–66. [18] C. Tempany, J. Jayender, T. Kapur, R. Bueno, A. Golby, N. Agar, F.A. Jolesz, Multimodal imaging for improved diagnosis and treatment of cancers, Cancer 121 (6) (2015) 817–827. [19] M. Ojha, K. Mathur, Proposed application of big data analytics in healthcare at Maharaja Yeshwantrao Hospital, in: Big Data and Smart City (ICBDSC), 2016 3rd MEC International Conference, IEEE, 2016, March, pp. 1–7. [20] S.H. Koppad, A. Kumar, Application of big data analytics in healthcare system to predict COPD, in: Circuit, Power and Computing Technologies (ICCPCT), 2016 International Conference, IEEE, 2016, March, pp. 1–5. [21] J.A. Patel, P. Sharma, Big data for better health planning, in: Advances in Engineering and Technology Research (ICAETR), 2014 International Conference, IEEE, 2014, August, pp. 1–5. [22] A.F. Simpao, L.M. Ahumada, M.A. Rehman, Big data and visual analytics in anaesthesia and health care, Br. J. Anaesth. 115 (3) (2015) 350–356. [23] O. Jokonya, Towards a big data framework for the prevention and control of HIV/AIDS, TB and silicosis in the mining industry, Procedia Technol. 16 (2014) 1533–1541. [24] D.Y. Weider, C. Pratiksha, S. Swati, S. Akhil, M. Sarath, A modeling approach to big data based recommendation engine in modern health care environment, in: Computer Software and Applications Conference (COMPSAC), 2015 IEEE 39th Annual, Vol. 1, IEEE, 2015, pp. 75–86. [25] D. Chrimes, M.H. Kuo, B. Moa, W. Hu, Towards a real-time big data analytics platform for health applications, Int. J. Big Data Intell. 4 (2) (2017) 61–80. [26] N.V. Chawla, D.A. Davis, Bringing big data to personalized healthcare: a patient-centered framework, J. Gen. Intern. Med. 28 (3) (2013) 660–665.

FURTHER READING

57

[27] C. McGregor, Big data in neonatal intensive care, Computer 46 (6) (2013) 54–59. [28] M. Fahim, M. Idris, R. Ali, C. Nugent, B. Kang, E.N. Huh, S. Lee, ATHENA: a personalized platform to promote an active lifestyle and wellbeing based on physical, mental and social health primitives, Sensors 14 (5) (2014) 9313–9329. [29] H. Das, B. Naik, H.S. Behera, Classification of diabetes mellitus disease (DMD): A data mining (DM) approach, in: Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 539–549. [30] H. Das, A.K. Jena, J. Nayak, B. Naik, H.S. Behera, A novel PSO based back propagation learning-MLP (PSO-BP-MLP) for classification, in: Computational Intelligence in Data Mining, vol. 2, Springer, New Delhi, 2015, pp. 461–471. [31] M.S. Kamal, S. Parvin, A.S. Ashour, et al., De-Bruijn graph with MapReduce framework towards metagenomic data classification, Int. J. Inf. Tecnol. 9 (2017) 59, https://doi.org/10.1007/s41870-017-0005-z. [32] C. Pradhan, H. Das, B. Naik, N. Dey, Handbook of Research on Information Security in Biomedical Signal Processing, IGI Global, Hershey, PA, 2018, pp. 1–414, https://doi.org/10.4018/978-1-5225-5152-2. [33] R. Sahani, C. Rout, J.C. Badajena, A.K. Jena, H. Das, Classification of intrusion detection using data mining techniques, in: Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 753–764. [34] A.E. Hassanien, N. Dey, S. Borra (Eds.), Medical Big Data and Internet of Medical Things: Advances, Challenges and Applications, Taylor & Francis, 2019. [35] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Health Inf. Sci. Syst. 2 (1) (2014) 3. [36] R. Sonnati, Improving healthcare using big data analytics, Int. J. Sci. Technol. Res. 4 (8) (2015) 142–146. [37] M. Cottle, W. Hoover, S. Kanwal, M. Kohn, T. Strome, N. Treister, Transforming Health Care Through Big Data Strategies for Leveraging Big Data in the Health Care Industry, Institute for Health Technology Transformation, 2013. http://ihealthtran.com/big-data-in-healthcare. [38] M.K. Muchahari, S.K. Sinha, Reputation-based trust for selection of trustworthy cloud service providers, in: Proceedings of the International Conference on Computing and Communication Systems, Springer, Singapore, 2018, pp. 65–74. [39] H.A. Al Hamid, S.M.M. Rahman, M.S. Hossain, A. Almogren, A. Alamri, A security model for preserving the privacy of medical big data in a healthcare cloud using a fog computing facility with pairing-based cryptography, IEEE Access 5 (2017) 22313–22328.

FURTHER READING Big Data in Healthcare Market Value Share of 20.69% With Cerner Co, Cognizant, Dell, Philips, Siemens and Business Forecast to 2022, Retrieved from: https://www.medgadget.com/2018/04/big-data-in-healthcaremarket-value-share-of-20-69-with-cerner-co-cognizant-dell-philips-siemens-and-business-forecast-to-2022.html, 2018. H. Chen, R.H. Chiang, V.C. Storey, Business intelligence and analytics: from big data to big impact, MIS Q. 36 (2012) 1165–1188. R. Elliott, P. Morss, Big Data: How It Can Improve Our Health jElliott Morss, Retrieved from: http://www. morssglobalfinance.com/big-data-how-it-can-improve-our-health/, 2018.

CHAPTER

TRANSFER LEARNING AND SUPERVISED CLASSIFIER BASED PREDICTION MODEL FOR BREAST CANCER

4

Md. Nuruddin Qaisar Bhuiyan, Md. Shamsujjoha, Shamim H. Ripon, Farhin Haque Proma, Fuad Khan Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh

4.1 INTRODUCTION There are several methods available for diagnosis of breast cancer such as breast exam, mammography, breast ultrasound, and biopsy [1, 2]. Of these, biopsy is the only way to detect the presence of breast cancer. The most common biopsy techniques are core needle biopsy, fine needle biopsy, and surgical open biopsy. In the biopsy procedure, breast tissue samples are collected and examined under the microscope by pathologists. The whole procedure is based on visual inspection by pathologists. This is a time consuming, costly task and it requires attention of the pathologists examining the tissue. Histopathology image analysis also depends on the experience of the pathologist, which is costly and requires a huge amount of time [3]. As a result, there is a pressing need for an automated system that can differentiate between the cancerous tissue (malignant tissue) and the noncancerous tissue (benign tissue) and help the pathologists to make the diagnosis process into an easy and time efficient task, and consequently the pathologist can focus on more difficult cases. A significant amount of research work has already been undertaken to build a computer aided system to automate the classification of benign and malignant tumor tissue images using an image dataset consisting of different sizes of images. To build an automated system, we used one of the largest datasets of breast cancer images called BreaKHis containing 7909 images of benign and malignant tissue at four different magnification factors. Convolution neural networks (ConvNets) are a stateof-the-art technique for image classification. There are many available convolution networks released by renowned organizations and institutions. Some architecture examples are ResNet-50, Inception V3, Inception ResNet V2, Xception etc. These ConvNets are very deep and are trained on millions of images. Training these ConvNets from scratch requires a significant amount of time and can cause overfitting since the number of training images we have is not big enough. For this reason, in this paper, four pretrained ConvNets were used as fixed feature extractors to extract the feature from the benign and malignant tissue images. To reduce the dimension of extracted features from different ConvNet Big Data Analytics for Intelligent Healthcare Management. https://doi.org/10.1016/B978-0-12-818146-1.00004-0 # 2019 Elsevier Inc. All rights reserved.

59

60

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

models, a dimensionality reduction algorithm PCA was applied. Then to perform classification, three different classifiers logistic regression (LR), support vector machine (SVM), and K-nearest neighbor (K-NN) were used. This model could help the pathologist have a preliminary idea of to what class a breast tissue image belongs to for example, to benign or malignant. After that step, it will be easy for the pathologists to diagnose the image to the predicted class or not.

4.2 RELATED WORK A huge amount of research work has been done for the automatic prediction of the presence of breast cancer using different datasets. In paper [3], using the BreaKHis image dataset to classify the images into two classes (benign and malignant), the authors used six different feature extractors with four different classifiers and reported accuracy ranges from 80% to 85%. The authors of paper [4], using different pretrained ConvNet models running thousands of epochs, reported the highest accuracy of 99.8%. In paper [5], the authors reported accuracy ranges from 81% to 90% using DECAF features with other classifiers and compared their work to other works. In paper [6], the authors showed a comparative study of different machine learning techniques on breast cancer FNA biopsy data and the K-NN with Euclidean distance approach showed a prediction accuracy of 100% with the K value of 5, 10, and 11, and it also showed the same accuracy using cityblock distance with a K value of 13. In the paper [7], the authors applied different machine learning algorithms to the Wisconsin breast cancer dataset and analyzed the performance. They reported an accuracy close to 100% and SVM gaves an accuracy of 100%. In paper [8], the authors used ResNet-50 and VGG16 and reported an accuracy of 89% and 84%, respectively. In paper [9], the author used different machine learning algorithms and reported an accuracy of 98.8% and 96.33%m respectively, using SVM on two different datasets. In paper [10], the authors reported their best accuracy of 99.038% using MLP on the Wisconsin breast cancer dataset. A huge amount of research work has been done in the area of cancer with different supervised and semisupervised classification, clustering, and feature detection methods of biomedical images [11–20], optimization, and information security techniques of medical data [21–25] helping to make computer-aided medical systems.

4.3 DATASET AND METHODOLOGIES Dataset: In this work, we have used the BreaKHis [3] breast cancer histopathological image dataset. This dataset contains about 7909 RGB images of benign and malignant tissue at four magnification factors (40 , 100 , 200 , 400 ) (Fig. 4.1, Table 4.1).

4.3.1 CONVOLUTION NEURAL NETWORKS (CNNS/CONVNETS) Convolution neural networks are the state-of-the-art models for image classification and they are very similar to ordinary neural networks except they have some extra layers. There are three main building blocks of convolution neural networks: Convolution Layer, Pooling Layer, and Fully Connected Layer. These terms are described further below.

4.3 DATASET AND METHODOLOGIES

Type

40X

100X

200X

61

400X

Benign

Malignant

FIG. 4.1 Major two types of images with four magnification factors.

Table 4.1 Image Distribution by Magnification Factor and Class Magnification Factor

# of Benign

# of Malignant

Total Images

40  100 200  400  Total images # of patients

625 644 623 588 2480 24

1370 1437 1390 1232 5429 58

1995 2081 2013 1820 7909 82

Based on F.A. Spanhol, L.S. Oliveira, C. Petitjean, L. Heutte, A dataset for breast cancer histopathological image classification, in IEEE Trans. Biomed. Eng., 63 (2016) 1455–1462, doi: 10.1109/TBME.2015.2496264.







Convolution Layer: In this layer, a set of filters slide in the direction of height and width of the input image, compute dot products, and produce two-dimensional activation maps for each of the filters. There are several activation functions that are applied to the activation maps to add nonlinearity such as RELU, Sigmoid etc. Pooling Layer: With a specified filter, strides, and pooling method, value from the activation maps is extracted. There are several pooling methods such as max pooling, min pooling, and average pooling. For example, if the filter size is 2  2 and strides is 2, and pooling method is max, then for each 2  2 matrix in the activation map, the output will be the maximum value within that 2  2 cell and then the filter will head to the next 2  2 cell as the specified strides is 2. Fully Connected Layer: In this layer, a number of hidden layers with specified neurons and activation functions are declared and the flattened feature of the previous stacked convolution and pooling layers are passed through this fully connected layer. More details about ConvNets can be found here [26] (Fig. 4.2).

62

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Class 1 Class 2 Image Input Convolution + Relu

Pooling Convolution + Relu

Class n

Pooling

Flatten

Feature learning

Fully connected

Softmax

Classification

FIG. 4.2 Sample convolution network architecture. Data from Mathworks.com, Convolutional Neural Network, 2018. Available from: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html. Accessed 10 June 2018.

In this work, four pretrained ConvNets architectures were used: ResNet50 [27], Inception V3 [28], InceptionResnetV2[29], and Xception [30] with their default parameter settings with an average pooling implemented in keras [31] deep learning library.

4.3.1.1 Transfer learning and convolution networks Transfer learning is a machine learning method that allows the use of a model trained on a task to perform another task. Since the convolution architectures released by different organizations trained on ImageNet databases containing 1.2 million images from 1000 categories is very large, training these types of architectures for custom datasets is not practical because datasets are not large enough in practice. So instead of training the whole network, these pretrained networks are used. There are ways of using a pretrained a convolution network that is, doing transfer learning using convolution network. One of them is using the pretrained convolution network as fixed feature extractors [32].

4.3.1.2 Convolution networks as fixed feature extractors To use a convolution network as a feature extractor, remove the last fully connected layer of a pretrained convolution network and then use the rest of the architecture as a fixed feature extractor for the custom dataset. Then the extracted features can be used for other purposes [32] (Fig. 4.3).

4.3.1.3 Dimensionality reduction and principle component analysis (PCA) It is difficult to train a learning algorithm with a higher dimensional data. Here comes the importance of dimension reduction. Dimensionality reduction is a method of reducing the original dimension of data to a lower dimension without much loss of information. Dimension reduction techniques have two components. One is feature selection and the other is feature extraction. Feature selection is responsible for selecting the subset of original attributes with specified parameters and feature extraction is responsible for projecting the data into a lower dimensional space that is forming a new dataset with selected attributes [33]. PCA is one of the popular dimension reduction algorithms that uses the orthogonal

4.3 DATASET AND METHODOLOGIES

63

Image Input Convolution + RELU

Pooling

Convolution + RELU

Pooling Feature vector

Feature learning

FIG. 4.3 Convolution network as feature extractor. Data from Mathworks.com, Convolutional Neural Network, 2018. Available from: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html. Accessed 10 June 2018.

linear transformation to project the dataset to a new coordinate system such that the highest variance by some projection of data lies on the first coordinate called the first principle component and second largest variance lies on the second coordinate called the second principle component and so on. PCA steps [34]: 1. 2. 3. 4. 5.

Calculate covariance matrix. Calculate eigenvalue and eigenvectors from covariance matrix. Select K largest eigenvalues where K is the dimension of new subspace. Calculate projection matrix from K selected eigenvalues. Transform the dataset through projection matrix to form new dataset of K dimension.

4.3.1.4 Supervised machine learning In supervised machine learning, both input and output pairs are used to train the learning algorithm and it is the task of the algorithm to learn the mapping function from input to output very well so that when a new input comes, the function can map the input to the output [35]. Three supervised machine learning algorithms were used in this work for classification purpose: LR, SVM, and K-NN. These algorithms are described further below. •

LR is borrowed from the field of statistics and named after the logistic function. It is the most popular method for binary classification problems. Logistic function is also known as sigmoid function, which is an s-shaped curve that transforms any input to a value between 0 and 1.

y ¼ 1=ð1 + ex Þ ¼ ex =ð1 + ex Þ

Here, e is the base of the natural logarithm and x is the input and y is the output.

64

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

X2

Class 1 Class 2 O

pt

im

al

hy p

er

pl

an

e

Maximum margin

X1

FIG. 4.4 SVM separates two classes, keeping the maximum margin. Data from Docs.opencv.org, Introduction to Support Vector Machines—OpenCV 2.4.13.6 Documentation, 2018. Available from: https://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html, Accessed 10 June 2018.

An example logistic regression equation, Y ¼ ex ðb0 + b1∗xÞ=ð1 + ex ðb0 + b1∗xÞÞ

Here Y is the predicted output, b0 is the bias, and b1 is the coefficient of input x. Thus, for every input, the logistic equation learns the coefficient and uses these learned coefficients for prediction when an unknown input arrives [36]. •

SVM is a supervised machine learning algorithm and it can be used for classification problems. It separates the data points of different classes by hyperplane that maximizes the distance (also called margin) of the nearest point of each class from the hyperplane, shown in Fig. 4.4. SVM is also called maximal margin classifier [37].



K-NN algorithm requires no learning. It simply stores the whole dataset and when a new instance comes, it measures the distance of k–data points around it and labels the new instance as the same label of the closest instance, illustrated in Fig. 4.5. K-NN is also called instance based learning [38].

4.4 PROPOSED MODEL Fig. 4.6 demonstrates the overall architecture of the proposed model. In the proposed model, the images at each of the magnification factors are passed through four pretrained ConvNets (ResNet-50, Inception V2, Inception ResNet V2, and Xception). The outputs of these ConvNets are the image features. Then on the flattened image features, PCA is applied to reduce the dimension of the feature vector. Then

4.4 PROPOSED MODEL

65

Class 1

Training instance K=3 ce

an

t is

Class 2

K=1

D

? New example to classify

FIG. 4.5 K-NN to classify new instance. Data from Medium, A Quick Introduction to K-Nearest Neighbors Algorithm, 2018. Available from: https://medium.com/@adi. bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7, Accessed 10 June 2018.

ConvNets as feature extractor

BreakHis

40×

100×

200×

ResNet50 400×

Magnification factor

Inception V3

Feature vectors

Inception Resnet V2

Dimension reduction by PCA

SVM

K-NN Xception

Classification Logistic regression Performance analysis

FIG. 4.6 Proposed model.

depending on the explained variance ratio, the dimension of the feature vector can be reduced. Then the reduced feature set is passed to the classifiers to perform binary classification to automate the classification of benign and malignant images. Classification is performed by three different classifiers are LR, SVM, K-NN. All the feature extraction, dimension reduction, and classification is done per magnification.

66

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

4.5 IMPLEMENTATION System Description: Proposed model is implemented on a system with 8 GB RAM core i7 processor. Tools: Python programming language with python image processing package scikit-mage [39], machine learning package scikit-learn [40], and for ConvNets, a deep learning framework, Keras, is used.

4.5.1 FEATURE EXTRACTION I. Every image is resized by the required maximum input size for the four ConvNet models. After that, image pixels are rescaled to [1, +1] (Table 4.2). II. Then the resized and rescaled images are passed through the ConvNets Models and 1D feature vector is collected (Table 4.3). All the ConvNets models are available in keras deep learning framework.

4.5.2 DIMENSIONALITY REDUCTION PCA is applied on the feature vectors to reduce feature dimension. Before applying PCA, the features are standardized. PCA was trained only on a training set and projection to lower dimensional space was applied to both the training and test set. PCA is applied with an explained variance ratio of 0.95. PCA is available in the scikit-learn package (Table 4.4).

Table 4.2 Input Image Size Required for Four ConvNet Models Feature Extractor

Input Size

ResNet50 InceptionV3 Inception-ResNet-v2 Xception

224  224 299  299 299  299 299  299

Table 4.3 Feature Dimension by Four Feature Extractors Feature Extractor

Feature Dimension

ResNet50 InceptionV3 Inception ResnetV2 Xception

2048 2048 1536 2048

4.6 RESULT AND ANALYSIS

67

Table 4.4 Feature Dimension Reduced After Applying PCA

ResNet50 InceptionV3 Inception ResnetV2 Xception

40 ×

100×

200×

400×

650 511 286 636

659 521 297 652

658 507 291 637

622 487 283 613

4.5.3 CLASSIFICATION Classifiers: For classification purposes, three classifiers are used (LR, SVM, and K-NN). All these classifiers are available in the scikit-learn package. The reduced image feature vectors are passed to the classifiers for classification. Every classifier is first trained on the training set and tested on the test set.

4.5.4 TUNING HYPERPARAMETERS OF THE CLASSIFIERS Some of the parameters of the classifiers, such as for LR, the parameters C, for SVM the parameters C, and gamma both with rbf kernel and for K–NN the parameters n_neighbors are tuned. To tune the hyperparameters, the 10-fold cross validation approach is used. For each of the classifiers, 10-fold cross validation is performed on the training set with the combination of the abovementioned hyperparameters and best cross validation giving hyperparameters are used as tuned parameters. The performance of each of the classifiers is described in Section 4.5.

4.6 RESULT AND ANALYSIS To evaluate the performance of the three classifiers along with the four feature extractors, 10-fold cross validation approach applied on the training set and then a test result was generated on the test set.

4.6.1 10-FOLD CROSS VALIDATION RESULT The below Table 4.5 describes the 10-fold cross validation accuracy on the training set for the different combinations of feature extractor and classifiers along with different magnification factors of the images.

4.6.2 MAGNIFICATION FACTOR WISE ANALYSIS ON VALIDATION ACCURACY 4.6.2.1 Validation accuracy of 40 ×

Interpretation: On the 40  data, almost all of the combinations of feature extractors and classifiers gave a validation accuracy above 90% and among them the ResNet-50 and LR classifier gave the best cross validation score of 94.17% (Fig. 4.7).

68

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Table 4.5 10-fold Cross Validation Results Magnification Factor Wise 10-Fold Cross Validation Accuracy (%) Feature Extractors

Classifiers

40×

100×

200×

400×

ResNet-50

LR SVM K-NN LR SVM K-NN LR SVM K-NN LR SVM K-NN

94.17 90.72 90.85 92.60 92.54 91.97 92.98 94.04 89.90 92.60 90.40 89.71

94.41 90.32 88.45 92.18 90.04 89.06 92.13 92.30 88.52 91.64 88.88 88.88

94.09 90.06 91.06 91.80 90.55 89.93 92.86 94.22 89.62 92.91 90.80 88.26

92.03 89.28 89.35 88.73 89.00 87.42 88.80 89.42 85.30 89.90 87.15 85.44

Inception V3

Inception ResNet V2

Xception

Validation accuracy (%) Xception+KNN Xception+SVC Xception+LR IRV2+KNN IRV2+SVC IRV2+LR IV3+KNN IV3+SVM IV3+LR RN50+KNN RN50+SVM RN50+LR 87

88

89

90

91

92

93

94

95

FIG. 4.7 Validation accuracy graph for 40.

4.6.2.2 Validation accuracy of 100× Interpretation: On the 100  data, all of the combinations of feature extractors and classifiers gave validation accuracy above 88% and the ResNet50 and LR classifier gave the best cross validation score of 94.41% (Fig. 4.8).

4.6.2.3 Validation accuracy of 200× Interpretation: On the 200  data, all of the combinations of feature extractors and classifiers gave a validation accuracy above 88% and the Inception ResNet V2 with Support Vector Classifier gave the best cross validation score of 94.22% (Fig. 4.9).

4.6 RESULT AND ANALYSIS

69

Validation accuracy (%)

Xception+SVM IRV2+KNN IRV2+LR IV3+SVM RN50+KNN RN50+LR 84

86

88

90

92

94

96

92

94

96

FIG. 4.8 Validation accuracy graph for 100 .

Validation accuracy (%) Xception+SVM IRV2+KNN IRV2+LR IV3+SVM RN50+KNN RN50+LR

84

86

88

90

FIG. 4.9 Validation accuracy graph for 200 .

4.6.2.4 Validation accuracy of 400× Interpretation: On the 400  data, most of the combinations of feature extractors and classifiers gave a validation accuracy above 86% and the ResNet50 and LR classifier gave the best cross validation score of 92.03% (Fig. 4.10).

4.6.2.5 Best validation accuracy Table 4.6 summarizes the best validation accuracy achieved. It is noticeable that for 40 , 100, and 400 , the ResNet-50 with LR classifier performed better than any others.

4.6.2.6 Performance on the test set To evaluate the performance of the combinations of feature extractors and classifiers, some parameters are described below with the help of a sample confusion matrix. In this work, the positive class is Malignant, which means cancer is present and the negative class is Benign, which means cancer is not present.

70

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Validation accuracy (%)

Xception+SVM IRV2+KNN IRV2+LR IV3_SVM RN50+KNN RN50+LR 80

82

84

86

88

90

92

94

FIG. 4.10 Validation accuracy graph for 400 .

Table 4.6 Best Validation Accuracy Magnification Factor

Feature Extractor

Classifier

Validation Accuracy (%)

40 100  200  400

ResNet-50 ResNet-50 Inception ResNet V2 ResNet-50

LR LR SVM LR

94.17 94.41 94.22 92.03

Confusion Matrix Class

Predicted Yes

Predicted No

Actual Yes Actual No

TP FP

FN TN

Accuracy: Accuracy refers to how often the classifiers predict the correct label and is calculated as: Accuracy ¼ ðTP + TNÞ=ðTP + TN + FP + FNÞ

Precision: Precision refers to the correctness of predicting yes of a classifier and is calculated as: Precision ¼ TP=ðTP + TNÞ

Recall: Recall refers to the true positive rate and is calculated as: Recall ¼ TP=ðTP + FNÞ

F1-score: This is the weighted average of precision and recall and is calculated as: F1  score ¼ ð2∗ precision∗ recallÞ=ðprecision + recallÞ

False Positive Rate (FPR): the ratio of FP and the summation of FP and TN. False Negative Rate (FNR): the ratio of N and the summation of FN and TP.

4.6 RESULT AND ANALYSIS

71

4.6.3 RESULT AND ANALYSIS OF TEST PERFORMANCE 4.6.3.1 Test performance on 40 × Interpretation: With ResNet50, LR gave the highest precision while Support Vector claimed the highest F1 Score and recall though they both have the highest accuracy because the LR model can correctly classify the positive class better than the SVM model (Table 4.7, Fig. 4.11). Interpretation: With InceptionV3, LR and SVM both have the highest recall value but the Support Vector classifier had the maximum accuracy and precision, leading to the maximum f1score (Table 4.8, Fig. 4.12). Interpretation: With Inception ResNet V2, SVM had the best accuracy, precision, recall, and f1score (Table 4.9, Fig. 4.13). Table 4.7 Result of ResNet-50 With LR, SVM, and K-NN on 40× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

ResNet-50

LR SVM K-NN

96.24 96.24 93.23

96.25 95.02 92.71

98.60 100 97.80

97.41 97.44 95.19

102 100 98 96 94 92 90 88 Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.11 Performance of Resnet50 with three difference classifiers on 40.

Table 4.8 Result of Inception V3 with LR, SVM, and K-NN on 40 × Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception V3

LR SVM K-NN

94.49 94.74 91.23

94.30 94.61 94.16

98.25 98.25 93.14

96.23 96.40 93.65

72

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

100 98 96 94 92 90 88 86 Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.12 Performance of InceptionV3 with three different classifiers on 40.

Table 4.9 Result of Inception ResNet V2 with LR, SVM, and K-NN on 40 × Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1 Score (%)

Inception ResNet V2

LR SVM K-NN

92.23 95.99 89.47

93.52 96.55 92.19

95.80 97.90 92.19

94.65 97.22 92.19

100 98 96 94 92 90 88 86 84 Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.13 Performance of Inception ResNet V2 with three different classifiers on 40.

4.6 RESULT AND ANALYSIS

73

Table 4.10 Result of Xception with LR, SVM, and K-NN on 40× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Xception

LR SVM K-NN

94.49 93.73 91.48

95.52 93.65 92.08

96.85 97.90 94.94

96.18 95.73 93.49

100 98 96 94 92 90 88 Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.14 Performance of Xception with three different classifiers on 40.

Interpretation: With Xception, Support Vector had the maximum recall but LR gave the best accuracy, precision, and f1score (Table 4.10, Fig. 4.14).

4.6.3.2 Overall performance on 40 × Interpretation: The ResNet50 with both LR and Support Vector classifier had the maximum accuracy but the Support Vector classifier had best recall value. On the other hand, with Inception ResNet V2, Support Vector gave the highest precision (Fig. 4.15).

4.6.3.3 Test performance on 100× Interpretation: With ResNet50, the Support Vector classifier gave the maximum recall value but K-NN had the maximum precision while the LR gave the best accuracy and f1score (Table 4.11, Fig. 4.16). Interpretation: With InceptionV3, the Support Vector classifier had the maximum recall value, but the LR gave the best accuracy, precision, and f1score (Table 4.12, Fig. 4.17).

74

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Xception+KNN Xception+SVM Xception+LR IRV2+KNN IRV2+SVM IRV2+LR

Recall

IV3+KNN

Precision Accuracy

IV3+SVM IV3+LR RN50+KNN RN50+SVM RN50+LR 84

86

88

90

92

94

96

98

100

102

FIG. 4.15 Test performance graph for 40.

Table 4.11 Result of ResNet-50 with LR, SVM, and K-NN on 100× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

ResNet-50

LR SVM K-NN

92.81 90.41 91.37

93.17 89.32 94.24

96.47 97.53 92.91

94.79 93.24 93.57

100 98 96 94 92 90 88 86 84 Logistic regression

Support vector Accuracy

Precision

K-NN Recall

FIG. 4.16 Performance of ResNet50 with three different classifiers for 100 .

F1score

4.6 RESULT AND ANALYSIS

75

Table 4.12 Result of Inception V3 with LR, SVM, and K-NN on 100 × Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception V3

LR SVM K-NN

90.89 90.65 88.25

92.10 90.67 90.69

94.70 96.11 92.28

93.38 93.31 91.48

98 96 94 92 90 88 86 84 Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.17 Performance of InceptionV3 with three different classifiers for 100 .

Table 4.13 Result of Inception ResNet V2 with LR, SVM, and K-NN on 100× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception ResNet V2

LR SVM K-NN

90.65 91.37 91.85

91.78 91.58 94.31

94.70 96.11 93.64

93.22 93.79 93.97

Interpretation: With Inception ResNet V2, the Support Vector classifier gave the best recall value, but K-NN gave the highest accuracy, precision, and f1score (Table 4.13, Fig. 4.18). Interpretation: With Xception, the Support Vector classifier gave the highest recall but LR had the highest accuracy, precision, and f1score (Table 4.14, Fig. 4.19).

4.6.3.4 Overall performance on 100 × Interpretation: The ResNet50 with LR gave the highest accuracy but Xception with the Support Vector classifier had a higher recall value than others. On the other hand, with Inception ResNet V2, K-NN had the highest precision (Fig. 4.20).

76

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

97 96 95 94 93 92 91 90 89 88 87

Logistic regression

Support vector Accuracy

Precision

Recall

K-NN F1score

FIG. 4.18 Performance of Inception ResNet V2 with three different classifiers for 100 .

Table 4.14 Result of Xception with LR, SVM, and K-NN on100× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Xception

LR SVM K-NN

91.85 91.37 89.21

92.78 90.23 89.80

95.41 97.88 95.12

94.08 93.90 92.39

100 98 96 94 92 90 88 86 84

Logistic regression

Support vector Accuracy

Precision

Recall

FIG. 4.19 Performance of Xception with three different classifiers for 100 .

K-NN F1score

4.6 RESULT AND ANALYSIS

77

Xception+KNN Xception+SVC Xception+LR IRV2+KNN IRV2+SVC IRV2+LR

Recall

IV3+KNN

Precision Accuracy

IV3+SVC IV3+LR RN50+KNN RN50+SVM RN50+LR 82

84

86

88

90

92

94

96

98

100

FIG. 4.20 Test performance graph for 100 .

Table 4.15 Result of ResNet-50 with LR, SVM, and K-NN on 200× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

ResNet-50

LR SVM K-NN

94.29 91.81 92.06

94.89 90.41 91.01

96.65 98.14 97.31

95.76 94.12 94.05

4.6.3.5 Test performance on 200× Interpretation: With ResNet50, SVM had the highest recall value but LR had the highest accuracy, precision, and f1score (Table 4.15, Fig. 4.21). Interpretation: With InceptionV3, Support Vector had the highest accuracy, recall, and f1score but K-NN had the highest precision (Table 4.16, Fig. 4.22). Interpretation: With Inception ResNet V2, the Support Vector classifier had the highest accuracy, precision, recall, and f1score (Table 4.17, Fig. 4.23). Interpretation: With Xception, the Support Vector classifier had the best recall value, K-NN had the best precision, and LR had the highest accuracy and f1score (Table 4.18, Fig. 4.24). Overall performance on 200

78

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

100 98 96 Accuracy

94

Precision

92

Recall

90

F1score

88 86 Logistic regression

Support vector

K-NN

FIG. 4.21 Performance of ResNet50 with three different classifiers for 200 .

Table 4.16 Result of Inception V3 With LR, SVM, and K-NN on 200× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception V3

LR SVM K-NN

89.83 91.56 90.07

90.71 90.66 92.36

94.42 97.40 93.66

92.53 93.91 93.01

98 96 94

Accuracy

92

Precision Recall

90

F1score

88 86 Logistic regression

Support vector

FIG. 4.22 Performance of InceptionV3 with three different classifiers for 200 .

K-NN

4.6 RESULT AND ANALYSIS

79

Table 4.17 Result of Inception ResNet V2 With LR, SVM, and K-NN on 200× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception ResNet V2

LR SVM K-NN

91.32 94.29 89.83

90.63 94.24 92.25

97.03 97.40 93.24

93.72 95.80 92.74

98 96 94

Accuracy

92

Precision Recall

90

F1score

88 86 Logistic regression

Support vector

K-NN

FIG. 4.23 Performance of Inception ResNet V2 with three different classifiers for 200 .

Table 4.18 Result of Xception with LR, SVM, and K-NN on 200× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Xception

LR SVM K-NN

90.82 90.32 88.34

89.73 88.33 90.25

97.40 98.51 92.59

93.40 93.15 91.41

100 98 96 94 92 90 88 86 84 82

Accuracy Precision Recall F1score

Logistic regression

Support vector

FIG. 4.24 Performance of Xception with three different classifiers for 200 .

K-NN

80

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Xception+KNN Xception+SVM Xception+LR IRV2+KNN IRV2+SVM IRV2+LR

Recall

IV3+KNN

Precision

IV3+SVM

Accuracy

IV3+LR RN50+KNN RN50_SVC RN50_LR 82

84

86

88

90

92

94

96

98

100

FIG. 4.25 Test performance graph for 200 .

Interpretation: The ResNet50 with LR and Inception ResNet V2 with Support Vector classifier had the highest accuracy but the Xception with Support Vector classifier had the highest recall value while ResNet50 with LR had the highest precision (Fig. 4.25).

4.6.3.6 Test performance on 400× Interpretation: With ResNet50, the Support Vector classifier had the maximum value for accuracy, precision, recall, and f1score (Table 4.19, Fig. 4.26). Interpretation: With InceptionV3, the Support Vector classifier had the maximum value for accuracy, precision, recall, and f1score (Table 4.20, Fig. 4.27). Interpretation: With Inception ResNet V2, the Support Vector classifier had the highest accuracy, precision, recall, and f1score (Table 4.21, Fig. 4.28). Interpretation: With Xception, the LR had the maximum accuracy, precision, recall, and f1score (Table 4.22, Fig. 4.29).

Table 4.19 Result of ResNet-50 with LR, SVM, and K-NN on 400× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

ResNet-50

LR SVM K-NN

91.48 92.86 89.65

92.94 93.75 91.63

94.80 96.00 93.12

93.86 94.86 92.37

4.6 RESULT AND ANALYSIS

81

98 96 Accuracy

94

Precision

92

Recall

90

F1score

88 86 Logistic regression

Support vector

K-NN

FIG. 4.26 Performance of ResNet50 with three different classifiers for 400 .

Table 4.20 Result of Inception V3 with LR, SVM, and K-NN on 400 × Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception V3

LR SVM K-NN

91.21 92.86 89.01

91.60 91.79 89.02

96.00 98.40 95.53

93.75 94.98 92.16

100 98 96 94

Accuracy

92

Precision

90

Recall

88

F1score

86 84 Logistic regression

Support vector

K-NN

FIG. 4.27 Performance of InceptionV3 with three different classifiers for 400 .

4.6.3.7 Overall performance on 400 × Interpretation: Resnet50 with the Support Vector classifier and InceptionV3 with the Support Vector classifier had the highest accuracy but InceptionV3 with the Support Vector classifier gave the highest recall value and ResNet-50 with the Support Vector gave the highest precision (Fig. 4.30).

82

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

Table 4.21 Result of Inception ResNet V2 With LR, SVM, and K-NN on 400× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Inception ResNet V2

LR SVM K-NN

90.93 92.03 87.09

92.55 92.66 89.59

94.40 96.00 92.69

93.47 94.30 91.12

98 96 94 92

Accuracy

90

Precision

88

Recall

86

F1score

84 82 Logistic regression

Support vector

K-NN

FIG. 4.28 Performance of Inception ResNet V2 with three different classifiers for 400 .

Table 4.22 Result of Xception with LR, SVM, and K-NN on 100× Feature Extractor

Classifier

Accuracy (%)

Precision (%)

Recall (%)

F1-Score (%)

Xception

LR SVM K-NN

91.21 89.84 84.34

91.60 90.80 84.23

96.00 94.80 93.19

93.75 92.76 88.48

100 95 Accuracy

90

Precision

85

Recall

80

F1score

75

Logistic regression

Support vector

FIG. 4.29 Performance of Xception with three different classifiers for 400 .

K-NN

4.8 CONCLUSION

83

Xception+KNN Xception+SVM Xception+LR IRV2+KNN IRV2+SVM IRV2+LR

Recall

IV3+KNN

Precision Accuracy

IV3+SVM IV3+LR RN50+KNN RN50+SVM RN50+LR 75

80

85

90

95

100

FIG. 4.30 Test performance graph for 400 .

4.7 DISCUSSION In this work, several pretrained deep learning architectures were applied for feature extraction instead of using them as a classifier, which enabled us to save on the time for training. The dimensions of features were reduced so that the classifiers could fit them properly within a shorter period of time. The results reveal that the best validation and test accuracy for each of the magnification factors was quite impressive. Beside accuracy, we also analyzed the performance in terms of precision and recall since precision and recall are important for medical image classification as we always want to classify the tumorous image correctly rather than classifying the nontumorous image correctly and this can be considered as a tradeoff between precision and recall. The result for each of the combinations of feature extractor and classifier were shown and interpreted graphically. As it is not wise to make a decision in medical diagnosis based on a machine learning model, this model can assist the pathologists for diagnosis of breast tumor detection.

4.8 CONCLUSION In this work, we have classified breast cancer histopathological images into two major classes—benign and malignant by our proposed model using some deep feature extractors and supervised classifiers. The field of machine learning is huge and there are lots of feature extractors and classifiers that can be used to automate this task. Since the overall performance of this model is not 100%, there is room for improvement.

84

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

REFERENCES [1] Who.int, Breast Cancer, Available from: http://www.who.int/cancer/prevention/diagnosis-screening/breastcancer/en/, 2015. (Accessed 12 February 2018). [2] Mayoclinic.org, Breast Cancer–Diagnosis and Treatment–Mayo Clinic, Available from: https://www. mayoclinic.org/diseases-conditions/breast-cancer/diagnosis-treatment/drc-20352475, 2018. (Accessed 10 June 2018). [3] F.A. Spanhol, L.S. Oliveira, C. Petitjean, L. Heutte, A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63 (7) (July 2016) 1455–1462, https://doi.org/10.1109/ TBME.2015.2496264. [4] F.A. Spanhol, L.S. Oliveira, C. Petitjean, L. Heutte, Breast cancer histopathological image classification using convolutional neural networks. in: 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, 2016, pp. 2560–2567, https://doi.org/10.1109/IJCNN.2016.7727519. [5] F.A. Spanhol, L.S. Oliveira, P.R. Cavalin, C. Petitjean, L. Heutte, Deep features for breast cancer histopathological image classification. in: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, 2017, pp. 1868–1873, https://doi.org/10.1109/SMC.2017.8122889. [6] H. Youh, G. Rumbe, Comparative study of classification techniques on breast cancer FNA biopsy data, Int. J. Int. Multimed. Artif. Intell. 1 (2010) 6–12. [7] M. Meraliyev, M. Zhaparov, K. Artykbayev, Choosing best machine learning algorithm for breast cancer prediction, Int. J. Adv. Sci. Eng. Technol. 5 (3) (2017) 50–54. [8] L. Shen, End-to-End Training for Whole Image Breast Cancer Diagnosis using An All Convolutional Design, eprint arXiv:1708.09427, 2017. [9] A. Osareh, B. Shadgar, Machine learning techniques to diagnose breast cancer. in: 2010 5th International Symposium on Health Informatics and Bioinformatics, Antalya, 2010, pp. 114–120, https://doi.org/ 10.1109/HIBIT.2010.5478895. [10] Agarap, Abien Fred, On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset, https://doi.org/10.1145/3184066.3184080, arXiv:1711.07831, 2017. [11] J. Kriti Virmani, N. Dey, V. Kumar, PCA-PNN and PCA-SVM based CAD systems for breast density classification, in: Applications of Intelligent Optimization in Biology and Medicine, 2016. https://dblp.uni-trier. de/db/series/isrl/isrl96.html. [12] L. Saba, N. Dey, A.S. Ashour, et al., Automated stratification of liver disease in ultrasound: an online accurate feature classification paradigm, Comput. Methods Prog. Biomed. 130 (2016) 118–134. [13] N. Dey, A. Ashour, Classification and Clustering in Biomedical Signal Processing, first ed., IGI Global, Hershey, PA, USA, 2016. [14] S. Cheriguene, N. Azizi, N. Zemmal, N. Dey, H. Djellali, N. Farah, Optimized tumor breast cancer classification using combining random subspace and static classifiers selection paradigms, in: A.E. Hassanien, C. Grosan, T.M. Fahmy (Eds.), Applications of Intelligent Optimization in Biology and Medicine, Intelligent Systems Reference Library, 96, Springer, Cham, 2016, pp. 289–307. [15] N. Zemmal, N. Azizi, N. Dey, M. Sellami, Adaptive semi supervised support vector machine semi supervised learning with features cooperation for breast cancer classification, J. Med. Imaging Health Inf. 6 (1) (2016) 53–62. [16] S. Kamal, N. Dey, S.F. Nimmy, S.H. Ripon, N.Y. Ali, A.S. Ashour, F. Shi, Evolutionary framework for coding area selection from cancer data, Neural Comput. & Applic. 29 (4) (2018) 1015–1037. [17] A. Bhattacherjee, S. Roy, S. Paul, P. Roy, N. Kausar, N. Dey, Classification approach for breast cancer detection using back propagation neural network: A study, in: Biomedical Image Analysis and Mining Techniques for Improved Health Outcomes, IGI Global, 2016, pp. 210–221. [18] H. Das, B. Naik, H.S. Behera, Classification of Diabetes Mellitus Disease (DMD): A Data Mining (DM) Approach, Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 539–549.

REFERENCES

85

[19] R. Sahani, C. Rout, J.C. Badajena, A.K. Jena, H. Das, Classification of intrusion detection using data mining techniques, in: Progress in Computing, Analytics and Networking, Springer, Singapore, 2018, pp. 753–764. [20] H. Das, A.K. Jena, J. Nayak, B. Naik, H.S. Behera, A Novel PSO Based Back Propagation Learning-MLP (PSO-BP-MLP) for Classification, Computational Intelligence in Data Mining, Vol. 2, Springer, New Delhi, 2015, pp. 461–471. [21] C. Pradhan, H. Das, B. Naik, N. Dey, Handbook of Research on Information Security in Biomedical Signal Processing. IGI Global, Hershey, PA, 2018, pp. 1–414, https://doi.org/10.4018/978-1-5225-5152-2. [22] K.H.K. Reddy, H. Das, D.S. Roy, A Data Aware Scheme for Scheduling Big-Data Applications with SAVANNA Hadoop. Futures of Network, CRC Press, 2017. [23] B.S.P. Mishra, H. Das, S. Dehuri, A.K. Jagadev, Cloud Computing for Optimization: Foundations, Applications, and Challenges, 39 Springer, 2018. [24] P.K. Pattnaik, S.S. Rautaray, H. Das, J. Nayak (Eds.), Progress in Computing, Analytics and Networking: Proceedings of ICCAN 2017, Vol. 710, Springer, 2018. [25] C.R. Panigrahi, M. Tiwary, B. Pati, H. Das, Big data and cyber foraging: Future scope and challenges, in: Techniques and Environments for Big Data Analysis, Springer, Cham, 2016, pp. 75–100. [26] Cs231n.github.io, CS231n Convolutional Neural Networks for Visual Recognition, Available from: http:// cs231n.github.io/convolutional-networks/, 2018. (Accessed 25 September 2018). [27] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, Arxiv.org. Available from: https://arxiv.org/abs/1512.03385, 2018. (Accessed 25 September 2018). [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the Inception Architecture for Computer Vision, Arxiv.org. Available from: https://arxiv.org/abs/1512.00567, 2018. (Accessed 25 September 2018). [29] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Arxiv.org. Available from: https://arxiv.org/abs/1602.07261, 2018. (Accessed 25 September 2018). [30] F. Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, Arxiv.org. Available from: https://arxiv.org/abs/1610.02357, 2018. (Accessed 25 September 2018). [31] Keras.io, Keras Documentation, Available from: https://keras.io/, 2018. (Accessed 10 June 2018). [32] Cs231n.github.io, CS231n Convolutional Neural Networks for Visual Recognition, Available from: http:// cs231n.github.io/transfer-learning/, 2018. (Accessed 10 June 2018). [33] GeeksforGeeks, Introduction to Dimensionality Reduction-GeeksforGeeks, Available from: https://www. geeksforgeeks.org/dimensionality-reduction/, 2018. (Accessed 11 June 2018). [34] Plot.ly, Principal Component Analysis, Available from: https://plot.ly/ipython-notebooks/principalcomponent-analysis/, 2018. (Accessed 11 June 2018). [35] J. Brownlee, Supervised and Unsupervised Machine Learning Algorithms. Machine Learning Mastery, Available from: https://machinelearningmastery.com/supervised-and-unsupervised-machine-learningalgorithms/, 2018. (Accessed 10 June 2018). [36] J. Brownlee, Logistic Regression for Machine Learning. Machine Learning Mastery, Available from: https:// machinelearningmastery.com/logistic-regression-for-machine-learning/, 2018. (Accessed 10 June 2018). [37] M. Learning, U. Code, Understanding Support Vector Machine Algorithm From Examples (Along With Code). Analytics Vidhya, Available from: https://www.analyticsvidhya.com/blog/2017/09/understaingsupport-vector-machine-example-code/, 2018. (Accessed 10 June 2018). [38] J. Brownlee, K-Nearest Neighbors for Machine Learning. Machine Learning Mastery, Available from: https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/, 2018. (Accessed 10 June 2018). [39] Scikit-image.org, Scikit-Image: Image Processing in Python—Scikit-Image, Available from: http://scikitimage.org/, 2018. (Accessed 10 June 2018). [40] Scikit-learn.org, scikit-Learn: Machine Learning in Python—Scikit-Learn 0.19.1 Documentation, Available from: http://scikit-learn.org/stable/, 2018. (Accessed 10 June 2018).

86

CHAPTER 4 TRANSFER LEARNING AND SUPERVISED CLASSIFIER

FURTHER READING Docs.opencv.org, Introduction to Support Vector Machines—OpenCV 2.4.13.6 Documentation, Available from: https://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html, 2018. (Accessed 10 June 2018). Mathworks.com, Convolutional Neural Network, Available from: https://www.mathworks.com/solutions/ deep-learning/convolutional-neural-network.html, 2018. (Accessed 10 June 2018). Medium, A Quick Introduction to K-Nearest Neighbors Algorithm, Available from: https://medium.com/@adi. bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7, 2018. (Accessed 10 June 2018).

CHAPTER

CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK ON VARIOUS MODES AND VARIOUS MEDICAL SYMPTOMS USING IOT

5

Rohit Rastogi*, D.K. Chaturvedi†, Santosh Satya‡, Navneet Arora§, Mayank Gupta¶, Vishwas Yadav*, Sumit Chauhan*, Pallavi Sharma* ABES Engineering College, Ghaziabad, India* Dayalbagh Educational Institute, Agra, India† Indian Institute of Technology, Delhi, India‡ Indian Institute of Technology, Roorkee, India§ Tata Consultancy Services, Noida, India¶

5.1 INTRODUCTION AND BACKGROUND 5.1.1 BIOFEEDBACK The term “biofeedback” was voted in place of the term “auto regulation” in 1969. The organization who coined this word was named the “Biofeedback Research Society” (BRS). In 1976, the BRS was renamed “Biofeedback Society of America” (BSA). The present name of the society, the “Association for Applied Psychophysiology and Biofeedback” came into existence in 1989. Edmund Jacobson, a physician, was one of the earliest contributors in the field of biofeedback. In 1938, he monitored electromyography data (EMG) of patients practicing progressive muscle relaxation to find out if the muscles actually relaxed. Previously, it was believed that autonomic responses could not be controlled voluntarily. Miller and Leo DiCara in 1962 demonstrated that curarized rats could learn to control their autonomic functions (breathing patterns, muscle tone, blood pressure, salivation, GSR, etc.). In 1966, Joe Kamiya, who is popularly known as “the father of biofeedback” found that some subjects could learn to discriminate the presence of alpha waves when electroencephalography (EEG) was performed on them. He also found that they could learn to manipulate their alpha frequency by about 1 Hz, thus establishing that subjects could control their own neuro-biological rhythm. Physicians Marinacci and Whatmore practiced biofeedback even before the term was founded. They used EMG biofeedback to treat stroke patients. But their work on neuromuscular re-education was not continued by others and remained undeveloped until it was rediscovered. Significant contributions to this field have been made by researchers in the clinical aspects such as (a) Basmajin, who used surface EMG to study the role of different muscles in movements and used the information for rehabilitation, (b) A. Kegel, who used pneumatic biofeedback devices to train pelvic floor muscles, Big Data Analytics for Intelligent Healthcare Management. https://doi.org/10.1016/B978-0-12-818146-1.00005-2 # 2019 Elsevier Inc. All rights reserved.

87

88

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

(c) Johan Stovya used biofeedback for treating anxiety, and (d) Thomas Budzynski used surface electromyography (sEMG) for the treatment of headaches [1]. The Association for Applied Psychophysiology and Biofeedback (AAPB), the Biofeedback Certification Institute of America (BCIA), and the International Society for Neurofeedback and Research (ISNR) summoned a task force of renowned clinicians and scientists in late 2007 to design a standard definition for biofeedback. They defined biofeedback as “a process that enables an individual to learn how to change physiological activity for the purposes of improving health and performance” [2]. Instruments measure physiological activity such as muscle activity, heart function, breathing, skintemperature, etc. These biofeedback instruments quickly and precisely give “feedback” information. The information is used, often in combination with changes in thinking, emotions, and behavior, to support the needed physiological changes [2]. Patients with the use of this information (biofeedback) learn increased control over the physiological process (operant learning model) [3]. Over time, these changes can be preserved without continuous use of an instrument [1]. Any learning is facilitated by feedback. The same principle is used in biofeedback therapy whose main aim is to assist the patients in self-regulation of psycho-physiological factors, thereby allowing them to gain voluntary control over physiological parameters. Learning behavioral control over physiological responses was first published in 1961. In the 1960s and 1970s, human studies revealed that through various operant feedback methods, voluntary control could be learnt over many physiologic responses (e.g., heart rate, muscle tension, blood pressure, skin conductance, skin temperature, evoked potentials, and various rhythms of EEG). Biofeedback is apparently free of any adverse side effects and therefore seemingly the preferable choice for treatment of psychosomatic disorders. Biofeedback therapy has evolved over the last 30 years, and today there are innumerable disorders for which biofeedback therapy has been used. Biofeedback therapy is now used for a variety of disorders, such as headaches (migraine, tension, and mixed), urinary incontinence, essential hypertension, etc. with reliable results. Biofeedback is also called neurotherapy and it is a self-regulating and an accelerating relaxation technique that is used to control a person’s stress level [4]. It works by preventing illness by the help of stress management methods. The treatment enhances the quality of life and sharpens coping skills. In other words, it is a psycho-physiological technique used to enhance the overall wellness of the body and mind [2]. One idea of biofeedback is to reduce stress via self-control [5]. As mentioned earlier, it uses a set of definite techniques for the reduction of tension. These techniques include efficient decision-making capability, twilight learning/permissive concentration, and autogenic feedback training. If a person can use these techniques to gain self-control, they have a better chance of overall wellness. Biofeedback can help in reducing chronic pain symptoms [6] and stress symptoms, and serves as an alternative method of healthcare as opposed to drugs. If biofeedback proves beneficial, it is often preferred over prescription drugs due to the high cost of the medication and its possibility of dependency [1]. Many previous studies have shown that biofeedback does indeed work, especially in children and young adults. A study focused in a college setting, showed benefits to students who practice the biofeedback technique. They attended workshops and worked in individual sessions. Along with attending workshops, they also kept a daily diary [7]. They also maintained a stress-control log and changed their sleeping patterns. The information was collected and assessed and the students/technocrats who participated increased their focus and GPA considerably [8]. Another case study that focused on

5.1 INTRODUCTION AND BACKGROUND

89

biofeedback as a form as relaxation training also reduced stress in its participants. The group of volunteers was measured on a self-report scale along with temperature and an EMG. Although the temperature had no effect, the volunteers retained their improvements weeks after their biofeedback training.

5.1.2 MENTAL HEALTH INTRODUCTION Drug treatment and negative lifestyle changes completely destroy the recovery system of the human [5,6]. In the United Kingdom, about 25% of people suffer from mental health issues. The United States has the maximum number of incidences of people diagnosed with mental health issues [4].

5.1.3 IMPORTANCE OF MENTAL HEALTH, STRESS, AND EMOTIONAL NEEDS AND SIGNIFICANCE OF STUDY A person experiences basic needs such as to know autonomy, to feel capable, and develop relationships. When all of these needs are fulfilled, then the person experiences improvements in well-being. They also become more flexible, rather than sensitive problems. However, many common trends, such as the inclination to conceal personal problems or working extensive hours can impede the likelihood that such needs are fulfilled. As all of us know, the four main stages of human development are: infancy, childhood, adolescence, and adulthood. Adolescence is the most transitory state and is the most important phase of life for every individual. It is the intermediate period between childhood and adulthood. People with good mental health feel comfortable about themselves. They are not intimidated by their own emotions of fear, anger, love, jealousy, guilt, and worries. They have a tolerant easy-going attitude toward themselves and others. They are able to give love and consider the interests of others. They are able to meet the demands of life. They solve their problems and are not disturbed by them. They try to shape the environment accordingly. They accept the challenges and make use of their capabilities. They set realistic goals for themselves and make their own decisions. They put their own efforts into what they do and get satisfaction out of doing it. “Mental health refers to our cognitive and/or emotional well-being,” it is all about how we think, feel, and behave. A mental health issue, if somebody has one, can also mean no mental illness. We selected this work to observe the impact of chronic TTH type headache stress and affirmative ideas on mood states of the personality for mental health (anxiety, stress, depression, aggression, fatigue, guilt, extraversion, and arousal) on the students and technocrats of the institute [DSVV Hardwar and ABES Engineering College Ghaziabad]. For this, Electromyography and Electrodermal responses were captured and recorded. The concept of mental health is as old as human beings. In recent years, clinical psychologists as well as educators have started giving proper attention to the study of mental health. However, in India, relatively very few works have been conducted. Thus, the concept of mental health takes a “Gestalt” view of the individual. It incorporates the concept of personality characteristics and behavior all in one. It may also be understood as the behavioral characteristics of the person.

90

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

5.1.4 MEANING OF MENTAL HEALTH Mental health is a level of psychological well-being or having no mental illness. It is the “psychological state of a person who is functioning at a mental satisfactory level of emotional and behavior adjustment.” From the perspective of Positive Psychology, mental health may include an individual’s ability to enjoy life activities and efforts to achieve psychological flexibilities. According to WHO, mental health includes “subjective well-being, perceived self efficacy, autonomy, competence, intergenerational dependence and self-actualization of one’s intellectual and emotional potential among others.” WHO further states that the mental health of an individual is dependent upon the realization of their capabilities, ability to cope with stress in life, productive work, and contribution to their community. However, cultural differences and subjective assessment theories affect how “mental health” is defined.

5.1.5 DEFINITIONS "Mental health represents those insights and behaviors that determine an individual’s overall level of personal effectiveness, success, happiness, and excellence of functioning as a person." Mental health also includes an individual’s capability to enjoy life—to attain an equilibrium between life activities and efforts to achieve psychological flexibility. Mental health is described as “behavioral and emotional normality is the non-presence of a mental illness or behavioral disorder, a state of psychological healthy being in which an individual has obtained a satisfactory integration of one’s instinctual drives acceptable to both oneself and one’s social environment; an appropriate balance of love, work, and leisure pursuits.”—[Medilexicon’s medical dictionary]. “A person’s overall emotional and psychological condition.” Mental health is a broad term. Some use it as a simple adjective to describe our brain’s health. Others use it more broadly to indicate our psychological state. Still others will add emotion into the definition. I believe a good definition includes all of the above factors. Mental health describes our social, emotional, and psychological condition all encapsulated into one

Mental health, just like our physical health, operates on a continuum. Mental health is not just about the nonpresence of mental illness. It is defined as the situation of well-being in one’s own capability, being able to cope with normal stresses of life, able to work productively, and able to make a contribution to his or her community. Mental ill health refers to the kind of common mental health problems that we can all experience in certain stressful circumstances, e.g., poor concentration, mood swings, and sleep disturbances, etc. Such problems are usually of a temporary nature, are relieved once the demands of a particular situation are removed, and generally respond to support and reassurance. Every one of us has suffered from a mental health problem but it doesn’t mean that one is always mentally ill. Being mentally ill restricts our capabilities as human beings and may lead to more serious problems. Mental illness can be defined as the experience of severe and distressing psychological symptoms to the extent that normal functioning is seriously impaired. Examples of such symptoms include: anxiety, depressed mood, obsessive thinking, delusions, and hallucinations.

5.1 INTRODUCTION AND BACKGROUND

91

Some form of medical help is usually needed for recovery/management and this may take the form of counseling or psychotherapy, drug treatment, and lifestyle changes. Approximately 25% of people in the United Kingdom have mental health problems at one point during their lives. The United States has the maximum number of incidences of people being diagnosed with mental illness. Your mental health can affect your daily life relationships and even your physical health. After reviewing literature in this field; the six indices of mental health that are used herein are emotional stability, adjustment, autonomy, security, self-concept, and intelligence.

5.1.6 FACTORS AFFECTING MENTAL HEALTH There are generally nine factors that affect mental health • • • • • • • • •

Exercise and activity level Smoking Diet Physical activity Abuse Social and community activities Relationships Meditation and other relaxation techniques Healthy sleep patterns

5.1.7 MODELS OF STRESS: THREE MODELS IN PRACTICE General Adaptation Syndrome (Figs. 5.1–5.3) Stages: Alarm, Resistance, Exhaustion 2-Selye Eustress, and Distress (Fig. 5.4A) 3-Lazarus: Cognitive Appraisal Model.

Phase 1

Phase 2 Compensation

Immune system’s adaptive response to stress over time

Normal level of resistance

FIG. 5.1 Phases of stress.

Onset shock

Phase 3 Resistance

Phase 4 Decompensation

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

Resistance to stress

92

Resistance

Alarm

Exhaustion

Time FIG. 5.2 General adaptation syndrome.

Danger

Snake

Fear represents danger

Perception of snake

Change in body state

Fear: perception of body change

FIG. 5.3 Stress growth simulation.

5.1.7.1 Types of stress Stress management can be complicated and confusing, as there are different types of stress: (1) acute stress (emotional problem, muscular pressure), (2) episodic acute stress, and (3) chronic stress, each with its own characteristics, symptoms, duration, and treatment approaches.

5.1 INTRODUCTION AND BACKGROUND

93

FIG. 5.4 (A) Stress performances. (B) Big data characteristics.

5.1.7.2 Causes of stress Some classes of stressful situations are as follows: (1) uncertainty and under stimulation, (2) informative overload, (3) danger, (4) ego control failure, (5) ego-mastery failure, (6) self-esteem danger, (7) other esteem danger.

5.1.7.3 Symptoms of stress Stress warning signs and symptoms are as follows: the person or subject under study may show cognitive symptoms in which the patient may experience loss of memory, inability to concentrate on dayto-day issues, difficulty in making decisions, negative or pessimistic attitude, anxiety in thinking process, and excessive worrying. In behavioral characteristics, the subject may experience procrastinating behavior, negligence of responsibility, aloneness, sleepiness, habitual pessimism, use of alcohol, cigarettes, and addiction of drugs. The physical nature of the subject may also vary and he/she may suffer with painful aches, pain in chest, fast heartbeat, less interest in sexual relationships, constipation, frequent colds, nausea, and dizziness. Emotional features of such subjects reveal depressive and sadistic behavior, sense of isolation, and loneliness, moodiness, anger and short tempered, unable to relax, and feeling emotional and agitated.

5.1.8 BIG DATA AND IOT Big data is the collection of data that is being generated at a tremendous rate around the world. These data can be structured or unstructured. These data are so large and complex that it is difficult to process them using traditional data processing applications. To overcome the processing and storage difficulty of big data, an open source Hadoop is introduced. Hadoop is an open source distributed processing framework that is used to store a tremendous amount of data, i.e., big data and their processing outputs. Big data has different characteristics, which are defined using four v’s (Fig. 5.4B) [9].

94

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

The Internet-of-things (IoT) is a system or network that connects all physical objects to the Internet through routers or network devices to collect and share data without manual intervention. IoT provides a common platform and language for all physical devices to dump their data and to communicate with each other. Data emitted from various sensors are securely provided to IoT platforms and is integrated. Necessary and valuable information is extracted. Finally, results are shared with other devices to improve the efficiency and for better user experience automation.

5.2 PREVIOUS STUDIES (LITERATURE REVIEW) 5.2.1 TENSION TYPE HEADACHE AND STRESS It is a state of physiological or psychological stress due to undesirable stimuli, physical, mental, or emotional; internal or external that could likely disturb the functioning of an individual [10]. Acute stress is generally short-lived and causes no actual damage, whereas chronic stress can cause a sustained response to stress, causing damage and chronic pain. Stress reactions (response to stress) cause amplification of physiological parameters such as muscle tension, blood pressure, increased sweating, etc. This causes disorders in the body such as headaches, irritable bowel syndrome, ulcers, hyperhidrosis, chest pain, etc. Eventually, it results in a vicious cycle wherein stress causes pain or stress-related disorders and increased pain or other symptoms, which leads to further amplification of stress [11]. Mental stress and tension are the most frequently recorded activator of chronic TTH tension-type headaches [12,13]. Genetic or family-related social and environmental factors are also associated with TTH [11,14]. In addition to physical variables such as muscle tension, electro-dermal activity, temperature, etc. and other demographic variables of pain, the occurrence of headaches is empirically associated with psychological risk factor. These comprise social support, hypnotizability, affect, life events, and negative thinking [11]. Genetically negative affectivity is raised in chronic headache sufferers, causing overreporting of somatic symptoms like headaches, irrespective of organic disease [12,13]. This indicates that mental health is largely affected in patients with TTH and therefore a good deal of attention should be paid to the psychological component in terms of assessing and taking measures to improve the mental health of patients with TTH. A headache is a clinical syndrome affecting 91% of males and 96% of females at some time during their life. The World Health Organization recognized that primary headaches are among the first 20 major causes of disability. In the primary care practice, the tension type headache is the most commonly diagnosed variety of primary headache. Due to tension, the formerly called tension headache or muscle contraction headache is the most frequently occurring headache disorder. It is the most common among primary headaches. It is the most dominant and costly headache. Tension type headaches are responsible for nearly 90% of all headaches. According to the International Headache Society (IHS), its lifetime occurrence in the overall population observed in different studies varies from 30% to 78%. In spite of its high prevalence and regardless of the fact that it has the highest socio-economic impact, it is still the least studied of the primary headache disorders. Published estimates of the prevalence of the tension type headache vary over a wide range from 1.3% to 65% in men and 2.7% to 86% in women [15]. A World Health Organization (WHO) statement released in 2000 on headache disorders and public health quotes that the onset of TTH is often in adolescence and prevalence peaks in the fourth decade and subsequently declines [16], whereas the

5.4 MEDITATION—EFFECTIVE SPIRITUAL TOOL

95

average age of onset of TTH was found to be 25–30 years in cross-sectional epidemiological studies. The prevalence peaks between the age of 30–39 and decreases slightly with age. Some of the risk factors for developing TTH have been reported to be poor self-rated health, inability to relax after work, and sleeping limited hours per night. The number of workdays missed was three times higher for TTH than for migraines in the population and this was shown by two Danish studies. A US study has also found that nonpresence due to TTH is considerable. In a study where a cohort study was conducted to study the outcome of elderly patients with chronic tension type headache (CTTH) in a span of 13 years, the authors found that 30% of patients with CTTH evolved to chronic migraine (CM) or episodic migraine (EM). Therefore, it is important to curb the tension type headache before it transforms to a migraine, which could lead to difficulty in treating due to its complex nature. Tension type headache is clinically and pathophysiologically heterogeneous. The complex interrelation of the various pathophysiological factors of TTH; makes this disorder often difficult to treat. Various therapeutic measures have been recommended to be used in sequence or in combination. Therapies for TTH can be subdivided into short-term, abortive treatment of each attack (mainly pharmacological), and long-term prophylactic treatments (pharmacological and/or nonpharmacological) [15]. Several nonpharmacological treatments have been recommended for management of TTH, e.g., physical therapy [16], craniocervical training, oro-mandibular treatment acupuncture, relaxation therapies, cognitive-training biofeedback, etc. However, the scientific evidence for efficacy of most treatment modalities is sparse [15]. Biofeedback is one of the most prominent behavioral headache treatments. It is a nonpharmacologic technique commonly used in the treatment of migraines and TTH. Many published studies have conveyed that biofeedback is effective in decreasing the frequency and severity of headaches, thereby limiting the patient’s dependence on medication. Conforming to this, studies have also proposed that biofeedback may affect a reduction in medical utilization in headaches [17].

5.3 INDEPENDENT VARIABLE: EMOTIONAL NEED FULFILLMENT The ability of an individual to control his anger and to stabilize his mood or to achieve success in work depends 80% on their emotional intelligence and 20% on their verbal intelligence. Basic human needs for emotional fulfillment are universal and include: love, acceptance, affection, feeling valued, appreciated, secure, companionship, admiration, trust, respect, understanding, conversation, and communication. Maslow in 1954 described a hierarchy of human needs in his “Theory of Self Actualization” which are psychological, safety, belonging, love, esteem, and self-actualization.

5.4 MEDITATION—EFFECTIVE SPIRITUAL TOOL WITH APPROACH OF BIOFEEDBACK EEG 5.4.1 MIND-BODY AND CONSCIOUSNESS There is a thin layer between the body and conscious mind. This is interconnected through the chain of breathing. The body interacts through its senses and responds with its actions (Table 5.1). There are three main stages of daily routine: awakening, dreaming, and day sleep stage, which is still unknown (Figs. 5.5 and 5.6).

96

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

Table 5.1 The Rays Emitted by an Individual in Different Moods Name of Level

Frequency (Hz/ Cycle/s)

Beta Alpha

14–30 8–13

Theta

4–7

Delta

0.5–3.5

Description Typical level of daily mental activity associated with tension or stress. Relaxed, passive attention, often considered the goal of relaxation exercises. While this is a very relaxing state, and useful to be practiced, it is sometimes incorrectly thought to be the goal of yoga Nidra. Normally considered to be unconscious, possibly drowsy, or half asleep. This level is also sometimes incorrectly considered to be the level of Yoga Nidra, where there is still the experience of images and streams of thought. Considered to be unconscious dreamless. In Yoga Nidra, the brain waves are at this level, as if practitioner is in conscious deep sleep, beyond the activities experienced at the other level.

FIG. 5.5 Biofeedback measurements.

5.5 SENSOR MODALITIES AND OUR APPROACH 5.5.1 BIOFEEDBACK BASED SENSOR MODALITIES Three professional biofeedback organizations, the Association for Applied Psychophysiology and Biofeedback (AAPB), Biofeedback Certification International Alliance (BCIA), and the International Society for Neurofeedback and Research (ISNR), concluded the definition of biofeedback in 2008: “is

5.5 SENSOR MODALITIES AND OUR APPROACH

97

Dependent variable:-

Anxiety

Stress

Depression

8 factors of mood states

Regression

Fatigue Guilt

Extraversion

Arousal

FIG. 5.6 Factors of mood states.

a process that enables a person to learn how he can change physiological activities for the purposes of enhancing health and performance. Instruments measures brainwaves, heart function, breathing, muscle activity, and skin temperature and all other such physiological activity.” The instruments give quick and accurate “feedback” to the users. This information is often given in conjunction with changes in thinking, emotions, and behavior. Over time, they can endure without the continuous use of an instrument [10].

5.5.2 ELECTROMYOGRAPH The “Muscle Whistler,” represented here with surface EMG electrodes, was an early device that was developed by Dr. Harry Garland and Dr. Roger Melen in 1971 [5,14]. An electromyography (EMG) made use of surface electrodes to find muscle action potentials from underlying skeletal muscles that start with muscle contraction. Clinicians record the sEMG using one or more active electrodes that are placed over a target muscle and a reference electrode that is placed within 6 in of either active electrode. The sEMG is measured in microvolts (millionths of a volt) [6,8].

5.5.3 ELECTRODERMOGRAPH An electrodermograph (EDG) is used to measure skin’s electrical activities directly (skin-conductance and skin-potential) and indirectly (skin resistance) using electrodes placed over the digits or hand and wrist. Recording responses to unexpected stimuli, arousal, and worry, and cognitive activity can enhance eccrine sweat gland activity, enhancing the conductivity of the skin for electric current [18].

98

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

5.5.4 PROPOSED FRAMEWORK The framework has been explained with the help of different diagrams.

5.6 EXPERIMENTS AND RESULTS—STUDY PLOT 5.6.1 STUDY DESIGN AND SOURCE OF DATA The whole clinical trial was conducted in a randomized, single blinded, prospective controlled environment. From different neurology clinics, various subjects were recruited, and the psychology of Dev Sanskriti Vishwa Vidyalaya, Shantikunj Haridwar were used for biofeedback therapy.

5.6.2 STUDY DURATION AND CONSENT FROM SUBJECTS Subjects were recruited from mid-January 2017 up to May 2018 and followed until July 2018. Consent from each subject was obtained before recruiting in the trials. An Informed-Consent Form approved by the ethical committee was completed by the subjects and clinical authentication for the trial was obtained. Ethical clearance was granted by the ethical committee formed by the University.

5.6.3 SAMPLING DESIGN AND ALLOCATION PROCESS Simple random sampling was used with the lottery method for allocation of subjects to seven groups. Subjects with TTH were enrolled in the study. Subjects who did not consent and who did not meet the eligibility criteria were excluded from the study. The rest of the subjects were randomized using the lottery method for allocation. Chits numbered 1 to 7 were placed in a bowl and the subjects were asked to pick the chits. Subjects with the following chit numbers were allocated to the corresponding groups: Grp Grp Grp Grp Grp Grp Grp

1: 2: 3: 4: 5: 6: 7:

EMG auditory biofeedback (EMGa) GSR auditory (GSRa) EMG visual biofeedback (EMGv) GSR visual (GSRv) EMG auditory-visual biofeedback (EMGav) GSR auditory-visual (GSRav) Control group.

5.6.4 SAMPLE SIZE The formula used for calculating the sample size was as follows: N ¼ 2ðZα + ZβÞ2  pq=ðp1  p2Þ2

Probability of Type I error was set at .05. Power of the study was set at 80% (0.8).

5.6 EXPERIMENTS AND RESULTS—STUDY PLOT

99

p1 ¼ 1.0 and p2 ¼ 0.75 were the mean differences of pre and post (baseline to 1 year) average frequency of headaches per month in the EMG biofeedback training group and pain management group respectively from a study by Mullay et al. p ¼ 0.875 was calculated as (p1 + p2)/2 and q ¼ 0.125 was calculated as 1  p. The sample size thus calculated was 26.6 per group. The size of the sample was chosen as 30 per group in order to accommodate dropouts.

5.6.5 STUDY POPULATION 5.6.5.1 Inclusion criteria Subjects included in the study were: • •

subjects with headaches fulfilling the criteria for TTH determined by the International Headache Society both males and females between 18 and 65 years.

5.6.5.2 Exclusion criteria Subjects excluded from the study were: those subjects who, in the previous 6 months had consumed complementary medicines; those who had drug addictions or used analgesic medicines and triptans >10 days/month; those suffering with severe psychiatric or somatic issues; those who were suffering from more than one type of aches other than TTH or their headache started after 50 years of age; those suffering from ICHD-type 2 classification headache disorder in international standards.

5.6.6 INTERVENTION After allocation of subjects to the seven groups, all subjects were informed about the treatment procedure in detail. Biofeedback (BF) training was provided in a separate room of DSVV, Hardwar of Physiotherapy research laboratory, which had minimal lighting and minimal external noise, to facilitate relaxation. All subjects underwent respective (EMG/GSR) BF training for 20 min per session for seven sessions. Subjects underwent seven biofeedback sessions with one session per day. If the subject missed a session, the biofeedback session was provided when the subject reported for therapy again, avoiding an interval of more than 2 days between the sessions to avoid unlearning and deconditioning. By the using of EMG-IR Retrainer, the EMG BF was provided (Chattanooga Group Inc., United States) and using the GSR biofeedback Biotrainer GPF-2000, the GSR BF was provided (Biotech, India). The EMG BF provided auditory and visual feedback. Auditory feedback was in the form of clicks that enhanced in frequency and became a continuous sound with an increase in frontalis muscle tension and to no sound with relaxation of frontalis muscle. Visual feedback on the display monitor consisted of glowing bars along with a displayed numerical that displayed the relative EMG activity of the frontalis muscle in figures. The number of glowing bars was directly proportional to tension in the frontalis muscle. The GSR BF machine similarly provided visual and auditory feedback. Visual display was in the form of glowing bars and displayed the numerical value of real time skin resistance in kilo-Ohms. An increase in the number of red glowing bars depicted an increase in tension (fall in skin resistance) and a

100

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

decrease in the number of red bars and an increase in the number of green bars indicated a decrease in stress or tension (increase in skin resistance). The EMG was similar to auditory feedback, i.e., in the form of clicks that became a continuous noise with an increase in stress and to no sound with relaxation. The training was given at 2% sensitivity with actual GSR. Throughout the training sessions, no changes in sensitivity were made. Skin preparation was done prior to attachment of electrodes by cleaning the skin using a spiritsoaked cotton pad. After skin preparation, surface EMG electrodes (Ag-AgCl, triode electrodes) were applied 2.5 cm above the center of each eyebrow [19] and the GSR electrodes were applied on the middle phalanx of the index and ring finger [20]. Both EMG and GSR BF electrodes were placed on all the subjects including the control group irrespective to which group they belonged or what training they received. The subjects were unaware of whether they were receiving EMG BF or GSR BF. The investigator was aware of the group that the subject belonged to and instructed the subject accordingly. Both EMG and GSR BF auditory groups received only respective auditory feedback. The subjects were instructed to reduce the tone and frequency of the sound, which would help them achieve relaxation. During the session, the monitor on which the visual display was present was moved away from the field of vision of the subject. Similarly, both EMG and GSR BF visual groups were instructed to reduce the number of glowing bars. In the case of GSR, they were instructed to decrease the number of red bars and increase the number of green bars to indicate relaxation. During the treatment session, the volume of the auditory tone was muted. Both EMG and GSR audio-visual groups were instructed to lower the tone and frequency of sound and also decrease the number of bars simultaneously. Subjects in the control group were not asked to manipulate either the visual or auditory display. They were only informed that their stress levels were being recorded through the machines. The subjects were instructed to practice relaxation at home, both during the course of therapy and at the end of 15 sessions, in a way similar to the relaxation during the biofeedback therapy sessions. However, compliance of the subjects in the home program was not monitored. All subjects were allowed to take the medications prescribed by their treating physicians, especially if they were preventive/prophylactic medications. They were requested to avoid taking any analgesic/abortive/palliative medication unless the headache was unbearable.

5.6.7 OUTCOME PARAMETERS 5.6.7.1 Primary variables As per the recommendations of the American Headache Society Behavioral Clinical Trials Workgroup, 2005 [21] of the primary variables selected for our study were: medium size of frequency, duration, and intensity of headache per week. The HIS has issued guidelines for the headache for controlled way of drug treatment of tension type chronic headache. As per [22], the recommendation for headache frequency reporting favors metaanalyses and other comparisons across studies and its various interventions are consistent with the HIS guidelines for controlled trials of drug treatments [23,24].

5.6 EXPERIMENTS AND RESULTS—STUDY PLOT

101

5.6.7.2 Secondary variables The secondary variables considered in our study were: SF-36 Quality of Life—total, physical, and mental scores.

5.6.8 ANALGESIC CONSUMPTION These secondary variables are also termed as “secondary non headache measures” by the American Headache Society Behavioral Clinical Trials Workgroup, 2005 [21].

5.6.9 ASSESSMENT OF OUTCOME VARIABLES The demographic data in regards to age, gender, and chronicity of headache was collected from all the subjects in the trial.

5.6.10 PAIN DIARY As per the guidelines of the American Headache Society Behavioral Clinical Trials Workgroup, 2005 [21], a pain diary was given to all the subjects in which they were asked to note down the headache episodes, duration, and intensity of headache they experienced in a week. At the end of the week, the averages of the headaches in that week were calculated. The variables were recorded as: (1) Average frequency of headache per week: number of headache episodes per week. (2) Average duration of headache per week: total hours of all episodes of headache that week divided by the number of episodes in that week. (3) Average intensity of headache per week: average of the 10-point visual analogue score (VAS) per headache that week. The subjects were also requested to note down the use of analgesics during any of the pain episodes. The secondary variables were assessed and a licensed SF-36 questionnaire was obtained from Quality Metric Incorporated, United States, in Hindi and English. It is a short-form health survey with 36 questions, which was multipurpose and yielded a psychometrically based physical and mental health report calculated and a total score. The SF-36 was judged to be the most broadly calculated generic patient assessed health outcome [25]. Its authenticity has been estimated using both internal consistency and test-retest methods. Most of the published authentic statistics have exceeded 0.80. Validity studies generally support the intended objective of high and low SF-36 scores as documented in the original user’s manuals [26–28]. Analgesic consumption was recorded from the pain dairy of the subjects as well as from the prescriptions for medications given to the subjects by their treating physicians.

5.6.11 DATA COLLECTION All data was collected at the following time measures: Baseline: scores of primary and secondary variables the week before the starting of the treatment. Scores at 1, 3, 6 months, and 1 year. These were the scores of the last week of the corresponding month.

102

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

5.6.12 STATISTICAL ANALYSIS Effect size at 1 year inter group analysis. Percentage improvement was calculated by percentage improvement ¼ (baseline-monthly scores)/baseline score. This end product was then converted to percentage by multiplying it by 100. Significance for the results was set at P < .05. Data Recording and Visualization: Data collection and recording was done through a subject pain diary and in records in Microsoft Excel and analytics and visualization in Licensed Tableau on its software version. Different visuals were used for analysis and representation purposes. The EMG-IR Retrainer was used to provide EMG BF (Chattanooga Group Inc., United States) and GSR biofeedback Biotrainer GPF-2000 was used to provide GSR BF (Biotech, India).

5.6.13 HYPOTHESIS Meditation has a significant effect on mental health and stress management. There is a significant increase in the rate of reduction of the chronic type TTH using guided meditation by EMG and GSR audio therapy in 12 months. – There is no significant difference in the mental health (stress and chronic type TTH) using EMGa, EMGv, EMGav, GSRa, GSRv, or GSRav BF therapy. – There is no significant relationship between stress and EMGa, EMGv, EMGav, GSRa, GSRv, or GSRav BF therapy.

5.7 DATA COLLECTION PROCEDURE—GUIDED MEDITATION AS PER FIG. 5.7G Subjects were given the following instructions by the instructor: – – – – – –

Adopt any comfortable meditative posture such as padmasana or siddhasan Try to follow the given instructions Prepare yourself for the practice of meditation Keep the head, neck, and spine upright Close the eyes and relax the whole body Alpha waves are an indicator of a calm, cool and steady, and concentrated mental state. Development of talent and the innate caliber happens at this stage. Alpha activity of the brain is affected by stress, anxiety, mood variations, and different environmental factors. – The pain diary of the subjects was used to record the average intensity, duration, and frequency of chronic TTH type headache.

5.8 RESULTS, INTERPRETATION AND DISCUSSION Trend/pattern analysis of groups undergoing EMGav and GSRav therapies over the experiment period of 12 months on the subjects for chronic TTH Type headache for frequency, intensity and duration parameters.

5.8 RESULTS, INTERPRETATION AND DISCUSSION

103

3-Use machine intelligence to record/note/observe. Measure different pains and psychic challenges

5-Measure the difference before and after the

1-Subject

treatment and find the

under

deviation

concern 4-Directly apply on basis of 2-A predefined spiritual consultant expert system

+

expert system or consult the spiritual counselor and apply a suitable scientific spirituality

6-Record the feedback from subject in certain time periods

method to heal the subject

No

7-If improvement in subject

Yes

8-Train the system and feed the symptoms, parameters and results details as applied/obtained on/from

9-Report for concerned subjects

subject

4-Identify the symptoms with help of some spiritual counselor/healer/spiritual expert system 5-Use of blofeedback/neuro feedback Al machines

(A) FIG. 5.7 (A) Use of biofeedback and neurofeedback in machine. (Continued)

104

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

Bio/neuro feedback

Parameter –1

Graph and chart Parameter

analysis on

–2

different medical parameters

Subject

Quantify the

Parameter

parameters and

–3

record it

as input

Parameter –4

...

Parameter N

(B) FIG. 5.7, Cont’d (B) Spiritual fitness measurement 0-level. (Continued)

5.8 RESULTS, INTERPRETATION AND DISCUSSION

Record the symptoms of subject quantify the symptoms

Subject as input to bio/neuro feedback instrument

105

Suggest some suitable spiritual technique according to symptoms

Record the inputs on different parameters numerical values

(C) FIG. 5.7, Cont’d (C) Spiritual fitness measurement 1-level. (Continued)

106

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

Measurement of human body parameters Spiritual tools set

Appendix “A”

Inform the

applied suggest

objects about

appropriate some

Their physical,

activity

Results

mental, social,

performed by set

produced

and spiritual

of people

fitness

individually and collectively

>e Difference or deviation on pre and postactivity

If differnce level is? e Declare and record the In each certain time

effect of spiritual practice for that symptoms

2 3

6

4

4 5

Average

6 7 8

2

9 10 11

0

12 13

–2 12

14 15 16

10

17 18

GSRa

TTH-intensity ---->

8

19 20 21

6

22

4

Average

23 24 25 26

2

27

0 –2

Average

0

10

20

TTH-duration (in h) ---->

Average

0

10

Average

0

20

TTH-duration (in h) ---->

10

20

TTH-duration (in h) ---->

Average

Average

0

10

20

TTH-duration (in h) ---->

0

10

20

TTH-duration (in h) ---->

FIG. 5.12 Correlation between TTH: duration and intensity.

The baseline data for both the groups showed high TTH intensity and high TTH duration and the correlation between these parameter was observed as high. After applying the different trend models such as linear, logarithmic, exponential, polynomial, and power model, we found the best fitted trend in the power model. Hence, the power model trend was analyzed. The mathematical modeling of the power model is given as: Trend Line Model A linear trend model is computed for natural log of sum of intensity given natural log of sum of duration. The model may be significant at P  .05. The factor period may be significant at P  .05 (Tables 5.2–5.4). Analysis of the trend found with the respective techniques are as follows: GSRa: The improvement trend of duration of TTH pain with the pain intensity was exponential almost throughout the period, having exception at the end after using the GSRa as the feedback

114

CHAPTER 5 CHRONIC TTH ANALYSIS BY EMG AND GSR BIOFEEDBACK

Table 5.2 Trend Line Model Period  Techniques  (ln(Duration) + intercept) 234 13 20 214 26.9855 0.1261 0.560029 0.355106

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27 28

Table 5.20 Trend Lines Model Period  Techniques  (ln(Duration) + intercept) 346 23 30 316 34.9709 0.110667 0.642267 0.332667