Handbook of Research on Machine Learning Innovations and Trends [Illustrated] 1522522298, 9781522522294

"This book consists of three sections. In the first section, many state-of-the-art techniques are discussed and ana

923 82 56MB

English Pages 1050 [1270] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook of Research on Deep Learning Innovations and Trends 2018036340, 9781522578628, 9781522578635, 9781522573531, 9781522557937, 9781522562108, 9781522573388, 9781522558521

Leading technology firms and research institutions are continuously exploring new techniques in artificial intelligence

321 96 15MB Read more

Handbook of Research on Emerging Trends and Applications of Machine Learning 2019005483, 9781522596431, 9781522596455, 9781522596448, 9781799815815, 9781799816591, 9781799814641, 9781799803010

701 28 30MB Read more

Handbook of Evolutionary Machine Learning 9789819938148, 9789819938131

This book, written by leading international researchers of evolutionary approaches to machine learning, explores various

145 17 64MB Read more

Handbook of Research on Developments and Future Trends in Transnational Higher Education 166845226X, 9781668452264

Higher education has embraced a period of increasingly rapid development due to the speed of technological advances, inc

354 24 11MB Read more

Probabilistic Machine Learning: An Introduction [Illustrated]

In 2012, I published a 1200-page book called “Machine learning: a probabilistic perspective”, whichprovided a fairly com

8,610 1,003 80MB Read more

Handbook on Local Governance in China: Structures, Variations, and Innovations (Handbooks of Research on Contemporary China series) 1800883234, 9781800883239

Demonstrating the crucial importance of local governance in China’s development and international relations, this topica

147 42 6MB Read more

Python Machine Learning: Machine Learning And Deep Learning From Scratch Illustrated With Python, Scikit-Learn, Keras, Theano And Tensorflow

6,563 1,712 1MB Read more

Handbook of Research on Nanoelectronic Sensor Modeling and Applications [Illustrated] 1522507361, 9781522507369

Nanoelectronics are a diverse set of materials and devices that are so small that quantum mechanics need to be applied t

315 82 52MB Read more

Handbook of Research on Machine Learning-Enabled IoT for Smart Applications Across Industries [by Team-IRA] 1668487853, 9781668487853

Machine learning (ML) and the internet of things (IoT) are the top technologies used by businesses to increase efficienc

467 19 17MB Read more

Python Machine Learning: Machine Learning And Deep Learning From Scratch Illustrated With Python, Scikit-Learn, Keras, Theano And Tensorflow 1211083261

3,146 604 2MB Read more

Handbook of Research on Machine Learning Innovations and Trends [Illustrated]
1522522298, 9781522522294

Author / Uploaded
Ella Hassanien
Tarek Gaber

Table of contents :
List of Contributors
Table of Contents
Detailed Table of Contents
Preface
Section 1: State-of-the-Art Techniques
1 T-Spanner Problem: Genetic Algorithms for the T-Spanner Problem • Riham Moharam, Ehab Morsy, Ismail A. Ismail
2 Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes • Dmitry Klyushin, Natalia Boroday, Kateryna Golubeva, Maryna Prysiazhna, Maksym Shlykov
3 Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification • Nora Shoaip, Mohammed Elmogy, Alaa M. Riad, Hosam Zaghloul, Farid A. Badria
4 Data Storage Security Service in Cloud Computing: Challenges and Solutions • Alshaimaa Abo-alian, Nagwa L. Badr, Mohamed F. Tolba
5 Workload Management Systems for the Cloud Environment • Eman A. Maghawry, Rasha M. Ismail, Nagwa L. Badr, Mohamed F. Tolba
6 Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques • Eman A. Abdel Maksoud, Mohammed Elmogy, Rashid Mokhtar Al-Awadi
7 Localization and Mapping for Indoor Navigation: Survey • Heba Gaber, Mohamed Marey, Safaa Amin, Mohamed F. Tolba
8 Enzyme Function Classification: Reviews, Approaches, and Trends • Mahir M. Sharif, Alaa Tharwat, Aboul Ella Hassanien, Hesham A. Hefny
9 A Review of Vessel Segmentation Methodologies and Algorithms: Comprehensive Review • Gehad Hassan, Aboul Ella Hassanien
10 Cloud Services Publication and Discovery • Yasmine M. Afify, Ibrahim F. Moawad, Nagwa L. Badr, Mohamed F. Tolba
Section 2: Applications-Based Machine Learning
11 Enhancement of Data Quality in Health Care Industry: A Promising Data Quality Approach • Asmaa S. Abdo, Rashed K. Salem, Hatem M. Abdul-Kader
12 Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods • Pradeep Kumar, Abdul Wahid
13 Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops • Ahmed M. Gadallah, Assem H. Mohammed
14 Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification • M. N. Al-Berry, Mohammed A.-M. Salem, H. M. Ebeid, A. S. Hussein, Mohamed F. Tolba
15 Data Streams Processing Techniques Data Streams Processing Techniques • Fatma Mohamed, Rasha M. Ismail, Nagwa L. Badr, Mohamed F. Tolba
16 A Preparation Framework for EHR Data to Construct CBR Case-Base • Shaker El-Sappagh, Mohammed Elmogy, Alaa M. Riad, Hosam Zaghloul, Farid A. Badria
17 Detecting Significant Changes in Image Sequences • Sergii Mashtalir, Olena Mikhnova
18 Multiple Sequence Alignment Optimization Using Meta-Heuristic Techniques • Mohamed Issa, Aboul Ella Hassanien
19 Recent Survey on Medical Image Segmentation • Mohammed A.-M. Salem, Alaa Atef, Alaa Salah, Marwa Shams
20 Machine Learning Applications in Breast Cancer Diagnosis • Syed Jamal Safdar Gardezi, Mohamed Meselhy Eltoukhy, Ibrahima Faye
21 A Hybrid Optimization Algorithm for Single and Multi-Objective Optimization Problems • Rizk M. Rizk-Allah, Aboul Ella Hassanien
22 Neuro-Imaging Machine Learning Techniques for Alzheimer’s Disease Diagnosis • Gehad Ismail Sayed, Aboul Ella Hassanien
23 Swarm Intelligence Based on Remote Sensing Image Fusion: Comparison between the Particle Swarm Optimization and the Flower Pollination Algorithm • Reham Gharbia, Aboul Ella Hassanien
24 Grey Wolf Optimization-Based Segmentation Approach for Abdomen CT Liver Images • Abdalla Mostafa, Aboul Ella Hassanien, Hesham A. Hefny
25 3D Watermarking Approach Using Particle Swarm Optimization Algorithm • Mona M. Soliman, Aboul Ella Hassanien
26 Particle Swarm Optimization: A Tutorial • Alaa Tharwat, Tarek Gaber, Aboul Ella Hassanien, Basem E. Elnaghi
27 A Comparison of Open Source Data Mining Tools for Breast Cancer Classification • Ahmed AbdElhafeez Ibrahim, Atallah Ibrahin Hashad, Negm Eldin Mohamed Shawky
28 2D and 3D Intelligent Watermarking • Mourad R. Mouhamed, Ashraf Darwish, Aboul Ella Hassanien
Section 3: Innovative ML Applications
29 Differential Evolution Algorithm with Space Reduction for Solving Large-Scale Global Optimization Problems • Ahmed Fouad Ali, Nashwa Nageh Ahmed
30 Interpreting Brain Waves • Noran Magdy El-Kafrawy, Doaa Hegazy, Mohamed F. Tolba
31 Data Clustering Using Sine Cosine Algorithm: Data Clustering Using SCA • Vijay Kumar, Dinesh Kumar
32 Complex-Valued Neural Networks: A New Learning Strategy Using Particle Swarm Optimization • Mohammed E. El-Telbany, Samah Refat, Engy I. Nasr
33 Text Classification: New Fuzzy Decision Tree Model • Ben Elfadhl Mohamed Ahmed, Ben Abdessalem Wahiba
34 PAGeneRN: Parallel Architecture for Gene Regulatory Network • Dina Elsayad, A. Ali, Howida A. Shedeed, Mohamed F. Tolba
35 Hybrid Wavelet-Neuro-Fuzzy Systems of Computational Intelligence in Data Mining Tasks • Yevgeniy Bodyanskiy, Olena Vynokurova, Oleksii Tyshchenko
36 On Combining Nature-Inspired Algorithms for Data Clustering • Hanan Ahmed, Howida A. Shedeed, Safwat Hamad, Mohamed F. Tolba
37 A Fragile Watermarking Chaotic Authentication Scheme Based on Fuzzy C-Means for Image Tamper Detection • Kamal Hamouda, Mohammed Elmogy, B. S. El-Desouky
38 New Mechanisms to Enhance the Performances of Arabic Text Recognition System: Feature Selection • Marwa Amara, Kamel Zidi
39 Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters • Ahmed.T. Sahlol, Aboul Ella Hassanien
40 Telemetry Data Mining Techniques, Applications, and Challenges • Sara Ahmed, Tarek Gaber, Aboul Ella Hassanien
41 Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach in Digital Mammography • Mohammed A. Osman, Ashraf Darwish, Ayman E. Khedr, Atef Z. Ghalwash, Aboul Ella Hassanien
42 TAntNet-4: A Threshold-Based AntNet Algorithm with Improved Scout Behavior • Ayman M. Ghazy, Hesham A. Hefny
43 Digital Images Segmentation Using a Physical-Inspired Algorithm • Diego Oliva, Aboul Ella Hassanien
44 A Proposed Architecture for Key Management Schema in Centralized Quantum Network • Ahmed Farouk, Mohamed Elhoseny, Josep Batle, Mosayeb Naseri, Aboul Ella Hassanien
45 Secure Image Processing and Transmission Schema in Cluster-Based Wireless Sensor Network • Mohamed Elhoseny, Ahmed Farouk, Josep Batle, Abdulaziz Shehab, Aboul Ella Hassanien
46 Color Invariant Representation and Applications • Abdelhameed Ibrahim, Takahiko Horiuchi, Shoji Tominaga, Aboul Ella Hassanien
47 An Efficient Approach for Community Detection in Complex Social Networks Based on Elephant Swarm Optimization Algorithm • Khaled Ahmed, Aboul Ella Hassanien, Ehab Ezzat
48 Designing Multilayer Feedforward Neural Networks Using Multi-Verse Optimizer • Mohamed F. Hassanin, Abdullah M. Shoeb, Aboul Ella Hassanien
Compilation of References
About the Contributors
Index

Citation preview

Handbook of Research on Machine Learning Innovations and Trends Aboul Ella Hassanien Cairo University, Egypt Tarek Gaber Suez Canal University, Egypt

A volume in the Advances in Computational Intelligence and Robotics (ACIR) Book Series

Published in the United States of America by IGI Global Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA, USA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com Copyright © 2017 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Names: Hassanien, Aboul Ella, editor. | Gaber, Tarek, 1975- editor. Title: Handbook of research on machine learning innovations and trends / Aboul Ella Hassanien and Tarek Gaber, editors. Description: Hershey, PA : Information Science Reference, [2017] | Includes bibliographical references and index. Identifiers: LCCN 2016056940| ISBN 9781522522294 (hardcover) | ISBN 9781522522300 (ebook) Subjects: LCSH: Machine learning--Technological innovations. | Machine learning--Industrial applications. Classification: LCC Q325.5 .H3624 2017 | DDC 006.3/1--dc23 LC record available at https://lccn.loc.gov/2016056940 This book is published in the IGI Global book series Advances in Computational Intelligence and Robotics (ACIR) (ISSN: 2327-0411; eISSN: 2327-042X)

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. For electronic access to this publication, please contact: [email protected].

Advances in Computational Intelligence and Robotics (ACIR) Book Series Ivan Giannoccaro University of Salento, Italy

ISSN:2327-0411 EISSN:2327-042X Mission

While intelligence is traditionally a term applied to humans and human cognition, technology has progressed in such a way to allow for the development of intelligent systems able to simulate many human traits. With this new era of simulated and artificial intelligence, much research is needed in order to continue to advance the field and also to evaluate the ethical and societal concerns of the existence of artificial life and machine learning. The Advances in Computational Intelligence and Robotics (ACIR) Book Series encourages scholarly discourse on all topics pertaining to evolutionary computing, artificial life, computational intelligence, machine learning, and robotics. ACIR presents the latest research being conducted on diverse topics in intelligence technologies with the goal of advancing knowledge and applications in this rapidly evolving field.

Coverage • Synthetic Emotions • Robotics • Natural language processing • Brain Simulation • Fuzzy Systems • Automated Reasoning • Computational Logic • Artificial Life • Algorithmic Learning • Cognitive Informatics

IGI Global is currently accepting manuscripts for publication within this series. To submit a proposal for a volume in this series, please contact our Acquisition Editors at [email protected] or visit: http://www.igi-global.com/publish/.

The Advances in Computational Intelligence and Robotics (ACIR) Book Series (ISSN 2327-0411) is published by IGI Global, 701 E. Chocolate Avenue, Hershey, PA 17033-1240, USA, www.igi-global.com. This series is composed of titles available for purchase individually; each title is edited to be contextually exclusive from any other title within the series. For pricing and ordering information please visit http:// www.igi-global.com/book-series/advances-computational-intelligence-robotics/73674. Postmaster: Send all address changes to above address. Copyright © 2017 IGI Global. All rights, including translation in other languages reserved by the publisher. No part of this series may be reproduced or used in any form or by any means – graphics, electronic, or mechanical, including photocopying, recording, taping, or information and retrieval systems – without written permission from the publisher, except for non commercial, educational use, including classroom teaching purposes. The views expressed in this series are those of the authors, but not necessarily of IGI Global.

Titles in this Series

For a list of additional titles in this series, please visit: www.igi-global.com/book-series

Handbook of Research on Soft Computing and Nature-Inspired Algorithms Shishir K. Shandilya (Bansal Institute of Research and Technology, India) Smita Shandilya (Sagar Institute of Research Technology and Science, India) Kusum Deep (Indian Institute of Technology Roorkee, India) and Atulya K. Nagar (Liverpool Hope University, UK) Information Science Reference • copyright 2017 • 627pp • H/C (ISBN: 9781522521280) • US $280.00 (our price) Membrane Computing for Distributed Control of Robotic Swarms Emerging Research and Opportunities Andrei George Florea (Politehnica University of Bucharest, Romania) and Cătălin Buiu (Politehnica University of Bucharest, Romania) Information Science Reference • copyright 2017 • 119pp • H/C (ISBN: 9781522522805) • US $160.00 (our price) Recent Developments in Intelligent Nature-Inspired Computing Srikanta Patnaik (SOA University, India) Information Science Reference • copyright 2017 • 264pp • H/C (ISBN: 9781522523222) • US $185.00 (our price) Ubiquitous Machine Learning and Its Applications Pradeep Kumar (Maulana Azad National Urdu University, India) and Arvind Tiwari (DIT University, India) Information Science Reference • copyright 2017 • 258pp • H/C (ISBN: 9781522525455) • US $185.00 (our price) Advanced Image Processing Techniques and Applications N. Suresh Kumar (VIT University, India) Arun Kumar Sangaiah (VIT University, India) M. Arun (VIT University, India) and S. Anand (VIT University, India) Information Science Reference • copyright 2017 • 439pp • H/C (ISBN: 9781522520535) • US $290.00 (our price) Advanced Research on Biologically Inspired Cognitive Architectures Jordi Vallverdú (Universitat Autònoma de Barcelona, Spain) Manuel Mazzara (Innopolis University, Russia) Max Talanov (Kazan Federal University, Russia) Salvatore Distefano (University of Messina, Italy & Kazan Federal University, Russia) and Robert Lowe (University of Gothenburg, Sweden & University of Skövde, Sweden) Information Science Reference • copyright 2017 • 297pp • H/C (ISBN: 9781522519478) • US $195.00 (our price) Theoretical and Practical Advancements for Fuzzy System Integration Deng-Feng Li (Fuzhou University, China) Information Science Reference • copyright 2017 • 415pp • H/C (ISBN: 9781522518488) • US $200.00 (our price)

701 East Chocolate Avenue, Hershey, PA 17033, USA Tel: 717-533-8845 x100 • Fax: 717-533-8661 E-Mail: [email protected] • www.igi-global.com

List of Contributors

Abdel Maksoud, Eman A. / Mansoura University, Egypt................................................................. 114 Abdo, Asmaa S. / Menoufia University, Egypt................................................................................... 230 Abdul-Kader, Hatem M. / Menoufia University, Egypt..................................................................... 230 Abo-alian, Alshaimaa / Ain Shams University, Egypt......................................................................... 61 Afify, Yasmine M. / Ain Shams University, Egypt.............................................................................. 204 Ahmed, Ben Elfadhl Mohamed / Higher Institute of Management, Tunisia.................................... 740 Ahmed, Hanan / Ain Shams University, Egypt.................................................................................. 826 Ahmed, Khaled / Cairo University, Egypt....................................................................................... 1062 Ahmed, Nashwa Nageh / Suez Canal University, Egypt.................................................................... 671 Ahmed, Sara / Al Azhar University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt.............................................................................................................................................. 915 Al-Awadi, Rashid Mokhtar / Mansoura University, Egypt............................................................... 114 Al-Berry, M. N. / Ain Shams University, Egypt.................................................................................. 295 Ali, A. / Ain Shams University, Egypt................................................................................................. 762 Ali, Ahmed Fouad / Suez Canal University, Egypt............................................................................ 671 Amara, Marwa / SOIE Laboratory, Tunisia..................................................................................... 879 Amin, Safaa / Ain-Shams University, Egypt...................................................................................... 136 Atef, Alaa / Ain Shams University, Egypt.......................................................................................... 424 Badr, Nagwa L. / Ain Shams University, Egypt................................................................. 61,94,204,320 Badria, Farid A. / Mansoura University, Egypt............................................................................ 43,345 Batle, Josep / Universitat de les Illes Balears, Spain................................................................ 997,1022 Bodyanskiy, Yevgeniy / Kharkiv National University of Radio Electronics, Ukraine...................... 787 Boroday, Natalia / National Academy of Sciences of Ukraine, Ukraine............................................. 22 Darwish, Ashraf / Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt....................................................................................................................................... 652,925 Ebeid, H. M. / Ain Shams University, Egypt....................................................................................... 295 El-Desouky, B. S. / Mansoura University, Egypt................................................................................ 856 Elhoseny, Mohamed / Mansoura University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt...................................................................................................................... 997,1022 El-Kafrawy, Noran Magdy / Ain Shams University, Egypt............................................................... 695 Elmogy, Mohammed / Mansoura University, Egypt...................................................... 43,114,345,856 Elnaghi, Basem E. / Suez Canal University, Egypt............................................................................ 614 El-Sappagh, Shaker / Mansoura University, Egypt.......................................................................... 345 Elsayad, Dina / Ain Shams University, Egypt.................................................................................... 762 El-Telbany, Mohammed E. / Electronics Research Institute, Egypt.................................................. 727

Eltoukhy, Mohamed Meselhy / Suez Canal University, Egypt.......................................................... 465 Ezzat, Ehab / Cairo University, Egypt............................................................................................. 1062 Farouk, Ahmed / Zewail City of Science and Technology, Egypt & Mansoura University, Egypt..................................................................................................................................... 997,1022 Faye, Ibrahima / Universiti Teknologi Petronas, Malaysia............................................................... 465 Gaber, Heba / Ain-Shams University, Egypt...................................................................................... 136 Gaber, Tarek / Suez Canal University, Egypt............................................................................. 614,915 Gadallah, Ahmed M. / Cairo University, Egypt................................................................................ 272 Gardezi, Syed Jamal Safdar / Universiti Teknologi Petronas, Malaysia.......................................... 465 Ghalwash, Atef Z. / Helwan University, Egypt.................................................................................. 925 Gharbia, Reham / Nuclear Materials Authority, Egypt.................................................................... 541 Ghazy, Ayman M. / Cairo University, Egypt...................................................................................... 942 Golubeva, Kateryna / Kyiv National Taras Shevchenko University, Ukraine..................................... 22 Hamad, Safwat / Ain Shams University, Egypt................................................................................. 826 Hamouda, Kamal / Mansoura University, Egypt.............................................................................. 856 Hashad, Atallah Ibrahin / Arab Academy for Science, Technology, and Maritime Transport, Egypt.............................................................................................................................................. 636 Hassan, Gehad / Fayoum University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt.187 Hassanien, Aboul Ella / Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt........ 161,187,409,491,522,541,562,582,614,652,897,915,925,975,997,1022,1041,1062,1076 Hassanin, Mohamed F. / Fayoum University, Egypt....................................................................... 1076 Hefny, Hesham A. / Cairo University, Egypt....................................................................... 161,562,942 Hegazy, Doaa / Ain Shams University, Egypt.................................................................................... 695 Horiuchi, Takahiko / Chiba University, Japan............................................................................... 1041 Hussein, A. S. / Arab Open University, Kuwait.................................................................................. 295 Ibrahim, Abdelhameed / Mansoura University, Egypt................................................................... 1041 Ibrahim, Ahmed AbdElhafeez / Arab Academy for Science, Technology, and Maritime Transport, Egypt............................................................................................................................ 636 Ismail, Ismail A. / 6 October University, Egypt..................................................................................... 1 Ismail, Rasha M. / Ain Shams University, Egypt.......................................................................... 94,320 Issa, Mohamed / Zagazig University, Egypt...................................................................................... 409 Khedr, Ayman E. / Helwan University, Egypt.................................................................................... 925 Klyushin, Dmitry / Kyiv National Taras Shevchenko University, Ukraine......................................... 22 Kumar, Dinesh / GJUS&T, India...................................................................................................... 715 Kumar, Pradeep / Maulana Azad National Urdu University, India.................................................. 251 Kumar, Vijay / Thapar University, India........................................................................................... 715 Maghawry, Eman A. / Ain Shams University, Egypt........................................................................... 94 Marey, Mohamed / Ain-Shams University, Egypt............................................................................. 136 Mashtalir, Sergii / Kharkiv National University of Radio Electronics, Ukraine............................... 379 Mikhnova, Olena / Kharkiv Petro Vasylenko National Technical University of Agriculture, Ukraine.......................................................................................................................................... 379 Moawad, Ibrahim F. / Ain Shams University, Egypt.......................................................................... 204 Mohamed, Fatma / Ain Shams University, Egypt............................................................................. 320 Mohammed, Assem H. / Cairo University, Egypt.............................................................................. 272 Moharam, Riham / Suez Canal University, Egypt................................................................................ 1 Morsy, Ehab / Suez Canal University, Egypt........................................................................................ 1

Mostafa, Abdalla / Cairo University, Egypt...................................................................................... 562 Mouhamed, Mourad R. / Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt............................................................................................................................... 652 Naseri, Mosayeb / Islamic Azad University, Kermanshah, Iran........................................................ 997 Nasr, Engy I. / Ain Shams University, Egypt...................................................................................... 727 Oliva, Diego / Tecnológico de Monterrey, Mexico & Universidad de Guadajalara, Mexico & Tomsk Polytechnic University, Russia & Scientific Research Group in Egypt (SRGE), Egypt..... 975 Osman, Mohammed A. / Helwan University, Egypt......................................................................... 925 Prysiazhna, Maryna / Kyiv National Taras Shevchenko University, Ukraine.................................... 22 Refat, Samah / Ain Shams University, Egypt..................................................................................... 727 Riad, Alaa M. / Mansoura University, Egypt................................................................................ 43,345 Rizk-Allah, Rizk M. / Menoufia University, Egypt............................................................................ 491 Sahlol, Ahmed.T. / Damietta University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt.............................................................................................................................................. 897 Salah, Alaa / Ain Shams University, Egypt........................................................................................ 424 Salem, Mohammed A.-M. / Ain Shams University, Egypt.......................................................... 295,424 Salem, Rashed K. / Menoufia University, Egypt................................................................................ 230 Sayed, Gehad Ismail / Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt.............................................................................................................................................. 522 Shams, Marwa / Ain Shams University, Egypt.................................................................................. 424 Sharif, Mahir M. / Cairo University, Egypt & Omdurman Islamic University, Sudan & Scientific Research Group in Egypt (SRGE), Egypt...................................................................................... 161 Shawky, Negm Eldin Mohamed / Arab Academy for Science, Technology, and Maritime Transport, Egypt............................................................................................................................ 636 Shedeed, Howida A. / Ain Shams University, Egypt................................................................... 762,826 Shehab, Abdulaziz / Mansoura University, Egypt.......................................................................... 1022 Shlykov, Maksym / Kyiv National Taras Shevchenko University, Ukraine......................................... 22 Shoaip, Nora / Mansoura University, Egypt........................................................................................ 43 Shoeb, Abdullah M. / Taibah University, Saudi Arabia................................................................... 1076 Soliman, Mona M. / Scientific Research Group in Egypt, Egypt....................................................... 582 Tharwat, Alaa / Suez Canal University, Egypt & Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt.............................................................................................. 161,614 Tolba, Mohamed F. / Ain Shams University, Egypt....................... 61,94,136,204,295,320,695,762,826 Tominaga, Shoji / Chiba University, Japan..................................................................................... 1041 Tyshchenko, Oleksii / Kharkiv National University of Radio Electronics, Ukraine......................... 787 Vynokurova, Olena / Kharkiv National University of Radio Electronics, Ukraine.......................... 787 Wahiba, Ben Abdessalem / Taif University, Saudi Arabia................................................................ 740 Wahid, Abdul / Maulana Azad National Urdu University, India...................................................... 251 Zaghloul, Hosam / Mansoura University, Egypt.......................................................................... 43,345 Zidi, Kamel / University of Tabouk, Saudi Arabia............................................................................ 879

Table of Contents

Preface............................................................................................................................................xxxviii

Volume I Section 1 State-of-the-Art Techniques Chapter 1 T-Spanner Problem: Genetic Algorithms for the T-Spanner Problem..................................................... 1 Riham Moharam, Suez Canal University, Egypt Ehab Morsy, Suez Canal University, Egypt Ismail A. Ismail, 6 October University, Egypt Chapter 2 Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes.................................................................................................................................................. 22 Dmitry Klyushin, Kyiv National Taras Shevchenko University, Ukraine Natalia Boroday, National Academy of Sciences of Ukraine, Ukraine Kateryna Golubeva, Kyiv National Taras Shevchenko University, Ukraine Maryna Prysiazhna, Kyiv National Taras Shevchenko University, Ukraine Maksym Shlykov, Kyiv National Taras Shevchenko University, Ukraine Chapter 3 Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification................. 43 Nora Shoaip, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Alaa M. Riad, Mansoura University, Egypt Hosam Zaghloul, Mansoura University, Egypt Farid A. Badria, Mansoura University, Egypt Chapter 4 Data Storage Security Service in Cloud Computing: Challenges and Solutions................................... 61 Alshaimaa Abo-alian, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt

Chapter 5 Workload Management Systems for the Cloud Environment................................................................ 94 Eman A. Maghawry, Ain Shams University, Egypt Rasha M. Ismail, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Chapter 6 Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques............... 114 Eman A. Abdel Maksoud, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Rashid Mokhtar Al-Awadi, Mansoura University, Egypt Chapter 7 Localization and Mapping for Indoor Navigation: Survey.................................................................. 136 Heba Gaber, Ain-Shams University, Egypt Mohamed Marey, Ain-Shams University, Egypt Safaa Amin, Ain-Shams University, Egypt Mohamed F. Tolba, Ain-Shams University, Egypt Chapter 8 Enzyme Function Classification: Reviews, Approaches, and Trends.................................................. 161 Mahir M. Sharif, Cairo University, Egypt & Omdurman Islamic University, Sudan & Scientific Research Group in Egypt (SRGE), Egypt Alaa Tharwat, Suez Canal University, Egypt & Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Hesham A. Hefny, Cairo University, Egypt Chapter 9 A Review of Vessel Segmentation Methodologies and Algorithms: Comprehensive Review............ 187 Gehad Hassan, Fayoum University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 10 Cloud Services Publication and Discovery.......................................................................................... 204 Yasmine M. Afify, Ain Shams University, Egypt Ibrahim F. Moawad, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt

Section 2 Applications-Based Machine Learning Chapter 11 Enhancement of Data Quality in Health Care Industry: A Promising Data Quality Approach.......... 230 Asmaa S. Abdo, Menoufia University, Egypt Rashed K. Salem, Menoufia University, Egypt Hatem M. Abdul-Kader, Menoufia University, Egypt Chapter 12 Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods............................................................................................................................................... 251 Pradeep Kumar, Maulana Azad National Urdu University, India Abdul Wahid, Maulana Azad National Urdu University, India Chapter 13 Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops.......... 272 Ahmed M. Gadallah, Cairo University, Egypt Assem H. Mohammed, Cairo University, Egypt Chapter 14 Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification........................................................................................................................................ 295 M. N. Al-Berry, Ain Shams University, Egypt Mohammed A.-M. Salem, Ain Shams University, Egypt H. M. Ebeid, Ain Shams University, Egypt A. S. Hussein, Arab Open University, Kuwait Mohamed F. Tolba, Ain Shams University, Egypt Chapter 15 Data Streams Processing Techniques Data Streams Processing Techniques....................................... 320 Fatma Mohamed, Ain Shams University, Egypt Rasha M. Ismail, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Chapter 16 A Preparation Framework for EHR Data to Construct CBR Case-Base............................................. 345 Shaker El-Sappagh, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Alaa M. Riad, Mansoura University, Egypt Hosam Zaghloul, Mansoura University, Egypt Farid A. Badria, Mansoura University, Egypt

Chapter 17 Detecting Significant Changes in Image Sequences............................................................................ 379 Sergii Mashtalir, Kharkiv National University of Radio Electronics, Ukraine Olena Mikhnova, Kharkiv Petro Vasylenko National Technical University of Agriculture, Ukraine Chapter 18 Multiple Sequence Alignment Optimization Using Meta-Heuristic Techniques................................ 409 Mohamed Issa, Zagazig University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Chapter 19 Recent Survey on Medical Image Segmentation................................................................................. 424 Mohammed A.-M. Salem, Ain Shams University, Egypt Alaa Atef, Ain Shams University, Egypt Alaa Salah, Ain Shams University, Egypt Marwa Shams, Ain Shams University, Egypt Chapter 20 Machine Learning Applications in Breast Cancer Diagnosis.............................................................. 465 Syed Jamal Safdar Gardezi, Universiti Teknologi Petronas, Malaysia Mohamed Meselhy Eltoukhy, Suez Canal University, Egypt Ibrahima Faye, Universiti Teknologi Petronas, Malaysia Chapter 21 A Hybrid Optimization Algorithm for Single and Multi-Objective Optimization Problems.............. 491 Rizk M. Rizk-Allah, Menoufia University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Chapter 22 Neuro-Imaging Machine Learning Techniques for Alzheimer’s Disease Diagnosis........................... 522 Gehad Ismail Sayed, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt

Volume II Chapter 23 Swarm Intelligence Based on Remote Sensing Image Fusion: Comparison between the Particle Swarm Optimization and the Flower Pollination Algorithm............................................................... 541 Reham Gharbia, Nuclear Materials Authority, Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt

Chapter 24 Grey Wolf Optimization-Based Segmentation Approach for Abdomen CT Liver Images................. 562 Abdalla Mostafa, Cairo University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Hesham A. Hefny, Cairo University, Egypt Chapter 25 3D Watermarking Approach Using Particle Swarm Optimization Algorithm.................................... 582 Mona M. Soliman, Scientific Research Group in Egypt, Egypt Aboul Ella Hassanien, Scientific Research Group in Egypt, Egypt Chapter 26 Particle Swarm Optimization: A Tutorial............................................................................................ 614 Alaa Tharwat, Suez Canal University, Egypt Tarek Gaber, Suez Canal University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Basem E. Elnaghi, Suez Canal University, Egypt Chapter 27 A Comparison of Open Source Data Mining Tools for Breast Cancer Classification......................... 636 Ahmed AbdElhafeez Ibrahim, Arab Academy for Science, Technology, and Maritime Transport, Egypt Atallah Ibrahin Hashad, Arab Academy for Science, Technology, and Maritime Transport, Egypt Negm Eldin Mohamed Shawky, Arab Academy for Science, Technology, and Maritime Transport, Egypt Chapter 28 2D and 3D Intelligent Watermarking................................................................................................... 652 Mourad R. Mouhamed, Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ashraf Darwish, Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University Egypt & Scientific Research Group in Egypt (SRGE), Egypt Section 3 Innovative ML Applications Chapter 29 Differential Evolution Algorithm with Space Reduction for Solving Large-Scale Global Optimization Problems........................................................................................................................ 671 Ahmed Fouad Ali, Suez Canal University, Egypt Nashwa Nageh Ahmed, Suez Canal University, Egypt

Chapter 30 Interpreting Brain Waves..................................................................................................................... 695 Noran Magdy El-Kafrawy, Ain Shams University, Egypt Doaa Hegazy, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Chapter 31 Data Clustering Using Sine Cosine Algorithm: Data Clustering Using SCA..................................... 715 Vijay Kumar, Thapar University, India Dinesh Kumar, GJUS&T, India Chapter 32 Complex-Valued Neural Networks: A New Learning Strategy Using Particle Swarm Optimization........................................................................................................................................ 727 Mohammed E. El-Telbany, Electronics Research Institute, Egypt Samah Refat, Ain Shams University, Egypt Engy I. Nasr, Ain Shams University, Egypt Chapter 33 Text Classification: New Fuzzy Decision Tree Model........................................................................ 740 Ben Elfadhl Mohamed Ahmed, Higher Institute of Management, Tunisia Ben Abdessalem Wahiba, Taif University, Saudi Arabia Chapter 34 PAGeneRN: Parallel Architecture for Gene Regulatory Network....................................................... 762 Dina Elsayad, Ain Shams University, Egypt A. Ali, Ain Shams University, Egypt Howida A. Shedeed, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Chapter 35 Hybrid Wavelet-Neuro-Fuzzy Systems of Computational Intelligence in Data Mining Tasks........... 787 Yevgeniy Bodyanskiy, Kharkiv National University of Radio Electronics, Ukraine Olena Vynokurova, Kharkiv National University of Radio Electronics, Ukraine Oleksii Tyshchenko, Kharkiv National University of Radio Electronics, Ukraine Chapter 36 On Combining Nature-Inspired Algorithms for Data Clustering........................................................ 826 Hanan Ahmed, Ain Shams University, Egypt Howida A. Shedeed, Ain Shams University, Egypt Safwat Hamad, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt

Chapter 37 A Fragile Watermarking Chaotic Authentication Scheme Based on Fuzzy C-Means for Image Tamper Detection................................................................................................................................. 856 Kamal Hamouda, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt B. S. El-Desouky, Mansoura University, Egypt Chapter 38 New Mechanisms to Enhance the Performances of Arabic Text Recognition System: Feature Selection............................................................................................................................................... 879 Marwa Amara, SOIE Laboratory, Tunisia Kamel Zidi, University of Tabouk, Saudi Arabia Chapter 39 Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters....................................... 897 Ahmed.T. Sahlol, Damietta University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 40 Telemetry Data Mining Techniques, Applications, and Challenges.................................................... 915 Sara Ahmed, Al Azhar University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Tarek Gaber, Suez Canal University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 41 Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach in Digital Mammography..................................................................................................................................... 925 Mohammed A. Osman, Helwan University, Egypt Ashraf Darwish, Helwan University, Egypt Ayman E. Khedr, Helwan University, Egypt Atef Z. Ghalwash, Helwan University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Chapter 42 TAntNet-4: A Threshold-Based AntNet Algorithm with Improved Scout Behavior.......................... 942 Ayman M. Ghazy, Cairo University, Egypt Hesham A. Hefny, Cairo University, Egypt

Chapter 43 Digital Images Segmentation Using a Physical-Inspired Algorithm................................................... 975 Diego Oliva, Tecnológico de Monterrey, Mexico & Universidad de Guadajalara, Mexico & Tomsk Polytechnic University, Russia & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 44 A Proposed Architecture for Key Management Schema in Centralized Quantum Network............... 997 Ahmed Farouk, Zewail City of Science and Technology, Egypt & Mansoura University, Egypt Mohamed Elhoseny, Mansoura University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Josep Batle, Universitat de les Illes Balears, Spain Mosayeb Naseri, Islamic Azad University, Kermanshah, Iran Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 45 Secure Image Processing and Transmission Schema in Cluster-Based Wireless Sensor Network.............................................................................................................................................. 1022 Mohamed Elhoseny, Mansoura University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ahmed Farouk, Zewail City of Science and Technology, Egypt & Mansoura University, Egypt Josep Batle, Universitat de les Illes Balears, Spain Abdulaziz Shehab, Mansoura University, Egypt Aboul Ella Hassanien, Cairo University, Giza, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Chapter 46 Color Invariant Representation and Applications.............................................................................. 1041 Abdelhameed Ibrahim, Mansoura University, Egypt Takahiko Horiuchi, Chiba University, Japan Shoji Tominaga, Chiba University, Japan Aboul Ella Hassanien, Cairo University, Egypt Chapter 47 An Efficient Approach for Community Detection in Complex Social Networks Based on Elephant Swarm Optimization Algorithm........................................................................................................ 1062 Khaled Ahmed, Cairo University, Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ehab Ezzat, Cairo University, Egypt

Chapter 48 Designing Multilayer Feedforward Neural Networks Using Multi-Verse Optimizer........................ 1076 Mohamed F. Hassanin, Fayoum University, Egypt Abdullah M. Shoeb, Taibah University, Saudi Arabia Aboul Ella Hassanien, Cairo University, Egypt Compilation of References................................................................................................................... xl About the Contributors.................................................................................................................... clxii Index.................................................................................................................................................. clxix

Detailed Table of Contents

Preface............................................................................................................................................xxxviii

Volume I Section 1 State-of-the-Art Techniques Chapter 1 T-Spanner Problem: Genetic Algorithms for the T-Spanner Problem..................................................... 1 Riham Moharam, Suez Canal University, Egypt Ehab Morsy, Suez Canal University, Egypt Ismail A. Ismail, 6 October University, Egypt The t-spanner problem is a popular combinatorial optimization problem and has different applications in communication networks and distributed systems. This chapter considers the problem of constructing a t-spanner subgraph H in a given undirected edge-weighted graph G in the sense that the distance between every pair of vertices in H is at most t times the shortest distance between the two vertices in G. The value of t, called the stretch factor, quantifies the quality of the distance approximation of the corresponding t-spanner subgraph. This chapter studies two variations of the problem, the Minimum t-Spanner Subgraph (MtSS) and the Minimum Maximum Stretch Spanning Tree(MMST). Given a value for the stretch factor t, the MtSS problem asks to find the t-spanner subgraph of the minimum total weight in G. The MMST problem looks for a tree T in G that minimizes the maximum distance between all pairs of vertices in V (i.e., minimizing the stretch factor of the constructed tree). It is easy to conclude from the literatures that the above problems are NP-hard. This chapter presents genetic algorithms that returns a high quality solution for those two problems. Chapter 2 Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes.................................................................................................................................................. 22 Dmitry Klyushin, Kyiv National Taras Shevchenko University, Ukraine Natalia Boroday, National Academy of Sciences of Ukraine, Ukraine Kateryna Golubeva, Kyiv National Taras Shevchenko University, Ukraine Maryna Prysiazhna, Kyiv National Taras Shevchenko University, Ukraine Maksym Shlykov, Kyiv National Taras Shevchenko University, Ukraine The chapter is devoted to description of a novel method of breast cancer diagnostics based on the analysis of the distribution of the DNA concentration in interphase nuclei of epitheliocytes of buccal epithelium

with the aid of novel algorithms of statistical machine learning, namely: novel proximity measure between multivariate samples, novel algorithm of construction of tolerance ellipsoids, novel statistical depth and novel method of multivariate ordering. In contrast to common methods of diagnostics used in oncology, this method is a non-invasive and offers a high rate of accuracy and sensitivity. Chapter 3 Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification................. 43 Nora Shoaip, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Alaa M. Riad, Mansoura University, Egypt Hosam Zaghloul, Mansoura University, Egypt Farid A. Badria, Mansoura University, Egypt Ovarian cancer is one of the most dangerous cancers among women which have a high rank of the cancers causing death. Ovarian cancer diagnoses are very difficult especially in early-stage because most symptoms associated with ovarian cancer such as Difficulty eating or feeling full quickly, Pelvic or abdominal pain, and Bloating are common and found in Women who do not have ovarian cancer. The CA-125 test is used as a tumor marker, high levels could be a sign of ovarian cancer, but sometimes it is not true because not all women with ovarian cancer have high CA-125 levels, particularly about 20% of ovarian cancers are found at an early stage. In this paper, we try to find the most important rules helping in Early-stage ovarian cancer Diagnosis by evaluating the significance of data between ovarian cancer and the amino acids. Therefore, we propose a Fuzzy Rough feature selection with Support Vector Machine (SVM) classification model. In the pre-processing stage, we use Fuzzy Rough set theory for feature selection. In post-processing stage, we use SVM classification which is a powerful method to get good classification performance. Finally, we compare the output results of the proposed system with other classification technique to guarantee the highest classification performance. Chapter 4 Data Storage Security Service in Cloud Computing: Challenges and Solutions................................... 61 Alshaimaa Abo-alian, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Cloud computing is an emerging computing paradigm that is rapidly gaining attention as an alternative to other traditional hosted application models. The cloud environment provides on-demand, elastic and scalable services, moreover, it can provide these services at lower costs. However, this new paradigm poses new security issues and threats because cloud service providers are not in the same trust domain of cloud customers. Furthermore, data owners cannot control the underlying cloud environment. Therefore, new security practices are required to guarantee the availability, integrity, privacy and confidentiality of the outsourced data. This paper highlights the main security challenges of the cloud storage service and introduces some solutions to address those challenges. The proposed solutions present a way to protect the data integrity, privacy and confidentiality by integrating data auditing and access control methods.

Chapter 5 Workload Management Systems for the Cloud Environment................................................................ 94 Eman A. Maghawry, Ain Shams University, Egypt Rasha M. Ismail, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Workload Management is a performance management process in which an autonomic database management system on a cloud environment efficiently makes use of its virtual resources. Workload management for concurrent queries is one of the challenging aspects of executing queries over the cloud. The core problem is to manage any unpredictable overload with respect to varying resource capabilities and performances. This chapter proposes an efficient workload management system for controlling the queries execution over a cloud. The chapter presents architecture to improve the query response time. It handles the user’s queries then selecting the suitable resources for executing these queries. Furthermore, managing the life cycle of virtual resources through responding to any load that occurs on the resources. This is done by dynamically rebalancing the queries distribution load across the resources in the cloud. The results show that applying this Workload Management System improves the query response time by 68%. Chapter 6 Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques............... 114 Eman A. Abdel Maksoud, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Rashid Mokhtar Al-Awadi, Mansoura University, Egypt The popularity of clustering in segmentation encouraged us to develop a new medical image segmentation system based on two-hybrid clustering techniques. Our medical system provides an accurate detection of brain tumor with minimal time. The hybrid techniques make full use of merits of these clustering techniques and overcome the shortcomings of them. The first is based on K-means and fuzzy C-means (KIFCM). The second is based on K-means and particle swarm optimization (KIPSO). KIFCM helps Fuzzy C-means to overcome the slow convergence speed. KIPSO provides global optimization with less time. It helps K-means to escape from local optima by using particle swarm optimization (PSO). In addition, it helps PSO to reduce the computation time by using K-means. Comparisons were made between the proposed techniques and K-means, Fuzzy C-means, expectation maximization, mean shift, and PSO using three benchmark brain datasets. The results clarify the effectiveness of our second proposed technique (KIPSO). Chapter 7 Localization and Mapping for Indoor Navigation: Survey.................................................................. 136 Heba Gaber, Ain-Shams University, Egypt Mohamed Marey, Ain-Shams University, Egypt Safaa Amin, Ain-Shams University, Egypt Mohamed F. Tolba, Ain-Shams University, Egypt Mapping and exploration for the purpose of navigation in unknown or partially unknown environments is a challenging problem, especially in indoor environments where GPS signals can’t give the required accuracy. This chapter discusses the main aspects for designing a Simultaneous Localization and Mapping (SLAM) system architecture with the ability to function in situations where map information or current

positions are initially unknown or partially unknown and where environment modifications are possible. Achieving this capability makes these systems significantly more autonomous and ideal for a large range of applications, especially indoor navigation for humans and for robotic missions. This chapter surveys the existing algorithms and technologies used for localization and mapping and highlights on using SLAM algorithms for indoor navigation. Also the proposed approach for the current research is presented. Chapter 8 Enzyme Function Classification: Reviews, Approaches, and Trends.................................................. 161 Mahir M. Sharif, Cairo University, Egypt & Omdurman Islamic University, Sudan & Scientific Research Group in Egypt (SRGE), Egypt Alaa Tharwat, Suez Canal University, Egypt & Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Hesham A. Hefny, Cairo University, Egypt Enzymes are important in our life and it plays a vital role in the most biological processes in the living organisms and such as metabolic pathways. The classification of enzyme functionality from a sequence, structure data or the extracted features remains a challenging task. Traditional experiments consume more time, efforts, and cost. On the other hand, an automated classification of the enzymes saves efforts, money and time. The aim of this chapter is to cover and reviews the different approaches, which developed and conducted to classify and predict the functions of the enzyme proteins in addition to the new trends and challenges that could be considered now and in the future. The chapter addresses the main three approaches which are used in the classification the function of enzymatic proteins and illustrated the mechanism, pros, cons, and examples for each one. Chapter 9 A Review of Vessel Segmentation Methodologies and Algorithms: Comprehensive Review............ 187 Gehad Hassan, Fayoum University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt “Prevention is better than cure”, true statement which all of us neglect. One of the most reasons which cause speedy recovery from any diseases is to discover it in advanced stages. From here come the importance of computer systems which preserve time and achieve accurate results in knowing the diseases and its first symptoms .One of these systems is retinal image analysis system which considered as a key role and the first step of Computer Aided Diagnosis Systems (CAD). In addition to monitor the patient health status under different treatment methods to ensure How it effects on the disease.. In this chapter the authors examine most of approaches that are used for vessel segmentation for retinal images, and a review of techniques is presented comparing between their quality and accessibility, analyzing and catgrizing them. This chapter gives a description and highlights the key points and the performance measures of each one.

Chapter 10 Cloud Services Publication and Discovery.......................................................................................... 204 Yasmine M. Afify, Ain Shams University, Egypt Ibrahim F. Moawad, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Cloud computing is an information technology delivery model accessed over the Internet. Its adoption rate is dramatically increasing. Diverse cloud service advertisements introduce more challenges to cloud users to locate and identify required service offers. These challenges highlight the need for a consistent cloud service registry to serve as a mediator between cloud providers and users. In this chapter, state-ofthe-art research work related to cloud service publication and discovery is surveyed. Based on the survey findings, a set of key limitations are emphasized. Discussion of challenges and future requirements is presented. In order to contribute to cloud services publication and discovery area, a semantic-based system for unified Software-as-a-Service (SaaS) service advertisements is proposed. Its back-end foundation is the focus on business-oriented perspective of the SaaS services and semantics. Service registration template, guided registration model, and registration system are introduced. Additionally, a semantic similarity model for services metadata matchmaking is presented. Section 2 Applications-Based Machine Learning Chapter 11 Enhancement of Data Quality in Health Care Industry: A Promising Data Quality Approach.......... 230 Asmaa S. Abdo, Menoufia University, Egypt Rashed K. Salem, Menoufia University, Egypt Hatem M. Abdul-Kader, Menoufia University, Egypt Ensuring data quality is a growing challenge, particularly when emerging big data applications. This chapter highlights data quality concepts, terminologies, techniques, as well as research issues. Recent studies have shown that databases are often suffered from inconsistent data, which ought to be resolved in the cleaning process. Data mining techniques can play key role for ensuring data quality, which can be reutilized efficiently in data cleaning process. In this chapter, we introduce an approach for dependably generating rules from databases themselves autonomously, in order to detect data inconsistency problems from large databases. The proposed approach employs confidence and lift measures with integrity constraints to guarantee that generated rules are minimal, non-redundant and precise. Since healthcare applications are critical, and managing healthcare environments efficiently results in patient care improvement. The proposed approach is validated against several datasets from healthcare environment. It provides clinicians with automated approach for enhancing quality of electronic medical records. We experimentally demonstrate that the proposed approach achieves significant enhancement over existing approaches.

Chapter 12 Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods............................................................................................................................................... 251 Pradeep Kumar, Maulana Azad National Urdu University, India Abdul Wahid, Maulana Azad National Urdu University, India Software reliability is a statistical measure of how well software operates with respect to its requirements. There are two related software engineering research issues about reliability requirements. The first issue is achieving the necessary reliability, i.e., choosing and employing appropriate software engineering techniques in system design and implementation. The second issue is the assessment of reliability as a method of assurance that precedes system deployment. In past few years, various software reliability models have been introduced. These models have been developed in response to the need of software engineers, system engineers and managers to quantify the concept of software reliability. This chapter investigates performance of some classical and intelligent machine learning techniques such as Linear regression (LR), Radial basis function network (RBFN), Generalized regression neural network (GRNN), Support vector machine (SVM), to predict software reliability. The effectiveness of LR and machine learning methods is demonstrated with the help of sixteen datasets taken from Data & Analysis Centre for Software (DACS). Two performance measures, root mean squared error (RMSE) and mean absolute percentage error (MAPE) is compared quantitatively obtained from rigorous experiments. Chapter 13 Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops.......... 272 Ahmed M. Gadallah, Cairo University, Egypt Assem H. Mohammed, Cairo University, Egypt Climate changes play a significant role in the crops plantation process. Such changes affect the suitability of planting of many crops in their traditional plantation dates in a given place. In contrary, many of such crops become more suitable for planting at other new dates in their traditional places or in other new places regarding the climate changes. This chapter presents a fuzzy-based approach for optimizing crops planting dates with the ongoing changes in climate at a given place. The proposed approach incorporates four phases. The first phase is concerned with climate data preparation. And the second phase is concerned with Defining suitability membership functions. While in third phase is responsible for automatic fuzzy clustering. Finally, the fourth phase is responsible for fuzzy selection and optimization for the more suitable plantation dates for each crop. This chapter consists of an introduction, related works, the proposed approach, a first case study, a second case study, results discussion, future research directions and finally the chapter conclusion. Chapter 14 Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification........................................................................................................................................ 295 M. N. Al-Berry, Ain Shams University, Egypt Mohammed A.-M. Salem, Ain Shams University, Egypt H. M. Ebeid, Ain Shams University, Egypt A. S. Hussein, Arab Open University, Kuwait Mohamed F. Tolba, Ain Shams University, Egypt Human action recognition is a very active field in computer vision. Many important applications depend on accurate human action recognition, which is based on accurate representation of the actions. These

applications include surveillance, athletic performance analysis, driver assistance, robotics, and humancentered computing. This chapter presents a thorough review of the field, concentrating the recent action representation methods that use spatio-temporal information. In addition, the authors propose a stationary wavelet-based representation of natural human actions in realistic videos. The proposed representation utilizes the 3D Stationary Wavelet Transform to encode the directional multi-scale spatio-temporal characteristics of the motion available in a frame sequence. It was tested using the Weizmann, and KTH datasets, and produced good preliminary results while having reasonable computational complexity when compared to existing state–of–the–art methods. Chapter 15 Data Streams Processing Techniques Data Streams Processing Techniques....................................... 320 Fatma Mohamed, Ain Shams University, Egypt Rasha M. Ismail, Ain Shams University, Egypt Nagwa L. Badr, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt Many modern applications in several domains such as sensor networks, financial applications, web logs and click-streams operate on continuous, unbounded, rapid, time-varying streams of data elements. These applications present new challenges that are not addressed by traditional data management techniques. For the query processing of continuous data streams, we consider in particular continuous queries which are evaluated continuously as data streams continue to arrive. The answer to a continuous query is produced over time, always reflecting the stream data seen so far. One of the most critical requirements of stream processing is fast processing. So, parallel and distributed processing would be good solutions. This paper gives (1) analysis to the different continuous query processing techniques; (2) a comparative study for the data streams execution environments; and (3) finally, we propose an integrated system for processing data streams based on cloud computing which apply continuous query optimization technique on cloud environment. Chapter 16 A Preparation Framework for EHR Data to Construct CBR Case-Base............................................. 345 Shaker El-Sappagh, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt Alaa M. Riad, Mansoura University, Egypt Hosam Zaghloul, Mansoura University, Egypt Farid A. Badria, Mansoura University, Egypt Diabetes mellitus diagnosis is an experience-based problem. Case-Based Reasoning (CBR) is the first choice for these problems. CBR depends on the quality of its case-base structure and contents; however, building a case-base is a challenge. Electronic Health Record (EHR) data can be used as a starting point for building case-bases, but it needs a set of preparation steps. This chapter proposes an EHR-based casebase preparation framework. It has three phases: data-preparation, coding, and fuzzification. The first two phases will be discussed in this chapter using a diabetes diagnosis dataset collected from EHRs of 60 patients. The result is the case-base knowledge. The first phase uses some machine-learning algorithms for case-base data preparation. For encoding phase, we propose and apply an encoding methodology based on SNOMED-CT. We will build an OWL2 ontology from collected SNOMED-CT concepts. A CBR prototype has been designed, and results show enhancements to the diagnosis accuracy.

Chapter 17 Detecting Significant Changes in Image Sequences............................................................................ 379 Sergii Mashtalir, Kharkiv National University of Radio Electronics, Ukraine Olena Mikhnova, Kharkiv Petro Vasylenko National Technical University of Agriculture, Ukraine In this chapter the authors propose an overview on contemporary artificial intelligence techniques designed for change detection in image and video sequences. A variety of image features have been analyzed for content presentation at a low level. In attempt towards high-level interpretation by a machine, a novel approach to image comparison has been proposed and described in detail. It utilizes techniques of salient point detection, video scene identification, spatial image segmentation, feature extraction and analysis. Metrics implemented for image partition matching enhance performance and quality of the results, which has been proved by several estimations. The review on estimation measures is also given along with references to publicly available test datasets. Conclusion is provided in relation to trends of future development in image and video processing. Chapter 18 Multiple Sequence Alignment Optimization Using Meta-Heuristic Techniques................................ 409 Mohamed Issa, Zagazig University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Sequence alignment is a vital process in many biological applications such as Phylogenetic trees construction, DNA fragment assembly and structure/function prediction. Two kinds of alignment are pairwise alignment which align two sequences and Multiple Sequence alignment (MSA) that align sequences more than two. The accurate method of alignment is based on Dynamic Programming (DP) approach which suffering from increasing time exponentially with increasing the length and the number of the aligned sequences. Stochastic or meta-heuristics techniques speed up alignment algorithm but with near optimal alignment accuracy not as that of DP. Hence, This chapter aims to review the recent development of MSA using meta-heuristics algorithms. In addition, two recent techniques are focused in more deep: the first is Fragmented protein sequence alignment using two-layer particle swarm optimization (FTLPSO). The second is Multiple sequence alignment using multi-objective based bacterial foraging optimization algorithm (MO-BFO). Chapter 19 Recent Survey on Medical Image Segmentation................................................................................. 424 Mohammed A.-M. Salem, Ain Shams University, Egypt Alaa Atef, Ain Shams University, Egypt Alaa Salah, Ain Shams University, Egypt Marwa Shams, Ain Shams University, Egypt This chapter presents a survey on the techniques of medical image segmentation. Image segmentation methods are given in three groups based on image features used by the method. The advantages and disadvantages of the existing methods are evaluated, and the motivations to develop new techniques with respect to the addressed problems are given. Digital images and digital videos are pictures and films, respectively, which have been converted into a computer-readable binary format consisting of logical zeros and ones. An image is a still picture that does not change in time, whereas a video evolves in time

and generally contains moving and/or changing objects. An important feature of digital images is that they are multidimensional signals, i.e., they are functions of more than a single variable. In the classical study of the digital signal processing the signals are usually one-dimensional functions of time. Images however, are functions of two, and perhaps three space dimensions in case of colored images, whereas a digital video as a function includes a third (or fourth) time dimension as well. A consequence of this is that digital image processing, meaning that significant computational and storage resources are required. Chapter 20 Machine Learning Applications in Breast Cancer Diagnosis.............................................................. 465 Syed Jamal Safdar Gardezi, Universiti Teknologi Petronas, Malaysia Mohamed Meselhy Eltoukhy, Suez Canal University, Egypt Ibrahima Faye, Universiti Teknologi Petronas, Malaysia Breast cancer is one of the leading causes of death in women worldwide. Early detection is the key to reduce the mortality rates. Mammography screening has proven to be one of the effective tools for diagnosis of breast cancer. Computer aided diagnosis (CAD) system is a fast, reliable, and cost-effective tool in assisting the radiologists/physicians for diagnosis of breast cancer. CAD systems play an increasingly important role in the clinics by providing a second opinion. Clinical trials have shown that CAD systems have improved the accuracy of breast cancer detection. A typical CAD system involves three major steps i.e. segmentation of suspected lesions, feature extraction and classification of these regions into normal or abnormal class and further into benign or malignant stages. The diagnostics ability of any CAD system is dependent on accurate segmentation, feature extraction techniques and most importantly classification tools that have ability to discriminate the normal tissues from the abnormal tissues. In this chapter we discuss the application of machine learning algorithms e.g. ANN, binary tree, SVM, etc. together with segmentation and feature extraction techniques in a CAD system development. Various methods used in the detection and diagnosis of breast lesions in mammography are reviewed. A brief introduction of machine learning tools, used in diagnosis and their classification performance on various segmentation and feature extraction techniques is presented. Chapter 21 A Hybrid Optimization Algorithm for Single and Multi-Objective Optimization Problems.............. 491 Rizk M. Rizk-Allah, Menoufia University, Egypt Aboul Ella Hassanien, Cairo University, Egypt This chapter presents a hybrid optimization algorithm namely FOA-FA for solving single and multiobjective optimization problems. The proposed algorithm integrates the benefits of the fruit fly optimization algorithm (FOA) and the firefly algorithm (FA) to avoid the entrapment in the local optima and the premature convergence of the population. FOA operates in the direction of seeking the optimum solution while the firefly algorithm (FA) has been used to accelerate the optimum seeking process and speed up the convergence performance to the global solution. Further, the multi-objective optimization problem is scalarized to a single objective problem by weighting method, where the proposed algorithm is implemented to derive the non-inferior solutions that are in contrast to the optimal solution. Finally, the proposed FOA-FA algorithm is tested on different benchmark problems whether single or multiobjective aspects and two engineering applications. The numerical comparisons reveal the robustness and effectiveness of the proposed algorithm.

Chapter 22 Neuro-Imaging Machine Learning Techniques for Alzheimer’s Disease Diagnosis........................... 522 Gehad Ismail Sayed, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Alzheimer’s disease (AD) is considered one of the most common dementia’s forms affecting senior’s age staring from 65 and over. The standard method for identifying AD are usually based on behavioral, neuropsychological and cognitive tests and sometimes followed by a brain scan. Advanced medical imagining modalities such as MRI and pattern recognition techniques are became good tools for predicting AD. In this chapter, an automatic AD diagnosis system from MRI images based on using machine learning tools is proposed. A bench mark dataset is used to evaluate the performance of the proposed system. The adopted dataset consists of 20 patients for each diagnosis case including cognitive impairment, Alzheimer’s disease and normal. Several evaluation measurements are used to evaluate the robustness of the proposed diagnosis system. The experimental results reveal the good performance of the proposed system.

Volume II Chapter 23 Swarm Intelligence Based on Remote Sensing Image Fusion: Comparison between the Particle Swarm Optimization and the Flower Pollination Algorithm............................................................... 541 Reham Gharbia, Nuclear Materials Authority, Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt This chapter presents a remote sensing image fusion based on swarm intelligence. Image fusion is combining multi-sensor images in a single image that has most informative. Remote sensing image fusion is an effective way to extract a large volume of data from multisource images. However, traditional image fusion approaches cannot meet the requirements of applications because they can lose spatial information or distort spectral characteristics. The core of the image fusion is image fusion rules. The main challenge is getting suitable weight of fusion rule. This chapter proposes swarm intelligence to optimize the image fusion rule. Swarm intelligence algorithms are a family of global optimizers inspired by swarm phenomena in nature and have shown better performance. In this chapter, two remote sensing image fusion based on swarm intelligence algorithms, Particle Swarm Optimization (PSO) and flower pollination algorithm are presented to get an adaptive image fusion rule and comparative between them. Chapter 24 Grey Wolf Optimization-Based Segmentation Approach for Abdomen CT Liver Images................. 562 Abdalla Mostafa, Cairo University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Hesham A. Hefny, Cairo University, Egypt In the recent days, a great deal of researches is interested in segmentation of different organs in medical images. Segmentation of liver is as an initial phase in liver diagnosis, it is also a challenging task due to its similarity with other organs intensity values. This paper aims to propose a grey wolf optimization

based approach for segmenting liver from the abdomen CT images. The proposed approach combines three parts to achieve this goal. It combines the usage of grey wolf optimization, statistical image of liver, simple region growing and Mean shift clustering technique. The initial cleaned image is passed to Grey Wolf (GW) optimization technique. It calculated the centroids of a predefined number of clusters. According to each pixel intensity value in the image, the pixel is labeled by the number of the nearest cluster. A binary statistical image of liver is used to extract the potential area that liver might exist in. It is multiplied by the clustered image to get an initial segmented liver. Then region growing (RG) is used to enhance the segmented liver. Finally, mean shift clustering technique is applied to extract the regions of interest in the segmented liver. A set of 38 images, taken in pre-contrast phase, was used for liver segmentation and testing the proposed approach. For evaluation, similarity index measure is used to validate the success of the proposed approach. The experimental results of the proposed approach showed that the overall accuracy offered by the proposed approach, results in 94.08% accuracy. Chapter 25 3D Watermarking Approach Using Particle Swarm Optimization Algorithm.................................... 582 Mona M. Soliman, Scientific Research Group in Egypt, Egypt Aboul Ella Hassanien, Scientific Research Group in Egypt, Egypt This work proposes a watermarking approach by utilizing the use of Bio-Inspired techniques such as swarm intelligent in optimizing watermarking algorithms for 3D models. In this proposed work we present an approach of 3D mesh model watermarking by introducing a new robust 3D mesh watermarking authentication methods by ensuring a minimal surface distortion at the same time ensuring a high robustness of extracted watermark. In order to achieve these requirements this work proposes the use of Particle Swarm Optimization (PSO) as Bio-Inspired optimization techniques. The experiments were executed using different sets of 3D models. In all experimental results we consider two important factors: imperceptibility and robustness. The experimental results show that the proposed approach yields a watermarked object with good visual definition; at the same time, the embedded watermark was robust against a wide variety of common attacks. Chapter 26 Particle Swarm Optimization: A Tutorial............................................................................................ 614 Alaa Tharwat, Suez Canal University, Egypt Tarek Gaber, Suez Canal University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Basem E. Elnaghi, Suez Canal University, Egypt Optimization algorithms are necessary to solve many problems such as parameter tuning. Particle Swarm Optimization (PSO) is one of these optimization algorithms. The aim of PSO is to search for the optimal solution in the search space. This paper highlights the basic background needed to understand and implement the PSO algorithm. This paper starts with basic definitions of the PSO algorithm and how the particles are moved in the search space to find the optimal or near optimal solution. Moreover, a numerical example is illustrated to show how the particles are moved in a convex optimization problem. Another numerical example is illustrated to show how the PSO trapped in a local minima problem. Two experiments are conducted to show how the PSO searches for the optimal parameters in one-dimensional and two-dimensional spaces to solve machine learning problems.

Chapter 27 A Comparison of Open Source Data Mining Tools for Breast Cancer Classification......................... 636 Ahmed AbdElhafeez Ibrahim, Arab Academy for Science, Technology, and Maritime Transport, Egypt Atallah Ibrahin Hashad, Arab Academy for Science, Technology, and Maritime Transport, Egypt Negm Eldin Mohamed Shawky, Arab Academy for Science, Technology, and Maritime Transport, Egypt Data Mining is a field that interconnects areas from computer science, trying to discover knowledge from databases in order to simplify the decision making. Classification is a Data Mining chore that learns from a set of instances in order to precisely classify the target class for new instances. Open source Data Mining tools can be used to make classification. This paper compares four tools: KNIME, Orange, Tanagra and Weka. Our goal is to discover the most precise tool and technique for breast cancer classifications. The experimental results show that some tools achieve better results more than others. Also, using fusion classification task verified to be better than the single classification task over the four datasets have been used. Also, we present a comparison between using complete datasets by substituting missing feature values and incomplete ones. The experimental results show that some datasets have better accuracy when using complete datasets. Chapter 28 2D and 3D Intelligent Watermarking................................................................................................... 652 Mourad R. Mouhamed, Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ashraf Darwish, Helwan University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University Egypt & Scientific Research Group in Egypt (SRGE), Egypt These days the enormous advancement in the field of data innovation and the wide employments of the web made the security of the information confront a major issue to accomplish the information assurance. The authentication and the copyright of the information is a critical piece of this issue. Scientists started to discover answers for this issue, watermarking and cryptology two of these solutions. Digital watermarking refers to the process of embedding imperceptible information called a digital watermark into a cover multimedia object so that the information may be detected or extracted later for security purposes. Cover multimedia object used to hide watermark information can be any digital media that we used in our daily life for data distribution such as: audio, 2D images, 3D images, and video. The problem that face the researchers in developing a watermarking techniques that the trade of between the impeccability and the robustness of the watermark this chapter focus on how the intelligent algorithms can help in this issue. This chapter surveys the watermarking techniques in 2D and 3D techniques. We conclude that watermarking technique are efficient for different areas of applications.

Section 3 Innovative ML Applications Chapter 29 Differential Evolution Algorithm with Space Reduction for Solving Large-Scale Global Optimization Problems........................................................................................................................ 671 Ahmed Fouad Ali, Suez Canal University, Egypt Nashwa Nageh Ahmed, Suez Canal University, Egypt Differential evolution algorithm (DE) is one of the most applied meta-heuristics algorithm for solving global optimization problems. However, the contributions of applying DE for large-scale global optimization problems are still limited compared with those problems for low dimensions. In this chapter, a new differential evolution algorithm is proposed in order to solve large-scale optimization problems. The proposed algorithm is called differential evolution with space partitioning (DESP). In DESP algorithm, the search variables are divided into small groups of partitions. Each partition contains a certain number of variables and this partition is manipulated as a subspace in the search process. Searching a limited number of variables in each partition prevents the DESP algorithm from wandering in the search space especially in large-scale spaces. The proposed algorithm is investigated on 15 benchmark functions and compared against three variants DE algorithms. The results show that the proposed algorithm is a cheap algorithm and obtains good results in a reasonable time. Chapter 30 Interpreting Brain Waves..................................................................................................................... 695 Noran Magdy El-Kafrawy, Ain Shams University, Egypt Doaa Hegazy, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt BCI (Brain-Computer Interface) gives you the power to manipulate things around you just by thinking of what you want to do. It allows your thoughts to be interpreted by the computer and hence act upon it. This could be utilized in helping disabled people, remote controlling of robots or even getting personalized systems depending upon your mood. The most important part of any BCI application is interpreting the brain signalsasthere are many mental tasks to be considered. In this chapter, the authors focus on interpreting motor imagery tasks and more specifically, imagining left hand, right hand, foot and tongue. Interpreting the signal consists of two main steps: feature extraction and classification. For the feature extraction,Empirical Mode Decomposition (EMD) was used and for the classification,the Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel was used. The authors evaluated this system using the BCI competition IV dataset and reached a very promising accuracy. Chapter 31 Data Clustering Using Sine Cosine Algorithm: Data Clustering Using SCA..................................... 715 Vijay Kumar, Thapar University, India Dinesh Kumar, GJUS&T, India The clustering techniques suffer from cluster centers initialization and local optima problems. In this chapter, the new metaheuristic algorithm, Sine Cosine Algorithm (SCA), is used as a search method to solve these problems. The SCA explores the search space of given dataset to find out the near-optimal cluster centers. The center based encoding scheme is used to evolve the cluster centers. The proposed

SCA-based clustering technique is evaluated on four real-life datasets. The performance of SCA-based clustering is compared with recently developed clustering techniques. The experimental results reveal that SCA-based clustering gives better values in terms of cluster quality measures. Chapter 32 Complex-Valued Neural Networks: A New Learning Strategy Using Particle Swarm Optimization........................................................................................................................................ 727 Mohammed E. El-Telbany, Electronics Research Institute, Egypt Samah Refat, Ain Shams University, Egypt Engy I. Nasr, Ain Shams University, Egypt In this chapter, the authors will try to go through the problem of learning the complex-valued neural networks (CVNNs) using particle swarm optimization (PSO); which is one of the open topics in the machine learning society. Quantitative structure-activity relationship (QSAR) modelling is one of the well developed areas in drug development through computational chemistry. This relationship between molecular structure and change in biological activity is center of focus for QSAR modelling. Machine learning algorithms are important tools for QSAR analysis, as a result, they are integrated into the drug production process. Predicting the real-valued drug activity problem is modelled by the CVNN and is learned by a new strategy based on PSO. The trained CVNNs are tested on two drug sets as a real world bench-mark problem. The results show that the prediction and generalization abilities of CVNNs is superior in comparison to the conventional real-valued neural networks (RVNNs). Moreover, convergence of CVNNs is much faster than that of RVNNs in most of the cases. Chapter 33 Text Classification: New Fuzzy Decision Tree Model........................................................................ 740 Ben Elfadhl Mohamed Ahmed, Higher Institute of Management, Tunisia Ben Abdessalem Wahiba, Taif University, Saudi Arabia In this chapter, a supervised automatic text documents classification using the fuzzy decision trees technique is proposed. Whatever the algorithm used in the fuzzy decision trees, there must be a criterion for the choice of discriminating attribute at the nodes to partition. For fuzzy decision trees usually two heuristics were used to select the discriminating attribute at the node to partition. In the field of text documents classification there is a heuristic that has not yet been tested. This chapter tested this heuristic. This heuristic is analyzed and adapted to the author’s approach for text documents classification. Chapter 34 PAGeneRN: Parallel Architecture for Gene Regulatory Network....................................................... 762 Dina Elsayad, Ain Shams University, Egypt A. Ali, Ain Shams University, Egypt Howida A. Shedeed, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt The gene expression analysis is an important research area of Bioinformatics. The gene expression data analysis aims to understand the genes interacting phenomena, gene functionality and the genes mutations effect. The Gene regulatory network analysis is one of the gene expression data analysis tasks. Gene regulatory network aims to study the genes interactions topological organization. The regulatory network is critical for understanding the pathological phenotypes and the normal cell physiology. There

are many researches that focus on gene regulatory network analysis but unfortunately some algorithms are affected by data size. Where, the algorithm runtime is proportional to the data size, therefore, some parallel algorithms are presented to enhance the algorithms runtime and efficiency. This work presents a background, mathematical models and comparisons about gene regulatory networks analysis different techniques. In addition, this work proposes Parallel Architecture for Gene Regulatory Network (PAGeneRN). Chapter 35 Hybrid Wavelet-Neuro-Fuzzy Systems of Computational Intelligence in Data Mining Tasks........... 787 Yevgeniy Bodyanskiy, Kharkiv National University of Radio Electronics, Ukraine Olena Vynokurova, Kharkiv National University of Radio Electronics, Ukraine Oleksii Tyshchenko, Kharkiv National University of Radio Electronics, Ukraine This work is devoted to synthesis of adaptive hybrid systems based on the Computational Intelligence (CI) methods (especially artificial neural networks (ANNs)) and the Group Method of Data Handling (GMDH) ideas to get new qualitative results in Data Mining, Intelligent Control and other scientific areas. The GMDH-artificial neural networks (GMDH-ANNs) are currently well-known. Their nodes are two-input N-Adalines. On the other hand, these ANNs can require a considerable number of hidden layers for a necessary approximation quality. Introduced Q-neurons can provide a higher quality using the quadratic approximation. Their main advantage is a high learning rate. Universal approximating properties of the GMDH-ANNs can be achieved with the help of compartmental R-neurons representing a two-input RBFN with the grid partitioning of the input variables’ space. An adjustment procedure of synaptic weights as well as both centers and receptive fields is provided. At the same time, Epanechnikov kernels (their derivatives are linear to adjusted parameters) can be used instead of conventional Gauss functions in order to increase a learning process rate. More complex tasks deal with stochastic time series processing. This kind of tasks can be solved with the help of the introduced adaptive W-neurons (wavelets). Learning algorithms are characterized by both tracking and smoothing properties based on the quadratic learning criterion. Robust algorithms which eliminate an influence of abnormal outliers on the learning process are introduced too. Theoretical results are illustrated by multiple experiments that confirm the proposed approach’s effectiveness. Chapter 36 On Combining Nature-Inspired Algorithms for Data Clustering........................................................ 826 Hanan Ahmed, Ain Shams University, Egypt Howida A. Shedeed, Ain Shams University, Egypt Safwat Hamad, Ain Shams University, Egypt Mohamed F. Tolba, Ain Shams University, Egypt This chapter proposed different hybrid clustering methods based on combining particle swarm optimization (PSO), gravitational search algorithm (GSA) and free parameters central force optimization (CFO) with each other and with the k-means algorithm. The proposed methods were applied on 5 real datasets from the university of California, Irvine (UCI) machine learning repository. Comparative analysis was done in terms of three measures; the sum of intra cluster distances, the running time and the distances between the clusters centroids. The initial population for the used algorithms were enhanced to minimize the sum of intra cluster distances. Experimental results show that, increasing the number of iterations doesn’t have a noticeable impact on the sum of intra cluster distances while it has a negative impact on the running time. K-means combined with GSA (KM-GSA), PSO combined with GSA (PSO-GSA) gave

the best performance according to the sum of intra cluster distances while K-means combined with PSO (KM-PSO) and KM-GSA were the best in terms of the running time. Finally, KM-GSA and GSA have the best performance. Chapter 37 A Fragile Watermarking Chaotic Authentication Scheme Based on Fuzzy C-Means for Image Tamper Detection................................................................................................................................. 856 Kamal Hamouda, Mansoura University, Egypt Mohammed Elmogy, Mansoura University, Egypt B. S. El-Desouky, Mansoura University, Egypt In the last two decades, several fragile watermarking schemes have been proposed for image authentication. In this paper, a novel fragile watermarking authentication scheme based on Chaotic Maps and Fuzzy C-Means (FCM) clustering technique is proposed. In order to raise the value of the tamper localization, detection accuracy, and security of the watermarking system the hybrid technique between Chaotic maps and FCM are introduced. In addition, this scheme can be applied to any image with different sizes not only in the square or even sized images. The proposed scheme gives high values especially in security because the watermarks pass through two levels to ensure security. Firstly, The FCM clustering technique makes the watermark dependent on the plain image. Secondly, the Chaotic maps are sensitive to initial values. Experimental results show that the proposed scheme achieves superior tamper detection and localization accuracy under different attacks. Chapter 38 New Mechanisms to Enhance the Performances of Arabic Text Recognition System: Feature Selection............................................................................................................................................... 879 Marwa Amara, SOIE Laboratory, Tunisia Kamel Zidi, University of Tabouk, Saudi Arabia The recognition of a character begins with analyzing its form and extracting the features that will be exploited for the identification. Primitives can be described as a tool to distinguish an object of one class from another object of another class. It is necessary to define the significant primitives during the development of an optical character recognition system. Primitives are defined by experience or by intuition. Several primitives can be extracted while some are irrelevant or redundant. The size of vector primitives can be large if a large number of primitives are extracted including redundant and irrelevant features. As a result, the performance of the recognition system becomes poor, and as the number of features increases, so does the computing time. Feature selection, therefore, is required to ensure the selection of a subset of features that gives accurate recognition and has low computational overhead. We use feature selection techniques to improve the discrimination capacity of the Multilayer Perceptron Neural Networks (MLPNNs).

Chapter 39 Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters....................................... 897 Ahmed.T. Sahlol, Damietta University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt There are still many obstacles for achieving high recognition accuracy for Arabic handwritten optical character recognition system, each character has a different shape, as well as the similarities between characters. In this chapter, several feature selection-based bio-inspired optimization algorithms including Bat Algorithm, Grey Wolf Optimization, Whale optimization Algorithm, Particle Swarm Optimization and Genetic Algorithm have been presented and an application of Arabic handwritten characters recognition has been chosen to see their ability and accuracy to recognize Arabic characters. The experiments have been performed using a benchmark dataset, CENPARMI by k-Nearest neighbors, Linear Discriminant Analysis, and random forests. The achieved results show superior results for the selected features when comparing the classification accuracy for the selected features by the optimization algorithms with the whole feature set in terms of the classification accuracy and the processing time. The experiments have been performed using a benchmark dataset, CENPARMI by k-Nearest neighbors, Linear Discriminant Analysis, and random forests. The achieved results show superior results for the selected features when comparing the classification accuracy for the selected features by the optimization algorithms with the whole feature set in terms of the classification accuracy and the processing time. Chapter 40 Telemetry Data Mining Techniques, Applications, and Challenges.................................................... 915 Sara Ahmed, Al Azhar University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Tarek Gaber, Suez Canal University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt The most recent rise of telemetry is around the use of Radio-telemetry technology for tracking the traces of moving objects. Initially, the radio telemetry was first used in the 1960s for studying the behavior and ecology of wild animals. Nowadays, there’s a wide spectrum application of can benefits from radio telemetry technology with tracking methods, such as path discovery, location prediction, movement behavior analysis, and so on. Accordingly, rapid advance of telemetry tracking system boosts the generation of large-scale trajectory data of tracking traces of moving objects. In this study, we survey various applications of trajectory data mining and review an extensive collection of existing trajectory data mining techniques to be used as a guideline for designing future trajectory data mining solutions.

Chapter 41 Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach in Digital Mammography..................................................................................................................................... 925 Mohammed A. Osman, Helwan University, Egypt Ashraf Darwish, Helwan University, Egypt Ayman E. Khedr, Helwan University, Egypt Atef Z. Ghalwash, Helwan University, Egypt Aboul Ella Hassanien, Cairo University, Egypt Breast cancer or malignant breast neoplasm is the most common type of cancer in women. Researchers are not sure of the exact cause of breast cancer. If the cancer can be detected early, the options of treatment and the chances of total recovery will increase. Computer Aided Diagnostic (CAD) systems can help the researchers and specialists in detecting the abnormalities early. The main goal of computerized breast cancer detection in digital mammography is to identify the presence of abnormalities such as mass lesions and Micro calcification Clusters (MCCs). Early detection and diagnosis of breast cancer represent the key for breast cancer control and can increase the success of treatment. This chapter investigates a new CAD system for the diagnosis process of benign and malignant breast tumors from digital mammography. X-ray mammograms are considered the most effective and reliable method in early detection of breast cancer. In this chapter, the breast tumor is segmented from medical image using Fuzzy Clustering Means (FCM) and the features for mammogram images are extracted. The results of this work showed that these features are used to train the classifier to classify tumors. The effectiveness and performance of this work is examined using classification accuracy, sensitivity and specificity and the practical part of the proposed system distinguishes tumors with high accuracy. Chapter 42 TAntNet-4: A Threshold-Based AntNet Algorithm with Improved Scout Behavior.......................... 942 Ayman M. Ghazy, Cairo University, Egypt Hesham A. Hefny, Cairo University, Egypt Traffic Routing System (TRS) is one of the most important intelligent transport systems which is used to direct vehicles to good routes and reduce congestion on the road network. The performance of TRS mainly depends on a dynamic routing algorithm due to the dynamic nature of traffic on road network. AntNet algorithm is a routing algorithm inspired from the foraging behavior of ants. TAntNet is a family of dynamic routing algorithms that uses a threshold travel time to enhance the performance of AntNet algorithm when applied to traffic road networks. TAntNet-1 and TAntNet-2 adopt different techniques for path update to fast direct to the discovered good route and conserve on this good route. TAntNet-3 has been recently proposed by inspiring the scout behavior of bees to avoid the bad effect of forward ants that take bad routes. This chapter presents a new member in TAntNet family of algorithms called TAntNet-4 that uses two scouts instead of one compared with TAntNet-2. The new algorithm also saves the discovered route of each of the two scouts to use the best of them by the corresponding backward ant. The experimental results ensure the high performance of TAntNet-4 compared with AntNet, other members of TAntNet family.

Chapter 43 Digital Images Segmentation Using a Physical-Inspired Algorithm................................................... 975 Diego Oliva, Tecnológico de Monterrey, Mexico & Universidad de Guadajalara, Mexico & Tomsk Polytechnic University, Russia & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Segmentation is one of the most important tasks in image processing. It classifies the pixels into two or more groups depending on their intensity levels and a threshold value. The classical methods exhaustively search the best thresholds for a spec image. This process requires a high computational effort, to avoid this situation has been incremented the use of evolutionary algorithms. The Electro-magnetism-Like algorithm (EMO) is an evolutionary method which mimics the attraction-repulsion mechanism among charges to evolve the members of a population. Different to other algorithms, EMO exhibits interesting search capabilities whereas maintains a low computational overhead. This chapter introduces a multilevel thresholding (MT) algorithm based on the EMO and the Otsu’s method as objective function. The combination of those techniques generates a multilevel segmentation algorithm which can effectively identify the threshold values of a digital image reducing the number of iterations. Chapter 44 A Proposed Architecture for Key Management Schema in Centralized Quantum Network............... 997 Ahmed Farouk, Zewail City of Science and Technology, Egypt & Mansoura University, Egypt Mohamed Elhoseny, Mansoura University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Josep Batle, Universitat de les Illes Balears, Spain Mosayeb Naseri, Islamic Azad University, Kermanshah, Iran Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Most existing realizations of quantum key distribution (QKD) are point-to-point systems with one source transferring to only one destination. Growth of these single-receiver systems has now achieved a reasonably sophisticated point. However, many communication systems operate in a point-to-multi-point (Multicast) configuration rather than in point-to-point mode, so it is crucial to demonstrate compatibility with this type of network in order to maximize the application range for QKD. Therefore, this chapter proposed architecture for implementing a multicast quantum key distribution Schema. The proposed architecture is designed as a Multicast Centralized Key Management Scheme Using Quantum Key Distribution and Classical Symmetric Encryption. In this architecture, a secured key generation and distribution solution has been proposed for a single host sending to two or more (N) receivers using centralized Quantum Multicast Key Distribution Centre and classical symmetric encryption.

Chapter 45 Secure Image Processing and Transmission Schema in Cluster-Based Wireless Sensor Network.............................................................................................................................................. 1022 Mohamed Elhoseny, Mansoura University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ahmed Farouk, Zewail City of Science and Technology, Egypt & Mansoura University, Egypt Josep Batle, Universitat de les Illes Balears, Spain Abdulaziz Shehab, Mansoura University, Egypt Aboul Ella Hassanien, Cairo University, Giza, Egypt & Scientific Research Group in Egypt (SRGE), Egypt WSN as a new category of computer-based computing platforms and network structures is showing new applications in different areas such as environmental monitoring, health care and military applications. Although there are a lot of secure image processing schemas designed for image transmission over a network, the limited resources and the dynamic environment make it invisible to be used with Wireless Sensor Networks (WSNs). In addition, the current secure data transmission schemas in WSN are concentrated on the text data and are not applicable for image transmission’s applications. Furthermore, secure image transmission is a big challenging issue in WSNs especially for the application that uses image as its main data such as military applications. The reason why is because the limited resources of the sensor nodes which are usually deployed in unattended environments. This chapter introduces a secure image processing and transmission schema in WSN using Elliptic Curve Cryptography (ECC) and Homomorphic Encryption (HE). Chapter 46 Color Invariant Representation and Applications.............................................................................. 1041 Abdelhameed Ibrahim, Mansoura University, Egypt Takahiko Horiuchi, Chiba University, Japan Shoji Tominaga, Chiba University, Japan Aboul Ella Hassanien, Cairo University, Egypt Illumination factors such as shading, shadow, and highlight observed from object surfaces affect the appearance and analysis of natural color images. Invariant representations to these factors were presented in several ways. Most of these methods used the standard dichromatic reflection model that assumed inhomogeneous dielectric material. The standard model cannot describe metallic objects. This chapter introduces an illumination-invariant representation that is derived from the standard dichromatic reflection model for inhomogeneous dielectric and the extended dichromatic reflection model for homogeneous metal. The illumination color is estimated from two inhomogeneous surfaces to recover the surface reflectance of object without using a reference white standard. The overall performance of the invariant representation is examined in experiments using real-world objects including metals and dielectrics in detail. The feasibility of the representation for effective edge detection is introduced and compared with the state-of-the-art illumination-invariant methods.

Chapter 47 An Efficient Approach for Community Detection in Complex Social Networks Based on Elephant Swarm Optimization Algorithm........................................................................................................ 1062 Khaled Ahmed, Cairo University, Egypt Aboul Ella Hassanien, Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Ehab Ezzat, Cairo University, Egypt Complex social networks analysis is an important research trend, which basically based on community detection. Community detection is the process of dividing the complex social network into a dynamic number of clusters based on their edges connectivity. This paper presents an efficient Elephant Swarm Optimization Algorithm for community detection problem (EESO) as an optimization approach. EESO can define dynamically the number of communities within complex social network. Experimental results are proved that EESO can handle the community detection problem and define the structure of complex networks with high accuracy and quality measures of NMI and modularity over four popular benchmarks such as Zachary Karate Club, Bottlenose Dolphin, American college football and Facebook. EESO presents high promised results against eight community detection algorithms such as discrete krill herd algorithm, discrete Bat algorithm, artificial fish swarm algorithm, fast greedy, label propagation, walktrap, Multilevel and InfoMap. Chapter 48 Designing Multilayer Feedforward Neural Networks Using Multi-Verse Optimizer........................ 1076 Mohamed F. Hassanin, Fayoum University, Egypt Abdullah M. Shoeb, Taibah University, Saudi Arabia Aboul Ella Hassanien, Cairo University, Egypt Artificial neural network (ANN) models are involved in many applications because of its great computational capabilities. Training of multi-layer perceptron (MLP) is the most challenging problem during the network preparation. Many techniques have been introduced to alleviate this problem. Back-propagation algorithm is a powerful technique to train multilayer feedforward ANN. However, it suffers from the local minima drawback. Recently, meta-heuristic methods have introduced to train MLP like Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Cuckoo Search (CS), Ant Colony Optimizer (ACO), Social Spider Optimization (SSO), Evolutionary Strategy (ES) and Grey Wolf Optimization (GWO). This chapter applied Multi-Verse Optimizer (MVO) for MLP training. Seven datasets are used to show MVO capabilities as a promising trainer for multilayer perceptron. Comparisons with PSO, GA, SSO, ES, ACO and GWO proved that MVO outperforms all these algorithms. Compilation of References................................................................................................................... xl About the Contributors.................................................................................................................... clxii Index.................................................................................................................................................. clxix

xxxviii

Preface

Nowadays, the topic of Machine Learning (ML) is one of the most hot research fields in both practical and theoretical prospective. Even though ML is not a new science, it is increasingly gaining popularity and momentum due to the wide range of real life applications that can be developed by ML techniques. ML is about enabling computers, without human intervention, to discovery hidden insights which could be interpreted faster and better decisions in real-time applications. ML science could improve our environment in a smart way. There are various applications where Machine Learning could support. Examples of these applications include image recognition, search engines, speech analysis, smart agriculture, precision farming, filtering tools, Brain-Computer Interface, forecasting in weather conditions and business marketing, and robotics. For example, Google search algorithm makes use ML science to learn from millions of users’ searches and their behavior types every day to continuously improve the search results. Also, the Facebook algorithm can detect suspicious activities by using ML techniques which learn from the user’s posts and their behavior. The development of such systems usually work by first using experimental data aiming to optimizing the performance of a given algorithm according to a predefined maximization or minimization criterion. Recently, due to the different applications of the machine learning science, there is a huge number of published patents, papers, and practical applications related to this science. The ML is characterized by an important feature where it utilizes and integrate knowledge from different areas, such as, statistics (Montecarlo methods, Bayesian methods, bootstrapping, …) pattern recognition (support vector machines, deep learning, neural networks, decision trees, …), data mining (time series prediction, modeling, …), or signal processing (Markov models), and many others fields. Hence, ML science is considered as a multidisciplinary research topic which requires researchers with particular bibliography to compile its different techniques and trends. The literature contains a number of good references about ML techniques and application, but recent trends and innovations of ML have not been presented and discussed in any reference book. In this handbook, theoretical techniques and practical applications are discussed. This book consists of sections parts. In the first section, many Stateof-the-Art Techniques are discussed and analyzed, in the second one, different Applications-Based ML are presented, while Innovative ML Applications are presented in the third section of the book. The book is divided into three sections: • • •

State-of-the-Art Techniques, Applications-Based Machine Learning, Innovative ML Applications.

Preface

The first section of the handbook comprises of 10 chapters discussing and reviewing the state-ofthe-art of machine learning applications such as breast cancer, vessel segmentation, cloud computing. The second section contains 18 chapters presenting different machine learning based applications such as of Software Reliability Prediction Using Statistical, Data Quality in health system, Climate Changes on Agricultural Crops, Alzheimer’s disease Diagnosis, and intelligent watermarking, etc. The third section consists of 20 chapters most of them using bio-inspired techniques for suggesting solution for problem such as Space Reduction, Data Clustering, Image Tamper Detection, Community Detection, and Quantum Network, etc.

xxxix

Section 1

State-of-the-Art Techniques

1

Chapter 1

T-Spanner Problem: Genetic Algorithms for the T-Spanner Problem Riham Moharam Suez Canal University, Egypt Ehab Morsy Suez Canal University, Egypt Ismail A. Ismail 6 October University, Egypt

ABSTRACT The t-spanner problem is a popular combinatorial optimization problem and has different applications in communication networks and distributed systems. This chapter considers the problem of constructing a t-spanner subgraph H in a given undirected edge-weighted graph G in the sense that the distance between every pair of vertices in H is at most t times the shortest distance between the two vertices in G. The value of t, called the stretch factor, quantifies the quality of the distance approximation of the corresponding t-spanner subgraph. This chapter studies two variations of the problem, the Minimum t-Spanner Subgraph (MtSS) and the Minimum Maximum Stretch Spanning Tree(MMST). Given a value for the stretch factor t, the MtSS problem asks to find the t-spanner subgraph of the minimum total weight in G. The MMST problem looks for a tree T in G that minimizes the maximum distance between all pairs of vertices in V (i.e., minimizing the stretch factor of the constructed tree). It is easy to conclude from the literatures that the above problems are NP-hard. This chapter presents genetic algorithms that returns a high quality solution for those two problems.

DOI: 10.4018/978-1-5225-2229-4.ch001

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

T-Spanner Problem

INTRODUCTION Let G = (V , E ) be an undirected edge-weighted graph with vertex set V and edge set E such that V = n and E = m . A spanning subgraph H in G is said to be a t-spanner subgraph if the distance between every pair of vertices in H is at most t times the shortest distance between the two vertices in G . The value of t , called the stretch factor, quantifies the quality of the distance approximation of the corresponding t-spanner subgraph. The goodness of t-spanner subgraph H is estimated by either its total weight or the distance approximation of H (stretch factor t of) (D. Peleg and J. D. Ulman, 1989). We concern with the following two problem. The first problem, called the Minimum t-Spanner Subgraph (MtSS), we are given a value of stretch factor t and the problem requires to find the t-spanner subgraph of the minimum total weight in G . The problem of finding a tree t-spanner with the smallest possible value of t is known as the Minimum Maximum Stretch Spanning Tree (MMST) problem (Y. Emek and D. Peleg, 2008). The t-spanner subgraph problem is widely applied in communication networks and distributed systems. For example, the MMST problem is applied to the arrow distributed directory protocol that supports the mobile object routing (M. J. Demmer & M. P. Herlihy, 1998). In particular, it is used to minimize the delay of mobile object routing from the source node to every client node in case of concurrent requests through a routing tree. The worst case overhead the ratio of the protocol is proportional to the maximum stretch factor of the tree (see (D. Peleg & E. Reshef, 2001)). Kuhn and Wattenhofer (2006) showed that the arrow protocol is a distributed ordering algorithm with low maximum stretch factor. Another application of the MMST is in the analysis of competitive concurrent distributed queuing protocols that intend to minimize the message transit in a routing tree (M. Herlihy, et al, 2001). Also, low-weight spanners have recently found interesting practical applications in areas such as metric space searching(G. Navarro, et al, 2002) and broadcasting in communication networks (M. Farley, 2004). A spanner can be used as a compact data structure for holding information about (approximate) distances between pairs of objects in a large metric space, say, a collection of electronic documents by using a spanner instead of a full distance matrix, significant space reductions can be obtained when using search algorithms like AESA (G. Navarro, et al, 2002) For message distribution in networks, spanners can simultaneously offer both low cost and low delay when compared to existing alternatives such as minimum spanning trees (MSTs) and shortest path trees. Experiments with constructing spanners for realistic communication networks show that spanners can achieve a cost that is close to the cost of a MST while significantly reducing delay (or shortest paths between pairs of nodes) cost (A. M. Farley, et al, 2004). It is well known that the MtSS problem is NP-complete (L. Cai, 1994). For any t ≥ 1 , the problem of deciding whether G contains a tree t-spanner is NP-complete (L. Cai and D. Corneil, 1995), and consequently the MMST problem is also NP-complete. In this chapter, we present efficient genetic algorithms to these two problems. Our experimental results show that the proposed algorithms return high quality solutions for both problems.

BACKGROUND In this section, we present results on related problems.

2

T-Spanner Problem

For an unweighted graph G , (L. Cai and D. Corneil, 1995) produced a linear time algorithm to find a tree t- spanner in G for any given t ≥ 2 . Moreover, they showed that, for any t ≥ 4 , the problem of finding a tree t-spanner in G is NP-complete. (Brandsta¨dt, et al, 2007) improved the hardness result in (L. Cai and D. Corneil, 1995) by showing that a tree t-spanner is NP-complete even over chordal graphs which each created a cycle with length 3 whenever t ≥ 4 and chordal bipartite graphs which each created a cycle with length 4 whenever t ≥ 5 . Peleg and Tendler (D. Peleg and D. Tendler, 2001) proposed a polynomial time algorithm to determine a minimum value for t for the tree t-spanner over outerplanar graphs. In (S. P. Fekete and J. Kremer, 2001), they showed that it is NP-hard to determine a minimum value for t for which a tree t-spanner exists even for planar unweighted graphs. They designed a polynomial time algorithm that decides if the planar unweighted graphs with bounded face length contains a tree t-spanner for any fixed t . Moreover, they proved that for t = 3 , it can be decided whether the unweighted planar graph has a tree t-spanner in polynomial time. The problem was left open whether a tree t-spanner is polynomial time solvable in case of t ≥ 4 . Afterwards, this open problem is solved by (F. F. Dragan, et al, 2010). They proved that, for any fixed t , the tree t-spanner problem is linear time solvable not only for a planar graphs, but also for the class of sparse graphs which include graphs of bounded genus. In (Y. Emek and D. Peleg, 2008) presented an O (log n ) -approximation algorithm for finding the tree t-spanner problem in a graph of size n . Moreover, they established that unless P = NP, the problem cannot be approximated additively by any o (n ) term. In (M. Sigurd and M. Zachariasen, 2004) presented exact algorithm for the Minimum Weight Spanner problem, they proposed an integer programming formulation based on column generation. They showed that the total weight of a spanner constructed by the greedy spanner algorithm is typically within a few percent from the optimum. Recently, (F. F. Dragan and E. K¨ohler, 2014) examined the tree t-spanner on chordal graphs, generalized chordal graphs and general graphs. For every n-vertex m-edge unweighted graph G , they proposed a new algorithm constructs a tree ( 2 log2 n )-spanner in O (m log n ) time for chordal graphs, a tree (2ρ log2 n ) -spanner O (m log 2n ) time or a tree (12ρ log2 n ) -spanner in O (m log n ) time for

graphs that confess a Robertson- Seymour’s tree-decomposition with bags of radius at most ρ in G and

(

)

a tree (2t / 2 log2 n ) -spanner in O mn log2 n time or a tree (6t log2 n ) -spanner in O (m log n ) time

for graphs that confess a tree t-spanner. They produced the same approximation ratio as in (Y. Emek and D. Peleg, 2008) but in a better running time.

GENETIC ALGORITHMS In this section, we propose a genetic algorithms for the two variants of t-spanner problem, the problem of finding a tree t- spanner in a given edge-weighted graphs that minimizes the stretch factor t (MMST) and the problem of finding the t-spanner subgraph with minimum total weight for a given value t (MtSS).(see Section I). We first introduce some terminologies that will be used throughout this section. Let G ′ be a subgraph of G . The sets V (G ′) and E (G ′) denote the set of vertices and edges of G ′ , respectively. The short-

3

T-Spanner Problem

est distance between two vertices u and v in G ′ is denoted by dG ' (u, v ) . For two subgraphs G1 and

G2 of G , let G 1 ∪ G 2 , G 1 ∩ G 2 , and G 1 − G 2 denote the subgraph induced by E (G 1) ∪ E (G 2 ) , E (G 1) ∩ E (G 2) , and E (G 1) − E (G 2) , respectively.

Algorithm Overview The Genetic Algorithm (GA) is an iterative optimization approach based on the principles of genetics and natural selection (Andris P. Engelbrecht, 2007). We first have to define a suitable data structure to represent individual solution (chromosomes), and then construct a set of candidate solutions as an initial population (first generation) of an appropriate cardinality pop − size . The following typical procedure is repeated as long as a predefined stopping criteria are met. Starting with the current generation, we use a predefined selection technique to repeatedly choose a pair of individuals (parents) in the current generation to reproduce, with probability pc , a new set of individuals (offsprings) by exchanging some parts of between the two parents (crossover operation). To avoid local minimum, we try to keep an appropriate diversity among different generations by applying mutation operation, with specific probability pm , to genes of individuals of the current generation. Finally, based on the values of an appropriate fitness function, we select a new generation from both the offspring and the current generation (the more suitable solutions have more chances to reproduce). Note that, determining representation method, population size, selection technique, crossover and mutation probabilities, and stopping criteria in genetic algorithms are crucial since they mainly affect the convergence of the algorithm (see (O. Abdoun, et al, 2012, J. Hesser and R. Mnner, 1991, W. Y. LIN, et al, 2001, O. Roeva, et al, 2013, K. Vekaria and C. Clack, 1998)). The rest of this section is devoted to describe steps of the above algorithm in details.

Representation Let G = (V , E ) be a given undirected graph such that each vertex in V is assigned a distinct label from the space 1, 2,...,n , i.e., V = {1, 2,..., n } . Clearly, each edge e ∈ E with end points i and j is unique-

ly defined by the unordered pair i, j . Moreover, every subgraph of G is uniquely defined by the set of unordered pairs of all its edges. In particular, every spanning tree T in G is induced by a set of exactly n − 1 unordered pairs corresponding to its edges since T is a subgraph of G that spans all vertices in V and has no cycles. Therefore, each chromosome (t-spanner subgraph) can be represented as a set of unordered pairs of integers each of which represent a gene (edge) in the chromosome.

Initial Population Constructing an initial generation is the first step in typical genetic algorithms. We first have to decide the population size pop − size , one of the decisions that affect the convergence of the genetic algorithm. It is expected that small population size may lead to weak solutions, while, large population size in-

4

T-Spanner Problem

Figure 1. The representation of chromosome

creases the space and time complexity of the algorithm. Many literatures studied the influence of the population size to the performance of genetic algorithms (see (O. Roeva, et al, 2013) and the references therein). In this chapter, we discuss the effect of the population size on the convergence time of the algorithm. One of the most common methods is to apply random initialization to get an initial population.

t-Spanner Subgraph We compute each chromosome (spanning subgraph) in the initial population by applying the following two phases procedure. In the first phase, we repeatedly add a new edge to the constructed subgraph so far as long as the cardinality of the set of visited vertices is less than dn / 2e . Let H denote the tree constructed so far by the procedure (initially, H consists of a random vertex from V (G ) . We first select a random vertex v ∈ V (G) from the set of the neighbors of all vertices in H, and then add the edge e = (u, v ) to

(

)

H if e ∈ /E (H ) where u is the neighbor of v in H . In the second phase, we repeatedly add a new vertex to the subgraph output from the first phase as long as the set of vertices of the subgraph is less than n . We first select a random vertex v ∈ /V (H ) from the set of the neighbors of all vertices in H , and then add the edgee = (u, v ) to, where u is the neighbor of v in H . It is easy to verify that the above procedure returns a spanning subgraph. The generated subgraph H is added to the initial population if it is t-spanner, and the above algorithm is repeated as long as the number of constructed population is less than pop − size .

5

T-Spanner Problem

Tree t-Spanner We compute each chromosome (spanning tree) in the initial population by repeatedly applying the following simple procedure as long as the cardinality of the set of traversed edges is less than n − 1 . Let T denote the tree constructed so far by the procedure (initially, T consists of a random vertex from V (G ) ). We first select a random vertex v ∈ /V (T ) from the set of the neighbors of all vertices in T, and then add the edge e = (u, v ) to T , where u is the neighbor of v in T . It is easy to verify that the

above algorithm visits the set of all vertices in the underlying graph after exactly n − 1 iterations, thus, it returns a spanning tree. The generated tree T is added to the initial population. (see Figure 2 for an illustration example). The above algorithm is repeated as long as the number of constructed population is less than pop − size . Fitness Function Fitness function is a function used to evaluate each chromosome. Here, the objective function of the underlying problem is used as the corresponding fitness function. Namely, for the MtSS problem, the fitness function is the total weight of the chromosome, i.e., the fitness value of chromosome H equals ∑ w (e ) . For the MMST problem, the maximum ratio between all pairs of vertices in the chromosome e ∈E (H )

is used as its fitness, i.e., maxu,v ∈V

dT (u, v )

dG (u, v )

is the fitness value of T , where dT (u, v ) and dG (u, v ) are

the distances between u and v in T and G , respectively. Note that, in both problems we look for a chromosome of the least fitness value.

Selection Process In this chapter, we present three common selection techniques: roulette wheel selection, stochastic universal sampling selection, and tournament selection. All these techniques are called fitness-proportionate Figure 2. Construct each chromosome randomly: in (a) T started with v 4 randomly and then from its neighbors v 3 selected randomly, after that e34 added to T as in (b) this procedure will be repeated n − 1 times until the set of all vertices in the underlying graph is visited as in (c).

6

T-Spanner Problem

selection techniques since they are based on a predefined fitness function used to evaluate the quality of individual chromosomes. Throughout the execution of the proposed algorithm, the reverse of this ratio is used as the fitness function of the corresponding chromosome. We assume that the same selection technique is used throughout the whole algorithm. The rest of this section is devoted to briefly describe these selection techniques.

Roulette Wheel Selection (RWS) In the roulette wheel selection, the probability of selecting a chromosome is based on its fitness value (Andris P. Engelbrecht, 2007, A. Chipperfield, et al, 1994). More precisely, each chromosome is selected with the probability that equals to its normalized fitness value, i.e., the ratio of its fitness value to the total fitness values of all chromosomes in the set from which it will be selected. (see Figure 3 for an illustration example).

Stochastic Universal Sampling Selection (SUS) Instead of a single selection pointer used in roulette wheel approach, SUS uses h equally spaced pointers, where h is the number of chromosomes to be selected from the underlying population (T. Blickle and L. Thiele, 1995, A. Chipperfield, et al, 1994). All chromosomes are represented in number line  1 randomly and a single pointer ptr ∈ 0,  is generated to indicate the first chromosome to be selected.  h 

Figure 3. Example for roulette wheel selection: The circumference of the roulette wheel is the sum of all six individual’s fitness values. Individual 5 is the most fit individual and occupies the largest interval, whereas individuals 6 and 4 are the least fit and have correspondingly smaller intervals within the roulette wheel. To select an individual, a random number is generated in the interval (0, 1) and the individual whose segment spans the random number is selected. This process is repeated until the desired numbers of individuals have been selected.

7

T-Spanner Problem

The remaining h − 1 individuals whose fitness spans the positions of the pointers ptr + i / h, i = 1, 2,..., h −1 are then chosen. (see Figure 4 for an illustration example).

Tournament Selection (TRWS) This is a two stages selection technique (Andris P. Engelbrecht, 2007, T. Blickle and L. Thiele, 1995). We first select a set of k < pop − size chromosomes randomly from the current population. From the selected set, we choose the more fit chromosome by applying the roulette wheel selection approach. Tournament selection is performed according to the required number of chromosomes.

Crossover Process In each iteration of the algorithm we repeatedly select a pair of chromosomes (parents) from the current generation and then apply crossover operator with probability pc to the selected chromosomes to get new chromosomes (offsprings). Simulations and experimental results of the literatures show that a typical crossover probability lies between 0.75 and 0.95. There are two common crossover techniques: single-point crossover and multi-point crossover. Many researchers studied the influence of crossover approach and crossover probability to the efficiency of the whole genetic algorithm, see for example (W-Y. LIN, et al, 2001, K. Vekaria and C. Clack, 1998) and the references therein. In this paper, we use a multi-point crossover approach by exchanging a randomly selected set of edges between the two parents. In particular, for each selected pair of chromosomes H 1 and H 2 , we generate a random number s ∈ (0, 1) . If s < pc holds, we apply crossover operator to H 1 and H 2 as follows.

Define the two sets E1 = E (H 1 ) − E (H 2 )and E 2 = E (H 2 ) − (H 1 ) . Generate a random number k

from (1, E1 ). We first choose a random subset E1' of cardinality k from E1 , and then add E1' to H 2 to get a subgraph H ′ (i.e., H ′ = H 2 ∪ H ′ contains k cycles each of which contains a distinct edge from E ′ . For every edge e = (u, v) in E1' , we apply the following procedure to fix a cycle containing e . Let be the current subgraph (initially, = H ′ ). We first find a path P (u, v ) between u and v in

Figure 4. Example for stochastic universal sampling selection: For 6 individuals to be selected, h = 6 , 1 the distance between the pointers is = 0.167, and the random number ptr in the range (0, 0.167): h 0.1. After selection the new population consists of the individuals: 1, 2, 3, 4, 6, and 8.

8

T-Spanner Problem

− {e}. We then choose an edge e in P (u, v ) randomly and delete it from subgraph . (see Figure 5 for an illustrative example). Similarly, we apply the above crossover technique by interchanging the roles of H 1 and H 2 , and E1 and E 2 , to get one more offspring. For the MtSS problem, we add each of the resulting spanning subgraphs to the set of generated offsprings if it is t-spanner, while for the MMST problem, we add each of the resulting spanning trees to the set of generated offsprings.

Mutation Process To maintain the diversity among different generations of the population (and hence avoid local minimum), we apply a genetic (mutation) operator to chromosomes of the current generation with predefined (usuFigure 5. Example for crossover process: E1 = E (H 1 ) − E (H 2 ) = {e1, e2, e3 } is the set of solid edges that in H 1 (b) and not contained in H 2 (a). A random number k is generated from interval (1, E1 ): 2.

In (c) E1' = {e1, e3 } a random subset is selected from E1 and added to H 2 . After fixing the 2 cycles a new offspring is generated as in (d).

9

T-Spanner Problem

ally small) probability pm . Namely, for each chromosome, we generate a random number s ∈ (0, 1) , and then mutate H if s < pm holds by replacing a random edge (gene) in H with a random edge from E (G )−E (H ) . Many results analyzed the role of mutation operator in genetic algorithms (O. Abdoun,

et al, 2012, J. Hesser and R. Mnner, 1991, W-Y. LIN, et al, 2001). Formally, a chromosome H is mutated as follows. We first select a random edge e = (u, v ) in the graph G but not in the chromosome, i.e., e is randomly chosen from the set E (G )−E (H ) of edges, It

is easy to see that the subgraph H ∪ {e } contains exactly one cycle including e . We then select a random edge e ′ in the path PH (u, v ) between u and v in. Let H ′ denote the offspring obtained from H by

(

)

exchanging the two edges e and e ’ , i.e., H ′ = H − {e } ∪ { e ′ }. It is easy to see that H ′ is a spanning subgraph in G . (see Figure 6 for an illustrative example). For the MMST, we add the resulting spanning tree to the set of generated offsprings, while, for the MtSS, we add the resulting spanning subgraph only if it is t-spanner. The main structure of proposed genetic algorithm is given in Figure 7. A formal description of the proposed genetic algorithm is described in Algorithm 1.

EXPERIMENTAL RESULTS In this section, we evaluate the proposed genetic algorithm by applying it to several random edgeweighted graphs. In particular, we generate a random graph G of n nodes by applying Erdos and Renyi approach (P. Erdos and A. Renyi, 1959) in which an edge is independently included between each pair of nodes of G with a given probability p . Here, we generate random graphs with sizes 20, 50, and 100, Figure 6. Example for mutation process: in (a) a random edge e is selected from the set E (G ) − E (H )

of edges. After that the cycle is fixed by selecting a random edge e ′ in PH (v 4 , v5 ) as in (b). A new offspring is generated in (c).

10

T-Spanner Problem

Figure 7. The main structure of proposed GA

and a randomly chosen probability p. Moreover, all edge weights of the generated graphs are set to random integers from the interval (1, 1000). For each of the generated graphs, we apply the proposed algorithm with different selection techniques. We set the population size pop − size = 30, the maximum number of iterations the genetic algorithm executes maxgen = 300, the crossover probability pc = 0.9, and the mutation probability pm = 0.2. All parameters of the proposed genetic algorithms are summarized with their assigned values in Table 1. These values are based on the common setting in the literature or defined through our preliminary experimental results. The algorithm terminates if either the number of iterations exceeds maxgen or the solution does not change for three consecutive iterations. All obtained solutions are compared with the corresponding optimal solutions obtained by considering all possibilities of all spanning trees in the underlying graphs. All results presented in this section were performed in MATLAB R2014b on a computer powered by a core i7 processor and 16 GB RAM.

t-Spanner Subgraph In this section, we present our experimental results for the problem (MtSS), the problem of finding tspanner subgraph with minimum total weight. We apply our algorithm with different values of t from the range (1, 2). The results of applying our genetic algorithm to random graphs with sizes n = 20 ,

11

T-Spanner Problem

Algorithm 1. Genetic Algorithm for the t-Spanner Problem Input: An edge-weighted graph G , a population size pop − size , a maximum num-

ber of generations maxgen , a crossover probability pc , a mutation probability

pm .

Output: A tree t-spanner that minimizes t . Step 1: Compute an initial population I 0 .

Step 2: gen ← 1 .

Step 3: While ( gen ≤ maxgen ) do Step 4: For i = 1 to pop − size do

Step 5: Select a pair of chromosomes from I gen − 1 .

Step 6: Apply crossover operator with probability pc to the selected pair of chromosomes to get two offsprings. Step 7: End for

Step 8: For each chromosome in I gen − 1 , apply mutation operator with probability pm to get an offspring.

Step 9: Extend I gen − 1 with valid offsprings output from lines 6 and 8. Step 10: Find the chromosome

Tgen − 1

gen ≥ 2

with the best fitness value in

Step 11: If and the fitness values of identical, then break.

Tgen − 2

,

Tgen − 1

, and

I gen − 1 Tgen

are

pop − size chromosomes from I gen − 1 to form I gen . gen ← gen + 1 . Step 13: Step 12: Select

Step 14: End while Step 15: Output

Tgen

.

Table 1. Parameters settings Parameters

n pop − size

Graph size

Values 20, 50, 100

Population size

       30

Maximum number of iterations

      300

pc

Crossover probability

       0.9

pm

Mutation probability

       0.2

maxgen

12

Definitions

.

T-Spanner Problem

n = 50 , and n = 100 , are shown in Table 2. It is seen that the proposed algorithm outputs optimal solution to MtSS for all the instances the algorithm applies to. We evaluate the influence of the population size on the convergence of the proposed genetic algorithm. For a graph of n nodes, we apply the algorithm with population sizes n / 3 , 2n / 3 , n , 4n / 3 , 5n / 3 , Table 2. Minimum weight t-spanner subgraphs corresponding to a random graphs with sizes n=20, 50, 100 MtSS-TRWS

MtSS-SUS

MtSS-RWS

MtSS-Optimal

n=20 455

452

453

452

1.1

455

452

453

452

1.2

455

452

452

452

1.3

452

448

448

448

1.4

452

448

448

448

1.5

448

448

448

448

1.6

445

445

445

445

1.7

445

445

445

445

1.8

445

445

445

445

1.9

445

445

445

445

2

n=50 667

668

668

667

1.1

667

668

668

667

1.2

659

659

660

659

1.3

659

659

659

659

1.4

659

659

659

659

1.5

659

659

659

659

1.6

649

649

650

649

1.7

649

649

649

649

1.8

649

649

649

649

1.9

649

649

649

649

2

n=100 883

885

883

883

1.1

883

883

883

883

1.2

883

883

883

883

1.3

875

875

877

875

1.4

875

875

875

875

1.5

866

869

869

866

1.6

866

866

866

866

1.7

860

860

860

860

1.8

860

860

860

860

1.9

860

860

860

860

2

13

T-Spanner Problem

2n , 7n / 3 and 8n / 3 . Figures 8- 10 illustrate the running time of the algorithm applied to graphs of sizes n = 20 , n = 50 , and n = 100 , respectively, for fixed value t = 1.3 . We observe that the algorithm attains the least running time when the population size is set to a constant fraction of the graph size n. Figures 11-13 show the influence of the value of t and the used selection technique on the execution time of the algorithm for graphs of sizes n = 20 , n = 50 and n = 100 , respectively. It is expected that the number of valid offsprings obtained in each iteration increases by relaxing the value of t , and consequently it is more likely that the execution time of the algorithm decreases as the value of t increases.

Tree t-Spanner In this section, we present our experimental results for the problem (MMST), the problem of finding a tree t-spanner with a smallest value of t . The results of applying our genetic algorithm to random graphs with sizes n = 20 , n = 50 and n = 100 , are shown in Table 3. It is seen that the proposed algorithm outputs optimal solution to MMST for all the instances the algorithm applies to. Figure 8. The influence of pop − size on the running time of the algorithm ( n = 20 )

14

T-Spanner Problem

Figure 9. The influence of pop − size on the running time of the algorithm ( n = 50 )

Figure 10. The influence of pop − size on the running time of the algorithm ( n = 100 )

15

T-Spanner Problem

Figure 11. The influence of t on the running time of the algorithm (n = 20)

Figure 12. The influence of t on the running time of the algorithm (n = 50)

16

T-Spanner Problem

Figure 13. The influence of t on the running time of the algorithm (n = 100)

Figure 14. The influence of pop − size on the running time of the algorithm ( n = 20 )

17

T-Spanner Problem

Table 3. Values of t corresponding to a random graphs with size n t-Optimal

t-RWS

t-SUS

t-TRWS

1.0132

1.0132

1.0484

1.0132

50

1.109

1.109

1.109

1.109

100

1.148

1.148

1.148

1.148

20

Also, We discuss the effect of the population size pop − size on the convergence of the algorithm. Given a random graph with size n , we apply the algorithm with population sizes n / 3 , 2n / 3 , n , 4n / 3 , 5n / 3 , 2n , 7n / 3 and 8n / 3 . Figures 14-16 illustrate the running time of the algorithm applied to graphs of sizes n = 20 , n = 50 and n = 100 , respectively. The algorithm attains the least running time when the population size is set to a constant fraction of the graph size n .

CONCLUSION In this chapter, we have studied the t-spanner problem for the two variants, first called the Minimum t-Spanner Subgraph (MtSS), we are given a value of stretch factor t and the problem requires to find

Figure 15. The influence of pop − size on the running time of the algorithm ( n = 50 )

18

T-Spanner Problem

Figure 16. The influence of pop − size on the running time of the algorithm ( n = 100 )

the t-spanner subgraph of the minimum total weight in G . The second problem of finding a tree tspanner with the smallest possible value of t is known as the Minimum Maximum Stretch Spanning Tree (MMST) that aims to find a spanning tree T in a given graph G such that the maximum ratio of the distances between every pair of vertices in T to the shortest distance between the two vertices in G is minimized. We have designed genetic algorithms for the both problems which have been evaluated by applying it to random instances of the problem. Experimental results have shown that the proposed algorithm outputs high quality solutions to the both problems.

ACKNOWLEDGMENT This work is partially supported by Alexander von Humboldt foundation

REFERENCES Abdoun, Abouchabaka, & Tajani. (2012). Analyzing the Performance of Mutation Operators to Solve the Travelling Salesman Problem. CoRR abs/1203.3099 Blickle, T., & Thiele, L. (1995). A comparison of Selection Schemes used in Genetic Algorithms. Academic Press.

19

T-Spanner Problem

Brandsta¨dt, Dragan, Le, & Uehara. (2007). Tree spanners for bipartite graphs and probe interval graphs. Algorithmica, 27-51. Brandsta¨dt, A., Dragan, F. F., Le, H. O., & Le, V. B. (2004). Tree spanners on chordal graphs: Complexity and algorithms. Theoretical Computer Science, 310(1-3), 329–354. doi:10.1016/S0304-3975(03)00424-9 Cai, L. (1994). NP-completeness of minimum spanner problems. Discrete Applied Mathematics, 48(2), 187–194. doi:10.1016/0166-218X(94)90073-6 Cai, L., & Corneil, D. (1995). Tree Spanners. SIAM Journal on Discrete Mathematics, 8(3), 359–387. doi:10.1137/S0895480192237403 Chipperfield, Fleming, Pohlheim, & Fonseca. (1994). The Matlab Genetic Algorithm User’s Guide. UK SERC. Demmer, M. J., & Herlihy, M. P. (1998). The Arrow Distributed Directory Protocol. Proceeding of the 12th International Symposium on Distributed Computing (DISC), 119-133. Dragan, , & Fomin,, & Golovach. (2010). Spanners in Sparse Graphs. Journal of Computer and System Sciences, 1108–1119. Dragan, F. F., & Köhler, E. (2014). An approximation algorithm for the tree t-spanner problem on unweighted graphs via generalized chordal graphs. Algorithmica, 69(4), 884–905. doi:10.1007/s00453013-9765-4 Emek, Y., & Peleg, D. (2008). Approximating minimum max-stretch spanning trees on unweighted graphs. SIAM Journal on Computing, 1761–1781. Engelbrecht. (2007). Computational Intelligence: An Introduction. John Wiley & Sons. Erdos, P., & Renyi, A. (1959). On Random Graphs. Publ. Math, 290. Farley, M., Zappala, D., Proskurowski, A., & Windisch, K. (2004). Spanners and Message Distribution in Networks. Discrete Applied Mathematics. Fekete, S. P., & Kremer, J. (2001). Tree spanners in planar graphs. Discrete Applied Mathematics, 85–103. Herlihy, M., Tirthapura, S., & Wattenhofer, R. (2001). Competitive concurrent distributed queuing. Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing, 127-133. Hesser & Mnner. (1991). Towards an Optimal Mutation Probability for Genetic Algorithms. Proceedings of 1st workshop in Parallel problem solving from nature. Kuhn, F., & Wattenhofer, R. (2006). Dynamic analysis of the arrow distributed protocol. Theory of Computing Systems, 875–901. Lin, L. & Hong. (2001). Adapting Crossover and Mutation Rates in Genetic Algorithms. The Sixth Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan. Navarro, G., Paredes, R., & Chavez, E. (2002). t-Spanners as a Data Structure for Metric Space Searching. International Symposium on String Processing and Information Retrieval, 298–309. doi:10.1007/3540-45735-6_26

20

T-Spanner Problem

Peleg, D., & Reshef, E. (2001). Low complexity variants of the arrow distributed directory. Journal of Computing, 474–485. Peleg, D., & Ulman, J. D. (1989). An optimal sychronizer for the hypercube. SIAM Journal on Computing, 18(4), 740–747. doi:10.1137/0218050 Peleg & Tendler. (2001). Low stretch spanning trees for planar graphs. Tech. Report MCS01-14. Weizmann Science Press of Israel. Roeva, F., & Paprzycki. (2013). Influence of the Population Size on the Genetic Algorithm Performance in Case of Cultivation Process Modelling. Proceedings of the Federated Conference on Computer Science and Information Systems. Sigurd, M., & Zachariasen, M. (2004). Construction of Minimum-Weight Spanners. Springer. Vekaria, K., & Clack, C. (1998). Selective Crossover in Genetic Algorithms: An Empirical Study. LNCS, 1498, 438-447.

21

22

Chapter 2

Breast Cancer Diagnosis Using Relational Discriminant Analysis of MalignancyAssociated Changes Dmitry Klyushin Kyiv National Taras Shevchenko University, Ukraine

Kateryna Golubeva Kyiv National Taras Shevchenko University, Ukraine

Natalia Boroday National Academy of Sciences of Ukraine, Ukraine

Maryna Prysiazhna Kyiv National Taras Shevchenko University, Ukraine

Maksym Shlykov Kyiv National Taras Shevchenko University, Ukraine

ABSTRACT The chapter is devoted to description of a novel method of breast cancer diagnostics based on the analysis of the distribution of the DNA concentration in interphase nuclei of epitheliocytes of buccal epithelium with the aid of novel algorithms of statistical machine learning, namely: novel proximity measure between multivariate samples, novel algorithm of construction of tolerance ellipsoids, novel statistical depth and novel method of multivariate ordering. In contrast to common methods of diagnostics used in oncology, this method is a non-invasive and offers a high rate of accuracy and sensitivity.

INTRODUCTION Today, the problem of early diagnosis of breast cancer is one of the most challenging problems. The point of view of authors is based on the premise that a human organism has cytological reactivity and the appearance of a tumor causes the malignancy-associated changes in buccal epithelium. These changes (the nuclei heterogeneity and presence of the numerous Feulgen-negative zones) have sub visual character and to detect the it is necessary to apply novel achievements of machine learning. DOI: 10.4018/978-1-5225-2229-4.ch002

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

The chapter is devoted to description of a novel method of breast cancer diagnostics based on the analysis of the distribution of the DNA concentration in interphase nuclei of epitheliocytes of buccal epithelium with the aid of novel algorithms of statistical machine learning. In contrast to common methods of diagnostics used in oncology, this method is a non-invasive and offers a high rate of accuracy and sensitivity; these characteristics are considerably greater than the ones of the majority of common methods. Besides, the method has a formal scientific background, since it uses modern achievements of statistical machine learning. We give an accurate and clear description of the method.

BACKGROUND Today, it is well known that in the presence of malignant tumor in a human organism the malignancy associated changes (MAC) of the distant tissues occur (see, for example, Susnik et al. (1995), Mairinger et al. (1999), Us-Krasovec et al. (2005), Hassanien Aboul Ella, et.al.(2014), Hossam Moftah et.al (2014)). However, these changes have sub visual nature and their quantitative measurement is a difficult problem. That is why till now oncomorphologists studied only qualitative changes characterizing the influence of a tumor on various organs and tissues of an organism distant from the tumor. In the chapter the authors describe the methods of quantitative estimations of these changes to discover significant statistical properties of DNA content distribution in nuclei of epitheliocytes of the buccal epithelium in the presence of pre-tumor processes and cancers of the mammary gland. The aim of the research is to compare the indices characterizing the state of chromatin and DNA content in the epithelial cells of mammary gland among patients suffering from the breast cancer and fibroademomatosis, and healthy women also

MATERIAL AND METHOD For investigation the groups of the women suffering from the breast cancer (stages T2 and T3) and fibroadenomatosis from 25 to 53 years old (25 cases of breast cancer and 25 cases of firboadenomatosis) were taken. The scrapes were taken from the spinous layer of buccal mucous after gargling and taking down of the superficial cell layer. The smears were dried out under the room temperature and fixed during 30 min in the Nikiforov mixture. Then, the Feulgen reaction with the cold hydrolysis in 5 n. HCl during 15 min under the temperature t=21-22 °C was made. The optical density of the nuclei was registerated by cytospectrophotometer LOMO with the help of the scanning method for the wave length 575 nm and probe diameter 0.05 mkm. We investigated from 10 to 30 nuclei in each preparation. The DNA- fuchsine content in the nuclei of the epitheliocytes was defined as a product of the optical denj = 1, m , sity on area. Thus, under investigation of the interphase nucleus a rectangular matrix {rij } i = 1, n whose entries characterizes the DNA content in the corresponding grid cell was obtained (mostly, n and m were equal to 8 or 9). On the basis of these cytophotospectrometric indices the following morpho- and densitometric features that characterize structural and textural peculiarities of chromatin were calculated.

23

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

1. 2. 3. 4. 5. x5 = 6. x6 = 7.

Area of nuclei: x1 is a number of the elements of the matrix R where rij ≥ 0.08. Area of condensed chromatin: x2 is a number of the elements of the R where rij ≥ 0.35. Area of decondensed chromatin: x3 is a number of the elements of the R where 0.08 ≤ rij < 0.35. Area of strongly decondenced chromatin: x4 is a number of the elements of the R where 0.08 ≤ rij < 0.15. Specific area of condensed chromatin: x2 x1

.

Specific area of decondensed chromatin: x3 x1

.

Integral density: m

n

x7 = ∑ ∑ rij , i =1 j =1 rij ≥0.08

where the inequality rij ≥ 0.08 placing under the sum means that summation is taken over indices i and j for which rij ≥ 0.08. 8. x8 =

Mean density: x1 nm − p

,

where p is a number of the elements rij < 0.08. 9. x9 =

Averaged sum of overfalls: 1 q ∑ v , q k =1 k

where q is a number of the elements such that min(rij , ri +1, j , ri, j +1, ri +1, j +1 ) ≥ 0.08; vk = max (rij , ri +1, j , ri, j +1, ri +1, j +1 ) − min (rij , ri +1, j , ri, j +1, ri +1, j +1 ), k = 1, q.

24

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

(The summation is taken over elements mentioned above). 10. General cluster index: x10 =

1 q 2 ∑ v . q k =1 k

11. Dispersion coefficient:  q 2   v x − ( )  ∑ k 9   x11 =  k =1   q −1     

1

2

12. Index of overall variation: x12 = x9 + x11 13. Relief index:

x13 =

m

n

i =2

j =1

∑∑ r

ij

− ri−1, j

(2mn − m + n − q)

where q is a number of the points (i, j) such that max (rij , ri−1, j ) < 0.08 . 14. Textural coefficient: m

n

∑ ∑ (r x14 =

x13 ε

i =1

,ε =

− x7 )

ij j =1 rij ≥0.08

mn − p

,

where p is defined as for x8. 15. Coefficient of mutual disposition: x15 =

a , x82b

25

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

where  n n m n  m m rij rkl rij rkl   , a = ∑ ∑ ∑ ∑ + ∑ ∑ 2 2 2 2  i =1 j =1  k =i +1 l =1 (k − i) + (l − j )   k =1 l = j +1 (k − i) + (l − j)  m n  m n m n  1 1  , b = ∑ ∑ ∑ ∑ + ∑ ∑ 2 2 2 2   k =1 l = j +1 (k − i) + (l − j) i =1 j =1  k =i +1 l =1 (k − i) + (l − j ) 

Moreover, the summation both for a and for b is taken over elements such that min (rij , rkl ) > 0.875 max

max

r .

i =1,2,... n j =1,2,... m;rij ≥0.08 ij

COMPACTNESS HYPOTHESIS AND DIMENSIONALITY REDUCTION The novel method of the one-of classification of multivariate samples is a result of synthesis of two main approaches of modern machine learning — quadratic and relational (featureless) discriminant analysis. The object of the research is a matrix rather than a vector as in classic schemes. Such situations are common in cytometric researches where the results of measurements are obtained from a set of cells. Then, rows of a matrix correspond to cells, and columns correspond to measures of cells. The most popular tools of the quadratic discriminant analysis are ellipsoids of minimal volume (MVE) containing a given set of points. There are various algorithms for construction of MVE (see Silverman & Titterington, 1980; Weltzl, 1991; Petunin & Rublev, 1996; Lyashko & Rublev, 2003; Kumar & Yildirim, 2005). The fundamental ideas of the relational discriminant analysis are mapping from a vector space of features to a proximity space and the hypotheses of compactness stating that similarity between objects in a same class is greater than similarity between objects in a distinct classes. The hypotheses of compactness in a proximity measure means that similar objects has similar distances from an alternative class. One of the first papers in this area were Petunin and Klyushin (1997) and Duin et al. (1997). Later these ideas were developed in the papers Duin and Pekalska (1999, 2001, 2005), Mottl (2001) and etc. Often, using of ellipsoids in classification of multivariate yields an uncertainty caused by overlapping of tolerance regions. The natural way to reduce the uncertainty is multivariate ordering (Barnett (1976)). Barnett proposed to divide methods of multivariate ordering in four groups: marginal, reduced, partial and conditional. One of the most effective methods among these ones are methods based on a function of statistical depth of a sample with respect to the center of a distribution and corresponding peeling algorithms (see Cascos, 2009; Koshevoy & Mosler, 1997; Liu, 1990; Oja, 1983; Tukey & Tukey, 1975; Zuo & Serfling, 2000; Lange & Mozharovsky, 2010, etc.). The chapter describes the novel method of reducing of uncertainty in the quadratic discriminant analysis based on Petunin ellipsoids and multivariate peeling (Petunin & Rublev, 1996; Lyashko & Rublev,1997; Lyashko et al., 2014). At first, consider the multistage scheme of transition from a multivariate vector space of features to a 2D proximity space. Let G1 and G2 be general populations (classes). From these populations N training samples are taken: uk = (x1, x2, …, xn) and vl = (y1, y2, …, yn), where k, l = 1, 2, …, N, n is the number

26

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

of features, xi and yj are vectors consisting of m numbers. So, n is the number of columns and m is the number of rows of the original matrix. The transition procedure consists of the following stages. 1. Compute interclass and intra class proximity measure between samples from G1 and G2. 2. Construct the proximity space and compute. The coordinates in the proximity measure correspond to the average proximity measures between samples with respect to different indexes. The results of this procedure are n(n – 1)/2 planes according to the number of pair that may be formed from the indices. Each point at these planes corresponds to unique patient and represent her average similarity to the other patients from the alternative classes. Now, consider the algorithm of computation of proximity measure between two samples (Klyushin & Petunin (2003)).

PROXIMITY MEASURE Let Gx and Gy are general populations with continuous hypothetical distribution functions Fx(u) and Fy(u), respectively. Suppose, we have two samples X = (x1, x2, …, xn) and Y = (y1, y2, …, ym) from the general populations Gx and Gy, so that sample values are independent on the total. Consider the following criterium for the test of hypothesis H about equality of the distribution functions Fx(u) and Fy(u) on the basis of the samples X and Y. Let x(1) ≤ … ≤ x(n) be variation series constructed from the sample X, and x be a sample values from Gx which does not depend on X. Then, on the basis of the results of the paper (see Madreimov and Petunin 1982)

( (

p x ∈ x( i ) , x( j )

)) = nj −+ 11 , i < j.

(1)

{

(

If we suppose that hypothesis H is true, the probability of the random event Aij = yk ∈ x(i ), x( j )

)}

have been calculated by the formulae (1). Using the known sample Y we can calculate frequency of the random event Aij and confidence limits pij(1), pij(2) for probability pij corresponding to given significance

{

(

)}

(

)

level 2β : B = pij ∈ pij(1), pij(2) , p ( B ) = 1 − 2β . These limits have been calculated by the formulae (see Van der Waerden 1957):

p

(1) ij

pij(2)

1 2 1 g − g hij (1 − hij ) m + g 2 2 4 , = 2 m+g 1 2 1 2 hij m + g + g hij (1 − hij ) m + g 4 2 , = 2 m+g hij m +

(2)

27

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

where g satisfies condition Φ ( g ) = 1 − β, Φ (u ) is a function of the normal distribution (if m is small then according to “3σ rule” g=3). Denote by N the number of all confidence intervals I ij = pij(1), pij(2)

(

)

N = n (n − 1) 2 and L a num-

L is the proximity measure N between X and Y. As far as h will be a frequency of the random event B = { pij ∈ I ij } having the probber of those intervals Iij which contain the probability pij; put h = ρ ( X , Y ) =

ability p( B) = 1 − β , then setting hij = h, m = N and g = 3 in formulae (2) we get a confidence interval

(

)

I = p(1), p(2) for the probability p(B) which have confidence level approximately about 0.95. A cri-

terion for the test of hypothesis H with significance level about 0.05 may be formulated in the following way: if the confidence interval I = p(1), p(2) contains the probability p( B) = 1 − β , then hypothesis

(

)

H is accepted, otherwise it is rejected. The statistics h is called p-statistics (Petunin’s statistics); it is a measure of the proximity ρ ( x, x ′) between the samples X and Y. It should been noted that the function ρ ( x, x ′) of two variables X and Y is in general non-symmetric. The justification of this statistical test

was given in the paper Petunin and Klyushin (2000).

PROXIMITY SPACE Compute the proximity measure between samples from G1 and G2 with respect to two indexes. Consider the matrixes of indexes of kth patient from G1 and lth patient from G2:  x( k )  11  (k ) x uk =  21  …   x(k )  m1

(k )  y(l ) … x1n   11   (l ) (k ) (k )  y x22 … x2 n   and vl =  21 … …   …   (l )  (k ) (k ) xm2 … xmn   ym1 (k )

x12

… y1n   (l ) (l )  y22 … y2 n   . … …   (l )  (l ) ym2 … ymn  (l )

y12

(l )

(k )

Extract the columns corresponding to ith index from uk and vl: X i Yi

(l )

(

(l )

(l )

(l )

= y1i , y2i ,..., ymi

(1)

(

(k )

(l )

k

i

), µ( ) = ρ ( X ( ), Y ( ) ), ..., µ( ) = ρ ( X ( ), Y ( ) ). k

2

kl

2

l

2

n

k

l

kl

N

N

Then, compute the average proximity measures

28

k

(k )

(k )

)

T

and

) . Then, compute proximity between samples X ( ) and Y ( ) and construct the T

vector of proximity measures between uk and vl with respect to every index: µ kl = ρ X 1 , Y1

( ()

= x1i , x2i ,..., xmi l

i

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

(1)

νk =

1 N

N

N

1

1

N

∑ µ( ), ν( ) = N ∑ µ( ), ..., ν( ) = N ∑ µ( ) t =1

1

2

kt

k

t =1

n

2

kt

k

t =1

n

kt

between uk and patient from G2 with respect to ith index. This scheme may be applied for comparing uk with other patient from G1: (1)

νk =

N N N 1 1 1 (1) (2) (2) (n) (n) µ ks , νk = µ ks , ..., νk = µ ks . ∑ ∑ ∑ N − 1 s=1, s≠1 N − 1 s=1, s≠1 N − 1 s=1, s≠1

Join the average proximity measures into pairs and point at the proximity space corresponding to ith

( ( ) ( ) ) and (ν( ), ν( ) ) , i, j = 1, 2,..., m; t, s = 1, 2,..., n. i

and jth indexes: νt , νt

j

i

s

j

s

As a result, in the proximity space there are two sets of points consisting of average interclass proximity measures and average intra class proximity measures.

PETUNIN ELLIPSES AND ELLIPSOIDS In this part we will consider Petunin ellipsoid construction algorithm and its modification. For convenience we will divide the original algorithm on two cases that depend on input space dimension m. Also we denote by M n = { x1,..., xn } input set of points.

Case m=2 Build the convex hull of Mn = {(x1, y1), …, (xn,yn)}. Find the two vertices (xk, yk) and (xl, yl) of the convex hull that determine its diameter. Build the straight line L through them. Find the most distant points from L: (xr, yr) and (xq, yq). Build the lines L1 and L2 that are parallel to L and contain (xr, yr) and (xq, yq). Build the lines L3 and L4 that are orthogonal to L and contain (xk, yk) and (xl, yl). The intersections of L1, L2, L3 and L4 determine a rectangle. Denote its side’s lengths by a and b (without losing generality let a ≤ b). a Make translation, rotation and scaling with coefficient α = so that our rectangle becomes a square. b Denote its center by ( x0′, y0′ ) . Consider the images ( x1′, y1′ ) , ( x2′, y2′ ) ,..., ( xn′ , yn′ ) of our input points after all these transformations and find distances from them to the square center and denote them by r1, r2, …, rn. Let R = max(r1, r2, …, rn). Build a circle with the center ( x0′, y0′ ) and radius R(all the points

( x ′, y ′) , ( x ′, y ′ ) ,..., ( x ′, y ′ ) are inside this circle now). Perform inverse transformations of this circle. 1

1

2

2

n

n

The result will be an ellipse. It’s easy to see that the computational complexity of this algorithm is determined by the convex hull construction and is O(n lg n).

29

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Case m>2 Build the convex hull of M n = { x1,..., xn } . Find two vertices xk and xl of the convex hull that determine its diameter. Perform rotation and translation so that the diameter belongs to Ox1′ . Project all the points to the orthogonal complement of Ox1′ . Continue the operations mentioned above (convex hull construc-

tion, rotation and translation) until the orthogonal complement becomes two-dimensional. Perform the transformations like in case m=2 without scaling. Build a minimum volume axis-aligned box in m-di mensional space that contains the images x1′,..., xn′ of input points. Perform scaling so that our box becomes a hypercube. Denote its center by x0 and find distances r1, r2 ,..., rn from it to x1′,..., xn′ . Let R = max(r1, r2, …, rn). Build a hypersphere with the center x0 and radius R. Perform inverse transformations on it and obtain an ellipsoid in the input space. It’s easy to see that computational complexity is also O(n lg n). The algorithm described above is not perfect. In particular in case m>2 it is difficult to find the rotation at each step when the input space’s dimension is greater than three. Now, we are going to modify the algorithm in such a way that it will become easy to apply it in arbitrary finite-dimensional space. The first fact to notice is that the main goal of the algorithm is to construct a linear transformation that maps diameter vectors found at each step into axes. All we have to do after that is scaling and finding the most distant point from the center of a hypercube. Our goal is to describe the construction of the linear transformation introduced above. Assume we did not perform rotation at each step, but only made projections and translations. This means we have all the diameters of convex hulls (they are also diameters for corresponding sets of points). By construc tion these diameters form an orthogonal system. Let’s normalize it and denote by Bm = {a1, a2 , …, a m } . After that we need to find an unitary transformation that maps this orthonormalized system into basis. This problem becomes trivial after taking into account the fact that the inverse transformation maps basis vectors into our system. Thus, we know the matrix of the inverse transformation: U −1 = ( a 1 | a 2 | … | a m ) or equivalently U = (a1 | a2 | … | a m )−1 Another convenient fact about unitary operators is U–1 = UT which means that the only thing to make for finding the inverse transformation is transposition. Out next goal is to simplify the process of moving to the space of lower dimension and make only one translation. Consider the first step of the algorithm for that. Let we found the diameter vectors of our input set (denote them by x k and x l ). As we don’t make a rotation and translation, we must project our points onto the affine subspace x k + L . Here L is the orthogonal complement of the line determined by x k − x l . In fact it is a hyperplane which contains xk and is orthogonal to x k − x l . Denote the points

30

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

obtained after projection by M n(1) = xi(1), i = 1, n − 1 . It’s worth to say that the number of points will decrease by one after projection, because the projection of xk is xl by construction. We show then that on the next steps we do not have to move to the spaces with lower dimension. Instead of that it’s enough to perform projection onto the hyperplane that is orthogonal to the diameter of the corresponding set. Let at some step we have M n(k) = xi(k), i = 1, n − k — the set of points in Rm that lie in some affine subspace y+L1, whose dimension is p. Assume we found the diameter of this set an it equals to xl( k ) − xt( k ) and consider the hyperplane that is orthogonal to this vector and contains xt( k ) . Its equation looks as follows:

{

{

}

}

( x, xl( k ) − xt( k ) ) = ( xt( k ), xl( k ) − xt( k ) ) . The projection operator for this hyper plane is: (k ) (k ) (k ) (k ) (k ) ( x, xl − xt ) − ( xt , xl − xt ) ( k ) ( k ) ( xl − xt ) . Px = x − || xl( k ) − xt( k ) ||2 As xt( k ) lies in L 1 we have y + L1 = xl( k ) + L1 . That’s why x ∈= xt( k ) + L1 implies that (k ) x = x t + y, y ∈ L1 (k ) Px = x − α( xl( k ) − xt( k ) ) = x t + y − α( xl( k ) − xt( k ) ) , ( x, xl( k ) − xt( k ) ) − ( xt( k ), xl( k ) − xt( k ) ) α= . || xl( k ) − xt( k ) ||2 Also we know that α( xl( k ) − xt( k ) ) ∈ L1 and this implies that Px ∈ xt( k ) + L1 . At the same time by the definition of projection Px belongs to the hyperplane ( x, xl( k ) − xt( k ) ) = ( xt( k ), xl( k ) − xt( k ) ) which is also affine subspace. Thus the projections of our points will lie in the intersection of two affine subspaces one of which is our hyperplane. Denote it by xt( k ) + L2 , where L2 is a (m-1)-dimensional subspace. The intersection of these affine subspaces is affine subspace xt( k ) + L1 ∩ L2 . At the same the fact that xl( k ) − xt( k )

is orthogonal to L2 means that L1 + L2 = R m . To find the dimension of L1 ∩ L2 consider the Grassmann’s formula for our subspaces: m = m − 1 + p − dim L1 ∩ L2

and we have

31

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

dim L1 ∩ L2 = p − 1 . Knowing this we can perform projection onto the hyperplane which is orthogonal to the diameter and contains one of its points. But in fact the dimension of the affine subspace, which contains our projected points will decrease. In the original algorithm we move to the orthogonal complement instead of this. Finally, we have the following algorithm: Input Data: M n = { x1,..., xn } — vectors from Rm Algorithm ( 0)

1. M n ← M n , B =∅, k = 0 . 2. While k < m – 1 do: (k ) (k ) (k ) a. Find x l and x t — the diameter points of M n .  xl( k ) − xt( k )  b. B ← B∪  ( k ) ( k )  . || x − x || t   l (k +1) (k ) c. M n ← PL M n , where L is a hyperplane ( x, xl( k ) − xt( k ) ) = ( xt( k ), xl( k ) − xt( k ) ) . d. k ← k + 1 . 3. Build matrix U whose columns are the vectors of B. 4. M n' ← U T M n . 5. Find minimal axis-aligned bounding box for the points of M n' . Denote by c the center of this bounding box. 6. Build scaling transformation S which maps this bounding box into hypercube. 7. M n'' ← SM n' , c ' ← Sc ; 8. R ← max || x − c ' ||, x ∈ M n'' .

{

}

1 I , where I is a unit matrix of size m. R2 10. E ← US T ESU T , c ← Uc . 9.

E←

Output Data: E is the matrix of our ellipsoid, c is its center. The only constraint for the application of our algorithm is that the ellipsoid must exist and be nondegenerate. In other words there must not exist a subspace of the input space Rm that contains the input set of points. Consider the projection onto affine subspace and comparison of vectors as elementary operations. Asymptotically the most complex part of both algorithms is the searching of diameter of a finite set of points. This procedure requires n(n – 1)/2 operations. This result may be improved for two- and threedimensional spaces with convex hull construction. We may build a convex hull for our set and perform search on its vertices. Asymptotically this algorithm will require O(n lg n + z(z – 1)/2) operations, where z is the number of vertices of the convex hull. In the worst case the complexity will be the same, but for most samples the number of vertices in convex hull is small in comparison with the sample size. This means that the complexity for such samples will be O(n lg n). For Rm whose dimension is greater than three the complexity will be O(n2).

32

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Memory complexity of our algorithm will be O(n) as all we need to keep is input sample, projected sample (it changes at each step) and the matrices for our ellipsoid and transformations. We compared original and modified algorithms in two- and three-dimensional spaces. Both of them were implemented in R programming language. There were generated 100 samples of 200 vectors each. All the vectors were from Gaussian multivariate distribution with the same covariance matrix. Every sample consisted of two groups of 100 vectors each with different mean vectors. For each algorithm the total time of its work on all the 100 samples was computed. Taking as a unit the time of work for original algorithm in two-dimensional space the comparison in Table 1. It’s worth saying that for three-dimensional space after the first projection in the original algorithm we obtain the case m=2. This means that after rotation we can use the algorithm for m=2 and build the bounding box after that. It has been taken into account while implementing the original algorithm. This optimization decreases the number of diameters search by one and we have to perform them m-1 times. At the same time the modified algorithm performs diameters search m times. All the other actions made by algorithms are analogous. Thus the ratio between the times of work of the two algorithms on the same datasets must be close to (m – 1)/m. This fact can be observed in Table 1. Summing up the modified algorithm works slightly longer. But as the dimension of input increases the ratio of times converges to 1. Also as we mentioned above the original algorithm is hard to apply in the spaces dimension greater than three.

DECISION RULE The compactness hypothesis in a proximity space implies symmetry of the relation of closeness between objects inside and between classes. For this reason the decision diagram shall consider interclass and interclass distances between objects of classes of G1 and G2. Therefore, it is necessary to construct four types of ellipses: 1. E11, the ellipses containing the points representing values of a measure of closeness between objects in G1 class. 2. E21, the ellipses containing the points representing values of a measure of closeness between objects of a class of G2 and a class of G1. 3. E22, the ellipses containing the points representing values of a measure of closeness between objects in G2 class. 4. E12, the ellipses containing the points representing values of a measure of closeness between objects of a class of G1 and G2. Table 1. Comparison of two algorithms by time Space

Original Algorithm

Modified Algorithm

R2

1

2,0082

R

1,9951

2,9404

3

33

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Classification is executed on the basis of set of couples of planes containing ellipses of E11 and E21 on the first plane and E22 and E12 on the second plane. The amount of planes depends on number of couples of indices which can be formed. As it is possible to form n(n – 1)/2 couples of n indices, having increased this number on 2, we receive total quantity of couples of ellipses, equal to n (n-1). If the point gets to an ellipse of E11, so the average measure of closeness between patient to which it corresponds, and patients of a class of G1, is characteristic for patients of a class of G1. Therefore, the patient should be referred to G1 class. Similarly, if the point gets to an ellipse of E22, so an average measure of closeness between patient to which it corresponds, and patients of a class of G2, is characteristic for patients of a class of G2. Therefore, the patient should be referred to G2 class. If the point gets to an ellipse of E21, so the average measure of closeness between patient to which it corresponds, and patients of a class of G1, is characteristic for patients of a class of G2. Therefore, the patient should be referred to G2 class. Similarly, if the point gets to an ellipse of E12, so an average measure of closeness between patient to which it corresponds, and patients of a class of G2, is characteristic for patients of a class of G1. Therefore, the patient should be referred to G1 class. If the point gets to intersection of ellipses, point ranks in each of ellipses are compared according to their statistical depth. It doesn’t require additional computation as ranks are automatically determined during creation of an ellipse of Petunin. The point belongs to that ellipse in which it has a smaller rank, i.e. lies “more deeply”. The final output for the tested patient is accepted by simple vote.

STATISTICAL DEPTH The concept of statistical depth functions is widely used in problems of classification. In particular, quite wide and detailed review of this issue was described in scientific work of Lange and Mozharovsky (2010). In this section a new probabilistic understanding of the concept of statistical data depth is developed. This makes multivariate random sample possible to ordering view of the possibility that a new point from the same general population falls into a given decision region. Let’s give a general definition of the depth function. Definition 1: Let P be some distribution in d-dimensional space Rd. Random function D(x; P) for each x ∈ P , that orders points from distribution by law decay of the center, is called a depth function. The value of depth function D(x; P) for particular points from distribution P is called the depth of these points (Zuo and Serfling (2000)). Remark 1: The distribution center is determined by the particular depth function (it may be a median, a centroid, a geometric center of distribution, etc.). In fact, depth function provides presentation of the output data from d-dimensional Euclidean space in the one-dimensional form. According to this multiple advantages of the depth function may be provided. Let’s present the most significant ones below: 1. Unity of the ordering feature. 2. Relative simplicity of ordering. 3. Independence from the affine transformation and choice of coordinate system

34

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Let’s describe classical properties of depth functions, presented in work of Zuo and Serfling (2000), which makes the depth function possible to call as the statistical depth function in the classical sense: A1 - Affinity Invariance: The depth of the point is independent from chosen coordinate system and doesn’t change under an affine transformation influence on the original data. A2 - Maximum in the Center: The depth of a point, which is the considering distribution center, is the greatest of all points of distribution. A3 - Monotonic Relative to the Deepest Point: The depth of points from considering distribution monotonically decreases relative to an imaginary line passing through the distribution center - the deepest point of distribution. A4 - The Function Behavior at Infinity: If x − xc → ∞ ( x ∈ P is a random point from distribution P, xc is the distribution center, ⋅ is the usual Euclidean norm given d-dimensional space Rd) then D( x; P) → 0 . Remark 2: Let’s review finite samples of considered distributions instead of these distributions for better understanding of the practical application of particular statistical depth functions. In this case, the statistical depth function is denoted by Dn(x; Mn), where Mn is the considered sample.

Tukey Depth Let’s introduce the definition of the sample center in the Tukey understanding for disclosure the concept of Tukey depth (Tukey, 1974). Definition 2: Let Mn be the sample from some distribution P in d-dimensional space Rd. The point xc, for which any hyperplane, which passes through it, divides the considering sample into two approximately equal subsets, is called the sample center. The central point shouldn’t be one of the set of data, as a median. Each non-empty set of points (points without repetitions) has at least one central point. Remark 3: In statistics and computational geometry, the concept of the center (or even say the central point) is one of the generalizations of the concept of the median in a multidimensional Euclidean space. Definition 3: Let Mn be the sample from some distribution P in d-dimensional space Rd. The minimum number of sample points, which lie on one side of random hyperplane passing through a given point x ∈ M n , is called Tukey depth of this point. In other words, TD( x0 ; P) = inf {C( H ) | C( H ), x0 ∈ H } where C(H) is a number of points that lie on one side of the random hyperplane passing through a given point x0. Lemma 1: The Tukey depth satisfies conditions B1-B4.

35

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

The Simplex Depth The simplex depth is introduced in work of Liu (1990). Definition 4: Let Mn be the sample from some distribution P in d-dimensional space Rd. The simplex depth of point x ∈ M n is a number of simplexes based on random sample of points and contains the point x. In other words, SDn ( x) =

1 Cnd +1

∑

1≤i1 ≤i2 ≤...≤id +1

I ( x ∈ S ( xi , xi ,..., xi )) , 1

2

d +1

where S ( xi , xi ,..., xi ) is the simplex based on d+1 sample points, I is the indicator function. 1

d +1

2

Lemma 2: If the considering distribution is continuous, its simplex depth satisfies the conditions B1-B4. Lemma 3: If the considering distribution is discrete, it’s the simplex depth doesn’t satisfy the conditions B2 and B3.

Oja Depth Oja depth is the result of a generalization of the simplex depth theory and has been presented in work of Oja (1983). Definition 5: Let Mn be the sample from some distribution P in d-dimensional space Rd. Oja depth of the point x ∈ M n is average simplex volume based on d random sample points and point x. In other words, Dn ( x) =

1 Cnd

∑

1≤i1 ≤i2 ≤...≤id

ν( x, xi , xi ,..., xi )) , 1

2

d

where ν( x, xi , xi ,..., xi ) is d-dimensional simplex volume in the space Rd. 1

2

d

Mahalanobis Depth Mahalanobis distance between two points x, y ∈ R d relative to positive definite d × d–dimensional matrix M is calculated as d M2 ( x, y) = ( x − y)T M −1 ( x − y) . The concept of Mahalanobis depth is based on Mahalanobis distance and has been described in work of Zuo and Serfling (2000). i.e.

36

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

MHD( x; F ) = (1 + d2∑( F ) ( x, µ( F )))−1 , where µ( F ) is the distribution expectation, Σ ( F ) is the covariance matrix.

Zonoid Depth Zonoid depth has been proposed in work of Koshevoy and Mosler (1997). Definition 6: Zonoid of some distribution P in d-dimensional space Rd is a set of points in the space Rd that

{

}

Z = z(µ, h) | h : R d →  0, 1 is measurable where z(µ, h) = ( z0 (µ, h), ζ(µ, h)) ∈ R d +1, z0 (µ, h) =

∫ h( x)d µ( x), R

ζ(µ, h) =

d

∫ R

xh( x)d µ( x) .

d

Definition 7: Let M n = { x1,..., xn } be the sample from some distribution P in d-dimensional space R d . Zonoid depth of the point y relative to sample Mn is the number d z ( y | x1,..., xn ) = sup {α : y ∈ Dα ( x1,..., xn )} , where n  n 1  Dα ( x1,..., xn ) = ∑ λ i xi : ∑ λ i = 1, 0 ≤ λ i , ∀i : αλ i ≤  .  i=1 n  i =1

Statistical depth based on convex hulls peeling. The convex hull of a set of points is the minimum convex shape that contains the given points. In other words, it is a set of points which forms the “perimeter” of a polygon formed by them connection. Peeling (screening) method of convex hulls is achieved by the systematic identification and removal a set of points which forms the convex hull of points (Barnett, 1976).

NOVEL STATISTICAL DATA DEPTH BASED ON PETUNIN ELLIPSOIDS In this section we propose a new probability understanding of the concept of the statistical depth function. It is based on a new ordering method and a probabilistic feature. Thus, besides the direct samples ordering, we can calculate the probability that the considering point falls into a given sample, and further identify the given point rank relative to the sample.

37

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Elliptical Statistical Depth (PD) Confidence Petunin ellipses allow to build a new elliptical statistical depth function. Since Petunin ellipses form function level line, that is, points on the borders of ellipsoids En have the same statistical n −1 depth equal to the confidence ellipse level , monotonous and continuous elliptic depth function n +1 can be constructed, for example, as a linear spline on these points (Lyashko, Klyushin, & Alexeyenko, 2013). Obviously, the continuity of the statistical depth can be achieved not only by linear splines. Lemma 5: Elliptical statistical depth satisfies conditions A1-A4. Proof: Let us test the properties A1–A4. A1: A method of constructing ellipses Petunin is invariant with respect to changes in the coordinates system and affine transformations (due to the respective invariance of ellipses and rectangles). A2: If multivariate random variable has an elliptical distribution (in particular, normal), the Petunin ellipses centers converges to the median, thus statistical Petunin depth function acquires its maximum value at this point. This function provides an approximation of the medians for other distributions, which accuracy depends on the difference to elliptical distribution. n −1   → 1 , i.e. as it’s closer to the distribution center (in this case it is a meA3: P( x ∈ En ) = n→∞ n +1 dian), so monotonically increases probability and so monotonically increases the value (ellipse index) of point depth. A4) As the considering point x locates further from the distribution center, so becames greater the probability that it falles into an ellipse with a lower sequence number. If x → ∞ , then, by the compactness hypothesis of the considering general population, the probability that this point does not fall into one of the ellipses is increasing, and thus the value of its depth tends to 0. The proof is completed. Remark 5: Obviously, the concentric Petunin ellipses with confidence level of ordering multivariable samples.

n −1 define a new way n +1

DEPTH-ORDERED REGIONS Cascos (2007) has been proposed in his work the definition and the basic properties, which must satisfy the deep-ordered regions formed by statistical depth function. Definition 8: Depth-ordered regions (central) is a set of points, that statistical depth is not less than a predetermined value, i.e,

{

}

Dα ( PX ) = x ∈ R d : D ( x; PX ) ≥ α ,

38

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

where D ( x; PX ) is a statistical depth of the point x, that is taken from the multivariate general population with distribution PX . Also Cascos (2007) has been introduced in his work the desired properties of deep-ordered regions. 1. Affine Equivariance: For each random vector x ∈ R d Dα ( Ax + b) = ADα ( x ) + b for random nonsingular d × d -dimensional matrix A and b ∈ R d . 2. Nesting: If α ≥ β , then Dα ( PX ) ⊂ Dβ ( PX ) .

3. Monotony: If x ≺ y by components, then Dα ( PX ) + a ⊂ Dβ ( PX ) + a , where and a ∈ R d is a

random positive vector. 4. Compactness: Dα ( PX ) is the compact area.

5. Convexity: Dα ( PX ) is the convex area.

6. Subadditivity: Dα ( PX +Y ) ⊂ Dα ( PX ) + Dα ( PY ) . Theorem: Depth-ordered Petunin ellipses satisfy conditions 1-5. Proof: The analysis of the constructing Petunin ellipses algorithm shows that the properties 1-5 (affine equivariance, nesting, monotony, compactness and convexity) are obviously performed. Cascos (2007) noted in his work that subadditivity is only desirable but not necessary property in financial risks research. Since the main use of Petunin ellipses is the multivariate data classification, and not an assessment of financial risks, the failure deep-ordered regions to satisfy the subadditivity property isn’t a disadvantage.

RESULTS Consider the results of the test of 25 cases of breast cancer and 12 cases of fibroadenomatosis using cross-validation. Such proportion of cases was selected to take into account the “domination effect” described in Petunin et al (2001) and to compare the new and previous results. Let D1 be the diagnose “breast cancer”, D2 be the diagnose “fibroadenomatosis”, v11 be the frequency of the event D1 for the samples of fibroadenomatosis, v21 be the frequency of the event D2 for the samples of breast cancer, v12 be the frequency of the event D1 for the samples of fibroadenomatosis, v22 be the frequency of the event D2 for the samples of fibroadenomatosis. So, the results may be written as shown in Table 2. Thus, the specificity of breast cancer diagnosis using quadratic test is over 90 percent, but the sensitivity is less then 50 percent. Using of peeling for elimination of uncertainty (ranking of points in overlapping

Table 2. Results of diagnostics of breast cancer and fibroadenomatotis Test

v11

v21

v22

v12

Quadratic

0.92

0.08

0.42

0.58

Quadratic with peeling

0.96

0.04

0.58

0.42

39

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

of ellipses) raised the specificity up to 16 percent. Despite of this positive result, to raise the sensitivity up to the acceptable level it is necessary to use either other indexes or methods, or independent repetitions of the test based on other smears. It is interesting that combination of quadratic test with peeling has the same sensitivity and specificity as the sophisticated method described in Petunin et al (2001).

FUTURE RESEARCH DIRECTIONS The paradigm of relational discriminant analysis and peeling using statistical depth are very perspective and highly fruitful ones. The future research opportunities within the domain of the topic are 1) developing novel proximity (similarity) measures, 2) developing novel statistical depths, 3) developing of more effective algorithms for constructing Petunin ellipses and ellipsoids using genetic and other algorithms, 4) searching of new sets of textural indexes of Feulgen-stained nuclei, and 5) using bootstrap methods for voting scheme (decision rule).

CONCLUSION It is proved that the sensitivity and specificity of the novel method of breast cancer diagnostics based on the analysis of the distribution of the DNA concentration in interphase nuclei of epitheliocytes of buccal epithelium with the aid of novel algorithms of statistical machine learning are comparable with the method described in Petunin et al. (2001). The fact that the specificity of the method is greater than 50 percent allows using of repeating analysis. If it is exactly known that the patient may be suffering from only one of two diseases (breast cancer or fibroadenomatosis), the value 1/2n quickly tends to zero. Due to very high sensitivity the proposed method of computer diagnosis of the cancer of mammary gland may be useful for screening of the population for the selection of the patients for the risk group.

REFERENCES Barnett, V. (1976). The ordering of multivariate data. Journal of the Royal Statistical Society. Series A (General), 139(3), 318–355. doi:10.2307/2344839 Cascos, I. (2007). Depth function as based of a number of observation of a random vector. Working Paper 07-29, Statistic and Econometric Series 07, 21–28. Duin, R. P. W., Pekalska, E., & Ridder, D. (1999). Relational discriminant analysis. Pattern Recognition Letters, 20(11-13), 1175–1181. doi:10.1016/S0167-8655(99)00085-9 Duin, R. P. W., Ridder, D., & Tax, D. M. J. (1997). Experiments with object based discriminant functions; a featureless approach to pattern recognition. Pattern Recognition Letters, 18(11-13), 1159–1166. doi:10.1016/S0167-8655(97)00138-4 Ella, H. A., Moftah, H. M., Azar, A. T., & Shoman, M. (2014). MRI breast cancer diagnosis hybrid approach using adaptive ant-based segmentation and multilayer perceptron neural networks classifier. Applied Soft Computing, 14, 62–71. doi:10.1016/j.asoc.2013.08.011

40

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Klyushin, D. A., & Petunin, Yu. I. (2003). A Nonparametric Test for the Equivalence of Populations Based on a Measure of Proximity of Samples. Ukrainian Mathematical Journal, 55(2), 181–198. doi:10.1023/A:1025495727612 Koshevoy, G., & Mosler, K. (1997). Zonoid trimming for multivariate distributions. Annals of Statistics, 25(5), 1998–2017. doi:10.1214/aos/1069362382 Kumar, P., & Yildirim, E. A. (2005). Minimum volume enclosing ellipsoids and core sets. Journal of Optimization Theory and Applications, 126(1), 1–21. doi:10.1007/s10957-005-2653-6 Lange, T.I., & Mozharovsky, P.F. (2010). Determination of the depth for multivariate data sets. Inductive Simulation of Complex Systems, 2, 101–119. (in Russian) Liu, R. J. (1990). On a notion of data depth based on random simplices. Annals of Statistics, 18(1), 405–414. doi:10.1214/aos/1176347507 Lyashko, S. I., Klyushin, D. A., & Alexeyenko, V. V. (2013). Mulrivariate ranking using elliptical peeling. Cybernetics and Systems Analysis, 49(4), 511–516. doi:10.1007/s10559-013-9536-x Lyashko, S. I., & Rublev, B. V. (2003). Minimal ellipsoids and maximal simplexes in 3D euclidean space. Cybernetics and Systems Analysis, 39(6), 831–834. doi:10.1023/B:CASA.0000020224.83374.d7 Mairinger, T., Mikuz, G., & Gschwendtner, A. (1999). Nuclear chromatin texture analysis of nonmalignant tissue can detect adjacent prostatic adenocarcinoma. The Prostate, 41(1), 12–19. doi:10.1002/ (SICI)1097-0045(19990915)41:13.0.CO;2-# PMID:10440871 Moftah, M. H., Azar, A. T., Al-Shammari, E. T., Ghali, N. I., Hassanien, A. E., & Shoman, M. (2014). Adaptive K-Means Clustering Algorithm for MR Breast Image Segmentation. Neural Computing & Applications, 24(7-8), 1917–1928. doi:10.1007/s00521-013-1437-4 Mottl, V., Dvoenko, S., Seredin, O., Kulikowski, C., & Muchnik, I. (2001). Featureless pattern recognition in an imaginary Hilbert space and its application to protein fold classification. Lecture Notes in Computer Science, 2123, 322–336. doi:10.1007/3-540-44596-X_26 Oja, H. (1983). Descriptive statistics for multivariate distributions. Statistics & Probability Letters, 1(6), 327–332. doi:10.1016/0167-7152(83)90054-8 Pekalska, E., & Duin, R. P. W. (2001). Automatic pattern recognition by similarity representations. Electronics Letters, 37(3), 159–160. doi:10.1049/el:20010121 Pekalska, E., & Duin, R. P. W. (2002). Dissimilarity representations allow for building good classifiers. Pattern Recognition Letters, 23(8), 943–956. doi:10.1016/S0167-8655(02)00024-7 Pekalska, E., & Duin, R. P. W. (2005). The Dissimilarity Representation for Pattern Recognition, Foundations and Applications. Singapore: World Scientific. doi:10.1142/5965 Petunin, Yu. I., Kljushin, D. A., & Andrushkiw, R. I. (1997). Nonlinear algorithms of pattern recognition for computer-aided diagnosis of breast cancer. Nonlinear Analysis, 30(8), 5431–5336. doi:10.1016/ S0362-546X(96)00119-8

41

Breast Cancer Diagnosis Using Relational Discriminant Analysis of Malignancy-Associated Changes

Petunin, Y. I., Klyushin, D. A., Andrushkiw, R. I., Ganina, K. P., & Boroday, N. V. (2001). Computeraided differential diagnosis of breast cancer and fibroadenomatosis based on malignancy associated changes in buccal epithelium. Automedica, 19(3-4), 135–164. Petunin, Yu. I., & Rublev, B. V. (1996). Pattern recognition using quadratic discriminant functions. [In Russian]. Numerical and Applied Mathematics, 80, 89–104. Redon, C. E., Dickey, J. S., Nakamura, A. J., Kareva, I. G., Naf, D., Nowsheen, S., & Sedelnikova, O. A. et al. (2010). Tumors induce complex DNA damage in distant proliferative tissues in vivo. Proceedings of the National Academy of Sciences of the United States of America, 107(42), 17992–17997. doi:10.1073/ pnas.1008260107 PMID:20855610 Silverman, B. W., & Titterington, D. M. (1980). Minimum covering ellipses. SIAM Journal on Scientific and Statistical Computing, 1(4), 401–409. doi:10.1137/0901028 Susnik, B., Worth, A., LeRiche, J., & Palcic, B. (1995). Malignancy-associated changes in the breast: Changes in chromatin distribution in epithelial cells in normal-appearing tissue adjacent to carcinoma. Analytical and Quantitative Cytology and Histology, 17(1), 62–68. PMID:7766270 Tukey, J. W., & Tukey, J. W. (1975). Mathematics and the picturing of data. Proceedings of the International Congress of Mathematician, 523–531. Us-Krasovec, M., Erzen, J., & Zganec, M. (2005). Malignancy associated changes in epithelial cells of buccal mucosa: A potential cancer detection test. Analytical and Quantitative Cytology and Histology, 27(5), 254–262. PMID:16447817 Welzl, E. (n.d.). Smallest enclosing disks (balls and ellipsoids). In Proceedings of New Results and New Trends in Computer Science (vol. 555, pp. 359–370). Springer. Zuo, Y., & Serfling, R. (2000). General notions of statistical depth function. Annals of Statistics, 28(2), 461–482. doi:10.1214/aos/1016218226

42

43

Chapter 3

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification Nora Shoaip Mansoura University, Egypt

Alaa M. Riad Mansoura University, Egypt

Mohammed Elmogy Mansoura University, Egypt

Hosam Zaghloul Mansoura University, Egypt

Farid A. Badria Mansoura University, Egypt

ABSTRACT Ovarian cancer is one of the most dangerous cancers among women which have a high rank of the cancers causing death. Ovarian cancer diagnoses are very difficult especially in early-stage because most symptoms associated with ovarian cancer such as Difficulty eating or feeling full quickly, Pelvic or abdominal pain, and Bloating are common and found in Women who do not have ovarian cancer. The CA-125 test is used as a tumor marker, high levels could be a sign of ovarian cancer, but sometimes it is not true because not all women with ovarian cancer have high CA-125 levels, particularly about 20% of ovarian cancers are found at an early stage. In this paper, we try to find the most important rules helping in Early-stage ovarian cancer Diagnosis by evaluating the significance of data between ovarian cancer and the amino acids. Therefore, we propose a Fuzzy Rough feature selection with Support Vector Machine (SVM) classification model. In the pre-processing stage, we use Fuzzy Rough set theory for feature selection. In post-processing stage, we use SVM classification which is a powerful method to get good classification performance. Finally, we compare the output results of the proposed system with other classification technique to guarantee the highest classification performance.

DOI: 10.4018/978-1-5225-2229-4.ch003

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

INTRODUCTION Ovarian cancer is formed in tissues of woman ovary (National Cancer Institute, 2014). Abnormal cells of this cancer can be found in one or both ovaries that have the ability to spread to the pelvis and abdomen parts then it spread in all body (Mayoclinic, 2014). Annually ovarian cancer is diagnosed in nearly a quarter of a million women in the entire world and is responsible for Hundreds of thousands of deaths each year. A small percentage of infected women with ovarian cancer, about 45%, can live for only five years compared to 89% of other women infected with breast cancer (World Ovarian Cancer, 2014). Ovarian cancer can be increased with custom groups of women. For example, women aged above 50 years, women who did not give birth or have difficulty in pregnancy or women who have relatives infected with cancer, such as breast, ovarian cancer, colon, uterine, or cervical cancer (Centers for Disease Control and Prevention CDC, 2014). Cancer antigen-125 (CA-125) is a protein found with high rates of ovarian cancer cells more than other normal cells. CA-125 is created on the surface of cells then moving to the blood stream (Johns Hopkins University, 2014). Physicians used CA-125 as a tumor marker for ovarian cancer by measuring the levels of the CA-125 protein in a woman’s blood. However, it is not an adequate early detection tool. High levels of CA-125 could be used as a sign of ovarian cancer, but it is not accurate or effective role because not all women with ovarian cancer have high CA-125 levels. Ovarian cancer is difficult to be diagnosed in early stages until it spreads to different body parts. More than 60% of women discover this cancer in stage III or stage IV cancer (Wikipedia, 2014a). Medical diagnosis contains a high degree of difficulty that faces two main problems. The first problem of the medical diagnosis is a classification process. It must analyze many factors in difficult circumstances, such as diagnosis disparity and limited observation. The second problem is the uncertainty of the processed data that affects the diagnosis process. Among the machine learning techniques that can deal efficiently with different degrees of difficulty of data problems, such as incomplete, uncertain, and inconsistent data, Rough set theory (Pawlak, 1982) is used to analysis vague and uncertain data. In practice, Rough set classifies the discrete attributes with high accuracy. It cannot be done well with real-valued, continuous attributes. Therefore, it leads to the creation of hybrid systems to integrate the Rough set theory with other machine learning technique, such as a Fuzzy set. These methods are complementary to each other, and the combination of them can provide improved solutions for dealing with continuous attributes. The fuzzy-rough set theory is a successful hybrid model between the Rough set and the fuzzy set that proposed by Dubois and Prade to be one of the most common and efficient feature selection algorithms (Dubois & Prade, 1990, Hassanien Aboul Ella et.al. 2010). The fuzzy-rough set (Jensen & Shen, 2004; Jensen, 2005; Shen & Jensen, 2007; Chen et al., 2008) is a generalization of the lower and upper approximation of the rough set to have greater flexibility in handling uncertainty. As compared to the Rough sets, fuzzy-rough handles the uncertainty present in real data type in a better way without making any transformation such as discretization. On the other hand, there are several classification methods widely used in medical data based on symptoms and health conditions (Lee & Chang, 2010; Chen et al., 2011; Yoo et al., 2012; Abdul Khaleel et al., 2013). Naïve Bayes classifier is considered as simple and rapid classification algorithm, but it is based on strong feature independence assumptions that sometimes considered as a significant disadvantage. Decision tree classifiers provide good visualization of data that allows users to understand the structure of all data easy, but decision tree may be too complicated when a data set contains many attributes. 44

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Neural Networks consist of computational nodes that emulate the functions of the neurons in the brain to be self-learning and strong classification technique. However, they have many disadvantages, such as they cannot have good accuracy for massive data; their classification performance is very sensitive to the parameters selected, and their training or learning process is very slow and computationally very expensive. Fuzzy-Rough Nearest Neighbor combines fuzzy-rough approximations with the ideas of the classical fuzzy K-nearest neighbor (FNN) approach. This algorithm presents the lower and the upper approximation of a decision class, is calculated by means of the nearest neighbors of a test object. It provides good clues to predict the membership of the test object to that class. On the other hand, Vapnik (Meyer, 2014) proposed the Support Vector Machine (SVM) for binary classification. It finds a decision boundary between two classes that is maximally far from any point in the training data. When it cannot find a linear separator, kernel techniques are used where the data points can separable. SVM has many advantages (Tharwat et al., 2016 & Sewell, 2014), such as, it has a simple geometric interpretation, and it gives a sparse solution. The limitation of the SVM approach lies in the choice of the kernel parameters for obtaining good results, which practically means that an extensive search must be conducted in the parameter space before results can be trusted. ANNs and SVM are two popular strategies for supervised machine learning and classification. They are often viewed as black boxes models that do not have an understandable form to extract rules directly from a domain expert (Kamruzzaman & Islam, 2006; Suganya & D. Dhivya, 2011). This paper is organized as follows. Section 2 is a review of the prior works of the medical diagnosis based on machine learning techniques. Section 3 discusses the framework of our proposal fuzzy-rough with SVM model. In Section 4, the experimental results are introduced to illustrate how to apply the proposed model to the practical case for detecting early-stage ovarian cancer. In Section 5, the comparative study of the output results of the SVM and other classification techniques, such as FuzzyRoughNN, J48, NaiveBayes, and ANNs, is provided. Finally, the conclusion and the future work are presented in Section 6.

RELATED WORK Medical data usually contain high degrees of uncertainty. They are considered as classification processes. High classification accuracy plays a major role to ensure the success of detecting medical diagnosis in early-stage that affect human life. Therefore, improving the diagnosis process is an emergent task. It poses a greater challenge for many researchers to care about and tried to find the best machine learning technique to accomplish the job. One of these successful techniques and widely used in a lot of work is ANNs (Amato et al., 2013) because they have good accuracy and able to classify nonlinear systems that have a very complex relationship among variables. Wilding et al. (1994) used ANNs as a classification tool to help in diagnosis two common cancers, which are ovarian and breast cancers. Their dataset is based on the age of the patient, a tumor marker (CA 15-3 or CA 125), and a group of laboratory tests, such as cholesterol, albumin, cholesterol, and triglyceride. ANNs achieved good results for the ovarian cancer study (85.5% for specificity and 80.6% for sensitivity). CA 125 tumor marker alone gave results of 82.3% and 77.8%, respectively. Ganesan et al. (2010) used ANNs in the actual clinical diagnosis of lung cancer. Initially, 100 lung cancer patients’ data had been collected from various hospitals and trained with the neural networks. It gives more than 87% of accuracy.

45

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Khan et al. (2013) aimed to evaluate ANNs in two cases. The first one is acute nephritis disease; data is the disease symptoms. The second is the heart disease. The results were very good; the network was able to classify 99% of the acute nephritis disease and 95% of the heart disease in the testing set. Jeetha & Malathi (2013) evaluated the performance of each of the ANNs and Genetic Algorithm (GA) in the diagnosis of ovarian cancer using proven ovarian dataset. The resulting ANNs had high accuracy 98% of performance and outperformed the GA in the classification of the ovarian dataset. Verma (2014) tried to evaluate ANNs in three diseases: Diabetes, Hypertension, and Obesity diseases. His results showed that the proposed medical diagnosis ANNs could be useful for identifying the infected person. Yingchi et al. (2014) evaluated ANNs model in the diagnosis of pancreatic cancer using three serum markers (CA19-9), (CA125) and (CEA). ANNs analysis of multiple markers yielded a high level of diagnostic accuracy (83.53%) compared to Logistic regression (74.90%). On the other hand, SVM is an excellent classification technique. Marchiori & Sebag (2005) proposed a method to improve the classification of SVM with recursive feature selection (SVM-RFE). The SVM had been applied to four publically cancer datasets. They are ovarian, leukemia, colon, and lymphoma cancer data. The results of this data set confirmed the robustness and stability of their proposal model. Zhou et al. (2010) collected the metabolite levels from 50 healthy women and other 44 cases infected with ovarian cancer in late stages then make classification by using a customized functional SVM algorithm. 100% accuracy performance was evaluated through a 64-30 split validation test and with a stringent series of leave-one-out cross-validations. Srivastava & Bhambhu (2010) presented an SVM learning method that is applied to different types of data sets (Diabetes data, Heart data, Satellite data and Shuttle data). Their experimental results showed that the choice of the kernel function and the best value of parameters for the particular kernel are the critical parameters. Sasirekha and Kumar (2013) used SVM to classify the time series for a heartbeat. The result showed that the proposed approach worked well in detecting different attacks and achieved high performance 96.63%. Kumari & Chitra (2013) proposed a hybrid model based on SVM and radial basis function kernel and tried to use this hybrid in classifying diabetes dataset. The performance parameters are the classification accuracy (78%), the sensitivity (80%), and the specificity (76.5%) of the SVM and RBF. When the input space dimension is large, it will lead to a longer time training of the classification algorithms and affect the performance. Our survey of the literature shows that many researchers used feature selection algorithms to improve classification accuracy on different medical datasets. For example, Bhatia et al. (2008) used the GA for feature selection and enhanced the performance of the SVM to a great extent and a high accuracy. Their system’s accuracy was 72.55% that is obtained by using only 6 out of 13 features against an accuracy of only 61.93% by using all the features. Akay (2009) proposed F-score feature selection with SVM classification for breast cancer dataset. SVM has excellent classification accuracy (99.51%) for important five features as compared to their previously reported results. Tsai et al. (2011) tried to solve associated problem with ovarian cancer dataset that related to a large number of gene chip variables by using linear regression and variance analysis (ANOVA) then doing classification by using SVM with a different kernel. They discovered that the SVM had considerably fine effect in classification. When different kernel function is used, the results will change too.

46

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Srivastava et al. (2010) introduced a Rough-SVM classification method with heart dataset, which makes great use of the advantages of SVM’s higher generalization performance and Rough Set theory in effectively dealing with attributes reduction. The accuracy of the classification is increased by 4.29% than the general SVM. Chen et al. (2009) and LI and Ye-LI (2010) proposed hybrid models between SVM and Rough Set. The introduction of Rough Set can reduce the number of the entered data to the SVM. It is greatly reduced, and the system speed is improved. Gandhi & Prajapati (2014) tried to reduce features of Pima Indian diabetes dataset by using the F-score method and the K-means clustering before performing SVM classification. As a result, SVM achieved an empirical accuracy of 98%. Several comparative studies have been done in medical classification. For example, Tiwari et al. (2013) tried to evaluate the performance of several classification algorithms, such as SVM, self-organizing map (SOM), back propagations (BP), and radial basis function networks (RBF) for classifying Liver patient data. The SVM provided a high correctly classified rate. Ubaidillaha et al. (2013) used four common cancer datasets (ovarian, prostate, liver, and breast) to compare the performance of the SVM and ANNs. SVM classifier has better performance in the dataset with a smaller amount of input features (breast cancer and liver cancer dataset) while ANNs classifier can obtain good classification performance in the dataset with a larger amount of input features (prostate and ovarian cancer dataset). Although both of the classifiers have a good result in accuracy and AUC, the SVM classifier is better in classifying data that belongs to each tumor. SVM generates black box models that do not have an explanation or understandable form. Nunez et al. (2002) proposed an SVM model with prototype method that combined prototype vectors with support vectors for each class to define an ellipsoid or rectangle in the input space. Then, it is transformed into if-then rules. However, this model became complicated for a large number of patterns. Barakat & Diederich (2005) handled the SVM rule-extraction task in three basic steps: training, propositional rule extraction, and rule quality evaluation that achieve high accuracy. Barakat & Bradley (2006) evaluated the rules’ quality that is extracted from an SVM model by using receiver operating characteristics (ROC) curve. Martens et al. (2008) presented a survey of the proposed techniques, which have been used for rule extraction for each of SVM and ANNs. Another effort has been made to extracted correct and valid rules from the medical point of view and consistent with clinical knowledge of diabetes risk factors by Barakat & Bradley (2010). They tried to handle rule-extraction as a learning task. The decision trees and rule sets produced by C5.0 offered an explanation of the concepts learned by the SVM. In this paper, we try to discover the relationship between ovarian cancer and the amino acids to find a new tumor marker for ovarian cancer to help detect ovarian cancer at an early stage. The dataset is collected from Mansoura Cancer Center, Egypt. This dataset contains several attributes all of them real-valued data types. We choose the fuzzy-rough method as a feature selection algorithm that can deal with complex data without making any transformation for dealing with real data, such as discretization to avoid loss of data. The fuzzy-rough technique reduced the attributed conditions to improve classification accuracy. The classification task was performed using SVM that becomes one of the most popular techniques for classification. This model has the merit of dealing with real and complex data, performing quick learning, and having good classification performance. Finally, we compare the output results of the SVM with other classification techniques to guarantee the highest classification performance.

47

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

THE PROPOSED FUZZY-ROUGH WITH SVM CLASSIFICATION MODLE Mining suitable rules to explore most important factors affecting the medical diagnosis from complex, real medical data with quick learning and good classification performance is considered as a significant challenge. Therefore, this paper proposes a fuzzy-rough with SVM classification model. It integrates fuzzy-rough and SVM in four phases, as shown in Figure 1. The first phase is preparing the input data to construct the information table. The second phase is to reduce features to get the minimum essential ones and exclude the redundant attribute for saving training time. The classification process is done in the third phase using SVM. The last phase is to extract the suitable rules to get the understandable form for helping the physician in diagnosing ovarian cancer in early stages. In the following subsections, the stages of our proposal model will be introduced in more detail.

Phase One: Data Preparation and Information Table Construction In this stage, we prepare the medical data to be presented in a suitable form to be processed in our proposed system. The dataset should be processed to remove the redundancy, represent the non-numerical attributes in an appropriate numerical form, check for missing values, and rescale the real values for decision attribute, as shown in Figure 2. Figure 1. The block diagram of the fuzzy-rough based on SVM classification model

48

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Figure 2. The stages of the data preparation and information table construction

It is done to construct the information table for extracting knowledge in the form of a two-dimensional table. For dealing with real data, we do not have to use discretization algorithms. To avoid loss of data, we assign membership degrees (fuzzification) as preparing to use Fuzzy Rough feature selection.

Phase Two: Reducing Features Reducing Features is a very critical section of the building of the classification systems. Certainly, it has a good effect on the classification accuracy by limiting the number of the input features for the classifier (Huang e al., 2008; Xu et al., 2009; Devi & Rajagopalan, 2011; Zhao & Zhang, 2011; Hong, 2011; Gangwal & Bhaumik, 2012, Waleed et.al, 2016). We eliminate the redundant attributes by using the fuzzy-rough Quick Reduct Algorithm (Jensen, 2005) which based on dependency function to choose which essential attributes to be added to the current reduct until the addition of any remaining attribute does not increase the dependency, as shown in Figure 3. Fuzzy-rough sets have achieved more accurate results than Pawlak Rough sets by translating crisp Rough set to a Fuzzy set and extended the indiscernibility relation to a fuzzy equivalence relation. Therefore, we do not have to make any transformation for dealing with real data such as discretization. On the other hand, become powerless when dealing with uncertain information and cannot be able to determine what kind of data is redundant or useful SVM (Srivastava et al., 2010). Therefore, we used fuzzy-rough as a pre-processing stage to get minimum reduct. Then, the result is supplied as an input to the SVM to get a balance between training data and SVM performance.

Phase Three: SVM Classification The classification task was performed using SVM. The data is partitioned into training and testing subsets as shown in Figure.4. We tested the performance of SVM with different parameter values for the polynomial and RBF kernels to choose the SVM kernel, which has a good classification. The statistical measures such as accuracy, sensitivity, specificity, and Area under ROC Curve are good tools for evaluating the ability of SVM to differentiate between positive and negative ovarian cancer cases (Kumari & Chitra, 2013; Gandhi & Prajapati, 2014).

49

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Figure 3. The fuzzy-rough QuickReduct flow chart

Figure 4. The SVM classification steps

50

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Accuracy =

TP + TN TP + TN + FP + FN

(1)

Sensitivity =

TP TP + FN

(2)

Specificity =

TN TN + FP

(3)

Phase Four: Extracting Suitable Rules To score success in the medical diagnosis, our proposal has to produce its result in an understandable form (if-then rule). On the other hand, SVM algorithm is considered as a black box model. It has not the ability to explain the result in an understandable form for a domain expert (Nunez et al., 2002). We faced this big problem in our model. Therefore, we have to extract rules from the information provided by SVM using boundaries (ellipsoids or hyper-rectangles) which are defined in the input space. They are a combination of support vectors and prototype vectors. The prototype is computed through a clustering algorithm; whereas the rule is defined for each region.

Experimental Results Since the CA-125 is not a good tumor marker for detecting the early-stage ovarian cancer, we tried to determine the effect of the amino acids in predicting ovarian cancer. We applied our proposed model in two different studies. The first one is between amino acids and ovarian cancer. The second is between amino acids and CA-125 tumor marker. These two studies are conducted to explore better knowledge and the most important factors useful for detecting early-stage of ovarian cancer. We have used Weka (Version 3.7.2) (Markov, 2008; Jensen et al., 2014) to carry out the proposed model experiments •

Samples Collection: This investigation is a case-control based study, which was approved by the ethical institutional review board at Mansoura University that complies with acceptable international standards. Written informed consent for participation was obtained from each participant. Case patients were recruited from the population of patients with diagnosed ovarian cancer who were evaluated and treated at Mansoura Cancer Center, Egypt. The inclusion criteria were as follows: pathologically confirmed the diagnosis of ovarian cancer, CA-125 (1.9-16.3 U/ ml), a diagnostic cut-off value, and Egyptian residency. From February 2011 through December 2013, the statistical analyses indicated that the eligible patients who were not recruited did not differ from the recruited patients regarding demographical, epidemiological, or clinical factors (retrieved from patients’ medical records). The control subjects were healthy and recruited from the diagnostic biochemical lab, AutoLab of Mansoura institution, and were matched by age, sex, and ethnicity to the case subjects. The eligibility criteria for controls were the same as those for patients, except for having a cancer diagnosis. A short structured questionnaire was used to screen for potential controls based on the eligibility criteria. Analysis of the answers received on the short questionnaire indicated that 80% of those questioned agreed to participate in clinical research. 51

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

•

Samples Analysis: CA-125 blood analysis samples (5 mL) were taken centrifuged, and the serum separated and stored at – 80 C until analyzed. Serum samples were assayed. For CA-125 by enzyme-linked immunosorbent assay with commercial kits (Abbott, North Chicago, IL). For amino acid, the analysis is done by high-performance liquid chromatog-raphy (HPLC) amino acids analyzer, LC 3000 Eppendorf (Germany).

The Relation between the Amino Acids and the Ovarian Cancer The data is obtained from 135 female patients. They include 22 negative cases and 113 positive cases. The class target or decision attribute of ovarian cancer takes the values “negative” or “positive” that means negative or a positive test for ovarian cancer, respectively. Table 1 shows the description of the used attributes. After preparing the dataset, the information table for the fuzzy-rough is used to calculate the reduction of all amino acids attributes. We found that attributes 2, 3, 13, and 14 were the most important amino acids that have a main effect on ovarian cancer. Reduced amino acids dataset is split into training dataset (66%), and test dataset (33%). We selected polynomial kernel as appropriate SVM kernel function that is suitable for nonlinear data with high complexity. We gain an excellent correctly classified instances (100%) and precision equal to 1, as shown in Table 2. The proposed fuzzy-rough with SVM (Polynomial kernel) classification model achieved classification accuracy greater than SVM (Polynomial kernel) by 4.347%. It confirms that feature selection is an important stage to achieve high classification accuracy by limiting the number of input features to the classifier. We extracted the rules from the information provided by SVM. As shown in Figure 5, the classification error is 0.0%, and it is too easy to extract rules by using boundaries (hyper-rectangles) defined in the input space. We found that the phenylalanine amino acid (no. 13) could be used as an excellent tumor marker with classification error equal to 0.0%. Table 3 shows the important rules for detecting ovarian cancer in early-stage. Table 1. The description of ovarian cancer attributes Amino Acids (Condition Attribute)

Class

1-aspartic, 2-threonine, 3-serine, 4-glutamine, 5-glycine, 6-alanine, 7-cystine, 8-valine, 9-methionine, 10-isoleucine, 11-leucine, 12-tyrosine, 13-phenylalanine, 14-histidine, 15-lysine, 16-Arginine, 17-Proline

ovarian cancer

*All of them is Numeric (real) data type *no missing data

Nominal {negative, positive}

Table 2. The comparison of the performance of the proposed model and SVM using different kernels for Amino Acids and The Ovarian Cancer dataset Classification Technique Fuzzy-Rough with SVM SVM

52

Accuracy

TP Rate

FP Rate

ROC Area

100%

1

0

1

RBF

97.82%

0.978

0.145

0.917

Polynomial

95.65%

0.957

0.29

0.833

RBF

86.95%

0.87

0.87

0.5

Polynomial

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Figure 5. The classification error for 13-phenylalanine and ovarian cancer (X: 13-phenylalanine, Y: ovarian cancer)

Table 3. The primary rule for detecting ovarian cancer based on phenylalanine that is an amino acid that is More affected by ovarian cancer 13-Phenylalanine (phenylalanine>33.9 and phenylalanine Positive ovarian cancer case Else => Negative ovarian case

The Relation between the Amino Acids and CA-125 The current experiment extracts the relationship between the amino acids and the tumor marker (CA125). The class target or decision attribute is CA-125 that is a real data type. We rescale this attribute to takes the values “normal” inside the range [1.9-16.3 U/ml]. There is 113 positive ovarian cancer case -53 cases in the normal range and there are 60 abnormal cases. After using fuzzy-rough reduction, we find that the selected attributes (1, 2, 3, 5, 6, 7, and 9) are the minimal set of amino acids that have the main effect on the tumor marker CA-125. For classification process, it is necessary to validate the model with cross-validation. Hence, 3-fold, 5-fold, and 10-fold cross-validation techniques are used on amino acids dataset. The comparative results show that SVM with polynomial kernel works better for 5-fold cross validation. We gain correctly classified instances of 60.17%, as show in Table 5.

53

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Table 4. Description of the Amino Acids and CA-125 attributes Amino Acids (Condition Attribute)

Class

1-aspartic, 2-threonine, 3-serine, 4-glutamine, 5-glycine, 6-alanine, 7-cystine, 8-valine, 9-methionine, 10-isoleucine, 11-leucine, 12-tyrosine, 13-phenylalanine, 14-histidine, 15-lysine, 16-Arginine, 17-Proline

CA-125 - 53 normal case between [1.9-16.3] - 60 abnormal case out of that range - 113 positive Ovarian cancer case

*All of them is Numeric (real) data type *no missing data

Numeric (real)

Table 5. The comparison of the performance of the proposed model and SVM using different kernels for The Amino Acids and CA-125 dataset Classification Technique Fuzzy-Rough with SVM SVM

Accuracy

TP Rate

FP Rate

ROC Area

Polynomial

60.17 %

0.602

0.405

0.599

RBF

53.097%

0.531

0.531

0.5

Polynomial

52.212%

0.522

0.482

0.52

RBF

52.212%

0.522

0.539

0.492

Comparative Study In this section, we try to compare the performance of the SVM with other classification methods such as NaiveBayes, Decision tree, ANNs, and Fuzzy-rough Nearest Neighbour. Naive Bayesian, the independent feature model, has many advantages, such as it is the simplest and rapid classification algorithms, very easy to construct, can be used when the dimensionality of the inputs is high, not require a complicated parameter, and doing well with a small amount of training data (Wikipedia, 2014b). NaiveBayes has strong feature independence assumption that is considered as a significant disadvantage (Wikibooks, 2014). Decision trees are non-parametric supervised learning technique used for regression and classification (scikit-learn, 2014). It provides a good visualization of data. This visualization allows users to understand readily the overall structure of data, able to handle both numerical and categorical data, and it works more efficiently with discrete attributes. Decision trees quickly classify data. They are a good visualization tool and can be used to help predict the future. However, they are suffering from some problems, such as they need a long time for building a tree and the time increased as the number of classes increase, the decision tree may be too complex when a dataset contains many attributes, overfitting, suffer from errors propagating, and they are difficult to work with continuous values. Fuzzy-rough Nearest Neighbour (Jensen & Cornelis, 2011) combines fuzzy-rough approximations with the ideas of the classical fuzzy K-nearest neighbor (FNN) approach. It helps in drawing more meaningful interpretation from the output which in return provides the decision maker more valuable information (SAROJINI, 2014). It provides good clues to predict the membership of the test object to that class, and it could perform better under partially exposed and unbalanced domain. ANNs emulate the functions of the neurons in the brain. Based on the learning type, ANNs are divided into two categorized: supervised and unsupervised (Samant & Rao, 2013). The most widely used

54

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

ANNs is multi-layer perceptron with back-propagation because its performance is considered superior to other ANNs algorithms including a self-organizing map (SOM) (Delen et al., 2005). ANNs classifier is required less formal statistical training, able to detect nonlinear relationships between variables, a high tolerance for noisy data, availability of multiple training algorithms (Adebowale et al., 2013). Nevertheless, ANNs cannot get a good effect for massive data. Their classification performance is very sensitive to the selected parameters, and its training or learning process is very slow and computationally very expensive. Another important technique is the SVM that becomes one of the most popular classification methods. It can compete with other classification techniques. Therefore, we tried to produce comparative study between SVM and other classification methods, such as J48, NaiveBayes, ANNs, and FuzzyRoughNN by using the reduced attribute of our two experiments (ovarian cancer based on amino acids and CA-125 based on amino acids). Table 6 shows the comparative results of our reduced dataset about ovarian cancer based on amino acids. SVM has achieved perfect accuracy and can outperform other techniques, such as J48 (97.82%), NaiveBayes (95.65%), ANNs (86.96%), and Equal with Fuzzy-Rough-NN classification technique (100%). Therefore, it is interesting to find two classification techniques achieved a perfect accuracy that ensures the success of our medical rules. Table 7 shows different classification techniques on the reduced dataset about CA125 tumor markerbased on amino acids dataset. SVM has achieved the highest accuracy as compared to other classification techniques, such as J48 (48.67%), NaiveBayes (53.98%), ANNs (53.097%), and Fuzzy-Rough-NN (53.98%). Table 6. The comparison of the performance of the SVM with other classification techniques for reduced ovarian cancer based on amino acids dataset Classification Technique SVM (Polynomial kernel)

Accuracy

TP Rate

FP Rate

ROC Area

100%

1

0

1

J48

97.82%

0.978

0.145

0.917

NaiveBayes

95.65%

0.957

0.29

1

ANNs

86.96%

0.87

0.87

0.25

100%

1

0

1

FuzzyRoughNN

Table 7. The comparison of the performance the SVM with other classification technique for reduced CA-125 tumor marker-based on amino acids dataset Accuracy

TP Rate

FP Rate

ROC Area

SVM (Polynomial kernel)

Classification Technique

60.17 %

0.602

0.405

0.599

J48

48.67%

0.487

0.533

0.511

NaiveBayes

53.98%

0.54

0.448

0.522

ANNs

53.097%

0.531

0.507

0.51

FuzzyRoughNN

53.98%

0.54

0.466

0.522

55

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

CONCLUSION In this paper, we tried to discover an effective tumor marker for detecting ovarian cancer in early-stage. Our framework based on fuzzy-rough and SVM techniques to diagnose ovarian cancer based on amino acids. We choose fuzzy-rough technique as feature selection method and SVM technique for classification. We performed two experimental studies. The first study is between ovarian cancer and the amino acids. We found that there is a strong relation between amino acids and ovarian cancer. We found that the phenylalanine amino acid (no. 13) is a perfect tumor marker for ovarian cancer and most important factor for detecting early-stage ovarian cancer. The second study is between amino acids and the CA-125 tumor marker that achieved a moderate accuracy of 60.177%. Comparison of SVM to several classification techniques, such as J48, NaiveBayes, ANNs, and FuzzyRoughNN. It was shown that SVM classifier outperformed these techniques on our two datasets. Therefore, in future work, we hope to apply our hybrid model with larger datasets with additional attributes to explore the primary factors that affect other types of a cancer diagnosis.

REFERENCES Abdul Khaleel, M., Pradham, k., & Dash, G. N. (2013). A Survey of Data Mining Techniques on Medical Data for Finding Locally Frequent Diseases. International Journal of Advanced Research in Computer Science and Software Engineering, 3(8), 149–153. Adebowale, A. (2013). Comparative Study of Selected Data Mining Algorithms Used for Intrusion Detection. International Journal of Soft Computing and Engineering, 3(3), 237–241. Akay, M. F. (2009). Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications, 36(2), 3240–3247. doi:10.1016/j.eswa.2008.01.009 Amato, F., Lopez, A., Pena-Mendez, E. M., Vanhara, P., Hampl, A., & Havel, J. (2013). Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11(2), 47–58. doi:10.2478/v10136012-0031-x Barakat, N., & Bradley, P. A. (2006). Rule Extraction from Support Vector Machines: Measuring the Explanation Capability Using the Area under the ROC Curve. IEEE, 7695-2521. Barakat, N., & Bradley, P. A. (2010). Rule extraction from support vector machines: A review. Neurocomputing, 74(1), 178–190. doi:10.1016/j.neucom.2010.02.016 Barakat, N., & Diederich, J. (2005). Eclectic Rule-Extraction from Support Vector Machines. International Journal of Computational Intelligence, 2(1), 59-62. Bhatia, S., Prakash, P., & Pillai, G. N. (2008). SVM Based Decision Support System for Heart Disease Classification with Integer-Coded Genetic Algorithm to Select Critical Features. Proceedings of World Congress on Engineering and Computer Science (WCECS). Centers for Disease Control and Prevention (CDC). (2014). Ovarian Cancer. Retrieved from http://www. cdc.gov/cancer/ovarian/index.htm

56

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Chen, H., Wang, M., Qian, F., Jiang, Q., Yu, F., & Luo, Q. (2008). Research on Combined Rough Sets with Fuzzy Sets. IEEE Computer Society, 1, 163–167. Chen, H.-L., Liu, D.-H. B. Y., Liu, J., Wang, G., & Wang, S. (2011). An Adaptive Fuzzy k-Nearest Neighbor Method Based on Parallel Particle Swarm Optimization for Bankruptcy Prediction. Springer. Chen, R.-C., Cheng, K.-F., & Hsieh, C.-F. (2009). Using Rough Set and Support Vector Machine for Network Intrusion Detection. International Journal of Network Security & Its Applications, 1(1), 1–13. Delen, D. W. G., & Kadam, A. (2005). Predicting breast cancer survivability: A comparison of three data mining methods. Artificial Intelligence in Medicine, 34(2), 113–127. doi:10.1016/j.artmed.2004.07.002 PMID:15894176 Devi, S. N., & Rajagopalan, S. P. (2011). A study on Feature Selection Techniques in Bio-Informatics. International Journal of Advanced Computer Science and Applications, 2(1), 138–144. Dubois, D., & Prade, H. (1990). Rough fuzzy sets and fuzzy rough sets. International Journal of General Systems, 17(2–3), 191–209. doi:10.1080/03081079008935107 Gandhi, K. K., & Prajapati, N. B. (2014). Diabetes prediction using feature selection and classification. International Journal of Advance Engineering and Research Development, 1(5), 1–7. Ganesan, N., Venkatesh, K., Rama, M. A., & Palani, A. M. (2010). Application of Neural Networks in Diagnosing Cancer Disease Using Demographic Data. International Journal of Computers and Applications, 1(26), 76–85. Gangwal, C., & Bhaumik, R. N. (2012). Intuitionistic Fuzzy Rough Relation in Some Medical Applications. International Journal of Advanced Research in Computer Engineering Technology, 1, 28–32. Hassanien, Hameed, & Ajith. (2010). Rough Hybrid Scheme: An application of breast cancer imaging. Chapman & Hall. Hong, J. (2011). An Improved Prediction Model based on Fuzzy-rough Set Neural Network. International Journal of Computer Theory and Engineering, 3(1), 158–162. doi:10.7763/IJCTE.2011.V3.299 Huang, C.-L., Liao, H.-L., & Chen, M.-C. (2008). Prediction model building and feature selection with support vector machines in breast cancer diagnosis. Expert Systems with Applications, 34(1), 578–587. doi:10.1016/j.eswa.2006.09.041 Jeetha, B., & Malathi, M. (2013). Diagnosis of Ovarian Cancer Using Artificial Neural Network. International Journal of Computer Trends and Technology, 4(10), 3601–3606. Jensen, R. (2005). Combining Rough and Fuzzy Sets for Feature Selection. The University of Edinburg. Retrieved from http://books.google.com.eg/books?id=84ESSQAACAAJ Jensen, R. (n.d.). Fuzzy-rough data mining with Weka. Retrieved from http://users.aber.ac.uk/rkj/Weka.pdf Jensen, R., & Cornelis, C. (2011). Fuzzy Rough Nearest Neighbour Classification and Prediction. Theoretical Computer Science, 1–29.

57

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Jensen, R., Parthaláin, N. M., & Shen, Q. (2014). Tutorial: Fuzzy-rough data mining (using the Weka data mining suite). Retrieved from http://users.aber.ac.uk/rkj/wcci-tutorial-2014 Jensen, R., & Shen, Q. (2004). Semantics-Preserving Dimensionality Reduction: Rough and FuzzyRough-Based Approaches. IEEE Transactions on Knowledge and Data Engineering, 16(12), 1457–1471. doi:10.1109/TKDE.2004.96 Johns Hopkins University. (2014). Ovarian Cancer. Retrieved from http://ovariancancer.jhmi.edu/ ca125qa.cfm Kamruzzaman, S. M., & Islam, M. M. (2006). An Algorithm to Extract Rules from Artificial Neural Networks for Medical Diagnosis Problems. International Journal of Information Technology, 12(8), 41–59. Khan, I. Y., Zope, P., & Suralkar, S. (2013). Importance of Artificial Neural Network in Medical Diagnosis disease like acute nephritis disease and heart disease. International Journal of Engineering Science and Innovative Technology, 2(2), 210–217. Kumari, V. A., & Chitra, R. (2013). Classification Of Diabetes Disease Using Support Vector Machine. International Journal of Engineering Research and Applications, 3, 1797–1801. Lee, M., & Chang, T. (2010). Comparison of Support Vector Machine and Back Propagation Neural Network in Evaluating the Enterprise Financial Distress. CoRR, abs/1007.5133 Li, Y.-B., & Ye-Li. (2010). Survey on Uncertainty Support Vector Machine and Its Application in Fault Diagnosis. IEEE, 9, 561-565. Marchiori, E., & Sebag, M. (2005). Bayesian learning with local support vector machines for cancer classification with gene expression data. Verlag, Springer., 3449, 74–83. Markov, Z., & Russell, I. (2008). An Introduction to the WEKA Data Mining System. Workshop #5 at The 39th ACM Technical Symposium on Computer Science Education, Portland, Oregon, March. Retrieved from http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf Martens, D., Huysmans, J., Setiono, R., Vanthienen, J., & Baesens, B. (2008). Rule Extraction from Support Vector Machines: An Overview of Issues and Application in Credit Scoring. Rule Extraction from Support Vector Machines, 80, 33-63. Mayoclinic. (2014). Staging ovarian cancer. Retrieved from http://www.mayoclinic.org/diseasesconditions/ovarian-cancer/basics/tests-diagnosis/con-20028096 Meyer, D. (2014). Support Vector Machines. The Interface to libsvm in package e1071. Online-Documentation of the package e1071 for Support Vector Machines. National Cancer Institute. (2014). Definition of ovarian cancer. Retrieved from http://www.cancer.gov/ cancertopics/types/ovarian Núñez, H., Angulo, C., & Catala, A. (2002). Rule-extraction from Support Vector Machines. Proc. of European Symposium on Artificial Neural Networks (ESANN), 107-112. Olson, D. L., & Delen, D. (2008). Advanced Data Mining Techniques. Springer Publishing Company, Incorporated.

58

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Pawlak, Z. (1982). Rough sets. International Journal of Computer & Information Sciences, 11(5), 341–356. doi:10.1007/BF01001956 Rohilla, M., & Singal, V. (2014). Design and Comparative Analysis of SVM and PR Neural Network for Classification of Brain Tumour in MRI. International Journal of Computers and Applications, 91(17), 15–21. doi:10.5120/16101-5287 Samant, R., & Rao, S. (2013). Comparative Study of Artificial Neural Network Models in Prediction of Essential Hypertension. International Journal of Latest Trends in Engineering and Technology, 3(2), 114–120. Sarojini, B. (2014). A Wrapper Based Feature Subset Evaluation Using Fuzzy Rough K-NN. IACSIT International Journal of Engineering and Technology, 5(6), 4672–4676. Sasirekha, A., & Kumar, P. G. (2013). Support Vector Machine for Classification of Heartbeat Time Series Data. International Journal of Emerging Science and Engineering, 1, 38–41. Scikit-Learn. (n.d.). Decision Trees. Retrieved from http://scikit-learn.org/stable/modules/tree.html Sewell, M. (2014). Support Vector Machines (SVMs). Retrieved from http://www.svms.org Shen, Q., & Jensen, R. (2007). Rough Sets, their Extensions, and Applications. International Journal of Automation and Computing, 4(3), 217–228. doi:10.1007/s11633-007-0217-y Srivastava, D. K., & Bhambhu, L. (2010). Data Classification Using support vector machine. Journal of Theoretical and Applied Information Technology, 12(1), 1–7. Srivastava, D. K., Patnaik, K. S., & Bhambhu, L. (2010). Data Classification: A Rough - SVM Approach. Contemporary Engineering Sciences, 3(2), 77–86. Suganya, G., & Dhivya, D. (2011). Extracting Diagnostic Rules from Support Vector Machine. Journal of Computer Applications, 4, 95–98. SVMs. (2014). Support Vector Machines vs. Artificial Neural Networks. Retrieved from http://www. svms.org/anns.html Tharwat, A., Hassanien, A. E., & Elnaghi, B. E. (2016). A BA-based algorithm for parameter optimization of Support Vector Machine. Pattern Recognition Letters. doi:10.1016/j.patrec.2016.10.007 Tiwari, A.K., & Sharma, L.K., & Krishna, G.R. (2013). Comparative Study of Artificial Neural Network based Classification for Liver Patient. Journal of Information Engineering and Applications, 3(4), 1–5. Tsai, M.-H., Wang, S.-H., Wu, K.-C., Chen, J.-M., & Chiu, S.-H. (2011). Human Ovarian carcinoma microarray data analysis based on Support Vector Machines with different kernel functions. International Conference on Environment Science and Engineering (IPCBEE), 8, 138-142. Ubaidillaha, S. H. S. A., Sallehuddina, R., & Alia, N. A. (2013). Cancer Detection Using Artificial Neural Network and Support Vector Machine: A Comparative Study. Jurnal Teknologi (Sciences & Engineering), 65(1), 73–81.

59

Early-Stage Ovarian Cancer Diagnosis Using Fuzzy Rough Sets with SVM Classification

Verma, M. (2014). Medical Diagnosis using Back-Propagation Algorithm in ANN. International Journal of Science, Engineering and Technology Research, 3, 94–99. Wikibooks. (n.d.). Data Mining Algorithms In R/Classification/NaiveBayes. Retrieved from http:// en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Na%C3%AFve_Bayes Wikipedia. (2014a). Ovarian Cancer. Retrieved from http://en.wikipedia.org/wiki/Ovarian_cancer Wikipedia. (2014b). Naïve Bayesian classifier. Retrieved from http://en.wikipedia.org/w/index. php?oldid=422757005 Wilding, P., Morgan, M., Grygotis, A., Shoffner, M., & Rosato, E. (1994). Application of back propagation neural networks to diagnosis of breast and ovarian cancer. Cancer Letters, 77(2-3), 145–153. doi:10.1016/0304-3835(94)90097-3 PMID:8168061 World, O. C. D. (2014). About Ovarian Cancer. Retrieved from http://ovariancancerday.org/en/aboutovarian Xu, F. F., Miao, D. Q., & Wei, L. (2009). Fuzzy-rough Attribute Reduction via Mutual Information with an Application to Cancer Classification. Computers & Mathematics with Applications (Oxford, England), 57(6), 1010–1017. doi:10.1016/j.camwa.2008.10.027 Yamany, W., Emary, E., Hassanien, A. E., Schaefer, G., & Zhu, S. Y. (2016). An Innovative Approach for Attribute Reduction Using Rough Sets and Flower Pollination Optimisation, Procedia. Computer Science, 96, 403–409. Yingchi, Y., Hui, C., Dong, W., Wei, L., Biyun, Z., & Zhongtao, Z. (2014). Diagnosis of pancreatic carcinoma based on combined measurement of multiple serum tumor markers using artificial neural network analysis. Chinese Medical Journal, 127(10), 1891–1896. PMID:24824251 Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J., & Hua, L. (2012). Data Mining in Healthcare and Biomedicine: A Survey of the Literature. Journal of Medical Systems, 1(4), 2431–2448. doi:10.1007/s10916-011-9710-5 PMID:21537851 Zhao, J., & Zhang, Z. (2011). Fuzzy Rough Neural Network and Its Application to Feature Selection. Academic Journal., 13(4), 270–275. Zhou, M., Guan, W., Walker, L. D., Mezencey, R., Benigno, B. B., Gray, A., & McDonald, J. F. et al. (2010). Rapid Mass Spectrometric Metabolic Profiling of Blood Sera Detects Ovarian Cancer with High Accuracy. American Association for Cancer Research, 1. doi:10.1158/1055-9965.EPI-10-0126 PMID:20699376

60

61

Chapter 4

Data Storage Security Service in Cloud Computing: Challenges and Solutions Alshaimaa Abo-alian Ain Shams University, Egypt Nagwa L. Badr Ain Shams University, Egypt Mohamed F. Tolba Ain Shams University, Egypt

ABSTRACT Cloud computing is an emerging computing paradigm that is rapidly gaining attention as an alternative to other traditional hosted application models. The cloud environment provides on-demand, elastic and scalable services, moreover, it can provide these services at lower costs. However, this new paradigm poses new security issues and threats because cloud service providers are not in the same trust domain of cloud customers. Furthermore, data owners cannot control the underlying cloud environment. Therefore, new security practices are required to guarantee the availability, integrity, privacy and confidentiality of the outsourced data. This paper highlights the main security challenges of the cloud storage service and introduces some solutions to address those challenges. The proposed solutions present a way to protect the data integrity, privacy and confidentiality by integrating data auditing and access control methods.

INTRODUCTION Cloud computing can be defined as a type of computing in which dynamically scalable resources (i.e. storage, network, and computing) are provided on demand as a service over the Internet. The service delivery model of cloud computing is the set of services provided by cloud computing that is often referred to as an SPI model, i.e., Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). In a SaaS model, the cloud service providers (CSPs) install and operate application DOI: 10.4018/978-1-5225-2229-4.ch004

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Data Storage Security Service in Cloud Computing

software in the cloud and the cloud users can then access the software from cloud clients. The users do not purchase software, but rather rent it for use on a subscription or pay-per-use model, e.g. Google Docs (Attebury, George, Judd, & Marcum, 2008). The SaaS clients do not manage the cloud infrastructure and platform on which the application is running. In a PaaS model, the CSPs deliver a computing platform which includes the operating system, programming language execution environment, web server and database. Application developers can subsequently develop and run their software solutions on a cloud platform. With PaaS, developers can often build web applications without installing any tools on their computer, and can hereafter deploy those applications without any specialized system administration skills (Tim, Subra, & Shahed, 2009). Examples of PaaS providers are Windows Azure (Chambers, 2013) and Google App Engine (Pandey & Anjali, 2013). The IaaS model provides the infrastructure (i.e., computing power, network and storage resources) to run the applications. Furthermore, it offers a pay-per-use pricing model and the ability to scale the service depending on demand. Examples of IaaS providers are Amazon EC2 (Gonzalez, Border, & Oh, 2013) and Terremark (Srinivasan, 2014). Cloud services can be deployed in four ways depending upon the clients’ requirements. The cloud deployment models are public cloud, private cloud, community cloud and hybrid cloud. In the public cloud (or external cloud), a cloud infrastructure is hosted, operated, and managed by a third party vendor from one or more data centers (Tim, Subra, & Shahed, 2009).The network, computing and storage infrastructure is shared with other organizations. Multiple enterprises can work simultaneously on the infrastructure provided. Users can dynamically provide resources through the internet from an off-site service provider (Bhadauria & Sanyal, 2012). In the private cloud, cloud infrastructure is dedicated to a specific organization and is managed either by the organization itself or third party service provider. This emulates the concept of virtualization and cloud computing on private networks. Infrastructure, in the community cloud, is shared by several organizations for a shared reason and may be managed by themselves or a third party service provider. Infrastructure is located at the premises of a third party. Hybrid cloud consists of two or more different cloud deployment models bound together by standardized technology, which enables data portability between them. With a hybrid cloud, organizations might run non-core applications in a public cloud, while maintaining core applications and sensitive data in-house in a private cloud (Tim, Subra, & Shahed, 2009). A cloud storage system (CSS) can be considered a network of distributed data centers which typically uses cloud computing technologies like virtualization, and offers some kind of interface for storing data (Borgmann, et al., 2012). Data may be redundantly stored at different locations in order to increase its availability. Examples of such basic cloud storage services are Amazon S3 (Berriman, et al., 2013) and Rackspace (Garg, Versteeg, & Buyya, 2013). One fundamental advantage of using a CSS is the cost effectiveness, where data owners avoid the initial investment of expensive large equipment purchasing, infrastructure setup, configuration, deployment and frequent maintenance costs (Abo‐alian, Badr, & Tolba, 2015). Instead, data owners pay for only the resources they actually use and for only the time they require them. Elasticity is also a key advantage of using a CSS, as storage resources could be allocated dynamically as needed, without human interaction. Scalability is another gain of adopting a CSS because Cloud storage architecture can scale horizontally or vertically, according to demand, i.e., new nodes can be added or dropped as needed. Moreover, a CSS offers more reliability and availability, as data owners can access their data from anywhere and at any time (Abo-alian, Badr, & Tolba, 2016). Furthermore, Cloud service providers use several replicated sites for business continuity and disaster recovery reasons. Despite the appealing advantages of cloud storage services, they also bring new and challenging security threats towards users outsourced data. Since cloud service providers (CSPs) are separate administrative 62

Data Storage Security Service in Cloud Computing

entities, the correctness and the confidentiality of the data in the cloud is at risk due to the following reasons: First, since cloud infrastructure is shared between organizations, it is still facing the broad range of both internal and external threats to data integrity, for example, outages of cloud services such as the breakdown of Amazon EC2 in 2010 (Miller, 2010). Second, users no longer physically possess the storage of their data, i.e., data is stored and processed remotely. So, they may worry that their data could be misused or accessed by unauthorized users. For example, a dishonest CSP may sell the confidential information about an enterprise to the enterprise’s closest business competitors for profit. Third, there are various motivations for the CSP to behave disloyally towards the cloud users regarding their outsourced data status. For example, the CSP might reclaim storage for monetary reasons by discarding data that has not been or is rarely accessed or even hide data loss incidents to maintain a reputation (Cong, Ren, Lou, & Li, 2010). In a nutshell, although outsourcing data to the cloud is economically attractive for long term large scale storage, the data security in cloud storage systems is a prominent problem. Cloud storage systems do not immediately offer any guarantee of data integrity, confidentiality, and availability. As a result, the CSP should adopt data security practices to ensure that their clients’ data is available, correct and safe from unauthorized access and disclosure. Downloading all the data and checking on retrieval is the traditional method for verifying the data integrity but it causes high transmission costs and heavy I/O overheads. Furthermore, checking data integrity when accessing data is not sufficient to guarantee the integrity and availability for all the stored data in the cloud because there is no assurance for the data that is rarely accessed. Thus, it is essential to have storage auditing services that verify the integrity of outsourced data and to provide proof to data owners that their data is correctly stored in the cloud. Traditional server-based access control methods such as Access Control List (ACL) (Shalabi, Doll, Reilly, & Shore, 2011) cannot be directly applied to cloud storage systems because data owners do not fully trust the cloud service providers. Additionally, traditional cryptographic solutions cannot be applied directly while sharing data on cloud servers because these solutions require complicated key management and high storage overheads on the server. Moreover, the data owners have to stay online all the time to deliver the keys to new users. Therefore, it is crucial to have an access control method that restricts and manages access to data and ensure that the outsourced data is safe from unauthorized access and disclosure. In this paper, an extensive survey of cloud storage auditing and access control methods is presented. Moreover, an evaluation of these methods against different performance criteria is conducted. The rest of the paper is organized as follows. Section 2 overviews various cloud storage auditing methods. Section 3 presents some literature methods for cloud access control. A comparative analysis of some existing data security methods in cloud computing is provided in section 4. Section 5 discusses the limitations of different data security methods in cloud computing and provides some concluding remarks that can be used in designing new data security practices. Finally, we conclude in section 6.

DATA STORAGE AUDITING METHODS This section first defines the system model and security model of data storage auditing schemes within cloud computing. Then, the existing data storage auditing methods that are classified into different categories are presented.

63

Data Storage Security Service in Cloud Computing

Data storage auditing can be defined as a method that enables data owners to check the integrity of remote data without downloading the data or explicit knowledge of the entire data (Blum, Evans, Gemmell, Kannan, & Naor, 1994). Any system model of auditing scheme consists of three entities as mentioned in (Liu, Zhang, & Lun, 2013): 1. Data Owner: An entity which has large data files to be stored in the cloud and can be either individual consumers or organizations. 2. Cloud Storage Server (CSS): An entity which is managed by a Cloud Service Provider (CSP) and has significant storage space and computation resources to maintain clients’ data. 3. Third Party Auditor or Verifier (TPA): An entity which has expertise and capabilities to check the integrity of data stored on CSS. In the security model of most data auditing schemes, the auditor is assumed to be honest-but-curious. It performs honestly during the entire auditing protocol but it is curious about the received data. So, it is essential to keep the data confidential and invisible to the auditor during the auditing protocol, but the cloud storage server could be dishonest and may launch the following attacks (Yang & Xiaohua, 2014): 1. Replace Attack: The server may choose another valid and uncorrupted pair of the data block and data tag (mk, tk) to replace the challenged pair of the data block and data tag (mi, ti), when it has already discarded mi or ti. 2. Forge Attack: The server may forge the data tag of the data block and deceive the auditor if the owner’s secret tag keys are reused for the different versions of data. 3. Replay Attack: The server may generate the proof from the previous proof or other information, without retrieving the actual owners’ data. As illustrated in Figure 1, a data auditing storage scheme should basically consist of five algorithms (Abo-Alian, Badr, & Tolba, 2016): 1. Key Generation: It is run by the data owner. It takes as input security parameter 1λ and outputs a pair of private and public keys (sk, pk). 2. Tag Generation: It is run by the data owner to generate the verification metadata, i.e., data block tags. It takes as inputs a public key pk, a secret key sk and the file blocks b. It outputs the verifiable block tags Tb. 3. Challenge: It is run by the auditor in order to randomly generate a challenge that indicates the specific blocks. These random blocks are used as a request for a proof of possession. 4. Response/Proof: It is run by the CSP, upon receiving the challenge, to generate a proof that is used in the verification. In this process, the CSP proves that it is still correctly storing all file blocks. 5. Verify: It is run by the auditor in order to validate a proof of possession. It outputs TRUE if the verification equation passed or FALSE otherwise. The existing data storage auditing schemes can be basically classified into two main categories:

64

Data Storage Security Service in Cloud Computing

Figure 1. Structure of an auditing scheme

1. Provable Data Possession (PDP) Methods: For verifying the integrity of data without sending it to untrusted servers, the auditor verifies probabilistic proofs of possession by sampling random sets of blocks from the cloud service provider (Ateniese, et al., 2007). 2. Proof of Retrievability (PoR) Methods: A set of randomly valued check blocks called sentinels are embedded into the encrypted file. For auditing the data storage, the auditor challenges the server by specifying the positions of a subset of sentinels and asking the server to return the associated sentinel values (Juels & Kaliski, 2007). The existing data storage auditing methods can be further classified into several categories according to: 1. The Type of the Auditor/Verifier: Public auditing or private auditing. 2. The Distribution of Data to be Audited: Single-copy or multiple-copy data. 3. The Data Persistence: Static or dynamic data.

Public Auditing vs. Private Auditing Considering the role of the auditor in the auditing model, data storage auditing schemes fall into two categories; private auditing and pubic auditing (Zheng & Xu, 2012). In private auditing, only data owners can challenge the CSS to verify the correctness of their outsourced data (Yang & Jia, 2012).Unfortunately, private auditing schemes have two limitations: (a) They impose an online burden on the data owner to

65

Data Storage Security Service in Cloud Computing

verify data integrity and (b) The data owner must have huge computational capabilities for auditing. Examples of auditing schemes that only support private auditing are (Chen & Curtmola, 2012; Chen & Curtmola, 2013; Mukundan, Madria, & Linderman, 2012; Etemad & Kupcu, 2013). In public auditing or third party auditing (Zhu, Ahn, Hu, Yau, An, & Hu, 2013), data owners are able to delegate the auditing task to an independent third party auditor (TPA), without the devotion of their computational resources. However, pubic auditing schemes should guarantee that the TPA keeps no private information about the verified data. Several variations of PDP schemes that support public auditing such as (Chen & Curtmola, 2013; Mukundan, Madria, & Linderman, 2012; Fujisaki & Okamoto, 1999; Abo-alian, Badr, & Tolba, 2015), were proposed under different cryptographic primitives.

Auditing Single-Copy vs. Multiple-Copy Data For verifying the integrity of outsourced data in the cloud storage, various auditing schemes have been proposed which can be categorized into: 1. Auditing schemes for single-copy data. 2. Auditing schemes for multiple-copy data.

Auditing Schemes for Single-Copy Data Shacham and Waters (Shacham & Waters, 2013) proposed two fast PoR schemes based on an homomorphic authenticator that enables the storage server to reduce the complexity of the auditing by aggregating the authentication tags of individual file blocks. The first scheme is built from BLS signatures and allows public auditing. The second scheme is built on pseudorandom functions and allows only private auditing but its response messages are shorter than those of the first scheme, i.e., 80 bits only. Both schemes use Reed-Solomon erasure encoding method (Plank, 1997) to support an extra feature which allows the client to recover the data outsourced in the cloud. However, their encoding and decoding are slow for large files. Yuan and Yu (Yuan & Yu, 2013) managed to suppress the need for the tradeoff between communication costs and storage costs. They proposed a PoR scheme with public auditing and constant communication cost by combining techniques such as constant size polynomial commitment, BLS signatures, and homomorphic linear authenticators. On the other hand, their data preparation process requires (s+3) exponentiation operations where s is the block size. However, their scheme does not support data dynamics. Xu and Chang (Xu & Chang, 2011) proposed a new PDP model called POS. POS requires no modular exponentiation in the setup phase and uses a smaller number, i.e., about 102 for a 1G file, of group exponentiation in the verification phase. POS only supports private auditing for static data and its communication cost is linear in relation to the number of encoded elements in the challenge query. Ateniese et al. (Ateniese G., et al., 2011) introduced a PDP model supported with two advantages: Lightweight and robustness. Their challenge/response protocol transmits a small, and constant amount of data, which minimizes network communication. Furthermore, it incorporates mechanisms for mitigating arbitrary amounts of data corruption. On the other hand, it relies on Reed-Solomon encoding scheme in which the time required for encoding and decoding of n-block file is O (n2).

66

Data Storage Security Service in Cloud Computing

Lou et al. (Cao, Yu, Yang, Lou, & Hou, 2012) proposed a public auditing scheme that is suitable for distributed storage systems with concurrent user’s access. In order to efficiently recover the exact form of any corrupted data, they utilized the exact repair method (Rashmi, Shah, Kumar, & Ramchandran, 2009) where the newly generated blocks are the same as those previously stored blocks. So, no verification tags need to be generated on the fly for the repaired data. Consequently, it relieves the data owner from the online burden. However, their scheme increases the storage overheads at each server, uses an additional repair server to store the original packets besides the encoded packets, and does not support data dynamics.

Auditing Schemes for Multiple-Copy Data Barsoum and Hasan (Barsoum & Hasan, 2011; Barsoum & Hasan, 2012) proposed two dynamic multicopy PDP schemes: Tree-Based and Map-Based Dynamic Multi-Copy PDP (TB-DMCPDP and MBDMCPDP, respectively). These schemes prevent the CSP from cheating and maintaining fewer copies, using the diffusion property of the AES encryption scheme. TB-DMCPDP scheme is based on Merkle hash tree (MHT) whereas MB-DMCPDP scheme is based on a map-version table to support outsourcing of dynamic data. The setup cost of the TB-DMCPDP scheme is higher than that of the MB-DMCPDP scheme. On the other hand, the storage overhead is independent of the number of copies for the MBDMCPDP scheme, while the storage overhead is linear with the number of copies for the TB-DMCPDP scheme. However, the authorized users should know the replica number in order to generate the original file which may require the CSP reveal its internal structure to the users. Zhu et al. (Zhu, Hu, Ahn, & Yu, 2012) proposed a co-operative provable data possession scheme (CPDP) for multi-cloud storage integrity verification along with two fundamental techniques: Hash index hierarchy (HIH) and homomorphic verifiable response (HVR). Using the hash index hierarchy, multiple responses of the clients’ challenges that are computed from multiple CSPs, can be combined into a single response as the final result. Homomorphic verifiable response supports distributed cloud storage in a multi-cloud storage environment and implements an efficient collision-resistant hash function. Wang and Zhang (Wang & Zhang, 2014) proved that the CPDP (Zhu, Hu, Ahn, & Yu, 2012) is insecure because it does not satisfy the knowledge soundness, i.e., any malicious CSP or malicious organizer is able to pass the verification even if they have deleted all the stored data. Additionally, the CPDP does not support data dynamics. Etemad and Kupcu (Etemad & Kupcu, 2013) proposed a distributed and replicated DPDP (DR-DPDP) that provides transparent distribution and replication of the data over multiple servers where the CSP may hide its internal structure from the client. This scheme uses persistent rank-based authenticated skip lists to handle dynamic data operations more efficiently, such as insertion, deletion, and modification. On the other hand, DR-DPDP has three noteworthy disadvantages: First, it only supports private auditing. Second, it does not support the recovery of corrupted data. Third, the organizer looks like a central entity that may get overloaded and may cause a bottleneck. Mukundan et al. (Mukundan, Madria, & Linderman, 2012) proposed a Dynamic Multi-Replica Provable Data Possession scheme (DMR-PDP) that uses the Paillier probabilistic encryption for replica differentiation so that it prevents the CSP from cheating and maintaining fewer copies than what is paid for. DMR-PDP also supports efficient dynamic operations such as block modification, insertion, and deletion on data replicas over cloud servers. However, it supports only private auditing and does not provide any security proofs.

67

Data Storage Security Service in Cloud Computing

Chen and Curtmola (Chen & Curtmola, 2013) proposed a remote data checking scheme for replication-based distributed storage systems, called RDC-SR. RDC-SR enables server-side repair and places a minimal load on the data owner who only has to act as a repair coordinator. In RDC-SR, each replica constitutes a masked/encrypted version of the original file in order to support replica differentiation. In order to overcome the replicate-on-the-fly (ROTF) attack, they make replica creation a more timeconsuming process. However, RDC-SR has three remarkable limitations: First, the authorized users must know the random numbers used in the masking step in order to generate the original file. Second, it only supports private auditing. Third, it works only on static data.

Static vs. Dynamic Data Auditing Considering the data persistence, existing auditing schemes can be categorized into: Auditing schemes that support only static archived data such as; (Chen & Curtmola, 2013; Shacham & Waters, 2013; Yuan & Yu, 2013) and auditing schemes that support data dynamics such as; insertion, deletion and modification (Abo‐alian, Badr, & Tolba, 2016). For enabling dynamic operations, existing data storage schemes utilize different authenticated data structures including: (1) Merkle hash tree (Merkle, 1980), (2) balanced update tree (Zhang & Blanton, 2013), (3) skip-list (Pugh, 1990; Goodrich, Tamassia, & Schwerin, 2001; Erway, Küpçü, Papamanthou, & Tamassia, 2009) and (4) map-version table (index table) (Chen & Curtmola, 2013; Fujisaki & Okamoto, 1999) that are illustrated in detail, as follows:

Merkle Hash Tree Merkle Hash Tree (MHT) (Merkle, 1980) is a binary tree structure used to efficiently verify the integrity of the data. As illustrated in Figure 2, the MHT is a tree of hashes where the leaves of the tree are the hashes of the data blocks. Wang et al. (Wang, Wang, Ren, Lou, & Li, 2011) proposed a public auditing protocol that supports fully dynamic data operations by manipulating the classic MHT construction for block tag authentication in order to achieve efficient data dynamics. They could also achieve batch auditing where different users can delegate multiple auditing tasks to be performed simultaneously by the TPA. Unfortunately, their protocol does not maintain the data privacy and the TPA could derive user’s data from the information collected during the auditing process. Additionally, their protocol does not support data recovery in case of data corruption. Lu et al. (Liu, Gu, & Lu, 2011) addressed the security problem of the previous auditing protocol (Wang, Wang, Ren, Lou, & Li, 2011) in the signature generation phase that allows the CSP to deceive by using blocks from different files during verification. They presented a secure public auditing protocol based on the homomorphic hash function and the BLS short signature scheme which is publicly verifiable and supports data dynamics, it also preserves privacy. However, their protocol suffers from massive computational and communication costs. Wang et al. (Wang, Chow, Wang, Ren, & Lou, 2013) proposed a privacy-preserving public auditing scheme using random masking and homomorphic linear authenticators (HLAs) (Ateniese, Kamara, & Katz, 2009). Their auditing scheme also supports data dynamics using MHT and it enables the auditor to perform audits for multiple users simultaneously and efficiently. Unfortunately, their scheme is vulnerable to the TPA offline guessing attack.

68

Data Storage Security Service in Cloud Computing

Figure 2. Merkle hash tree Merkle, 1980.

Balanced Update Tree Zhang and Blanton (Zhang & Blanton, 2013) proposed a new data structure called” balanced update tree,” to support dynamic operations while verifying data integrity. In the update tree, each node corresponds to a range of data blocks on which an update (i.e., insertion, deletion, or modification) has been performed. The challenge with constructing such a tree was to ensure that: (1) a range of data blocks can be efficiently located within the tree and (2) the tree is maintained to be balanced after applying necessary updates caused by clients queries, i.e., the size of this tree is independent of the overall file size as it depends on the number of updates. However, it introduces more storage overhead on the client. Besides, the auditing scheme requires the retrieval, i.e., downloading, of data blocks which leads to high communication costs. Figure 3 illustrates an example of balanced update tree operations.

Skip List A skip list (Pugh, 1990) is a hierarchical structure of linked lists that is used to store an ordered set of items. An authenticated skip list (Goodrich, Tamassia, & Schwerin, 2001) is constructed using a collision-resistant hash function and keeps a hash value in each node. Due to the collision resistance of the hash function, the hash value of the root can be used later for validating the integrity. Figure 4 (Erway, Küpçü, Papamanthou, & Tamassia, 2009) illustrates an example of a rank-based authenticated skip list, where the number inside the node represents its rank, i.e., the number of nodes at the bottom level that can be reached from that node. Lu et al. (Liu, Gu, & Lu, 2011) used the skip list structure to support data dynamics in their PDP model that reduces the computational and communication complexity from log(n) to constant. However, the use of a skip-list creates some additional storage overheads, i.e., about 3.7% of the original file at the CSP-side and 0.05% at the client-side. Additionally, it only supports private auditing.

69

Data Storage Security Service in Cloud Computing

Figure 3. Example of balanced update tree operations Zhang & Blanton, 2013.

Figure 4. Rank-based authenticated skip list Erway, Küpçü, Papamanthou, & Tamassia, 2009.

Index Table A map-version table or an index table is a small data structure created by the owner and stored on the verifier side to validate the integrity and consistency of all file copies stored by the CSP (Barsoum & Hasan, 2011). The map-version table consists of three columns: Serial number (SN), block number (BN), and version number (VN). The SN represents an index to the file blocks that indicates the physical position of a block in a data file. The BN is a counter used to make logical numbering/indexing to the file blocks. The VN indicates the current version of file blocks. Yang et al. (Li, Chen, Tan, & Yang, 2012; Li, Chen, Tan, & Yang, 2013) proposed a full data dynamic PDP scheme that uses a map-version table to support data block updates. Then, they discussed how to extend their scheme to support other features, including public auditing, privacy preservation, fairness, and multiple-replica checking.

70

Data Storage Security Service in Cloud Computing

Recently, Yang and Xiaohua (Yang & Xiaohua, 2014) presented Third-party Storage Auditing Scheme (TSAS) which is a privacy preserving auditing protocol. It also supports data dynamic operations using Index Table (ITable). Moreover, they add a new column to the index table, Ti, that is the timestamp used for generating the data tag to prevent the replay attack. They applied batch auditing for integrity verification for multiple data owners in order to reduce the computational cost on the auditor. However, this scheme moved the computational loads of the auditing from the TPA to the CSP, i.e., pairing operation, which introduced high computational cost on the CSP side.

ACCESS CONTROL METHODS Despite the cost effectiveness and reliability of cloud storage services, data owners may consider them as an uncertain storage pool outside the enterprises. Data owners may worry that their data could be misused or accessed by unauthorized users as a reason of sharing the cloud infrastructure. An important aspect for the cloud service provider is to have in place an access control method to ensure the confidentiality and privacy of their data, i.e., their data are safe from unauthorized access and disclosure (Abo-alian, Badr, & Tolba, 2016). Generally, access control can be defined as restricting access to resources/data to privileged/authorized entities (Menezes, Van Oorschot, & Vanstone, 1996). Various access control models have emerged, including Discretionary Access Control (DAC) (Li N., 2011), Mandatory Access Control (MAC) (McCune, Jaeger, Berger, Caceres, & Sailer, 2006), and Role-Based Access Control (RBAC) (Ferraiolo, Kuhn, & Chandramouli, 2003). In these models, subjects (e.g. users) and objects (e.g. data files) are identified by unique names and access control is based on the identity of the subject or its roles. DAC, MAC, and RBAC are effective for closed and relatively unchangeable distributed systems that deal only with a set of known users who access a set of known services where the data owner and the service provider are in the same trust domain (Cha, Seo, & Kim, 2012). In cloud computing, the relationship between services and users is more ad hoc and dynamic, service providers and users are not in the same security domain. Users are usually identified by their characteristics or attributes rather than predefined identities (Cha, Seo, & Kim, 2012). Therefore, the cryptographic solution, such as data encryption before outsourcing, can be the trivial solution to keep sensitive data confidential against unauthorized users and untrusted CSPs. Unfortunately, cryptographic solutions only are inefficient to encrypt a file to multiple recipients, and fail to support fine-grained access control, i.e., granting differential access rights to a set of users and allowing flexibility in specifying the access rights of individual users. Moreover, traditional ACL-based access control methods require attaching a list of authorized users to every data object. When ACLs are enforced with cryptographic methods, the complexity of each data object in terms of its ciphertext size and/or the corresponding data encryption operation is linear to the number of users in the system, and thus makes the system less scalable (Yu, 2010). The following sub-sections overview the existing access control methods and highlight their main advantages and drawbacks.

Traditional Encryption One method for enforcing access control and assuring data confidentiality is to store sensitive data in encrypted form. Only users authorized to access the data have the required decryption key. There are

71

Data Storage Security Service in Cloud Computing

two main classes of encryption schemes: 1) Symmetric key encryption and 2) Public key encryption (Fujisaki & Okamoto, 1999).There are several schemes (Kallahalla, Riedel, Swaminathan, Wang, & Fu, 2003; Vimercati, Foresti, Jajodia, Paraboschi, & Samarati, 2007; Goh, Shacham, Modadugu, & Boneh, 2003) proposed in the area of access control of outsourced data addressing the similar issue of data access control with conventional symmetric-key cryptography or public-key cryptography. Although these schemes are suitable for conventional file systems, most of them are less suitable for fine-grained data access control in large-scale data centers which may have a large number users and data files. Obviously, neither of them should be applied directly while sharing data on cloud servers, since they are inefficient in encrypting a file to multiple recipients in terms of key size, ciphertext length and computational cost for encryption. Moreover, they fail to support fine-grained attribute-based access control and key delegation.

Broadcast Encryption In the broadcast encryption (BE), a sender encrypts a message for some subset S of users who are listening on a broadcast channel, so that only the recipients in S can use their private keys to decrypt the message. The problem of practical broadcast encryption was first formally studied by Fiat and Naor in 1994 (Fiat & Naor, 1994). The BE system is secure against a collusion of k users, which means that it may be insecure if more than k users collude. Since then, several solutions have been described in the literature such as schemes presented in Boneh et al. (Halevy & Shamir, 2002) and Halevy and Shamir (Boneh, Gentry, & Waters, 2005) which are the best known BE schemes. However, the efficiency of both schemes depends on the size of the authorized user set. Additionally, they also require the broadcaster/sender to refer to its database of user authorizations. Paillier et al. (Delerable, Paillier, & Pointcheval, 2007) proposed a dynamic public-key broadcast encryption that simultaneously benefit from the following properties: Receivers are stateless; encryption is collusion-secure for arbitrarily large collusions of users and security is tight in the standard model; new users can join dynamically (i.e. without modification of user decryption keys and ciphertext size). Recently, Seberry et al. (Kim, Susilo, Au, & Seberry, 2013) proposed a semi-static secure broadcast encryption scheme with constant-sized private keys and ciphertexts that improves the scheme introduced by Gentry and Waters (Gentry & Waters, 2009). They reduce the private key and ciphertext size by half. In addition, the sizes of the public key and the private key do not depend on the total number of users. Unfortunately, it is only secure against adaptive chosen plaintext attacks (CPA). Apparently, a BE system achieves a one-to-many encryption with general performance. However, it may not be applied directly while sharing data on cloud servers, since it fails to support attribute-based access control and key delegation.

Identity-Based Encryption Identity-based encryption (IBE) was proposed by Shamir in 1984 (Boneh, Gentry, & Waters, 2005). But, IBE remained an open problem for many years until a fully functional identity-based encryption (IBE) scheme proposed by Boneh and Franklin (Halevy & Shamir, 2002). It can be defined as a type of public-key cryptography (PKC), in which any arbitrary string corresponding to unique user information is a valid public key such as; an email address or a physical IP address. The corresponding private key is computed by a trusted third party (TTP) called the private key generator (PKG) as illustrated in Figure 5 (Wikipedia, 2014).

72

Data Storage Security Service in Cloud Computing

Figure 5. Example of an identity- based encryption Wikipedia, 2014.

Compared with the traditional PKC, the IBE system eliminates online look-ups for the recipients authenticated public key. However, an IBE system introduces several problems: First, there is only one PKG to distribute private keys to each user which introduces the key escrow problem, i.e., the PKG knows the private keys of all users and may decrypt any message. Second, the PKG is a centralized entity; it may get overloaded and can cause a bottleneck. Third, if the PKG server is compromised, all messages used by that server are also compromised. Recently, Lou et al. (Li J., Chen, Jia, & Lou, 2013) proposed a revocable IBE scheme that handles the critical issue of overhead computation at the Private Key Generator (PKG) during user revocation. They employ a hybrid private key for each user, in which an AND gate is involved to connect and bond the identity component and the time component. At first, the user is able to obtain the identity component and a default time component, i.e., PKG can issue his/her private key for a current time period. Then, unrevoked users need to periodically request a key update for the time component to a newly introduced entity named Key Update Cloud Service Provider (KU-CSP) which introduces further communication costs.

Hierarchical Identity-Based Encryption Horwitz and Lynn (Horwitz & Lynn, 2002) introduced the concept of a Hierarchical identity-based encryption (HIBE) system in order to reduce the workload on the root PKG. They proposed a two-level HIBE scheme, in which a root PKG needs only to generate private keys for domain-level PKGs that, in turn, generates private keys for all the users in their domains at the next level. That scheme has a chosen ciphertext security in the random oracle model. In addition, it achieves total collusion resistance at the upper levels and partial collusion resistance at the lower levels.

73

Data Storage Security Service in Cloud Computing

Gentry and Halevi (Gentry & Halevi, 2009) proposed an HIBE scheme with total collusion resistance at an arbitrary number of levels, which has chosen ciphertext security in the random oracle model under the BDH assumption and key randomization. It is noteworthy that their scheme has a valuable feature which is one-to-many encryption, i.e., an encrypted file can be decrypted by a recipient and all his ancestors, using their own secret keys, respectively. However, the length of ciphertext and private keys, as well as the time of encryption and decryption, grows linearly with the depth of a recipient in the hierarchy. Figure 6 (Gagné, 2011) illustrates an example of HIBE in which the root PKG generates the system parameters for the HIBE system and the secret keys for the lower-level PKGs, which, in turn, generate the secret keys for the entities in their domains at the bottom level. In other words, a user public key is an ID-tuple, which consists of the user’s identity (ID) and the IDs of the user’s ancestors. Each PKG uses its secret keys (including a master key and a private key) and a user public key to generate secret keys for each user in its domain. Liu et al. (Liu, Wang, & Wu, 2010) utilized the ’one-to-many’ encryption feature of (Gentry & Halevi, 2009) and proposed an efficient sharing of the secure cloud storage services scheme. In their scheme, a sender can specify several users as the recipients for an encrypted file by taking the number and public keys of the recipients as inputs of an HIBE system. Using their scheme, the sender needs to encrypt a file only once and store only one copy of the corresponding ciphertext regardless of the number of intended recipients. The limitation of their scheme is that the length of ciphertexts grows linearly with the number of recipients so that it can only be used in the case of a confidential file involving a small set of recipients. Recently, Zhang et al. (Mao, Zhang, Chen, & Zhan, 2013) presented a new HIBE system where the ciphertext sizes as well as the decryption costs are independent of the hierarchy depth, i.e., constant length of ciphertext and a constant number of bilinear map operations in decryption. Moreover, their scheme is fully secure in the standard model. The HIBE system obviously achieves key delegation and some HIBE schemes achieve one-to-many encryption with adequate performance. However, it may not be applied directly while sharing data on cloud servers because it fails to efficiently support fine-grained access control. Figure 6. Hierarchical identity-based encryption

Gagné, 2011.

74

Data Storage Security Service in Cloud Computing

Attribute-Based Encryption An Attribute-Based Encryption (ABE) scheme is a generalization of the IBE scheme. In an IBE system, a user is identified by only one attribute, i.e., the ID. While, in an ABE scheme, a user is identified by a set of attributes, e.g. specialty, department, location, etc. Sahai and Waters (Sahai & Waters, 2005) first introduced the concept of the ABE schemes, in which a sender encrypted a message, specifying an attribute set and a number d so that only a recipient who has at least d attributes of the given attributes can decrypt the message. Although their scheme, which is referred to as a threshold encryption, is collusion resistant and has selective-ID security, it has three drawbacks: First, it is difficult to define the threshold, i.e., the minimum number of attributes that a recipient must have to decrypt the ciphertext. Second, revoking a user requires redefining the attribute set. Third, it lacks expressibility which limits its applicability to larger systems. As an extension of the ABE scheme, two variants are proposed in the literature: 1) the Key Policy based ABE (KP-ABE) scheme and 2) the Ciphertext-Policy based ABE (CP-ABE) scheme.

Key-Policy Attribute-Based Encryption A key policy attribute-based encryption (KP-ABE) scheme was first proposed by Goyal et al. (Goyal, Pandey, Sahai, & Waters, 2006) which supports any monotonic access formula consisting of AND, OR, or threshold gates. Their scheme is considered a fine-grained and expressive access control. KP-ABE is a scheme in which the access structure or policy is specified in the users’ private keys, while ciphertexts are associated with sets of descriptive attributes. As stated in (Waters, 2011), any monotonic access structure can be represented as an access tree over data attributes. For example, Figure 7 presents an access structure and attribute sets that can be generated in a healthcare application (Yu, Wang, Ren, & Lou, 2010). The data owner encrypts the data file using a selected set of attributes (e.g., diabetes, A, Asian, etc.) before uploading it to the cloud. Only users, whose access structure specified in their private keys matching the file attributes can decrypt the file, Figure 7. Key-policy attribute-based encryption in a healthcare system Yu, Wang, Ren, & Lou, 2010.

75

Data Storage Security Service in Cloud Computing

in other words, a user with access structure as follows: diabetes Λ (Asian V white) can decrypt the data file encrypted under the attributes diabetes and Asian. In recent times, authors of (Yu, Wang, Ren, & Lou, 2010; Si, Wang, & Zhang, 2013) proposed a user private key revocable KP-ABE scheme with non-monotonic access structure, which can be combined with the XACML policy (di Vimercati, Samarati, & Jajodia, 2005) to address the issue of moving the complex access control process to the cloud and constructing a security provable public verifiable cloud access control scheme. All of the prior work described above assume the use of a single trusted authority (TA) that manages, e.g., add, issue, revoke, etc., all attributes in the system domain. This assumption not only may create a load bottleneck but also suffers from the key escrow problem as the TA can decrypt all the files, causing a privacy disclosure. Thus, Chase (Chase, 2007) provided a construction for a multi-authority ABE scheme which supports many different authorities operating simultaneously, and each administering a different set of domain attributes, i.e., handing out secret keys for a different set of attributes (Li, Yu, Ren, & Lou, 2010). However, his scheme is still not ideal because there are three main problems: first, there is a central authority that can decrypt all ciphertexts because it masters the system secret keys and thus a key escrow problem is aroused; second, it is very easy for colluding authorities to build a complete profile of all of the attributes corresponding to each global identifier (GID); third, the set of authorities is predetermined. Chase and Chow (Chase & Chow, 2009) proposed a more practical multi-authority KP-ABE system, which removes the trusted central authority to preserve the user’s privacy. Their scheme allows the users to communicate with AAs via pseudonyms instead of having to provide their GIDs. Moreover, they prevent the AAs from pooling their data and linking multiple attribute sets belonging to the same user. Yu et al. (Yu, Wang, Ren, & Lou, 2010) exploited the uniquely combined techniques of ABE, proxy re-encryption (PRE) (Goh, Shacham, Modadugu, & Boneh, 2003), and lazy re-encryption (LRE) (Blaze, Bleumer, & Strauss, 1998) to allow the data owner to delegate most of the computation tasks involved in user revocation to untrusted CSPs without disclosing the underlying data contents. PRE eliminates the need for direct interaction between the data owner and the users for decryption key distribution, whereas LRE allows the CSP to aggregate the computation tasks of multiple user revocation operations. For example, once a user is revoked, the CSP just record that. If only there is a file data access request from a user, the CSP then re-encrypts the requested files and updates the requesting user’s secret key. Recently, Zeng et al. (Li, Xiong, Zhang, & Zeng, 2013) propose an expressive decentralizing KPABE scheme with a constant ciphertext size, i.e., the ciphertext size is independent of the number of attributes used in the scheme. In their construction, there is no trusted central authority to conduct the system setup and the access policy can be expressed as any non-monotonic access structure. In addition, their scheme is semantically secure in so-called Selective-Set model based on the n-DBDHE assumption. Hohenberger and Waters (Hohenberger & Waters, 2013) proposed a KP-ABE scheme in which ciphertexts can be decrypted with a constant number of pairings without any restriction on the number of attributes. However, the size of the user’s private key is increased by a factor of the number of distinct attributes in the access policy. Furthermore, there is a trusted single authority that generates private keys for users which violates the user privacy and causes a key escrow problem. Unfortunately, in all KP-ABE schemes, the data owners have no control over who has access to the data they encrypt, except by their choice of the set of descriptive attributes for the data. Rather, they must trust that the key issuer issues the appropriate keys to grant or deny access to the appropriate users.

76

Data Storage Security Service in Cloud Computing

Additionally, the size of the user’s private key and the computation costs in encryption and decryption operations depend linearly on the number of attributes involved in the access policy.

Ciphertext-Policy Attribute-Based Encryption In the ciphertext-policy ABE (CP-ABE) introduced by Waters et al. (Bethencourt, Sahai, & Waters, 2007), the roles of the ciphertexts and keys are reversed in contrast with the KP-ABE scheme. The data owner determines the policy under which the data can be decrypted, while the secret key is associated with a set of attributes. Most of the proposed CP-ABE schemes incur large ciphertext sizes and computation costs in the encryption and decryption operations which depend at least linearly on the number of attributes involved in the access policy. Therefore, Chen et al. (Chen, Zhang, & Feng, 2011) proposed two CP-ABE schemes, both have constant size ciphertext and constant computation costs for an access policy containing AND Gate with a wildcard. The first scheme is provably CPA-secure in standard model under the decision n-BDHE assumption while the second scheme is provably CCA-secure in a standard model under the decision n-BDHE assumption and the existence of collision-resistant hash functions. Yan Zhu at al. (Zhu Y., Hu, Ahn, Huang, & Wang, 2012; Zhu Y., Hu, Ahn, Yu, & Zhao, 2012) proposed a comparison-based encryption scheme that support a complete comparison relation, e.g., =, ≠, ≤, ≥, in the policy specification to implement various range constraints on integer attributes, such as temporal and level attributes. They combined proxy re-encryption (with CP-ABE to support key delegation and reduce computational overheads on lightweight devices by outsourcing the majority of decryption operations to the CSP. Their scheme provides O (1) size of private key and ciphertext for each range attribute. Additionally, it is also provably secure under the RSA and CDH assumption. However, their scheme depends on a central single authority to conduct the system setup and manage all attributes. And, they do not provide a mechanism for efficient user revocation. Zhang and Chen (Zhang & Chen, 2012) proposed the idea of” Access control as a service” for public cloud storage, where the data owner controls the authorization, and the PDP (Policy Decision Point) and PEP (Policy Enforcement Point) can be securely delegated to the CSP by utilizing CP-ABE and proxy re-encryption. However, it incurs high communication and setup costs. The main limitations of most traditional CP-ABE schemes are: First, compromising the user’s privacy since the access structure is embedded in the ciphertext that may reveal the scope of the data file and the authorized users who have access. The obvious solution to this problem as proposed by Nishide et al. (Nishide, Yoneyama, & Ohta, 2008), is to hide ciphertext policy, i.e., hidden access structure. Subsequently, there are various efforts to improve the traditional CP-ABE scheme and to support privacypreserving access policy such as in (Doshi & Jinwala, 2012; Qian, Li, & Zhang, 2013; Jung, Li, Wan, & Wan, 2013). The other limitation of the traditional CP-ABE schemes is depending on a single central authority for monitoring and issuing user’s secret keys. Recently, many CP-ABE schemes consider multi-authority environments (Jung, Li, Wan, & Wan, 2013; Li, Yu, Zheng, Ren, & Lou, 2013; Yang, Jia, Ren, & Zhang, 2013). To achieve fine-grained and scalable data access control for personal health records (PHRs), Lou et al. (Li, Yu, Zheng, Ren, & Lou, 2013) utilized CP-ABE techniques to encrypt each patient’s PHR file while focusing on the multiple data owner scenario. Moreover, they adopted proxy re-encryption and lazy revocation to efficiently support attribute and user revocation. However, the computational costs of the key generation, encryption, and decryption processes are all linear with the number of attributes.

77

Data Storage Security Service in Cloud Computing

Hierarchical Attribute-Based Encryption The Hierarchical Attribute-Based Encryption (HABE) model, as described in (Wang, Liu, & Wu, 2010; Wang, Liu, Wu, & Guo, 2011), integrates properties in both an HIBE model (Gagné, 2011) and an ABE model (Sahai & Waters, 2005). As illustrated in Figure 8, it consists of a root master (RM) and multiple domains, where the RM functions as the TTP, and the domains are enterprise users. More precisely, a domain consists of many domain masters (DMs) corresponding to the internal trusted parties (ITPs) and numerous users corresponding to end users. The RM, whose role closely follows the root PKG in an HIBE system, is responsible for generation and distribution of system parameters and domain keys. The DM, whose role integrates both the properties of the domain PKG in an HIBE system and the AA in an ABE system, is responsible for delegating keys to the DMs at the next level and distributing secret keys to users. Wang et al. (Wang, Liu, & Wu, 2011) proposed fuzzy and precise identity-based encryption (FPIBE) scheme that supports the full key delegation and requires only a constant number of bilinear map operations during decryption. The FPIBE scheme is able to efficiently achieve a flexible access control by combining the HIBE system and the CP-ABE system. Using the FPIBE scheme, a user can encrypt data by specifying a recipient ID set, or an access control policy over attributes, so that only the user whose ID belongs to the ID set or attributes satisfying the access control policy can decrypt the corresponding data. However, the ciphertext length is proportional to the number of authorized users and encryption time. As well as, the size of a user’s secret key is proportional to the depth of the user in the hierarchy. For supporting compound and multi-valued attributes in a scalable, flexible, and fine-grained access control, Wan et al. (Wan, Liu, & Deng, 2012) proposed hierarchical attribute-set-based encryption (HASBE) by extending the ciphertext policy attribute-set-based encryption (CP-ASBE) with a hierarchical structure of users. HASBE employs multiple value assignments for access expiration time to deal with user revocation more efficiently. However, the granting access operation is proportional to the number of attributes in the key structure. Figure 8. Hierarchical attribute-based encryption model Wang, Liu, Wu, & Guo, 2011.

78

Data Storage Security Service in Cloud Computing

Zhou et al. (Chen, Chu, Tzeng, & Zhou, 2013) proposed a new hierarchical key assignment scheme, called CloudHKA, that addresses a cryptographic key assignment problem for enforcing a hierarchical access control policy over cloud data. CloudHKA possesses many advantages: 1) Each user only needs to store one secret key, 2) Each user can be flexibly authorized the access rights of Write or Read, or both, 3) It supports a dynamic user set and access hierarchy, and 4) It is provably secure against collusive attacks. However, the re-key cost in the case of user revocation is linear with the number of users in the same security class. CloudHKA does not consider the expressive user attributes, so it can be considered coarse-grained access control scheme. Qin Liu et al. (Wang, Liu, & Wu, 2014) recently extended the hierarchical CP-ABE scheme (Wang, Liu, & Wu, 2010; Wang, Liu, Wu, & Guo, 2011) by incorporating the concept of time to perform automatic proxy re-encryption. More specifically, they proposed a time-based proxy re-encryption (TimePRE) scheme to allow a user’s access right to expire automatically after a predetermined period of time. In this case, the data owner can be offline in the process of user revocations. Unfortunately, TimePRE scheme has two drawbacks: first, it assumes that there is a global time among all entities. Second, the user secret key size is O(mn), where m is the number of nodes in the time tree corresponding to the user’s effective time period and n is the number of user attributes.

Role-Based Access Control In a Role-Based Access Control (RBAC) system, access permissions are assigned to roles and roles are assigned to users/subjects (Wikipedia, 2014). Roles can be created, modified or disabled according to the system requirements. Role-permission assignments are relatively stable, while user-role assignments change quite frequently (e.g., personnel moving across departments, reassignment of duties, etc.). So, managing the user-role permissions is significantly easier than managing user rights individually (Ferrara, Madhusudan, & Parlato, 2013). Zhou et al. (Zhou, Varadharajan, & Hitchens, 2011) proposed a role-based encryption (RBE) scheme for secure cloud storage. This scheme specifies a set of roles assigned to the users, each role having a set of permissions. Roles can be defined in a hierarchy, which means a role can have sub-roles (successor roles). The owner of the data can encrypt the private data to a specific role. Only the users in the specified role or predecessor roles are able to decrypt that data. The decryption key size still remains constant regardless of the number of roles that the user has been assigned to. However, the decryption cost is proportional to the number of authorized users in the same role.

COMPARATIVE ANALYSIS OF SECURITY METHODS ON CLOUD DATA STORAGE This section evaluates the performance of auditing and access control methods, that were presented in the previous sections, against different performance criteria.

Performance Analysis of Data Storage Auditing Methods This sub-section assesses the different characteristics of some existing data storage auditing schemes such as auditor type, supporting dynamic data, replication/multiple-copy and data recovery, as illustrated

79

Data Storage Security Service in Cloud Computing

in Table 1. In addition, it evaluates their performance in terms of computational complexity at the CSP and the auditor, storage overheads at the CSP and the auditor, communication complexity between the auditor and the CSP, as illustrated in Table 3. As shown in Table 1, most auditing schemes focus on a single copy of the file and provide no proof that the CSP stores multiple copies of the data owners file. Although data owners many need their critical data to be replicated on multiple servers across multiple data centers to guarantee the availability of their data, only few schemes (Etemad & Kupcu, 2013; Barsoum & Hasan, 2012; Zhu, Hu, Ahn, & Yu, 2012) that support auditing for multiple replicas of data owner’s file. Table 2 gives more notations for cryptographic operations used in the different auditing methods. Let r, n, k denote the number of replicas, the number of blocks per replica and the number of sectors per block (in the case of block fragmentation), respectively. s denotes the block size. c denotes the number of challenged blocks. Let λ be the security parameter which is usually the key size. Let p denote the order of the groups and φ(N) denotes the Euler Function on the RSA modulus N.

Table 1. Characteristics of data storage auditing schemes Scheme

Auditor Type

Dynamic Data

Replication

Data Recovery

(Liu, Zhang, & Lun, 2013)

Public

Yes

No

No

(Yang & Xiaohua, 2014)

Public

Yes

No

No

(Ateniese, et al., 2007)

Public

No

No

No

(Wang, Chow, Wang, Ren, & Lou, 2013)

Public

Yes

No

No

(Shacham & Waters, 2013) (BLS)

Public

No

No

Yes

(Shacham & Waters, 2013) (RSA)

Public

No

No

Yes

(Shacham & Waters, 2013) (MAC)

Private

No

No

Yes

(Yuan & Yu, 2013)

Public

No

No

Yes

(Xu & Chang, 2011)

Public

No

No

Yes

(Barsoum & Hasan, 2012)

Public

Yes

Yes

No

(Zhu, Hu, Ahn, & Yu, 2012)

Public

No

Yes

No

(Etemad & Kupcu, 2013)

Private

Yes

Yes

No

Table 2. Notations of cryptographic operations Notation

Cryptographic Operation

MUL

Multiplication in group G

EXP

Exponentiation in group G

ADD

Addition in group G

H

Hashing into group G

MOD

Modular operation in ZN

Pairing

Bilinear pairing; e (u, v)

SEncr

Stream Encryption

80

Data Storage Security Service in Cloud Computing

As shown in Table 3, MAC-based schemes (Shacham & Waters, 2013) require storage overheads of metadata, i.e., block tags, as long as each data block size. On the other hand, they have efficient computational complexity at the CSP and the auditor. The homomorphic tags based on BLS signatures (Ateniese, et al., 2007; Shacham & Waters, 2013) are much shorter than the RSA-based homomorphic tags (Shacham & Waters, 2013). While the verification cost, i.e., the computational cost at the auditor in the BLS-based auditing schemes is higher than that of RSA-based auditing schemes due to the bilinear pairing operations which consume more time than other cryptographic operations. To reduce the storage overheads, the data owner can store the tags together with the data blocks only on the CSP. Upon challenging the CSP, the CSP generates the data proof and the tag to the auditor instead of only the data proof. But this solution will increase the communication cost between the CSP and the auditor. In fact, there is a tradeoff between the storage overheads at the auditor and the communication costs between the CSP and the auditor. Table 3. Performance analysis of different auditing schemes Scheme

Computational complexity CSP

Storage Overhead

Auditor

CSP

Auditor

Communication Complexity

(Liu, Zhang, & Lun, 2013)

1 SEncr + O(c) [MUL+ EXP + ADD]

O(c) [MUL+ H+ EXP] + 2 Pairing

(1+1/k) n

O (1)

O (1)

(Yang & Xiaohua, 2014)

O(c) [ADD + MUL + EXP] + O(k) [EXP + Pairing]

O(c) [H+ EXP+ MUL] + 2 Pairing

s. n/k

O (1)

O(c)

(Ateniese, et al., 2007)

O(c) [H+ MUL + EXP]

O(c) [H + MUL+ EXP]

n · |N|

O(λ)

O (3|N|)

(Wang, Chow, Wang, Ren, & Lou, 2013)

O(c) [MUL+ EXP + ADD] +H + 1 Pairing

O(c) [MUL+ H+ EXP] + 2 Pairing

(1+1/k) n)

O(1+1/k)

O(kc)

(Shacham & Waters, 2013) (BLS)

O(c) [ ADD + MUL +EXP]

O(k) [MUL + EXP] + O(c) [H + EXP] + 2 Pairing

n.|p|

O(λ)

O(c+|p|)

(Shacham & Waters, 2013) (RSA)

O(c) [ ADD + MUL +EXP]

O(k) [MUL + EXP] + O(c) [H + EXP]

n.|N|

O(λ)

O (c+|N|+ s)

(Shacham & Waters, 2013) (MAC)

O(c)

O(c)H

n.|p|

N/A

O (c+|p|+ s)

(Yuan & Yu, 2013)

O(c+s) [MUL+ EXP)

O(c) [MUL+ EXP] + 4 Pairing

(1+1/s) n

O (1)

O (1)

(Xu & Chang, 2011)

O(s) EXP

3EXP + O(c) [ADD + MUL + PRF]

(1+1/s) n

O (1)

O(λ)

(Barsoum & Hasan, 2012)

O(c) [EXP + MUL + ADD]

O(c) [ H+ MUL+ EXP] + O(k) [MUL+ EXP] + O(r)ADD+ 2 Pairing

2n

O(2n)

O(kr)

(Zhu, Hu, Ahn, & Yu, 2012)

O(c) Pairing + O(ck) EXP

3 Pairing + O(k) EXP

n.|p|

O(λ)

O (c + k)

(Etemad & Kupcu, 2013)

O (1+ log nr) [EXP+ MUL]

O (1+ log nr) [EXP+ MUL]

log nr

O (1)

O (log nr)

81

Data Storage Security Service in Cloud Computing

The scheme of Yuan and Yu (Yuan & Yu, 2013) omits the tradeoff between communication costs and storage costs. The message exchanged between the CSP and the auditor during the auditing procedure consists of a constant number of group elements. However, it requires 4 Bilinear pairing operations for proof verification that result in high computational cost at the auditor. The communication cost can be reduced by using short homomorphic tags such as BLS tags (Mukundan, Madria, & Linderman, 2012; Etemad & Kupcu, 2013) that enable the CSP to reduce the communication complexity of the auditing by aggregating the authentication tags of individual file blocks into a single tag. Batch auditing (Yang & Xiaohua, 2014; Wang, Chow, Wang, Ren, & Lou, 2013) can further reduce the communication cost by allowing the CSP to send the linear combination of all the challenged data blocks whose size is equal to one data block instead of sending them sequentially. Auditing schemes that support dynamic data (Yang & Xiaohua, 2014; Etemad & Kupcu, 2013; Barsoum & Hasan, 2012; Wang, Chow, Wang, Ren, & Lou, 2013; Liu, Zhang, & Lun, 2013) add more storage overheads at the auditor. MHT-based schemes (Wang, Chow, Wang, Ren, & Lou, 2013) keep the metadata at the auditor side (i.e. the root of the MHT) smaller than schemes that utilize the index table (Yang & Xiaohua, 2014; Barsoum & Hasan, 2012), as the total number of index table entries is equal to the number of file blocks. On the contrary, the computational and communication costs of the MHT-based schemes are higher than those of the table-based schemes. During the dynamic operations of the MHT-based schemes, the data owner sends a modification request to the CSP and receives the authentication paths. The CSP updates the MHT according to the required dynamic operations, regenerates a new directory root and sends it to the auditor. On the other hand, for the table-based schemes, the data owner only sends a request to the CSP and updates the index table without the usage of any cryptographic operations. However, auditing schemes based on the index table (Yang & Xiaohua, 2014; Barsoum & Hasan, 2012) suffer a performance penalty during insertion or deletion operations, i.e. O(n) in the worst case, because the indexes of all the blocks after the insertion/deletion point are changed, and all the tags of these blocks should be recalculated. While the complexity of insertion or deletion operations when using skip lists (Etemad & Kupcu, 2013) is O (log n).

Performance Analysis of Access Control Methods This sub-section presents an evaluation of different access control methods in terms of the ciphertext size, user’s secret key size, decryption cost, user revocation cost and the existence of multiple authorities. As illustrated in Table 4, user revocation is a very challenging issue in access control methods that requires re-encryption of data files accessible to the revoked user, and may need updates of secret keys for all the non-revoked users. So the user revocation requires a heavy computation overhead on the data owner and may also require him/her to be always online. Access control methods that utilize the proxy encryption can efficiently perform the user revocation operation such as (Zhang & Chen, 2012; Li, Yu, Zheng, Ren, & Lou, 2013; Yang, Jia, Ren, & Zhang, 2013; Chen, Chu, Tzeng, & Zhou, 2013). To further improvement on the complexity of user revocation, hierarchical access control methods such as (Wan, Liu, & Deng, 2012; Zhou, Varadharajan, & Hitchens, 2011) use the concept of key delegation. Ciphertext and secret key sizes are other challenging issues in current access control methods that may cause high storage overhead and communication cost. Ciphertext and secret key sizes usually grow linearly with the number of attributes in the system domain. Methods of (Zhang & Chen, 2012; Doshi & Jinwala, 2012; Wan, Liu, & Deng, 2012; Chen, Chu, Tzeng, & Zhou, 2013; Zhou, Varadharajan, &

82

Data Storage Security Service in Cloud Computing

Table 4. Performance analysis of different access control methods Access Control Method

Approach

Ciphertext Size

User Key Size

Decryption Cost

User Revocation Cost

Multiple Authority

(Hohenberger & Waters, 2013)

KP-ABE

Linear with no. of attributes

Linear with no. of attributes

Constant no. of pairings

N/A

No

(Doshi & Jinwala, 2012)

CP-ABE

Constant

Constant

Constant no. of pairings,

N/A

No

(Qian, Li, & Zhang, 2013)

CP-ABE with fully hidden access structure

Linear with no. of attributes

Linear with total no. of attributes

Linear with no. of attribute authorities

N/A

Yes

(Li, Yu, Zheng, Ren, & Lou, 2013)

CP-ABE with proxy encryption and lazy reencryption

Linear with no. of attributes and no. of AA

Linear with no. of attributes in the secret key.

Linear with total no. of attributes

Linear with no. of attributes in the secret key.

Yes

(Yang, Jia, Ren, & Zhang, 2013)

CP-ABE with proxy encryption

Linear with no. of attributes in the policy

Linear with the total no. of attributes

Constant

Linear with no. of non-revoked users who hold the revoked attribute

Yes

(Wang, Liu, & Wu, 2011)

HIBE + CP-ABE

Linear with no. of users and the max. the depth of the hierarchy.

Linear with no. of attributes of a user

Constant

N/A

Yes

(Wan, Liu, & Deng, 2012)

CP-ABE + HIBE

Constant

Linear with no. of attributes of a user

Linear with no. of attributes in the key

Constant

Yes

(Chen, Chu, Tzeng, & Zhou, 2013)

HIBE with proxy encryption

Constant

Read and Write keys independent of number of ciphertexts

Linear with the depth of the user in hierarchy

Linear with no. of authorities and no. of ciphertext accessed by revoked user.

Yes

(Zhou, Varadharajan, & Hitchens, 2011)

Hierarchical RBAC

Constant

Constant

Linear with no. of users in the same role.

Linear with no. of roles

Yes

(Zhang & Chen, 2012)

CP –ABE + proxy encryption

Constant

Linear with no. of attributes

Constant no. of pairings

Constant

No

Hitchens, 2011) have constant ciphertext sizes while only the methods of (Doshi & Jinwala, 2012; Zhou, Varadharajan, & Hitchens, 2011) achieve constant secret key size. In most of the access control methods, decryption cost scales with the complexity of the access policy or the number of attributes which is infeasible as in the case of lightweight devices such as mobiles. Some access control methods such as (Hohenberger & Waters, 2013; Zhang & Chen, 2012; Chen & Curtmola, 2013; Yang, Jia, Ren, & Zhang, 2013; Doshi & Jinwala, 2012; Wang, Liu, & Wu, 2011) have constant decryption costs. Some access control methods, e.g. (Hohenberger & Waters, 2013; Zhang & Chen, 2012; Doshi & Jinwala, 2012), assume the use of a single trusted authority (TA) that manages (e.g., add, issue, revoke, etc.,) all attributes in the system domain. This assumption not only may create a load bottleneck but also suffers from the key escrow problem as the TA can decrypt all the files, causing a privacy disclosure. Instead, recent access control methods, such as (Wang, Liu, & Wu, 2011; Wan, Liu, & Deng, 2012; Chen, Chu, Tzeng, & Zhou, 2013), consider the existence of different enti-

83

Data Storage Security Service in Cloud Computing

ties, called attribute authorities (AAs), responsible for managing different attributes of a person, e.g. the Department of Motor Vehicle Tests whether you can drive, or a university can certify that you are a student, etc. Each AA manages a disjoint subset of attributes, while none of them alone is able to control the security of the whole system.

DISCUSSIONS AND CONCLUDING REMARKS From the extensive survey in the previous sections, we conclude that data security in the cloud is one of the major issues that is considered as a barrier to the adoption of cloud storage services. The most critical security concern is about data integrity, availability, privacy, and confidentiality. For validating data integrity and availability in cloud computing, many auditing schemes have been proposed under different security levels and cryptographic assumptions. Most auditing schemes are provably secure in the random oracle model. The main limitations of existing auditing schemes can be summarized as follows: (1) Dealing only with archival static data files and do not consider dynamic operations such as insert, delete and update, (2) Relying on spot checking that can detect if a long fraction of the data stored at the CSP has been corrupted but it cannot detect corruption of small parts of the data, (3) Relying on Reed-Solomon encoding scheme to support data recovery feature in case the data is lost or corrupted. It leads to inefficient encoding and decoding for large files, (4) Supporting only private auditing that impose computational and online burdens on the data owners for periodically auditing their data, (5) verifying only single copy data files and not considering replicated data files, and (6) Incurring high computational costs and storage overheads. Therefore, to overcome the prior limitations, an ideal auditing scheme should have the following features: 1. Public Auditing: To enable the data owners to delegate the auditing process to a TPA in order to verify the correctness of the outsourced data on demand. 2. Privacy-Preserving Assurance: To prevent the leakage of the verified data during the auditing process. 3. Data Dynamics: To efficiently allow the clients to perform block-level operations on the data files such as insertion, deletion, and modification while maintaining the same level of data correctness assuring and guaranteeing data freshness. 4. Robustness: To efficiently recover an arbitrary amount of data corruptions. 5. Availability and Reliability: To support auditing for distinguishable multi-replica data files to make sure that the CSP is storing all the data replicas agreed upon. 6. Blockless Verification: To allow TPA to verify the correctness of the cloud data on demand without possessing or retrieving a copy of challenged data blocks. 7. Stateless Verification: Where the TPA is not needed to maintain a state between audits. 8. Efficiency: To achieve the following aspects: a. Minimum computation complexity at the CSP and the TPA. b. Minimum communication complexity between the CSP and the TPA. c. Minimum storage overheads at the CSP and the TPA.

84

Data Storage Security Service in Cloud Computing

For ensuring data confidentiality in the cloud computing, various access control methods were proposed in order to restrict access to data and guarantee that the outsourced data is safe from unauthorized access. However, these methods suffer many penalties such as (1) Heavy computation overheads and a cumbersome online burden towards the data owner because of the operation of user revocation, (2) Large ciphertext and secret key sizes as well as high computation costs in the encryption and decryption operations which depend at least linearly on the number of attributes involved in the access policy, (3) Compromising the users privacy by revealing some information about the data file and the authorized users in the access policy, and (4) Relying on single authority for managing different attributes in the access policy which may cause a bottleneck and a key escrow problem. Consequently, we propose that the ideal access control methods should satisfy the following requirements: 1. Fine-Grained: Granting different access rights to a set of users and allowing flexibility and expressibility in specifying the access rights of individual users. 2. Privacy-Preserving: Access policy does not reveal any information to the CSP about the scope of the data file and the kind or the attributes of users authorized to access. 3. Scalability: The number of authorized users cannot affect the performance of the system. 4. Efficiency: The method should be efficient in terms of ciphertext size, user’s secret key size, and the costs of encryption, decryption, and user revocation. 5. Forward and Backward Security: The revoked user should not be able to decrypt new ciphertexts encrypted with the new public key. And, the newly joined users should be able to decrypt the previously published ciphertexts encrypted with the previous public key if they satisfy the access policy. 6. Collusion Resistant: Different users cannot collude each other and combine their attributes to decrypt the encrypted data. 7. Multiple Authority: To overcome the problems of load bottleneck and key escrow problems, there should be multiple authorities that manage the user attributes and issue the secret keys, rather than a central trusted authority.

CONCLUSION Cloud storage services have become extremely promising due to its cost effectiveness and reliability. However, this service also brings many new challenges for data security and privacy. Therefore, cloud service providers have to adopt security practices to ensure that the clients’ data is safe. In this paper, the different security challenges within a cloud storage service are introduced and a survey of the different methods to assure data integrity, availability and confidentiality is presented. Comparative evaluations of these methods are also conducted according to various pre-defined performance criteria. Finally, we suggest the following research directions for data security within cloud computing environments: (1) Integrate attribute-based encryption with role-based access control models such that the user-role and role-permission assignments can be separately constructed using access policies applied on the attributes of users, roles, the objects and the environment. (2) Develop a context-aware role-based control model and incorporate it to the Policy Enforcement Point of a cloud, and enable/activate the role only when the user is located within the logical positions time intervals and under certain platforms in order

85

Data Storage Security Service in Cloud Computing

to prevent the malicious insiders from disclosing the authorized user’s identity. (3) Integrate efficient access control and auditing methods with new hardware architectural and virtualization features that can help protect the confidentiality and integrity of the data and resources. (4) Incorporate the relationship between auditing and access control for guaranteeing secure cloud storage services. (5) Extend current auditing methods with data recovery features.

REFERENCES Abo‐alian, A., Badr, N., & Tolba, M. (2015b). Keystroke dynamics‐based user authentication service for cloud computing. Concurrency and Computation. Abo-alian, A., Badr, N., & Tolba, M. (2016d). Hierarchical Attribute-Role Based Access Control for Cloud Computing. The 1st International Conference on Advanced Intelligent System and Informatics (AISI2015), 381-389. Abo-alian, A., Badr, N. L., & Tolba, M. F. (2015a). Auditing-as-a-Service for Cloud Storage. Intelligent Systems, 2014, 559–568. Abo-alian, A., Badr, N. L., & Tolba, M. F. (2016a). Authentication as a Service for Cloud Computing. Proceedings of the International Conference on Internet of things and Cloud Computing (pp. 36-42). ACM. doi:10.1145/2896387.2896395 Abo‐alian, A., Badr, N. L., & Tolba, M. F. (2016b). Integrity as a service for replicated data on the cloud. Concurrency and Computation. Abo-Alian, A., Badr, N. L., & Tolba, M. F. (2016c). Integrity Verification for Dynamic Multi-Replica Data in Cloud Storage. Asian Journal of Information Technology, 15(6), 1056–1072. Ateniese, G., Burns, R., Curtmola, R., Herring, J., Khan, O., Kissner, L., & Song, D. et al. (2011). Remote Data Checking Using Provable Data Possession. ACM Transactions on Information and System Security, 14(1), 121–155. doi:10.1145/1952982.1952994 Ateniese, G., Burns, R., Curtmola, R., Herring, J., Kissner, L., Peterson, Z., & Song, D. (2007). Provable data possession at untrusted stores. The 2007 ACM Conference on Computer and Communications Security (pp. 598-609). ACM. Ateniese, G., Kamara, S., & Katz, J. (2009). Proofs of Storage from Homomorphic Identification Protocols. In Advances in Cryptology–ASIACRYPT (pp. 319–333). Springer Berlin Heidelberg. Attebury, R., George, J., Judd, C., & Marcum, B. (2008). Google Docs: A Review. Against the Grain, 20(2), 14–17. Barsoum, A. F., & Hasan, M. A. (2011). On Verifying Dynamic Multiple Data Copies over Cloud Servers. IACR Cryptology ePrint Archive. Barsoum, A. F., & Hasan, M. A. (2012). Integrity verification of multiple data copies over untrusted cloud servers. The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 829-834). IEEE Computer Society.

86

Data Storage Security Service in Cloud Computing

Berriman, G. B., Deelman, E., Good, J., Juve, G., Kinney, J., Merrihew, A., & Rynge, M. (2013). Creating A Galactic Plane Atlas With Amazon Web Services. arXiv preprint arXiv:1312.6723 Bethencourt, J., Sahai, A., & Waters, B. (2007). Ciphertext-Policy Attribute-Based Encryption. IEEE Symposium on Security and Privacy (pp. 321-334). IEEE. Bhadauria, R., & Sanyal, S. (2012). Survey on Security Issues in Cloud Computing and Associated Mitigation Techniques. International Journal of Computers and Applications, 47(18), 47–66. doi:10.5120/7292-0578 Blaze, M., Bleumer, G., & Strauss, M. (1998). Divertible protocols and atomic proxy cryptography. In Advances in Cryptology—EUROCRYPT’98 (pp. 127–144). Springer Berlin Heidelberg. doi:10.1007/ BFb0054122 Blum, M., Evans, W., Gemmell, P., Kannan, S., & Naor, M. (1994). Checking the correctness of memories. Algorithmica, 12(2), 225–244. doi:10.1007/BF01185212 Boneh, D., Gentry, C., & Waters, B. (2005). Collusion resistant broadcast encryption with short ciphertexts and private keys. Advances in Cryptology–CRYPTO, 2005, 258–275. Borgmann, M., Hahn, T., Herfert, M., Kunz, T., Richter, M., Viebeg, U., & Vowe, S. (2012). On the Security of Cloud Storage Services. Fraunhofer-Verlag. Cao, N., Yu, S., Yang, Z., Lou, W., & Hou, Y. T. (2012). LT Codes-based Secure and Reliable Cloud Storage Service. In Processing of 2012 IEEE INFOCOM (pp. 693–701). IEEE. Cha, B., Seo, J., & Kim, J. (2012). Design of attribute-based access control in cloud computing environment. The International Conference on IT Convergence and Security, 41-50. Chambers, J. (2013). Windows Azure Web Sites. John Wiley & Sons. Chase, M. (2007). Multi-authority attribute based encryption. In Theory of Cryptography (pp. 515–534). Springer Berlin Heidelberg. Chase, M., & Chow, S. (2009). Improving privacy and security in multi-authority attribute-based encryption. The 16th ACM conference on Computer and communications security (pp. 121-130). ACM. Chen, B., & Curtmola, R. (2012). Robust Dynamic Provable Data Possession. The 32nd International IEEE Conference on Distributed Computing Systems Workshops (pp. 515-525). IEEE. Chen, B., & Curtmola, R. (2013). Towards self-repairing replication-based storage systems using untrusted clouds. The 3rd ACM conference on Data and application security and privacy (pp. 377-388). ACM. Chen, C., Zhang, Z., & Feng, D. (2011). Efficient ciphertext policy attribute-based encryption with constant-size ciphertext and constant computation-cost. In Provable Security (pp. 84–101). Springer Berlin Heidelberg. Chen, Y., Chu, C., Tzeng, W., & Zhou, J. (2013). Cloudhka: A cryptographic approach for hierarchical access control in cloud computing. In Applied Cryptography and Network Security (pp. 37–52). Springer Berlin Heidelberg.

87

Data Storage Security Service in Cloud Computing

Cong, W., Ren, K., Lou, W., & Li, J. (2010). Toward publicly auditable secure cloud data storage services. IEEE Network, 24(4), 19–24. doi:10.1109/MNET.2010.5510914 Delerable, C., Paillier, P., & Pointcheval, D. (2007). Fully collusion secure dynamic broadcast encryption with constant-size ciphertexts or decryption keys. Pairing-Based Cryptography–Pairing, 2007, 39–59. doi:10.1007/978-3-540-73489-5_4 di Vimercati, S. D., Samarati, P., & Jajodia, S. (2005). Policies, models, and languages for access control. In Databases in Networked Information Systems (pp. 225–237). Springer Berlin Heidelberg. doi:10.1007/978-3-540-31970-2_18 Doshi, N., & Jinwala, D. (2012). Hidden access structure ciphertext policy attribute based encryption with constant length ciphertext. In Advanced Computing, Networking and Security (pp. 515–523). Springer Berlin Heidelberg. Doshi, N., & Jinwala, D. (2012). Hidden access structure ciphertext policy attribute based encryption with constant length ciphertext. In Advanced Computing, Networking and Security (pp. 515–523). Springer Berlin Heidelberg. Erway, C., Küpçü, A., Papamanthou, C., & Tamassia, R. (2009). Dynamic provable data possession. The 16th ACM conference on Computer and communications security (pp. 213-222). ACM. Etemad, M., & Kupcu, A. (2013). Transparent Distributed and Replicated Dynamic Provable Data Possession. The 11th international conference on Applied Cryptography and Network Security (pp. 1-18). Springer Berlin Heidelberg. Ferraiolo, D., Kuhn, D. R., & Chandramouli, R. (2003). Role-based access control. Artech House. Ferrara, A., Madhusudan, P., & Parlato, G. (2013). Policy analysis for self-administrated role-based access control. In Tools and Algorithms for the Construction and Analysis of Systems (pp. 432–447). Springer Berlin Heidelberg. Fiat, A., & Naor, M. (1994). Broadcast encryption. In Advances in Cryptology—CRYPTO’93 (pp. 480–491). Springer Berlin Heidelberg. doi:10.1007/3-540-48329-2_40 Fujisaki, E., & Okamoto, T. (1999). Secure integration of asymmetric and symmetric encryption schemes. In Advances in Cryptology (pp. 537–554). Springer Berlin Heidelberg. Gagné, M. (2011). Identity-Based Encryption. In Encyclopedia of Cryptography and Security (pp. 594–596). Springer Science Business Media. Garg, S. K., Versteeg, S., & Buyya, R. (2013). A framework for ranking of cloud computing services. Future Generation Computer Systems, 29(4), 1012–1023. doi:10.1016/j.future.2012.06.006 Gentry, C., & Halevi, S. (2009). Hierarchical identity based encryption with polynomially many levels. In Theory of Cryptography (pp. 437–456). Springer Berlin Heidelberg. Gentry, C., & Waters, B. (2009). Adaptive security in broadcast encryption systems (with short ciphertexts). Advances in Cryptology-EUROCRYPT, 2009, 171–188.

88

Data Storage Security Service in Cloud Computing

Goh, E., Shacham, H., Modadugu, N., & Boneh, D. (2003). Sirius: Securing remote untrusted storage. Network and Distributed System Security (NDSS) Symposium, 131-145. Gonzalez, C., Border, C., & Oh, T. (2013). Teaching in amazon EC2. The 13th annual ACM SIGITE conference on Information technology education (pp. 149-150). ACM. Goodrich, M. T., Tamassia, R., & Schwerin, A. (2001). Implementation of an authenticated dictionary with skip lists and commutative hashing. DARPA Information Survivability Conference (pp. 68-82). IEEE. doi:10.1109/DISCEX.2001.932160 Goyal, V., Pandey, O., Sahai, A., & Waters, B. (2006). Attribute-based encryption for fine-grained access control of encrypted data. The 13th ACM conference on Computer and communications security (pp. 89-98). ACM. Halevy, D., & Shamir, A. (2002). The LSD broadcast encryption scheme. In Advances in Cryptology— CRYPTO 2002 (pp. 47–60). Springer Berlin Heidelberg. doi:10.1007/3-540-45708-9_4 Hohenberger, S., & Waters, B. (2013). Attribute-based encryption with fast decryption. Public-Key Cryptography–PKC, 2013, 162–179. Horwitz, J., & Lynn, B. (2002). Toward hierarchical identity-based encryption. In Advances in Cryptology—EUROCRYPT 2002 (pp. 466–481). Springer Berlin Heidelberg. doi:10.1007/3-540-46035-7_31 Juels, A., & Kaliski, B. (2007). Pors: Proofs of retrievability for large files. The 2007 ACM Conference on Computer and Communications Security (pp. 584-597). ACM. Jung, T., Li, X., Wan, Z., & Wan, M. (2013). Privacy preserving cloud data access with multi-authorities. In The 2013 IEEE INFOCOM (pp. 2625–2633). IEEE. Kallahalla, M., Riedel, E., Swaminathan, R., Wang, Q., & Fu, K. (2003). Plutus: Scalable Secure File Sharing on Untrusted Storage. 2nd usinex conference on file and storage technologies, 29-42. Kim, J., Susilo, W., Au, M. H., & Seberry, J. (2013). Efficient Semi-static Secure Broadcast Encryption Scheme. Pairing-Based Cryptography–Pairing, 2013, 62–76. Li, C., Chen, Y., Tan, P., & Yang, G. (2012). An Efficient Provable Data Possession Scheme with Data Dynamics. The International Conference on Computer Science & Service System (pp. 706-710). IEEE. Li, C., Chen, Y., Tan, P., & Yang, G. (2013). Towards comprehensive provable data possession in cloud computing. Wuhan University Journal of Natural Sciences. Li, J., Chen, X., Jia, C., & Lou, W. (2013). Identity-based Encryption with Outsourced Revocation in Cloud Computing. IEEE Transactions on Computers, 1–12. Li, M., Yu, S., Ren, K., & Lou, W. (2010). Securing personal health records in cloud computing: Patient-centric and fine-grained data access control in multi-owner settings. In Security and Privacy in Communication Networks (pp. 89–106). Springer Berlin Heidelberg. Li, M., Yu, S., Zheng, Y., Ren, K., & Lou, W. (2013). Scalable and secure sharing of personal health records in cloud computing using attribute-based encryption. IEEE Transactions on Parallel and Distributed Systems, 24(1), 131–143. doi:10.1109/TPDS.2012.97

89

Data Storage Security Service in Cloud Computing

Li, N. (2011). Discretionary Access Control. In Encyclopedia of Cryptography and Security (pp. 864866). Springer US. Li, Q., Xiong, H., Zhang, F., & Zeng, S. (2013). An expressive decentralizing kp-abe scheme with constant-size ciphertext. International Journal of Network Security, 15(3), 161–170. Liu, F., Gu, D., & Lu, H. (2011). An improved dynamic provable data possession model. The IEEE International Conference on Cloud Computing and Intelligence Systems (pp. 290-295). IEEE. Liu, H., Zhang, P., & Lun, J. (2013). Public Data Integrity Verification for Secure Cloud Storage. Journal of Networks, 8(2), 373–380. doi:10.4304/jnw.8.2.373-380 Liu, Q., Wang, G., & Wu, J. (2010). Efficient sharing of secure cloud storage services. The 10th International Conference on Computer and Information Technology (CIT) (pp. 922-929). IEEE. Mao, Y., Zhang, X., Chen, M., & Zhan, Y. (2013). Constant Size Hierarchical Identity-Based Encryption Tightly Secure in the Full Model without Random Oracles. The 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies (EIDWT) (pp. 652-657). IEEE. McCune, J. M., Jaeger, T., Berger, S., Caceres, R., & Sailer, R. (2006). Shamon: A system for distributed mandatory access control. 22nd Annual Computer Security Applications Conference (pp. 23-32). IEEE. doi:10.1109/ACSAC.2006.47 Menezes, A. J., Van Oorschot, P. C., & Vanstone, S. A. (1996). Handbook of applied cryptography. CRC Press. doi:10.1201/9781439821916 Merkle, R. C. (1980). Protocols for public key cryptosystms. IEEE Symposium on Security and Privacy (pp. 122-122). IEEE Computer Society. Miller, R. (2010). Amazon Addresses EC2 Power Outages Data Center Knowledge. Retrieved from http:// www.datacenterknowledge.com/archives/2010/05/10/amazon-addresses-ec2-power-outages/ Mukundan, R., Madria, S., & Linderman, M. (2012). Replicated Data Integrity Verification in Cloud. A Quarterly Bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering, 35(4), 55–64. Nishide, T., Yoneyama, K., & Ohta, K. (2008). Attribute-Based Encryption with Partially Hidden Encryptor-Specified Access Structures. In Applied cryptography and network security (pp. 111–129). Springer Berlin Heidelberg. Pandey, U. S., & Anjali, J. (2013). Google app engine and performance of the Web Application. International Journal (Toronto, Ont.), 2(2). Plank, J. S. (1997). A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Software, Practice & Experience, 27(9), 995–1012. doi:10.1002/(SICI)1097-024X(199709)27:93.0.CO;2-6 Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668–676. doi:10.1145/78973.78977

90

Data Storage Security Service in Cloud Computing

Qian, H., Li, J., & Zhang, Y. (2013). Privacy-Preserving Decentralized Ciphertext-Policy AttributeBased Encryption with Fully Hidden Access Structure. Information and Communications Security (pp. 363-372). Springer International Publishing. Rashmi, K. V., Shah, N. B., Kumar, P. V., & Ramchandran, K. (2009). Explicit construction of optimal exact regenerating codes for distributed storage. 47th Annual Allerton Conference onCommunication, Control, and Computing (pp. 1243-1249). IEEE. doi:10.1109/ALLERTON.2009.5394538 Sahai, A., & Waters, B. (2005). Fuzzy identity-based encryption. Advances in Cryptology–EUROCRYPT, 2005, 457–473. Shacham, H., & Waters, B. (2013). Compact Proofs of Retrievability. Journal of Cryptology, 26(3), 442–483. doi:10.1007/s00145-012-9129-2 Shalabi, S. M., Doll, C. L., Reilly, J. D., & Shore, M. (2011). Patent No. U.S. Patent Application 13/311,278. Washington, DC: US Patent Office. Si, X., Wang, P., & Zhang, L. (2013). KP-ABE Based Verifiable Cloud Access Control Scheme. The 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (pp. 34-41). IEEE. Srinivasan, S. (2014). Cloud Computing Providers. In Cloud Computing Basics (pp. 61–80). Springer New York. doi:10.1007/978-1-4614-7699-3_4 Tim, M., Subra, K., & Shahed, L. (2009). Cloud security and privacy. O’Reilly & Associates. Vimercati, S. D., Foresti, S., Jajodia, S., Paraboschi, S., & Samarati, P. (2007). Over-encryption: Management of Access Control Evolution on Outsourced Data. The 33rd international conference on Very large databases (pp. 123-134). VLDB Endowment. Wan, Z., Liu, J., & Deng, R. H. (2012). HASBE: A hierarchical attribute-based solution for flexible and scalable access control in cloud computing. IEEE Transactions on Information Forensics and Security, 7(2), 743–754. doi:10.1109/TIFS.2011.2172209 Wang, C., Chow, S. S., Wang, Q., Ren, K., & Lou, W. (2013). Privacy-Preserving Public Auditing for Secure Cloud Storage. IEEE Transactions on Computers, 62(2), 362–375. doi:10.1109/TC.2011.245 Wang, G., Liu, Q., & Wu, J. (2010). Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. The 17th ACM conference on Computer and communications security (pp. 735-737). ACM. Wang, G., Liu, Q., & Wu, J. (2011). Achieving fine-grained access control for secure data sharing on cloud servers. Concurrency and Computation, 23(12), 1443–1464. doi:10.1002/cpe.1698 Wang, G., Liu, Q., & Wu, J. (2014). Time-based proxy re-encryption scheme for secure data sharing in a cloud environment. Information Sciences, 258, 355–370. doi:10.1016/j.ins.2012.09.034 Wang, G., Liu, Q., Wu, J., & Guo, M. (2011). Hierarchical attribute-based encryption and scalable user revocation for sharing data in cloud servers. Computers & Security, 30(5), 320-331.

91

Data Storage Security Service in Cloud Computing

Wang, H., & Zhang, Y. (2014). On the Knowledge Soundness of a Cooperative Provable Data Possession Scheme in Multicloud Storage. IEEE Transactions on Parallel and Distributed Systems, 25(1), 264–267. doi:10.1109/TPDS.2013.16 Wang, Q., Wang, C., Ren, K., Lou, W., & Li, J. (2011). Enabling Public Auditability and Data Dynamics for Storage Security in Cloud Computing. IEEE Transactions on Parallel and Distributed Systems, 22(5), 847–859. doi:10.1109/TPDS.2010.183 Waters, B. (2011). Ciphertext-Policy Attribute-Based Encryption: An Expressive, Efficient, and Provably Secure Realization. Public Key Cryptography–PKC, 2011, 53–70. Wikipedia. (2014a, April). ID-based encryption. Retrieved from http://en.wikipedia.org/wiki/IDbased_encryption Wikipedia. (2014b). Role-based access control. Retrieved from http://en.wikipedia.org/wiki/Rolebased_access_control Xu, J., & Chang, E. C. (2011). Towards efficient provable data possession. IACR Cryptology ePrint Archive. Yang, K., & Jia, X. (2012). Data storage auditing service in cloud computing: Challenges, methods and opportunities. World Wide Web (Bussum), 15(4), 409–428. doi:10.1007/s11280-011-0138-0 Yang, K., Jia, X., Ren, K., Zhang, B., & Xie, R. (2013). Dac-macs: Effective data access control for multi-authority cloud storage systems. IEEE Transactions on Information Forensics and Security, 8(11), 1790–1801. doi:10.1109/TIFS.2013.2279531 Yang, K., & Xiaohua, J. (2014). TSAS: Third-Party Storage Auditing Service. In Security for Cloud Storage Systems (pp. 7–37). Springer New York. doi:10.1007/978-1-4614-7873-7_2 Yu, S. (2010). Data sharing on untrusted storage with attribute-based encryption (PhD dissertation). Worcester Polytechnic Institute. Yu, S., Wang, C., Ren, K., & Lou, W. (2010). Achieving secure, scalable, and grained data access control in cloud computing. In The 2010 IEEE INFOCOM (pp. 1–9). IEEE. Yuan, J., & Yu, S. (2013). Proofs of retrievability with public verifiability and constant communication cost in cloud. Proceedings of the 2013 international workshop on Security in cloud computing (pp. 1926). ACM. doi:10.1145/2484402.2484408 Zhang, Y., & Blanton, M. (2013). Efficient dynamic provable possession of remote data via balanced update trees. The 8th ACM SIGSAC symposium on Information, computer and communications security (pp. 183-194). ACM. Zhang, Y., & Chen, J. (2012). Access control as a service for public cloud storage. 32nd International Conference on Distributed Computing Systems Workshops (ICDCSW) (pp. 526-536). IEEE. doi:10.1109/ ICDCSW.2012.65 Zheng, Q., & Xu, S. (2012). Secure and Effcient Proof of Storage with Deduplication. The second ACM conference on data and application security and privacy (pp. 1-12). ACM.

92

Data Storage Security Service in Cloud Computing

Zhou, L., Varadharajan, V., & Hitchens, M. (2011). Enforcing role-based access control for secure data storage in the cloud. The Computer Journal, 54(10), 1675–1675. doi:10.1093/comjnl/bxr080 Zhu, Y., Ahn, G., Hu, H., Yau, S., An, H., & Hu, C. (2013). Dynamic Audit Services for Outsourced Storages in Clouds. IEEE Transactions on Services Computing, 6(2), 227–238. doi:10.1109/TSC.2011.51 Zhu, Y., Hu, H., Ahn, G., Huang, D., & Wang, S. (2012). Towards temporal access control in cloud computing. In The 2012 IEEE INFOCOM (pp. 2576–2580). IEEE. Zhu, Y., Hu, H., Ahn, G., Yu, M., & Zhao, H. (2012). Comparison-based encryption for fine-grained access control in clouds. The second ACM conference on Data and Application Security and Privacy (pp. 105-116). ACM. Zhu, Y., Hu, H., Ahn, G. J., & Yu, M. (2012). Cooperative Provable Data Possession for Integrity Verification in Multicloud Storage. IEEE Transactions on Parallel and Distributed Systems, 23(12), 2231–2244. doi:10.1109/TPDS.2012.66

93

94

Chapter 5

Workload Management Systems for the Cloud Environment Eman A. Maghawry Ain Shams University, Egypt

Nagwa L. Badr Ain Shams University, Egypt

Rasha M. Ismail Ain Shams University, Egypt

Mohamed F. Tolba Ain Shams University, Egypt

ABSTRACT Workload Management is a performance management process in which an autonomic database management system on a cloud environment efficiently makes use of its virtual resources. Workload management for concurrent queries is one of the challenging aspects of executing queries over the cloud. The core problem is to manage any unpredictable overload with respect to varying resource capabilities and performances. This chapter proposes an efficient workload management system for controlling the queries execution over a cloud. The chapter presents architecture to improve the query response time. It handles the user’s queries then selecting the suitable resources for executing these queries. Furthermore, managing the life cycle of virtual resources through responding to any load that occurs on the resources. This is done by dynamically rebalancing the queries distribution load across the resources in the cloud. The results show that applying this Workload Management System improves the query response time by 68%.

INTRODUCTION Modern business software and services demand high availability and scalability. This has resulted in an increasing demand for large scale infrastructures which generally tend to be highly complicated and expensive for the organization requesting them. These higher resource and maintenance costs have given rise to a paradigm shift towards cloud computing. Its services are offered and maintained by various providers over the Internet. These service offerings range from software applications to virtualized platforms and infrastructures (Mittal, 2001). Cloud Computing is the combination of parallel and distributing computing paradigms. This distributed computing paradigm is driven by the economies of scale, in which a pool of virtualized and managed computing power, storage, platforms and services are delivered on demand to remote users over the Internet (Foster, Zhao, Raicu,& Lu, 2008). It is helping in DOI: 10.4018/978-1-5225-2229-4.ch005

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Workload Management Systems for the Cloud Environment

realizing the potential of large scale data intensive computing by providing effective scaling of resources (Mian, Martin, Brown,& Zhang, 2001). Also, a Cloud computing environment is widely employed in scientific, business and industrial applications to increase the use and delivery model involving the Internet to provide dynamic virtualized resources (Jeong & Park, 2012). It offers the vision of a virtually infinite pool of computing, storage and networking resources where applications can be scalable deployed (Hayes, 2008). As data continues to grow, it enables the remote clients to store their data on a cloud storage environment with different clients’ expectations over the internet. An increasing amount of data is stored in cloud repositories which provide high availability, accessibility and scalability (Duggan, Chi, Hacigumus, Zhu, & Cetintemel, 2013). As clouds are built over wide area networks, the use of large scale computer clusters are often built from low cost hardware and network equipment, where resources are allocated dynamically amongst users of the cluster (Yang, Li, Han,& Wang, 2013). Therefore the cloud storage environment has resulted in an increasing demand to co-ordinate access to the shared resources to improve the overall performance. Users can purchase traditional data centers, sub-divide hardware into virtual machines or outsource all of their work to one of many of cloud providers (Duggan et al., 2013). As the number of users submitting the queries increase, that leads to increases in the load and traffic of the virtual resources. So it is essential to incorporate a mechanism to balance the load across these virtual resources (Somasundaram, Govindarajan, Rajagopalan,& Madhusudhana, 2012). Workload management is the discipline of effectively managing, controlling and monitoring application workloads across computing systems (Niu, Martin,&Powley, 2009). Workload is as a set of requests that access and process data under some constraints. The data access performed by a query can vary from retrieval of a single record to the scan of an entire file or table. Since the load on data resources in a cloud can fluctuate rapidly among its multiple workloads, it is impossible for system administrators to manually adjust the system configurations in order to maintain the workloads objectives during their execution. Therefore, managing the query workload automatically in a cloud computing environment is a challenge to satisfy the cloud users. This is done by relocating resources through admission control in the presence of workload fluctuations (Niu, Martin, Powley, Horman,& Bird, 2006). The workload produced by queries can change very quickly; consequently this can lead to decreasing the overall performance (e.g. query processing time) depending on the number and the type of requests made by remote users. In this case, a cloud service provider must manage the unpredictable workloads, through making decisions about which requests from which users are to be executed on which computational resources. These decisions are based on the progress feedback of the workload or the behavior of the resources to recover any load imbalance that may occur. This can lead to improving the overall performance during the queries execution over the distributed resources. Adaptive query processing changes the way in which a query is being evaluated while the query is executing over the computational resources (Paton, Aragão,& Fernandes, 2012). Furthermore, the importance of managing the workload arises by the demand to revise resource allocation decisions dynamically. The challenge in this chapter is how to provide a fast and efficient monitoring process to the queries executed over the distributed running resource. Furthermore, responding to any failure or load imbalance occurs during the queries execution. This is done by generating an assessment plan that redistributes the queries execution over the replicated resources.

95

Workload Management Systems for the Cloud Environment

This chapter focuses on presenting an enhancement of the Workload Management sub-system of our previous architecture that was presented in (Maghawry, Ismail, Badr,& Tolba, 2012) to overcome the challenge of slow query response time. It is beneficial to manage the queries execution after implementing the query optimization sub-system. The main objective of this chapter is to minimize the overall queries response time in our query processing architecture that was presented in (Maghawry et. al. 2012).

BACKGROUND This section reviews related work on the topic of query processing and database workloads regarding how to characterize workloads and monitor query progress. Duggan et al. (2011) they present a modeling approach to estimate the impact of concurrency on query performance for analytical workloads. Their solution relies on the analysis of query behavior in isolation, pair wise query interactions and sampling techniques to predict resource contention. They also introduced a metric that accurately captures the joint effects of disk and memory contention on query performance in a single value. In addition, predicting the execution behavior of a time varying query workload through query interaction timelines, i.e., a fine-grained estimation of the time segments during which discrete mixes are executed concurrently. In the work of Liu and Karimi (2008) they proposed a query optimization technique for query processing to improve the query response time. Furthermore, proposing a selection module for query execution by selecting a sub-set of resources and applying ranking functions to improve execution performance for individual queries, however their technique didn’t consider other queries running at the same time and the load imbalance that may occur during the queries execution. In addition, Albuitiu and Kemper (2009) proposed an approach based on the fact that multiple requests that are executed concurrently, may have a positive impact on the execution time of each other due to caching or complementary resource consumption or impede each other’s execution, e.g. in the case of resource contention. Those impacts are reflected by the execution time of the workload. They applied a monitoring approach to derive those impacts called synergies between request types fully automated at runtime from measured execution times. Thereby, their approach works completely independent from changing synergies or configurations and handles new query types. Gounaris et al. (2005) presented an architecture for Adaptive Query Processing (AQP), its components communicate with each other asynchronously according to the publish/subscribe model in order to dynamically re-balance intra-operator parallelism across Grid nodes for both stateful and stateless operations. Wei et al. (2012) authors presented a joint query support in CloudTPS, a middleware layer which stands between a Web application and its data store. The system enforces strong data consistency and scales linearly under a demanding workload composed of join queries and read-write transactions. In the work of Walsh et al. (2004) they proposed an architecture which is a two-level structure of independent autonomic elements that supports modularity, flexibility, and self-management. Individual autonomic elements manage application resource usage to optimize local service level utility functions, and a global arbiter allocates the resources among the application environments considering the resource level utility functions obtained from the managers of the applications. They introduced empirical data that demonstrate their utility function scheme in handling realistic, fluctuating Web-based transactional workloads running on a nodes cluster. A workload management is proposed by Krompass et al. (2007) for controlling the execution of individual queries based on realistic customer service level objectives. In order to validate their ap-

96

Workload Management Systems for the Cloud Environment

proach, they have implemented an experimental system that includes a dynamic execution controller that leverages fuzzy logic. Several techniques have been proposed in the work of Paton et al. (2009), for dynamically re-distributing processor load assignments throughout a computation to take into account of varying resource capabilities. They presented a simulation based evaluation of these autonomic parallelization techniques in a uniform environment. These techniques which are designed for use in open and unpredictable environments differ from most work on dynamic load balancing for parallel databases. Also they proposed a novel approach to adaptive load balancing, based on incremental replication of an operator state. Paton et al. (2009, 2012) they describe the use of utility functions to co-ordinate adaptations that assign resources to query fragments from multiple queries. As well as, demonstrating how a common framework can be used to support different objectives, specifically to minimize overall query response times and to maximize the number of queries meeting quality of service goals. They proposed an autonomic workload mapper that adaptively assigns tasks in the workload to execution sites. Furthermore, revising the assignment during workload execution based on the basis of feedback on the overall progress of the submitted requests. The goal of their autonomic workload mapper is to explore the space of alternative mappings with a view to maximizing utility as measured by the utility function. The utility functions are combined with optimization algorithms that seek to maximize utility for a workload given certain resources. Chen et al. (2011) presented a Merge-Partition (MP) query reconstruction algorithm. Their algorithm is able to exploit data sharing opportunities among the concurrent sub-queries. This can reduce the average communication overheads. Lee et al. (2007) presented request window, a mechanism that can detect and employ data sharing opportunities across concurrent distributed queries. By combining multiple similar data requests issued to the same distributed data source to a common data request, request window allows concurrent query executing processes to share the common result data, with the benefits of reduced source burdens and data transfers. The work of Gounaris et al. (2006) authors proposed resource scheduling problem for Grid databases in its entirety, allowing for arbitrarily high degrees of partitioned parallelism across heterogeneous nodes, by leveraging and adjusting existing proposals in a practical manner. The contribution of their work includes a system independent algorithm that does not restrict the degree of intra-operator to any extent and does take into account the fact that the resources heterogeneous and available. Also, a comprehensive analysis of the limitations of existing parallel database techniques to solve the resource scheduling problem in Grid settings is proposed. A class of load distribution algorithms is proposed in Lau et al. (2006) work is to allow a batch of requests to be transferred during each negotiation session. The core of the algorithms is a protocol that ensures a sender–receiver pair to arrive at a suitable batch size. It takes into account the processing speeds of the sender and receiver, as well as their relative workload, thus ensuring the maximal benefit for each negotiation session. The work of Ahmad et al. (2011) is proposed an experimental study to highlight how these interactions can be fairly complex. They argue that workload profiling or individual database staging is not an adequate approach to understanding the performance of the consolidated system. Their initial investigations suggest that machine learning approaches that use monitored data to model the system. They have developed an interaction aware query scheduler for report generation workloads. Their scheduler technique uses two query scheduling techniques that leverage models of query interactions. The first technique is optimized for workloads where queries are submitted in large batches. The second technique targets workloads where queries arrive continuously and scheduling decisions have to be made online 97

Workload Management Systems for the Cloud Environment

A framework implementation of a query scheduler is presented by Niu et al. (2006) that performs workload adaptations in a database management system. Their system manages multiple classes of queries through admission control to meet their performance goals by allocating database resources in the presence of workload fluctuations. Also Ilyas et al. (2003) proposed a method for estimating the optimizer compilation time of a query. This is proposed in their work to monitor the progress of workload analysis tools. Soror et al. (2008) introduced a virtualization design advisor that uses information about the anticipated workloads of each of the database systems to recommend workload certain configurations offline. In addition, runtime information collected after the deployment of the recommended configurations that can be used to refine the recommendation. To estimate the effect of a particular resource allocation on workload performance, they used the query optimizer in a new what if mode. DITN ‘Data In The Network’ presented by Raman et al. (2005) which used an alternate intra-fragment parallelism. In this intra-fragment parallelism, each node executes an independent select-project-join, with no tuple exchange between running nodes. This method cleanly handles heterogeneous nodes with different capabilities and well adapts during execution to node failures or load spikes. On the other hand, the adaptive distributed query processing grid authors proposed in Porto et al. (2006) work to a dynamic schedule and allocations of query execution engine modules into grid nodes. It has the adaptability of query execution to variations on environment conditions. They were developing a Configurable Data Integration (CoDIMS-G) that is a distributed grid service for the evaluation of scientific queries. Their module attached to the grid data service layer by providing high level services for data intensive grid applications. It focused on query evaluation strategies for grid environments. Shah et al. (2003) presented Flux which is a dataflow operator that encapsulates dataflow routing and adaptive state partitioning. Flux is placed between producer-consumer in a dataflow pipeline to repartition stateful operators while the pipeline is still running. They presented the Flux architecture with repartitioning policies that can be used for operators under memory loads. Gounaris et al. (2009) proposed adaptable techniques which balance the load across plan partitions supporting intra-operator parallelism. This is done by removing bottlenecks in pipelined plans supporting inter-operator parallelism in grid nodes. Curino et al. (2011) they concentrated on scenarios where each logical database paces a moderate but non-trivial load on underlying system, which is what enables consolidation. They proposed techniques to estimate resource requirements and models to estimate resource requirements for combined workloads. Also they introduced method to analyze the resource consumption for each database over time in order to produces an assignment of databases to physical machines. Furthermore, they proposed nonlinear optimization techniques to find the assignments of databases onto physical resources that maximize load balance across the consolidated servers and minimize the number of machines required to support client’s workload. The authors of Avnur and Hellerstein (2000), Tian and DeWitt(2004) continually monitor the speed of query operators and use this information to modify the query plan by changing the way tuples are routed. They used a queuing network to define performance metrics for response times and system throughputs. They also proposed several practical routing policies for a distributed stream management system. The Polar-Star system is proposed by Gounaris et al. (2004) to allocate query processing work to grid compute nodes through using intra-operator parallelism. It takes into consideration a plan with exchange operators and chooses the degree of parallelism of each exchange operator in a cost based fashion. In Xiong et al. (2011) work, they specified the issue of how to intelligently manage the resources in a shared cloud database management system. They proposed SmartSLA, a cost-aware resource management system. SmartSLA consists of two main components: the resource allocation decision module and the system 98

Workload Management Systems for the Cloud Environment

modeling module. Their system modeling module uses machine learning techniques to learn a model that describes the potential profit margins for each user under different resource allocations. Based on the learned model, the resource allocation decision module dynamically evaluates the resource allocations in order to achieve the optimum profits. They evaluate SmartSLA by using the TPC-W benchmark with workload characteristics derived from real-life systems. Several techniques proposing workload management techniques for resource allocation are presented in the works of Mehta and DeWitt (1993), Krompass et al. (2006) and Schroeder et al. (2006) to manage resource allocation for database queries with widely varying resource requirements and capabilities in multi-workload environments. As in Krompass et al. (2006) work they presented an adaptive quality of service management that is based on an economic. Their model develops a database component which schedules requests depending on their deadline and their associated penalty. The workload manager presented by Subramanian et al. (2000) uses feedback control to adjust and govern resources. It receives a data workload performance through API from performance monitoring created by the application provider. Furthermore, the workload manager uses a proportional controller to specify the suitable resources that must be allocated to a workload. This workload manger has been designed to take advantage of the existing infrastructure of tools that include a Process Resource Manager and the Event Monitoring Service. These tools raise alarm when performance goals are not being met. Past research on query processing has dealt with issues of only query optimization, scheduling or resource allocation. Unfortunately, most of the work in this area has concentrated on processing single queries and does not consider multi-users with multi-resource issues. Although previous researches address several issues in query processing and optimization, our proposed architecture combines the query optimization and query resource allocation with monitoring the concurrent queries execution over the running resources. Furthermore, it responds to any load imbalance by applying our workload management system on the cloud environment.

THE PROPOSED SYSTEM ARCHITECTURE Our proposed architecture overcomes the challenge of a low query response time. This is done by optimizing the sub-queries to detect the sharing data then reconstructing these sub-queries to exploit overlapping data among them. In addition, it determines the order of query executions by applying the scheduling technique, furthermore, it assigns the sub-queries to the appropriate resources. During queries execution, our system monitors the queries execution by collecting information about the resources utilization to diagnose any failure. Using this information, it responds to any load imbalance by re-distributing the failure queries to other suitable resources. Finally, it collects the queries results from the resources then returns them to the users. Our proposed architecture which is shown in Figure 1, involves three main sub-systems (Maghawry et. al. 2012): 1. Query Optimization Sub-System: Accepts queries from remote users and then translates a submitted query into an Abstract Query Tree (AQT), therefore detecting the data sharing among the sub-queries of submitted queries from users and merges them based on exploiting data sharing among the queries, therefore specifying the ordering of the queries execution by applying scheduling techniques. Finally, it allocates the queries to the appropriate available resources on a cloud

99

Workload Management Systems for the Cloud Environment

environment based on ranking functions. It takes into consideration each resource capability and availability to run the queries concurrently with minimizing contention at the resources. The output of this sub-system is the list of the resources that are responsible for the execution of the queries. 2. The Workload Manager Sub-System: Manages and controls the queries execution on the running resources to ensure that our system executes in an optimal state. To achieve workload balance during execution, our sub-system ensure that resources are used effectively and utilized as fully as possible without overloading any resource. It detects and responds to any load imbalance that may occur throughout the queries execution. It takes into consideration the running resources processor and utilization. It consists of Observer, Planner and Responder components to achieve its goals. This chapter will be focused on presenting the Workload Manager sub-system. The main advantages of this sub-system are the improvement of the query execution performance and overall query response time. This is done by proposing workload management subsystems to monitor queries execution. It holds the following main processes: a. Observer: Collects information about throughput and utilization values of each running resource during the queries execution. If these collected values exceed specific thresholds this means there is a fault with the current execution, therefore the observer generates a notification that contains these values and the failure resource. Then it sends this notification to the Planner process to announce a load imbalance on a specific resource. b. Planner: Performs the assessment and planning phase; it receives the notification from the Observer then, it creates an assessment plan to recover the load and contention on the resources during the queries execution. This is done by assigning the failure queries on another suitable replica to avoid resource overloading. In addition, it takes into consideration the utilization of other available resources by collecting the values and then checks these values against the thresholds values. Finally it sends the assessment plan as a notification to the Responder process. c. Responder: Receives the assessment plan from the Planner as a notification. It executes the plan by executing the failure query on a specific replica that is chosen by the planner, its main goal is to respond to the failure that occurs during the queries execution. 3. Integrator Sub-System: Is responsible for collecting the concurrent queries results from the running resources then partitioning the results in case of merged queries. The new merged queries are generated from query optimization sub-systems. In the step of merging queries, all the pairs of original queries that can be merged are recorded to compute their answers in the partitioning queries process within the Integrator Sub-system. This process is responsible for re-organizing the results of reconstructed sub-queries from the original sub-queries. When the data is returned from the data sources the integrator decides that if the returned results either need to be a partitioning step or not. In the case of merged queries, the results must be partitioned to get the overlapping data and the remaining unshared data between the original queries as presented in (Chen et al., (2011)) and finally, retrieving the queries results to the remote users.

THE IMPLEMENTATION OF THE PROPOSED SYSTEM The proposed system accepts queries from the remote users and then applies the query optimization sub-system. This sub-system is responsible for exploiting the data sharing between the submitted queries

100

Workload Management Systems for the Cloud Environment

through to the optimizing and merging processes. In addition, it applies scheduling technique to determine the queue of queries executions, finally selecting the appropriate resources to run the queries by implementing the resource ranker process. The main goal of this sub-system is to prepare and execute the queries on the available resources. The query optimization sub-system is presented and implemented in (Maghawry et. al. 2012). During the queries execution, the importance of managing the workload arises by dynamically revising the resources allocation. To achieve workload balance during queries execution, our architecture involves the workload manager sub-system to ensure that the running resources are effectively utilized without overloading any resource. Therefore, detecting and responding to any load imbalance that may occur throughout the queries execution. So the workload management sub-system is implemented in this chapter through three main processes: 1. Observer: Dynamically revises and monitors the running resources during query executions. In addition, it detects if a failure may occur and notifies the planner about the performance information of the overloaded resource. This is done by the following steps: a. Firstly, by collecting the performance information about the running resources every 15 seconds which is ideal (https://software.intel.com/en-us/articles/use-windows-performancemonitor-for-infrastructure-health) for benchmarking scenarios. The performance information is the percentage of processing time which each resource spends on processing queries. The other measurement is the query throughput which is the number of requests received per second. Such information gives a good indicator of just how much load is being placed on the system, which has a direct correlation to how much load is being placed on the processor of the resources (Fritchey & Dam, (2009)). b. Secondly, the Observer uses the collected performance information to detect if there is an overloaded resource during the queries execution. It checks the utilization of the running resources processor, in other words whether the query throughput exceeds the specific baseline value and if the processing time on specific resources exceeds 80% (Fritchey & Dam, (2009)). c. Finally, the Observer generates a notification that contains the updated information of the loaded resource. Then, it sends this notification to the Planner process to generate an assessment plan to handle the failure that may occur during the execution. 2. Planner: Is implemented to get the notifications from the observer when failure or overload may occur on a specific running resource. This notification includes the updated performance information of a specific resource. Therefore it creates an assessment plan for handling this load imbalance during the queries execution based on using the available replicas. The following steps are used to implement the Planner: a. Firstly, it initiates when receiving a notification from the Observer process. In this case, it collects the performance information about the available replicas of the loaded resource and then it determines the most available unloaded replica to execute the failure queries. b. Secondly, it generates the assessment plan based on determining the suitable replica that can execute these queries. The plan includes the loaded resource with the chosen replica to handle the resource failure. Furthermore, it contains the queries that will be executed on the chosen replica.

101

Workload Management Systems for the Cloud Environment

c. Finally, the Planner generates a notification about the assessment plan and then sends this notification to the Responder process to execute this assessment plan with the new queries allocation distribution. 3. Responder: Is implemented to receive and implement the assessment plan from the Planner process as a notification. The main goal of this process is to respond to the load imbalance that has occurred during the queries execution. The following steps are used to implement the Responder: a. Firstly, it receives notification from the Planner to get the assessment plan. It uses the assessment plan to specify the queries and the resource with a failure execution, therefore killing the queries execution on the loaded resource. b. Secondly, it uses the received notification to get the corresponding replica. Furthermore, it loads the information about this suitable replica from previously stored environment information. c. Finally, it loads the location information of the corresponding replica and then it dispatches the queries on the replica. The main goal of the Workload Management sub-system is to minimize the queries response time through handling any load imbalance that occurs on the running resources during the query execution. The final sub-system is the integration sub-system. It is responsible for dynamically collecting the queries results from the running resources and then partitioning the results in the case of merged queries. Finally, it sends the queries results to the remote users after collecting the results from the distributed resources. This sub-system consists of the following steps: • •

•

Firstly, it collects the results returned from the resources and then it loads the required information about the running queries to determine if the query results are completely returned from the distributed resources. Secondly, in case of the merged queries, it loads the information about the original queries that were constructed from the new merged query. The new merged query step is generated from query optimization sub-systems. In this step, all the pairs of the original queries that can be merged are recorded to compute their answers by the partitioning queries process. This process partitions the results to get the overlapping data and the remaining unshared data between the original queries as presented in (Chen et al., (2011)). ◦◦ For example, if there is data sharing between two queries q1 and q2 as shown in Figure 2, after the answers to q*1, q*2 and q*3 are retrieved, the answer to q1 can be computed directly from the answers to q*1 and q*3. The answer to q2 can be extracted directly from the answers to q*2 and q*3. Finally, it prepares the results of the finished queries through collecting the results of the same query from the distributed resources and then sends it to the remote users.

EXPERIMENTAL ENVIRONMENT The TPC-H database (http://www.tpc.org/tpch/) is used as our dataset (scale factor 1) to test our work. The TPC-H database has eight relations: REGION, NATION, CUSTOMER, SUPPLIER, PART, PARTSUPP, ORDERS, and LINEITEM.

102

Workload Management Systems for the Cloud Environment

Eight virtual machines are deployed as in Figure 3, which shows their capabilities and the relations distribution amongst them with an assumption about partitioning the relations horizontally into two parts (ex: Lineitem_P1, Lineitem_P2). Furthermore, replicas for each virtual machine are created to test our experiment as in case of overloading on a specific resource. These virtual machines are connected together to simulate cloud computing platforms. The cloud environment is simulated with the help of a VMWare workstation. A VMWare workstation is the global leader in virtualization and cloud infrastructure (http://www.vmware.com/). Twenty queries that are using the following relations LINEITEM, ORDERS, PART, CUSTOMER are used to test our experiment. Table 1, shows an example of the queries that have been used during our experiment. The ordering of the queries was randomly assigned prior to the execution. Table 1. Example of queries used in our experiment Queries select p_type,l_extendedprice,l_discount from lineitem, part where l_partkey = p_partkey and l_shipdate >= ‘01-09-1995’ select p_type,l_extendedprice,l_discount from lineitem, part where l_partkey = p_partkey and l_shipdate < ‘01-09-1995’ select l_returnflag, l_linestatus from lineitem where l_shipdate < ‘1998/12/08’ select l_shipmode from orders, lineitem where o_orderkey = l_orderkey and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= ‘1997-01-01’ and l_receiptdate < ‘1998-01-01’ select o_orderpriority from orders where o_orderdate >= ‘1993-07-01’ and o_orderdate < ‘1993-07-01’ and o_orderstatus = ‘F’ select c_name, c_address from customer c, orders o where c.c_custkey = o.o_custkey and o.o_orderdate > ‘1-1-1994’ select o_order_key, o_orderstatus from orders o, lineitem l where o.o_orderkey = l.l_orderkey and l_shipdate > ‘1-1-1996’ select c_phone, o_orderstatus from orders o, customer c where o.o_custkey = c.c_custkey and o.o_totalprice > 15000 and o.o_orderstatus = ‘F’ select c_name, c_address from customer c, orders o where c.c_custkey = o.o_custkey and o.o_orderdate < ‘1-1-1992’ select p_type, p_size,l_linenumber from lineitem l, part p where l.l_partkey = p.p_partkey and l_quantity > 15 and l_returnflag = ‘A’

103

Workload Management Systems for the Cloud Environment

Each execution was repeated several times and average values reported. Between each execution, the database management system was re-started to clear all monitor elements to warm up the buffer and to bring the database management system to a steady state. The threshold of query throughput is specified by estimating the average transactions per second during peak activity, then using this value as a baseline to compare the query throughput during any stage of the execution. Microsoft Windows Server 2008 and Microsoft Structured Query Language Server (MS SQL server) is used to deploy the TPC-H database.

EVALUATIONS Table 2 shows examples of the resources that are assigned to execute the merged and non-merged queries. This is done after applying our query optimization sub-system that is responsible for optimizing Table 2. The resources selected to execute the merged and non merged queries Queries

Assigning Virtual Machines

select l_returnflag, l_linestatus from lineitem where l_shipdate < ‘1998/12/08’

VM1 – VM2

select l_shipmode from orders, lineitem where o_orderkey = l_orderkey and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= ‘1997-01-01’ and l_receiptdate < ‘1998-01-01’

VM1 – VM3

select o_orderpriority from orders where o_orderdate >= ‘1993-07-01’ and o_orderdate < ‘1993-07-01’

VM1– VM3

select c_name, c_address from customer c, orders o where c.c_custkey = o.o_custkey and o.o_orderdate > ‘1-1-1994’ or o.o_orderdate < ‘1-1-1992’

VM3 – VM4

select o_order_key, o_orderstatus from orders o, lineitem l where o.o_orderkey = l.l_orderkey and l_shipdate > ‘1-1-1996’

VM1 – VM3

select c_phone, o_orderstatus from orders o, customer c where o.o_custkey = c.c_custkey and o.o_totalprice > 15000 and o.o_orderstatus = ‘F’

VM3 – VM4

select p_type, p_size,l_linenumber from lineitem l, part p where l.l_partkey = p.p_partkey and l_quantity > 15 and l_returnflag = ‘A’

VM1 – VM2

select p_type,l_extendedprice,l_discount from lineitem,part where l_partkey = p_partkey and l_shipdate >= ‘01-09-1995 or l_shipdate < ‘0109-1995

VM1 – VM2

104

Workload Management Systems for the Cloud Environment

and scheduling the queries and then selecting the suitable resources to execute the queries by applying the resource ranker process which is presented in (Maghawry et. al. 2012). After executing twenty queries, the VM1 transactions per second exceed 54, which is the baseline value and the processor utilization exceeds 80% that means there is a load on VM1 during the queries execution. By applying our Workload Management Sub-system, queries execution is controlled and monitored to handle any failure that may occur. In our experiment, the Observer detects overloading on the running resource (VM1), it generates a notification about the performance information of this resource and then it sends this notification to the Planner. The Planner checks the available replicas for VM1 and then decides to select the VM5 replica to execute failure queries instead of VM1 based on the performance information of each replica. Finally, the Planner sends the notification about the latest queries with failure execution and the selected replica to the Responder. The Responder begins to kill the queries execution on the overloaded resource and executes the failure queries on the selected replica VM5. Figure 4 shows that our proposed technique can reduce the queries execution time, as our proposed system combines the query optimization and query resource allocation of submitted concurrent queries with monitoring the queries execution by the Workload Management sub-system which is not considered in (Liu & Karimi, 2008). This experiment was executed for three times and the average response time is calculated. The results show that using our proposed Workload Management Sub-system reduces the queries execution time over the technique presented in (Liu & Karimi, 2008) by 68%. The average queries execution time for our proposed technique and query processing technique in (Liu & Karimi, 2008) are shown in Table 3.

Issues Cloud computing is becoming the newest evolution of computing. Distributed computing is a pool of virtualized and managed computing storage and services, delivered on demand to remote users with different objectives over the Internet. This leads to an increase in the number of queries requiring access and processing data from multiple distributed sources on a cloud environment. Since the load on data resources in a cloud can fluctuate rapidly among its multiple queries, it is impossible for system administrators to manually adjust the resources load distribution in order to maintain the workloads during the execution. Therefore, the existence of an effective query processing technique is required to handle the increasing amount of data on cloud virtual resources. Workload management for concurrent queries is one of the challenging aspects of query executions over the cloud computing environment. Key challenges raised by this problem are how to increase control over the running resources to improve the overall performance and response time of the query execution. In this chapter, the problem of the slow query response time is examined. The workloads produced by the queries can change very quickly; so this can lead to a poor performance and slow query processing time and depends on the type and the number of queries that are made by users. The core problem is to diagnose and manage any unpredictable load imbalance on resource query allocations with respect to varying resource capabilities and performances.

105

Workload Management Systems for the Cloud Environment

SOLUTIONS AND RECOMMENDATIONS Our query processing architecture in this chapter overcomes the challenge of slow query response time by optimizing the queries. Furthermore, it assigns the queries to the appropriate resources after applying scheduling techniques to determine the order of the queries execution. Also, it manages the queries execution in order to respond to any failure that may occur by proposing Workload Management Systems. It ensures that the running resources are utilized in an effective way to prevent any overloading on any resource before sending the results to the remote users.

FUTURE RESEARCH DIRECTIONS The future trends are to extend our query processing to handle multiple predicates in query and different types. In these cases the queries may become more complex, which means we need to improve query scheduling and in addition, increase the number of virtual resources included in the testing environment.

CONCLUSION Workload management is one of the challenging aspects of executing queries over the cloud computing environment. The workloads produced by queries can change very quickly; consequently this can lead to decreasing the overall performance (e.g. query processing time). The importance of managing the workload arises by dynamically revising the queries resource allocation. The key challenge is to manage any unpredictable load imbalance with respect to varying resource capabilities and performances. To achieve the workload balance during execution, the Workload Management System must ensure that resources are used effectively and utilized as fully as possible without overloading any resource. The challenge in this chapter is how to provide a fast and efficient monitoring process to the queries executed over the distributed running resource. Furthermore, responding to any failure or load imbalance that may occur during the queries execution. In this chapter, a Workload Management technique is proposed to minimize the overall query execution time over a cloud computing environment. This is done by monitoring the running resources performance such as; processing time of the resources and queries throughput over the cloud computing environment. Furthermore, it detects any load imbalance that may occur across these running resources and then creates an assessment plan for resource allocation distribution, taking into consideration the available replicas for the overloaded resource. Finally, it executes the failed queries on a suitable selected replica. The results show that by applying our proposed system which combines the query optimization and query resource allocation with monitoring the concurrent queries execution by the Workload Management technique over the running resources in a cloud computing environment, improves significantly the response time of the concurrent queries.

106

Workload Management Systems for the Cloud Environment

REFERENCES Ahmad, M., Aboulnaga, A., Babu, S., & Munagala, K. (2011). Interaction-aware scheduling of reportgeneration workloads. Journal of Very Large Database, 20(4), 589–615. doi:10.1007/s00778-011-0217-y Albuitiu, M. C., & Kemper, A. (2009). Synergy based Workload Management. Proceedings of the VLDB PhD Workshop. Avnur, R., & Hellerstein, J. M. Eddies (2000). Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (vol. 29, pp. 261-272). doi:10.1145/342009.335420 Chen, G., Wu, Y., Liu, J., Yang, G., & Zheng, W. (2011). Optimization of Sub-query Processing in Distributed Data Integration Systems. Journal of Network and Computer Applications, 34(4), 1035–1042. doi:10.1016/j.jnca.2010.06.007 Curino, C., Jones, E. P., Madden, S., & Balakrishnan, H. (2011). Workload-Aware Database Monitoring and Consolidation. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data - SIGMOD (pp. 313–324). doi:10.1145/1989323.1989357 Duggan, J., Cetintemel, U., Papaemmanouil, O., & Upfal, E. (2011). Performance prediction for concurrent database workloads. Athens: SIGMOD. Duggan, J., Chi, Y., Hacigumus, H., Zhu, S., & Cetintemel, U. (2013). Packing light: portable workload performance prediction for the cloud. In IEEE 29th International Conference on Data Engineering Workshops (ICDEW) (pp. 258-265). Brisbane: IEEE. doi:10.1109/ICDEW.2013.6547460 Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud computing and grid computing 360-degree compared. In Grid Computing Environments Workshop (pp. 1–10). doi:10.1109/GCE.2008.4738445 Fritchey, G., & Dam, S. (2009). SQL Server 2008 Query Performance Tuning Distilled (2nd ed.). doi:10.1007/978-1-4302-1903-3 Gounaris, A., Sakellariou, R., Paton, N. W., & Fernandes, A. A. A. (2004). Resource scheduling for parallel query processing on computational grids. In Proceedings of the 5th IEEE/ACM Intl. Workshop on Grid Computing (pp. 396–401). doi:10.1109/GRID.2004.55 Gounaris, A., Sakellariou, R., Paton, N. W., & Fernandes, A. A. A. (2006). A novel approach to resource scheduling for parallel query processing on computational grids. Journal of Distributed and Parallel Databases, 19(2-3), 87–106. doi:10.1007/s10619-006-8490-2 Gounaris, A., Smith, J., Paton, N. W., Sakellariou, R., Fernandes, A. A., & Watson, P. (2009). Adaptive workload allocation in query processing in autonomous heterogeneous environments. Journal of Distributed Parallel Databases, 25(3), 125–164. doi:10.1007/s10619-008-7032-5 Gounaris, A., Smith, J., Paton, N. W., Sakellariou, R., Fernandes, A. A. A., & Waston, P. (2005). Adapting to changing resource performance in grid query processing. In J. M. Pierson (Ed.), VLDB DMG. LNCS (Vol. 3836, pp. 30–44). Heidelberg: Springer. Hayes, B. (2008). Cloud computing. Communications of the ACM, 51(7), 9–11. doi:10.1145/1364782.1364786

107

Workload Management Systems for the Cloud Environment

Ilyas, I. F., Rao, J., Lohman, G., Gao, D., & Lin, E. (2003). Estimating compilation time of a query optimizer. In Proceedings of the ACM SIGMOD international conference on Management of data (pp. 373-384). doi:10.1145/872757.872803 Jeong, H., & Park, J. (2012). An efficient cloud storage model for cloud computing environment. In 7th International Conference in Advances in Grid and Pervasive Computing (pp. 370–376). doi:10.1007/9783-642-30767-6_32 Krompass, S., Gmach, D., Scholz, A., Seltzsam, S., & Kemper, A. (2006). Quality of service enabled database applications. In A. Dan & W. Lamersdorf (Eds.), ICSOC. LNCS (Vol. 4294, pp. 215–226). Heidelberg: Springer. Krompass, S., Kuno, H., Dayal, U., & Kemper, A. (2007). Dynamic workload management for very large data warehouses: juggling feathers and bowling balls. In 33rd international conference on VLDB (pp. 1105-1115). Lau, S. M., Lu, Q., & Leung, K. S. (2006). Adaptive load distribution algorithms for heterogeneous distributed systems with multiple task classes. Journal of Parallel and Distributed Computing, 66(2), 163–180. doi:10.1016/j.jpdc.2004.01.007 Lee, R., Zhou, M., & Liao, H. (2007). Request window: an approach to improve throughput of rdbmsbased data integration system by utilizing data sharing across concurrent distributed queries. In 33rd International Conference on Very Large Data Bases (pp. 1219-1230). Liu, S., & Karimi, A. H. (2008). Grid Query Optimizer to Improve Query Processing in Grids. Journal of Future Generation Computer Systems, 24(5), 342–353. doi:10.1016/j.future.2007.06.003 Maghawry, E. A., Ismail, R. M., Badr, N. L., & Tolba, M. F. (2012). An enhanced resource allocation approach for optimizing a sub-query on cloud. In A. E. Hassanien, A.-B. M. Salem, R. Ramadan, & T.-h. Kim (Eds.), AMLTA 2012. CCIS (Vol. 322, pp. 413–422). Heidelberg: Springer. doi:10.1007/9783-642-35326-0_41 Mehta, M., & DeWitt, D. J. (1993). Dynamic memory allocation for multiple-query workloads. In Proceedings of the 19th International Conference on Very Large Data Bases (pp.354 -367). Mian, R., Martin, P., Brown, A., & Zhang, M. (2011). Managing data-intensive workloads in a cloud. In G. Aloisio & S. Fiore (Eds.), Grid and Cloud Database Management (pp. 235–258). Springer. doi:10.1007/978-3-642-20045-8_12 Mittal, R. (2010). Query processing in the cloud (Master’s thesis). Swiss Federal Institute of Technology Zurich. Niu, B., Martin, P., & Powley, W. (2009). Towards autonomic workload management in DBMSs. Journal of Database Management, 20(3), 1–17. doi:10.4018/jdm.2009070101 Niu, B., Martin, P., Powley, W., Horman, R., & Bird, P. (2006). Workload adaptation in autonomic dbms. In CASCON ‘06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research (pp.161-173).

108

Workload Management Systems for the Cloud Environment

Paton, N. W., Buenabad, J. C., Chen, M., Raman, V., Swart, G., Narang, I., & Fernandes, A. A. A. et al. (2009). Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options. VLDB, 18(1), 119–140. doi:10.1007/s00778-007-0090-x Paton, N. W., de Aragão, M. A. T., & Fernandes, A. A. A. (2012). Utility-driven adaptive query workload execution. Journal of Future Generation Computer Systems, 28(7), 1070–1079. doi:10.1016/j. future.2011.08.014 Paton, N. W., de Aragão, M. A. T., Lee, K., Fernandes, A. A. A., & Sakellariou, R. (2009). Optimizing utility in cloud computing through autonomic workload execution. Journal of IEEE Data Engineering Bulletin, 32, 51–58. Performance Monitoring. (n.d.). Use Windows* Performance Monitor for Infrastructure Health. Intel Developer Zone. Retrieved from: https://software.intel.com/en-us/articles/use-windows-performancemonitor-for-infrastructure-health Porto, F., da Silva, V. F. V., Dutra, M. L., & Schulze, B. (2006). An adaptive distributed query processing grid service. In J.-M. Pierson (Ed.), VLDB DMG 2005. LNCS (Vol. 3836, pp. 45–57). Heidelberg: Springer. doi:10.1007/11611950_5 Raman, V., Han, W., & Narang, I. (2005). Parallel querying with non- dedicated computers. In VLDB ‘05 Proceedings of the 31st international conference on Very large data bases (pp. 61–72). Schroeder, B., Harchol-Balter, M., Iyengar, A., & Nahum, E. (2006). Achieving class based QoS for transactional workloads. In Proceedings of the 22nd International Conference on Data Engineering (pp.153). doi:10.1109/ICDE.2006.11 Shah, M., Hellerstein, J., Chandrasekaran, S., & Franklin, M. (2003). Flux: an adaptive partitioning operator for continuous query systems. In Proceedings of the 19th International Conference on of Data Engineering (pp. 25–36). doi:10.1109/ICDE.2003.1260779 Somasundaram, T. S., Govindarajan, K., Rajagopalan, M. R., & Madhusudhana Rao, S. (2012). A broker based architecture for adaptive load balancing and elastic resource provisioning and deprovisioning in multi-tenant based cloud environments. In International Conference on Advances in Computing (vol. 174, pp. 561-573). doi:10.1007/978-81-322-0740-5_67 Soror, A. A., Minhas, U. F., Aboulnaga, A., Salem, K., Kokosielis, P., & Kamath, S. (2008). Automatic virtual machine configuration for database workloads. In SIGMOD Conference (pp. 953–966). doi:10.1145/1376616.1376711 Subramanian, I., McCarthy, C., & Murphy, M. (2000). Meeting performance goals with the HP-UX workload manager. In Proceedings of the 1st conference on Industrial Experiences with Systems Software (pp.10-10). Tian, F., & DeWitt, D. J. (2004). Tuple routing strategies for distributed eddies. In Proceedings of the 29th international conference on Very large data bases. LNCS (vol. 2944, pp. 333-344). Springer. Transaction Processing and Database Benchmark. (n.d.). TPC. Retrieved from: http:// www.tpc.org/tpch/

109

Workload Management Systems for the Cloud Environment

VMware. (n.d.). Retrieved from: http://www.vmware.com/ Walsh, W., Tesauro, G., Kephart, J., & Das, R. (2004). Utility functions in autonomic systems. In Proceedings of the IEEE International Conference on Autonomic Computing (pp. 70-77). Wei, Z., Pierre, G., & Chi, C. (2012). Scalable join queries in cloud data stores. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 547-555). doi:10.1109/CCGrid.2012.28 Xiong, P., Chi, Y., Zhu, S., Moon, H. J., Pu, C., & Hacıgumus, H. (2011). Intelligent Management of Virtualized Resources for Database Systems in Cloud Environment. In IEEE ICDE Conference (pp. 87–98). doi:10.1109/ICDE.2011.5767928 Yang, D., Li, J., Han, X., & Wang, J. (2013). Ad hoc aggregation query processing algorithms based on bit-store in a data intensive cloud. Journal of Future Generation Computer System, 29, 725–1735. doi:10.1016/j.future.2012.03.009

ADDITIONAL READING Das, S., Nishimura, S., Agrawal, D., & El Abbadi, A. (2011). Albatross: lightweight elasticity in shared storage databases for the cloud using live data migration. In Proceedings of the VLDB Endowment (vol. 4, pp. 494-505). doi:10.14778/2002974.2002977 Elmore, A., Das, S., Agrawal, D., & Abbadi, A. (2011). Zephyr: live migration in shared nothing databases for elastic cloud platforms. In Proceedings of the ACM SIGMOD International Conference on Management of data(pp.301–312). doi:10.1145/1989323.1989356 Farooq, U. M., Lui, R., & Aboulnaga, A. Salem.Kenneth,Ng J., Robertson s.(2012). Elastic scale-out for partition-based database systems. In Proceedings of the IEEE International Conference on Data Engineering (pp. 281–288). Ghanbari, H., Simmons, B., Litoiu, M., & Iszlai, G. (2012). Feedback-based optimization of a private cloud. Journal of Future Generation Computer Systems, 28(1), 104–111. doi:10.1016/j.future.2011.05.019 Missier, P., Paton, N. W., & Belhajjame, K. (2010).Fine-grained and efficient lineage querying of collection-based workflow provenance. In Proceedings of the 13th International Conference on Extending Database Technology (pp. 299-310). doi:10.1145/1739041.1739079 Schnaitter, K., Spiegel, J., & Polyzotis, N. (2009). Depth estimation for ranking query optimization. Journal of Very Large Database, 18(2), 521–542. doi:10.1007/s00778-008-0124-z Vasic, N., Novakovic, D., Miucin, S., Kostic, D., & Bianchini, R. (2012). DejaVu: Accelerating resource allocation in virtualized environments. In proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (vol. 12, pp. 423-436). Zheng, W., Xu, P., Huang, X., & Wu, N. (2010). Design a cloud storage platform for pervasive computing environments. Journal of Cluster Computing, 13(2), 141–151. doi:10.1007/s10586-009-0111-1

110

Workload Management Systems for the Cloud Environment

KEY TERMS AND DEFINITIONS Abstract Query Tree: Generated by database management system parser which generates query execution plan as a tree. The tree consists of internal nodes that have the operations of the query, such as select operation, and the relations used in the operation put in the leaves of the tree. Autonomic Database Management System: The system’s ability to manage itself automatically without increasing costs or the size of the management team thereby achieving an administrator’s goals. Systems must be quickly adaptable to new conditions integrated to it. Cloud Computing: A kind of Internet based computing that relies on sharing computing resources instead of having local and personal servers to access applications. It’s based on the delivery of ondemand computing resources over the Internet on a pay-for-use basis. Cloud Storage: An environment of data storage where the digital data is stored in multiple servers with logical pools over cloud computing environment and the physical environment is typically owned and managed by a hosting company. Distributed Resources: A list of resources in which their storage devices are not all integrated to a common processing unit, it consists of multiple computers, located in different physical location and dispersed over a network of interconnected computers. Load Distribution: Distributes tasks across multiple computing resources that provide the requested database service. Its main goal is to optimize the resource utilization and minimize response time without overloading any resource. Query Processing: The process of how queries are processed and optimized within the database management system. It consists of a series of steps that take the query as input and produce its result as output. Resource Replication: The creation of multiple instances of the same resource, it enables data from one resource to be replicated to one or more resources. That is typically performed when a resource’s availability and performance need to be enhanced. Workload Management: A number of tasks or number that’s assigned to a particular resource over a given period. It manages workload distributions to provide an optimal performance for applications and users.

111

Workload Management Systems for the Cloud Environment

APPENDIX: FIGURES Figure 1. The proposed enhanced query workload architecture

Figure 2. Partition the queries q1 and q2 answers to the queries q*1, q*2 and q*3 to eliminate overlapping data

112

Workload Management Systems for the Cloud Environment

Figure 3. Capabilities of each resource with relations distribution

Figure 4. The average query execution time

113

114

Chapter 6

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques Eman A. Abdel Maksoud Mansoura University, Egypt Mohammed Elmogy Mansoura University, Egypt Rashid Mokhtar Al-Awadi Mansoura University, Egypt

ABSTRACT The popularity of clustering in segmentation encouraged us to develop a new medical image segmentation system based on two-hybrid clustering techniques. Our medical system provides an accurate detection of brain tumor with minimal time. The hybrid techniques make full use of merits of these clustering techniques and overcome the shortcomings of them. The first is based on K-means and fuzzy C-means (KIFCM). The second is based on K-means and particle swarm optimization (KIPSO). KIFCM helps Fuzzy C-means to overcome the slow convergence speed. KIPSO provides global optimization with less time. It helps K-means to escape from local optima by using particle swarm optimization (PSO). In addition, it helps PSO to reduce the computation time by using K-means. Comparisons were made between the proposed techniques and K-means, Fuzzy C-means, expectation maximization, mean shift, and PSO using three benchmark brain datasets. The results clarify the effectiveness of our second proposed technique (KIPSO).

INTRODUCTION Image segmentation is a fundamental and critical task in image processing. In most cases, segmentation is a pre-step for many image processing applications. Therefore, if the segmentation is accurate, then also other tasks that depend on it will be accurate. It refers to the process of partitioning a digital image into multiple non-overlapping regions to be more understandable and meaningful (Bai & Wang, 2014). There DOI: 10.4018/978-1-5225-2229-4.ch006

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

are many image segmentation techniques, such as edge-based, clustering, and region-based segmentation techniques (Patil & Deore, 2013). Although of the variety of the image segmentation techniques, the selection of an appropriate technique is a challenging problem for a special type of images. Not all techniques are suitable for all types of images (Dass & Deni, 2012). The major problems in segmentation algorithms are the over-segmentation and under-segmentation. Medical image segmentation is a quite challenging problem due to images with poor contrasts, noise, and missing or diffuses boundaries (Fazli & Ghiri 2014, Gaber et al., 2015, Gaber et al., 2016, Tharwat et al., 2015, Ahmed et al., 2015, Ali et al., 2015) On the other hand, the anatomy of the brain can be viewed by imaging modalities, such as magnetic resonance imaging (MRI) scan and computed tomography (CT) scan. The MRI is more comfortable than CT for diagnosis because it does not use any radiation. It is based on the magnetic field and radio waves (Patel & Doshi, 2014). On the other side, a brain tumor is one of the main causes of death among people. Brain tumors are not rare, thousands of people diagnosed every year with tumors of the brain. Typically, brain tumor affects the Cerebral Spinal Fluid (CSF). It is an abnormal growth of the cells in the brain. It is caused by abnormal and uncontrolled cell division, which is normally either in the brain, cranial nerves, brain envelopes or spread from cancers primarily located in other organs. Brain tumors are either primary or secondary. The former includes any tumor that starts in the brain and is classified as benign and malignant (Leela & Kumari, 2014). Benign tumors can be removed, and they rarely grow back. They usually have a border or an edge and not spread. The second type is more serious than the first. They grow rapidly and spread to other parts. The problem of false detection of this disease makes the physician gives the treatment for the strokes, not for the tumor. Therefore, the accurate and early detection of the tumor is critical. Consequently, an efficient medical image segmentation technique should be developed with advantages of minimum user interaction, fast computation, accurate, and robust segmentation results to help physicians in diagnosing and early treatment. The most widely used techniques for image segmentation are clustering techniques. Clustering is an unsupervised learning technique that needs the user to determine the number of clusters in advance to classify pixels (Neshat et al., 2012). Therefore, the cluster is a collection of pixels that are similar to each other in some attributes and dissimilar with other groups of pixels or other clusters (Madhulatha, 2012). Clustering techniques can perform clustering either by partitioning or by grouping pixels (Acharya et al., 2013). In partitioning, the clustering algorithm divides the whole image it into smaller clusters. In the second type, the algorithm merges the clusters to larger clusters based on some assumptions. In this paper, we focused on clustering techniques to detect the brain tumor. We experiment the most five famous and currently in used clustering techniques. They are K-means, Fuzzy C-means, expectation maximization, mean shift, and PSO. We selected them from different five categories of clustering, such as exclusive, overlapping, probabilistic, hierarchal, and optimizing successively. We applied these techniques to three different datasets of brain images. The elected techniques were the K-means, Fuzzy C-means and PSO due to the accuracy but Fuzzy C-means, and PSO takes more time than K-means and K-means are less accurate than the formers. The MRI images were pre-processed at first to enhance the quality of the processed images. We integrated two different image clustering techniques to have advantages of these clustering techniques and overcome the limitations of them. Then, we used thresholding technique to extract the tumor clusters automatically without user interaction. Then we made the post processing by filtering the resulting thresholding image using a median filter. To contour these clusters, we used the level set method. Finally, we

115

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

calculate the tumor area. The last stage is the validation stage by comparing the results with the ground truth and applying the performance or confusion matrices. We represent the results by charts to be clearer. This chapter is organized as follows. In Section 2, we introduce a general background on the current scientific research in medical image segmentation. Section 3 presents the materials and methods used in this work and shows the proposed medical image segmentation system phases based on our twoproposed hybrid-clustering techniques. Section 4 depicts the experimental results obtained from the evaluation of the two proposed techniques using three types of datasets. Finally, conclusion and future work are drawn in Section 5.

BACKGROUND Medical image segmentation is considered as a hot research topic. Several authors have introduced various methodologies and techniques for image segmentation (Ali, 2015). For example, Yang et al. (2013) developed a hybrid technique called PSOKHM. They integrated K-harmonic (KHM) with PSO. PSOKHM algorithm applied KHM with four iterations to the particles in the swarm every eight generations. It improves the fitness value of each particle. It searches the cluster centers using the sum over all data points of the harmonic average of the distance from a data point to all the centers as a metric. Some advantages of this method are improving the speed of PSO and helping KHM escape from local optima. However, the main drawback of PSOKHM is that it requires more runtime than KHM. Jumb et al. (2014) used K-means and Otsu’s adaptive thresholding to segment the color images. They converted the RGB image to HSV color model and extracted value channel. Then, they applied Otsu’s multi-thresholding on the value depending on the separation factor (SF). After that, they applied K-means clustering. Finally, they used morphological processing. The main disadvantages of their method are preprocessing, image enhancement stage is missing, and segmentation is not fully automatic. It depends on the user notice to apply K-means or not. Kumar and Sivasangareswari (2014) detected the brain tumor by using fuzzy C-means with local information and Kernel metric. At first, they filtered brain images by the median filter. Second, they made clustering by using the fuzzy C-means technique. Their modified fuzzy C-means depends on space distance of all neighboring pixels. They used gray level co-occurrence matrix (GLCM) to make feature extraction and used SVM in classification (Gaber et al. 2016, Tharwat et al. 2015). The limitation of their work is that they did not make skull removal. The thing that increases the amount of used memory and increases the processing time. Abdul-Nasir et al. (2013) presented a color image segmentation technique to detect the malaria parasites in red blood cells. They applied partial contrast stretching technique on malaria color images and extract color components from enhanced images. After that, they used a K-mean and median filter. Finally, they used the algorithms of seeded region growing area extraction. The limitation of this method is that there is no best model all times. Therefore, HSI color model gives best results in the recall, but the C-Y color model gives best results in precision. Joseph et al. (2014) presented brain tumor MRI segmentation. They started by the preprocessing stage. It converts the RGB input image to grayscale. They used a median filter. The preprocessed image is supplied to K-means clustering algorithm then followed by morphological filtering if there are no clustered regions. The main disadvantage is that they did not use any technique or procedure of skull removal before segmentation.

116

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Wang et al. (2013) made a segmentation by combining the PSO and K-means in the case of local search and global search. The mutation was processed to accelerate a poor particle in the population. The main disadvantage is that they did not care about reducing the dataset size by using feature extractions, which can reduce the number of iterations and the execution time. Ghamisi et al. (2012) presented two thresholding image segmentation methods. They used the Fractional-Order Darwinian Particle Swarm Optimization (FODPSO) and the Darwinian Particle Swarm Optimization (DPSO). The authors then performed a comparison between the proposed methods and number of current segmentation methods such as genetic algorithm (GA) and bacterial foraging (BF) for performance evaluation. The FODPSO provides higher fitness value than the DPSO. FODPSO is more efficient than other methods, particularly when the level of segmentation increases. It can find the better thresholds in less time. Although FODPSO segmentation (as all thresholding-based methods in general) controlled the convergence rate but suffers from these disadvantages: it cannot handle inhomogeneity, it fails when the intensity of object of interest does not appear as a peak in the histogram, and FODPSObased segmentation takes into account only the between-class variance, thus disregarding any feedback from the within-class variance. If combining or integrating another method with FODPSO such as fuzzy C-means or mean shift, it may overcome the disadvantages of it. Ghamisi et al. (2014) combined the FODPSO and mean shift segmentation (MSS) to classify hyperspectral images. The output of FODPSO can be segmented again by MSS. The output of MSS can be used as a preprocessing for classification. The authors made the classification by SVM. Although the combination of the two segmentation methods increased the accuracy of classification, but tuning the size of the kernel can be considered as the main difficulty of MSS and the obtained result may considerably be affected by the kernel size. Small kernel size may cause a high increase in the CPU processing time. Cabria & Gondra (2012) presented a mean-shift-based initialization method for K-means (MS-K Means). In this method, the authors used the mean shift clustering technique as initialization for K-means clustering technique. Although K-means is very fast for very large amounts of data. Its performance depends on the selection of K and the selection of the initial cluster centers. On the other hand, mean shift clustering does not need initialization of clusters numbers as in K-means. Besides, the modes of the probability density function of the observations found by the mean shift clustering could be the initial cluster centers for the K-means method. In the tests, they found that the differences between similar points disappear as the precision is increased, but the computational time increases. They used the parameter, radius of modes, to calculate the positions of the modes: If the distance between two points is smaller than the radius of modes, then only one of them is a mode and the second is not a mode. They compared the method performance with random-based initialization, uniform based, initialization, density-based initialization, and sphere-based initialization. They tested nine kernel functions: normal or Gaussian, Epanechnikov, triangular, quartic, uniform, trim weight, tricube, cosine, and Lorentz. The comparisons showed that, regarding the quality of the resulting clustering, MS-KMEANS outperforms the other initialization methods. However, it has a main weakness, which has the highest computational cost of all the methods. Ali et al. (2014) presented a Fuzzy C-Means based image segmentation approach with an optimum threshold using a measure of fuzziness that is called FCM-t. This approach was applied to liver CT images. The authors started their work by applying fuzzy c means for liver CT image then the second step consists of measuring the membership value or fuzziness by using “Zadeh’s S-Function.” After that, they calculated the optimum threshold based on measures of fuzziness. They revealed the ambiguous pixels by assigning them to the appropriate clusters. The authors identified them by 1 and 2 if they achieved 117

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

the condition. The condition says that membership values must be greater than or equal to the threshold. The other pixels with membership values less than the threshold will be defined as ambiguous. They will also be assigned to the appropriate clusters, calculated by rounding to the nearest integer average of the cluster values in the 3 × 3 neighborhood of that uncertain pixel. The authors did experiments on 30 liver CT images and made the comparison between traditional FCM and FCM-t approach. The differences between the two algorithms were analyzed using a one-way MANOVA on both Jaccard Index (JI) and CPU processing time. The used CT dataset is divided based on the tumor type into Benign (CY, HG, HA, and FNH) or Malignant (HCC, CC, and MS). Their FCM-t does present significantly better results than the traditional FCM, but at the cost of some processing power. They increased the segmentation performance by using the optimum threshold approach by approximately three times more than with the traditional FCM while doubling the computational complexity. In future work, the authors aiming at providing a post-processing stage where the ground truth will be determined without the surrounding objects. They will decrease the computational complexity of the optimization by using PSO. Jiang et al. (2013) proposed a method that performed on multimodal brain MR images, including T1, T2, post-Gadolinium T1, and FLAIR sequences. It depended on two classifiers. The first was the global classifier, which was trained by using samples from the population feature set. The second was a custom classifier, which was trained by using samples from seed points in the testing image. The procedure of their proposed method consisted of four steps started with feature extraction by Gabor filter, and the feature volumes were generated by stacking the 2D image slices features. After that applying distance metric learning through learning training set in feature space. Then classifiers training and the last phase was the optimization by graph cut. The main disadvantages of the author’s method are that it was a semi-automatic method. The high dimension of the features slowed down convergence of the algorithm. The most time-consuming stage was the training of the global classifier. It takes nearly two hours to train the global classifier, including 57 cases with 4 modalities. Dhanalakshemi et al. (2013) used KM to detect the brain tumor and calculated its area. The authors start their work by preprocessing the MRI images by filtering of noise and other artifacts and sharpening the edges in the image. They used a median filter for the purpose of removing noises. Then, they carried out the segmentation by KM. The output of the segmenting image is entered in feature extraction step to extract the tumor by edge detection method. After that, they used the thresholding in approximate reasoning step to calculate the area of the tumor. The performance of brain tumor segmentation is evaluated based on KM clustering by using dataset consists of MRI images with a size of 181x272. This image dataset consists of 40 brain MRI images in which 20 brain images with tumor and remaining brain images without tumor. The brain images dataset are divided into training and testing dataset. They used the first dataset segment the tumor on the brain images and used the second dataset to analyze the performance of their proposed technique. Of course, the authors get benefits of KM technique in detecting the tumor. KM technique is very fast and simple but it fails in most of the cases to detect the malignant tumor. They used the median filter to remove noise from MRI images, but they did not make skull removal! Moreover, this method cannot be good for all types of MRI images. In our proposed system, the primary objective is to detect the brain tumor accurately in minimal execution time. We consider the accuracy and minimum execution time in each stage. In the preprocessing stage, we applied the median filter to enhance entire image quality and removed the skull from the processed image. This stage reduces both the processing time and the used amount of memory. In segmentation stage, all advantages of K-means, Fuzzy C-means, and PSO are preserved; while their main problems have been solved by the proposed hybrid techniques. The over-segmentation and under118

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

segmentation problems were solved as shown in the experimental results and the iterations and computation time were reduced. The user interaction is eliminated. The thresholding is applied to present a clear brain tumor clustering. Finally, the level set stage is applied to present the contoured tumor area on the original image. The system provides the physician with the overall picture to observe and give the true plan of treatment. The overall picture comprises the original smoothing image, the clustering image, the thresholding image with extracted tumor regions and the contouring image with the marked tumor areas. The contouring image will appear as an expert marked the tumor regions by a green line marker.

THE PROPOSED MRI BRAIN SEGMENTATION SYSTEM According to the simplicity and speed of K-means technique, there are different image segmentation systems are based on it to detect the brain tumor. K-means mainly suffers from incomplete detection of the tumor if it is a malignant tumor, but it detects the mass tumor. On the other hand, some other systems are based on fuzzy C-means technique. It has the main advantage of retaining more information from the original image and can detect malignant tumor cells accurately (Anandgaonkar & Sable, 2014), but these techniques are sensitive to noise and outliers, and they take long execution time. Besides these systems, there are other methods used PSO to segment tumor from brain images (Arulra et al., 2014). It may reach the optimal solution or near the optimal solution. It takes more computation time, especially in color image segmentation. In our proposed medical image segmentation system, we get benefits from the advantages of the elected three techniques (K-means, Fuzzy C-means, and PSO). As shown in Figure 1, the proposed medical image segmentation system consists of five phases that are the pre-processing, clustering by one of the two proposed clustering technique, post-processing or filtering again, tumor extraction and contouring, and validation. The main idea of doing the integration between clustering techniques is to increase quality and reduce runtime. In KIFCM clustering technique, fuzzy C-means can detect the tumor accurately, but it does more iterations that lead to a long time. Therefore, we integrated it with K-means to reduce the time and speed the convergence. By making the integration between K-means and PSO, we can reduce the computation time of PSO to reach the optimal clustering. The main phases and sub-phases of the proposed system will be discussed in more detail in the sequent subsections.

Phase One: Pre-Processing Stage This stage is implemented by applying a series of initial processing procedures on the image before any special purpose processing. The main purpose of this stage is to improve the image quality and removes the noise. Since the brain images are very sensitive, they should be of free noise and high quality. Therefore, this stage consists of de-noising and skull removal sub-stages. De-noising is important for medical images to be sharp, clear, and free of noise and artifacts. MRI images are normally corrupted by Gaussian and Poisson noise (Rodrigurs et al., 2008). We used a median filter that is a nonlinear filter. It is often used in image processing to remove salt and pepper noises (Abinaya & Pandiselvi, 2014). It moves pixel by pixel in the image and replaces each value with the median value of the neighboring pixels. The median sorts the pixel values from the window or pattern of neighboring pixels the numerical order and replaces the pixel being considered with the median pixel value. Median filtering is better than linear filtering for removing noise in the

119

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

presence of edges (Kumar et al., 2014). On the other hand, image background does not usually contain any useful information but increasing processing time. Therefore, removing background, skull, and all contents that are not in interest, will decrease the memory used and increase the speed of the processing. We used the BSE (brain surface extractor) procedure to remove the skull. It is used only with MRI images (MIPAV, 2014). It filters the image to remove noises. It detects edges and performs morphological erosions. It isolates the brain and cleans the surface. The output of this phase is the only cleaned and isolated brain in the MRI image.

Phase Two: Clustering Stage The free noising MRI images are then fed to one of the two proposed clustering techniques. Either KIFCM or KIPSO. In the case of using KIFCM, we initialize cluster numbers K, max iterations, and termination parameter. We calculate the cluster centers by using Eq. 1: Figure 1. The framework of the proposed medical image segmentation system

120

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

MU =

(1 : k ) ∗ m (k + 1)

(1)

where MU is the initial means that can be calculated due to K and m=max (MRI image) +1. Then, assign each point to the nearest cluster center based on minimum distance and re-compute the new cluster centers. It repeats until some convergence criterion is met. Then, the resulting image can be clustered by initializing number of centroids (centroid of the cluster is the mean of all points in this cluster) equal to the number of K. This will reduce iterations and processing time. If initializing number of centroids differs from K number, it may increase time in some cases. Then, calculating the distance and updating membership and means values with determining the condition of closing. The output of the technique is the time was taken in clustering, iteration number, and the clustering image. The pseudo code of the proposed KIFCM algorithm is stated in details in (Abdel Maksoud et al., 2015). On the other hand, the free noise MRI images are fed to the second proposed approach (KIPSO) by initializing cluster numbers K, the population of particles, inertial weight value, and a number of iterations. The algorithm follows the same steps of KIFCM till determining the clusters means due to initial K. Two best values update each particle. The first is the personal best (pbest) which is the best solution or fitness that has achieved so far by that particle. The second value is global best (gbest), which is the best value obtained so far by any particle in the neighborhood of that particle. At first, we evaluate the fitness value after that calculate the current particle position and current velocity by formula 3 and 4, respectively. If the fitness value of the particle is greater than pbest then modifying pbest and asking if that fitness value of a particle is greater than gbest. If not and the fitness value is not greater than modified pbest then updating velocity and position but if the fitness value of a particle is greater than gbest then modifying gbest. Each particle modifies its position using the current positions, the current velocities, the distance between the current position and pbest, and the distance between the current position and the gbest. Each particle updates its velocity and position until reach the max number of iteration. The output of the algorithm is a clustered image with an optimal number of clusters, optimal cluster centers, and computation time. The pseudo code of the second proposed KIPSO algorithm is stated in Algorithm 1 in details.

Phase Three: Extraction and Contouring Stage In this stage, we combined two segmentation methods that are thresholding, and active contour level is set. Thresholding segmentation is intensity-based segmentation. It is one of the important, simple, and popular segmentation techniques (Narkhede, 2013). Thresholding segmentation technique converts a multilevel image into a binary image. It is used to extract or separate the objects in an image from the background. The segmenting image that was obtained from thresholding methods has some advantages. It has small storage, fasten the processing, and very easy in manipulation (Saini & Dutta, 2012). The output of this stage is a segmented image of a dark background and lighted object that is the brain tumor. On the other hand, the active contours are very important in image segmentation and have been used effectively in boundary detection since the first introduction of snakes by Kass et al. (1988). The idea of active contours is abstracted in starting with initial boundary. The shapes are represented as closed curves, i.e. contours. It iteratively shrinks or expands according to the constraints. The advantage of the active contours methods in image segmentation crystallizes in partitioning the image into

121

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Algorithm 1. The Pseudo Code of The Proposed KIPSO Algorithm 1. INITIALIZE number of clusters K, Maxiteration, number of particles, c1=c2=2, Wmax=0.9, Wmin=0.4 2. SET G=10, M=15% number of particles 3. SET pbest=zeros (M, 2), gbest1=gbest2=GG=0, r=0, s=0 4. READ image 5. ASSIGN m ; h=zero (1, m); hc=h 6. FOR i=0 to length of image 7. IF image (i)>0 THEN 8. Add one to h (image (i)) 9. END IF 10. END FOR 11. CALCULATE Formula (1) 12. WHILE (true) Old mean = 𝓜𝓤 FOR i=1 to length (find (h)) CALCULATE C =abs (Ind (i)-𝓜𝓤) CALCULATE CC=find(c==min(c)) END FOR FOR i=1 to k a=FIND (hc == i) CALCULATE the new means

MU (i ) =

∑ a ∗ h(a ) ∑ h(a )

END FOR IF 𝓜𝓤 =old mean THEN Break 13. END WHILE 14. SET IMA=clustering image 15. SET IMA=clustered image 16. GET length of IMA a, b 17. GET p, x of IMA from imhist[IMA,256] 18. SET L=x, LP=P. / (a*b) 19. CALCULATE current position Х= min(L) + fix(max (L))-min(L)*rand (1, M)(3) 20. CALCULATE current velocity ѵ= min(L) + max(L)- min(L)*rand (1, M) (4) 21. CALCULATE Fitness value 22. FOR y=1 to G W(y) =Wmax - (Wmax-Wmin) *y/G (5) FOR i=1 to M CALCULATE t=length (find (x (i)))>=L FOR j=1 to t SET r=r+LP (j) SET s=s+L (j)*LP (j) END FOR SET W0 (i) =r, SET W1 (i) =1-r SET U0 (i) =s/r, SET U1 (i) = (m-s)/ (1-r) 23. END FOR 24. FOR i0=1: M CALCULATE BB (i0) =W0 (i0)*W1 (i0)*((U1 (i0)-U0 (i0)) ^2); 25. END FOR 26. FOR I=1 to M IF pbest(i,2)=gbest2THEN SETgbest2=Max SETgbest1=X (cc) 30. END IF 31. SET GG(y) =gbest2 32. FOR I=1 to M 33. SET ѵ(i)=round(w(k)*ѵi)+c1*rand*(pbest(i,1)-X(i))+c2*rand*(gbest1-X(i))); 34. SET X (i) =ѵ (i) +X (i); 35. END FOR 36. SAVE clustering images 37. DISPLAY clustering image and computation time

regions with continuous boundaries. Therefore, we used level set to contour the boundary of tumor area or shape continuously after thresholding. The level set method is demonstrated in details by Lee (2005). By using level set after thresholding, it gives the user the resulting segmenting image of the original image with contoured tumor areas. Figure 3 shows the contoured tumor area in extraction step with its calculations. The figure displays the thresholding image after post processing phase that will be demonstrated in phase four. It contains only the tumor area as a white light on a dark background. We calculated all white pixels as the tumor area. So, the preprocessing phase is very important. It ensures removing all unwanted contents like noises, artifacts, and skulls that will not affect our calculations. As shown in Figure 3 we calculated only the tumor area as there is no boundary pixels or other unwanted regions. Post processing also served us in producing accurate calculations. By median filter, we removed the pepper and salt noise from the thresholding image before applying the calculations algorithm. The pseudo code of the extraction and contouring stage is demonstrated in detail by (Abdel Maksoud et al., 2015).

Phase Four: Post Processing This phase started after the thresholding techniques. It makes the filtering again on the resulting image by a median filter with three iterations. The resulting thresholding image has the tumor cluster as light in a dark background, but some scattered and small points may be found. These scattered points in the image appear like pepper and salt noises. By using the median filters, these points are removed, and the resulting image displays only the infected regions or tumor clusters as shown in Figure 2. Figure 2.a shows the thresholding image with pepper and salt noise. Figure 2.b shows the thresholding image after the median filter.

Phase Five: Validation Stage In the validation stage, the resulting segmenting images with the two proposed clustering techniques were compared to the ground truth as illustrated in experimental results. The results were evaluated by performance matrix that contains the percentage of false positive. It is also called incorrect segmentation

123

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Figure 2. (a) The thresholded image with pepper and salt noise; (b) the resulting image after median filter

Figure 3. The area calculation of the tumor by using Matlab

that is the number of pixels that were not belonging to the cluster and were detected in that cluster. The correct segmentation represents the percentage of true positive. It is the number of pixels that belong to a cluster and were detected in that cluster. The primary measures of the two proposed techniques performance are the precision and recall. Precision is the correct segmentation. Recall, or sensitivity is the number of true positives divided by the total number of elements that belong to the positive cluster (Dakua & Pasad, 2013).

124

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

EXPERIMENTAL RESULTS We applied our proposed system on three benchmark datasets. The first is the Digital Imaging and Communications in Medicine (DICOM). It consists of 21 images that contain brain tumors. It has no ground truth for the provided images. Therefore, we depended on the observation. The second dataset is Brain Web, which contains simulated brain MRI data. It based on normal and multiple sclerosis. It contains 151 slices. The last one is BRATS dataset. It was collected from Multimodal Brain Tumor Segmentation and consists of multi-contrast MRI scans and has the ground truth images. This dataset includes 81 images.

Table 1. The main stages of the proposed framework by using KIFCM applied on three benchmark datasets

125

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

In this section, we present the results of our two hybrid proposed techniques. The results of our proposed medical system obtained using the real MRI brain images from the datasets above. This work was implemented using MATLAB 7.12.0 (R2011a). We run our experiments on a Core i5/2.4 GHZ processor with 8 GB RAM and a NVIDIA/ (1 GB VRAM) VGA card. Tables 1 and 2 demonstrate the results of applying the main five stages of our framework and the contouring by level set stage was done after the post processing stage on the thresholding. The framework includes our proposed techniques KIFCM and KIPSO on the three image datasets. Table 3 shows that EM (Expectation Maximization) like KM in accuracy but it takes longer time (T in seconds) than KM. On the other hand, the mean shift technique (MS) needs to initialize the bandwidth and threshold value. It produces a number of clusters K and time (T in seconds). It takes less runtime

Table 2. The main stages of the proposed framework by using KIPSO applied on three benchmark datasets

126

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Table 3. The comparison between KM, EM, and MS clustering algorithms

time but it does not give accurate results especially for the low cluster numbers as in the second dataset (DS2) when the number of clusters or K=3. We observed that the processing time increased if we did not use the skull removal algorithm as BSE. The processing time for all techniques was increased as in the first data set DS1 and also waste memory. The calculations of the tumor area will be not accurate. As we can see from Figure 2 (a, b), Although, we made the filtering on the thresholding image to calculate the white pixels which represent the tumor area, but the boundary pixels also will be calculated. The skull in the image was not removed. So, when we did the thresholding, it considered the skull boundary. Thus, the calculations will not be accurate. For these reasons, Removing skull is so important. On the contrary, when we removed skull as in the second DS2 or used images without skulls or scalps as in the third dataset (DS3), the processing time was very short compared to the processing time in the first dataset (DS1), as shown in Table 3. The calculations also were accurate as we can see in Figure 3. In Table 4, the KIFCM seems as FCM in accuracy, but KIFCM takes less processing time (T) than FCM with less iteration number. From Table 5, we can notice that KIPSO seems like PSO in accuracy, Table 4. The comparison between FCM and our proposed technique (KIFCM)

127

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

but KIPSO take less (T) than PSO. Tables 6 and 7 describe the confusion matrix (True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) besides the precision, recall, and accuracy) of KM and EM. The results prove that expectation maximization may be like KM in the accuracy in DS2 and DS3, but in DS1, KM accuracy is 85.7% where EM accuracy is 66.6%. From Tables 6 and 8, we observed that MS seems as KM in the performance matrix elements except in DS2. Tables 9 and 10 ensure that KIFCM is more accurate than FCM. This result is very clear in DS1 where the accuracy of KIFCM is 90.50%, and FCM is 85.7%. Tables 11 and 12 describe the performance matrix comparisons between PSO and KIPSO. The results prove that they are the same in accuracy, but PSO takes a long time compared to KIPSO in Table 5. The results demonstrated the descending order of the execution time for the seven techniques. FCM is the first level that takes the longest processing time (T in seconds) in clustering, and then EM takes Table 5. The comparison between PSO and our proposed technique (KIPSO)

Table 6. The performance metrics of KM Data Sets

TP

TN

FP

FN

Accuracy

Precision

Recall

DS1

85.7

0

0

14.3

85.7

100

85.7

DS2

96.7

0

0

3.3

96.7

100

96.7

DS3

95.06

0

0

4.94

95.06

100

95.06

Table 7. The performance metrics of EM

128

Data Sets

TP

TN

FP

FN

Accuracy

Precision

Recall

DS1

66.6

0

0

33.4

66.6

100

66.6

DS2

95.4

0

0

4.6

95.4

100

95.4

DS3

95.06

0

0

4.94

95.06

100

95.06

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Table 8. The performance matrices of MS Data Set

TP

TN

FP

FN

Accuracy

Precision

Recall

DS1

85.7

0

0

14.3

85.7

100

85.7

DS2

96.05

0

0

3.95

96.05

100

96.05

DS3

95.06

0

0

4.94

95.06

100

95.06

FP

FN

Accuracy

Precision

Recall

Table 9. The performance matrices of FCM Data Sets

TP

TN

DS1

85.7

0

0

14.3

85.7

100

85.7

DS2

100

0

0

0

100

100

100

DS3

100

0

0

0

100

100

100

Precision

Recall

Table 10. The performance matrices of KIFCM Datasets

TP

TN

FP

FN

Accuracy

DS1

90.5

0

0

9.5

90.5

100

90.5

DS2

100

0

0

0

100

100

100

DS3

100

0

0

0

100

100

100

Table 11. The performance matrices of PSO Datasets

TP

TN

FP

FN

Accuracy

Precision

Recall

DS1

95

0

0

5

95

100

95

DS2

100

0

0

0

100

100

100

DS3

100

0

0

0

100

100

100

Table 12. The performance matrices of KIPSO Datasets

TP

TN

FP

FN

Accuracy

Precision

Recall

DS1

95

0

0

5

95

100

95

DS2

100

0

0

0

100

100

100

DS3

100

0

0

0

100

100

100

the second level. After that, PSO takes the third level. KIFCM is on the fourth level and KIPSO in the fifth level. KM is on the sixth level, and MS is on the last level that takes the shortest processing time. The comparison of the execution or processing time for seven techniques due to the first data set DS1 is demonstrated by a bar chart as shown in Figure 4. Figure 5 clarifies the comparison in case of time between KM, Fuzzy C-means and the integration between them KIFCM. KIFCM takes less T than FCM.

129

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Figure 4. The comparison of the execution time for seven techniques due to DS1

Figure 5. The comparison of the execution time for K-means, Fuzzy C-means and KFCM due to DS1

Figure 6 represents the comparison due to processing time also between K-means, PSO, and KIPSO. KIPSO takes less T than PSO. In Figure 7, the comparison was done between the two proposed hybrid clustering techniques KIFCM and KIPSO. We notice that KIPSO takes less T than KIFCM. Figure 8 shows the accuracies of the seven clustering techniques. We observe that the FCM, PSO, KIFCM, and KIPSO are the same in DS2 and DS3, but in DS1, KIPSO gives the most accuracy that is near to the PSO. The KIFCM gives less accuracy then KIPSO and also PSO but more than FCM. 130

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Figure 6. The comparison of the execution time for K-means, particle swarm optimization and KPSO due to DS1

Figure 7. The comparison of the execution time for the two proposed clustering techniques KFCM and KPSO

From the previous figures and tables, we can say that our proposed medical image segmentation system proved its effectiveness in detecting the brain tumor accurately with minimal execution time. This objective is achieved effectively with the second proposed clustering technique KIPSO which takes less processing time.

131

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Figure 8. The accuracies of the seven clustering techniques for the three datasets

CONCLUSION Image segmentation is a very important role in medical image processing applications. In this chapter, we proposed new medical image segmentation system. It is based on two proposed hybrid clustering techniques. The first is integration between fuzzy C-means with K-means that are called KIFCM. The second is integration between PSO and K-means that is called KIPSO. We applied our overall medical system containing one of the two proposed clustering techniques on the real three different datasets to detect the brain tumor. Our medical system consists of five main phases, the pre-processing, clustering, segmentation, post-processing, and validation. From the experiments, we proved the effectiveness of our system meanwhile the effectiveness of our proposed techniques in segmenting the brain tumor accurately with minimal processing time. We made comparisons between our techniques and the five state-of-theart clustering techniques: K-means, Expectation Maximization, Mean Shift, Fuzzy C-means, and PSO. The comparisons were done due to processing time and the accuracy. The four elements of confusion matrix (TP, FP, TN, FN) that were resulted from comparing the resulting images of techniques and the ground truth produced the precision, recall and accuracy values. The result of KIPSO is very near to KIFCM in accuracy and time. The difference between KIPSO and KIFCM in the first data set proved that KIPSO results are more accurate and clearer than results of KIFCM. The accuracy in KIPSO is 95% while KIFCM is 90.5%. On the other hand, KIPSO takes less processing time than KIFCM in all datasets.

132

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

In future work, the 3D MRI brain tumor segmentation will be carried out by using new 3D datasets. As well as, we will test the current initialization techniques to be used with K-means to reduce time as possible especially in segmenting 3D images. On the other hand, we will reconstruct the 2D slices of the 3D image and give the ability to the user to see all the clustering 2D slices. The evaluation will be carried out based on the confusion matrices, ground truth and the results of 3D slicer or MIPAV.

REFERENCES Abdel Maksoud, E. A., Elmogy, M., & Al-Awadi, R. M. (2015). Brain tumor segmentation based on a hybrid clustering technique. Egyptian Informatics Journal, 16(1), 1–11. doi:10.1016/j.eij.2015.01.003 Abdul-Nasir, A. S., Mashor, M. Y., & Mohamed, Z. (2013). Colour Image Segmentation Approach for Detection of Malaria Parasites Using Various Colour Models and k-Means Clustering. Journal of WSEAS Transactions on Biology and Biomedicine, 10(1), 41–55. Abinaya, K. S., & Pandiselvi, T. (2014). Brain tissue segmentation from magnitude resonance image using particle swarm optimization Algorithm. International Journal of Computer Science and Mobile Computing, 3(3), 404–408. Acharya, J., Gadhiya, S., & Raviya, K. (2013). Segmentation Techniques For Image Analysis: A review. International Journal of Computer Science and Managment Research, 2(1), 1218–1221. Ali, A., Couceiro, M., Anter, A. M., Hassanien, A. E., Tolba, M. F. & Snásel, V. (2014). Liver CT Image Segmentation with an Optimum Threshold Using Measure of Fuzziness. IBICA, 83-92. Ali, M. A., Sayed, G. I., Gaber, T., Hassanien, A. E., Snasel, V., & Silva, L. F. (2015). Detection of breast abnormalities of thermograms based on a new segmentation method. Proceedings of Federated IEEE Conference on Computer Science and Information Systems (FedCSIS), 255-261. doi:10.15439/2015F318 Anandgaonkar, G., & Sable, G. (2014). Brain Tumor Detection and Identification from T1 Post Contrast MR Images Using Cluster-Based Segmentation. International Journal of Science Researches, 3(4), 814–817. Arulraj, M., Nakib, A., Cooren, Y., & Siarry, P. (2014). Multicriteria Image Thresholding Based on Multiobjective Particle Swarm Optimization. Journal of Applied Mathematical Sciences, 8(3), 131–137. doi:10.12988/ams.2014.3138 Bai, X. & Wang, W. (2014). Saliency-SVM: An automatic approach for image segmentation. Journal of Neuro Computing, 136, 243–255. Cabria, I., & Gondra, I. (2012). A mean shift-based initialization method for k-means. Computer and Information Technology (CIT). 12th IEEE International Conference on, 579-586. doi:10.1109/CIT.2012.124 Dakua, & Prasad, S. (2013). Use of chaos concept in medical image segmentation. Journal of Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1(1), 28–36. Dass, R., Priyanka, & Devi, S. (2012). Image segmentation techniques. International Journal of Electronics and Communication Technology, 3(1), 66–70.

133

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Dhanalakshemi, P., & Kanimozhi, T. (2013). Automatic segmentation of brain tumor using k means clustering and its area calculation. International Journal of Advanced Electrical and Electronics Engineering, 2(2), 130–134. Fazli, S., & Ghiri, S. F. (2014). A Novel Fuzzy C-Means Clustering with Hybrid Local and Non Local Spatial Information for Brain Magnetic Resonance Image Segmentation. Journal of Applications and Engineering, 2(4), 40–46. Gaber, T., Ismail, G., Anter, A., Soliman, M., Ali, M., Semary, N.,... Snasel, V. (2015, August). Thermogram breast cancer prediction approach based on Neutrosophic sets and fuzzy c-means algorithm. In 2015 37th IEEE Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 4254-4257). doi:10.1109/EMBC.2015.7319334 Gaber, T., Tharwat, A., Hassanien, A. E., & Snasel, V. (2016). Biometric cattle identification approach based on Webers Local Descriptor and AdaBoost classifier. Computers and Electronics in Agriculture, 122, 55–66. doi:10.1016/j.compag.2015.12.022 Ghamisi, P., Couceiro, M. S., Benediktsson, J. A., & Ferreira, M. N. F. (2012). An efficient method for segmentation of images based on fractional calculus and natural selection. International Journal of Expert Systems with Applications., 39(16), 12407–12417. doi:10.1016/j.eswa.2012.04.078 Ghamisi, P., Couceiro, M. S., Fauvel, M., & Benediktsson, J. A. (2014). Integration of Segmentation Techniques for Classification of Hyperspectral Images. IEEE Geoscience and Remote Sensing Letters, 11(1), 342–346. doi:10.1109/LGRS.2013.2257675 Jiang, J., Wu, Y., Huang, M., Yang, W., Chen, W., &Feng, Q. (2013). 3D brain tumor segmentation in multimodal MR images based on learning population- and patient-specific feature sets. Journal of Computerized Medical Imaging and Graphics, 1-10. Joseph, R. P., Singh, C. S., & Manikandan, M. (2014). brain tumor MRI image segmentation and detection in image processing. International Journal of Research and Engineering Technology, 3(1), 1–5. Jumb, V., Sohani, M., & Shrivas, A. (2014). Color Image Segmentation Using K-Means Clustering and Otsu’s Adaptive Thresholding. International Journal of Innovative Technology and Exploring Engineering, 3(9), 72–76. Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: Active Contour Models. International Journal of Computer Vision, 1(4), 321–331. doi:10.1007/BF00133570 Kumar, K. S., & Sivasangareswari, P. (2014). Fuzzy C-Means Clustering with Local Information and Kernel Metric for Image Segmentation. International Journal of Advanced Research in Computer Science and Technology, 2(1), 95–99. Kumar, S. S, Jeyakumar, A. E., Vijeyakumar, K. N., & Joel, N. K. (2014). An adaptive threshold intensity range filter for removal of random value impulse noise in digital images. Journal of Theoretical and Applied Information Technology, 59(1), 103-112. Lee G. P. (2005). Robust image segmentation using active contours: Level set approaches. (Ph.D.). North Carolina State University.

134

Segmentation of Brain Tumor from MRI Images Based on Hybrid Clustering Techniques

Leela, G. A., & Kumari, H. M. V. (2014). Morphological Approach for the Detection of Brain Tumour and Cancer Cells. Journal of Electronics and Communication Engineering Research, 2(1), 7–12. Madhulatha, T. S. (2012). An overview on clustering methods. Journal of Engineering, 2(4), 719–725. Medical Image Processing Analysis and Visualization. (2014). Retrieved from http://mipav.cit.nih.gov/ pubwiki/index.php/Extract_Brain:_Extract_Brain_Surface_. Mohan, P., AL, V., Shyamala, B.R., & Kavitha, B. (2013). Intelligent Based Brain Tumor Detection Using ACO. International Journal of Innovative Research in Computer and Communication Engineering., 1(9), 2143–2150. Narkhede, H. P. (2013). Review of Image Segmentation Techniques. International Journal of Science and Modern Engineering., 1(8), 54–61. Neshat, M., Yazdi, S. F., Yazdani, D., & Sargolzaei, M. (2012). A New Cooperative Algorithm Based on PSO and K-Means for Data Clustering. Journal of Computer Science, 8(2), 188–194. doi:10.3844/ jcssp.2012.188.194 Patel, J., & Doshi, K. (2014). A Study of Segmentation Methods forDetection of Tumor in Brain MRI. Advance in Electronic and Electric Engineering, 4(3), 279–284. Patil, D. D., & Deore, S. G. (2013). Medical Image Segmentation: A Review. International Journal of Computer Science and Mobile Computing, 2(1), 22–27. Rodrigues, I., Sanches, J., & Dias, J. (2008). Denoising of Medical Images corrupted by Poisson Noise. Image Processing ICIP. 15th IEEE International Conference, 1756-1759. doi:10.1109/ICIP.2008.4712115 Saini, R., & Dutta, M. (2012). Image Segmentation for Uneven Lighting Images using Adaptive Thresholding and Dynamic Window based on Incremental Window Growing Approach. International Journal of Computers and Applications, 56(13), 31–36. doi:10.5120/8954-3140 Tharwat, A., Gaber, T., & Hassanien, A. E. (2015). Two biometric approaches for cattle identification based on features and classifiers fusion. International Journal of Image Mining, 1(4), 342–365. doi:10.1504/IJIM.2015.073902 Wang, X., Guo, Y., & Liu, G. (2013). Self-adaptive Particle Swarm Optimization Algorithm with Mutation Operation based on K-means. Advanced materials research. 2nd International Conference on Computer Science and Electronics Engineering, 2194-2198. Yang, F., Sun, T., & Zhang, C. (2009). An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization. International Journal of Expert Systems with Applications, 36(6), 9847–9852. doi:10.1016/j.eswa.2009.02.003

135

136

Chapter 7

Localization and Mapping for Indoor Navigation: Survey

Heba Gaber Ain-Shams University, Egypt

Safaa Amin Ain-Shams University, Egypt

Mohamed Marey Ain-Shams University, Egypt

Mohamed F. Tolba Ain-Shams University, Egypt

ABSTRACT Mapping and exploration for the purpose of navigation in unknown or partially unknown environments is a challenging problem, especially in indoor environments where GPS signals can’t give the required accuracy. This chapter discusses the main aspects for designing a Simultaneous Localization and Mapping (SLAM) system architecture with the ability to function in situations where map information or current positions are initially unknown or partially unknown and where environment modifications are possible. Achieving this capability makes these systems significantly more autonomous and ideal for a large range of applications, especially indoor navigation for humans and for robotic missions. This chapter surveys the existing algorithms and technologies used for localization and mapping and highlights on using SLAM algorithms for indoor navigation. Also the proposed approach for the current research is presented.

INTRODUCTION Navigation includes two main subjects: Outdoor navigation and indoor navigation where data is the main ingredient for navigation and route planning. The outdoor navigation problem can be solved by using systems that have GPS support, where a wide variety of data sources is already available from a mix of local and global data providers. The main spatial data providers are Navteq, TeleAtlas and Google. Indoor navigation is a broad topic covering a large spectrum of different algorithms, technologies and applications. In order to build a coherent working framework suitable for navigation: Environment Exploration, Modeling, Perception, Localization, Mapping, Path Planning and Path Execution algoDOI: 10.4018/978-1-5225-2229-4.ch007

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Localization and Mapping for Indoor Navigation

rithms are all needed. The problem with the current indoor data sources is the huge diversity in data structure, completeness, availability, data coverage, level of detailed linkage to the outdoor networks and geocoding. Most of the existing navigation systems require a priori knowledge of the environment or modification of the environment by adding artificial infrastructure and landmarks (ex: Radio Frequency Identifier (RFID), Bluetooth Beacons, Quick Response (QR) codes). Also building an accurate environment model requires a huge storage on the mobile agent device or high network load for data exchange. Finding solutions for consistent localization and mapping which allowing precise and robust localization in dynamic and real-world environments, is a very challenging research problem. Concise map building and map update that take into account the agent’s limited computation capabilities and allow it to plan its navigation path autonomously and smoothly in a dynamic environment is also a big challenge. In this chapter, localization and mapping algorithms and technologies will be studied in terms of accuracy, cost, wide spread adoption capability and dynamicity to environmental changes. The main challenge discussed is how to build a scalable, dynamic and cheap system for the purpose of indoor navigation. The authors’ approach is thus; to leverage the advances in SLAM algorithms in building an optimized environment model that dynamically represents the environment and is easily adaptable to the environment changes without the need of additional environment infrastructures. Moreover, the authors mainly focus on environment representation that minimize both the required storage and the network load and emphasize on building environment models that represent the necessary environment features. The chapter is organized as follows; in the beginning a basic background to SLAM that is considered a basis for building environment model for indoor navigation is explained. Afterwards different technologies and algorithms for localization is presented. Next, different approaches for environment mapping and representation are explained. Then a focus on existing vision based navigation systems is detailed. After that, location based services based on context awareness is demonstrated. After that insights on the standardization efforts and some industrial projects related to indoor navigation is explained. Afterwards the main aspects for evaluating a successful navigation system is discussed. Then the proposed research approach and the future research direction is shown and finally conclusion of the current research is discussed.

BACKGROUND SLAM addresses the problem of acquiring a spatial map of the environment while simultaneously localizing the mobile agent relative to this model (Thrun S., 2008). SLAM and navigation techniques have been covered in many citations, most of them have been applied to autonomous wheeled mobile robots and some of them were applied to smartphones and personal PDAs for human indoor navigation applications that support people with disabilities, firefighting groups and people navigating complex and large buildings. SLAM can be considered the basic step for building an optimized and dynamic indoor navigation assuming that there is no prior knowledge about the environment. The environment mapping procedure gradually discovers the environment while moving. Objects and regions perceived from a given location must be related to the previously perceived ones and integrated with them in a consistent manner. The result of this integration is a map representing the layout of the environment or parts of it. Localization is necessary to correctly relate the newly perceived areas with the already known ones. Hence incremental

137

Localization and Mapping for Indoor Navigation

environment mapping and localization appears to be two intimately related processes. Solving this problem requires identifying the same environment elements perceived from different positions. Sensors are indeed always imperfect; their data are incomplete, noisy and inaccurate. Environment representations should therefore explicitly take into account uncertainties to solve the data association problem, i.e., the ability to recognize the same features from different perceptions (Chatila, 2008; Thrun S., 2002). There are many paradigms and algorithms for SLAM, the method chosen by the practitioner will depend on a number of factors such as; the desired map resolution, the update time and the nature of the features in the map and so on. There are numerous practical uses for SLAM as a means of gaining knowledge in areas humans cannot access. SLAM must deal with the uncertainty of locations as a result of inaccurate sensor data. As discussed in (Naminski, 2013) SLAM process can be divided into five repetitive steps: Step 1: Odometry Readings and Location Prediction, the information about the current position is gathered to be stored in the control vector. The odometer gives information about changes in the orientation of and changes in the distance travelled since the last state which can be used to determine a prediction about the current location within the map and the direction it faces. Step 2: Sensor Readings and Data Association, gather readings on the visible landmarks from the current position. This information gives the range and bearing of the landmarks related to the location. These values are then used in conjunction with the stored position of the landmarks to estimate the position in the map through triangulation and in the middle of this step, the process of data association is performed. This is the process by which the system attempts to associate the landmarks currently visible with landmarks already observed from previous positions. The problem is that a single incorrect data association can induce divergence into the map estimate, often causing catastrophic failure of the localization algorithm. Step 3: Location Correction, a new estimate position is determined. The probability distribution for the location is used based on the two estimated locations from the odometer and the landmark triangulation and a new estimate is calculated that is a combination of the two values. The formula used to combine these two values depends on the selected algorithm for solving SLAM. Step 4: Landmark Location Updates, the estimated locations of the landmarks are updated. This is done using the range and bearing data gathered through the sensors and the estimate of the current position. Correlations between landmark locations are also gathered and updated. The map of the environment is also updated at this step. Step 5: Add New Landmarks, new landmarks may be observed in the new position. These landmarks are determined and their estimated location is added to the list of landmark locations. The correlations between new landmarks and other visible landmarks are also stored. This allows for the exploration of new parts of the environment. Assuming a static world, the classical formulation for the SLAM problem requires that the probability distribution be computed for all times k . This probability distribution describes the joint posterior density of vehicle position x k at time k and the landmark locations m (i.e. the map) given the history of observations Z k until time k, the history of motion commands U k up to time k and the initial pose x 0 as demonstrated in Equation (1).

138

Localization and Mapping for Indoor Navigation

(

)

p x k , m | Z k ,U k , x 0

(1)

The observation model describes the probability of making an observation z k when the vehicle location and landmark locations are known. It is assumed that, once the vehicle location and map are defined, observations are conditionally independent given the map and the current vehicle state. The observation model is generally described in Equation (2) p(z k | x k , m )

(2)

The motion model for the vehicle is described in terms of a state transition probability distribution described in Equation (3). p(x k | x k −1, uk )

(3)

That is, the state transition is assumed to be a Markov process in which the next state x k depends only on the immediately preceding state x k−1 and the applied control uk and is independent on both the observations and the map. Solutions to the SLAM problem can be distinguished along many different dimensions. The three main paradigms for SLAM problem are Kalman Filter, Particle Filters and Graph-Based SLAM (Thrun S., 2008). Moreover, SLAM could be parallelized between multiple navigating agents (Roweis & Salakhutdinov, 2005). Agents communicate information about their part of the map to other agents and various kinds of sensors data collected from laser range finders, ultrasonic sensors, vision sensors, etc. can be consolidated in an integrated system model.

LOCALIZATION TECHNOLOGIES AND ALGORITHMS The localization system is used to locate and track an object within buildings or closed environments. There are several techniques for localization, including incremental motion integration (example: Through odometry or inertial units) and recognizing known features in the environment and the use of them as landmarks. Latter solution requires knowing the environment map (the landmark positions). Localization systems are consisting of localization technologies that provide the data that will be used by Localization algorithms.

Localization Technologies From the technology perspective, positioning technologies can be classified into the active positioning system and passive positioning system Table 1, shows a comparison between different localization technologies.

139

Localization and Mapping for Indoor Navigation

•

140

Passive Localization Systems: Position is estimated by measuring the received signal or video process. Most solutions of passive system are based on the triangulation and multiple points positioning technology. They use light, ultrasonic or radio signals to represent the position information of objects. These technologies include WLAN/WIFI, RFID, Bluetooth technology (Zhu, Yang, Wu, & Liu, 2014). Most of them depend on the set of measurements captured by the receivers such as arrival time, signal strength and direction. In addition, vision based systems depend on visual sensors for positioning and navigation. Examples for passive localization technologies are: ◦◦ Wi-Fi: Local Area Wireless Technology that allows an electronic device to participate in computer networking using 2.4 GHz UHF and 5 GHz SHF ISM radio bands. Most buildings like a super mall or office building have already deployed Wi-Fi hotspots that provide whole building coverage as a network access point and most commercial products, like phones, laptops and tablets, support Wi-Fi. For Wi-Fi based indoor localization systems, achieving room level accuracy requires from four to five Wi-Fi hot spots. Wi-Fi indoor positioning is typically based on triangulation, where the distance to the known base station is measured by either time of flight (TOF) or signal strength. A detailed survey of wireless techniques can be found in (Liu & Liu, 2007). There exist some localization solutions based on Wi-Fi as (WifFarer, 2015) and (InfSoft, 2015). ◦◦ RFID: The wireless use of electromagnetic fields to transfer data, for the purposes of automatically identifying and tracking tags attached to objects. The tags contain electronically stored information. Some tags are powered by electromagnetic induction from magnetic fields produced near the reader. The reader uses radio frequency electromagnetic fields to read the data in the tag and get the identification of the object the tag is attached to. The tag can either have battery or not which makes it an active tag or a passive tag. The passive tag can be very cheap and have a long lifetime which is ideal for cost sensitive scenarios. Active tags can store up to 128 kb and passive tags typically store less than 128 bytes. In (Gandhi, 2014) indoor navigation systems using RFID integrated lives or evacuating buildings with hazards is explained, another application is presented to help visually impaired or blind people to find their way without assistance through audio directions. While passive tags are much less expensive, they have much shorter ranges and can store much less data, which increases their installation cost. RFID positioning systems for elderly people is also demonstrated in (Tsirmpas, Rompas, Fokou, & Koutsouris, 2014). RFID tags can be used to store an identifier or location information may be embedded in the tag itself. ◦◦ Bluetooth: Bluetooth is widely used for short distance communication like earphones or cell phones. Bluetooth conserves power consumption; it uses a very low transmission power. Therefore, the coverage of Bluetooth is shorter than Wi-Fi and other WLAN technology. Localization systems based on Bluetooth are explained in detail in (Subhan, Hasbullah, Rozyyev, & Bakhsh, 2011) and (Iglesias, Barral, & Escudero, 2012) ◦◦ UWB (Ultra-Wide Band): UWB uses a sub-nanosecond radio pulse to transmit data in a wide range of bandwidths (normally greater than 500 MHz). Its transmission can be regarded as background noise to other wireless technologies, hence, in theory it can use any spectrum without interfere with other users. It uses small transmission power -41.4dBm/MHz (which is limited by FCC) which means that the power consumption is low. Another advantage of UWB is its immunity to multi-path problems in theory (Chang, 2004).

Localization and Mapping for Indoor Navigation

Cellular-Based: A number of systems have used global system of mobile/code division multiple access (GSM/CDMA) mobile cellular network to estimate the location of outdoor mobile clients. However, the accuracy of the method using cell-ID or enhanced observed time difference (E-OTD) is generally low (in the range of 50–200 m), depending on the cell size. The accuracy is higher in densely covered areas (e.g., urban places) and much lower in rural environments. Indoor positioning based on mobile cellular networks is possible if the building is covered by several base stations or one base station with a strong RSS received by indoor mobile clients. In (Otsason, Varshavsky, LaMarca, & Lara, 2005) GSM indoor localization system that achieves median accuracy of 5 meters in large multi-floor buildings is presented. The key idea that makes accurate GSM-based indoor localization possible is the use of wide signal strength fingerprints. ◦◦ Infrared (IR): Localization uses IR transmitters which are installed in known positions where each transmitter broadcasts a unique ID in a cone shaped region. The user carries an IR receiver that picks up data from IR transmitters in range. In some systems, transmitters not only broadcast the position of the user but also provide information about the environment and graphical walking directions. Locating identifiers may be hard, as IR requires line of sight due to their narrow transmission angle. A drawback of IR is that natural and artificial light can interfere with IR. IR systems are costly to install due to the large number of tags that are needed to be installed (Fallaha, Apostolopoulosa, Bekrisb, & Folmera, 2013). ◦◦ Quick Response (QR) Code Scanning: QR codes are two dimensional codes where the data is encoded in an optically readable format. Localization with QR codes is implemented through placing a small label on the wall containing the identifier of the location. This content is being accepted by the content parser module and parses to find out the location details. In (N.Gule & R.N.Devikar, 2014) QR based navigation system was presented using a Smart Phone camera, the proposed system feeds with the map and the phone and sends the location details decoded from the QR code whenever needed. ◦◦ Vision Based Localization: Vision sensing has emerged as a popular alternative, where cameras can be used to reduce the overall costs, maintaining a high degree of intelligence, flexibility and robustness. Recent advancement in computer technology led to the development of computer or machine-vision that tries to mimic the vision in humans and many living creatures such as insects’ (Chatterjee, Rakshit, & Singh, 2013). Attempts have been made to incorporate machine-vision, unmanned aerial vehicles (UAVs) and machines visual features using artificial intelligence for navigation in natural and changing environments. Active Positioning Systems: As such, certain signal parameters such as; TOA (Time of Arrival), TDOA (Time difference of Arrival), RSS (Received Signal Strength) and AOA (Angle of Arrival) (Zhang, Xia, Yang, Yao, & Zhao, 2010). The active system means that the tracked target carries electronic devices or tags, which can transmit its signal to a management station that calculates the position by using a certain positioning algorithm. They include (Inertial Navigation System) INS, magnetic field positioning technology and ultrasonic. Examples for active positioning systems are: ◦◦ The Inertial Navigation (IN): localization technique in which the values obtained by inertial sensors (accelerometers and gyroscopes) used to estimate the location and orientation without requiring external references. In (Terraa, Figueiredoa, Barbosaa, & Anacletoa, 2014)an ankle mounted INS is used to estimate the distance traveled by a pedestrian. The use of an INS for estimating the successive displacements in conjunction with the technique ◦◦

•

141

Localization and Mapping for Indoor Navigation

◦◦

◦◦

Dead Reckoning (DR), allows estimation of the current location based on an initial one. This distance is estimated by the number of steps given by the user. The proposed method is based on force sensors to enhance the results obtained from an INS. Magnetic Field Positioning: The principle of geomagnetic fluctuations could be used to estimate the position of the mobile target. Engineers in Finland designed a Smartphone application named IndoorAtlas (FinlandTeam, 2014). According to different specific fluctuations emitted by indoor concrete and steel structures, the orientation of earth’s magnetic field can be detected, due to non-uniform magnetic field environments leading to different observed values of the magnetic field. So the different observed values rely on the different path. IndoorAtlas use this fact to determine the position of mobile targets. Engineers obtained inspiration from homing pigeons and used the Earth’s magnetic field for positioning. The engineers used digital signal processing technology to develop new indoor navigation solutions. Users only need to upload the indoor layout to IndoorAtlas and then the application is used to record the target’s geomagnetic field in different directions for navigation. This geomagnetic data is recorded and uploaded to IndoorAtlas. Other people can use recorded geomagnetic to calculate the indoor positioning; its accuracy can reach 0.1 m to 2 m. This smartphone application is a software only positioning system, which does not require additional wireless access points nor other external hardware bases. Ultrasonic Systems: Deploying ultrasonic sensor networks providing high update rates, high accuracy and robustness using redundancy in the sensors. Each sensor is equipped with an ultrasonic receiver, some computing power and a wireless link to a central computer. A pulse emitted from the transmitter on the flying robot is received by some of the sensors who report the pulse arrival time to the central computer. (Priyantha, Chakraborty, & Balakrishnan, 2000) presents a localization system where the ultrasound emitter is carried by the user and receivers are installed in the environment where the users’ location is determined centrally. A disadvantage of ultrasound is that walls may reflect or block ultrasound signals which result in less accurate localization. The other drawback of using ultrasound for localization is the requirement of the line of sight between the receivers and beacons.

Localization Algorithms All navigation systems must include a basic form of localization, i.e., the determination of a user’s position and/or orientation. Localization methods can be grouped into four different techniques: (1) Deadreckoning, (2) Direct sensing, (3) Triangulation and (4) Pattern recognition, which are discussed below: 1. Dead Reckoning: Dead reckoning localization techniques estimate a user’s location based on a previously estimated or known position. While the user is moving, the dead reckoning system estimates the user’s location through the aggregation of Odometry readings that can be acquired through a combination of sensors such as accelerometers, magnetometers, compasses and gyroscopes or using a user’s specific walking pattern (such as the user’s average walking speed) (Fallaha, Apostolopoulosa, Bekrisb, & Folmera, 2013). In (Terraa, Figueiredoa, Barbosaa, & Anacletoa, 2014), dead reckoning with INS for indoor localization depending on the user’s walking steps is shown. Since the location estimation is a recursive process, inaccuracy in location estimation results in errors that accumulate over time. The accumulated error can be corrected using envi-

142

Localization and Mapping for Indoor Navigation

Table 1. Localization technologies Technology

Accuracy

References

Cost/ Power

Env. Change Needed

Type

Comment

Wi-Fi

1-5m

(Liu & Liu, 2007)

High

Yes

Passive

• High cost • Low accuracy

RFID

10cm-1m

(Gandhi, 2014), (Tsirmpas, Rompas, Fokou, & Koutsouris, 2014)

Medium

Yes

Active/ Passive

• RFID tags are inexpensive but installing them in large environments is costly • Hard to update

Bluetooth

20cm– 5m

(Tsirmpas, Rompas, Fokou, & Koutsouris, 2014), (Iglesias, Barral, & Escudero, 2012)

Low

Yes

Passive

• High cost • Low power consumption • Depends on the number of beacons installed

UWB

0.5-1m

(Chang, 2004)

Low

Yes

Passive

• Low power consumption • High cost for emitters/receivers needed

Cellular Based

50-200 m

(Otsason, Varshavsky, LaMarca, & Lara, 2005)

Low

No

Passive

• Suit outdoor localization • Low accuracy indoor

Infrared

1–2m

(Fallaha, Apostolopoulosa, Bekrisb, & Folmera, 2013).

High

Yes

Passive

Costly to install due to the large number of tags that need to be installed

QR code

1m

(N.Gule & R.N.Devikar, 2014)

Low

Yes

Passive

Depends on the number of codes installed in the environment

Vision Based

Relative to the sensors/ algorithms deployed

(Chatterjee, Rakshit, & Singh, 2013)

Low

No

Passive

Computation costs rely on the landmarks

(Terraa, Figueiredoa, Barbosaa, & Anacletoa, 2014)

High

No

Active

Problems in error accoumulation

(FinlandTeam, 2014)

Low

No

Active

Can’t be used as a standalone technique, otherwise can be mixed with other localization technologies

(Priyantha, Chakraborty, & Balakrishnan, 2000)

Low

No

Active

Walls may reflect or block ultrasound signals, which result in less accurate localization.

Inertial Navigation Magnetic Field

Ultrasonic

10cm-1m

ronmental knowledge, RFID tags, ultrasound beacons and map matching. The inaccuracy of dead reckoning and the need to combine it with other localization techniques are the main drawbacks of this method. If a system uses RFID for error correction, the system has all the disadvantages of the RFID localization such as change in the infrastructure and the need for users to carry a RFID reader. If map matching or landmarks are used for error correction, some previous knowledge of the environment is required, which might be costly to prepare. 2. Direct Sensing: Direct sensing based localization methods determine the location of the user through the sensing of identifiers or tags, which have been installed in the environment. Two different approaches exist with regard to determining the user’s location: (1) Location information and information on the user’s environment is stored in the tag itself; or (2) This information is retrieved from a database using the tags’ unique identifier (Fallaha, Apostolopoulosa, Bekrisb, &

143

Localization and Mapping for Indoor Navigation

Folmera, 2013). Examples for the technologies used for direct sending are RFID, IR, Ultrasound, Bluetooth beacons and Barcodes. 3. Triangulation: Triangulation uses the geometric properties of triangles to estimate the target location. It has two derivations: Lateration and angulation. a. Trilateration Using Time of Arrival: Lateration estimates the position of an object by measuring its distances from multiple reference points, it is also called range measurement techniques. The principle of TOA is to measure the arrival time between the mobile target and at least three known beacon nodes and multiplies the signal speed to calculate the distances between the target and beacon nodes. Then taking the beacon nodes as the centers of circles, the distances between target and beacon nodes as the radius of circles, the intersection of circles is the co-ordinate of the mobile target as illustrated in illustrated in Figure 1-(a) (Zhu, Yang, Wu, & Liu, 2014). b. Triangulation Using Angle of Arrival: Instead of measuring the distance directly by using received signal strengths (RSS), time of arrival (TOA) or time difference of arrival (TDOA) is usually measured and the distance is derived by computing the attenuation of the emitted signal strength or by multiplying the radio signal velocity and the travel time. Roundtrip time of flight (RTOF) or received signal phase method is also used for range estimation in some systems. Angulation locates an object by computing angles relative to multiple reference points (Liu & Liu, 2007). As shown in Figure1-(b) Triangulation: The estimated location is calculated with the angles formed by two reference points and the target node (Disha, 2013). 4. Pattern Recognition: Pattern recognition based localization methods use data from one or more sensors carried or worn by the user and compare this perceived data with a set of prior collected raw sensor data that has been coupled with an environment map. This map of sensor data can be created by sampling different locations or by creating it manually. Most human navigation systems use a combination of different sensing techniques (Fallaha, Apostolopoulosa, Bekrisb, & Folmera, 2013): a. Computer Vision Based Localization Techniques: Require the user to either carry a camera or use a camera embedded in a handheld device such as, a cell phone. While the moving agent navigates in an environment, a camera captures images of the environment and then by matching the images against a database of images with known locations, users’ position and orientation can be determined. The camera captures images while the user navigates. Using image matching, the users’ position and orientation can be determined. A disadvantage of this technique is the high storage capacity required for storing the images that are coupled with the environment map. Significant computing power may be required to perform the image matching, which may be challenging to implement on a handheld device. Users are often required to carry supporting computing equipment which may impede their mobility. b. Signal Distribution or Fingerprinting Localization Techniques: Compare the unique signal data from one or more external sources sensed at a particular location with a map of prerecorded data. This technique requires a training phase, which the received signal’s strength at different locations are acquired and then stored in a database to create a map. In the next phase, when the user is navigating, the received signal strength or its distribution over time is measured and compared with the map to find the closest match. The signal strength from WLAN (Wireless Local Area Networks) access points is an example of signal distribution localization. An advantage of WLAN signal localization is the relatively small number of base 144

Localization and Mapping for Indoor Navigation

Figure 1. (a) Trilateration: Time of Arrival (TOA), (b) Triangulation: Angle of Arrival (AOA) Source: (Liu & Liu, 2007).

stations that are required for localizing the user. Due to the increased prevalence of wireless networks in indoor environments, often no investment in infrastructure is required as existing base stations can be used. Other signal distribution localization techniques typically rely on a combination of low cost sensors such as an accelerometer, magnetometer (measuring the strength and direction of a magnetic field), temperature and light sensors. Creating a map for a multitude of sensors is often time consuming and furthermore, the map may not be reliable as some signals such as temperature and light sensors may be subject to daily or seasonal fluctuations.

Environment Mapping/Representation For successful navigation the geometric model in a given space region and the relationships between regions of space to decide for the followed general roadmap is needed. Hence both a geometrical and a topological model are useful and complementary. In addition, semantics, which defines the nature of objects or of space regions, would be an important knowledge. Navigation systems require storing and retrieving different types of information. The stored information can be used for localization, path planning, generating directions and providing location information. Depending on the approach employed by the system, this information may include floor plans, the location and description of objects in the indoor environment, locations of identifier tags or data collected using sensors. This can be a simple two dimensional (2D) map of the environment representing walls and doors with room numbers, digital road maps or graph of accessible paths with the associated cost for each link. Several mapping approaches used to model the representation of the environment are demonstrated below: •

Grid Based/Metric Map: The metric maps can be considered as discrete, two dimensional occupancy grids. Grid representations capture the presence or probability of presence of objects in space areas organized as preset grid cells. Even if this kind of representation is easy to construct, it requires grid cells be recognized as such for updating. The most commonly used ones are the regular grids, quad trees and octrees. Regular grids are also called occupancy grids, because each

145

Localization and Mapping for Indoor Navigation

•

•

•

146

element in the grid will hold a value, representing whether the location in space is occupied or empty and occupancy grids have been implemented successfully in various systems. In metric map, positions are defined by position and orientation [xyθ ]t . Topological Maps: Are also called landmark-based approaches. It represents the world as a graph with a set of places (nodes) and the relative location information between the places (edges). It is a more compact representation that scales better with the size of the environment. It relies on a higher level of representation than metric mapping, allowing for semantically transparent planning and navigation. Since humans have mostly a topological understanding of the surrounding environment, it will be of great help when such a similar representation can be granted. This capability can be called environment understanding, which implies the process of environment modeling, leading to the construction of an environment model and quantitative evaluation of places suitability as reference positions. In comparison to metric approaches, topological maps provide a more compact representation that scales better with the size of the environment. From a scientific point of view, topological maps (Thrun S., 2002) rely on a higher level of representation than metric mapping, allowing for semantically transparent planning and navigation, at the same time, visual topological navigation is biologically more plausible. It is closer to animal behavior, revealing proof of concept. Topological modeling is presented in Figure 1 (Fraundorfer, Engels, & Nister, 2007) Hybrid Approach: Many researchers have noted that for the purposes of navigation, a globally consistent metric map is not a necessity, rather, over larger spaces, topological connections with rough metric information suffice for planning, while local metric information can be used for more precise localization and obstacle avoidance. In the hybrid approach, topological map connects local metric maps in order to avoid the requirement of global metric consistency. This allows for a compact environment model, which permits both precision and robustness and allows the handling of loops in the environment during automatic mapping by means of the information of the multi-modal topological localization (Tomatis, 2008). An example of a hybrid environment representation is shown in Figure 2. Multi-Layered Maps: Some of the existing work in environment representation introduced a multi-layered environment approach where each layer represents a level of detail or different properties of the environment such as occupancy, traversability, elevation, etc. ◦◦ In (Nieto, 2005) a framework that combines feature maps is presented with other dense metric sensory information. The global feature map is partitioned into a set of connected Local Triangular Regions (LTRs), which provide a reference for a detailed multi-layer description of the environment. ◦◦ In (Perez-Lorenzo, Vazquez-Martin, Antunez, & Bandera, 2009), a dual graph data structure and the maximal independent edge set (MIES) decimation process are used to integrate the grid based and the topological paradigms. The use of the dual graph allows you to preserve the topology of the metric map and to correctly code the relation of adjacency and inclusion between topological regions. Integrating into the same framework the topological and metric maps using a dual graph pyramid. A dual graph pyramid is built over the metric map. Each level of the pyramid can be seen as a topological map with different levels of resolution. Each node of the topological map has associated a region of the metric map.

Localization and Mapping for Indoor Navigation

•

Cognitive Maps: Cognitive mapping (Yeap & Jefferies, 2000) is concerned with the process in which humans create a mental map of the environment. While navigating many decisions need to be made based on previous knowledge of the environment that is stored in the cognitive map. While different senses may be used for navigation, vision is the most effective way for creating a cognitive map, as great levels of detail about the environment can be acquired in a relatively short time. Cognitive researchers focused mainly on “the knowledge problems”. They investigated what people remembered most when visiting new places and how their conceptually rich knowledge of their environment is organized into hierarchies. They discussed the use of landmarks and high level cognitive capabilities such as the use of shortcuts and the ability to orientate oneself in complex spaces. Bio-Inspired algorithms can be used while building the environment maps through cognitive map techniques (Chatila, 2008). Many examples that show how to use cognitive maps are presented in (Nehmzow, 2008).

VISION BASED NAVIGATION SYSTEMS When cameras are employed as the only exteroceptive sensors, it is called visual SLAM. The terms vision based SLAM or vSLAM are also used. Visual SLAM refers to using images as the only source of external information in order to establish the position of a robot, vehicle or a moving camera in an

Figure 2. The world environment is represented as a linked collection of waypoint images. Image matching closes loops in the topological representation. Local geometric geometry allows the agent to follow a previously traversed path Source: (Fraundorfer, Engels, &Nister, 2007).

147

Localization and Mapping for Indoor Navigation

Figure 3. The space is represented by places given by their metric maps and nodes representing topological locations. The graph represents the topological map, which is used for traveling. When interaction with the environment is needed, the local metric map is used Source: (Tomatis, 2008).

environment and at the same time, constructs a representation of the explored zone. The computer vision techniques employed in visual SLAM such as detection, description and matching of salient features, Image recognition and retrieval, among others, are still susceptible to improvement (Fuentes-Pacheco, Ruiz-Ascencio, & Rendón-Mancha, 2012). Moreover, visual SLAM systems can be complemented with information from proprioceptive sensors, with the aim of increasing accuracy and robustness. This approach is known as visual-inertial SLAM. The key challenges for vSLAM are dynamic environments or environments with too many or very few salient features. Also, large scale environments are challenging especially during erratic movements of the camera and when partial or total occlusions of the sensor occur.

Types of Imaging Sensors for SLAM Different imaging systems have been used for visual SLAM. Cameras provide rich information about the environment enabling the detection of stable features. Furthermore cameras are low cost, light and compact, (Milford & Wyeth, 2007) easily available, offer passive sensing and have low power consumption. All these features make cameras very attractive to be used for SLAM. •

•

148

Single Camera: Single camera SLAM is also referred to as Bearing-only SLAM as a single image provides only the direction of features present in the environment and doesn’t provide the depth information. To get the 3D location of a feature, multiple images from different viewpoints are required. Some visual SLAM implementations using single cameras are (Ethan Eade & Drummond, 2009) and (Muhammad, Fofi, & Ainouz, 2009). Omni-Directional Camera: The main advantage of an Omni-directional camera is that it provides a 360-degree view of the environment. However, the mirror geometry provides radial distortion and non-uniform resolution on the image, so conventional image processing techniques are not directly applicable in Omni-directional images. In order to apply image processing techniques for conventional cameras, Omni-directional images usually are unwrapped to perspective views which remove the radial distortion. Recovering the panoramic view from the wrapped view requires the reverse mapping of points from the wrapped view onto the panoramic view, (Goedemé & Gool, 2010) and (LIU, 2013).

Localization and Mapping for Indoor Navigation

•

•

•

• •

•

Stereo Pair: Can provide a 3D location of the observed features in the environment, this makes a stereo pair readily usable for visual SLAM. In (Lemaire, Berger, Jung, & Lacroix, 2007), visual SLAM is implemented using stereo pairs for ground and aerial robots. Another implementation of visual SLAM using stereo pair is presented in (Berger & Lacroix, 2008). Multiple Camera Rigs: Have been used for visual SLAM. One advantage is that the use of multiple cameras increases the field of view and enables the tracking of features over the wider robot motion. Another advantage is that the spatial resolution over the field of view of a multiple camera rig is uniform unlike the catadioptric sensors which also offer a large field of view. An example of using an eight camera rig which offers 360° field of view to carry out visual SLAM can be found on (Kaess & Dellaert, 2006) Catadioptric Sensors: Are attractive for application in visual SLAM because they offer a wide field of view. Catadioptric sensors have been used in different formations for visual SLAM. (Lemaire & Lacroix, SLAM with panoramic vision) and presents an implementation using a single catadioptric sensor mounted on a ground robot. Trinocular Camera: Is an arrangement of three RGB cameras to capture the same scene. Search for the corresponding pixels in a stereo camera is speeded up in a trinocular camera. It is more accurate than a stereo camera (Particle-based Sensor Modeling for 3D-Vision SLAM, 2007). Kinect: Is highly accurate and precise. It gives an accurate 3D point cloud and provides high resolution depth videos (640 × 480). Kinect uses IR projection to get the depth information. Kinect is released as a motion sensor device for the gaming console Microsoft Xbox360. (Henry, Krainin, Herbst, Ren2, & Fox, 2012). Cartographer: Is backpack equipped with Simultaneous Localization and Mapping (SLAM) technology (Lardinois, 2014). As the backpack wearer walks through a building, SLAM technology generates the floor plan in real time and displays it on an Android tablet connected to the backpack’s computer. The wearer can then add points of interest on the go, such as a T-rex replica in a museum.

Vision Based SLAM and Navigation Systems A considerable amount of research has been carried out on SLAM using vision sensors during the last decade. 2 references to vision based systems for SLAM and navigation are illustrated. Navigation Assistance for Visually Impaired (NAVI) systems aimed to assist or to guide people with vision loss, ranging from partially sighted to totally blind, by means of sound., In (Jain, 2014), experimental results considering visual based indoor navigation systems for visually impaired people with audio instructions is presented. The proposed system relies on wearable devices to detect the surrounding features. In (Aladren, Lopez-Nicolas, Puig, & Guerrero, 2014) an implementation for a system that combines range information with color information to address the task of NAVI is given. Range information is used to detect and classify the main structural elements of the scene. Due to the limitations of the range sensor, the color information is jointly used with the range information to extend the floor segmentation to the entire scene. Topological scene recognition algorithm using omni-directional cameras to help disabled people with wheelchairs to find their way and navigate in a complex building is presented in (LIU, 2013). The algorithm is modeled by a Dirichlet process mixture model (DPMM). Navigation among topological nodes, waypoint based topological modeling of the environment centered on a visual homing framework

149

Localization and Mapping for Indoor Navigation

Table 2. Vision based navigation systems Citation

Application

Implementation

Methodology

(Aladren, Lopez-Nicolas, Puig, & Guerrero, 2014)

Navigation

NAVI system

Combine range information+ color information

(LIU, 2013)

Navigation

Disabled people with wheeled chair

Omni-directional camera, Dirichlet Process Mixture Model (DPMM), Image Based Visual Servoing (IBVS)

(Becerra & Sagues, 2014)

Navigation

Wheeled Robot Navigation

Visual control based on epi-polar geometry and trifocal tensor

(Wang, 2015)

Navigation

(Chacon, 2011)

SLAM and Navigation

RoboCup Rescue Competition

Visual topological mapping, building a visualtopological map using salient image features

(GARCIA-FIDALGO & ORTIZ, 2013)

SLAM and Navigation

SLAM and Navigation

Appearance based approach for topological visual mapping and localization using local invariant features

(Goedemé & Gool, 2010)

SLAM and Navigation

Robotic wheeled chair navigation

Omni-directional camera + FAST wide baseline feature matching

(Huitl, Schroth, Hilsenbeck, Schweiger, & Steinbach, 2012), (Schroth, 2013)

SLAM and Navigation

SLAM and Indoor Navigation with smartphone

Smartphone camera for localization and content Based Image Retrieval techniques for SLAM and Navigation

(Terraa, Figueiredoa, Barbosaa, & Anacletoa, 2014)

SLAM and Navigation

SLAM and Navigation with wearable devices

Ankle mounted Inertial Navigation System (INS) used to estimate the distance traveled by a pedestrian

(Bhanage, 2014)

SLAM and navigation

SLAM and Navigation with smartphone

Application that delivers an interactive indoor navigation experience through augmented graphical views aligned with indoor objects.

(Lardinois, 2014)

SLAM

SLAM

Using Cartographer for SLAM

(Particle-based Sensor Modeling for 3D-Vision SLAM, 2007)

SLAM

SLAM

Uses Kinect to generate a 3D dense map

(Particle-based Sensor Modeling for 3D-Vision SLAM, 2007)

SLAM

SLAM

Using trinocular camera for creating environment model

(Kaess & Dellaert, 2006)

SLAM

SLAM

Eight camera rig whichoffers360° field of view, to carry out visual SLAM.

(Lemaire, Berger, Jung, & Lacroix, 2007)

SLAM

SLAM for ground and aerial robots

Visual SLAM using stereo pairs for ground and aerial robots.

(Berger & Lacroix, 2008)

SLAM

SLAM

Visual SLAM using stereo pairs

computational geometry, set covering, nonsmooth optimization, combinatorial optimization and optimal control

using image based visual servoing (IBVS). A finite state-machine enables topological navigation between waypoints by fusing visual homing and odometry motion. The work was applied on a wheelchair robot. Visual based control of wheeled navigation had been discussed in (Becerra & Sagues, 2014) it presents visual servoing schemes in addition to presenting visual control based on epi-polar geometry and trifocal tensors. In, (Wang, 2015) visibility based path and motion planning algorithms based on computational geometry, set-covering, non-smooth optimization, combinatorial optimization and optimal control is shown.

150

Localization and Mapping for Indoor Navigation

In (Chacon, 2011), visual topological mapping for RoboCup rescue competition is presented. It shows that navigation can be implemented in a dynamic environment, by building a visual topological map using salient image features. This map is describing the spatial and geometric relations between several visual landmarks in a given environment. In (GARCIA-FIDALGO & ORTIZ, 2013), an appearance based approach for topological visual mapping and localization using local invariant features is proposed. The approach is based on an efficient matching scheme between features. In order to avoid redundant information in the resulting maps, a map refinement framework is presented. The proposed approach takes into account the visual information stored in the map for refining the final topology of the environment. These refined maps save storage space and improve the execution times of localizations tasks. In (Goedemé & Gool, 2010), FAST wide baseline feature matching is used for autonomous mobile robot navigation. Only an omni-directional camera is used as sensor. This system is able to build automatically and robust accurate topologically organized environment maps of a complex, natural environment. The implemented system was applied to a robotic wheelchair platform. Navigation based on a Visual Information (NAVVIS) project was applied using images captured by a smart phone as visual finger prints of the environment. The captured images are matched to the previously recorded geotagged reference database with content based image retrieval (CBIR) techniques. This work introduces an extensive benchmark dataset for visual indoor localization, available to the research community for downloading (NAVVIS, 2012), thus any experimental results can be compared with the published results. The TumIndoor vision based indoor navigation system is discussed (Huitl, Schroth, Hilsenbeck, Schweiger, & Steinbach, 2012), (Schroth, 2013) and focuses on the major challenges for developing a mobile visual localization system. The first challenge is the computational complexity for matching the captured images from the mobile device with the reference images in the database. To solve this challenge, they introduced a quantization structure for quantizing visual features named “visual words”. Through the visual words text querying/retrieval approaches can be used for image recognition problems. The second challenge is network latency for retrieving information from the centralized server. They addressed this challenge through preloading selected features information in a “partial visual vocabularies”. The partial vocabularies aim at enhancing the application performance through minimum dependency on network exchange. Additionally, they introduced a novel text detector that facilitates robust low complexity localization through recognizing the text in the images. Moreover, they introduced a virtual view technique that computes the virtual reference images at 16 locations. This computation helps to reduce the data required to be stored and retrieved on the devices.

CONTEXT AWARENESS Indoor navigation systems research had been enriched by introducing context awareness based services. Contextual information can be defined as any information that is gathered and can be used to enrich the knowledge about the user’s state, his or her physical surroundings and capabilities of the used mobile device(s). Context varies according to application constraints, taking into account the way users act in the environment, as well as the interface to interact with. Two generic concepts have been introduced as super-classes that encapsulate contextual dimensions. The context of use includes: (1) User-centered dimensions of context such as the user profile, preferences and physical/cognitive capability. The user centered dimensions also include the user interface that provides direct interaction with the system (i.e.,

151

Localization and Mapping for Indoor Navigation

input data manipulation and output communication operations) and (2) the environmental context which refers to the parameters that influence the user (e.g., location, time, temperature, light). From another perspective, the context of the execution models the behavior of the system and encompasses: The infrastructure dimension and the topological organization of the system components and the system dimension that evaluates resource utilization (e.g., memory, processor and network) and the capabilities of the user’s mobile device(s) (Afyouni, Ray, & Claramunt, 2012). The combination of information on the place, time and further context however, enables a broad variety of applications which go beyond navigational tasks or vicinity searches. Example the smart phone can be used as a virtual tourist guide to display information about a pointed exhibit. At the airport an indoor positioning system could indicate personalized Duty Free offers on the way to the gate or help to find the shortest security line. Dynamic information like waiting times at a queue or last minute deals however, would be of great benefit not only for the user but also to the service provider to raise client satisfaction as well as to applying real time marketing by analyzing the customer flow. Context aware positioning systems are the basis for many applications on monitoring and activity recognition. For example, in the supermarket, through obtaining the relevant position information of the consumers and target commodities, the supermarket is able to provide the services of the route guidance and the intelligent shopping guide. In (Bhanage, 2014) augmented reality for indoor navigation systems for enhancing the user experience is introduced and a lot of relevant information is displayed along with the location information. The techniques were applied on a vision based smartphone application for indoor navigation.

INDOOR NAVIGATION INDUSTRY LANDSACPE In this section the authors demonstrate the standardization efforts for indoor navigation frameworks, also survey for some of the existing solutions deployed for indoor navigation is presented.

Standardization Efforts There currently exist many approaches that target establishing a common schema framework for indoor navigation applications and industry standards. The common frameworks facilitate interoperability between the indoor navigation systems and enables data sharing easily. Indoor Geometry Markup Language standard working group (IndoorGML SWG) purposes to develop an application schema of the Open Geospatial Consortium (OGC) GML and to progress the document to the state of an adopted OGC standard. The goal of this candidate standard is to establish a common schema framework for indoor navigation applications (Nagel, et al., 2010). The InLocation Alliance (ILA) (Inlocationalliance, 2015) was founded by the mobile industry to accelerate the adoption of indoor position solutions that will enhance the mobile experience by opening up new opportunities for consumers and venue owners. ILA targets accurate indoor positioning to unlock a new set of possibilities for mobile services focusing on creating solutions offering high accuracy, low power consumption, mobility, feasibility and usability. The other important task for the Alliance is to ensure a multi-vendor environment by promoting open interfaces and a standard based approach. ILA founding members include: Broadcom, CSR, Dialog Semiconductor, Eptisa, Geomobile, Genasys, Indra, Insiteo, Nokia, Nomadic Solutions, Nordic Semiconductor, Nordic Technology Group, NowOn, Primax

152

Localization and Mapping for Indoor Navigation

Electronics, Qualcomm, RapidBlue Solutions, Samsung Electronics, Seolane Innovation, Sony Mobile Communications, TamperSeal AB, Team Action Zone and Visioglobe. Their primary solutions will be based on enhanced Bluetooth 4.0 low energy technology and Wi-Fi standards using relevant existing or up and coming features of those technologies.

Indoor Navigation Solutions Currently intensive work and many solutions exist for human indoor navigation leveraging the usage of smartphones for navigation in complex indoor environments. Industrial solutions exist currently for indoor navigation. Table 3 shows a list of existing solutions for indoor navigation. Google Indoor Maps have been activated for over 10,000 floor plans throughout the world. These indoor spaces include airports, malls, museums etc. Its indoor navigation algorithm is based on WiFi access points and mobile towers to determine user’s location (Moses, 2013). In (Zheng, Bao, Fu, & Pahlavan, 2012) Wi-Fi based indoor localization algorithm with the help of Google Indoor Map is presented. The initial position of the mobile station (MS) is estimated according to the received signal strength (RSS) from the calibrated Wi-Fi access points (APs), then, the position of the MS is allocated by using the simulated annealing (SA) algorithm to search for a better solution. In (Lardinois, 2014), using SLAM for building maps for navigation is presented. It uses Trekker, the backpack that includes a complete Street View camera setup for mapping the environment. The Cartographer uses SLAM, for mapping new locations and that Google is now using to map anything from hotels to museums. As the backpacker walks through a building, the floor plan is automatically generated in real time, says Google. The wearer also uses a tablet to add points of interest while walking around the building (say room numbers in a hotel or the exhibits in a museum). The Place Lab architecture (LaMarca, et al., 2005) and (PlaceLab, 2014), developed for research purposes, consisting of three key elements are; Radio beacons installed at various places indoors, Databases Table 3. Indoor navigation solutions Solution Name

Reference

Technology/Methodology

Sponsor/ Provider

Google Indoor

(Moses, 2013), (Zheng, Bao, Fu, & Pahlavan, 2012)

Mapping and Navigation

Google

Cartographer for indoor navigation

(Lardinois, 2014)

SLAM technique for Vision based navigation

Google

Qualcomm

(Kim, Mitra, Yan, & Guibas, 2012)

Wi-Fi based navigation solution

Qualcomm

In Place Lab

(LaMarca, et al., 2005) (PlaceLab, 2014)

Radio Beacons

Intel

InfSoft

(InfSoft, 2015)

Wi-Fi, Bluetooth, GPS

InfSoft

NAVVIS

(Huitl, Schroth, Hilsenbeck, Schweiger, & Steinbach, 2012)

Vision based with Content based image retrieval technique

Technical university of Munich

Nexthome

(NextHome, 2015)

Wi-Fi, Bluetooth based navigation solution

Nexthome

Wi-Farer

(WifFarer, 2015)

Wi-Fi, Bluetooth based navigation solution

Wi-Farer

Smart Indoor

(SmartIndoor, 2015)

Wi-Fi, Bluetooth based navigation solution

Smart Indoor

153

Localization and Mapping for Indoor Navigation

containing the beacon location information and clients that estimate their location from this data. Place Lab provides locations based on known positions of the access points which are provided by a database cached on the detecting device. Place Lab is entirely dependent on the availability of beacon locations without which, it cannot estimate anything about the current location. InfSoft (InfSoft, 2015) makes use of multiple wireless technologies Wi-Fi, Bluetooth and GPS to localize an object indoors. With the help of InfSoft in-house navigation technology, a position can be identified inside enclosed buildings to within a few meters – without installing additional hardware in the building. For this purpose, a variety of the sensors in a mobile device (smartphone) are used and evaluated: GSM, 3G/4G (LTE), Wi-Fi, magnetic fields, compass, air pressure, barometer, accelerometer, gyroscope, Bluetooth and GPS. Thanks to the unique combination of these technologies, a position can be displayed to within 1-meter accuracy and also the associated floor can be shown within a building. In-house positioning, switches automatically to GPS when leaving a building, thus enabling seamless utilization inside and outside buildings. The NAVVIS positioning system (Huitl, Schroth, Hilsenbeck, Schweiger, & Steinbach, 2012) works on a large database of images of indoor places and generates a 3D model of that place. The user needs to take a picture of his surroundings and NAVVIS compares it with the images in the database to compute the user’s current location and orientation. It is also smart enough to analyze the picture provided by the user for any changes of objects indoors. It further implements augmented reality to overlay the device’s camera view with relevant navigational information. Qualcomm’s indoor location technology (IZat) is a chip-based platform that facilitates delivery of location aware networks. Solutions based on Qualcomm Atheros’ 802.11ac and 802.11n network access points include advanced Wi-Fi based calculations to localize devices indoors with an accuracy of 5 meters. These access points, in conjunction with a server component they interact with, form a cohesive indoor positioning system (Kim, Mitra, Yan, & Guibas, 2012). Nexthome (NextHome, 2015), Wi-Farer (WifFarer, 2015) and (SmartIndoor, 2015)are indoor positioning and navigation systems based on Wi-Fi and Bluetooth beacon technologies. Those systems are based on the existence of an infrastructure of the devices and compatible with Apple iBeacon standard and smartphones, either iOS or Android, provided with Bluetooth 4.0 Low Energy technology.

FUTURE RESEARCH DIRECTIONS The authors’ research approach focuses on building a dynamic and cheap algorithm for indoor navigation. The targeted work will be the design and the analysis of an affordable indoor navigation system that: • • • • •

154

Minimizes computational overheads on the agent level and allows for maximum autonomous behavior/capabilities with minimum network loads. Consider the tradeoff between the cost, performance and the accuracy. Utilize Multi-agent capabilities for SLAM and environment explorations. Make best use of bio-inspired algorithms (cognitive maps and genetic algorithms) in building the proposed vision based navigation system. Scalability, i.e. we can deploy the system in very large and complex environments with minimum cost.

Localization and Mapping for Indoor Navigation

The recommended approach is based in two steps as demonstrated in Figure 4. The first step is utilizing vision based SLAM algorithms for building an optimized environment model this environment model shall be adaptable to the dynamicity of the environment. Vision-based approaches for mobile indoor localization was selected because they do not rely on the infrastructure and are therefore scalable and cheap. From the mapping perspective, we are working on integrating various techniques for environment abstraction that yields to concise information representation and minimum processing and maximum autonomous behavior for the moving agents. This shall include topological maps, multi-layered graph representation mixed with bio-inspired algorithms for environment abstraction, map building and navigation. The second step is indoor navigation based on the environment model built in the first step. We are focusing on location based services with smartphones for human navigation. Users can localize themselves within the indoor building through their phone camera. The system will generate the path to the user to go to the target location. Data collected by the devices will be utilized for updating the environment model and serve for context awareness services. The proposed work can serve in navigating unknown environments (ex: firefighters or visually impaired people) and in artificial agents’ indoor surveillance missions. It provides a scalable and cheap navigation system with concise environment model. The implementation results will be compared with the benchmark dataset published by NAVVIS (NAVVIS, 2012) for vision based navigation system based on SLAM implemented by the Technical University of Munich. Figure 4. Proposed architecture

155

Localization and Mapping for Indoor Navigation

CONCLUSION Accurate indoor navigation can enable a wide range of location based services across many sectors and can benefit in many applications either in robotic fields or for human navigation. Due to the complexity of indoor environments, the development of an indoor navigation systems is always accompanied with a set of challenges as demonstrated in this chapter. Many technologies have been used for indoor positioning and navigation; the accuracy of those technologies varies according to the environment setup. For example, the accuracy of Bluetooth navigation depends upon the number of beacons used and their distribution methodology. Wireless systems are error-prone and require at least four to five Wi-Fi hot spots in the vicinity to allow for a room level accuracy. Thus, additional hot spots need to be installed throughout the building, which poses a major obstacle for the widespread adoption. RFID requires active tags for indoor navigation; the accuracy is directly proportional to the number of active tags used in that system. Active tags are self-powered so they are costly. Also close pass by is required to sense RFIDs. The user needs to be aware of the RFID position. RFID tags themselves are relatively inexpensive but installing them in large environments may be costly, since such tags need to be embedded in floors or walls for users to sense them. Another disadvantage of this technique is that the human body can block RF signals (Fallaha, Apostolopoulosa, Bekrisb, & Folmera, 2013). Based on the previous investigations, vision sensors are realized to be attractive to be employed for SLAM because of a number of reasons including their rich sensing, easy availability and cost effectiveness. The authors’ approach is to use vision based SLAM in building a concise multi-layered environment model that minimize the computational overheads and achieves maximum effectiveness for navigation and environment model updates. Vision sensors have a huge potential to be employed for SLAM and gradually more robust visual SLAM methods are being developed. After building a robust environment model that can be easily updated upon environment changes the proposed model can be used for positioning and navigation. In addition, data gathered while users’ navigation will facilitate providing context awareness based services.

REFERENCES Afyouni, I., Ray, C., & Claramunt, C. (2012). Spatial models for context-aware indoor navigation systems: A survey. Journal of Spatial Information Science, 85–123. Aladren, A., Lopez-Nicolas, G., Puig, L., & Guerrero, J. J. (2014). Navigation Assistance for the Visually Impaired Using RGB-D Sensor with Range Expansion. Systems Journal, IEEE, 1 - 11. Becerra, H., & Sagues, C. (2014). Visual Control of Wheeled Mobile Robots, Unifying Vision and Control in Generic Approaches (Vol. 103). Springer Tracts in Advanced Robotics. doi:10.1007/978-3-319-05783-5 Berger, C., & Lacroix, S. (2008). Using planar facets for stereovision SLAM. Intelligent Robots and Systems, 1606 - 1611. Bhanage, N. (2014). Improving User Experiences in Indoor Navigation Using Augmented Reality (Master degree thesis). Technical Report No. UCB/EECS-2014-73. Electrical Engineering and Computer Sciences University of California at Berkeley.

156

Localization and Mapping for Indoor Navigation

Chacon, J. D. (2011, June). Visual-Topological Mapping: An approach for indoor robot navigation (MSC thesis project). University of Groningen. Chang, C. (2004). Localization and Object-Tracking in an Ultrawideband Sensor Network. MSC thesis. Chatila, R. (2008). Robot mapping: An Introduction. Robotics and Cognitive Approaches to Spatial Mapping, 38, 9–13. doi:10.1007/978-3-540-75388-9_2 Chatterjee, A., Rakshit, A., & Singh, N. N. (2013). Vision Based Autonomous Robot Navigation (Vol. 455). Studies in Computational Intelligence, Springer Berlin Heidelberg. doi:10.1007/978-3-642-33965-3 Disha, A. (2013). A Comparative Analysis on indoor positioning Techniques and Systems. International Journal of Engineering Research and Applications, 3, 1790–1796. Eade, E., & Drummond, T. (2009). Edge landmarks in monocular SLAM. Image and Vision Computing, 588–596. Fallaha, N., Apostolopoulosa, I., Bekrisb, K., & Folmera, E. (2013). Indoor Human Navigation Systems-a Survey. Interacting with Computers, 21–33. FinlandTeam. (2014). Finland Team uses Earth’s Magnetic Field for Phone Indoor Positioning System. Retrieved from http://phys.org/news/2012-07-finland-team-earth-magnetic-field.html Fraundorfer, F., Engels, C., & Nister, D. (2007). Topological mapping, localization and navigation using image collections. Intelligent Robots and Systems, IROS2007, 3872–3877. Fuentes-Pacheco, J., Ruiz-Ascencio, J., & Rendón-Mancha, J. M. (2012). Visual simultaneous localization and mapping: A survey. Artificial Intelligence Review, 55–81. Gandhi, S. R. (2014). A Real Time Indoor Navigation and Monitoring System for Firefighters and Visually Impaired (Master Thesis). University of Massachusetts. Garcia-Fidalgo, E., & Ortiz, A. (2013). Vision-Based Topological Mapping and Localization by means of Local Invariant Features and Map Refinement. Universitat de les Dies Balears. Goedemé, T., & Gool, L. V. (2010). Omnidirectional vision based topological navigation. Mobile Robot Navigaion (pp. 171-196). Springer. Gule & Devikar. (2014). A Survey- Qr Code Based Navigation System for Closed Building Using Smart Phones. International Journal for Research in Applied Science and Engineering Technology, 2, 442-445. Henry, P., Krainin, M., Herbst, E., Ren, X., & Fox, D. (2012). RGB-D Mapping: Using Kinect-Style Depth Cameras for Dense 3D Modeling of Indoor Environments. The International Journal of Robotics Research, 31(5), 647–663. doi:10.1177/0278364911434148 Huitl, R., Schroth, G., Hilsenbeck, S., Schweiger, F., & Steinbach, E. (2012). TUMindoor: an extensive image and point cloud dataset for visual indoor localization and mapping. 19th IEEE International Conference on Image Processing (ICIP) (pp. 1773 - 1776). Orlando, FL: IEEE. doi:10.1109/ICIP.2012.6467224

157

Localization and Mapping for Indoor Navigation

Iglesias, H. J., Barral, V., & Escudero, C. J. (2012). Indoor person localization system through RSSI Bluetooth fingerprinting. Proceedings of the 19th International Conference on Systems, Signals and Image Processing (IWSSIP ’12), 40–43. InfSoft. (2015). Retrieved from http://www.infsoft.com/Products/Indoor-Navigation Inlocationalliance. (2015). Retrieved from http://inlocationalliance.org/ Jain, D. (2014). Path-guided indoor navigation for the visually impaired using minimal building retrofitting. 16th international ACM SIGACCESS conference on Computers & accessibility, 225-232. Kaess, M., & Dellaert, F. (2006). Visual SLAM with a Multi-Camera Rig. Technical Report GITGVU-06-06, Georgia Institute of Technology. Kim, Y. M., Mitra, N. J., Yan, D.-M., & Guibas, L. (2012). Acquiring 3D Indoor Environments with Variability and Repetition. ACM Transactions on Graphics, 31, 138:1-138:11. LaMarca, A., Chawathe, Y., Consolvo, S., Hightower, J., Smith, I., Scott, J.,... Schilit, B. (2005). Place Lab: Device Positioning Using Radio Beacons in the Wild. Pervasive Computing, 116-133. Lardinois, F. (2014). Google Unveils The Cartographer, Its Indoor Mapping Backpack. Retrieved from http://techcrunch.com/2014/09/04/google-unveils-the-cartographer-its-indoor-mapping-backpack/ Lemaire, T., Berger, C., Jung, I., & Lacroix, S. (2007). Vision-Based SLAM: Stereo and Monocular Approaches. International Journal of Computer Vision, 74(3), 343–364. doi:10.1007/s11263-007-0042-3 Lemaire, T., & Lacroix, S. (n.d.). SLAM with panoramic vision. Journal of Field Robotics, 24, 91-111. Liu, M. (2013). Topological Scene Recognition and Visual Navigation for Mobile Robots using Omnidirectional Camera (PHD thesis). ETH Zurich Uni. Liu, H., & Liu, J. (2007). Survey of Wireless Indoor Positioning Techniques and Systems. IEEE Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews, 37(6), 1067–1080. doi:10.1109/TSMCC.2007.905750 Milford, M., & Wyeth, G. (2007). Featureless Vehicle-Based Visual SLAM with a Consumer Camera. Proceedings of Australasian Conference on Robotics and Automation. Moses, A. (2013). Inside out: Google launches indoor maps. Retrieved from http://www.smh.com.au/ digital-life/digital-life-news/inside-out-google-launches-indoor-maps-20130312-2fxz2.html Muhammad, N., Fofi, D., & Ainouz, S. (2009). Current state of the art of vision based SLAM. IS&T/ SPIE Electronic Imaging, 72510F-72510F-12. Nagel, C., Becker, T., Kaden, R., Li, K.-J., Lee, J., & Kolbe, T. H. (2010). Requirements and Space-Event Modeling for Indoor Navigation. Open Geospatial Consortium. Naminski, M. R. (2013). An Analysis of Simultaneous Localization and Mapping (SLAM) Algorithms. Macalester Math, Statistics, and Computer Science Department. Retrieved from http://digitalcommons. macalester.edu/mathcs_honors/29

158

Localization and Mapping for Indoor Navigation

NAVVIS. (2012). Retrieved from http://www.navvis.lmt.ei.tum.de/dataset/ Nehmzow, U. (2008). Emergent Cognitive Mappings in Mobile Robots Through Self-organisation. In Robotics and Cognitive Approaches to Spatial Mapping (pp. 83–104). doi:10.1007/978-3-540-75388-9_6 NextHome. (2015). NextHome. Retrieved from http://www.nextome.org/en/indoor-positioning-technology. php Nieto, J. I. (2005). Detailed Environment Representation for the SLAM Problem (PHD thesis). The University of Sydney. Otsason, V., Varshavsky, A., LaMarca, A., & Lara, E. d. (2005). Accurate GSM Indoor Localization. UbiComp 2005: Ubiquitous Computing, 141-158. Perez-Lorenzo, J., Vazquez-Martin, R., Antunez, E., & Bandera, A. (2009). A Dual Graph Pyramid Approach to Grid-Based and Topological Maps Integration for Mobile Robotics. 10th International Work-Conference on Artificial Neural Networks, 781–788. PlaceLab. (2014). PlaceLab. Retrieved from http://ils.intel-research.net/place-lab Priyantha, N., Chakraborty, A., & Balakrishnan, H. (2000). The cricket location-support system. MobiCom ‘00: Proceedings of the 6th annual international conference on Mobile computing and networking, 32-43. doi:10.1145/345910.345917 Roweis, S. T., & Salakhutdinov, R. R. (2005). Simultaneous Localization and Surveying with Multiple Agents. LNCS, 3355, 313–332. Schroth, G. (2013, July). Mobile Visual Location Recognition (PhD Thesis). Munich: Technische Universität München. SmartIndoor. (2015). Retrieved from SmartIndoor: http://smartindoor.com/#indoor-nav Subhan, F., Hasbullah, H., Rozyyev, A., & Bakhsh, S. T. (2011). Indoor positioning in Bluetooth networks using fingerprinting and lateration approach. Proceedings of the International Conference on Information Science and Applications (ICISA ’11). Terraa, R., Figueiredoa, L., Barbosaa, R., & Anacletoa, R. (2014). Traveled Distance Estimation Algorithm for Indoor Localization. Conference on Electronics, Telecommunications and Computers – CETC 2013. 17, (pp. 248 – 255). Procedia Technology. doi:10.1016/j.protcy.2014.10.235 Thrun, S. (2002). Robotic Mapping: A Survey. In G. A. Lakemeyer (Ed.), Exploring Artificial Intelligence in the New Millenium. Morgan Kaufmann. Thrun, S. (2008). Simultaneous Localization and Mapping. Robotics and Cognitive Approaches to Spatial Mapping, 38, 13–41. doi:10.1007/978-3-540-75388-9_3 Tomatis, N. (2008). Hybrid, Metric-Topological Representation for Localization and Mapping. Robotics and Cognitive Approaches to Spatial Mapping, 43–63. Tsirmpas, C., Rompas, A., Fokou, O., & Koutsouris, D. (2014). An indoor navigation system for visually impaired and elderly people based on Radio Frequency Identification (RFID). Information Sciences.

159

Localization and Mapping for Indoor Navigation

Wang, P. K.-C. (2015). Visibility-based Optimal Path and Motion Planning (Vol. 568). Springer. WifFarer. (2015). Retrieved from http://www.wifarer.com/technology Yeap, W., & Jefferies, M. (2000). On early cognitive mapping. Spatial Cognition and Computation, 2(2), 85–116. doi:10.1023/A:1011447309938 Zhang, D., Xia, F., Yang, Z., Yao, L., & Zhao, W. (2010). Localization Technologies for Indoor Human Tracking. 5th International Conference on Future Information Technology (FutureTech), 1-6. Zheng, X., Bao, G., Fu, R., & Pahlavan, K. (2012). The Performance of Simulated Annealing Algorithms for Wi-Fi Localization Using Google Indoor Map. Vehicular Technology Conference (VTC Fall), 1 - 5. doi:10.1109/VTCFall.2012.6399302 Zhu, L., Yang, A., Wu, D., & Liu, L. (2014). Survey of Indoor Positioning Technologies and Systems. Life System Modeling and Simulation, 461, 400–409.

160

161

Chapter 8

Enzyme Function Classification: Reviews, Approaches, and Trends

Mahir M. Sharif Cairo University, Egypt & Omdurman Islamic University, Sudan & Scientific Research Group in Egypt (SRGE), Egypt Alaa Tharwat Suez Canal University, Egypt & Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt

Aboul Ella Hassanien Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Hesham A. Hefny Cairo University, Egypt

ABSTRACT Enzymes are important in our life and it plays a vital role in the most biological processes in the living organisms and such as metabolic pathways. The classification of enzyme functionality from a sequence, structure data or the extracted features remains a challenging task. Traditional experiments consume more time, efforts, and cost. On the other hand, an automated classification of the enzymes saves efforts, money and time. The aim of this chapter is to cover and reviews the different approaches, which developed and conducted to classify and predict the functions of the enzyme proteins in addition to the new trends and challenges that could be considered now and in the future. The chapter addresses the main three approaches which are used in the classification the function of enzymatic proteins and illustrated the mechanism, pros, cons, and examples for each one.

INTRODUCTION Proteins play important roles in the living organisms. They carry out a majority of the cellular processes and act as structural constituents, catalysis agents, signaling molecules and molecular machines of every biological system. It makes up a large proportion of living cells and they are important (i) as enzymes to carry out all the metabolic reactions (biological catalysts) going on inside the cell; (ii) as structural proteins making up connective tissue, muscle, bones and the cellular division machinery; (iii) as receptors and ion channels to communicate from the outside to the inside of the cell and to allow flow of ions DOI: 10.4018/978-1-5225-2229-4.ch008

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Enzyme Function Classification

and small molecules across the cell membrane; (iv) as the basis of our immune system in the form of antibodies and (v) being involved in macro-molecular complexes with RNA, DNA and carbohydrate molecules (Eisenberg,2000, Gennis,2013). Enzymes represent basic and the most important types of proteins. The enzyme or protein sequences contained in various organisms are very easy to determine, but to find experimentally the function of a protein remains a tedious and expensive task because it needs a faster and more cost-effective manner. Thus, the biologists are interested to find an automatic approach that can help them to filter among the numerous possibilities, and so with the aid of computational methods for the development of models to classify and predict the functions of various enzymes based on similarities between their sequences and/or their spatial structures (Tharwat, et al.,2015a) The IUBMB enzyme classification system represents the first enzyme classification set by the Enzyme Commission (EC) since five decades ago to classify the enzymes and update with new enzymes which discover. IUBMB puts Enzyme Nomenclature and decided to retain the systematic names as the basis for classification for any reason, e.g. the code number alone is only useful for identification of an enzyme; systematic names stresses the type of reaction; can be formed for new enzymes by the discoverer; and the systematic names are helpful in finding recommended names that are in accordance with the general pattern. According to above recommendations the EC develops a new scheme of enzyme classification and numbering consist of code numbers prefixed by EC which is now widely in use. The EC classification system contains four elements/numbers separated by points, the first number shows to which of the six main divisions (classes) the enzyme belongs, the second number indicates the subclass, the third number gives the sub-subclass, and finally the fourth number is the serial number of the enzyme in its sub-subclass. EC classified the enzymes based on the chemical reactions into six main classes such as (i) Oxidoreductases to this class belong all enzymes catalyzing oxidoreduction reactions, (ii) Transferases are enzymes transferring a group, e.g. a methyl group or a glycosyl group, (iii) Hydrolases these enzymes catalyze the hydrolytic cleavage of C−O, C−N, C−C and some other bonds enzyme, (iv) Lyases are enzymes cleaving C−C, C−O, C−N, and other bonds by elimination, leaving double bonds or rings, or conversely adding groups to double bonds, (v) Isomerases these enzymes catalyze geometric or structural changes within one molecule, (vi) Ligases catalyzing the joining together of two molecules coupled with the hydrolysis of a pyrophosphate bond in ATP or a similar triphosphate (Sharif et al.,2015a, Tipton,1994). The rest of the book chapter is organized as follows: Section 2 presents some of the state-of-the-art methods. Section 3 introduces a detailed explanation of enzyme function classification using sequence alignment. In this section, sequence alignment methods and the similarity score calculations are explained. Section 3 presents details about enzyme function classification using enzymes’ structures. Enzyme function classification using features is introduced in Section 4. Finally, conclusions and future work are presented in Section 5.

BACKGROUND There are three famous approaches widely used for enzyme prediction and classification based on sequence alignment, protein’s features, and the protein’s structure. There are many studies which conducted and a group of models which developed based on sequence alignment to classify and predict the enzymes/proteins. Cai et al., developed a method based on Support 162

Enzyme Function Classification

Vector Machine (SVM) for classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. Their model based on the two-class classification platform. One class contains proteins in a particular enzyme family, and another class consists of representative proteins outside of this family that includes both enzymes of the remaining 45 enzyme families and non-enzymes. From 8291 enzymes correctly classified, their model classified correctly 6658 enzyme sequence with accuracy 80.3% (Cai et al.,2004). Shen and Chou developed a top-down predictor, called EzyPred, which consists of three layers, the first layer represents the prediction engine for identifying the enzymes from non-enzyme protein; and the second layer was to determine the main functional classes; 3rd layer subfunctional class, they achieved an overall accuracy 90%. The main reason of making the EzyPred high success referred to fusing the FunD (Functional Domain) approach and Pse-PSSM (Pseudo PositionSpecific Scoring Matrix) approach (Shen et al.,2007). Espadaler et al. developed a novel method to predict the first three EC digits. They used the PSI-BLAST in their proposed method and their dataset consist of 3890 protein sequences divided to 1227 enzymes and 2663 non-enzymes. They called their method ModFun and achieved prediction accuracy 80%, 90% and 80% for first the three EC digits (class, sub-class, sub-sub-class), respectively (Shen et al.,2007). In (Sharif et al.,2015a) they proposed and developed a model to classify the enzymes based on their sequence alignment. In their model, the sequences were first aligned using local and global methods. Then, a score similarity was calculated using different score matrices, BLOSUM30 and BLOSUM62 (default score matrix). They achieved a reasonable classification accuracy reached 88.16%, 89.94%, and 93.3% when they applied a global alignment using BLOSUM62, global alignment using BLOSUM30, and local alignment, respectively. In Tharwat et al. an enzyme sequence function classification model based on scores fusion technique was proposed (Tharwat et al., 2015b). In their proposed model, the outputs of seven pairwise local sequence alignment processes using different score matrices were represented by a vector of scores. All vectors were combined to determine the final vector of scores. The function or class label of an unknown sequence was the function of the candidate or element that has the maximum score. The proposed model achieved a reasonable classification accuracy reached to 93.9% when the dataset was unbalanced, while the accuracy of the proposed model achieved 92.2% when the dataset was balanced. The results of their proposed model achieved results better than their model that was proposed in (Tharwat et al.,2015a). Sharif et al. proposed an enzyme sequence function classification model based on ranking aggregation technique is proposed. In the proposed model, the outputs of seven pairwise local sequence alignments based on different score matrices are represented by ranked lists. Then, combine all ranked lists to get the final ranked list. The function or class label of the unknown sequence is the function of the highest rank candidate or element. The proposed model achieved reasonable classification accuracy 93.5% when the dataset was unbalanced. While the accuracy of the proposed model achieved 90.5% when the dataset is balanced. The results of the proposed model achieved results better than in our experiment in (Sharif et al., 2015b). Table 1 lists some state-of-the-art methods that used sequence alignment to predict enzyme function. There are many studies which developed based on enzyme structure to classify and predict the enzymes/proteins. Liu et al. developed a new method to combine SVM with PSI-BLAST to predict protein structural class for low-similarity sequences which is called AADP-PSSM. They transformed the PSSM profiles of proteins into fixed-length feature vectors by extracting Amino Acid Composition (AAC) and DiPeptide Composition (DPC) from the profiles. They used 1189 dataset with 1092 protein domain and sequence similarity lower than 40%, 25PDB dataset with 1673 proteins domain and 25% sequence similarity. They reported that 70.7% overall accuracy and 72.9% for 1189 dataset and 25PBD 163

Enzyme Function Classification

Table 1. List of enzyme classification attempts based on enzyme sequences Method/Tool

Model Name

Accuracy

Notes

Reference

SVM

-

80.3%

-

(Cai et al., 2004)

Neural Network, MEME

-

(95.76%, 25) and (95.38%, 25),

The corresponding results for MAST andSAM respectively

(BLEKAS et al, 2005)

Reverse PSI-BLAST

EzyPred

90%

Overall three levels

(Shen and Chou, 2007)

BLAST

Sequence Information

40.5%

For all EC classes

(Audit et al., 2007)

PSI-BLAST

ModFun

80%, 90%, 80%

Level 1,2,3 respectively

(Espadaler et al., 2008)

Neural Network

FANN GO

75%

FANN-GO was evaluated against three different strategies.

(Clark and Radivojac,2011)

BLOSUM30 & BLOSUM62

-

88.2%, 89.9%, 93.3%

GA BLOSUM62, GA BLOSUM30, and LA respectively

(Sharif et al., 2015b)

BLOSUM62, BLOSUM100, BLOSUM30, PAM10, PAM100, DAYHOFF, GONNET

-

93.5%, 90.5%

Unbalanced Data Experiment, Balanced Data Experiment Respectively

(Sharif et al., 2016)

BLOSUM62, BLOSUM100, BLOSUM30, PAM10, PAM100,DAYHOFF, GONNET

-

93.9%, 92.2%

Unbalanced Data Experiment, Balanced Data Experiment Respectively

(Tharwat et al., 2015b)

dataset, respectively (Liu et al.,2010). Liang et al. proposed a novel model called MBMGAC-PSSM by fusing PSSM and three autocorrelation descriptors: normalized Moreau-Broto autocorrelation, Moran autocorrelation, and Geary autocorrelation. They started from 560-dimensional feature vector and they used the Principal Component Analysis (PCA) technique to reduce these features to 175. They performed Rigorous jackknife cross-validation tests in three widely used low similarity datasets contain 1189, 25 PDB, and 640. Their proposed model achieved the competitive performance on prediction accuracies and also outdoes the other existing PSSM-based methods. They achieved 76.3%, 77.2%, and 79.1% prediction accuracies for 1189, 25 PDB, and 640 datasets, respectively Liang et al. 2015). Zhang et al. developed a new computational method to predict protein structural class based on structure by incorporating alternating word frequency and normalized LempelZiv complexity to represent a protein sample. The two novel features including alternating word frequency and normalized LempelZiv complexity has been proposed. They reported 83.6%, 81.8% and 83.6% prediction accuracies for 25PDB, 1189 and 640 benchmarks, respectively. They compared their results with other methods showed that their proposed method achieved promising results (Zhang et al.,2014). Table 2 shows a comparison between many studies which used protein structure for enzyme function prediction. One of the most studies that conducted in this direction the model which developed by Kumar and Choudhary, they proposed a supervised machine learning model to predict the function class and subclass of enzymes based on a set of derived features (Kumar & Choudhary,2012). They used an efficient data mining algorithm called Random Forest (RS) and supervised machine learning model using a set

164

Enzyme Function Classification

Table 2. Comparison between the related studies which used based on protein structure Dataset

Method/Author/Reference

Prediction Accuracy (%) All-α

25DBP

640

FC699

1189

All-β

α\β

α +β

(OA%)

AADP-PSSM (Liu et al., 2010)

69.1

83.7

85.6

35.7

70.7

AAC-PSSM-AC (Liu et al., 2012)

85.3

81.7

73.7

55.3

74.1

(Zhang et al., 2013)

95.7

80.8

82.4

75.5

83.7

(Ding et al., 2014)

91.7

80.8

79.8

64.0

79.0

(Wang et al., 2015)

98.9

89.6

85.6

78.9

88.4

MBMGAC-PSSM (Liang et al., 2015)

86.7

81.5

79.5

61.7

77.2

RKS-PPSC (Yang et al., 2010)

89.1

85.1

88.1

71.4

83.1

(Ding et al., 2014)

92.8

88.3

85.9

66.1

82.7

(Zhang et al., 2014)

92.0

81.8

87.6

74.3

83.6

(Kong et al., 2014)

94.2

80.5

87.6

77.2

84.5

MBMGAC-PSSM (Liang et al., 2015)

86.2

83.1

85.3

63.2

79.1

(Wang et al., 2015)

93.7

83.3

91.5

79.5

87.5

11 features (Liu and Jia, 2010)

97.7

88.0

89.1

84.2

89.6

(Kong et al., 2014)

96.2

90.7

96.3

69.5

92.0

(Wang et al., 2015)

97.7

97.4

97.1

79.3

95.6

RKS-PPSC (Yang et al., 2010)

89.2

86.7

82.6

65.6

81.3

AAC-PSSM-AC (Liu et al., 2012)

80.7

86.4

81.4

45.2

74.6

(Zhang et al., 2013)

91.5

86.7

82.0

66.4

81.8

(Ding et al., 2014)

89.2

88.8

85.6

58.5

81.2

(Kong et al., 2014)

91.9

84.4

85.3

72.2

83.5

MBMGAC-PSSM (Liang et al., 2015)

79.8

85.0

84.7

50.6

76.3

(Wang et al., 2015)

98.7

89.1

89.2

73.4

87.6

* OA Overall Accuracy

of 73 sequences derived features in order to construct a top-down model consist of three layers, the first layer to discriminate the enzyme protein from non-enzyme protein, the second layer to determine the main six classes of the enzymes according to the definition of International Union of Biochemistry and Molecular Biology (IUBMB), the third layer to identify the sub-class of the enzyme protein. Their model reported overall classification accuracy 94.87% for the first layer, and 87.7% for the second, and 84.25% for the third layer. Yang et al., developed a new method based on extracted numerical features from the sequence, e.g. hydrophobicity, polarity and charge properties which used to create vector features with k-nearest neighbor (kNN). They used local and global sequence information to avoid some problems that facing the methods which based on sequence similarity. They classified 20 amino acids (residues) into five different classes based on their physiochemical characteristics, such as hydrophobic property, polarity, and acid-base properties (Yang et al., 2012). Table 3 reports some studies that used features for enzyme function prediction.

165

Enzyme Function Classification

Table 3. List of enzyme classification attempts based on extracted features of List of enzyme classification attempts based on extracted features Method

Feature Used

Accuracy

Notes

Reference

ANN, PSSM

Sequence-derived Features

79%

-

(Naik et al., 2007)

Self-organizing Map (SOM)

Reaction describer, Amino acid

92%, 80%, 70%

For class, subclass and subsubclass levels, respectively

(Latino et al., 2008)

Structure Template Matching

Structure Information

87%

Accuracy in functional annotation of enzymes

(Kristensen et al., 2008)

Nearest Neighbor Algorithm

Sequences descriptor, Amino Acid Composition.

95%

Accuracy to the level of enzyme class

(Nasibov and KandemirCavas, 2009)

Support Vector Machine (SVM)

Composition and, Conjoint triad feature

81% to 98%

Accuracy of predicting the first, three EC digits

(Wang et al., 2011)

Random Forest, RFCRYS

Physicochemical Features

80%

-

(Jahandideh and Mahdavi, 2012)

Random Forest (RS)

Sequence-derived Features

87%

The model consists from 3 layers

(Kumar and Choudhary, 2012)

KNN

Physicochemical Properties

88.53%

-

(Yang et al., 2012)

An ensemble of five different classifiers (AdaBoost, M1, LogitBoost, Naive Bayes, MLP, and SVM)

Physicochemical-based attributes

90.3%, 96.6%, 74.8%, 76.7%

When they use Z277, When they use Z498, When they use 1189, When they use 25PDB

(Dehzangi et al., 2013)

Boolean-Like Training Algorithm (BLTA)

6 extracted features

100%

With varied Computational Time {91.511.03,, 100.263.83}

(Bharill et al., 2015)

ENZYME FUNCTION CLASSIFICATION BASED ON SEQUENCE SIMILARITY Expectedly, sequence comparison through sequence alignment is central to the most Bioinformatics analysis. It is the first step towards understanding the evolutionary relationship and the pattern of divergence between two sequences. The relationship between two sequences also helps to predict the potential function of an unknown sequence, thereby indicating protein family relationship (Choudhuri,2014). The first step in the enzyme function classification using the sequence alignment method is to align the sequences and then the similarity score between them is then calculated. In the section, more details about sequence alignment methods and calculating score similarity are illustrated.

Sequence Alignment Sequence alignment is a process of arranging two, i.e. pairwise alignment, or more, i.e. Multiple Sequence Alignment (MSA), sequences of characters to define the region of similarity and calculate a similarity score. There are many problems and challenges that face the process of calculating similarity scores such as, 1) Different length and different gaps of sequences, 2) Small matching/similar region with respect

166

Enzyme Function Classification

to the length of each sequence (Edgar 2004, Blazewicz et al.,2013, Orobitg et al.,2014). Two pairwise sequence alignment methods, namely, global and local sequence alignment methods which considered the most popular sequence alignment methods are explained below.

Pairwise Sequence Alignment Assume X and Y is a pair of two different proteins or DNA sequences, where X ≡ x 1x 2 … x m ;Y ≡ y1y2 …yn where x i and yi are letters chosen from the alphabet. Figure 1 shows the results of aligning two different sequences. As shown, the results of the alignment process are: (1) Matching when the two letters are the same, (2) Mismatch when the two letters are different, (3) Insertion gap, and (4) Deletion gap. Gaps are an arbitrary number of null characters or spaces, which represented by dashes. This gap may be understood as insertion of a character into one sequence, i.e. insertion gap, or deletion of a character from the other one, i.e. deletion gap, as shown in Figure (1). Gap penalties/scores can be calculated using different methods. The simplest gap penalty method is the constant gap penalty. In this type, the penalty is always constant (g), while the linear gap penalty depends on the length of the gap. Gap penalties are letter independent. However, the gap penalty value is subtracted from the total alignment score to calculate the final score (Edgar 2004).

Global Alignment In Global Alignment method, all letters and null in each sequence must be aligned from one end to the other end as shown in Figure 2a. To align two sequences, first a score matrix is built and the two Figure 1. Example of a pairwise sequence alignment process

167

Enzyme Function Classification

Figure 2. A compa rison between local and global alignment sequence alignment of a pair of sequences

sequences are then aligned by tracing back the cells of the score matrix. The score of each cell of the score matrix in global alignment is calculated as follows: SIM (i − 1, j − 1) + s(x + y ) i i  SIM (i, j ) = max SIM (i − 1, j ) + g  SIM (i, j − 1) + g 

(1)

where s(a, b) represents the score for aligning letters a and b, g represents gap score, x i is the i th letter

of X ≡ x 1x 2 … x m , yi is the j th letter of Y ≡ y1y2 …yn and SIM (i, j ) represents the similarity of xi and yi . In the first line of Equation (1), x i and yi are aligned, while in the second and third lines, the null is aligned with xi and yi , respectively (Huang,1994). Given two different sequences X and Y, where X = TGCCGTG and Y = CTGTCGCTGCCG. The length of X and Y are represented by m and n, respectively. Hence, the size of the scoring matrix is (m + 1)(n + 1) . The score matrix is filled by assigning the two sequences’ letters and gap penalties on the left and top headers of the table as shown in Table (4). The score in each cell in the score matrix is calculated by assigning the maximum of all three cells to the new cell as follows, C (i, j ) = max {C (i1, j1 ) + SIM (Si ,Ti ),C (i − 1, j ) − gap,C (i, j − 1) − gap } After filling the score matrix, the two sequences are aligned by tracing back the cells starting from the bottom right cell. In Table (4), bold fonts show the trace back of our example. In Figure (1), the vertical or horizontal means deletion or insertion gaps, respectively, while moving in diagonal represent matched or mismatched letters. Figure (2a) shows the global alignment between the two sequences.

Local Alignment Smith-Waterman proposed the first algorithm that used the local alignment method to align different sequences (Xiong, 2006). In other words, local alignment is used to find the most similar regions in two sequences being aligned as shown in Figure (2b). In Figure (2) we note that global alignment aligns the whole length of the two sequences, while in the local alignment it aligns the most similar region in the two sequences. Hence, the local alignment isolates regions in the sequences and repeats are easy to detect, while global alignment is suitable for overall similarity detection. The similarity between two letters in local alignment is calculated as in Equation (2).

168

Enzyme Function Classification

Table 4. Example of global alignment method gap

T

G

C

C

G

T

G

0

-2

-4

-6

-8

-10

-12

-14

C

2

-2

-4

0

-2

-4

-6

-8

T

-4

2

0

-2

-2

-4

0

-2

G

-6

0

6

4

2

2

0

4

T

-8

-2

4

4

2

0

6

4

gap

C

-10

-4

2

8

8

6

4

4

G

-12

-6

0

6

6

12

10

8

C

-14

-8

-2

4

10

10

10

8

T

-16

-10

-4

2

2

8

14

12

G

-18

-12

0

0

0

6

12

18

C

-20

-14

-2

4

4

4

10

16

A

-22

-16

-4

2

2

2

8

14

C

-24

-18

-6

0

6

4

6

12

G

-26

-20

-8

-2

4

10

8

10

SIM (i − 1, j − 1) + s(x + y )  i i  SIM (i, j ) = max SIM (i − 1, j ) + g  SIM (i, j − 1) + g 

(2)

Given two different sequences X and Y with different lengths m and n, respectively, where X = [TGCCGTG] and Y = [CTGTCGCTGCCG]. Local and global alignment methods have the same steps. The size scoring matrix is built and its size is (m + 1) × (n + 1) . The first row and first column are initialized with zeros as shown in Table (5). Equation (2) is used to calculate the score or value of any cell of the score matrix. After filling all cells of the score matrix, the two sequences are aligned by tracing back the cells starting from the bottom right cell as in global alignment.

Multiple Sequence Alignment (MSA) Multiple Sequence Alignment (MSA) is an extension of pairwise sequence alignment. It is a sequence alignment of three or more biological sequences of enzymes with different lengths. In pairwise alignment, the alignment process searches in two-dimensional space, while in MSA the algorithms search in many dimensional spaces to align sequences as shown in Figure 3. This means that the MSA algorithm needs more computation. Hence, in MSA it is not feasible to search exhaustively for optimal alignment even for a small number of short sequences (Edgar, 2006, Notredame et al.,2000). In the next sections, many methods are used to solve this problem such as dynamic programming, progressive alignment, iterative alignment, statistical methods, and heuristic methods.

169

Enzyme Function Classification

Table 5. Score matrix of local alignment example

gap

gap

T

G

C

C

G

T

G

0

0

0

0

0

0

0

0

C

0

0

0

1

1

0

0

0

T

1

0

0

0

0

0

1

0

G

0

0

1

0

0

1

0

1

T

0

1

0

0

0

0

2

1

C

0

0

0

1

1

0

1

1

G

0

0

1

0

0

2

1

2

C

0

0

0

2

1

1

1

1

T

0

1

0

1

1

0

2

1

G

0

0

2

1

0

2

1

3

C

0

0

1

3

2

1

1

2

A

0

0

0

2

2

1

0

1

C

0

0

0

1

3

2

1

0

G

0

0

1

0

2

4

3

2

Figure 3. Example of multiple sequence alignment

Dynamic Programming Dynamic programming is considered one of the solutions of MSA problem. In dynamic programming, the problem is divided into smaller independent subproblems. Firstly, all sequences are compared to each other, i.e. pairwise comparison. Secondly, the sequences are divided into groups of most similar sequences. Thirdly, clusters or subgroups that are similar are clustered together. In each cluster, align the two most similar sequences, and use the result to align it with third sequence and so on. This process will continue until finished from all sequences in the clusters. One of the advantages of the dynamic programming technique is that it can reach to the global optimum solution (Lipman et al., 1989). However, it is theoretically suitable for a small number of sequences, because it needs more CPU time and memory. The complexity of the dynamic programming algorithm in MSA is O(LN), where L represents the length of the sequence and N represents a number of sequences. Hence, it is rare to use in this field.

170

Enzyme Function Classification

Progressive/Hierarchical Alignment Construction Progressive, hierarchy, or tree methods considered the most famous MSA approaches that is used to align multiple sequences. It is widely used in many web tools like T-Coffee (Notredame et al.,2000). First, it aligns the most similar sequences and constructs a score or distance matrix. These scores are used to build the guided tree which describes the relationship between sequences based on pairwise alignment. The final step is to align the most similar pair of sequences and align pairs of pairs till finish as shown in Figure (4). The main disadvantage of progressive alignment is that the guided tree topology may be considerably wrong; hence, aligning pairs of sequences may create errors that will propagate through to the final result. The main steps of progressive alignment are summarized as follows: Compute the pairwise alignments for all sequences and construct a distance matrix that has the similarity scores. If we have N sequences, N (N 1) = 2 pairwise alignments is required; hence, this step needs more CPU time. 1. The distance matrix from the previous step is used to build the guide tree. This step started with aligning the sequences that have the maximum similarity. This procedure is repeated until all scores have been used. There are two types of trees, namely, simple and compound trees. In simple trees, the branching order follows the simple and easy clustering, while compound tree’s have sub-clusters. 2. In this step, the sequences are aligned in a bottom-up order. This step starting from aligning the most similar (neighboring) pairs, then align pairs of pairs and finally align sequences clustered to pairs of pairs deeper in the tree (Notredame et al.,1998; Gotoh,1990; Vingron & Haeseler,1997; Feng & Doolittle,1987). Assume we have four different sequences as follows, Seq1 = [TTT], Seq2 = [TTT], Seq3 = [GTT], and Seq4 = [GGT]. First computes the pairwise alignment between all sequences as shown in Table 6. The distance matrix is then constructed as in Table 7. Finally, build the guide tree as in Figure 5.

Iterative Alignment In the iterative method the steps are approximately similar to the progressive method, but the iterative method repeatedly realigns the initial sequences as well as adding new sequences to the growing MSA. Iterative alignment tries to remove the weak point of the progressive alignment. In the progressive alignTable 6. Distance or score matrix of aligning different sequences (first step in MSA) Aligned Sequences

  Score

Seq1 and Seq2

6+6+6=18

Seq1 and Seq3

-3+6+6=9

Seq1

and Seq4

-3-3+6=0

Seq2

and Seq3

-3+6+6=9

Seq2

and Seq4

-3-3+6=0

Seq3

and Seq4

9-3+6=12

171

Enzyme Function Classification

Figure 4. Progressive sequence alignment

Table 7. Pairwise distances between the four sequences in the example Seq1

Seq2

Seq3

Seq4

Seq1

-

-

-

-

Seq2

18

-

-

-

Seq3

9

9

-

-

Seq4

0

0

12

-

ment, the accuracy depends on the initial pairwise alignments. But, in the iterative method, the MSA is re-iterated, starting with the pairwise re-alignment of sequences within subgroups. The subgroups are then re-aligned. The choice of subgroups can be selected randomly or made via sequence relations on the guided tree, and so on. The steps of iterative method as follows: 1. Compute the pairwise alignments for all sequences against all sequences and construct a distance matrix. 2. Build the guided tree. 3. Iterate until MSA does not change (convergence).

172

Enzyme Function Classification

Figure 5. Guided Tree of the progressive alignment example

Iterative method is used in many software packages and web tools such as MUSCLE (MUltiple Sequence alignment by Log-Expectation) (Wallace et al.,2005), PRRN/PRRP (Gotoh,1996), and CHAOS/ DIALIGN suite (Brudno et al.,2003). However, iterative MSA is an optimization method; hence, the process can get trapped in a local minima problem and can be much slower.

Statistical Methods The main idea of the statistical methods is based on assigning a likelihood to all combinations of the match, mismatch, and gaps to get a set of multiple sequence alignment. Hidden Markov Models (HMM) are one of the most famous statistical methods. HMM generates a family of possible alignments. It offers relatively high-speed algorithms and good results (Mount, 2004). HMM solved many problems by representing it into Directed Acyclic Graph (DAG). Hence, MSA represents in HMM as a form of DAG, which consists of a series of nodes representing possible entries in the columns of an MSA. There are many software that used HMM such as HMMER (Durbin, 1998), SAM (Sequence Alignment and Modeling) systems (Hughey & Krogh,1996), and HHsearch (Soding, 2005).

Heuristic Algorithms The main idea of heuristic methods is to search for the solution in a search space. This search space is determined based on the objective function that is formulated according to the problem. The best solutions are then selected and then used to generate other solutions till the number of iteration end or reach to minimize error. However, it may not find the optimal alignment solution due to many factors. Heuristic methods are efficient in large-scale problems; while many other methods like dynamic programming need more CPU time and storage which may lead to problems with it. Thus, many tools like FASTA and BLASTA used these heuristic search methods (Mount, 2004).

Calculating Similarity Score After aligning the pair of sequences, the score of alignment or similarity is calculated. Calculating the scores or similarity of alignment depends on the values of matching and mismatch scores. The values of matching and mismatching scores depend mainly on the substitution matrix that has the values of

173

Enzyme Function Classification

matching scores and mismatching penalty scores. In other words, Substitution or scoring matrix is a set of values representing the likelihood of one residue being substituted by another. The scoring matrices for proteins are 20 × 20 matrices as shown in Figure (6). The figure shows, the matching and mismatching scores of the two well-known scoring matrices for proteins, i.e. PAM and BLOSUM. BLOSUM (BLOck SUbstitutions Matrices) is one of the widely used substitution matrices. In BLOSUM method, blocks are ungapped multiple sequence alignment corresponding to the most conserved regions of the sequences involved. The default BLOSUM matrix is BLOSUM62, which represents the sequences used to create the BLOSUM62 matrix have approximately 62% identity. In PAM (Point Accepted Mutation or Percent Accepted Mutation), a substitution of one amino acid by another that has been fixed by natural selection. In Figure (6), the (i th ; j th ) cell in a PAM matrix denotes the probability that amino-acid i will be replaced by amino-acid j (Xiong, 2006). The original PAM matrix is PAM1. In PAM1, 1% of the amino acids in a sequence are expected to accept mutation. BLOSUM matrix is tested by Hanikoff and Hanikoff and they found that BLOSUM matrices performed better than PAM matrices. PAM score matrix applies when the divergence is low. The higher suffix number in PAM matrices, the better it is in dealing with distant sequence alignment. In contrast with PAM matrices, the higher the suffix number in BLOSUM, the better it is in dealing with closer sequences (Xiong, 2006). The alignment score represents the sum of the scores (matching or mismatching scores) for aligning pairs of letters, i.e. alignment of two letters, and gap scores, i.e. alignment of a gap with a letter.

Protein Enzyme Classification and Prediction Based on Enzyme Structure Predicting protein function from the structure has been the most successful method, but since protein structures are known for less than 1% of known protein sequences, most proteins of newly sequenced genomes have to be characterized by their amino-acid sequences alone (Volpato et al.,2013). The structural class is one of the most important features for characterizing the overall folding type of a protein and has played a pivotal role in rational drug design, pharmacology and many other applications (Zhou & Munt,2001). The exponential growth of newly discovered protein sequences by different scientific community has made a large gap between the number of sequences-known and the number of structureknown proteins. Accurate prediction of protein secondary structural classes is a step towards understanding the protein tertiary structural classes on the grounds that protein tertiary structure represents secondary structure with some developments. Hence, there exists a critical challenge to develop automated methods for fast and accurate determination of the protein secondary structural classes in order to reduce the gap. All the proteins are made up of the 20 standard amino acids, which are encoded by DNA, and all these amino acids have common general structure. The amino acid structure consists of an amino group and a carboxyl group, but they differ among themselves in the R group which distinguishes them from each other. The identity and properties of the amino acid depend on the nature of the R group as shown in Figure 7. The nature of the amino acid formation and the degree of their combination specifying the type of the protein and the Figure 10 illustrates the biology and chemistry of protein structure. The amino acids through the ribosome join together to make a polypeptide by the elimination of water to form a peptide bond. As shown in Figure 8, the amino acids gather together (assembly) to create the primary structural protein, in the secondary structure, the amino acids folding in the two shapes (a) Alpha helix, and (b) Beta pleated sheet and the Figure 9 illustrated these two shapes.

174

Enzyme Function Classification

Figure 6. Two examples of substitution matrix: (a) BLOSUM62 substitution matrix; (b) PAM120 substitution matrix Zvelebil & Baum, 2008.

Figure 7. The basic structure of an amino acid

Figure 8. Levels of structure in proteins Ball et al., 2011.

175

Enzyme Function Classification

Figure 9. Illustrated (a) Alpha helix and (b) Beta pleated sheet Ball et al., 2011.

The protein in the tertiary structure consisted of alpha helix and beta pleated sheet through a process which called packing. Meanwhile, in the interaction process result in another form of proteins known as quaternary structure which represented more complex comparing with other structures (Satyanarayana et al.,2013). As a typical pattern recognition problem, computational methods for protein structural class prediction consist of three main stages: i) protein feature representation; ii) algorithm selection for classification; iii) optimal feature selection. Among the three stages, feature extraction is the most critical factor and important phase in the success of protein structural class prediction. For this phase, models in common use include amino acid composition (AAC), polypeptide composition, functional domain composition, physicochemical features (Rao et al., 2011), PSI-BLAST profiles (Ding et al.,2014) and function annotation information (Li et al.,2012). Despite some success in prediction tasks, a carefully engineered integrated feature model generally offers higher accuracy and stability than those with a single feature (Li et al.,2014). It is known that proteins have irregular surfaces and complex 3D structures, but they are formed regularly in regional fold patterns at the secondary structure level. The protein secondary structures, are classified into four categories, All-a, all-b, a=b and a + b. Where All-a consist of only a helix and all-b proteins consist of only b strands. The a=b and a + b proteins are mixed with a -helices and b -strands, where the former consist of parallel b -proteins and the latter anti-parallel b -proteins (Li et al.,2014). The problem of the protein structural class prediction is defined as categorizing a given protein into one of the four structural classes, namely, all − a, all − b, a + b and a = b . The existence enough knowledge about the protein structural class and function, so the determination of protein structural class suitable for the proteins which have low-similarity sequences, still represent a hot and promise topic in Bioinformatics and computational biology. We need to bridge the gap and narrow the distance between the exponential growth of newly discovered protein sequences by different scientific community, and that number of sequences-known. Meanwhile, depend on the experimental techniques it is required time-consuming and expensive. Hence there exists a critical challenge to develop automated methods

176

Enzyme Function Classification

for a fast and accurate determination of the structures of proteins in order to reduce the gap. Therefore, there is a need to develop reliable and effective computational methods for identifying the structural classes of newly found proteins based on their primary sequences. Protein secondary structural class’s prediction problem has been studied widely for almost a quarter of a century. A group of different algorithms and methods have been proposed to predict protein secondary structural classes from amino acid sequences. Recently there are some new methods, Petrey et al. (Petrey et al.,2015) they developed a new model depend on structure-based approach to predict whether two proteins interact which relies heavily on homology models. These structures are used to identify geometrically similar proteins if founded one or more domains between putative interaction partners and the structure in the PDB or homology model databases. The scenario of this method consists of some steps: (a) scan the library of templates with known function,(b) Templates can be proteins with various binding partners including other proteins, (c) determine whether the query has functional properties similar to the template, (d) by using one of the machine learning techniques, prosperities of the three dimensions, e.g. sequence conservation and covariation in the interface which used as the input to the machine learning approach (Petrey et al.,2015).

Protein Enzyme Classification and Prediction Based on Feature Extraction In this section, the enzyme/protein sequence or structure is transformed into more biologically meaningful features, i.e. feature-based approach, which makes it easier to distinguish between proteins from different functional classes. This method consists of two main steps, namely, feature extraction and feature selection or dimensionality reduction.

Feature Extraction Feature extraction step includes the definition of features of a sequence that can be used to encode the desired properties of a protein. Several of the popularly used features are motifs derived from a set of Figure 10. Steps of structure based model Petrey et al., 2015.

177

Enzyme Function Classification

functionally or evolutionary related proteins, functional domains, n-grams and more biologically meaningful features such as the molecular weight, isoelectric point, Theoretical Pi, protein length, number of atoms, grand average of hydropathicity (GRAVY), amino acid composition, periodicity, physicochemical properties, predicted secondary structures, and the Van der Waals volume as illustrated in Table 8 (Lee et al.,2007). Increasing the number of features led to (1) a curse of dimensionality problem, (2) Noisy and/or redundant feature. This problem can be addressed using dimensionality reduction methods.

Dimensionality Reduction The goal of dimensionality reduction methods is to remove noisy, irrelevant, or redundant features. Dimensionality reduction can be achieved either by eliminating data closely related to other data in the set, or combining data to make a smaller set of features (Addison et al., 2003, Tharwat 2016c,d,e). There are many dimensionality reduction methods such as Principal Component Analysis (PCA) (Tharwat et al., 2012, Tharwat et al. 2013, Tharwat et al. 2014, Tharwat et al.,2015b, Tharwat 2016b), Linear Discriminant Analysis (LDA) (Scholkopft & Mullert, 1999, Tharwat 2016a, Semary et al. 2015, Tharwat 2016a), and Canonical Correlation Analysis (CCA) (Thompson, 2005). PCA is one of the most famous methods that is used in dimensionality reduction. It finds a linear projection of a high-dimensional data into a lower dimensional sub-space, which leads to the maximization of the variance and minimization of the least square reconstruction error. Because of this, PCA has been found very effective in performing feature extraction (Li & Li, 2008, Gaber et al. 2016). It has also been used extensively in the studies involving analysis of spectral data, with good and acceptable efficiency (GRILL & RUSH, 2000).

Feature Selection Feature selection is used to select the most discriminating features; hence, it has the same goal of dimensionality reduction. Feature selection is known as subset selection, attribute selection or variable selection. In other words, the goal of feature selection is to choose a small subset of features that is sufficient to predict the target classes accurately. Applying feature selection techniques in prediction methods achieved a high performance because feature selection: i) Avoids over-fitting and improves prediction performance of the model generated; ii) Reduces the computational complexity of learning and prediction algorithms; iii) Provides faster and cost-effective models; and (iv) Gains a deeper insight into the underlying processes that generated the data (Saeys & Larranaga, 2007). Considering these benefits that feature selection techniques offer, they have been widely applied in the development of prediction methods. Based on the context of classification (prediction), feature selection techniques are grouped into three categories (Dash et al., 1997).

CONCLUSION In this chapter, we tried to introduce a general idea about the main approaches which use to predict and classify the enzyme proteins and addition to mentioned some models and techniques developed by the authors or that which developed by others. The author’s exposure to some of the basics of the concepts 178

Enzyme Function Classification

Table 8. Discriminative features which used for enzyme function classification No.

Feature

Description

1

Number of amino acids

Number of residues in each protein

2

Molecular weight

Molecular weight of the protein

3

Theoretical pI

The pH at which the net charge of the protein is zero (isoelectric point)

4

Amino acid composition

Percentage of each amino acid in the protein

5

Positively charged residue 2

Percentage of positively charged residues in the protein (lysine and arginine)

6

Positively charged residue 3

Percentage of positively charged residues in the protein (histidine, lysine, and arginine)

7

Number of atoms

Total number of atoms

8

Carbon

Total number of carbon atoms in the protein sequence

9

Hydrogen

Total number of hydrogen atoms in the protein sequence

10

Nitrogen

Total number of nitrogen atoms in the protein sequence

11

Oxygen

Total number of oxygen atoms in the protein sequence

12

Sulphur

Total number of sulphur atoms in the protein sequence

13

Extinction coefficient All

Amount of light a protein absorbs at a certain wavelength

14

Extinction coefficient No

Amount of light a protein absorbs at a certain wavelength

15

Instability index

The stability of the protein

16

Aliphatic index

The relative volume of the protein occupied by aliphatic side chains

17

GRAVY

Grand average of hydropathicity

18

PPR

Percentage of continuous changes from positively charged residues to positively charged residues

19

NNR

Percentage of continuous changes from negatively charged residues to negatively charged residues

20

PNPR

Percentage of continuous changes from positively charged residues to negatively charged residues or from negatively charged residues to positively charged residues.

21

NNRDist(x; y)

Percentage of NNR from x to y (local information)

22

PPRDist(x; y)

Percentage of PPR from x to y (local information)

23

PNPRDist(x; y)

Percentage of PNPR from x to y (local information)

24

Charged

Physicochemical property

25

Negatively charged residues

Percentage of negatively charged residues in the protein

26

Polar

Physicochemical property

27

Aliphatic

Physicochemical property

28

Aromatic

Physicochemical property

29

Small

Physicochemical property

30

Tiny

Physicochemical property

31

Bulky

Physicochemical property

32

Hydrophobic

Physicochemical property

33

Hydrophobic and aromatic

Physicochemical property

34

Neutral, weakly and hydrophobic

Physicochemical property

35

Hydrophilic and acidic

Physicochemical property

36

Hydrophilic and basic

Physicochemical property

37

Acidic

Physicochemical property

38

Polar and uncharged

Physicochemical property

39

Amino acids pair ratio

Percentage composition for each of the 400 possible amino acid dipeptides

Lee et al., 2009.

179

Enzyme Function Classification

and terminologies which is in the area, e.g. Sequence Alignment, Multiple Sequence Alignment (MSA), Feature Extraction and Selection, Dimensionality Reduction, etc. This chapter is an attempt to shed light on the important aspects of Bioinformatics applications which still represent hot research point and good trends recently and the in the foreseeable future.

REFERENCES Addison, D., Wermter, S., & Arevian, G. (2003). A comparison of feature extraction and selection techniques. Proceedings of International Conference on Artificial Neural Networks (Supplementary Proceedings), 212–215. Ball, D.W., Hill, J.W., Scott, R.J. (2011). The basics of general, organic, and biological chemistry. Academic Press. Bharill, N., Tiwari, A., & Rawat, A. (2015). A novel technique of feature extraction with dual similarity measures for protein sequence classification. Procedia Computer Science, 48, 796–802. doi:10.1016/j. procs.2015.04.217 Blazewicz, J., Frohmberg, W., Kierzynka, M., & Wojciechowski, P. (2013). G-msaa gpu-based, fast and accurate algorithm for multiple sequence alignment. Journal of Parallel and Distributed Computing, 73(1), 32–41. doi:10.1016/j.jpdc.2012.04.004 Blekas, K., Fotiadis, D. I., & Likas, A. (2005). Konstantinos; FOTIADIS, Dimitrios I.; LIKAS, Aristidis. Motif-based protein sequence classification using neural networks. Journal of Computational Biology, 12(1), 64–82. doi:10.1089/cmb.2005.12.64 PMID:15725734 Brudno, M., Chapman, M., Gottgens, B., Batzoglou, S., & Morgenstern, B. (2003). Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics, 4(1), 66. doi:10.1186/1471-21054-66 PMID:14693042 Cai, C., Han, L., Ji, Z., & Chen, Y. (2004). Enzyme family classification by support vector machines. Proteins. Structure, Function, and Bioinformatics, 55(1), 66–76. doi:10.1002/prot.20045 Chen, W., Lin, H., Feng, P.-M., Ding, C., Zuo, Y.-C., & Chou, K.-C. (2012). inucphyschem: A sequencebased predictor for identifying nucleosomes via physicochemical properties. PLoS ONE, 7(10), e47843. doi:10.1371/journal.pone.0047843 PMID:23144709 Chou, K. C. (2005). Progress in protein structural class prediction and its impact to Bioinformatics and proteomics. Current Protein & Peptide Science, 6(5), 423–436. doi:10.2174/138920305774329368 PMID:16248794 Choudhuri, S. (2014). Bioinformatics for Beginners: Genes, Genomes, Molecular Evolution, Databases and Analytical Tools. Elsevier. Clark, W. T., & Radivojac, P. (2011). Analysis of protein function and its prediction from amino acid sequence. Proteins. Structure, Function, and Bioinformatics, 79(7), 2086–2096. doi:10.1002/prot.23029 Dash. (1997). Feature selection for classification. Intelligent Data Analysis, 1(3), 131–156.

180

Enzyme Function Classification

Ding, S., Li, Y., Shi, Z., & Yan, S. (2014). A protein structural classes prediction method based on predicted secondary structure and psi-blast profile. Biochimie, 97, 60–65. doi:10.1016/j.biochi.2013.09.013 PMID:24067326 Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification (2nd ed.). John Wiley & Sons. Durbin, R. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press. doi:10.1017/CBO9780511790492 Edgar, R. C. (2004). Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. doi:10.1093/nar/gkh340 PMID:15034147 Edgar, R. C., & Batzoglou, S. (2006). Multiple sequence alignment. Current Opinion in Structural Biology, 16(3), 368–373. doi:10.1016/j.sbi.2006.04.004 PMID:16679011 Eisenberg, D., Marcotte, E. M., Xenarios, I., & Yeates, T. O. (2000). Protein function in the post-genomic era. Nature, 405(6788), 823–826. doi:10.1038/35015694 PMID:10866208 Espadaler, J., Eswar, N., Querol, E., Aviles, F. X., Sali, A., Marti-Renom, M. A., & Oliva, B. (2008). Prediction of enzyme function by combining sequence similarity and protein interactions. BMC Bioinformatics, 9(1), 1. doi:10.1186/1471-2105-9-249 PMID:18505562 Faria, D., Ferreira, A. E., & Falcão, A. O. (2009). Enzyme classification with peptide programs: a comparative study. BMC Bioinformatics, 10(1), 231. doi:10.1186/1471-2105-10-231 PMID:19630945 Feng, D. F., & Doolittle, R. F. (1987). Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. Journal of Molecular Evolution, 25(4), 351–360. doi:10.1007/BF02603120 PMID:3118049 Gaber, T., Tharwat, A., Hassanien, A. E., & Snasel, V. (2016). Biometric cattle identification approach based on Webers Local Descriptor and AdaBoost classifier. Computers and Electronics in Agriculture, 122, 55–66. doi:10.1016/j.compag.2015.12.022 Gennis, R. B. (2013). Biomembranes: molecular structure and function. Springer Science & Business Media. Gotoh, O. (1990). Consistency of optimal sequence alignments. Bulletin of Mathematical Biology, 52(4), 509–525. doi:10.1007/BF02462264 PMID:1697773 Gotoh, O. (1996). Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. Journal of Molecular Biology, 264(4), 823–838. doi:10.1006/jmbi.1996.0679 PMID:8980688 Grill, C. P., & Rush, V. N. (2000). Analysing spectral data: comparison and application of two techniques. Biological Journal of the Linnean Society. Linnean Society of London, 69(2), 121–138. doi:10.1111/j.1095-8312.2000.tb01194.x Huang, X. (1994). On global sequence alignment. Computer applications in the biosciences. CABIOS, 10(3), 227–235. PMID:7922677 Hughey, R., & Krogh, A. (1996). Hidden Markov Models for sequence analysis: Extension and analysis of the basic method. Computer applications in the biosciences. CABIOS, 12(2), 95–107. PMID:8744772

181

Enzyme Function Classification

Jahandideh, S., & Mahdavi, A. (2012). Rfcrys: Sequence-based protein crystallization propensity prediction by means of random forest. Journal of Theoretical Biology, 306, 115–119. doi:10.1016/j. jtbi.2012.04.028 PMID:22726810 Kong, L., Zhang, L., & Lv, J. (2014). Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of chous pseudo amino acid composition. Journal of Theoretical Biology, 344, 12–18. doi:10.1016/j.jtbi.2013.11.021 PMID:24316044 Kumar, C., & Choudhary, A. (2012). A top-down approach to classify enzyme functional classes and sub-classes using random forest. EURASIP Journal on Bioinformatics & Systems Biology, 2012(1), 1–14. doi:10.1186/1687-4153-2012-1 PMID:22376768 Kurgan, L. A., & Homaeian, L. (2006). Prediction of structural classes for protein sequences and domainsimpact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recognition, 39(12), 2323–2343. doi:10.1016/j.patcog.2006.02.014 Latino, D. A., Zhang, Q.-Y., & Aires-de Sousa, J. (2008). Genome-scale classification of metabolic reactions and assignment of ec numbers with self-organizing maps. Bioinformatics (Oxford, England), 24(19), 2236–2244. doi:10.1093/bioinformatics/btn405 PMID:18676416 Lee, B. J., Lee, H. G., Lee, J. Y., & Ryu, K. H. (2007). Classification of enzyme function from protein sequence based on feature representation. Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on, 741–747. doi:10.1109/BIBE.2007.4375643 Lee, B. J., Shin, M. S., Oh, Y. J., Oh, H. S., & Ryu, K. H. (2009). Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Science, 7(1), 27. doi:10.1186/1477-5956-7-27 PMID:19664241 Li, F. M., & Li, Q. Z. (2008). Predicting protein subcellular location using chous pseudo amino acid composition and improved hybrid approach. Protein and Peptide Letters, 15(6), 612–616. doi:10.2174/092986608784966930 PMID:18680458 Li, L., Cui, X., Yu, S., Zhang, Y., Luo, Z., Yang, H., & Zheng, X. et al. (2014). Pssp-rfe: Accurate prediction of protein structural class by recursive feature extraction from psi-blast profile, physical-chemical property and functional annotations. PLoS ONE, 9(3), e92863. doi:10.1371/journal.pone.0092863 PMID:24675610 Li, L., Zhang, Y., Zou, L., Li, C., Yu, B., Zheng, X., & Zhou, Y. (2012). An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS ONE, 7(1), e31057. doi:10.1371/journal.pone.0031057 PMID:22303481 Liang, Y. Y., Liu, S. Y., & Zhang, S. L. (2015). Prediction of protein structural class based on different autocorrelation descriptors of position-specific scoring matrix. MATCH: Communications in Mathematical and in Computer Chemistry, 73(3), 765–784. Lipman, D. J., Altschul, S. F., & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 86(12), 4412–4415. doi:10.1073/pnas.86.12.4412 PMID:2734293

182

Enzyme Function Classification

Liu, T., Geng, X., Zheng, X., Li, R., & Wang, J. (2012). Accurate prediction of protein structural class using auto covariance transformation of psi-blast profiles. Amino Acids, 42(6), 2243–2249. doi:10.1007/ s00726-011-0964-5 PMID:21698456 Liu, T., & Jia, C. (2010). A high-accuracy protein structural class prediction algorithm using support vector machine and psi-blast profile. Biochimie, 92(10), 1330–1334. doi:10.1016/j.biochi.2010.06.013 PMID:20600567 Liu, T., Zheng, X., & Wang, J. (2010). Prediction of protein structural class for low-similarity sequences using support vector machine and psi-blast profile. Biochimie, 92(10), 1330–1334. doi:10.1016/j.biochi.2010.06.013 PMID:20600567 Min, J.-L., Xiao, X., and Chou, K.-C. (2013). iezy-drug: A web server for identifying the interaction between enzymes and drugs in cellular networking. BioMed Research International. Mount, D. W. (2004). Sequence and genome analysis. In Bioinformatics. Cold Spring Harbour Laboratory Press. Naik, P. K., Mishra, V. S., Gupta, M., & Jaiswal, K. (2007). Prediction of enzymes and nonenzymes from protein sequences based on sequence derived features and pssm matrix using artificial neural network. Bioinformation, 2(3), 107–112. doi:10.6026/97320630002107 PMID:18288334 Nanni, L., Mazzara, S., Pattini, L., & Lumini, A. (2009). Protein classification combining surface analysis and primary structure. Protein Engineering, Design & Selection, 22(4), 267–272. doi:10.1093/protein/ gzn084 PMID:19188137 Notredame, C., Higgins, D. G., & Heringa, J. (2000). T-coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1), 205–217. doi:10.1006/jmbi.2000.4042 PMID:10964570 Notredame, C., Holm, L., & Higgins, D. G. (1998). Coffee: An objective function for multiple sequence alignments. Bioinformatics (Oxford, England), 14(5), 407–422. doi:10.1093/bioinformatics/14.5.407 PMID:9682054 Orobitg, M., Guirado, F., Cores, F., Llados, J., & Notredame, C. (2014). High performance computing improvements on Bioinformatics consistency-based multiple sequence alignment tools. Parallel Computing. Petrey, D., Chen, T. S., Deng, L., Garzon, J. I., Hwang, H., Lasso, G., & Honig, B. et al. (2015). Templatebased prediction of protein function. Current Opinion in Structural Biology, 32, 33–38. doi:10.1016/j. sbi.2015.01.007 PMID:25678152 Rao, H., Zhu, F., Yang, G., Li, Z., & Chen, Y. (2011). Update of profeat: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research, 39(suppl 2), W385–W390. doi:10.1093/nar/gkr284 PMID:21609959 Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques. Bioinformatics, 23(19), 2507–2517.

183

Enzyme Function Classification

Satyanarayana, T., Littlechild, J., & Kawarabayasi, Y. (2013). Thermophilic microbes in environmental and industrial biotechnology: Biotechnology of thermophiles. Springer Science & Business Media. doi:10.1007/978-94-007-5899-5 Scholkopft, B., & Mullert, K.R. (1999). Fisher discriminant analysis with kernels. Neural Networks for Signal Processing, 9(1). Semary, N. A., Tharwat, A., Elhariri, E., & Hassanien, A. E. (2015). Fruit-based tomato grading system using features fusion and support vector machine. Proceedings of IEEE conference on Intelligent Systems, 401–410 doi:10.1007/978-3-319-11310-4_35 Sharif, M. M., Tharwat, A., Hassanien, A. E., & Hefeny, H. A. (2015). Automated enzyme function classification based on pairwise sequence alignment technique. Proceedings of the Second Euro-China Conference on Intelligent Data Analysis and Applications (ECC 2015), 499–510. doi:10.1007/978-3319-21206-7_43 Sharif, M. M., Tharwat, A., Hassanien, A. E., Hefeny, H. A., & Schaefer, G. (2015). Enzyme function classification based on borda count ranking aggregation method. Proceedings of 6th International Conference (Industrial Sessions). Shen, H.-B., & Chou, K.-C. (2007). Ezypred: A top–down approach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications, 364(1), 53–59. doi:10.1016/j. bbrc.2007.09.098 PMID:17931599 Soding, J. (2005). Protein homology detection by hmm–hmm comparison. Bioinformatics (Oxford, England), 21(7), 951–960. doi:10.1093/bioinformatics/bti125 PMID:15531603 Tharwat, A. (2016a). Linear vs. quadratic discriminant analysis classifier: A tutorial. International Journal of Applied Pattern Recognition, 3(2), 145–180. doi:10.1504/IJAPR.2016.079050 Tharwat, A. (2016b). Principal component analysis-a tutorial. International Journal of Applied Pattern Recognition, 3(3), 197–240. doi:10.1504/IJAPR.2016.079733 Tharwat, A., Gaber, T., & Hassanien, A. E. (2016e). One-dimensional vs. two-dimensional based features: Plant identification approach. Journal of Applied Logic. doi:10.1016/j.jal.2016.11.021 Tharwat, A., Gaber, T., & Hassanien, A. E. (2014). Cattle identification based on muzzle images using gabor features and svm classifier. Proceedings of second conference of Advanced Machine Learning Technologies and Applications, 236–247. doi:10.1007/978-3-319-13461-1_23 Tharwat, A., Ghanem, A. M., & Hassanien, A. E. (2013). Three different classifiers for facial age estimation based on k-nearest neighbor. Proceedings of 9th International Computer Engineering Conference (ICENCO), 55–60. Tharwat, A., Hassanien, A. E., & Elnaghi, B. E. (2016c). A BA-based algorithm for parameter optimization of Support Vector Machine. Pattern Recognition Letters. doi:10.1016/j.patrec.2016.10.007 Tharwat, A., Ibrahim, A., & Ali, H. A. (2012). Personal identification using ear images based on fast and accurate principal component analysis. 8th International Conference on Informatics and Systems (INFOS), 56–59.

184

Enzyme Function Classification

Tharwat, A., Ibrahim, A., Hassanien, A. E., & Schaefer, G. (2015). Ear recognition using block-based principal component analysis and decision fusion. In Pattern Recognition and Machine Intel-ligence (pp. 246–254). Springer. doi:10.1007/978-3-319-19941-2_24 Tharwat, A., Moemen, Y. S., & Hassanien, A. E. (2016d). A Predictive Model for Toxicity Effects Assessment of Biotransformed Hepatic Drugs Using Iterative Sampling Method. Scientific Reports, 6. PMID:27934950 Tharwat, A., Sharif, M. M., Hassanien, A. E., & Hefeny, H. A. (2015). Improving enzyme function classification performance based on score fusion method. Proceedings of 10th International Conference on Hybrid Artificial Intelligent Systems, 530–542. doi:10.1007/978-3-319-19644-2_44 Thompson, B. (2005). Canonical correlation analysis. In Encyclopedia of statistics in behavioral science. doi:10.1002/0470013192.bsa068 Tipton, K. (1994). Nomenclature committee of the international union of biochemistry and molecular biology (nc-iubmb). enzyme nomenclature. recommendations 1992. supplement: corrections and additions. European Journal of Biochemistry, 223(1), 1–5. Vingron, M., & von Haeseler, A. (1997). Towards integration of multiple alignment and phylogenetic tree construction. Journal of Computational Biology, 4(1), 23–34. doi:10.1089/cmb.1997.4.23 PMID:9109035 Volpato, V., Adelfio, A., & Pollastri, G. (2013). Accurate prediction of protein enzymatic class by n-to-1 neural networks. BMC Bioinformatics, 14(Suppl 1), S11. doi:10.1186/1471-2105-14-S1-S11 PMID:23368876 Wallace, I. M., & Higgins, D. G. (2005). Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics (Oxford, England), 21(8), 1408–1414. doi:10.1093/bioinformatics/bti159 PMID:15564300 Wang, J., Wang, C., Cao, J., Liu, X., Yao, Y., & Dai, Q. (2015, January). classes for low-similarity sequences using reduced PSSM and position-based secondary structural features. Gene, 554(2), 241–248. doi:10.1016/j.gene.2014.10.037 PMID:25445293 Wang, Y.-C., Wang, Y., Yang, Z.-X., & Deng, N.-Y. (2011). Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Systems Biology, 5(1), 1. doi:10.1186/1752-0509-5-S1-S1 PMID:21689481 Xiao, X., Min, J.-L., Wang, P., & Chou, K.-C. (2013). icdipsefpt: Identify the channel–drug interaction in cellular networking with pseaac and molecular fingerprints. Journal of Theoretical Biology, 337, 71–79. doi:10.1016/j.jtbi.2013.08.013 PMID:23988798 Xiong, J. (2006). Essential bioinformatics. Cambridge University Press. doi:10.1017/CBO9780511806087 Xu, Y., Ding, J., Wu, L.-Y., & Chou, K.-C. (2013a). isno-pseaac: Predict cysteine snitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE, 8(2), e55844. doi:10.1371/journal.pone.0055844 PMID:23409062

185

Enzyme Function Classification

Xu, Y., Shao, X.-J., Wu, L.-Y., Deng, N.-Y., & Chou, K.-C. (2013b). isnoaapair: Incorporating amino acid pairwise coupling into pseaac for predicting cysteine snitrosylation sites in proteins. PeerJ, 1, e171. doi:10.7717/peerj.171 PMID:24109555 Yang, A., Li, R., Zhu, W., & Yue, G. (2012). A novel method for protein function prediction based on sequence numerical features. Match-Communications in Mathematical and Computer Chemistry, 67(3), 833. Yang, J., Zhang, D., Frangi, A. F., & Yang, J.-y. (2004). Two-dimensional pca: A new approach to appearance-based face representation and recognition. Pattern Analysis and Machine Intelligence. IEEE Transactions on, 26(1), 131–137. Yang, J.-Y., Peng, Z.-L., & Chen, X. (2010). Prediction of protein structural classes for lowhomology sequences based on predicted secondary structure. BMC Bioinformatics, 11(1), 1. PMID:20043860 Zhang, L., Zhao, X., & Kong, L. (2013). A protein structural class prediction method based on novel features. Biochimie, 95(9), 1741–1744. doi:10.1016/j.biochi.2013.05.017 PMID:23770446 Zhang, S., Ding, S., & Wang, T. (2011). High-accuracy prediction of protein structural class for lowsimilarity sequences based on predicted secondary structure. Biochimie, 93(4), 710–714. doi:10.1016/j. biochi.2011.01.001 PMID:21237245 Zhang, S., Liang, Y., & Yuan, X. (2014). Improving the prediction accuracy of protein structural class: Approached with alternating word frequency and normalized lempelziv complexity. Journal of Theoretical Biology, 341, 71–77. doi:10.1016/j.jtbi.2013.10.002 PMID:24140787 Zhou, G., & Assa-Munt, N. (2001). Some insights into protein structural class prediction. Proteins. Structure, Function, and Bioinformatics, 44(1), 57–59. doi:10.1002/prot.1071 Zvelebil, M. J., & Baum, J. O. (2008). Understanding Bioinformatics. Garland Science.

186

187

Chapter 9

A Review of Vessel Segmentation Methodologies and Algorithms: Comprehensive Review

Gehad Hassan Fayoum University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt

ABSTRACT “Prevention is better than cure”, true statement which all of us neglect. One of the most reasons which cause speedy recovery from any diseases is to discover it in advanced stages. From here come the importance of computer systems which preserve time and achieve accurate results in knowing the diseases and its first symptoms.One of these systems is retinal image analysis system which considered as a key role and the first step of Computer Aided Diagnosis Systems (CAD). In addition to monitor the patient health status under different treatment methods to ensure How it effects on the disease.. In this chapter the authors examine most of approaches that are used for vessel segmentation for retinal images, and a review of techniques is presented comparing between their quality and accessibility, analyzing and catgrizing them. This chapter gives a description and highlights the key points and the performance measures of each one.

INTRODUCTION Retinal image analysis is one of systems which help on diagnosing almost of diseases in advanced stages like (hypertension, diabetic retinopathy, hemorrhages, macular degeneration, glaucoma, neo-vascularization and vein occlusion), in addition to achieving accurate result and saving time (Bernardes, Serranho, & Lobo, 2011). It is the main first step of Computer Aided Diagnosis (CAD) systems and registration of patient images. This diagnosis done by detection of some morphological features and attributes of the DOI: 10.4018/978-1-5225-2229-4.ch009

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

A Review of Vessel Segmentation Methodologies and Algorithms

retinal vasculature like width, length, branching pattern or tortuosity and angles. And on another level, manually detection of retinal vasculature is very difficult because of the complexity and the low contrast of blood vessels in retinal image (Asad, Azar, & Hassanien, 2014). Here come the importance of vessel segmentation as a pre-step in most of medical applications. No specific method is existed which segments the vasculature from each retinal image modality. So on classifying the segmentation methods, we should put in our mind some important factors such as application domain, method being automated or semi-automated, imagining modality, and other factors (Miri & Mahloojifar, 2011; Fraz, Remagnino, Hoppe, & Barman, 2013). And also not lose sight of the amount of effort and time which taken in the manual manner of the retinal blood vessel segmentation, in addition to our need for training and skill. Sometimes we may need a preprocessing step before the actual algorithm of segmentation method is executed; this is due to other factors such as noise or bad acquisition that effect on the quality of image. In the opposite some methods perform post-processing in order to treat some problems which happened after segmentation method. And there are methods which not to do neither this nor that. In this chapter, the authors present a review about the most methodologies of blood vessel segmenttion; to provide the algorithms which employed for vessel segmentation to researchers to be considered as ready reference; to discuss the advantages and limitations of these approaches; to discuss the current trends and future challenges to be opened for solving, then it discusses the proposed approach for vessel segmentation which will be completely explained in the next sections.

BACKGROUND Retinal Image Processing Retinal Photography Creating photograph of the interior surface of the eye containing the retina, macula, optic disc, and posterior pole is called fundus photography (also called fundography) (Lee, et al., 2000). A fundus camera is used in performing this fundus photograph; it consists of a specialized low power microscope with an attaching camera (Cassin & Solomon, 1990b; Saine, 2011). Three modes the fundus camera basically operates in: 1. Color Photography: Examining the retina with full color under white light illumination. 2. Red Free Photography: Improving contrast of the vessels and other structures where the imaging light is filtered to remove red colors. 3. Angiography Photography: Where the vessels are brought into high contrast by intravenous injection of a fluorescent dye. The retina is illuminated with an excitation color which fluoresces light of another color where the dye is present. By filtering to exclude the excitation color and pass the fluorescent color, a very high-contrast image of the vessels is produced. Shooting a timed sequence of photographs of the progression of the dye into the vessels reveals the flow dynamics and related pathologies. Specific methods include sodiumfluorescein angiography (abbreviated FA or FAG) and indocyanine green (abbreviated ICG) angiography (Cassin & Solomon, 1990a).

188

A Review of Vessel Segmentation Methodologies and Algorithms

Medical Image Analysis The medical filed is a source of interest of images. Large amount of image information is generated Because of the multiplicity of imaging modalities like MRI (Magnetic Resonance Imaging), as CT (Computed Tomography), PET (Positron Emission Tomography) etc. Not only the image’s resolution and size grow with new improved technology, but also the image’s number of dimensions increase. In the previous, only two dimensional images in medical staff were studied which produced by X-ray. But now there is image with three dimension volumes which is common in usual practice (Läthén, 2010). In addition to four dimensional data (three dimensional images which changed with time changing) is also used. So the key technical challenges are introduced because of this huge increasing of size and dimensionality. We really need to store, transmit, look at and find relational information between all this data. Here, we find automatic or semi-automatic algorithms as kind of interest. We want algorithms which automatically detect lesions, diseases and tumors and stand out their locations in the huge heap of images. But another problem presents itself, we also must trust in these algorithms results. This is special important issue in medical applications; we don’t need to algorithms with missing fatal diseases or algorithms with false alarms. So, it is important issue to perform validation studies to make the algorithms results for medical image analysis usable. Another dimension is added to the research process which includes communication between two non-similar worlds- the medical world which focus on patient, and the technical world which focus on computer. Coexistence between these two worlds rarely found and both sides must join to make great efforts achieving a common goal.

Retinal Vessel Segmentation One of the most popular problems of computer vision is the image segmentation (Terzopoulos, 1984). Contaiment effects in general images like shadows, highlights, object occlusion, and transparency is considered as so difficult problem. Segmentation may be easy task and difficult task in the same time depending on its characteristics. On the one hand, an anatomic region is the most popular thing which the imaging is focused on (Asad, Azar, & Hassanien, 2012). Context may be provide some scope in general images segmenting (e.g., indoor vs. outdoor, people vs. animals, city vs. nature), it is more accurate in a medical imaging task where method, conditions, and organ identity of the imaging is known. In addition, there are limitations in the pose variations, and a prior knowledge of the Region of Interest (ROI) and tissue’s number (Deserno & Thomas, 2010). On the other hand, producing the images in the medical field is one of the challenges because of the imaging poor quality, based on that; we find it is difficult to segment the anatomical region from the background. To discriminate between the foreground and background, we depend on not only the intensity variations but also additional cues to isolate ROIs. Summarizing for the above, in medical imaging, segmentation is used as essential tool for many reasons, one of them is detection process or diagnosis such as segmentation of anatomical surfaces for blood vessels and this what we will discuss in the next paragraph in more details about the anatomic of the retina. The retinal vasculature consists of both arteries and veins which appearing as outspread fea-tures, with visible tributaries within the retinal medical image. Vessel widths vary depending on both the image resolution and the vessel’s width ranging from one pixel to twenty pixels. Ocular fundus image shown other structures including the optic disc, the retina boundary, and pathologies which take the form of bright and dark lesions, cotton wool spots, and exudates. If we take a cross-sectional intensity of vessel

189

A Review of Vessel Segmentation Methodologies and Algorithms

retinal medical image, we will note approximation to gaussian shape. The intensity of the grey level and orientation of a vessel is gradually changing along their lengths. From other aspect about vessels shape it seems to take the structure of connected treelike (Emary E., Zawbaa, Hassanien, Tolba, & Sansel, 2014). However, there are huge varying in the shape, local grey level, and size. However, there are huge varying in the shape, local grey level, and size and, on the other hand some features of background may have similar attributes to vessels. Vessel crossing and branching can further complicate the profile model. As with the pro-cessing of most medical images, signal noise, drift in image intensity and lack of image contrast pose significant challenges to the extraction of blood vessels. A central vessel reflex which is con-sidered as indicator of a presence of strongly reflection along retinal vessels centerline, this reflec-tion is more clear in arteries than veins, we can see it more stronger at images which taken at long wave-lengths in the retinal images of younger patients. There are some characteristics of vessel segmentation depending on different aims, contrary to classical segmentation, such as: 1. 2. 3. 4. 5.

Complex topologies and branches which should be correctly detected, Vessels should be detected with different thickness (ranging from very thick to very thin), Small occlusions should be repaired (false disconnections), Noise which is incorrectly segmented should be removed, and The vessel’s minimum thickness should be put under control. Moreover, it must take into account robust, automatic, and efficient methods when we use vessel segmentation in a medical real-time environment (Emary E., Zawbaa, Hassanien, Schaefer, & Azar, 2014), so we find very challenging problems in this domain in return for all these requirements.

CLASSIFICATION OF RETINAL VESSEL SEGMENTATION APPROACHES The authors have divided the retinal vessel segmentation algorithms into seven main categories: • • • • • • •

Pattern recognition techniques. Matched filtering. Vessel tracking/tracing. Mathematical morphology. Multiscale approaches. Model based approaches. Parallel/hardware based approaches. Some of these categories are further divided into subcategories.

PATTERN CLASSIFICATION AND MACHINE LEARNING Pattern recognition algorithms handle with the automatic detection or blood vessel features which classified on retinal images and other non-vessel objects one main object of them is background. To perform

190

A Review of Vessel Segmentation Methodologies and Algorithms

pattern recognition tasks, humans are adapted. There are two main vessel segmentation categories of patern recognition techniques: 1. Supervised Approaches: In this method it should be decided if a pixel is a vessel or non-vessel depending on some prior labeling information which is exploited to make this decision. 2. Unsupervised Approaches: The vessel segmentation is performed with no any prior labeling knowledge.

Supervised Approaches The gold standard of vessel extraction in this method is about the training set basis of reference images which is manually processed and segmented. Ophthalmologist is doing this by precisely marking the gold standard images. In a supervised method, the algorithm performs its classification according to a given features. Therefore the classified ground truth data have to be available because in some real life applications, it is not available and this is the main condition of the classification. Usually In healthy retinal images, a supervised method produce good results more than unsupervised method because of its dependability on pre-classified data. As we said before that supervised method classify each pixel if it is vessel or not. In (Niemeijer, Staal, Van Ginneken, Loog, & Abramoff, 2004) 31- feature sets are extracted by Gaussians and their derivatives through the k-Nearest Neighbor (kNN) classifier. Then in (Staal, Abramoff, Niemeijer, Viergever, & Van Ginneken, 2004) the algorithm was improved using ridge-based detection. The image should be parttioned by assigning each pixel to its nearest ridge element. So a 27 feature set is computed for each pixel which KNN classifier uses. But these methods have two main disadvantages. Firstly the large size of the features sets and thus the algorithm becomes slow down, and secondly the dependency of the training data and its sensitivity to false edges. Another method presented in (Soares, Leandro, Cesar, Jelinek, & Cree, 2006) performs Gaussan Mixture Model (GMM) classifier which extracts a 6-feature set using Gabor-wavelets. This method also is characterized by the dependency of its training data and requires more hours to train GMM models with a mixture of 20 Gaussians. The method in (Ricci & Perfetti, 2007) performs line operators and support vector machine (SVM) classifier, a 3-feature set is extracted per each pixel. But this method characterized by its sensitivity to the training data, and it’s intensively computation because it uses the SVM classifier. Boosting and bagging strategies is applied in (Fraz, et al., 2012b) with vessel classification of 200 decision trees with Gabor filters which extracts 9-feature set. And because of using boosting strategy, this method has high computational complexity. And about method that has independent training data set is produced in (Marin, Aquino, Gegundez-Arias, & Bra, 2011). It extracts 7-features set using moment invariantsbased method and neighborhood parameters with neural network as classifier. The motivation of this method is to design an algorithm with low dependence on training data and with quickly computation. The part of computational complexity has been proposed in (Perfetti, Ricci, Casali, & Costantin, 2007) and (Lam, Gao, & Liew, 2010). In (Roychowdhury, Koozekanani, & Parhi, 2014) Gaussian Mixture Model (GMM) classifier is used with 8-features which extracted using pixel neighborhood with first and second-order gradient images. This method has good consistency in the accuracy of vessel segmentation because it reduces the number of pixels which classified and identifies an optimal feature set, but on the other hand it has low computational complexity. The performance measures adopted for evaluating the efficiency of supervised classification of retinal vessels are illustrated in Table 1.

191

A Review of Vessel Segmentation Methodologies and Algorithms

Table 1. Performance measures for supervised methods Test Data

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

(Niemeijer, Staal, Van Ginneken, Loog, & Abramoff, 2004)

0.942

0.969

0.689

_

_

_

(Staal, Abramoff, Niemeijer, Viergever, & Van Ginneken, 2004)

0.944

0.977

0.719

0.952

0.981

0.697

(Soares, Leandro, Cesar, Jelinek, & Cree, 2006)

0.946

0.978

0.733

0.948

0.975

0.72

(Ricci & Perfetti, 2007)

0.959

0.972

0.775

0.965

0.939

0.903

(Fraz, et al., 2012b)

0.948

0.981

0.74

0.953

0.976

0.755

(Marin, Aquino, Gegundez-Arias, & Bra, 2011)

0.945

0.98

0.706

0.952

0.982

0.694

(Lam, Gao, & Liew, 2010)

0.947

_

_

0.957

_

_

(Roychowdhury, Koozekanani, & Parhi, 2014)

0.952

0.983

0.725

0.951

0.973

0.772

Unsupervised Approaches Unsupervised classification approaches aim to find inherent patterns of blood vessels, and then uses these patterns to determine if a particular pixel is classified as a vessel or not. The training data do not directly contribute to the design of the algorithm in these unsupervised approaches. An unsupervised approach, presented in (Kande, Subbaiah, & Savithri, 2010), corrects non-uniform illumination of color fundus images by using the intensity information of red and green channels on the same retinal image. Then matched filtering is used to enhance blood vessels contrast against the background. Finally, identifying the vascular tree structure of the retinal images by applying connected component labeling after weighted fuzzy C-means clustering is performed. In (Ng, Clay, Barman, & Feilde, 2010) a vessel detection system presented using the idea of maximum likelihood inversion of a model of image formation. Second derivative Gaussian filters are applied on images at several scales, and from the outputs of these filters, it is inferred the presence of vessels and their properties. For blood vessels detection, a generative model is proposed using a Gaussian-profiled valley and their corresponding filter outputs are calculated. The Gaussian model of noise is performed, and then the filter outputs covariance is calculated to the isotropic. To estimate the image and noise models parameters, these models are incorporated into a maximum likelihood estimator. The contrast, width, and direction of the blood vessel at every point in the image are estimated by the system. It also produces likelihoods of the model with additive noise. Likelihoods with additive noise are produced. Then the vessel centerline is detected by using these likelihoods in conjunction with vessel parameters which were estimated previously. Finally the model marks the vessel by combining the estimated width parameter and this centerline. The Gray-Level Co-occurrence Matrix (GLCM) in combination with The local entropy information is performed in (Castaldi, Fabiola, & River, 2010) for vessel segmentation. To enhance the vessels structure, a matched filter is performed, then GLCM is computed, from the calculations of a statistical feature, and this calculated value is considered as threshold. Another method is presented in (Zhang, Cui, Jiang, & Wang, 2015), it aims to construct multidimensional feature vector with the green channel intensity as first step. Also the vessel intensity is enhanced using morphological operation. As second step, they perform pixel clustering by constructing Self-Organizing Map (SOM) which considered as

192

A Review of Vessel Segmentation Methodologies and Algorithms

unsupervised neural network. In the last stage by using Otsu’s method, each neuron is classified as neuron or non-vessel neuron in the output layer of SOM. Finally, in order to segment the vessel network, local entropy thresholding is applied.The performance measures adopted for evaluating the efficiency of unsupervised classification of retinal vessels are illustrated in Table 2.

MATCHED FILTERING In this approach, 2-D kernel is convolved with the retinal image to detect the vasculature. A feature in the image is modeled by the kernel at some orientation and position, and also the matched filter response (MFR) is used as indicator of the presence of the feature (Sreejini & Govindan, 2015). The following properties are used to design the matched filter kernel: 1. Vessels usually characterized by a limiting on its curvature and it may be approached by piecewise linear segments. 2. The farther the vessels move radially outward from the optic disc, the less the diameter of the vessels. 3. The line segment on its cross-sectional intensity of the pixel approximately takes a Gaussian curve shape. The convolution kernel is large and should be applied in a computational head at several rotations resulting. and also to confirm investigating the optimal responding for the kernel, the underlying Gaussian function which specified by the kernel must be have the same standard deviation for most vessels. So it is possible that the kernel do not respond to the vessels with a different profile. Another reason for false response, it is the variation of background and existence of pathologies. in the retinal image, this in turn increase the number of false responses because the pathologies and the vessels may have the same local attributes. A matched filter achieves good effective response when it applied with other processing techniques. Matched filter approach is used in (Cinsdikici & Aydin, 2009), firstly the image is preprocessed, and then the matched filter and ANT algorithm is performed on the image in parallel manner. To completely extract the vasculature, the results are combined followed by length filtering. In (Zhang, Zhang, Zhang, & karray, 2010) the method exploits that the classical matched filter is generalized and extended with the first-order derivative of the Gaussian (MF-FDOG) to exploit the property of the blood vessel which is the symmetric Gaussian shaped cross section with respect to its peak position, on the other hand the Table 2. Performance measures for unsupervised methods Test Data

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

(Kande, Subbaiah, & Savithri, 2010)

0.891

_

_

0.898

_

_

(Ng, Clay, Barman, & Feilde, 2010)

_

0.953

0.700

_

_

_

(Castaldi, Fabiola, & River, 2010)

0.976

0.948

0.9648

0.948

0.975

0.72

(Zhang, Cui, Jiang, & Wang, 2015)

0.940

_

_

_

_

_

193

A Review of Vessel Segmentation Methodologies and Algorithms

nonvessel edges - e.g. the step edge for lesions- are asymmetric. For detecting the vessels in this method, the zero-mean Gaussian filter (MF) and the first-order derivative of the Gaussian (FDOG) are used. For vessel structure, there will be a high response for the MF around shape’s peak position, while the value of the local mean to the FDOG will be close to zero around position of the peak. In contrast, for non-vessel, the response of both the MF and the local mean to the FDOG will be high. The advantage of this method is that many vessels which are missed by MF are fine detected, so in this methodology, the false detections produced by the original MF are reduced. Phase congruency is used in (Amin & Yan, 2011) to detect the retinal blood vessels. Firstly phase congruency is performed on the retinal image. This classification features by its soft and invariant because both luminosity and contrast of the image change. Then to measure phase congruency, a log-Gabor filters are applied, finally binary vessel tree is extracted by thresholding. The performance measures adopted for evaluating the efficiency of unsupervised classification of retinal vessels are illustrated in Table 3.

MORPHOLOGICAL PREPROCESSING The term morphology is a branch of biology, its basics are the structures and the form of animals and plants. As for the mathematical morphology, it is a tool which extracts the image components that are forms a good data in the description and representation of region shapes such as boundaries, features, skeletons and convex hulls. The mathematical morphology produces a powerful and unified approach to a huge image processing problems. Morphological image processing (Serra, 1982), (Hassan, Elbendary, Hassanien, Shoeb, & Snasel, 2015) is a collection of techniques for digital image processing based on mathematical morphology. Structuring elements (SE) are applied to images by morphological operators, and specially are applied to binary images or to gray level images. There are two main operators (dilation and erosion). In dilation, objects are expanded by a structuring element, holes will be filled, and the disjoint regions will be connected. In erosion the objects are shrunk by a structuring element. There are two other compound operations which are closing and opening. Closing is a combination of dilation and erosion respectively, opening is a combination of erosion and dilation, respectively. In medical image segmentation, there are two algorithms that used as enhancement tool. Top-hat transformation performs morphology opening operation to estimate the local background, and then subtracts it from the original image, this in turn leads to enhance in vessels to perform high results later in segmentation process. If we look at the vasculature from the point of view the morphology, it will show that the vasculature is a collection of linear segments which are connected together to form the final shape. If we reviewed the advantages and disadvantages of identifying shapes with morphological Table 3. Performance measures for matched filtering methods Test Data

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

(Cinsdikici & Aydin, 2009)

0.929

_

_

_

_

_

(Zhang, Zhang, Zhang, & karray, 2010)

0.938

0.972

0.712

0.948

0.975

0.718

(Amin & Yan, 2011)

0.920

_

_

_

_

_

194

A Review of Vessel Segmentation Methodologies and Algorithms

processing, we will find that noise resistance and speed are from its main advantages. from the other hand morphological processing use long structure element which makes fitting highly tortuous vessels difficult, also the known vessel cross-sectional shape isn’t exploited in this method. Mathematical morphology and a fuzzy clustering algorithm are proposed in (Yang, Huang, & Rao, 2008). Top-hat morphological operation enhances the blood vessels and removes the background, then by using fuzzy clusltering, the vessels are extracted. Morphological multi-scale enhancement method is also presented in (Sun, Chen, Jiang, & Wang, 2011). For the extraction of the blood vessels in the angiogram, fuzzy filter and watershed transformation are used. Multi-scale non-linear morphology opening operators with structuring element which vary in size is used to estimate the background, and then the background is subtracted from the image to achieve the contrast normalization. A combined fuzzy morphological operation is applied on the normalized angiogram with twelve linear structuring elements with nine pixels length, the structuring element rotated every 15° between zero and 180°. Thresholding the filtered image to obtain the vessel region, then for approximating the vessels centerlines, thinning operation is applied. Finally watershed techniques are applied on vessel centerline to detect the vessel boundaries. Another method is presented in (Fraz, et al., 2012a) which is a combined unique vessel centerlines detection with morphological bit plane slicing. The first order derivative of a Gaussian filter is used in four directions to extract the centerlines, and then performing an average derivative and derivative signs with the extracted centerlines. Mathematical morphology has proven their worth as a brilliant technique for the blood vessels segmentation in the retina. Morphological multidirectional top-hat operation is applied on blood vessels gray-scale image with linear structure element to obtain the orientation map and shape, and then the enhanced vessels are subject to bit plane slicing. For obtaining the vessel tree, these maps are combined with the centerlines. In (Miri & Mahloojifar, 2011) fast discrete curvelet transform with multi-structure mathematical morphology is proposed. For contrast enhancement, FDCT is performed. For detecting the blood vessels edges, multi-structure morphological transformation is applied. Then morphological opening is applied on the result image to remove the false edges. Finally for obtaining the complete final vascular tree, a connected adaptive component analysis is applied. Another automated enhancement and segmentation method for blood vessels is presented in (Hou, 2014). This method decreases the optic disc influence and emphasizes the vessels by applying a morphological multidirectional top-hat transform with rotating structuring elements to the background of the retinal image. For producing a vessel response image and the final blood vessel tree, an improved multi-scale line detector is applied. As line detectors in the multi-scale detector have different line responses, the longer line detectors produce more vessel responses than the shorter line detectors. To set different weights for different scales, all the responses are combined by the improved multi-scale detector at different scales. The performance measures adopted for evaluating the efficiency of morphological processing methods of retinal vessels are illustrated in Table 4.

MULTI-SCALE APPROACHES The width of a vessel decreases as it travels radially outward from the optic disk and such a change in vessel caliber is a gradual one. The farther the vessel travels from the optic disc, the smaller the vessels

195

A Review of Vessel Segmentation Methodologies and Algorithms

Table 4. Performance measures for morphological processing methods Test Data

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

(Fraz, et al., 2012a)

0.943

0.977

0.715

0.944

0.968

0.731

(Miri & Mahloojifar, 2011)

0.946

0. 979

0. 735

_

_

_

(Hou, 2014)

0.942

0.969

0.735

0.934

0.965

0.735

width will be. Depending on this property, we can define a vessel as contrasted pattern with a Gaussian such as piecewise connected, shape cross-section profile, and locally linear, with a decreasing gradually vessel width. So to extract the complete vascular of retinal image in this method, some information which is related to the blood vessels with varying different scales is separated out. A proposed supervised method for red-free vessel segmentation retinal images is presented in (Anzalone, Bizzarri, Parodi, & Storace, 2008). Normalization of the background of retinal image is performed for uneven illumination, and then enhancement of the vessels via scale space theory. For determining the optimal scale factor and also the threshold to binarize the segmented image, an optimization supervised algorithm is performed, and then cleaning operation is done to ensure completely super removal. Another blood vessel segmentation method is investigated in (Farnell, et al., 2008) whose idea depends on the multi-scale line operator (MSLO).By using Gaussian sampling on a series of images at respectively coarser length scales with respect to the original image, sub-sampled images Gaussian pyramid- are constructed. Then the line operator is performed to the images on every level of this Gaussian pyramid in a separate manner. By using a cubic spline, the result image of the previous stage was mapped to the original level of scale. The final image is the addition of all images of the Gaussian pyramid. For each length scale in the MSLO image, the weight is obtained, and then a threshold is calculated to produce a binary segmented image. Finally By using a region simple growing algorithm, all remaining noise should be removed. In (Vlachos & Dermatas, 2010) a multi-scale line tracking algorithm for blood vessel segmentation is proposed. Firstly both contrast normalization and luminosity are performed, and then based on normalized histogram; brightness selection rule is obtained to derive the seeds of the line tracking. We get varying widths of the vessel via initialization of the line tracking at multiple scaled. Many cross sectional conditions are constructed as a termination condition of line tracking. The result of all multi-scale line tracking is combined to get the confidence image map which quantized in order to derive the initial vessel network. And because of the disconnected vessel lines and the remaining noise, Median filter is performed for restoration. Finally by performing morphological reconstruction, the fault artifacts are removed. In (Hou, 2014) another proposed vascular segmentation is presented as we explain before in section 3 morphological processing. This method combines between morphological processing to enhance the influence of optic disc and the multi-scale line detector to produce the final vascular tree of the retinal image.The performance measures adopted for evaluating the efficiency of multi-scale approaches of retinal vessels are illustrated in Table 5.

196

A Review of Vessel Segmentation Methodologies and Algorithms

Table 5. Performance measures for multiscale approaches Test Data

Drive Test Acc

Specificity

Stare Test Sensitivity

Acc

Specificity

Sensitivity

(Anzalone, Bizzarri, Parodi et al., 2008)

0.942

_

_

_

_

_

(Vlachos & Dermatas, 2010)

0.929

0.955

0.747

_

_

_

(Hou, 2014)

0.942

0.969

0.735

0.934

0.965

0.735

MODEL BASED APPROACHES In these approaches, an explicit vessel models are applied in order to extract the vascular. In (Vermeer, Vos, Lemij, & Vossepoel, 2004) a proposed segmentation method is presented which extracts the blood vessels by convolving with a Laplacian kernel, and then a threshold is calculated to segment the vessels. Finally the broken lines are connected. In (Lam & Yan, 2008) some improvement is performed on the previous methodology. The Laplacian operator is used to extract the vascular and pruning is performed for objects with noise according to center lines. Some advantage of this methodology is that it can extract the vessels from images with bright abnormalities, but in contrast it can’t work with red lesions in retinal images (like microaneurysms or hemorrhages). The method in (Lam, Gao, & Liew, 2010) proposed perceptive transformation approaches for segmenting vascular in retinal images with both bright and red lesions. A model-based method in (Jiang & Mojon, 2003) performs adaptive locally thresholding. In the verification process, vessel information is integrated. And because this method has an overall low accuracy, it is more generalizable than matched filter methods. Another approach for vessel segmentation presented in (Al-Diri, Hunter, & Steel, 2009). Active contour models are used, but it has computational complexity.The performance measures adopted for evaluating the efficiency of model based methods of retinal vessels are illustrated in Table 6.

PARALLEL HARDWARE BASED IMPLEMENTATIONS The Parallel hardware based implementation addresses the high computational cost of vascular segmentation algorithms, and also addresses the real-time performance requirements. Cellular neural networks Table 6. Performance measures for model based approaches Test Data (Vermeer, Vos, Lemij, & Vossepoel, 2004) (Lam & Yan, 2008)

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

0.929

_

_

_

_

_

_

_

_

0.965

_

_

(Lam, Gao, & Liew, 2010)

0.947

_

_

0.957

_

_

(Jiang & Mojon, 2003)

0.891

0.900

0.830

0.901

0.900

0.857

_

0.955

0.728

0.968

0.752

(Al-Diri, Hunter, & Steel, 2009)

197

A Review of Vessel Segmentation Methodologies and Algorithms

represent one appealing paradigm for image processing of parallel real-time (Manganaro, Arena, & Fortuna, 1999), (Roska & Chua, 1993), which VLSI chips are used to implement on. For parallel implementation for vascular segmentation algorithms, the registration ToolKit and insight segmentation are also used in high resolution images (Ibanez, Schroeder, Ng, & Cates, 2003). An approach is presented in (Alonso-Montes, Vilario, Dudek, & Penedo, 2008) which is pixel-parallel based method to confirm fast vascular extraction. In fact this method is an improvement of the original proposal (Alonso-Montes, Vilarino, & Penedo, CNN-based automatic retinal vascular tree extraction, 2005) which implements and tests morphological operations and local dynamic convolutions together with logical and arithmetic operations in parallel processor array (a fine-grain single instruction multiple data (SIMD)). ITK parallel based implementation is presented in (Palomera-perez, Martinez-Peez, Benitez-Perez, & Ortega-Arhona, 2010). It ensures achieving accuracy similar to its serial counterpart. In addition to its quick processing time (8-10 times faster), and that make it possible to handle high resolution images and large datasets. The image should be divided into sub-images which have overlapped regions. Then these sub images is distributed across computers. Each computer calculates the feature extraction and region growing and finally combining the segmentation results from different computers. But there are no guidelines to tune its design parameters (the neighborhood size, scaling factors of variance and local mean, and the structuring element for morphological operations), so they must be empirically tuned. In addition, for local variance estimation in the image, nonlinear CNN templates are required. (Costantini, Casali, & Todisco, 2010) overcomes the drawbacks of previous approach by exploiting the blood vessels geometrical properties. The line strength measures is calculated for the blood vessels on the level of green plane in the colored retinal image. Linear space-invariant 3 × 3 templates are required for the CNN algorithm, so by using one existing CNN chips, it could be implemented. The performance measures adopted for evaluating the efficiency of Parallel hardware based implementations of retinal vessels are illustrated in Table 7.

FUTURE RESEARCH DIRECTIONS The future direction of vessel segmentation research is to develop more accurate, faster automatic techniques. The segmentation accuracy is a critical and essential point in the research because of the nature of work that dealing with a part of human which is dealing with a part of the human body which must stop in front of him and do the best of ours. In order to achieve high accuracy we must focus on two important factors which are the acquisition phase to get images with high resolution with perfect

Table 7. Performance measures for parallel hardware implementation based methods Test Data

Drive Test

Stare Test

Acc

Specificity

Sensitivity

Acc

Specificity

Sensitivity

(Alonso-Montes, Vilario, Dudek, & Penedo, 2008)

0.919

_

_

_

_

_

(Palomera-perez, Martinez-Peez, Benitez-Perez, & Ortega-Arhona, 2010)

0.925

0.967

0.64

0.926

0.945

0.769

198

A Review of Vessel Segmentation Methodologies and Algorithms

brightness and these will help in the processing phase, and developing hybrid approach with optimization techniques which achieve faster and accurate results. In the end, we must not forget that we are dealing with human nature, so we must investigate precision to handle with.

CONCLUSION Segmentation algorithms are the heart of medical image applications like multimodal image registration, radiological diagnostic systems, visualization, creating anatomical atlases, and computer-aided diagnosis systems. There is different and large number of techniques on this area, however it is still having an areas which needs more research. In the future, the authors aim to develop more accurate automated segmentation techniques. The quick progress in radiological imaging systems lead to increase in volume patient images. Based on that, image processing in radiological diagnostic systems will require more fast segmentation algorithms. Developing parallel algorithms is one of the ways of achieving faster segmentation results. Cronemeyer is one of the people who relied on the exploitation of the nature of parallel hardware to achieve faster skeleton algorithm. Also from other approaches which achieve faster segmentation is neural network-based approaches because of their parallel nature. Also multi-scale approach is considered as faster segmentation approach because it can extract major structures in low resolution images and fine structures in high resolution images. The authors proposed a survey of current vessel segmentation algorithms. The authors tried to cover both old and new researches related to vessel segmentation approaches and techniques. The authors aimed to introduce the current vessel segmentation methods and also to give the researcher a base line and a framework for the existing research.

REFERENCES Al-Diri, B., Hunter, A., & Steel, D. (2009). An active contour model for segmenting and measuring retinal vessels. IEEE Transactions on Medical Imaging, 28(9), 1488–1497. Alonso-Montes, C., Vilarino, D., & Penedo, M. (2005). CNN-based automatic retinal vascular tree extraction. Proceedings of the 2005 9th International Workshop on Cellular Neural Networks and their Applications (pp. 61-64). Alonso-Montes, C., Vilario, D., Dudek, P., & Penedo, M. (2008). Fast retinal vessel tree extraction: A pixel parallel approach. International Journal of Circuit Theory and Applications, 36(5-6), 641–651. doi:10.1002/cta.512 Amin, M., & Yan, H. (2011). High speed detection of retinal blood vessels in fundus image using phase congruency. Soft Computing, 15(6), 1217–1230. doi:10.1007/s00500-010-0574-2 Anzalone, A., Bizzarri, F., Parodi, M., & Storace, M. (2008). A modular supervised algorithm for vessel segmentation in red-free retinal images. Computers in Biology and Medicine, 38(8), 913–922. doi:10.1016/j.compbiomed.2008.05.006 PMID:18619588

199

A Review of Vessel Segmentation Methodologies and Algorithms

Asad, A., Azar, A., & Hassanien, A. (2012). Integrated Features Based on Gray-Level and Hu MomentInvariants with Ant Colony System for Retinal Blood Vessels Segmentation. International Journal of Systems Biology and Biomedical Technologies, 1(4), 61–74. doi:10.4018/ijsbbt.2012100105 Asad, A., Azar, A., & Hassanien, A. (2014). A New Heuristic Function of Ant Colony System for Retinal Vessel Segmentation. International Journal of Rough Sets and Data Analysis, 1(2), 14–31. doi:10.4018/ ijrsda.2014070102 Bernardes, R., Serranho, P., & Lobo, C. (2011). Digital ocular fundus imaging: A review. Ophthalmologica, 226(4), 161–181. doi:10.1159/000329597 PMID:21952522 Cassin, B., & Solomon, S. (1990a). Dictionary of Eye Terminology (2nd ed.). Gainesville, Florida: Triad Publishing Company. Cassin, B., & Solomon, S. (1990b). Dictionary of Eye Terminology (1st ed.). Gainesville, Florida: Triad Publishing Company. Castaldi, V., Fabiola, M., & River, F. (2010). A fast, efficient and automated method to extract vessels from fundus images. Journal of Visualization, 13(3), 263–270. doi:10.1007/s12650-010-0037-y Cinsdikici, M., & Aydin, D. (2009). Detection of blood vessels in ophthalmoscope images using MF/ant (matched filter/ant colony) algorithm. Computer Methods and Programs in Biomedicine, 96(2), 85–95. doi:10.1016/j.cmpb.2009.04.005 PMID:19419790 Costantini, G., Casali, D., & Todisco, M. (2010). A hardware-implementable system for retinal vessel segmentation. Proceedings of the 14th WSEAS international conference on Computers: part of the 14th WSEAS CSCC multiconference (Vol. 2, pp. 568-573). Deserno, T., & Thomas, M. (2010). Fundamentals of Biomedical Image Processing. In Biological and Medical Physics, Biomedical Engineering (pp. 1-51). Verlag, Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-15816-2_1 Emary, E., Zawbaa, H., Hassanien, A., Schaefer, G., & Azar, A. (2014). Retinal Blood Vessel Segmentation using Bee Colony Optimisation and Pattern Search. Proceedings of the annual IEEE International Joint Conference on Neural Networks (IJCNN), Beijing, China (pp. 1001-1006). Emary, E., Zawbaa, H., Hassanien, A., Tolba, M., & Sansel, V. (2014). Retinal vessel segmentation based on flower pollination search algorithm. Proceedings of the 5th International Conference on Innovations in Bio-Inspired Computing and Applications, Ostrava, Czech Republic (pp. 93-100). Farnell, D., Hatfield, F., Knox, P., Reakes, M., Spencer, S., Parry, D., & Harding, S. P. (2008). Enhancement of blood vessels in digital fundus photographs via the application of multiscale line operators. Journal of the Franklin Institute, 345(8), 748–765. doi:10.1016/j.jfranklin.2008.04.009 Fraz, M., Barman, S., Remagnino, P., Hoppe, A., Basit, A., Uyyanonvara, B., & Owen, C. G. et al. (2012a). An approach to localize the retinal blood vessels using bit planes and centerline detection. Computer Methods and Programs in Biomedicine, 108(2), 600–616. doi:10.1016/j.cmpb.2011.08.009 PMID:21963241

200

A Review of Vessel Segmentation Methodologies and Algorithms

Fraz, M., Remagnino, P., Hoppe, A., & Barman, S. (2013). Retinal image analysis aimed at extraction of vascular structure using linear discriminant classifier. Proceedings of the International Conference on Computer Medical Applications, Sousse, Tunisia. doi:10.1109/ICCMA.2013.6506180 Fraz, M., Remagnino, P., Hoppe, A., Uyyanonvara, B., Rudnicka, A., Owen, C., & Barman, S. A. (2012b). An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Transactions on Bio-Medical Engineering, 59(9), 2538–2548. doi:10.1109/TBME.2012.2205687 PMID:22736688 Hassan, G., Elbendary, N., Hassanien, A., Shoeb, A., & Snasel, V. (2015). Retinal Blood Vessel Segmentation Approach Based on Mathematical Morphology. Procedia Computer Science, 62, 612–622. doi:10.1016/j.procs.2015.09.005 Hou, Y. (2014). Automatic Segmentation of Retinal Blood Vessels Based on Improved Multiscale Line Detection. Journal of Computing Science and Engineering, 8(2), 119–128. doi:10.5626/JCSE.2014.8.2.119 Ibanez, L., Schroeder, W., Ng, L., & Cates, J. (2003, August 21). The itk software guide. Kitware, Inc. Jiang, X., & Mojon, D. (2003). Adaptive local thresholding by verification-based multithreshold probing with application to vessel detection in retinal images. Pattern Analysis and Machine Intelligence. IEEE Transactions on, 25(1), 131–137. Kande, G., Subbaiah, P., & Savithri, T. (2010). Unsupervised fuzzy based vessel segmentation in pathological digital fundus images. Journal of Medical Systems, 34(5), 849–858. doi:10.1007/s10916-0099299-0 PMID:20703624 Lam, B., Gao, Y., & Liew, A. (2010). General retinal vessel segmentation using regularization-based multiconcavity modeling. IEEE Transactions on Medical Imaging, 29(7), 1369–1381. doi:10.1109/ TMI.2010.2043259 PMID:20304729 Lam, B., & Yan, H. (2008). A novel vessel segmentation algorithm for pathological retina images based on the divergence of vector fields. IEEE Transactions on Medical Imaging, 27(2), 237–246. Läthén, G. (2010). Segmentation Methods for Medical Image Analysis Blood vessels, multi-scale filtering and level set methods [Thesis]. Center for Medical Image Science and Visualization (CMIV) Linköping University Institute of Technology, Sweden. Lee, Y., Lin, R., Sung, C., Yang, C., Chien, K., Chen, W., & Huang, Y.-C. et al. (2000). Chin-Shan Community Cardiovascular Cohort in Taiwan-baseline data and five-year follow-up morbidity and mortality. Journal of Clinical Epidemiology, 53(8), 838–846. doi:10.1016/S0895-4356(00)00198-0 PMID:10942867 Manganaro, G., Arena, G., & Fortuna, L. (1999). Cellular Neural Networks: Chaos, Complexity and VLSI Processing. Springer-Verlag New York, Inc. doi:10.1007/978-3-642-60044-9 Marin, D., Aquino, A., Gegundez-Arias, M., & Bra, J. (2011). A new supervised method for blood vessel segmentation in retinal images by using gray-level and moment invariants-based features. IEEE Transactions on Medical Imaging, 30(1), 146–158. doi:10.1109/TMI.2010.2064333 PMID:20699207 Miri, M., & Mahloojifar, A. (2011). Retinal image analysis using curvelet transform and multistructure elements morphology by reconstruction. IEEE Transactions on Biomedical Engineering, 58(5), 1183–1192.

201

A Review of Vessel Segmentation Methodologies and Algorithms

Ng, J., Clay, S., Barman, S., & Feilde, A. (2010). Maximum likelihood estimation of vessel parameters from scale space analysis. Image and Vision Computing, 28(1), 55–63. doi:10.1016/j.imavis.2009.04.019 Niemeijer, M., Staal, J., Van Ginneken, B., Loog, M., & Abramoff, M. D. (2004). Comparative study of retinal vessel segmentation methods on a new publicly available database. In Medical Imaging 2004 (pp. 648–656). doi:10.1117/12.535349 Palomera-perez, M., Martinez-Peez, M., Benitez-Perez, H., & Ortega-Arhona, J. (2010). Parallel multiscale feature extraction and region growing: Application in retinal blood vessel detection. IEEE Transactions on Information Technology in Biomedicine, 14(2), 500–506. Perfetti, R., Ricci, E., Casali, D., & Costantin, G. (2007). Cellular neural networks with virtual template expansion for retinal vessel segmentation. IEEE Transactions on Circuits and Wystems. II, Express Briefs, 54(2), 141–145. doi:10.1109/TCSII.2006.886244 Ricci, E., & Perfetti, R. (2007). Retinal blood vessel segmentation using line operators and support vector classification. IEEE Transactions on Medical Imaging, 26(10), 1357–1365. doi:10.1109/TMI.2007.898551 PMID:17948726 Roletschek, R. (2010, 12 7). Retrieved from Image Texture. Wikipedia. Retrieved from http://en.wikipedia. org/wiki/File:2010-12-07-funduskamera-by-RalfR-02.jpg Roska, T., & Chua, L. (1993). The CNN universal machine: an analogic array computer. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 40(3), 163–173. Roychowdhury, S., Koozekanani, D., & Parhi, K. (2014). Blood Vessel Segmentation of Fundus Images by major Vessel Extraction and Sub-Image Classification. IEEE Journal of Biomedical and Health Informatics, 19(3), 2168–2194. Saine, P. (2011). Fundus Imaging. Opthalmic photographers Socity OPS. Retrieved from http://www. opsweb.org/page/fundusimaging Serra, J. (1982). Image analysis and mathematical morphology (1st ed.). Academic press. Soares, J., Leandro, J., Cesar, R., Jelinek, H., & Cree, M. (2006). Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification. IEEE Transactions on Medical Imaging, 25(9), 1214–1222. doi:10.1109/TMI.2006.879967 PMID:16967806 Sreejini, K., & Govindan, V. (2015). Improved multiscale matched filter for retina vessel segmentation using PSO algorithm. Egyptian Informatics Journal, 16(3), 253–260. doi:10.1016/j.eij.2015.06.004 Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., & Van Ginneken, B. (2004). Ridge-based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging, 23(4), 501–509. doi:10.1109/TMI.2004.825627 PMID:15084075 Sun, K., Chen, Z., Jiang, S., & Wang, Y. (2011). Morphological multiscale enhancement, fuzzy filter and watershed for vascular tree extraction in angiogram. Journal of Medical Systems, 35(5), 811–824. doi:10.1007/s10916-010-9466-3 PMID:20703728

202

A Review of Vessel Segmentation Methodologies and Algorithms

Terzopoulos, D. (1984). Multiresolution Computation of Visible-Surface Representation. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Vermeer, K., Vos, F., Lemij, H., & Vossepoel, A. (2004). A model based method for retinal blood vessel detection. Computers in Biology and Medicine, 34(3), 209–219. doi:10.1016/S0010-4825(03)00055-6 PMID:15047433 Vlachos, M., & Dermatas, E. (2010). Multi-scale retinal vessel segmentation using line tracking. Computerized Medical Imaging and Graphics, 34(3), 213–227. doi:10.1016/j.compmedimag.2009.09.006 PMID:19892522 Yang, Y., Huang, S., & Rao, N. (2008). An automatic hybrid method for retinal blood vessel extraction. International Journal of Applied Mathematics and Computer Science, 18(3), 399–407. doi:10.2478/ v10006-008-0036-5 Zhang, B., Zhang, L., Zhang, L., & Karray, F. (2010). Retinal vessel extraction by matched filter with first-order derivative of Gaussian. Computers in Biology and Medicine, 40(4), 438–445. doi:10.1016/j. compbiomed.2010.02.008 PMID:20202631 Zhang, J., Cui, Y., Jiang, W., & Wang, L. (2015). Blood Vessel Segmentation of Retinal Images Based on Neural Network. In Image and Graphics, LNCS (Vol. 9218, pp. 11–17). Springer International Publishing. doi:10.1007/978-3-319-21963-9_2

KEY TERMS AND DEFINITIONS Blood Vessel Extraction: An automatic processing step to extract vessels away from image to investigate the existence on some disease. Blood Vessel: A tubular channel that is characterized as flexible like a vein, an artery, and a capillary, and the blood passes through it to the eye. Hemorrhages: Secretions and ample blood as a result of a ruptured blood vessel. Lesions: A pathologic change in the tissues and individual points of multifocal disease. Macular Degeneration: A disease happened in the eye, especially destroys the macula and caues blindness because it effects on the center of vision. Magnetic Resonance Imaging: A method which used to obtain images of the interiors of objects, as humans and animals, it uses radio-frequency waves on its caption. Neural Network: A Deep learning technology depends on simulating the nature of brain to solve pattern recognition problems.

203

204

Chapter 10

Cloud Services Publication and Discovery Yasmine M. Afify Ain Shams University, Egypt

Nagwa L. Badr Ain Shams University, Egypt

Ibrahim F. Moawad Ain Shams University, Egypt

Mohamed F. Tolba Ain Shams University, Egypt

ABSTRACT Cloud computing is an information technology delivery model accessed over the Internet. Its adoption rate is dramatically increasing. Diverse cloud service advertisements introduce more challenges to cloud users to locate and identify required service offers. These challenges highlight the need for a consistent cloud service registry to serve as a mediator between cloud providers and users. In this chapter, stateof-the-art research work related to cloud service publication and discovery is surveyed. Based on the survey findings, a set of key limitations are emphasized. Discussion of challenges and future requirements is presented. In order to contribute to cloud services publication and discovery area, a semanticbased system for unified Software-as-a-Service (SaaS) service advertisements is proposed. Its back-end foundation is the focus on business-oriented perspective of the SaaS services and semantics. Service registration template, guided registration model, and registration system are introduced. Additionally, a semantic similarity model for services metadata matchmaking is presented.

INTRODUCTION Cloud computing is exceptionally evolving due to its vast benefits. It provides flexibility, scalability, lower cost, faster time to market, and ease of use to its users (Ali & Soar, 2014). Beneficiaries vary from individual users and small business organizations to huge institutions and governments. As shown in Figure 1, the service delivery in cloud computing comprises three models (Liu et al., 2011). First, Infrastructure-as-a-Service (IaaS), where infrastructure-level resources are provided on demand in the form of Virtual Machines (VM). IaaS providers include Amazon EC2, GoGrid, etc. Second, Platform-asas-Service (PaaS), where platform-level resources such as operating systems and development environments are provided on demand. Examples: Google App Engine, Microsoft Windows Azure, etc. Third, DOI: 10.4018/978-1-5225-2229-4.ch010

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Cloud Services Publication and Discovery

Software-as-a-Service (SaaS), where applications are made available to users on demand for business process operations. Example applications: Salesforce.com, Rackspace, Gmail, etc. Deployment models in cloud computing comprises four models: public, private, community, and hybrid cloud service usage. Cloud service lifecycle includes the service design, publication, discovery, selection, negotiation, usage, and termination. In service design phase, the syntactic and semantic descriptions of the services are prepared for the service to be published. In service publication phase, the cloud providers may register their services to service repositories in addition to publish the service information on their portal. In service discovery phase, users search for and retrieve list of services that satisfy their requirements. In service selection phase, users select the most appropriate service for their functional and non-functional requirements. In service negotiation phase, users negotiate the Service Level Agreement (SLA) and billing with the provider. In service usage phase, users start using the service, the service key performance metrics is monitored and optimized. Finally, in service termination phase, services are withdrawn or deactivated and feedbacks are kept. The major cloud players include the cloud users, cloud service providers, cloud brokers, cloud auditors, and cloud carriers (Liu et al., 2011). The cloud user is an individual or organization that requires a service that satisfy specific functional and non-functional requirements. The service usage is considered optimal for the cloud user when he receives the promised service quality with no extra effort. The cloud service provider is usually an organization that offers its services. The cloud broker works as an Figure 1. Cloud service models Source: (Liu et al., 2011).

205

Cloud Services Publication and Discovery

intermediate between users and service providers in order to facilitate the matchmaking process. The cloud auditor conducts independent performance and security monitoring of cloud services. The cloud carrier is the organization who has the responsibility of transferring the data akin to the power distributor for the electric grid. SaaS is the new business model of the information technology landscape. It is sometimes referred to as on-demand software. SaaS separates the ownership of software from its use as they are hosted by cloud providers and accessed by users using a thin client via a web browser or a program interface. SaaS delivers one application to many users regardless of their location. “SaaS runs on top of PaaS that in turn runs on top of IaaS. SaaS has not only its business model but also its unique development processes and computing infrastructure” (Tsai, Bai, & Huang, 2014, p.1). SaaS services are typically published on the cloud provider portal. Each provider describes his offered service using his own vocabulary. The major challenge to SaaS service discovery is the lack of standard description language or naming convention for the service advertisement (Noor, Sheng, Alfazi, Ngu, & Law, 2013). In particular, different terms are used by providers to describe the same concepts. Services can be searched using search engines such as Google, Yahoo, MSN, etc. However, these search engine constantly build indexes of words and pages, and when a user submits his request, keywords from his requests are compared to words in pages in order to retrieve relevant pages. This mechanism works for general products search. Different from product search, the cloud services has special functional and non-functional requirements, which are not supported by the design of the traditional search engines. Therefore, the returned pages may not be related to the user request. Furthermore, when traditional search engines return some URLs, the user has to manually go through each link in order to check its suitability for his requirements. Subsequently, the search process is timeconsuming and error-prone. Moreover, due to the services diversity, sometimes the user cannot settle on the service that best suits his requirements (Chen, Bai, & Liu, 2011; Crasso, Zunino, & Campo, 2011; Garg, Versteeg, & Buyya, 2013). Service directories that collects information about existing services facilitate the search process (Kehagias, Giannoutakis, Gravvanis, & Tzovaras, 2012; Sun, Dong, Hussain, Hussain, & Chang, 2014). Recently, service directories are leveraged to list cloud services (Open Data Center Alliance Inc., 2011; Spillner & Schill, 2013), such as (Cloudbook, 2013; CloudServiceMarket, 2009; Cloud Showplace, 2012; GetApp, 2014; OpenCrowd Project, 2010; ReadySaaSGo, 2014; SaaS Directory, n.d.; SaaS Lounge, n.d.; SAP Service Marketplace, 1972). Most of these directories list the services according to predefined categories with a short description of the service and a link to its website. However, they only provide a browsing capability using the service name and the application domain. In this chapter, the related background is introduced first, and then a thorough survey of relevant recent research work related to the cloud services advertisement, publication, and discovery areas is offered. Based on this survey, set of key limitations of existing approaches are highlighted. Moreover, a discussion of the challenges and future requirements is presented. Finally, a proposed service publication and discovery system is presented as a solution for some of the mentioned issues.

BACKGROUND In this section, concepts related to the cloud services publication and discovery research areas are presented.

206

Cloud Services Publication and Discovery

Functional Requirements Functional requirements represent detailed description of the functionalities provided by the service. Functional requirements should included information about description of data to be input, the operation workflow, and the expected outputs.

Non-Functional Requirements Quality attributes and constraints that represent the performance of a service operation rather than specific behavior. They are the criteria to judge the service quality. Non-functional requirements can be identified in terms of quality of metrics that elaborate on performance characteristics of the service. Some metrics are related to the service execution and can be measured, such as response time, throughput, and security. On the other hand, other metrics are related to the service evolution over time, such as: maintainability, extensibility, and scalability.

Service Domain Ontology In computer science, ontology is normally defined as a formal specification of a shared conceptualization, where a conceptualization refers to state-of-affairs in the real world (Guarino, 1998). Service domain ontology comprises information about common service-level semantics, such as service description, operations, resources, functional and non-functional information, characteristics, etc. In the context of cloud services publication and discovery, service domain ontology is usually employed to eliminate semantic heterogeneity in order to improve the system recall. It is designed by domain experts who identify common approved concepts and relationships among them and construct them as a hierarchy of service concepts, where each concept is an abstraction of its sub-concepts. It is integrated into service discovery systems in order to enrich the services description via semantic annotation, reason and match services metadata to user request, and formulate ontology-based queries.

Semantic Annotation Semantic annotation is the process of augmenting natural language text with semantic metadata, usually using ontology. In the context of cloud services publication and discovery, cloud services descriptions are semantically annotated in order to improve the discovery process in three steps, identify the useful keywords from the service description via pre-processing (tokenization, stop words removal, and stemming), disambiguate the keywords, and associate semantic metadata to the keywords.

Service Registry Service registry is a recognized service directory that publishes services information. Primary information includes service name, description, link, and provider. Service registry may also include information about service policies, QoS, price models, etc. Some documents may be included to provide detailed service technical and operational information. Users navigate the registry entries or search its contents using browsing capabilities provided by the registry. Significant features supported by some service registries include: service evaluation, availability monitoring, and versioning. Passive registries depend

207

Cloud Services Publication and Discovery

on voluntary registration by the service providers. On the other, active registries employ a crawler that collects service information available on the Internet.

RELATED WORK Cloud Service Advertisement and Publication Systems Relevant recent research work in the areas of cloud service advertisement, publication, and discovery is discussed. Research work is divided according to the context of richness of the services information addressed. The first group, functional-based systems, addresses functional information of the services only. On the other hand, the second group, quality-aware systems, addresses non-functional information as well.

Functional-Based Service Publication Systems Functional-based service publication systems (AbuJarour, 2011; Bernstein & Vij, 2010; Org, U. D. D. I., 2000; Steinmetz, Lausen, & Brunner, 2009; Tahamtan, Beheshti, Anjomshoaa, & Tjoa, 2012) address information related to the operation, functionalities and behavior of the services, neglecting the nonfunctional requirements. The commonly used implementation of a service repository is the Universal Description, Discovery and Integration (UDDI) (Org, U. D. D. I., 2000). It is a registry for Web Services (WS) publishing and querying. Web services are indexed as Web Service Description Language (WSDL) descriptions that model the interface and implementation of web services. The functional properties of each service can then be queried. It has three main limitations: a) services are organized according to their categories not to the WS capabilities, b) limited syntactic service discovery as it cannot match requested functionalities to capabilities provided by the services, and c) lack of support for non-functional properties. As opposed to passive registries such as UDDI, some work adopts the active registry approach where a crawler is used to collect web information (AbuJarour, 2011; Steinmetz et al., 2009). Steinmetz et al. (2009) proposed building unique web service objects from multiple resources. Collected information represent WSDL service descriptions and related documents, in addition to web pages that informally describe APIs. They built a focused crawler to collect and identify relevant content, then, they aggregated the information in order to annotate the services. They collected more than 28.000 web services. In order to address the lack of rich service descriptions, a novel approach and platform, Depot, was proposed in (AbuJarour, 2011). It investigates the benefits of information integration in SOC, where information about web services is gathered by proactively crawling from multiple sources, e.g., service providers, consumers, invocations, etc., and integrated in rich universal service descriptions that enable the proactive service registry. The aim was to enrich service descriptions, maximize the benefits of existing descriptions, and integrate all available resources to provide the highest precision for service discovery and selection. Other functional-based works focus on cloud services such as (Bernstein & Vij, 2010; Tahamtan et al. 2012). Targeting a federated-cloud environment, an intercloud directories and exchanges was proposed in (Bernstein & Vij, 2010). Authors provided a mediator for enabling connectivity and collaboration among cloud providers. The mediation mechanism utilizes a cloud computing resources catalog approach, defined using the Semantic Web Resource Definition Framework (RDF) and an ontology for cloud com-

208

Cloud Services Publication and Discovery

puting resources. The semantic model guarantees that the requirements of an intercloud enabled provider are automatically matched to the infrastructure capabilities. Cloud provider infrastructure features and capabilities are captured, grouped, and exposed as standardized configuration units to be used by other providers. However, it only focuses on infrastructure resource capabilities and features. On the other hand, Tahamtan et al. (2012) focused on business functions. They introduced an integrated ontology for business functions and cloud providers that matches cloud services according to their functional and non-functional requirements. It particularly addresses the demand for flexibility and exchangeability by the cloud computing paradigm and serves as a repository of services. Moreover, users can query the ontology in order to discover providers and services that match their requirements. The provided data on cloud providers is based on a market research. Limitations of this work are the strict matching of the user query and business functions, and the query representation language, which significantly restricts the ontology use to the experienced users only.

Quality-Aware Service Publication Systems Quality-based service publication systems (Barros, Oberle, Kylau, & Heinzl, 2012; Chen & Li, 2010; Fang, Liu, & Romdhani, 2014; Menychtas et al., 2014; Mindruta & Fortis, 2013; Spillner & Schill, 2013) take into account the non-functional requirements of services as well as its functional requirements. Chen and Li (2010) proposed SRC, a service registry model as an extension of the keywords based service registry model. SRC provides behavior-aware and QoS-aware service discovery services. It is deployed as a cloud application with ports for service publish, discovery, and feedback collection. SRC stores the semantic descriptors of web services in the services repository and the feedbacks of dynamic status of QoS in the feedback database. The Google File System (GFS) is used to store the data in the cloud. The MapReduce mechanism is used to find function-matched and QoS-matched services. A semantic registry of cloud services was proposed in (Mindruta & Fortis, 2013). The framework contains core ontological definitions and extension mechanisms used to define ontologies for cloud services. It contains several ontologies intended to model the cloud computing domain. The work focus was directed towards defining an ontological support related to aspects of semantic discovery of cloud services together with their related artifacts. The proposed registry is relevant in the context of service marketing, selection, composition, and interoperability. However, this framework does not include service types for SaaS service models. A versatile and extensible Everything-as-a-Service (XaaS) registration entry, Con-Qo, was proposed in (Spillner & Schill, 2013). They proposed a three-concerns solution: an extensible description language for services, a registration model, a system for registration and subsequent service discovery operations. Evaluation proved that ConQo, with its role-oriented service interfaces and efficient query and synchronization mechanisms, avoids the overhead and complexity of fully distributed registries while still achieving suitable scalability. However, no details were given on the request-service matchmaking algorithm. As opposed to focus on technical aspects of the cloud services, other research works include business services as well such as (Barros et al., 2012; Fang et al., 2014; Menychtas et al., 2014). As a main building block for the Internet of Services (IoS), the Unified Service Description Language (USDL) was proposed in (Barros et al., 2012) to capture the operational, technical and business-related aspects of services into a comprehensive description model. A USDL Marketplace contains a business-oriented registry dedicated to the exchange of USDL artefacts. The marketplace recommends services for business scenarios and allows for business value network creation based on contracts and subcontract relationships.

209

Cloud Services Publication and Discovery

Fang et al. (2014) proposed a Loosely-Coupled Cloud Service Ontology (LCCSO) model, which takes advantage of flexible concept naming as well as loosely-coupled axiom assertions to ultimately comprise comprehensive specifications of diverse cloud services from distinct levels and categories. It utilizes a wide range of ontology assertion types to comprehensively reveal details regarding service functions, characteristics and features. The ontology serves as a service registry that stores all services under one class, which enables effective service processing and lookup. Moreover, they proposed a Cloud Service Explorer (CSE) tool that supports user-friendly access to services information. The system consists of three main components, namely, ontology manager, service search engine, and user interface. The request-service matchmaking process is based on semantic matching of request keywords to services metadata from LCCSO. Compared to proposed work in this chapter, a hybrid matchmaking process that integrates semantic-based metadata and ontology-based hierarchical matching is utilized. Menychtas et al. (2014) presented a comprehensive marketplace environment for cloud services. Provision of cloud services is supported through a dynamic and fair ecosystem of all involved stakeholders. They emphasized the interrelations between technical and business service properties and defined a series of pricing models to adapt to different contexts and user requirements. Providers utilize marketplace functionalities to define new product offerings, business-related information and service level objectives, while users may pose specific service requests and contract providers. The operation of the implemented approach is evaluated using a real-world scenario. Other service registries were introduced (IBM WebSphere Service Registry, n.d.; Membrane SOA Registry, 2008; Oracle Service Registry, n.d.). IBM WebSphere service registry and repository (IBM WebSphere Service Registry, n.d.) provides functions for service-oriented architecture enterprise applications. It provides service life cycle management, visualization of services and their dependencies among other features. Membrane SOA Registry (Membrane SOA Registry, 2008) is an open source web service registry dedicated to services described using the WSDL files. It supports versioning, contract monitoring and availability monitoring among other features. Finally, Oracle service registry (Oracle Service Registry, n.d.) provides reference for runtime infrastructure to dynamically discover and bind to deployed services and end points.

Cloud Service Discovery Systems Functional-Based Service Discovery Systems Functional-based service discovery systems (Di Modica, Tomarchio, 2014; Noor et al., 2013; Parhi, Pattanayak, & Patra, 2015; Rodríguez-García, Valencia-García, García-Sánchez, & Samper-Zapater, 2014a, 2014b; Sim, 2012) is realized based on matching the user request to service operations, neglecting non-functional requirements of the user. Sim (2012) proposed a cloud-focused semantic search engine for cloud service discovery, called Cloudle. composition. Cloudle consists of: 1) a service discovery agent that consults cloud ontology for determining the similarities between providers’ service specifications and consumers’ service requirements, and 2) multiple cloud crawlers for building its database of services. Cloudle supports three types of reasoning: similarity, compatibility, and numerical reasoning. Using Cloudle web interface, user runs a query against the cloud services registered in the search engine’s database by matching consumers’ functional, technical, and budgetary requirements. However, the business perspective of the services has not been considered in the reasoning process.

210

Cloud Services Publication and Discovery

Noor et al. (2013) developed a Cloud Services Crawler Engine (CSCE). They developed Cloud Services Ontology (CSO) to provide the crawler engine with meta information and describe data semantics of cloud services. CSCE was used to collect, validate, and categorize cloud services available on the web. The collected cloud services can be continuously updated for effective cloud services discovery. Based on collected data, they conducted statistical analysis including the distribution of cloud providers categorization and the relationship between cloud computing and SOC. The most interesting finding is the fact that there is no strong evidence that SOC plays a significant role in enabling cloud computing as a technology. In addition, the lack of standardization in current cloud products and services, which makes cloud services discovery a very difficult task. However, no details were given on the discovery or matchmaking mechanism. A semantically-enhanced platform that assists in the process of discovering the cloud services that best match user needs was proposed in (Rodríguez-García et al., 2014a, 2014b). The system is composed of three main modules: (i) the semantic annotation module, (ii) the semantic indexing module, and (iii) the semantic search engine. Cloud services semantic repository is generated by annotating the cloud service descriptions with semantic content from the domain ontology and then creating a semantic vector for each service. Service descriptions are directly matched to the user query in the search process. The semantic similarity value between the query and service descriptions is calculated. Relevant cloud services are ranked and displayed to the user. However, the semantic-based matching process of user query rely on the service descriptions only without taking into account any information about their functionalities. Di Modica and Tomarchio (2014) proposed a semantic discovery framework that facilitates the operation of a cloud market. A semantic model assists providers and consumers of cloud services to characterize their specific business requirements according to their knowledge of the cloud domain. The service demands and offers are matched to each other in a way that maximizes both the providers business objective and the customers utility. The business aspects of the supply-demand matchmaking and interoperability are addressed. Although many cloud service features have been considered in this work, they were not related to the functionalities supported by the service offer. A multi-agent framework integrated with ontology for cloud service description and discovery was proposed in (Parhi et al., 2015). The framework mainly assists in describing the cloud service providers and their attributes in a standardized way by using ontology and helps the users in discovering suitable service according to their requirements. The framework consists of three types of agents: First, the consumer agent provides a user-friendly graphical user interface that can help user to select a query for cloud service. Second, the discovery agent is responsible for the discovery of the requested cloud services from the semantic service registry using the information provided by service consumer. Third, provider agent helps the service provider to register a new cloud service or to upgrade an existing service. Nevertheless, no details were given on the matchmaking reasoning rules.

Quality-Aware Service Discovery Systems Quality-based service discovery systems (Al-Masri & Mahmoud, 2010; Dastjerdi, Tabatabaei, & Buyya, 2010; Dong, Hussain, & Chang, 2011, 2013; Lin, Dou, Xu, & Chen, 2013; Liu, Yao, Qin, & Zhang, 2014; Nagireddi & Mishra, 2013; Sukkar, 2010; Wright et al., 2012; Xu, Gong, & Wang, 2012) consider the non-functional requirements/preferences of users in the service discovery process. In general, the QoS attributes are used to rank the set of discovered services. Web Service Broker (WSB), a universal access

211

Cloud Services Publication and Discovery

point for discovering web services registered across heterogeneous service registries, was introduced in (Al-Masri & Mahmoud, 2010). Authors presented a crawler to collect web service information, a monitoring scheme to monitor the web service performance, a querying technique to enable users tailor their queries according to their needs, and a ranking model based on QoS parameters. Results proved that applying the Information Retrieval (IR) techniques combined with structural matching improves the relevancy ranking of web services. However, their approach is specific to WSDL-described services, which is not the case for most of the SaaS services. Dastjerdi et al. (2010) proposed ontology-based discovery of QoS-aware deployment of appliances on IaaS providers. Using ontology solves the problem that providers and users are not using the same notation in describing requirements and services. The proposed architecture facilitates the discovery of appropriate appliances from different providers and dynamic deployment on IaaS providers. A desired attribute of the proposed architecture is to allow users to present their requirements in terms of high level and general software and hardware characteristics, which will be mapped to appliances and virtual units. This work objective was to focus on a neglected issue, which is the consideration of the cloud computing environment as a service deployment resource provider. A framework for a semantic service search engine that retrieves services based on domain-specific QoS criteria in the digital ecosystem environment was presented in (Dong et al., 2011). The ultimate goal of this search engine is to allow service users to effectively retrieve and evaluate services published by the service providers. The system consists of four primary components, which are service knowledge base, service reputation database, service search module, and service evaluation module. Moreover, this framework provides a QoS-based service evaluation and ranking methodology. The authors then presented a systematic framework for online service advertising information search in (Dong et al., 2013). It comprises an ontology-learning-based focused crawler for service information discovery and classification, a faceted semantic search component for service concept selection, and a user-click-based similarity component for service concept-based ranking adjustment. The framework follows a keywordbased search style. The framework is inspired by the philosophy of user-centric design, which is reflected in two aspects: the involvement of service users in ontology-based service request denotation, and the reference of users click behaviors for service concept recommendation. However, their framework is based on service advertising information only, without taking into account its feature details. Lin et al. (2013) proposed a QoS-aware service discovery method is investigated for elastic cloud computing in an unstructured peer-to-peer network. The deployment consists of two phases: service registration and discovery. Service registration comprises both functional and non-functional information. Flooding-based method is adopted for service registering. A probabilistic flooding-based QoS-aware service discovery method, combining Simple Additive Weighting (SAW) technique and skyline filtering, is proposed to perform QoS-aware service discovery. An ontology-based and SLA-aware service discovery method was proposed in (Liu et al., 2014). It is based on modeling semantically enriched cloud services, ontology reasoning, and logic matchmaking. They designed an IaaS cloud ontology to enhance the service semantic information, which defines the hierarchal relations of cloud concepts. The matching algorithm considers the equivalence concept to increase the service matching success rate. To select the best services, candidate services are ranked using the user non-functional preferences. The ranking method combines the Analytic Hierarchy Process (AHP) with preference ranking organization method for enrichment evaluations (PROMETHEE) to rank the available services.

212

Cloud Services Publication and Discovery

Generic cloud services search based on developed cloud ontology was presented in (Nagireddi & Mishra, 2013). The cloud ontology uses standard name convention for all cloud providers. An intercloud registry is used to store the cloud services and their attributes in a structured form. The user query keywords are rewritten as concept from the cloud ontology, then the query is arranged based on the services hierarchy. Service ranking is based on service SLA attributes. System architecture for automated SaaS services discovery and selection was presented in (Sukkar, 2010). The system facilitates service search and recommends service options to users based on functional and non-functional properties of the SaaS services. The authors presented the design and integration of a solution for the service evaluation and recommendation problem based on service quality. In particular, monitoring results of past service invocations are used to generate automated ratings in order to identify the cloud providers with bad behavior. However, the service characteristics were not considered in the recommendation process. Wright et al. (2012) proposed an infrastructure resource discovery engine which operates in a multi provider cloud environment with different types of user requirements. The application hosting requirements are specified using constraints. A two-phase constraints-based model is utilized. In the first phase, set of candidate resources are identified. In the second phase, some heuristic is utilized in order to select the most appropriate resource. The heuristic approach selection is based on the application type. Xu et al. (2012) proposed a Formal Concept Analysis-based (FCA) service discovery approach for cloud services. Semantic matchmaking of user requirements is utilized to filter irrelevant services. Then, using concept lattices, the services are classified based on their QoS level. Finally, the generated lattices identify appropriate services suitable for user requirements with less time and effort. This approach is particularly significant when the number of irrelevant domains is large.

RESEARCH ISSUES Based on the above survey, the following key findings are highlighted. First, lack of standardization of cloud services advertisement. Each provider uses his own vocabulary in describing his offer on his portal. As a result, different terms are used to describe the same concept by different providers. Moreover, different styles and formats are used which complicates the identification and comparison of the service offers. Second, existing cloud service registries support limited browsing capabilities. Search categories include search by service name and set of predefined domains. Few systems support keyword-based search. Even if supported, it is limited to semantic-based term-matching between the user request and services metadata, which has a negative effect on the discovery precision. Third, few publication systems provide sufficient service Quality of Service (QoS) information about the services, which makes it difficult for the user to determine if the service satisfies his non-functional requirements. Fourth, lack of robust focused crawler for automatic collection and update of services information. Fifth, deficient addressing of the business-oriented capabilities of the business services, where very few approaches consider the business aspect of the cloud services in the discovery process. And even if considered, it is limited to matching the user request to services description metadata.

213

Cloud Services Publication and Discovery

SOLUTIONS AND RECOMMENDATIONS Requirements of Cloud Service Advertisement and Publication System From the authors point of view, a consistent cloud service publication and discovery system is urgently required to connect cloud users and service providers. Requirements of an ideal cloud service publication and discovery system are as follows. First, support a unified semantic-based service specification for cloud services to facilitate lookup search by users. Second, leverage business-oriented matchmaking of services in addition to semantic-based service metadata matching. That is, instead of restricted matching of user keywords to functional description of the services, detailed technical operational information should also be utilized to improve the accuracy of the discovery process. This implies that the service functionalities should be comprehensively described in a business-oriented context. Third, consider non-functional requirements in addition to functional ones due to their significant impact on the user choice of the service. Service characteristics, features, and performance quality play a major role in selecting the suitable service that satisfies the user requirements. Fourth, contain an active service registry repository, which employs a focused crawler that collects service information from multiple sources such as cloud provider portals, SLA, and customer feedbacks. Collected information should be filtered, relevant information should be semantically enriched, and properly stored into the service registry. Fifth, support diverse types of cloud services and consider their relations and dependencies on the service discovery process. Most of existing literature focus on a specific type of cloud services, specially IaaS and PaaS. The real success of a cloud service discovery system implies support of the user needs of interconnected requirements among different cloud layers. Sixth, support automated service composition based on user requirements. Service composition allows customized complex services to users from same or different cloud providers (Jula, Sundararajan, & Othman, 2014). The composite service may be added as a new service for reuse by other users (Akolkar et al., 2012). Seventh, provide user-friendly web-based interface that accepts keyword-based user requests. As a result, users can freely describe their requirements even if they do not have technical knowledge of the cloud domain. Eighth, provide personalized recommendation for cloud services based on the user preferences and history. The recommendation process should be based on service operation, characteristics, QoS, and service reputation. Ninth, support ontology evolution in order to accommodate for changes in the domain and user requirements. Both timely adaptation of ontology and management of user-performed changes should be considered. The consistency between the modified ontology and the service annotations should be maintained (Rodríguez-García et al., 2014a). Tenth, provide service performance evaluation based on monitoring. Moreover, it should accept user ratings or feedbacks about the cloud services and providers. These ratings can be used as a factor in ranking the discovered services to other users.

214

Cloud Services Publication and Discovery

Proposed Cloud Service Publication and Discovery System In order to address some of the above issues, the first, second, third, and seventh requirements are considered in the proposed system. The proposed system exploits the business perspective of the SaaS services and semantic approach. On one hand, the business aspect of the SaaS service and its technical aspects are combined in order to describe the service. Service metadata comprises functional, non-functional, and technical information about the service. Therefore, the service matching process is enhanced compared to matching based on service description only. On the other hand, SaaS ontology is developed to provide comprehensive description of the SaaS services domain knowledge to enable a standard method of service specification and matching. The proposed system contributes to the SaaS services advertisement research by, firstly, eliminate the lack of standardization problem and secondly, provide proficient search capabilities based on concrete service functionalities. Hybrid matchmaking between the user request and services is based on semantic service metadata and ontology-based hierarchical matching. A business-oriented advertisement template, a guided registration model, and a registration system for SaaS services are proposed. In our previous work (Afify, Moawad, Badr, & Tolba, 2013, 2014a, 2014b), a system for SaaS offers publication, discovery, and selection was introduced. To extend this work, contributions in this chapter include a new semantic-based SaaS service registration model and a semantic similarity model for services matchmaking. Using a common meta-model, the proposed system standardizes the advertisement process and serves as a semantic-based registry for the service offerings. Subsequently, it provides competent search capabilities for the user. As shown in Figure 2, in addition to the SaaS registry, the system consists of four modules namely: service preprocessing, WordNet expansion, guided registration, and clustering. The SaaS registry is implemented as an ontology that integrates knowledge on SaaS business service domain, service characteristics, QoS metrics, and real service offers. It is a semantic repository for the service functional capabilities and non-functional quality guarantees. The business services domain ontology comprises concepts that cover domains of four SaaS applications: Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), Document Management (DM), and Collaboration. At present, the developed ontology consists of more than 700 concepts represented in the Figure 2. SaaS service advertisement system

215

Cloud Services Publication and Discovery

Web Ontology Language (OWL) (Antoniou & Van, 2009). More details about the developed ontology can be found in (Afify, Moawad, Badr, & Tolba, 2013, 2014a). A template for registering the SaaS service offers is proposed, which describes the service functionality as well as its quality information. The proposed SaaS service advertisement template consists of four sections namely: general, functional, non-functional, and quality. The general section includes the service name, cloud provider, service description, Uniform Resource Location (URL), application domain, and the price. The functional section includes a description of the service features and supported functionalities. The non-functional section includes information about the service characteristics such as: payment model, security, license type, standardization, formal agreement, user group, and cloud openness. Finally, the quality section includes Quality Of Service (QoS) values guaranteed by the cloud provider. The service preprocessing module is responsible for preprocessing the service description (Salton & McGill, 1983). The preprocessing process consists of tokenization, stop words removal, and stemming. The WordNet expansion module is responsible for enriching the service description by retrieving the service description token synonyms from the WordNet (Miller & Fellbaum, 1998). The expanded service description (ESD) is then stored in the SaaS registry. The guided registration module is responsible for semantically enriching the service with functional metadata using the business services domain ontology. In order to better characterize the new service, the cloud provider is assisted to map the service features into recommended ontology concepts. The feature-concepts mapping process is significant as it is the basis for efficient business-related cloud services searches. The recommended concepts represent service-related business functions. The SaaS service guided registration workflow algorithm is demonstrated using the following pseudo code.

Computer Code Algorithm: SaaS Service Registration Input: Service S, Service Description SD, appDomain, recommendationApproach Output: Set of Recommended Concepts RC, Expanded Service Description ESD 1. BEGIN 2. Replace any delimiter from set of delimiters D to space 3. FOR each letter l in SD DO 4. IF l ∈ D THEN 5. Reset l to space 6. END IF 7. END FOR 8. Generate set of service description tokens by splitting SD on space 9. Generate set of relevantTokens by removing common words from set stopWords 10. FOR each token t in tokens DO 11. IF t ∉ stopWords THEN 12. relevantTokens = relevantTokens ∪ t 13. END IF 14. END FOR 15. Generate ESD by finding synonyms of relevantTokens set from the WordNet 16. FOR each token rt in relevantTokens DO

216

Cloud Services Publication and Discovery

17. ESD = ESD ∪ getSynonyms (rt) 18. END FOR 19. Generate set of stems stems of ESD using Porter Stemmer algorithm 20. CASE ‘recommendationApproach’ OF 21. ‘semanticAnnotation’: 22. Generate set ontConceptStems by retrieving ontology concept stems 23. FOR each stem st in stems DO 24. IF st ∈ ontConceptStems THEN 25. RC = RC ∪ st 26. END IF 27. END FOR 28. ‘applicationDomain’: 29 Generate set DOC by retrieving relevant concepts to appDomain 30. FOR each stem st in stems DO 31. IF st ∈ DOC THEN 32. RC = RC ∪ st 33. END IF 34. END FOR 35. ‘hybrid’: 36. DO lines 22-27 37. DO lines 29-34 38. END CASE 39. Display set of recommended concepts RC to the cloud provider 40. Read selected concepts selectedConcepts 41. Return selectedConcepts, ESD 42. END

Three concept recommendation methods are proposed: semantic annotation, application domain, and hybrid. In the recommendation via semantic annotation method, semantic annotations of the service description are retrieved. In the recommendation via an application domain method, concepts related to the application domain specified in the service information are retrieved. In the hybrid recommendation method, a combination of the two methods is retrieved. Finally, the set of selected concepts is stored in the SaaS registry. The service clustering module clusters the service offers to functionally-similar clusters using the Agglomerative Hierarchical Clustering (AHC) (Salton & McGill, 1983) approach. Our previous hybrid service matchmaking algorithm is used to measure the similarity between the two services (Afify et al., 2013, 2014a) which applies both semantic-based metadata and ontology matching. After clustering, cluster signature vectors are created and kept in the SaaS registry. To search for a service, the user enters his business function requirements, and the system returns matching SaaS services. Comprehensive details of the search process can be found in our previous work (Afify et al., 2013, 2014a).

217

Cloud Services Publication and Discovery

Services Matchmaking Semantic Similarity Model In our previous work (Afify et al., 2013, 2014a), regarding the semantic-based metadata matching, the Vector Space Model (VSM) was used to compute the similarity between two SaaS service descriptions. In this chapter, an adaptation of the Extended Case Based Reasoning (ECBR) algorithm, presented in (Dong et al., 2011), is presented. The adapted SerECBR works as follows. First, the synonyms of the two service descriptions are retrieved (SD1 and SD2). SS1 and SS2 are synonyms of the first and second service descriptions respectively. Each service description and synonym terms are grouped into one list. T1 and T2 represent the first and second service terms respectively. ∆ and Ω represent a term that occurs within T1 and T2 respectively. LT1 represents the number of terms in the first service description list. For each service description list, weights are associated for terms as follows: 1, 1, if Ω ∈ SD2 if ∆ ∈ SD1 and w Ω =  w ∆ =  0.5, if ∆ ∈ SS1 0.5, if Ω ∈ SS 2  

(1)

To compute the semantic similarity between two SaaS services S1 and S2, T1 and T2 lists are compared using the following cases: Case 1: Term exists in two lists with weight 1, a value of 1 is given to this term. Case 2: Term exists in two lists with weight 0.5, a value of 0.25 is given to term. Case 3: Term exists in two lists by different weights, a value of 0.5 is given to term. Case 4: Term from one list does not exist in other list, a value of 0 is given to term. Finally, the sum of all term values is normalized by the length of the first service terms list. The semantic similarity between two services S1 and S2 is calculated using (2): simSerECBR (S1, S 2 ) =

∑

∆∈T1 ,Ω∈T2

1,  0.5, where match (∆, Ω) =  0.25,,  0, 

218

(match (∆, Ω)) LT 1

(2)

if (∆ = Ω) ∧ (w ∆ = w Ω = 1) if (∆ = Ω) ∧ (w ∆ ≠ w Ω )

if (∆ = Ω) ∧ (w ∆ = w Ω = 0.5)

(

)

if ∆ ∈ T1 ∧ ∆ ∉ T2 ∨ (Ω ∈ T2 ∧ Ω ∉ T1 )

(3)

Cloud Services Publication and Discovery

Implementation and Experimental Evaluation Real and synthetic cloud data were used to demonstrate the effectiveness of the proposed system. Experiments were conducted on an Intel Core i3 2.13 GHz processor, 5.0 GB RAM. The system was built using Java, Jena API, and WordNet API incorporated in Eclipse IDE. The data set consists of 500 SaaS services. In particular, 40 services are real, and the remaining are pseudo services. The following subsections present a case study for the guided registration process, the experimental evaluation of semantic annotation process, and the matchmaking semantic similarity model evaluation.

Guided Registration Process Case Study In order to populate the SaaS registry, the cloud providers are expected to register their services in our system. Typically, a speedy and proficient registration process would help the proposed system to gain wide acceptance. In particular, thorough and accurate matching of the service features to domain ontology concepts is vital while considering the time factor. The aim of this case study is to demonstrate the guided registration process. For example, service provider registers his cloud service Box.net and uses the concept recommendation method to match domain ontology concepts to the new service features. The cloud provider enters the service advertisement details for the cloud service. The cloud provider chooses the concept recommendation via Semantic Annotation method. Part of the service description is You can share large files. The processed service description tokens share, file are matched to the domain ontology concepts which results in a set of recommended business functions document_sharing, file_lock, file_sharing, file_storage, file_synchronization, file_transfer, file_types, and sharing. Recommended business functions related to the service description keywords are displayed. The cloud provider selects the business functions that accurately describe the service features and registers his service. Obviously, the time taken by the registration process is significantly reduced compared to previous work in (Afify et al., 2013, 2014a) where the cloud provider had to navigate through all domain ontology concepts to properly characterize his service features.

Semantic Annotation Process Evaluation The objective of this experiment is twofold: to calculate the time taken by the guided registration module to semantically annotate the service description and to study the effect of the WordNet expansion of the service description on the annotation process. To achieve the first objective, the processing time taken to semantically annotate the expanded service descriptions was computed. As shown in Figure 3, the semantic annotation process time is negligible. To achieve the second objective, the semantic annotation process is analyzed in two cases, with and without using WordNet. The number of matching Business Functions (BF) returned by the two cases is compared. The semantic expansion of the service description generally increases the number of retrieved business functions as shown in Figure 4. Nevertheless, the increase is neither nor relative to the number of expanded description terms. It depends significantly on the terms used by the cloud provider in their service descriptions, and their closeness to the concepts in the services domain ontology.

219

Cloud Services Publication and Discovery

Figure 3. Semantic annotation time of service descriptions

Figure 4. Service description semantic annotation process

Matchmaking Semantic Similarity Model Evaluation The objective of this experiment is to compare the resulting semantic similarity values of service description matchmaking of the VSM and the proposed serECBR similarity models. A sample of the results is shown in Table 1. The results demonstrate that the serECBR similarity model reflects the similarity among the service descriptions to be better than the VSM model with an average increase of 49%.

220

Cloud Services Publication and Discovery

Table 1. VSM vs. serECBR service matchmaking semantic similarities Service

Service

VSMSim

serECBR Sim

Oracle CRM On Demand

Intouchcrm

0.54

0.67

Box.net

Egnyte

0.65

0.37

Blue Link Elite

NetSuite

0.16

0.29

CubeTree

0.31

0.66

HyperOffice

GetDropBox

0.21

0.36

OrderHarmony

Plex Online

0.01

0.03

Incipi Workspace

HyperOffice

0.19

0.57

IBM Lotus Live

vs.

FUTURE RESEARCH DIRECTIONS Two of the most important key enablers to the growing adoption of cloud services by users and organizations are discussed.

Cloud Service Knowledge Base The core module of an ideal cloud service publication and discovery system is a comprehensive cloud service knowledge base. Such a knowledge base is a key enabler for automating the service life cycle. In particular, it should consist of the following components: cloud ontology, services domain ontology, services repository, SLA ontology, and QoS ontology. The cloud ontology should include information about the cloud computing service models, deployment models, resources, languages, APIs, and providers. The services domain ontology comprises broad conceptualization of the cloud services domain. It should cover both the technical and business-oriented aspects of all cloud service models (IaaS, PaaS, and SaaS). The services repository comprises details of the service offers along with enough semantic metadata about it. Service offer includes service name, provider, address, description, functional and non-functional requirements, characteristics, QoS, SLA, and semantic tags. The QoS ontology allows semantic interoperability of QoS. It contains detailed information about QoS concepts used in service offers by different cloud providers and service demands by cloud users. It should account for QoS levels and relationships among QoS properties. QoS properties can also be grouped and prioritized according to different application domains (Tran, Tsuji, & Masuda, 2009). The SLA ontology contains concepts that relate to all elements of a SLA. SLA represents the contractual agreement in which the provider lists QoS guarantees to the provided service functional and non-functional aspects (Modica, Petralia, & Tomarchio, 2013). Some of the SLA aspects are QoS, cost, legal issues, and support details. It also describes the remedy, such as reduced fee, if these guarantees were not met by the providers. The most complete ontology proposed so far is the mOSAIC ontology (Moscato, Aversa, Di Martino, Fortis, & Munteanu, 2011).

221

Cloud Services Publication and Discovery

Cloud Service Marketplace The need for a single focal point where cloud providers can publish their services and users can find them efficiently has arisen. “Today we are still far from an open and competitive cloud and service market where cloud resources are traded as in conventional markets” (Di Modica & Tomarchio, 2014, p.1). A successful marketplace should address both users’ and providers’ needs. It should be resourceful, that is to contain detailed updated information about vast number of services of different types. For the cloud providers, the service publishing process should be easy, fast, organized, and efficient. The provider should be able to provide all kinds of information required to properly describe his service in order to maximize its reachability. Business-oriented information such as service characteristics, non-functional requirements, QoS, SLA, price models, billing, and contracting information should also be supported. For the cloud users, the marketplace should be inclusive, up-to-date, easy to use, and personalized. Furthermore, it should provide intelligent support of associating the users incomplete and short requests to appropriate service offers even if the users have no technical background of the cloud domain terminology. This implies that the marketplace holds comprehensive semantic-based information about the services to improve the request-service matchmaking process (Di Modica & Tomarchio, 2014). Moreover, the marketplace should provide valuable features such as handling price models and billing, providing customized services tailored to the user requirements when there is no off-the-shelf available, and learning from its experience of dealing with users and providers (Akolkar et al., 2012).

CONCLUSION Cloud services are flourishing. Huge amount of available cloud services brings new challenges to the service identification and discovery processes. A comprehensive survey of the existing literature on cloud services advertisement, publication, and discovery was conducted. Significant open issues were highlighted based on this survey including: the lack of standardization of cloud services advertisement, limited browsing capabilities in cloud service registries, of robust focused crawler for automatic collection and update of cloud services information, and deficient addressing of the business-oriented capabilities of the business services. Moreover, a set of requirements for a consistent cloud service publication and discovery system was specified including: it should support a unified semantic-based service specification for cloud services, leverage business-oriented matchmaking of services, contain an active service registry repository, consider both functional and non-functional requirements in the service selection process, provide personalized recommendation, and performance evaluation. In this chapter, a SaaS services advertisement system was presented that addresses some of the key requirements. It exploits the services business perspective by introducing functionality-related metadata to the service descriptions. The semantic metadata enrichment is accomplished in the registration process via domain ontology concept recommendation. In order to effectively group functionally-similar services, a semantic similarity model was proposed for the services metadata matchmaking. The uniformity, integrity, and comprehensive representation of SaaS offers maximize the efficiency of the services discovery and close the gap between the services supply and demand in the cloud market. The proposed system effectiveness was proved by the experimental results.

222

Cloud Services Publication and Discovery

REFERENCES AbuJarour, M. (2011). A Proactive Service Registry With Enriched Service Descriptions. Proceedings of the 5th Ph. D. Retreat of the HPI Research School on Service-oriented Systems Engineering (Vol. 5, p. 191). Afify, Y. M., Moawad, I. F., Badr, N. L., & Tolba, M. F. (2013, November). A semantic-based Softwareas-a-Service (SaaS) discovery and selection system. Proceedings of the 2013 8th International Conference on Computer Engineering & Systems (ICCES) (pp. 57-63). IEEE. Afify, Y. M., Moawad, I. F., Badr, N. L., & Tolba, M. F. (2014a). Cloud Services Discovery and Selection: Survey and New Semantic-Based System. Proceedings of the Bio-inspiring Cyber Security and Cloud Services: Trends and Innovations (pp. 449-477). Springer. Afify, Y. M., Moawad, I. F., Badr, N. L., & Tolba, M. F. (2014b). Concept recommendation system for cloud services advertisement. Proceedings of the International Conference of Advanced Machine Learning Technologies and Applications (pp. 57-66). Springer International Publishing. doi:10.1007/978-3319-13461-1_7 Akolkar, R., Chefalas, T., Laredo, J., Peng, C. S., Sailer, A., Schaffa, F.,... Tao, T. (2012, June). The future of service marketplaces in the cloud. Proceedings of the 2012 IEEE Eighth World Congress on Services (SERVICES) (pp. 262-269). IEEE. doi:10.1109/SERVICES.2012.59 Al‐Masri, E., & Mahmoud, Q. H. (2010). WSB: A broker‐centric framework for quality‐driven web service discovery. Software, Practice & Experience, 40(10), 917–941. doi:10.1002/spe.989 Ali, O., & Soar, J. (2014, May). Challenges and Issues Within Cloud Computing Technology. Proceedings of the Fifth International Conference on Cloud Computing, GRIDs, and Virtualization CLOUD COMPUTING 2014 (pp. 55-63). Antoniou, G., & Van Harmelen, F. (2009). Web ontology language: OWL. In Handbook on ontologies (pp. 91–110). Springer. doi:10.1007/978-3-540-92673-3_4 Barros, A., Oberle, D., Kylau, U., & Heinzl, S. (2012). Design Overview of USDL. In Handbook of Service Description (pp. 187-225). Springer US. doi:10.1007/978-1-4614-1864-1_8 Bernstein, D., & Vij, D. (2010, July). Using Semantic Web Ontology for Intercloud Directories and Exchanges. Proceedings of the International Conference on Internet Computing (pp. 18-24). Chen, F., Bai, X., & Liu, B. (2011). Efficient service discovery for cloud computing environments. In Advanced Research on Computer Science and Information Engineering (pp. 443–448). Springer. doi:10.1007/978-3-642-21411-0_72 Chen, H. P., & Li, S. C. (2010, December). SRC: a service registry on cloud providing behavior-aware and QoS-aware service discovery. Proceedings of the 2010 IEEE International Conference on ServiceOriented Computing and Applications (SOCA) (pp. 1-4). IEEE. doi:10.1109/SOCA.2010.5707179 Cloudbook. (2013). The Cloud Computing & SaaS Information Resource. Retrieved from http://www. cloudbook.net/directories/product

223

Cloud Services Publication and Discovery

CloudServiceMarket. (2009). Retrieved from http://www.cloudservicemarket.info/ Crasso, M., Zunino, A., & Campo, M. (2011). A survey of approaches to Web Service discovery in Service-Oriented Architectures. Journal of Database Management, 22(1), 102–132. doi:10.4018/ jdm.2011010105 Dastjerdi, A. V., Tabatabaei, S. G. H., & Buyya, R. (2010, May). An effective architecture for automated appliance management system applying ontology-based cloud discovery. Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid) (pp. 104-112). IEEE. doi:10.1109/CCGRID.2010.87 Di Modica, G., & Tomarchio, O. (2014). Matching the business perspectives of providers and customers in future cloud markets. Cluster Computing. Dong, H., Hussain, F. K., & Chang, E. (2011). A service search engine for the industrial digital ecosystems. IEEE Transactions on Industrial Electronics, 58(6), 2183–2196. Dong, H., Hussain, F. K., & Chang, E. (2013). UCOSAIS: A framework for user-centered online service advertising information search. Proceedings of the Web Information Systems Engineering WISE 2013 (pp. 267-276). Springer Berlin Heidelberg. doi:10.1007/978-3-642-41230-1_23 Fang, D., Liu, X., & Romdhani, I. (2014, May). A Loosely-coupled Semantic Model for Diverse and Comprehensive Cloud Service Search and Retrieval. Proceedings of the Fifth International Conference on Cloud Computing, GRIDs, and Virtualization CLOUD COMPUTING 2014 (pp. 6-11). Garg, S. K., Versteeg, S., & Buyya, R. (2013). A framework for ranking of cloud computing services. Future Generation Computer Systems, 29(4), 1012–1023. doi:10.1016/j.future.2012.06.006 GetApp. (2014). Retrieved February 14, 2014, from http://www.getapp.com/ N. Guarino (Ed.). (1998, June 6-8). Formal ontology in information systems: Proceedings of the first international conference (FOIS’98), Trento, Italy (Vol. 46). IOS press. IBM WebSphere Service Registry. (n. d.). Retrieved from http://www-01.ibm.com/software/integration/ wsrr/ Jula, A., Sundararajan, E., & Othman, Z. (2014). Cloud computing service composition: A systematic literature review. Expert Systems with Applications, 41(8), 3809–3824. doi:10.1016/j.eswa.2013.12.017 Kehagias, D. D., Giannoutakis, K. M., Gravvanis, G. A., & Tzovaras, D. (2012). An ontology‐based mechanism for automatic categorization of web services. Concurrency and Computation, 24(3), 214–236. doi:10.1002/cpe.1818 Lin, W., Dou, W., Xu, Z., & Chen, J. (2013). A QoS‐aware service discovery method for elastic cloud computing in an unstructured peer‐to‐peer network. Concurrency and Computation, 25(13), 1843–1860. doi:10.1002/cpe.2993 Liu, F., Tong, J., Mao, J., Bohn, R., Messina, J., Badger, L., & Leaf, D. (2011). NIST cloud computing reference architecture. NIST special publication.

224

Cloud Services Publication and Discovery

Liu, L., Yao, X., Qin, L., & Zhang, M. (2014, July). Ontology-based service matching in cloud computing. Proceedings of the 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 2544-2550). IEEE. doi:10.1109/FUZZ-IEEE.2014.6891698 Membrane S.O.A. Registry. (2008). Retrieved from http://www.membrane-soa.org/soa-registry/ Menychtas, A., Vogel, J., Giessmann, A., Gatzioura, A., Garcia Gomez, S., Moulos, V., & Varvarigou, T. et al. (2014). 4CaaSt marketplace: An advanced business environment for trading cloud services. Future Generation Computer Systems, 41, 104–120. doi:10.1016/j.future.2014.02.020 Miller, G., & Fellbaum, C. (1998). Wordnet: An electronic lexical database. Mindruta, C., & Fortis, T. F. (2013, March). A Semantic Registry for Cloud Services. Proceedings of the 2013 27th International Conference on Advanced Information Networking and Applications Workshops (WAINA) (pp. 1247-1252). IEEE. doi:10.1109/WAINA.2013.100 Modica, G. D., Petralia, G., & Tomarchio, O. (2013, March). An SLA ontology to support service discovery in future cloud markets. Proceedings of the 2013 27th International Conference on Advanced Information Networking and Applications Workshops (WAINA) (pp. 1161-1166). IEEE. doi:10.1109/ WAINA.2013.68 Moscato, F., Aversa, R., Di Martino, B., Fortis, T., & Munteanu, V. (2011, September). An analysis of mOSAIC ontology for Cloud resources annotation. Proceedings of the 2011 Federated Conference on Computer Science and Information Systems (FedCSIS) (pp. 973-980). IEEE. Nagireddi, V. S. K., & Mishra, S. (2013, April). An ontology based cloud service generic search engine. Proceedings of the 2013 8th International Conference on Computer Science & Education (ICCSE) (pp. 335-340). IEEE. doi:10.1109/ICCSE.2013.6553934 Noor, T. H., Sheng, Q. Z., Alfazi, A., Ngu, A. H., & Law, J. (2013, June). CSCE: A Crawler Engine for Cloud Services Discovery on the World Wide Web. Proceedings of the 2013 IEEE 20th International Conference on Web Services (ICWS) (pp. 443-450). IEEE. doi:10.1109/ICWS.2013.66 Open Data Center Alliance Inc. (2011). SERVICE CATALOG Retrieved from http://www.opendatacenteralliance.org/accelerating-adoption/usage-models OpenCrowd Project. (2010). Retrieved from http://cloudtaxonomy.opencrowd.com/taxonomy/ Oracle Service Registry. (n. d.). Retrieved from http://www.oracle.com/technetwork/middleware/registry/ overview/index.html Parhi, M., Pattanayak, B. K., & Patra, M. R. (2015). A Multi-agent-Based Framework for Cloud Service Description and Discovery Using Ontology. In Intelligent Computing, Communication and Devices (pp. 337-348). Springer India. doi:10.1007/978-81-322-2012-1_35 ReadySaaSGo. (n. d.). Retrieved from http://www.readysaasgo.com/ UDDI.org. (2000). UDDI technical white paper. Retrieved from http://www.uddi.org/pubs/Iru_UDDI_ Technical_White_Paper.pdf/20000906.html

225

Cloud Services Publication and Discovery

Rodríguez-García, M. Á., Valencia-García, R., García-Sánchez, F., & Samper-Zapater, J. J. (2014a). Ontology-based annotation and retrieval of services in the cloud. Knowledge-Based Systems, 56, 15–25. doi:10.1016/j.knosys.2013.10.006 Rodríguez-García, M. Á., Valencia-García, R., García-Sánchez, F., & Samper-Zapater, J. J. (2014b). Creating a semantically-enhanced cloud services environment through ontology evolution. Future Generation Computer Systems, 32, 295–306. doi:10.1016/j.future.2013.08.003 Saa, S. Directory (n. d.). Retrieved from http://www.saasdirectory.com/ Saa, S. Lounge (n. d.). Retrieved from http://www.saaslounge.com/saas-directory/ Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. SAP Service Marketplace. (1972). Retrieved from http://service.sap.com/ Showplace, C. (2012). Retrieved from http://www.cloudshowplace.com/application/ Sim, K. M. (2012). Agent-based cloud computing. IEEE Transactions on Services Computing, 5(4), 564–577. Spillner, J., & Schill, A. (2013, May). A Versatile and Scalable Everything-as-a-Service Registry and Discovery (pp. 175–183). CLOSER. Steinmetz, N., Lausen, H., & Brunner, M. (2009). Web service search on large scale. In Service-Oriented Computing (pp. 437–444). Springer. doi:10.1007/978-3-642-10383-4_32 Sukkar, M. (2010). Design and Implementation of a Service Discovery and Recommendation Architecture [Doctoral dissertation]. University of Waterloo. Sun, L., Dong, H., Hussain, F. K., Hussain, O. K., & Chang, E. (2014). Cloud service selection: Stateof-the-art and future research directions. Journal of Network and Computer Applications, 45, 134–150. doi:10.1016/j.jnca.2014.07.019 Tahamtan, A., Beheshti, S. A., Anjomshoaa, A., & Tjoa, A. M. (2012, June). A cloud repository and discovery framework based on a unified business and cloud service ontology. Proceedings of the 2012 IEEE Eighth World Congress on Services (SERVICES) (pp. 203-210). IEEE. doi:10.1109/SERVICES.2012.42 Tran, V. X., Tsuji, H., & Masuda, R. (2009). A new QoS ontology and its QoS-based ranking algorithm for Web services. Simulation Modelling Practice and Theory, 17(8), 1378–1398. doi:10.1016/j. simpat.2009.06.010 Tsai, W., Bai, X., & Huang, Y. (2014). Software-as-a-service (SaaS): Perspectives and challenges. Science China Information Sciences, 57(5), 1–15. doi:10.1007/s11432-013-5050-z Wright, P., Sun, Y. L., Harmer, T., Keenan, A., Stewart, A., & Perrott, R. (2012). A constraints-based resource discovery model for multi-provider cloud environments. Journal of Cloud Computing, 1(1), 1–14. Xu, J., Gong, W., & Wang, Y. (2012, October). A cloud service discovery approach based on FCA. Proceedings of the 2012 IEEE 2nd International Conference on Cloud Computing and Intelligent Systems (CCIS) (Vol. 3, pp. 1357-1361). IEEE. doi:10.1109/CCIS.2012.6664607

226

Cloud Services Publication and Discovery

ADDITIONAL READING Crasso, M., Zunino, A., & Campo, M. (2011). Combining query-by-example and query expansion for simplifying web service discovery. Information Systems Frontiers, 13(3), 407–428. doi:10.1007/s10796009-9221-9 Da Silva, E. G., Pires, L. F., & Van Sinderen, M. (2011). Towards runtime discovery, selection and composition of semantic services. Computer Communications, 34(2), 159–168. doi:10.1016/j.comcom.2010.04.003 Hao, Y., Zhang, Y., & Cao, J. (2010). Web services discovery and rank: An information retrieval approach. Future Generation Computer Systems, 26(8), 1053–1062. doi:10.1016/j.future.2010.04.012 Loutas, N., Peristeras, V., Zeginis, D., & Tarabanis, K. (2012). The Semantic Service Search Engine (S3E). Journal of Intelligent Information Systems, 38(3), 645–668. doi:10.1007/s10844-011-0171-6 Nagireddi, V.S.K., & Mishra, S. (2013). A Generic Search Based Cloud Service Discovery Mechanism. Proceedings of the Science and Information Conference, London, UK, pp. 915-923. Paliwal, A. V., Shafiq, B., Vaidya, J., Xiong, H., & Adam, N. (2012). Semantics-based automated service discovery. IEEE Transactions on Services Computing, 5(2), 260–275. Rodriguez, J. M., Crasso, M., Mateos, C., & Zunino, A. (2013). Best practices for describing, consuming, and discovering web services: A comprehensive toolset. Software, Practice & Experience, 43(6), 613–639. doi:10.1002/spe.2123 Sabou, M., & Pan, J. (2007). Towards Semantically Enhanced Web Service Repositories. Journal of Web Semantics, 5(2), 142–150. doi:10.1016/j.websem.2006.11.004 Sadiku, M., Musa, S., & Momoh, O. (2014). Cloud computing: Opportunities and challenges. Potentials, IEEE, 33(1), 34–36. doi:10.1109/MPOT.2013.2279684 Tang, M., Zheng, Z., Chen, L., Liu, J., Cao, B., & You, Z. (2014). A Trust-Aware Search Engine for Complex Service Computing. International Journal of Web Services Research, 11(1), 57–75. doi:10.4018/ ijwsr.2014010103

KEY TERMS AND DEFINITIONS Cloud Computing: An information technology model in which computing resources and applications are provided on demand over the Internet. Cloud Provider: An organization/company that offers remote access to any of the cloud computing resources, such as: infrastructure resources, operating systems, development platforms, applications, etc. Domain Ontology: A knowledge framework of common concepts in a specific domain created by the domain experts. Non-Functional Requirements: Set of parameters that focus on the quality of the system performance rather than the behavior of the system, such as: availability, response time, ease of use, throughput, etc.

227

Cloud Services Publication and Discovery

Semantic Annotation: The process of associating semantic metadata to an object. Semantic metadata is usually retrieved from an ontology. Semantic Similarity: A metric that measures the likeness that the two objects have similar meaning (semantic content). Service Description: A representation of the service operations, features, and non-functional properties. Service Discovery: The process of finding a service that matches the user functional and nonfunctional requirements. Services Metadata: Set of data that gives detailed information about the services description, operations, non-functional requirements, features, characteristics, etc. User-Centric Design: A design philosophy where the main focus in work design is on the user requirements and needs satisfaction.

228

Section 2

Applications-Based Machine Learning

230

Chapter 11

Enhancement of Data Quality in Health Care Industry: A Promising Data Quality Approach Asmaa S. Abdo Menoufia University, Egypt Rashed K. Salem Menoufia University, Egypt Hatem M. Abdul-Kader Menoufia University, Egypt

ABSTRACT Ensuring data quality is a growing challenge, particularly when emerging big data applications. This chapter highlights data quality concepts, terminologies, techniques, as well as research issues. Recent studies have shown that databases are often suffered from inconsistent data, which ought to be resolved in the cleaning process. Data mining techniques can play key role for ensuring data quality, which can be reutilized efficiently in data cleaning process. In this chapter, we introduce an approach for dependably generating rules from databases themselves autonomously, in order to detect data inconsistency problems from large databases. The proposed approach employs confidence and lift measures with integrity constraints to guarantee that generated rules are minimal, non-redundant and precise. Since healthcare applications are critical, and managing healthcare environments efficiently results in patient care improvement. The proposed approach is validated against several datasets from healthcare environment. It provides clinicians with automated approach for enhancing quality of electronic medical records. We experimentally demonstrate that the proposed approach achieves significant enhancement over existing approaches.

DOI: 10.4018/978-1-5225-2229-4.ch011

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Enhancement of Data Quality in Health Care Industry

INTRODUCTION With massive and vast amount of generated data from health care applications domain, which consider a great asset to healthcare organizations in today’s economy. The assessment of data extremely depends on its degree of quality (Saha et al., 2014). Quality is applied to data by meaning is it “fitness for use” or “potential for use” (Martín et al., 2010). Data quality is essential characteristic that determines the reliability of data for data management purposes (Wang et al., 2014). The quality of data is an increasingly pervasive problem, as data in real world databases quickly degenerates over time and effects the results of the mining. This result in what is called “Dirty Data” (Chiang et al., 2008; Li et al., 2014). Such dirty data often emerges due to violations of integrity constraints, which results in incorrect statistics, and ultimately wasting of time and money (Yakout et al., 2010). It has been estimated that erroneous data leading to lose billions of dollars annually due to poor data quality in decision making that negatively effect on achieving customer satisfaction (Fan et al., 2012; Yakout et al., 2010). As a result, detecting inconsistent data is very important task in the data cleaning process. Doubtless, ensuring high quality dependable data is a competitive advantage to all businesses, which requires accurate data cleaning solutions (Fan et al., 2012; Wang et al., 2014). We need to verify four attributes of data quality as shown in Figure 1. Attributes of data quality is detailed as follows: Complete, Accurate, Available and Timely. Completeness of data denote as all related data about one entity is linked. Accuracy refer to data free from common problems such spelling mistake, typographical error, and random abbreviations. Availability of data mean required data accessible on demand that make customer not need to search manually for information. Timely consider data up to dated available for management purposes. Indeed, to ensure quality of data there is need to data cleaning. Data cleaning also called (data cleansing or scrubbing) refers to the process of maintaining corrupted and/or inaccurate records in order to enhancing the quality of data. This process is mandatory in data management cycle before mining and analyzing data (Mezzanzanica et al., 2013). A manual process of data cleansing is also exhausting process, time consuming, and itself prone to errors (Li et al., 2014). Thus, make us search for automate solutions as powerful tools, which automate or greatly assist in the data cleansing process to achieve a reasonable quality level in existing data. As this Figure 1. Four attributes of data quality

231

Enhancement of Data Quality in Health Care Industry

may seem to be sound solutions, we notice that little basic research has been directly aimed at methods to support such tools. Data cleaning essentially found in central areas such as data warehousing, data quality management and Knowledge discovery in databases. The ultimate goal of data cleaning research is to clearly and accurately address data cleansing problems. Data mining is mainly used today by several application domains with a strong consumer attention. Data mining is defined as the process of finding hidden and unknown patterns in databases and using these facts to build models (Koh et al., 2011). That provides the methods and technology to transform huge amounts of data to be processed and analyzed into useful information for data management purposes. This make data mining is becoming more and more popular and essential by many application domains especially in health care domains. Data mining techniques can be efficiently used in data cleaning process, which focus in analyzing, and processing extremely massive amount of data into meaningful knowledge through clean and correct erroneous data record (Maletic et al., 2010). Data mining applications can enormously advantage all parties included in medical services and health care industry. For example, help in healthcare insurers detect fraud and abuse, keep management of customer relationship decisions, help physicians identify effective treatments, patients receive better and more inexpensive health care accommodations, and in healthcare management purposes. The success of electronic medical data mining depends on the availability of cleaning healthcare data (Koh et al., 2011). In addition, most of the existing current research directions in data cleaning methods from the literature focus on identifying and eliminate duplicate records (Fan et al., 2011). This also called record matching, merge/purge, record linkage, instance identification (Bharambe et al., 2012; Herzog et al., 2007; Maletic et al., 2010), which focus on match master cleaned records with a probably imprecise records. Henceforward, existence of inconsistency issues in data intensely decreases their assessment, making them misinformed, or even harmful. This make still necessary to tackle the problem of data inconsistency with the help of data themselves and without the need for external master copy of data. During resolving data inconsistencies, several integrity constraints are ensured, e.g., Functional Dependencies (FDs) and Conditional Functional Dependencies (CFDs) (Fan et al., 2011; Liu et al., 2012). Medical application domain is one of the most critical applications that suffers from inconsistent and dirty data issues. As ensuring data quality of electronic medical records is very important in healthcare data management purposes, whereas critical decisions are based on patient status apprised from medical records (Chang et al., 2012; Kazley et al., 2012; Mans et al., 2015; Rodríguez et al., 2010). Herein, we are interested to generate data quality rules, which then used for resolving data inconsistencies in such medical databases. Figure 2 indicates the consequence of poor data quality on Electronic Medical Records (EMR). This chapter is organized as follows: Next section discuss central research areas in data quality. Followed by presenting data quality dimensions in next section. Then presents background and definitions about several aspects in data quality. Then discuss existing data cleaning methods in next section. Next section presents motivation and problem statement. Then present and highlight data cleaning in medical applications in next section. Then introduces the proposed approach for generating dependable data cleaning rules. Next discusses the experimental study and results conducted for different medical datasets. Finally, conclusion on the proposed work and highlight future trends of this research.

232

Enhancement of Data Quality in Health Care Industry

Figure 2. Poor data quality in electronic health care

Central Research Areas in Data Quality There are several research areas related to data quality namely, data integration, data cleaning, statistical data analysis, management information system, knowledge representation as indicated in figure 3. Each area focus in improving data quality in order to obtain high quality competitive advantage (Carey et al., 2006). • •

• • •

Data Integration: Aim to present unified view of data from heterogonous data sources in distributed and peer-to-peer systems. Identifying and solving conflicts on values with different source of data for enhancing data quality (Batini et al., 2009). Data Cleaning: Dealing with detecting and correcting (or removing) corrupt or inaccurate data such inconsistencies, incomplete, duplication, incorrect, irrelevant, etc. Focus on ensuring quality of all parts of the data by replacing, modifying, or deleting dirty data (Maletic et al., 2010; Mezzanzanica et al., 2013). Statistical Data Analysis: Refer to set of methods that analyze and interpret data. Focus on summarizing and exploring data to obtain accurate statistical results on data (Mayfield et al., 2010; Srivastava et al., 2015). Management Information System: Aim to enhance cost and optimization of overall organization. Provide data and knowledge necessary for controlling management purposes (Zu et al., 2008). Knowledge Representation: Focus on another issue of how to represent knowledge in explicit and declarative method (Carey et al., 2006).

Data Quality Dimensions More data quality dimensions for enhancing data quality in several application domains (Fan et al., 2014; Li et al., 2014) as shown in Figure 4. Figure 4 give details as follows: Data quality dimensions consist of several central issues about how to ensure quality of data. Here, we focus on highlight significant four dimensions such data consistency, data de-duplication, data currency, data accuracy.

233

Enhancement of Data Quality in Health Care Industry

Figure 3. Data quality research issues Carey et al., 2006.

Figure 4. Data quality dimensions

• • • •

234

Data Consistency: Refer to process of validation on data according set of predefined constraints to detect conflict and inconsistency problems to keep data actually representing real world entities. Data De-Duplication: Aim to identify and eliminate duplicate conflict records from database and keep correct unique values of data that refer to the same real world entities (Benjelloun et al., 2009; Fan et al., 2011). Data Currency: Refer to how to answer query to customers with up-to-date current values not old values. That help in data management purposes to obtain high quality up to dated data not outdated data (Fan et al., 2011). Data Accuracy: Which assure on closeness values between stored data to actually real world data (Batini et al., 2009).

Enhancement of Data Quality in Health Care Industry

BACKGROUND AND DEFINITIONS As main problem in data is how to detect inconsistencies data errors from large databases. We present traditional methods used in data cleaning as attempting to enhance data quality of electronic data records. Data dependencies play important rule in data quality management. Traditional methods begin with traditional dependencies also called functional dependencies (FD), Conditional Functional Dependencies (CFD), Matching Dependencies (MD), Fixing Rules, Constant CFD Problem, and Pruning Search Space. Assume relation schema R and set of attributes as (X, Y). •

•

•

Functional Dependencies (FD): Called traditional dependencies between attributes within a relation. Functional dependencies exist when one attribute in a relation uniquely determines another attribute from the same relation in database. We can say X is functionally determine Y, if and only if each value in X attribute is associated with accurately associated value in Y attribute. FD is written in the form (X →Y), where X is called antecedent and Y called consequence. FDs were developed mainly for schema design, but are often insufficient to capture the semantic of data (Hartmann et al., 2012; Liu et al., 2012; Papenbrock et al., 2015; Yao et al., 2008). Conditional Functional Dependencies (CFD): CFD is an extension to functional dependencies. Aim to detect inconsistencies of data between tuples in single relation. CFD φ on relation R is a pair (X → Y, tp), where X → Y is a standard FD on r; and tp is the pattern tuple of φ with attributes in X and Y. For each attribute i in X ∪ Y, tp[i] is either a constant pattern in the domain of i, or an unnamed variable ‘_’ by incorporating binding of semantically related values in a single relation (Fan et al., 2012; Fan et al., 2011). Matching Dependencies (MD): Denoted as dependency based approach for detecting duplicates by matching records. Specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Example 1: Consider the following database instance of a relation P. Rule of matching dependencies occur between two tuples is as follows:

P [Phone] ≈ P [Phone] ∧ P [Address] ≈ P [Address] → P [Name] ⇌ P [Name]. This mean if phone number value similar to phone number value, and address value similar to address value then conclude that Name of person are the same and detect this duplication by using matching similarity techniques (Elmagarmid et al., 2007; Fan et al., 2011; Fan et al., 2011). •

Fixing Rules: is manually designed rules for detect inconsistency errors. It contains an evidence pattern, a set of negative patterns, and a fact value. Given a tuple, the evidence pattern and the

Table 1. Sample from employee relation Name

Address

Phone

Joe Smith

4-50 Dak St.

523-4589

J. Smith

50 Dak St. Ap. 4

(860) 523-4589

235

Enhancement of Data Quality in Health Care Industry

Table 2. Sample from employee relation Name

County

Capital

ZIP

Jim

China

Beijing

08557

Mike

China

Shanghai (Beijing)

09788

Lan

Canada

Toronto (Ottawa)

01223

negative patterns of a fixing rule are combined to precisely tell which attribute is wrong, and the fact indicates how to correct it (Wang et al., 2014). Syntax for fixing rules is defined on schema relation R formalized as φ: ((X, tp[X]), (Y, Tp−[Y])) → tp+[Y] Example 2: Consider schema of employee: (name, country, capital, city, Zip). ω1: (([country], [China]), (capital, {Shanghai, Hongkong})) → Beijing ω2: (([country], [Canada]), (capital, {Toronto})) → Ottawa • •

•

Constant CFD Problem: The problem is to discover minimal set of frequent constant conditional functional dependencies, which include non-redundant CFD. This also mean discover conditional functional dependencies with constant patterns only (Li et al., 2013; Stefan, 2010). Closed Frequent Patterns: Pattern is frequent closed if it is not included in a proper superset having the same support. A generator Y of a frequent closed pattern X, is a pattern constraint with it has the same support as X, and it does not have any subset having the same support (Fan et al., 2011). Pruning Search Space: Pruning is to remove infrequent nodes from search space domain using predefined support value. (Li et al., 2013).

EXISTING DATA CLEANING METHODS In this section, we address current data cleaning methods proposed for enhancing data quality. Unfortunately, despite the urgent need for precise and dependable techniques for enhancing data quality and data cleaning problems, there is not vital solution up to now to these problems. There has been little discussion and analysis about enhancing data inconsistency. However, most of recent work focus on record matching and duplicate detection (Bharambe et al., 2012). Firstly, database and data quality researchers have discussed variety of integrity constraints based on Functional Dependencies (FD) (Cong et al., 2007; Hartmann et al., 2012; Liu et al., 2012; Yao et al., 2008). In (Yao et al., 2008) Propose FD_Mine algorithm that discover functional dependency from given relation. A survey and comprehensive comparison on seven algorithm for discovering functional dependencies (Papenbrock et al., 2015). This enable us to choose best algorithm for a given dataset and

236

Enhancement of Data Quality in Health Care Industry

also compare these algorithms in runtime and memory behavior. Algorithms used in comparison namely, TANE, FUN, FD_Mine, DFD, Dep-Miner, FastFDs, FDEP as indicated extensively in (Papenbrock et al., 2015). Nevertheless, traditional FDs were developed mainly for schema design, but are often not able to detect the semantic values errors of data. Other researchers focus on extension of FD, they have proposed what is so-called Conditional Functional Dependencies (CFD) and Conditional Inclusion Dependencies (CID) for capturing errors in data (Bohannon et al., 2007). Algorithms proposed for discovering CFDs rules from relation such, CFD Miner algorithm for discovering constant conditional functional dependencies, CTANE algorithm that extend TANE to discover general CFDs, and FastCFD for discovering general CFDs by employing a depthfirst search strategy instead of the level wise approach as used in CTANE algorithm (Fan et al., 2011). Several data quality techniques are proposed to clean missy tuples from databases (Fan et al., 2010). As researchers aim to find critical information missing from databases. In (Fan et al., 2010) propose three models to specify relative information completeness of databases from which both tuples and values may be missing. Statistical inference approaches are studied in (Mayfield et al., 2010), which propose approach for inferring missing information and correcting such errors automatically. Proposed approach based on two statistical data-driven methods for inferring missing data values in relational databases. These approaches tackle missing values in order to enhance quality of data. From technological part, there are several open source tools which developed for handling messy data. Open Refine and Data Wrangler are two open source tool for working with missing data for cleaning it as detailed in (Larsson, 2013). Besides, data transformation methods such as commercial ETL (Extract, Transformation and Loading) tools. Extract focus on extracts data from homogeneous or heterogeneous data sources. Transform method purpose is to store data in proper format or structure for querying and analysis purpose. Loading concern with load data into single data source repository such data warehouse or other unified data source depending on the requirements of the organization. These tools are developed for data cleaning in order to support any changes in the structure, representation or content of data (Vassiliadis et al., 2009). The usage of editing rules in combination with master data is discussed in (Fan et al., 2012). Such rules are able to find certain fixes by updating input tuples with master data. In construct to constraints, editing rules have dynamic semantics and are relative to master data. Given an input tuple t that matches a pattern, editing rules tell us which attributes of given tuple t should be updated and what values from master data should be assigned to them. This approach requires defining editing rules manually for both relations, i.e., master relation and input relation, which is very expensive and time-consuming. Repairing use heuristic solution based on min cost function of two updates that not provide with deterministic fix. Editing rules require users to examine every tuple, which is expensive. Furthermore, a lots of work are proposed in the literature relying on domain specific similarity and matching operators, such works include record matching, record linkage, duplicate detection, and merge purge (Bharambe et al., 2012; Fan et al., 2011; Herzog et al., 2007). These approaches define two functions; namely match and merge (Benjelloun et al., 2009). While match function identifies duplication of records, the merge function combines the two duplicated records into one. We conclude that previous methods do not guarantee that we have deterministic and reliable fixes to the consistency problem. As they do not work when detecting errors in critical data such Electronic Medical Records (EMRs) in health care hospitals. From the literature, we reutilized constant CFD as a special case of association rules (Fan et al., 2011; Kalyani, 2013). We focus on problem of detecting errors and inconsistencies from data. The relationship between minimal constant CFD and item set mining, association rule as similar relationship to CFD that work on non-redundant rules (Zaki, 2004). 237

Enhancement of Data Quality in Health Care Industry

MOTIVATION AND PROBLEM STATEMENT There is a lot of data cleaning challenges in ensuring quality of data such • • • •

Real-life data is often dirty (inconsistent, inaccurate, incomplete, and stale). Enhancing data quality take large amount of time that’s generally cost ineffective. Directly involving users in the data cleaning process. Scalability issues of most data cleaning techniques.

Based on the challenges, Data cleaning in data quality need to be improved by using data mining techniques to generate dependable rules and use them for object identification and attribute value correction. Given an instance r of a relation schema R, support s, and confidence c thresholds, the proposed approach discovers Interest-based Constant Conditional Functional Dependencies (ICCFDs), abbreviated ICCFD-Miner approach. The discovered ICCFDs ensure finding interest minimal non-redundant dependable Constant Conditional Functional Dependencies (CCFDs) rules with constant patterns in r. Indeed, the main contribution of this chapter is to propose an approach for generating dependable data cleaning rules. Such discovered rules are exploited not only for detecting inconsistent data, but also for correcting them. The proposed approach utilizes data mining techniques for discovering dependable rules, it based mainly on frequent closed patterns and their associated generators to their closure that speed up rules generation process. The experimental results conducted over medical datasets from health care applications domain verify the effectiveness and accuracy of the proposed approach against CCFD-ZartMNR algorithm (Kalyani, 2013).

DATA CLEANING IN ELECTRONIC MEDICAL RECORDS With the progress of information system in several application domains especially in medical applications (Mans et al., 2015). Data being digitally collected and stored incrementally, expanding rapidly. We need to enhance quality of medical records to obtain user satisfaction and eliminate risks from poor/dirty data. Electronic Medical Records (EMRs) are one of the major automated concern of current hospitals, which are objects of knowledge about patients medical and clinical data (Chang et al., 2012). EMRs need to be ensured against quality in order to obtain user satisfaction that effect on the resulting of the overall system effectiveness. User satisfaction about his electronic medical data is considered as one of the measure to ensure quality in EMRs (Koh et al., 2011). In Figure 5, we indicated schematic diagram for data cleaning process in electronic medical records. Healthcare stakeholders and providers of health care services need to obtain quality of information and knowledge not only at the point of service but also at the point of clinical treatment decisions for improving health care quality (Groves et al., 2013; Weiskopf et al., 2012). They require such knowledge with precise quality to maximize the benefit of decision making process (Kush et al., 2008). However, maintaining exact and reliable information about diseases treated is based on precise data stored about patient (Koh et al., 2011). Therefore, clinical and healthcare service research aimed at accurate, reliable and complete statistical information about the uses of health care services within a community.

238

Enhancement of Data Quality in Health Care Industry

Figure 5. Schematic diagram for data cleaning process in electronic medical records

PROPOSED METHODOLOGY The main purpose of the proposed approach is to discover interest minimal non redundant constant CFD rules that cover all set of rules. In other words, the discovered rules are minimal and complete with respect to specified support and confidence thresholds. Such discovered rules are employed for dirty data identification and treating data inconsistencies. The proposed approach relies on generating closed frequent patterns and their associated generators according. The associated generators are defined as the closure of closed frequent patterns from all set of frequent patterns. The flow of two main steps for generating minimal non redundant rules is shown in Figure 6. The proposed approach is detailed as follows: Input: Dataset and two predefined thresholds, i.e., minimum support (minsup), minimum confidence (minconf), are the input to the ICCFD-Miner approach. Step 1: Given user defined minimum support threshold, the list of closed frequent patterns and their associated generators to their closure can be generated. In order to minimize the search space and saving the time of rules generation, The proposed approach utilize such generated closed frequent patterns and their associated generators instead of working on generating all frequent patterns (Kalyani, 2013). The search space domain is indicated in Figure 7. Let us define Support CFD φ: (X → A), as the number of records in the dataset that contain X U A to the total number of records in the database. Support threshold based on the idea that values that occur together frequently have more evidence to validate that they are correlated. Support of a CFD φ: (X → A) where X generator pattern and A is (closed/generator) (Chiang et al., 2008), defined as.

239

Enhancement of Data Quality in Health Care Industry

Figure 6. The proposed ICCFD-Miner approach

support (X → A) = Number of tuples containning values in X and A

(1)

Total number of tuples in relation Step 2: Given user defined confidence threshold. The set of interest minimum non redundant constant conditional functional dependencies data quality rules are generated. While the literature utilizes only support and confidence for generating such rules, the proposed approach consider interest measure into account for generating more dependable and reliable rules. The form of rules for each frequent generator pattern X finds its proper supersets A from set of frequent closed patterns. Then, from X and A add rule antecedent (Generator) → consequence (closed/generator) as φ: X →A. Let us define Confidence CFD as the number of records in the dataset that satisfy CFD divided by number of records that satisfy left hand side of rule. Figure 7. Search space domain

240

Enhancement of Data Quality in Health Care Industry

confidence (X → A) =

support (ϕ ) support (x )

(2)

Confidence measures reliability of rule, since the value of confidence is real number between 0 and 1.0 (Medina et al., 2009). The pitfall of Confidence is that ignores support of right hand side of rules. As consequence, we add data quality measure called Interest (Lift) which generates more dependent rules when defined it as greater than one. Let us define Lift CFD as measuring the degree of compatibility of left hand side and right hand side of rules as, i.e., occurrence of both left hand side and right hand side (Hussein et al., 2015). We set here Lift value >1 to obtain dependent ICCFD-Miner rules. For example lift of this CFD rule φ: (X → A) is lift (X → A) =

confidence (ϕ ) support (A)

(3)

Finally, this approach optimizes the process of the rules generation compared with the most related methods.

EXPERIMENTAL STUDY Using five real-life datasets, we evaluate our proposed approach described in previous section. In order to assess their performance in current domain of real life application especially in critical application such as medical applications. The exploited datasets have large amount of information about patients and their status. The proposed approach used for generating dependable rules in these datasets in order to enhance their data quality. The cleaned data become on demand data access for decision maker in healthcare systems, enabling accurate decisions based on precise quality data. We evaluate the following factors on the efficiency and the accuracy of ICCFD-Miner rules produced such threshold support (sup), confidence (conf), size of sample relation r (the number of instances in r), arity of relation r (the number of columns in r), and time complexity.

Experimental Setting The experiments are conducted using five real-life datasets about diseases taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/) namely, Thyroid (hypothyroid), primary-tumor, cleveland-14-heart, Cardiotocography, and Pima_diabets. Table 3 shows the number of attributes and the number of instances for each dataset. The proposed approach is implemented using java (JDK1.7). The implementation is tested on machine equipped with Intel(R) Pentium(R) Dual CPUT3400 @ 2.16GHz 2.17GHz processor with 2.00 GB of memory running on windows 7 operating system. The proposed approach runs mainly in main memory. Each experiment is repeated at least five times and the average reported here.

241

Enhancement of Data Quality in Health Care Industry

Table 3. Datasets description Dataset Name

Arity (Number of Columns)

Size (Number of Instances)

Thyroid(hypothyroid)

30

3772

primary-tumor

18

339

cleveland-14-heart

14

303

Cardiotocography

23

2126

Pima_diabets

9

768

Experimental Results Now, we show and discuss the results on real world dataset described in previous section. Note that we aim to evaluate the effectiveness of rules generation of the proposed approach against CCFD-ZartMNR algorithm (Kalyani, 2013). Experiments show that proposed approach always produce less number of rules but more accurate. The generated rules are interest-based minimal and non-redundant. Experiments also show that ICCFD-Miner outperforms the other algorithm with respect to time for rules generated. Experiment 1: In this experiment, rules are generated from thyroid (hypothyroid) dataset. This data set contains 30 attributes, 3772 records of patient data describing patient information about the hypothyroid diagnoses data. By varying values of support (sup) and confidence (conf) thresholds as shown in Figure 8, we notice that the proposed approach always generate accurate interest minimum non redundant rules compared to CCFD-ZartMNR algorithm. For example, in Figure 8, at minimum support = 0.97 and minimum confidence = 0.99 the number of rules generated from the proposed ICCFD-Miner approach = 85 rules compared to output generated rules from CCFD_ZartMNR = 220 rules. Results from Figure 9 shows that the proposed approach generates rules at different sup, conf values in less time compared to CCFDFigure 8. Total number of rules generated of thyroid (hypothyroid) dataset

242

Enhancement of Data Quality in Health Care Industry

Figure 9. Response time measure about thyroid (hypothyroid) dataset

ZartMNR algorithm. For example in Figure 9, at minimum support = 0.97 and minimum confidence = 0.99 response time of the proposed ICCFD-Miner approach = 312 ms compared to response time of existing approach CCFD_ZartMNR = 368 ms. Experiment 2: The experimental results conducted over primary-tumor disease dataset are shown in Figures 10, 11. This data set contains 18 attributes, 339 records describing patient information about the primary-tumor disease diagnoses dataset. In Figure 10, 11 validate the efficiency of the proposed approach against CCFD-ZartMNR algorithm in both number of rules generated and response time measure. Figure 10. Total number of rules generated of primary-tumor dataset

243

Enhancement of Data Quality in Health Care Industry

Figure 11. Response time measure about primary-tumor dataset

Experiment 3: We conduct an experiment to generate rules from cleveland-14-heart disease dataset. This data set contains 14 attributes, 303 records describing patient information about the cleveland14-heart disease diagnoses data. Figure 12, 13 validate the effectiveness of the proposed approach against CCFD-ZartMNR algorithm in both minimal non-redundant of rules generated and response time measure. Experiment 4: We evaluate the effectiveness of ICCFD-Miner approach against CCFD-ZartMNR over Cardiotocography disease patient dataset. This data set contains 23 attributes, 2126 records Figure 12. Total number of rules generated of Cleveland-heart-disease dataset

244

Enhancement of Data Quality in Health Care Industry

Figure 13. Response time measure about Cleveland-heart-disease dataset

describing patient information about the Cardiotocography diagnoses data. Results prove that the proposed approach more effective than other algorithm that are shown in Figures 14, 15. Experiment 5: Generating rules from Pima_diabets dataset. This data set contains 9 attributes, 768 records describing patient information about disease diagnoses data. The experimental results conducted are shown in Figure 16, 17. Finally, we believe that the proposed approach, i.e., ICCFD-Miner, outperforms CCFD_ZartMNR in precise generating rules and enhancing response time due to its implying lift measure when generating Figure 14. Total number of rules generated of cardiotocography dataset

245

Enhancement of Data Quality in Health Care Industry

Figure 15. Response time measure about Cardiotocography dataset

Figure 16. Total number of rules generated of Pima_diabets dataset

dependable rules. Furthermore, the proposed approach focus on closed patterns with their associated generators, i.e., supersets of closed patterns, as a search space for generating the more accurate and reliable rules.

CONCLUSION AND FUTURE DIRECTIONS This chapter introduced data quality concepts, dimensions, methodologies, and related scientific research issues. The scope is focused on data preprocessing, i.e., data cleaning process, of inconsistent databases. Indeed, we have presented ICCFD-Miner approach that discovers precise data quality rules for resolving

246

Enhancement of Data Quality in Health Care Industry

Figure 17. Response time measure about Pima_diabets dataset

data inconsistency errors. The proposed approach yields a promising method for detecting semantic data inconsistency errors, and always keeping the database in consistent state. Generated rules are exploited as data cleaning solution to resolve inconsistency problem in several application domains. ICCFD-Miner relies on lift measure in addition to support and confidence measures for generating dependable minimal and non-redundant rules. The ICCFD-Miner is validated and evaluated over five real life datasets from medical application domains. The experimental results confirm the effectiveness and usefulness of the proposed approach against CCFD_ZartMNR algorithm. The proposed approach perform well across several dimensions such as effectiveness, accuracy of number of rules generated, and run time. Finally, we plan to investigate a technique for fixing errors autonomously with generated rules from ICCFD-Miner.

REFERENCES Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 16. doi:10.1145/1541880.1541883 Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), 255-276. Bharambe, D., Jain, S., & Jain, A. (2012). A Survey: Detection of Duplicate Record. International Journal of Emerging Technology and Advanced Engineering, 2(11). Bohannon, P., Fan, W., Geerts, F., Jia, X., & Kementsietsidis, A. (2007, April). Conditional functional dependencies for data cleaning. Proceedings of the IEEE 23rd International Conference on Data Engineering ICDE ‘07 (pp. 746-755). IEEE. doi:10.1109/ICDE.2007.367920 Carey, M. J., Ceri, S., Bernstein, P., Dayal, U., Faloutsos, C., Freytag, J. C.,... & Whang, K. Y. (2006). Data Centric Systems and Applications. Springer.

247

Enhancement of Data Quality in Health Care Industry

Chang, I. C., Li, Y. C., Wu, T. Y., & Yen, D. C. (2012). Electronic medical record quality and its impact on user satisfaction—Healthcare providers point of view. Government Information Quarterly, 29(2), 235–242. doi:10.1016/j.giq.2011.07.006 Chiang, F., & Miller, R. J. (2008). Discovering data quality rules. Proceedings of the VLDB Endowment, 1(1), 1166–1177. doi:10.14778/1453856.1453980 Cong, G., Fan, W., Geerts, F., Jia, X., & Ma, S. (2007, September). Improving data quality: Consistency and accuracy. Proceedings of the 33rd international conference on Very large data bases (pp. 315-326). VLDB Endowment. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. Knowledge and Data Engineering. IEEE Transactions on, 19(1), 1–16. Fan, W., & Geerts, F. (2010, June). Capturing missing tuples and missing values. Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 169178). ACM. doi:10.1145/1807085.1807109 Fan, W., & Geerts, F. (2012). Foundations of data quality management. Synthesis Lectures on Data Management, 4(5), 1–217. doi:10.2200/S00439ED1V01Y201207DTM030 Fan, W., Geerts, F., & Wijsen, J. (2011). Determining the currency of data. ACM Transactions on Database Systems, 37(4), 25. Fan, W., Ma, S., Tang, N., & Yu, W. (2011). Interaction between record matching and data repairing. Journal of Data and Information Quality, 4(4), 16. Fan, W., Li, J., Ma, S., Tang, N., & Yu, W. (2012). Towards certain fixes with editing rules and master data. The VLDB Journal, 21(2), 213-238. Fan, W., Gao, H., Jia, X., Li, J., & Ma, S. (2011). Dynamic constraints for record matching. The VLDB Journal, 20(4), 495–520. doi:10.1007/s00778-010-0206-6 Fan, W., Geerts, F., Li, J., & Xiong, M. (2011). Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering, 23(5), 683–698. Groves, P., Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The ‘big data’ revolution in healthcare. The McKinsey Quarterly. Hartmann, S., Kirchberg, M., & Link, S. (2012). Design by example for SQL table definitions with functional dependencies. The VLDB Journal, 21(1), 121-144. Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. Springer Science & Business Media. Hussein, N., Alashqur, A., & Sowan, B. (2015). Using the interestingness measure lift to generate association rules. Journal of Advanced Computer Science & Technology, 4(1), 156–162. doi:10.14419/ jacst.v4i1.4398 Kalyani, D. D. (2013). Mining Constant Conditional Functional Dependencies for Improving Data Quality. International Journal of Computers and Applications, 74(15).

248

Enhancement of Data Quality in Health Care Industry

Kazley, A. S., Diana, M. L., Ford, E. W., & Menachemi, N. (2012). Is electronic health record use associated with patient satisfaction in hospitals? Health Care Management Review, 37(1), 23–30. doi:10.1097/ HMR.0b013e3182307bd3 PMID:21918464 Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of Healthcare Information Management, 19(2), 65. PMID:15869215 Kush, R. D., Helton, E., Rockhold, F. W., & Hardison, C. D. (2008). Electronic health records, medical research, and the Tower of Babel. The New England Journal of Medicine, 358(16), 1738–1740. doi:10.1056/NEJMsb0800209 PMID:18420507 Larsson, P. (2013). Evaluation of Open Source Data Cleaning Tools: Open Refine and Data Wrangler. Li, J., Liu, J., Toivonen, H., & Yong, J. (2013). Effective pruning for the discovery of conditional functional dependencies. The Computer Journal, 56(3), 378–392. doi:10.1093/comjnl/bxs082 Li, L., Peng, T., & Kennedy, J. (2014). A rule based taxonomy of dirty data. Journal on Computing, 1(2). Liu, J., Li, J., Liu, C., & Chen, Y. (2012). Discover dependencies from Data—A review. IEEE Transactions on Knowledge and Data Engineering, 24(2), 251–264. doi:10.1109/TKDE.2010.197 Maletic, J. I., & Marcus, A. (2010). Data cleansing: A prelude to knowledge discovery. In Data Mining and Knowledge Discovery Handbook (pp. 19-32). Springer US. Mans, R. S., van der Aalst, W. M., & Vanwersch, R. J. (2015). Data Quality Issues. In Process Mining in Healthcare (pp. 79-88). Springer International Publishing. doi:10.1007/978-3-319-16071-9_6 Martin, E., & Ballard, G. (2010). Data Management Best Practices and Standards for Biodiversity Data Applicable to Bird Monitoring Data. North American Bird Conservation Initiative, Monitoring Subcommittee. Retrieved from http://www.nabcius.org/aboutnabci/bestdatamanagementpractices.pdf Mayfield, C., Neville, J., & Prabhakar, S. (2010, June). ERACER: a database approach for statistical inference and data cleaning. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 75-86). ACM. doi:10.1145/1807167.1807178 Medina, R., & Nourine, L. (2009). A unified hierarchy for functional dependencies, conditional functional dependencies and association rules. In Formal Concept Analysis (pp. 98–113). Springer Berlin Heidelberg. doi:10.1007/978-3-642-01815-2_9 Mezzanzanica, M., Boselli, R., Cesarini, M., & Mercorio, F. (2013). Automatic Synthesis of Data Cleansing Activities. In DATA (pp. 138-149). Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J. P., Schönberg, M., & Naumann, F. et al. (2015). Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of the VLDB Endowment, 8(10), 1082–1093. doi:10.14778/2794367.2794377 Rodríguez, C. C. G., & Riveill, M. (2010). e-Health monitoring applications: What about Data Quality? Saha, B., & Srivastava, D. (2014, March). Data quality: The other face of big data. Proceedings of the 2014 IEEE 30th International Conference on Data Engineering (ICDE) (pp. 1294-1297). IEEE.

249

Enhancement of Data Quality in Health Care Industry

Srivastava, M., Garg, R., & Mishra, P. K. (2015, March). Analysis of Data Extraction and Data Cleaning in Web Usage Mining. Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET ‘15) (p. 13). ACM. doi:10.1145/2743065.2743078 Brüggemann, S. (2010). Addressing Internal Consistency with Multidimensional Conditional Functional Dependencies. In COMAD (p. 139). doi:10.1145/2743065.2743078 Vassiliadis, P., & Simitsis, A. (2009). Extraction, transformation, and loading. In Encyclopedia of Database Systems (pp. 1095-1101). Springer US. Wang, J., & Tang, N. (2014, June). Towards dependable data repairing with fixing rules. Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 457-468). ACM. doi:10.1145/2588555.2610494 Weiskopf, N. G., & Weng, C. (2012). Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. Journal of the American Medical Informatics Association, 20(1), 144–151. doi:10.1136/amiajnl-2011-000681 PMID:22733976 Yakout, M., Elmagarmid, A. K., & Neville, J. (2010, March). Ranking for data repairs. Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW) (pp. 23-28). IEEE. doi:10.1109/ICDEW.2010.5452767 Yao, H., & Hamilton, H. J. (2008). Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2), 197–219. doi:10.1007/s10618-007-0083-9 Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining and Knowledge Discovery, 9(3), 223–248. doi:10.1023/B:DAMI.0000040429.96086.c7 Zu, X., Fredendall, L. D., & Douglas, T. J. (2008). The evolving theory of quality management: The role of Six Sigma. Journal of Operations Management, 26(5), 630–650. doi:10.1016/j.jom.2008.02.001

250

251

Chapter 12

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods Pradeep Kumar Maulana Azad National Urdu University, India Abdul Wahid Maulana Azad National Urdu University, India

ABSTRACT Software reliability is a statistical measure of how well software operates with respect to its requirements. There are two related software engineering research issues about reliability requirements. The first issue is achieving the necessary reliability, i.e., choosing and employing appropriate software engineering techniques in system design and implementation. The second issue is the assessment of reliability as a method of assurance that precedes system deployment. In past few years, various software reliability models have been introduced. These models have been developed in response to the need of software engineers, system engineers and managers to quantify the concept of software reliability. This chapter investigates performance of some classical and intelligent machine learning techniques such as Linear regression (LR), Radial basis function network (RBFN), Generalized regression neural network (GRNN), Support vector machine (SVM), to predict software reliability. The effectiveness of LR and machine learning methods is demonstrated with the help of sixteen datasets taken from Data & Analysis Centre for Software (DACS). Two performance measures, root mean squared error (RMSE) and mean absolute percentage error (MAPE) is compared quantitatively obtained from rigorous experiments.

DOI: 10.4018/978-1-5225-2229-4.ch012

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

INTRODUCTION Software reliability modeling has gained a lot of importance in many critical and daily life applications, which has led to the tremendous work being carried out in software reliability engineering. Software reliability growth models (SRGMs) successfully have been used for estimation and prediction of the number of errors remaining in the software. The software practitioners and potential users can assess the current and future reliability through testing using these SRGMs. Many analytical models such as times-between-failures model, nonhomogeneous Poisson process (NHPP) model, Markov processes and operational profile model has been proposed in past four decades for software reliability prediction. The two broad categories of SRGMs include parametric models and non-parametric models. Most of the parametric SRGM models are based on NHPP which has been widely used successfully in practice. The non-parametric SRGM models based on machine learning are more flexible which can predict reliability metrics such as cumulative failures detected, failure rate, time between failures, next time to failures. Both parametric and non-parametric models can be used to estimate the current reliability measures and predict their future trends. Therefore, SRGMs can be used as mathematical tools for measuring, assessing and predicting software reliability quantitatively. Despite the application of various machine learning methods in past few decades, non-homogeneous Poisson process (NHPP) based models has remained one of the most attractive reliability growth models in monitoring and tracking reliability improvement. However, due to their hard-core assumptions, validity and relevance in the real-world scenario have limited their usefulness. On the other hand, learning and generalization capability of artificial neural networks (ANNs), and its proven successes in complex problem solutions has made it, a viable alternative for predicting software failures in the testing phase. The main advantages of ANNs over NHPP based models is that it requires only failure history as inputs and no assumptions, or a priori postulation of parametric models is required. Several regression techniques such as linear regression and machine learning methods (DTs, ANNs, SVMs, GA) have been proposed in literature for predicting software reliability. The major challenges of these models do not lie in their technical soundness, but their validity and applicability in real world projects in particular to modern computing system. Linear regression (LR) is the most widely used method and easily understood but it hardly works well on real-life data. Since, LR is restricted to fitting straight line functions to data and hence not suited well for modeling non-linear functions. Some empirical studies based on multivariate linear regression and neural network methods have been carried out for prediction of software reliability growth trends. However, multivariate linear regression method can address linear relationship but require large sample size and more independent variables. The use of support vector machine (SVM) approach in place of classical techniques has shown a remarkable improvement in the prediction of software reliability in the recent years. The design of SVM is based on the extraction of a subset of the training data that serves as support vectors and therefore represents a stable characteristic of the data. GRNN-based reliability prediction model incorporating the test coverage information such as blocks and branches is applied for software reliability prediction. The prediction accuracy of software reliability models can be further improved by adding other important factors affecting the final software quality such as historical information from software development like capability of developers, testing effort and test coverage. SVM represent the state of the art due to their generalization performance, ease of usability and rigorous theoretical foundations that practically can be used for regression solving problems.

252

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

However, the major limitation of support vector machines is the increasing computing and storage requirements with respect to the number of training examples. Gene Expression Programming (GEP) is an evolutionary machine learning technique which has been found a robust approach in recent years that can be applied for predicting software reliability. Moreover, the modern intelligent techniques such as FIS, ANFIS, GEP, GMDH and MARS have been found to be more effective than classical machine learning such as ANNs [Ferreira91]. However, the present challenge is to make it even more efficient by incorporating a fairly new technique which can improve the prediction rate and require less computational resources. Therefore it is quite natural for software practitioners and potential users to know that which particular method tend to work well for a given type of dataset and up to what extent quantitatively. The author conducted an empirical study of statistical and machine learning methods for predicting software reliability that is essentially required to build an adequate body of knowledge in order to draw stronger conclusions leading to widely accepted and well-formed theories. This chapter focus and investigate three main issues: (i) How accurately and precisely the machine learning based models predict the reliability of software at any point of time during testing phase? (ii) Is the performance of machine learning methods such as RBFN, GRNN, SVM, FIS, ANFIS, GEP, GMDH and MARS is better than classical method (LR) (iii) Correlate between machine learning methods and Figure 1. Overview of LR and Machine Learning methods for predicting software reliability

253

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

statistical approach for software reliability prediction since their performance varies when applied to different past failure data in a realistic environment. The contribution of our chapter is summarized as follows: First, applied liner regression method and analysed the correlation metrics. Second, applied modern machine learning methods (RBFN, GRNN, SVM, FIS, ANFIS, GEP, GMDH and MARS) to study the impact of statistical failure data while making future predictions.

Background In literature, many empirical studies based on multivariate linear regression and neural network methods have been carried out for prediction of software reliability growth trends. Although, multivariate linear regression method can address linear relationship but require large sample size and more independent variables (Jung-Hua 2010). The use of support vector machine (SVM) approach in place of classical techniques has shown a remarkable improvement in the prediction of software reliability in the recent years (Xiang-Li 2007). The design of SVM is based on the extraction of a subset of the training data that serves as support vectors and therefore represents a stable characteristic of the data. SVM can be applied as an alternative approach because of their generalization performance, ease of usability and rigorous theoretical foundations that practically can be used for regression solving problems [Ping and Lyu 2005]. However, the major limitation of support vector machines is the increasing computational and storage requirements with respect to the number of training examples [Chen 2005]. The group method of data handling (GMDH) network based on the principle of heuristic self-organization has also been applied for predicting future software failure occurrence time and the optimal cost for software release instant during the testing phase [Dohi et al., 2000]. They numerically illustrated that GMDH networks are capable to overcome the problem of determining a suitable network size in multilayer perceptron neural network. GMDH also can provide a more accurate measure in the software reliability assessment than other classical prediction methods. Another reliability prediction method, neural network based approach for predicting software reliability using Back-Propagation Neural Network (BPN) has been applied for estimating the failures of the software system in the maintaining phase [Chen et al, 2009]. Therefore it is quite natural for software practitioners and potential users to know that which particular method tend to work well for a given type of dataset and up to what extent quantitatively [Aggarwal and Singh, 2006]. The objective of our study is to assess the effect of past and present failure data detected during software testing using soft computing techniques in a realistic environment. That will help project managers in optimizing the testing efforts to market the product on time leading to maximize the profit. The reliability (Zuzana, 2007) is one of the major issue of electronic device, hardware and application software. Over the years many software reliability growth models have been employed for predicting reliability. Yet, there is no universal agreement among the researchers in the field of software reliability modeling that a correct or best model can exist. Because one modeller might consider certain aspects of reality very important thus giving them significant weight in his model On the other hand another modeller may have dissimilar views which result in a different model.

Evaluation Criteria In order to analyze and compare the performance of statistical and machine learning methods, various statistics such as correlation coefficient, MAE, MAPE and RMSE computed as follows:

254

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

1. Correlation coefficient (CC) is the correlation between the output and target values. This is the statistical measure of how well the predicted values from a prediction model fit with the observed value of real-life data. n P −A 1 2. Mean average percentage error: MAPE = ∑ i =1 n A 3. MAPE is applied as a standard performance measure for predicting software reliability in our study. Where n represents the number of the test samples, P represents the estimated error and A is the actual error. 2

(P − A) , where n is the number of observations, P 4. Root Mean Square Error, RMSE = ∑ i =1 n is the predicted error and A is actual error observed during testing phase. n

METHODOLOGY This section presents a brief description of Linear Regression (LR) and machine learning methods such as Radial basis function network (RBFN), Generalized regression neural network (GRNN), Support vector machine (SVM) (Tharwat et al, 2014), Fuzzy inference system (FIS), Adaptive neuro-fuzzy inference system (ANFIS), Gene expression programming (GEP), Group method of data handling (GMDH) and Multivariate adaptive regression splines (MARS) to predict software reliability. The machine learning methods deal with the issues of how to build and design computer programs that improve their performance for some specific task such as reliability prediction based on statistical observations (software failure data).

Linear Regression Linear regression (LR) is well-known and most widely used predictive model which is used for minimizing the sum of the squared errors to fit a straight line to the set of data points. That is a linear regression model fits a linear function to a set of data points. Thus, LR can be used to find the relationship between a continuous response variable (failure rate) and a set of predictor variables (testing time). LR is two step process: the first step consists of estimating the probability of belonging to each group and in the second step apply a cut-off point with these probabilities to classify each case in one of the groups. The parameters of the model are estimated with the method of maximum likelihood through a process of successive iterations. The basic purpose of regression analysis is to determine the values of a parameter that minimize the sum of the squared residual values for a set of observations. This is known as least square regression fit or referred to as ordinary least squares (OLS) regression. Since linear regression is restricted for fitting linear functions to data, it rarely works well on real-world data in comparison to machine learning techniques such as ANNs, SVMs (Tharwat et al., 2016)) and DTs which can model non-linear functions. However, linear regression model is usually much faster than methods using machine learning techniques. Univariate linear regression method is applied for depicting relationship among dependent variable (failure rate) and independent variable (testing time) implemented through DTREG [Phillip03] [Kohavi95]. Table 2 shows the performance measure of linear regression for various datasets in terms of MAE, MAPE and RMSE.

255

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 1. Summary of predictions using linear regression Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9635

8.7007

60.5268

10.4962

10.6199

2

0.9641

3.6212

37.5689

4.1360

4.2431

3

0.9769

1.8032

37.0326

2.3433

2.4638

4

0.9806

2.2322

13.7921

2.9925

3.4081

5

0.9899

27.8925

12.6645

33.8929

33.9643

6

0.9821

3.1974

20.1987

3.9621

4.1622

14C

0.9457

2.7876

25.0475

3.3762

3.6952

17

0.9833

1.4737

18.4014

1.9916

2.1847

27

0.9740

2.1978

33.1756

2.6769

2.8596

40

0.9469

8.0909

48.5443

9.3706

9.4589

SS1A

0.9885

4.1104

19.1870

4.8825

4.9510

SS1B

0.9969

7.1080

25.9329

8.4471

8.4776

SS1C

0.9788

13.4437

49.6297

16.3452

16.3854

SS2

0.9904

6.6395

35.7266

7.6319

7.6818

SS3

0.9887

10.1431

25.8302

11.9884

12.0538

SS4

0.9975

3.1779

7.7162

3.9439

3.9746

Radial Basis Function Network The ability of ANNs to model complex non-linear relationships and capability of approximating any measurable function make them attractive prospects for solving complex tasks without having to build an explicit model of the system. Radial basis function is used for approximating software failure rate using Gaussian function. The response of Gaussian function is non-negative for all values of x defined as: f (x ) = exp −x 2

(

)

The architecture of radial basis function network consists of three layers namely input layer, hidden layer and output layer. The output of Gaussian transfer functions associated with the neurons in the hidden layer is inversely proportional to the distance from the center of the neuron. There exists n number of input neurons and m number of neurons with the hidden layer existing between the input and output layer. The interconnection between the input layer and hidden layer forms hypothetical connection. The weighted connections are generated between the hidden and output layer of the network. The RBFN network is implemented using DTReg [Phillip03] and applied on sixteen different datasets. Table 2 shows the performance measure of RBFN for corresponding datasets in terms of MAE, MAPE and RMSE.

Generalized Regression Neural Network (GRNN) Generalized Regression Neural Network (GRNN) is applied for predicting software reliability more realistically. The GRNN consists of a radial basis layer and a special linear layer used for function ap-

256

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 2. Summary of predictions using RBFN Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9987

1.5490

5.1722

1.9533

2.5358

2

0.9967

0.9678

7.4530

1.2514

1.7548

3

0.9796

1.6869

33.2174

2.1987

3.1999

4

0.9804

2.2648

19.7630

3.0123

9.6656

5

0.9997

4.5427

4.5950

5.6237

6.0624

6

0.9852

2.7497

18.8771

3.6006

26.7822

14C

0.9919

0.9971

15.0227

1.3128

1.6037

17

0.9973

0.6751

9.0649

0.8045

2.1912

27

0.9821

1.6518

31.9911

2.2247

3.1574

40

0.9984

1.2650

6.3919

1.6146

1.8667

SS1A

0.9981

1.5400

9.7827

1.9771

2.0001

SS1B

0.9994

2.7626

6.9266

3.5137

3.6471

SS1C

0.9996

1.5860

3.5959

2.1055

3.6890

SS2

0.9995

1.3217

3.6011

1.6451

1.9507

SS3

0.9991

2.4935

12.4061

3.2642

3.3550

SS4

0.99971

3.2117

15.8869

4.2316

3.0196

proximation with sufficient number of hidden neurons. GRNN was introduced by Specht (1991) as a normalized radial basis function (RBF) network in which there is a hidden unit centred at every training level. The radial basis function employs some probability density function such as the Gaussian function called kernel. The main drawback of GRNN is that it suffers badly from the curse of dimensionality. That is GRNN cannot handle irrelevant inputs without major modifications to the basic algorithm. General regression neural network (GRNN) is an extended form of probabilistic neural network based on mathematical statistics theory. GRNN is a one-pass learning algorithm with a highly parallel structure that can be converged to the linear or nonlinear regression surface. GRNN networks are very similar to RBF networks. The main difference is that GRNNs have one neuron for each point in the training file, whereas RBF networks have variable number of neurons for training points. GRNNs are more accurate than RBF networks for small to medium size training sets, but impractical for large training datasets. DTREG was used for training and testing the GRNN [Phillip03][Kohavi95] [Ross93]. Table 3 shows the performance measure of GRNN in terms of MAE, MAPE and RMSE.

Support Vector Machine Modeling Support vector machines (SVMs) takes a set of input data and predicts for each given input, which of two possible classes the input is a member of, that makes SVM a non-probabilistic binary linear classifier. Thus, SVM is a learning system and constructs an N-dimensional hyperplane that optimally separates the data set into two categories. In SVM, a good separation is achieved by hyperplanes that has the largest distance to the nearest training data points of any class called functional margin. Ideally larger is the

257

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 3. Summary of predictions using GRNN Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9942

1.0796

3.7954

1.4956

4.2162

2

0.9982

0.7014

4.5846

0.9209

1.3877

3

0.9901

0.8424

14.2916

1.5322

1.9965

4

0.9940

1.3406

10.5657

1.6734

2.2745

5

0.9991

7.5037

10.9796

9.8531

18.8977

6

0.9822

3.6524

37.4181

4.5788

13.0233

14C

0.9972

0.4341

3.7649

0.7771

1.3913

17

0.9984

0.4295

4.2077

0.6152

6.8174

27

0.9813

1.7477

29.1200

2.3054

8.8480

40

0.9992

0.8579

4.6111

1.1478

2.0952

SS1A

0.9983

1.4389

6.6309

1.9032

5.5483

SS1B

0.9990

3.2674

10.9546

4.7431

13.8030

SS1C

0.9984

3.1953

11.1501

4.4178

17.1370

SS2

0.9998

0.6926

1.7199

1.0387

2.0952

SS3

0.9997

2.1420

12.5917

3.0196

4.2479

SS4

0.9998

0.6568

1.9642

1.0917

2.3424

margin, lower the generalization error of classifier will be. The basic purpose of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the dependent variable on one side of the plane and the cases with the other category lies on other side of the plane. The support vectors are the vectors near the hyperplane. The SVM modelling finds the oriented hyperplane so that the margin between the support vectors is maximized. Support Vector Machine (SVM) models are closely related to ANNs [Jung10] [Bo07]. Therefore SVMs can be used as an alternative training method for polynomial, radial basis function and multilayer perceptron networks using a kernel function. In SVMs, the weights of the network are found by solving a quadratic programming problem with linear constraints, rather than by solving a non-convex, unconstrained minimization problem as in case of classical neural network training. To make optimum utilization of failure datasets and avoid over fitting, cross-validation is applied to evaluate the performance of the SVM model [Phillip03][Kohavi95] [Ross93]. Summary of prediction measures using SVM model is presented in Table 4.

Fuzzy Inference System (FIS) Here performance of software reliability prediction model is measured with the help of fuzzy inference system. Fuzzy inference system is similar to a neural network type structure capable of mapping inputs through input membership functions and associated parameters. The parameters associated with the membership functions and corresponding associated output parameters are used to interpret the final output of the system. The FIS structure was generated using genfis2 function from the Matlab Fuzzy logic Toolbox. The basic steps of the model are identification of input variables (cumulative failures and

258

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 4. Summary of predictions using SVM Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9979

1.9813

7.4314

2.4943

2.6423

2

0.9965

0.9681

7.4197

1.2991

2.5887

3

0.9877

1.3231

18.8908

1.7212

2.5692

4

0.9882

1.9003

16.9503

2.3417

2.5898

5

0.9995

5.2685

4.9519

7.3104

13.8895

6

0.9928

1.9436

13.1532

2.5205

3.6771

14C

0.9850

1.3036

28.4965

1.8158

2.3397

17

0.9970

0.7425

8.9119

0.8777

1.1929

27

0.9938

0.9821

19.6292

1.3193

4.5056

40

0.9987

1.1498

6.5999

1.4467

2.2218

SS1A

0.9988

1.1443

4.7200

1.5311

1.9042

SS1B

0.9993

2.9369

6.3394

3.8716

5.4320

SS1C

0.9987

3.0602

8.3793

3.9734

5.1429

SS2

0.9987

2.2645

4.9857

2.8062

3.1739

SS3

0.9992

2.4311

10.7239

3.1103

3.3697

SS4

0.9989

2.0029

4.4581

2.6230

3.1910

failure interval length) and output variable (failure time), development of fuzzy profile of these input/ output variables, defining relationships between inputs and output variables using fuzzy inference system (FIS). Moreover, fuzzy inference system is capable of making decisions under uncertainty which can be used for reliability prediction when applied to unknown datasets. The reasoning capability of fuzzy logic can be used as a technique of arriving at some concrete decision on whether to release the software under test or not. Table 5 shows the performance measure of FIS in terms of MAE, MAPE and RMSE.

Adaptive Neuro-Fuzzy Inference System (ANFIS) Adaptive Neuro-FIS is applied for predicting software reliability to assess the improved performance over FIS. In fuzzy inference system the membership functions are chosen arbitrarily or made fixed. But, in case of ANFIS the membership functions and associated parameters can be chosen automatically which results in better prediction accuracy of the system. ANFIS supports only to Sugano-type system and cannot accept all customization options allowed by fuzzy inference system. The ANFIS was generated using anfis function from the Matlab Fuzzy logic Toolbox (http://www.mathworks.com). The ANFIS model consists of two inputs associated with two membership functions and four rules with corresponding four output membership functions were generated to provide a single output. The adaptive neuro-fuzzy inference system can establish an input-output relation with the help of back propagation algorithm using connectionist approach. Thus, ANFIS is a hybrid intelligent system model which combines the low-level computational power of connectionist approach and high level reasoning capability of a fuzzy inference system. Moreover, neuro-fuzzy inference system is an adaptive model which is used to fine-

259

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 5. Summary of predictions using FIS Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9948

0.0104

0.0856

2.6424

5.3262

2

0.9933

0.0097

0.0847

1.2037

10.1067

3

0.9584

0.1490

0.2752

1.9758

52.5764

4

0.9886

0.0344

0.1263

1.5622

43.8102

5

0.9981

0.0263

0.0562

6.6603

52.6511

6

0.9943

0.0531

0.1256

1.4776

31.6250

14C

0.9768

0.0138

0.0138

1.4832

14.7064

17

0.9984

0.0080

0.0432

0.4129

34.6184

27

0.9900

0.0754

0.1732

1.0979

31.0034

40

0.9970

0.0159

0.0753

1.3464

57.0903

SS1A

0.9984

0.0055

0.0541

1.2242

17.3331

SS1B

0.9988

0.5960

0.0998

3.5984

6.8076

SS1C

0.9981

0.0554

0.1124

3.2871

15.9369

SS2

0.9980

0.0218

0.0756

2.3476

10.5509

SS3

0.9985

0.0846

0.1374

2.9688

70.1806

SS4

0.9982

0.0048

0.0519

2.2484

24.0991

tune the parameters of fuzzy inference systems for predicting software reliability based on statistical failure data. Table 6 shows the performance measure of ANFIS in terms of MAE, MAPE and RMSE.

Gene Expressing Programming Gene Expression Programming (GEP) proposed by Candida Ferreira is a new technique for the creation of computer programs [Ferreira01]. The GEP uses chromosomes composed of genes organized in a head and a tail. The chromosomes are subjected to modification by means of mutation, inversion, transposition, and recombination. The technique performs with high efficiency that greatly surpasses existing adaptive techniques. The fitness function measures the number of correct predictions in terms of MSE. GEP is a procedure applied to create a computer program for modeling software failure phenomenon. GEP can be used to create different types of models such as decision trees, neural networks and polynomial constructs using DTReg. A genetic algorithm (GA) is a search procedure that can be used to find a solution in a multidimensional space. Since GA is many times faster than exhaustive search procedures. Thus GAs can be used for finding the solution of high dimension problems. In gene expression programming, symbols are the basic unit consist of functions, variables and constants. The symbols used for variables and constants are called terminals, because they have no arguments. A gene is an ordered set of symbols whereas an ordered set of genes form a chromosome. In GEP programs, the range of genes typically varies from 4 to 20 symbols, and chromosomes are typically built from 2 to 10 genes. Chromosomes may consist of only a single gene also. The detailed architecture of GEP model is shown in Table 7. Table 8 shows the performance measure of GEP in terms of MAE, MAPE and RMSE.

260

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 6. Summary of predictions using ANFIS Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9971

0.0022

0.0766

1.9828

6.5960

2

0.9945

0.0119

0.0797

1.0859

40.3818

3

0.9743

0.0397

0.1996

1.5580

67.9723

4

0.9913

0.0190

0.0845

1.3676

100.674

5

0.9992

0.0160

0.0843

6.3421

47.7345

6

0.9995

0.0317

0.0936

1.3121

52.7800

14C

0.9764

0.0216

0.1846

1.4970

32.5615

17

0.9990

0.0022

0.0308

0.3263

30.0982

27

0.9930

0.0591

0.1543

0.9200

18.1164

40

0.9976

0.0057

0.0462

1.2104

54.2149

SS1A

0.9984

0.0002

0.0575

1.2213

16.9866

SS1B

0.9988

0.0573

0.0979

3.5371

8.4116

SS1C

0.9984

0.0625

0.1160

3.0309

46.9538

SS2

0.9982

0.0157

0.0674

2.2639

9.5940

SS3

0.9985

0.0850

0.1376

2.9681

66.0502

SS4

0.9982

0.0082

0.0547

2.2424

23.6423

Table 7. Parameters of GEP model Population size

50

Genes per chromosomes

4

Gene head length

8

Maximum generations

2000

Generation required to train the model

1822

Generation required for simplification

114

Linking Function

Addition

Fitness function

Mean squared error

GMDH Polynomial Network The group method of data handling (GMDH) is a popular non-linear method of mathematical modelling which can be applied for predicting software reliability accurately. The GMDH networks are self-organizing networks where the connections between neurons in the network are selected during the training to optimize the network. The number of layers in the network is selected automatically to produce maximum accuracy without overfitting [Yumei11] [Specht91]. The GMDH is a heuristic selforganizing method particularly useful in solving the problem of modeling multi-input to single-output data. The GMDH based modeling algorithms are self-organizing because various parameters such as

261

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 8. Summary of predictions using GEP Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

RMSE

Normalized MSE

1

0.9939

3.4958

10.1472

4.3223

0.0121

2

0.9677

3.4367

23.6446

3.9295

0.0635

3

0.9861

1.4647

19.8107

1.8369

0.0280

4

0.9827

2.1421

13.8306

2.8394

0.0344

5

0.9802

27.7873

19.2261

33.6880

0.0197

6

0.9826

3.2177

13.4645

3.9094

0.0344

14C

0.9591

2.6089

25.2379

2.9399

0.0800

17

0.9802

1.6656

17.1815

2.1716

0.0392

27

0.9839

1.8229

19.0914

2.1119

0.0318

40

0.9768

5.3763

30.8686

6.2392

0.0457

SS1A

0.9907

3.6408

18.2592

4.6112

0.0203

SS1B

0.9978

6.0333

15.6955

7.1072

0.0043

SS1C

0.9831

13.0330

25.9202

15.5901

0.0380

SS2

0.9912

6.4927

23.5289

7.3431

0.0175

SS3

0.9912

9.0939

12.9690

10.5797

0.0173

SS4

0.9974

3.2575

7.6756

4.0682

0.0051

the number of neurons, the number of layers and the actual behavior of each neuron are adjusted during the process of self-organization [Muller99]. Table 9 shows the performance measure of GMDH in terms of MAE, MAPE and RMSE.

Multivariate Adaptive Regression Splines Multivariate Adaptive Regression Splines (MARS) is a regression technique which allows the analyst to use automated procedures to fit the model in large complex datasets. Multivariate adaptive regression splines (MARS) introduced by Friedman (1991) is a novel exploratory modeling technique for software reliability prediction. MARS automates the building of accurate predictive models for both continuous and binary dependent variables. MARS an open source toolkit for academic purpose available at http:// salford-systems.com). MARS is an innovative and flexible modeling tool used for automating the building of software reliability prediction models for continuous dependent variable (failure rate) used in our study. MARS is a regression technique used for flexible modeling of high dimensional data. The MARS model takes the form of an expansion in product spline basis functions, where the numbers of basis functions are automatically determined by the data. This procedure is based on recursive partitioning such as in CART and shares its ability to capture high order interactions. MARS is capable and has more flexibility to model relationships that involve interactions among few variables producing continuous models with continuous derivatives. [Jarome Friedman]. MARS, is a statistical learning methodology that can be applied for both classification and regression purpose. It is very useful for high dimensional problems and shows a great promise for fitting nonlinear multivariate functions. The MARS uses forward

262

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 9. Summary of predictions using GMDH Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

Training RMSE

Testing RMSE

1

0.9978

2.0107

8.5399

2.6138

2.9594

2

0.9519

3.3527

37.8665

4.9206

6.9727

3

0.9890

1.2595

16.3331

1.6230

20.4716

4

0.9773

2.2365

15.5510

3.3557

12.5254

5

0.9989

9.0388

7.1133

11.2219

11.4992

6

0.9917

2.2275

13.0819

2.6974

3.7341

14C

0.9924

0.9197

11.6421

1.2815

1.8306

17

0.9941

0.8502

8.1355

1.1929

15.0646

27

0.3635

4.9880

31.7350

18.1730

10.7460

40

0.9982

1.4051

8.8940

1.7383

2.1096

SS1A

0.9984

1.4255

6.5131

1.7766

2.2924

SS1B

0.9986

4.3323

6.1238

5.5760

6.0338

SS1C

0.9978

4.1210

10.6744

5.1990

5.1088

SS2

0.9989

1.9867

4.7054

2.5082

3.0445

SS3

0.9991

2.5299

12.7349

3.2275

3.1743

SS4

0.9986

2.4826

5.2845

2.9648

3.2984

and the backward stepwise algorithms for estimating the model functions [Phillip, 2003; Kohavi, 1995; Ross, 1993]. Table 10 shows the performance measure of MARS in terms of MAE, MAPE and RMSE.

Training and Validation Method Machine learning methods for software reliability prediction is validated using failure datasets taken from DACS listed in Table1. The input xi is taken as cumulative number of failures detected during software testing time ti to predict (xi+1) as output of the model. Each data set is divided into two parts: training and testing data. The training data is then applied to the prediction model and subsequently the parameters that lead to the best accuracy are selected. Generally separate training and validation datasets are desired for testing the accuracy of prediction model. However, when the database used for modeling is small such as DS2, DS3, DS4, DS6 and DS40 used in this study, hence the testing is performed on relatively small samples. That is why the goodness of fit results may be sensitive to random variation in the subsets selected for training and testing. Therefore apply k-cross validation, an alternative procedures that allows more of the data to be used for fitting and testing. Using k-cross validation, the entire dataset is randomly divided into k subsets (here k=10) and each time one of the k subsets is used as the training data and other remaining (k-1) subsets are used to validate the prediction model for software reliability. Thus to maximize the utilization of failure datasets, cross-validation is an efficient method through repeatedly resampling the same data set randomly by reordering the dataset and then splitting up into 10 folds of equal length [Kohavi95]. Figure 1 shows the overview of software reliability prediction model using machine learning methods.

263

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Table 10. Summary of predictions using MARS Data Sets

Evaluation Parameters Correlation Coefficient

MAE

MAPE

RMSE

MSE

1

0.9975

1.5407

0.0669

1.9376

3.9891

2

0.9920

1.1501

0.1045

1.3900

1.9322

3

0.9536

1.8421

0.3756

2.3618

5.8879

4

0.9912

1.1893

0.0868

1.4292

2.0426

5

0.9991

5.5250

0.0410

6.9574

48.4057

6

0.9948

1.2094

0.1249

1.5154

2.2965

14C

0.9861

0.9144

0.1426

1.2228

1.4954

17

0.9910

0.8290

0.1225

1.0355

1.0724

27

0.9780

1.4534

0.2000

1.7519

3.0692

40

0.9983

0.9958

0.0542

1.2012

1.4430

SS1A

0.9985

0.9023

0.0554

1.2327

1.5197

SS1B

0.9978

3.9139

0.0897

5.0102

25.1027

SS1C

0.9992

1.7265

0.0812

2.1696

4.7075

SS2

0.9988

1.4442

0.0335

1.8520

3.4300

SS3

0.9978

2.1162

0.0813

2.6816

7.1912

SS4

0.9986

1.5994

0.0432

2.0684

4.2786

ANALYSIS RESULTS This section presents summary of results for all sixteen data sets using statistical and machine learning methods in terms of CC, MAE, MAPE and RMSE. The list of statistics shown in Tables 2 to Table 11 summarizes how accurately the model is able to predict the software reliability realistically. Figure 1 is the graphical representation of performance analysis of statistical and machine-learning methodology Table 11. Summary of predictions using statistical and Machine learning techniques Methods

Root Mean Square Error (RMSE) Training

Validation

LR

8.0298

8.1615

RBFN

2.5208

4.7800

SVM

2.5663

3.7768

FIS

2.2210

29.9013

ANFIS

2.0541

38.9230

GRNN

2.5696

6.6326

GMDH

4.3793

6.9290

GEP

7.0804

*

MARS

2.2385

*

*A blank entry shows that the method is not applicable or does not hold for a particular parameter.

264

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

for software reliability prediction applied in our study. The performance analysis to find the relationship between statistical and machine learning methods applied for software reliability prediction in detail. The performance analysis of LR and other machine learning methods (RBFN, GRNN, GMDH, SVM, FIS, ANFIS, GEP and MARS) is shown in Table 1 to Table 11.

SUMMARY The applications of machine-learning techniques for the prediction of software reliability in place of traditional statistical techniques have shown remarkable improvement in recent years. Therefore it is desirable and interesting to correlate between classical and modern machine learning approaches and other useful statistical techniques in practice. Accurate software reliability prediction can not only enable developers to improve the quality of software but also provide useful information to help them for planning valuable resources. This chapter employs multiple adaptive regression splines (MARS), a novel exploratory modeling technique for software reliability prediction. The prediction accuracy of the MARS is evaluated and compared using Linear regression models (LR), Radial basis function network (RBFN), Generalized regression neural network (GRNN), Group method of data handling (GMDH), Support vector machine (SVM), Fuzzy Inference System (FIS), Neuro–Fuzzy Inference system (ANFIS) and Gene expression programming (GEP).The experimental results suggest that MARS can predict software reliability more accurately than the other four typical modeling techniques (ANNs, FIS and ANFIS). It is demonstrated that MARS model is more reliable and accurate having better capability of generalization and less dependent on the sample data size. The performance of machine learning methods has been evaluated by using sixteen empirical databases extracted from DACS to predict failure intensity of the system. Based on the experimental results, it can be concluded that such models could help for reliability estimation and prediction by focusing on project’s failure dataset of a realistic environment. Some of the specific observations for predicting software reliability using machine learning are summarized as follows: 1. Although linear regression (LR) is a better choice for minimizing the sum of squared errors to fit a straight line to the set of data points. But the problem with LR is that it is restricted for fitting linear functions to data points and it hardly works well on real-world data in comparison to machine learning techniques such as ANNs, SVMs, ANFIS, GEP, GMDH and MARS which can model non-linear functions more accurately. However, linear regression model is usually much faster and requires less storage for computation than methods using machine learning techniques. 2. From the rigorous experiments conducted it is observed that it is easy to design models for reliability growth of varying complexity for a given data set using ANNs. However, the effectiveness of neural network based prediction models depend on the behavior of dataset that is basically of fluctuating nature. Therefore ANNs suffers from overfitting the results while dealing with reallife unknown large data sets. Overfitting occurs usually when the parameters of a model are tuned in such a way that the model fits the training data well but it has poor accuracy when applied on separate data not used for training. Radial basis function (RBF) networks and GRNNs exhibit and prone to overfitting for unknown failure data. 3. ANNs model also face the problem of learning from the dynamic environment which require larger sample size for training and testing the network along with more independent variables. Based on 265

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

4.

5.

6.

7.

8.

9.

10.

11.

266

the experimental results shown in Table4 to Table6, it can be depicted that RBFN and GRNN do not fit better in terms of MAPE and RMSE. On the other hand, modern machine learning methods particularly ANFIS, SVMs, GMDH, GEP and MARS can approximate continuous function more accurately which implies that these approximated function can be employed effectively for estimation and prediction of cumulative failures observed by time t in software reliability modeling. The robustness and validity of machine learning based models make it easier for real-world applications such as modeling complex failure phenomena for predicting accurate software reliability. Moreover, ANFIS, GMDH, GEP and MARS models are more adaptive to the modeling of nonlinear functional relationships, which are difficult to model with other classical techniques of predicting reliability in practice. Some machine learning techniques generalize well even in high dimensional spaces under small training datasets. Therefore, software reliability prediction models can be built much earlier with SVM, GMDH, GEP and MARS than other conventional techniques with relatively good performance achieved and can be extensively applied in many fields of software engineering realistically. With the inclusion of fuzzy logic systems for predicting software reliability has led to the achievement of more efficient and decisive system. That is why, fuzzy inference system and adaptive neuro fuzzy inference system has been found to be more efficient when compared to the classical prediction techniques. However, the future challenge is to make it even more efficient by incorporating a fairly new technique which can improve the prediction rate and require less computational resources. The main advantage of modeling statistical software failure data is decision making about the software system. That is, whether to release the system for deployment or continue further testing. Therefore, modern machine learning techniques such as FIS, ANFIS, GEP and MARS can be utilized as a powerful tool for reliability prediction than a conventional expert system. The group method of data handling applied for predicting software reliability is capable of selforganizing the networks for optimization. The GMDH is a heuristic self-organizing method and can be useful in solving the problem of characterizing the failure data for accurate prediction. The GMDH based modeling algorithms are self-organizing which can be utilized to adjust various parameters such as the number of neurons, the number of layers and the actual behavior of each neuron resulting in better accuracy and adaptable to unknown failure datasets. Gene Expression Programming is a promising new technique for the creation of computer programs for predicting software reliability with more precession using chromosomes composed of genes. Therefore GEP performs with high accuracy with the help of mutation, inversion, transposition, and recombination features which can be fine-tuned to overcome the existing adaptive techniques in practice. Multivariate adaptive regression spline is an innovative and flexible modeling technique which can be applied building of software reliability prediction models automatically. Moreover, MARS is capable and flexible to model relationships that involve interactions among various variables such as modeling failure phenomena for predicting software reliability. It is very useful for high dimensional problems and shows a great promise for fitting nonlinear multivariate functions. Finally, the ability of modern machine learning techniques to model complex non-linear relationships and capability of approximating any measurable function make them attractive prospects for solving regression task without having to build an explicit model of the system.

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

FUTURE RESEARCH DIRECTIONS This chapter on software reliability prediction methods may be improved further by incorporating the following issues: • •

•

The current practices of software reliability engineering (SRE) collect the failure data during integration testing or system testing phases. Failure data collected during the late testing phase may be too late for fundamental design changes. Secondly, the failure data collected in the in-house testing may be limited, and therefore, it may not represent failures that would be uncovered under actual operational environment. This is particularly required for high-quality software system, which requires extensive and wide-range testing. The reliability estimation and prediction using the restricted testing data may cause accuracy problems. Although, it is understood that exhaustive testing is not feasible. Thirdly, the current practices of SRE are based on various unrealistic assumptions that make the reliability estimation too optimistic relative to real situations. Thus, although SRE has been around for decades, credible software reliability modeling techniques are still urgently needed, particularly for modern software systems using more intelligent soft computing techniques.

CONCLUSION Software reliability growth models have been proved to be very effective technique for quantitative measurement of software quality particularly non-parametric software reliability prediction method based on ANN approach. We also have discussed the usefulness of connectionist approach using neural network models which is more flexible with less restrictive assumptions in a more realistic environment. The use of ANNs technique requires only failure history as input and then develops its own internal model of failure process by using back-propagation learning algorithm in which the network weight are adapted using errors propagated back through output layer of the network. Therefore the ability of neural networks to model nonlinear patterns and learn from the statistical failure data makes it a valuable alternative methodology for characterizing the failure process which generates less prediction errors realistically over the conventional parametric models. ANNs have less prediction errors realistically over conventional parametric models. Specifically from researcher’s point of view the artificial neural network approach offers a distinct advantage for software reliability assessment that the model development is automatic by using a training algorithm such as back propagation using feed forward neural network.

REFERENCES Aggarwal, K. K., Singh, Y., Kaur, A., & Malhotra, R. (2008). Empirical analysis for investigating the effect of object-oriented metrics on fault proneness: A replicated case study. Software Process Improvement and Practice, 14(1), 39–62. doi:10.1002/spip.389 Breiman, L. (2001). Random Forests. Machine Learning, 35(1), 5–32. doi:10.1023/A:1010933404324

267

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Funatsu, K. (2011). Knowledge-Oriented Applications in Data Mining. In Tech. Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques. India: Morgan Kaufmann Publishers. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. doi:10.1007/978-0-387-21606-5 Ho, S., Xie, M., & Goh, T. (2003). A study of the connectionist models for software reliability prediction. Computers & Mathematics with Applications, 46(7), 1037–1045. doi:10.1016/S0898-1221(03)90117-9 Karunanithi, N., Whitley, D., & Malaiya, Y. (1992). Prediction of software reliability using connectionist models. IEEE Transactions on Software Engineering, 18(7), 563–574. doi:10.1109/32.148475 Kohavi, R. (1995). The power of decision tables. Proceedings of the Eighth European Conference on Machine Learning (ECML-95), Heraclion, Greece (pp. 174-189). Kuei, C., Yeu, H., & Tzai, L. (2008). A study of software reliability growth from the perspective of learning effects. Reliability Engineering & System Safety, 93(10), 1410–1421. doi:10.1016/j.ress.2007.11.004 Lyu, M. R. (1999). Handbook of Software Reliability Engineering (pp. 131–151). India: McGraw Hill. Malhotra, R., Singh, Y., & Kaur, A. (2009). Comparative analysis of regression and machine learning methods for predicting fault proneness models. International Journal of Computer Applications in Technology, 35(2), 183–193. Mueller, J., & Lemke, F. (1999). Self-Organizing Data Mining: An Intelligent Approach to Extract Knowledge from Data. Musa, D. (2009). Software Reliability Engineering: More Reliable Software Faster and Cheaper (2nd ed.). India: McGraw-Hill. Musa. (n. d.). Software Life Cycle Empirical/Experience Database (SLED). Data & Analysis Center for Software (DACS). Retrieved from http://www.dacs.org Raj, K., & Ravi, V. (2008). Software reliability prediction by using soft computing techniques. Journal of Systems and Software, 81(4), 576–583. doi:10.1016/jss.2007.05.005 Ross, Q. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufman Publishers. Salford predictive modelling system. (n. d.). Retrieved from http//www.salford-systems.com Scott, E., & Christian, L. (1991). The Cascade-Correlation Learning Architecture. Sherrod, P.H. (2003). DTReg predictive modeling software. Retrieved from http://www.dtreg.com Singh, Y., Kaur, A., & Malhotra, R. (2009). Application of support vector machine to predict fault prone classes. ACM SIGSOFT Software Engineering Notes, 34(1). doi:http://doi.acm.org/10.1145/1457516.1457529 Singh, Y., & Kumar, P. (2010). A software reliability growth model for three-tier client-server system. International Journal of Computers and Applications, 1(13), 9–16. doi:10.5120/289-451

268

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Singh, Y., & Kumar, P. (2010). Application of feed-forward networks for software reliability prediction. ACM SIGSOFT Software Engineering Notes, 35(5), 1-6. DOI:10.1145/1838687.1838709 Singh, Y., & Kumar, P. (2010). Determination of software release instant of three-tier client server software system. International Journal of Software Engineering, 1(3), 51–62. Singh, Y., & Kumar, P. (2010). Prediction of Software Reliability using Feed Forward Neural Networks. Proceedings of the 2010 International Conference on Computational Intelligence and Software Engineering (CiSE), Wuhan, China. doi:10.1109/CISE.2010.5677251 Sitte, R. (1999). Comparison of software reliability growth predictions: Neural Networks vs. Parametric Recalibration. IEEE Transactions on Reliability, 48(3), 285–291. doi:10.1109/24.799900 Tharwat, A., Gaber, T., & Hassanien, A. E. (2014, November). Cattle identification based on muzzle images using gabor features and SVM classifier. Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications (pp. 236-247). Springer International Publishing. doi:10.1007/978-3-319-13461-1_23 Tharwat, A., Hassanien, A. E., & Elnaghi, B. E. (2016). A BA-based algorithm for parameter optimization of Support Vector Machine. Pattern Recognition Letters. doi:10.1016/j.patrec.2016.10.007 Witten, I., & Frank, E. (2011). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (3rd ed.). San Francisco, CA: Morgan Kaufman. Zheng, J. (2009). Predicting software reliability with neural network ensembles. Expert Systems with Applications, 36(2), 216–222. doi:10.1016/j.eswa.2007.12.029

KEY TERMS AND DEFINITIONS Availability: The probability that a system or a capability of a system is functional at a given time in a specified environment or the fraction of time during which a system is functioning acceptably. Basic Failure Intensity: Failure intensity that would exist at start of system test for new operations for a project without reviews (requirement, design, or code) or fault tolerance. Bugs: The mistakes committed by the developers while coding the program(s). Client: A node that makes request of services in a network or that uses resources available through the servers. Client-Server Computing: Processing capability or available information distributed across multiple nodes. Constant Failure Rate: The period during which failures of some units occur at an approximately uniform rate. Corrective Action: A documented design process or materials changes implemented and validated to correct the cause of a failure. Correlation: A statistical technique that determines the relationship between two variables (dependent and independent).

269

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Data: The representation of facts or instructions in a manner suitable for processing by computers or analyzing by human. Debugging: The process of detection, location, and correction of errors or bugs in hardware or software systems. Dependent Variable: The variable quantity in an experimental setting that depends on the action of the independent variable. Developed Code: New or modified executable delivered instructions. Developer: A person or an individual or team assigned a particular task. Deviation: Any departure of system behavior in execution from expected behavior. Error: Incorrect or missing action by a person or persons that causes a fault in a program. Error may be a syntax error or misunderstanding of specifications, or logical errors. An error may lead to one or more faults. Errors: Human actions that result in the software containing a fault. Examples of such faults are the omission or misinterpretation of the user’s requirements, a coding error etc. Estimation: Determination of software reliability model parameters and quantities from failure data. Execution Time: The time a processor(s) is / is executing non-filler operations in execution hour. Failure Category: The set of failures that have the same kind of impact on users such as safety or security. Failure Density: At any point in the life of a system, the incremental change in the number of failures per associated incremental change in time. Failure Intensity: Failures per time unit, is an alternative way of expressing reliability. Failure Rate: At a particular time, the rate of change of the number of units that have failed divided by the number of units surviving. Failure Time: Accumulated elapsed time at which a failure occurs. Failure: A failures occurs when a fault executes. It is the departure of output of the program from the expected output. Thus, failure is dynamic. Fault: Defect in system that causes a failure when executed. A software fault is a defect in the code. Thus, a fault is the representation of an error, where representation is the mode of expression such as narrative text, data flow diagrams, Entity-Relationships diagrams, or source code. Moreover, a fault may lead to many failures. That is, a particular fault may cause different failures depending on how it has been exercised. Prediction: The determination of software reliability model parameters and quantities from characteristics of the software product and development process. Probability: The fraction of occasions on which a specified value or set of values of a quantity occurs, out of all possible values for that quantity. Product: A software system that is sold to the customers. Program: A set of complete instructions (operators with operands specified) that executes within a single computer and relates to the accomplishment of some major function. Reliability: Reliability is the probability or the capability of a system that will continue to function without failure for a specified period in a specified environment. The period may be specified in natural or time units. Software Engineering: A systematic approach to the development and maintenance of software that begins with analysis of the software’s goals of purposes.

270

Investigation of Software Reliability Prediction Using Statistical and Machine Learning Methods

Software Error: An error made by a programmer or designer, e.g., a typographical error, an incorrect numerical value, an omission, etc. Software Failure: A failure that occurs when the user perceives that the software has ceased to deliver the expected result with respect to the specification input values. The user may need to identify the severity of the levels of failures such as catastrophic, critical, major or minor, depending on their impact on the systems. Software Fault: An error that leads to a software fault. Software faults can remain undetected until software failure results.

271

272

Chapter 13

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops Ahmed M. Gadallah Cairo University, Egypt Assem H. Mohammed Cairo University, Egypt

ABSTRACT Climate changes play a significant role in the crops plantation process. Such changes affect the suitability of planting of many crops in their traditional plantation dates in a given place. In contrary, many of such crops become more suitable for planting at other new dates in their traditional places or in other new places regarding the climate changes. This chapter presents a fuzzy-based approach for optimizing crops planting dates with the ongoing changes in climate at a given place. The proposed approach incorporates four phases. The first phase is concerned with climate data preparation. And the second phase is concerned with Defining suitability membership functions. While in third phase is responsible for automatic fuzzy clustering. Finally, the fourth phase is responsible for fuzzy selection and optimization for the more suitable plantation dates for each crop. This chapter consists of an introduction, related works, the proposed approach, a first case study, a second case study, results discussion, future research directions and finally the chapter conclusion.

INTRODUCTION Generally, a spatial database can be defined as a set of objects located in some reference space that attempts to model some enterprise aspects in the real world. Spatial agro-climatic database is a spatial database that contains the data of the climate variables for some places during specific periods of time. Almost, the data stored in such databases are needed to be searched in a more flexible human-like manner. DOI: 10.4018/978-1-5225-2229-4.ch013

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

For example, there is a need for a query approach that allows queries such as “select the more suitable dates for planting specific crop in Giza with a matching degree around 88%”. Commonly, the structured query language SQL is the widest used language for querying relational databases. It was initially presented by Chamberlin and Boyce for data retrieval and manipulation. Traditional SQL uses the two-valued logic (crisp logic) in generating and processing a query statement. Most real-world problems abound in uncertainty and any attempt to model aspects of the world should include some mechanisms for handling uncertainty as illustrated in Zhang et al. (2002) and Theresa Beaubouef and Frederick E. Petry1 (2010). Mohammed et al. (2014), Sabour(2014), Kumar and Pradheep (2016) and Werro and Nicolas (2015) show how to avoid the limitation of SQL when dealing with uncertainty by using the flexibility of fuzzy logic. Initially, Fuzzy set theory was proposed by Zadeh (1965). Since then, many researches in many fields have been achieved using the ideas of fuzzy logic. In consequence, fuzzy queries have appeared in the last 30 years to cope with the necessity to soften the sharpness of Boolean logic in database queries. Commonly, a fuzzy query system can be defined as an interface for users to retrieve information from a database using human linguistic words which are qualitative by nature as presented in Branco et al. (2005). This area of research is still interesting as there are needs for more improvements of existing approaches. This chapter proposes a fuzzy-based query approach for querying a spatial agro-climatic database depending on human like query. Such query approach helps in determining the more suitable planting date of crops in specific place. Nowadays, climate change represents the most affecting phenomena in almost all aspects of our life. Climate change is a change in the statistical distribution of the patterns of the weather conditions when that change lasts for an extended period of time (millions of years). It is caused by factors such as variations in the radiation of the solar received by Earth, plate tectonics, and volcanic eruptions America’s as shown in Climate Choices (2010). Also, it has significant impacts on conditions affecting agriculture. For any crop, there exist optimal climate conditions for its growth. So, the effect of an increase in temperature depends on how it is far from the optimal temperature for the crop growth and production. Accordingly, Karl et al. (2009) Showed that the ongoing changes in climate variables such as humidity and temperatures degrees affect the suitability of traditional plantation dates of some crops in a given place.

RELATED WORK Many approaches have been proposed for the problem of the affection of climate changes on agriculture. Some approaches aimed mainly to show the impact of cli-mate change on crop production like Defang et al. (2014), Bizikova et al. (2015) and Hamid et al. (2013). They proved that the agricultural productivity had been affected reflecting the climate changes. They also advise to make some adaptation on planting dates or by planting crops that are less sensitive to climate changes to get over such effects. Also, climate changes greatly affect the water resources in regions in which agriculture depends on rain. As water resources are one of the most important parameter in plantation process, a new approach has been developed to make adaptation to crop planting dates with climate changes like in Moussa et al. (2014) and S.A. Mohaddes and Mohd. Ghazali Mohayidin (2008). Another approach for crop yield forecasting was presented in Pankaj Kumar (2011) to map the relations between climate data and crop yield. This technique based on time series data of 27 years for yield and weather data. Other approaches have been developed such as Fuzzy-based Decision Support Systems for evaluating land suitability and selecting the most suitable crops to be planted is provided as in Sri Hartati and Imas S. Sitanggang (2010) and

273

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Rajeshwar et al. (2013). In these words, fuzzy rule based systems were developed for evaluating land suitability and selecting the appropriate crops to be planted considering the decision maker’s requirements in the selection of the crops with the best use of the powerful reasoning and explanation capabilities of DSS. Unfortunately, all of the above algorithms do not provide weight or matching measures for selected period’s suitability of plantation for the underlying crop after making adaptation on climate changes. Also, in all of them no clustering for the cli-mate data is made. This means that the algorithm calculates the suitability of the period day by day every time of searching suitability for planting any crop so that it takes long time. Some algorithms depend on the average values of climate data like in S.A. Mohaddes and Mohd. Ghazali Mohayidin (2008), Pankaj Kumar (2011) and Sri Hartati and Imas S. Sitanggang (2010) and this is not true as most of crops that have minimum and maximum suitable values of climate variables. Generally, temperature degree represents one of the most important climate variables affecting crops plantation in Egypt. This chapter presents a new approach that handles the effects of the change in temperature on the dates of squash crop plantation in the governorate of Alexandria. It makes an automatic fuzzy clustering on the predicted climate data for the next year. Each cluster (period) consists of some continuous days with length more than or equal to the age of the crop under study. Consequently, it is more convenient to generate a fuzzy query statement to retrieve the matching clusters of days with the underling crop suitable conditions. Also, the selection for suitable clusters of days takes into account the maximum and minimum values of climate variables suitable for the plantation of the underlying crop. Finally, the retrieved clusters of days will be optimized to enhance their suitability degrees. Finally, the optimized clusters are ranked according to the suitability degree for the underling crop plantation.

THE PROPOSED APPROACH The proposed approach aims to select and optimize the more suitable plantation periods for a specific crop. It is applied to the spatial Agro-climatic database of Egypt. At first, it divides the incoming year into fuzzy clusters (periods) respecting the values of days climatic variable like temperature, humidity and sun shining. After that it allows to define a set of suitability membership functions representing the crop suitable climatic requirements. Finally, the proposed approach performs a fuzzy query on the clustered data to find the more suitable ones for planting a given crop. The architecture of the proposed fuzzy query approach for such selection and optimization is shown in Figure 1. It consists of four main phases: 1. Climate Data Preparation Phase: In this phase the existing climate data of Egypt is prepared to be ready for used. Then this data is used to predict the climate data of Egypt for some of the incoming years. The values of climate variables for a set of incoming years based on the historical spatial Agro-climatic database. 2. Define Suitability Membership Functions Phase: This phase is to define a set of fuzzy membership functions describing the crop suitable climate variables values. These functions are used to evaluate the suitability of each day for planting the crop of interest. 3. Automatic Fuzzy Clustering Phase: In this phase the expected data of the next year is clustered to continuous periods. Each period size is equal to or more than the period required in the crop requirements.

274

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 1. The architecture of the proposed approach

4. Fuzzy Selection and Optimization Phase: This phase is to perform a fuzzy query selection from the clustered periods of days based on the required climate data of the crop defined in phase 2. After that, an optimization operation to each resulted cluster takes place aiming to increase the suitability degree as possible as it can.

Climate Data Preparation Phase This phase uses the climate data of Egypt from 2009 to 2013 from the Central Laboratory for Agricultural Expert Systems (CLAES). Also, some data are gathered from the website of World Weather Online “Worldweatheronline” (2013). Commonly, this phase includes the following two steps: 1. The first step checks for missing or extreme values. If any missed value exists for any climate variable the proposed approach predicts it. It gets the values of that climate variable at the same day in a set of previous and next years. Consequently, a weighted average of such values is computed representing the missed value. 2. The second step predicts the climate data of the incoming year based on a climate prediction method called climatology which was presented in Atmos (2013). In literature, climatology method is a simple method for making a forecast. This method involves averaging weather statistics collected over many historical years to make the forecast. For example, equation (1) is used to predict the value of temperature in a specific day in the incoming year.

275

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

PTd = α1 * ATd −1year + α2 * ATd −2year + ....... + αn * ATd −nyear

(1)

where PTd represents the predicted value of temperature at the day of date d; ATd −nyears denotes the actual temperature degree at the same day before n years; and α1 , α2 ,.. and αn are weighting factors such that α1 + α2 + … + αn = 1 .

Defining Suitability Membership Functions Phase This phase allows defining a set of fuzzy membership functions describing the the suitable climatic requirements of a crop data of the underlined climate variables at a specific place. This phase is divided into the following two steps: 1. Determining the suitable values of climatic requirements for the plantation of the crop under study either from experts or from references. 2. Defining a set of fuzzy membership functions based on the determined climatic requirement values resulted from the previous step. For example, based on the data gathered from Kenanaonline (2013) and Caae-eg (2013), the suitability of temperature degree for a specific crop can be defined as follows: • • •

Temperature degrees in [b, c] are the most suitable with full matching degree of 1, Temperature degrees in [a, b [ and [c, d [ are partially suitable with a matching degree in [0,1[, and Temperature degrees greater than d or less than a are not suitable at all with 0 matching degree.

Consequently, the above description of the suitability of temperature degrees for a specific crop can be easily defined as the trapezoidal fuzzy membership function depicted in Figure 2 and equation (2). Also, the suitability of humidity climate variable can be defined as follows: • • •

Humidity degrees in [b, c] are the most suitable with full matching degree of 1, Humidity degrees in [a, b [ and [c, d [ are partially suitable with a matching degree in [0,1[, and Humidity degrees greater than d or less than a are not suitable at all with 0 matching degree.

In consequence, the above description of the suitability of both temperature and humidity degrees for a specific crop can be easily defined using the trapezoidal fuzzy membership function depicted in Figure 3 and equation (3). Finally, the suitability of sunshine climate variable can be defined as follows: • • •

Sunshine degrees in [b, c] are the most suitable with full matching degree of 1, Sunshine degrees in [a, b [ and [c, d [ are partially suitable with a matching degree in [0,1[, and Sunshine degrees greater than d or less than a are not suitable at all with 0 matching degree.

Accordingly, the suitability of sunshine degrees for a specific crop can be easily defined as the trapezoidal fuzzy membership function depicted in figure 4 and equation (4).

276

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 2. The fuzzy membership function of the squash suitable temperature degrees

µtemp _suitability

0   x − a   b − a = 1  d − x  d − c  0

ifx < a if a ≤ x < b if b ≤ x < c

(2)

if c ≤ x < d if x ≥ d

where a=6, b=16, c=25 and d=32.

µhumid _suitability

0   x − a   b − a = 1  d − x  d − c  0

if x < a if a ≤ x < b if b ≤ x < c

(3)

if c ≤ x < d if x ≥ d

Figure 3. The fuzzy membership function of the squash suitable humidity degrees

277

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 4. The fuzzy membership function of the squash suitable sunshine degrees

where a=30, b=65, c=70 and d=100

µsunshine _suitability

0   x − a   b − a = 1  d − x  d − c  0

if x < a if a ≤ x < b if b ≤ x < c

(4)

if c ≤ x < d if x ≥ d

where a=9, b=12, c=13 and d=14

Automatic Fuzzy Clustering Phase Commonly, data clustering is divides the data elements into clusters so that the data items in the same cluster are as same as possible. In contrary, items in different clusters are as dissimilar as possible. In hard clustering, data items are divided into distinct clusters, and each data element belongs only to one cluster. On the other hand, in fuzzy clustering, a data item can belong to one or more clusters. Accordingly, the same data item may have a set of different membership values in different clusters. Such values indicate the relationship between the data items and a particular cluster. Fuzzy clustering is a process of assigning membership values to the data items which is used to assign data items to one or more clusters. In this phase the climate data of a year for the under study governorate is divided into fuzzy clusters, see algorithm 1. Each fuzzy cluster consists of continuous number of days that are equivalent to the crop lifetime. Also, there is an accepted threshold for the standard deviation in each resulted fuzzy cluster.

Fuzzy Selection and Optimization Phase This phase takes, as inputs, the resulted clusters (periods) from the fuzzy clustering phase, the crop plantation temperature data, humidity data and an accepted threshold for suitability. Then it calculates

278

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Algorithm 1. Fuzzy Clustering Algorithm Input: climate data for the under study governorate for a year, permissible standard deviation, accepted threshold for standard deviation membership function. Output: clustered periods. Cluster_items = 0. For each day in year_ data_table selected_day =day. If Cluster_items = 0 Add selected_day to Cluster_items. Else Temp_cluster_items = selected_day + Cluster_items. New cluster items variance = variance (Temp_cluster_items). Membership_value = fuzzy_mem_function (accepted_variance, Max_accepted_variance, New_cluster_items_variance).

// check the values towards acceptance threshold If Membership_value >= acceptance_threshold Add selected_day to Cluster_items. Else Save_new_cluster (Cluster_items). End if End if Cluster items = 0. End for each

the overall suitability membership degree for each cluster. Consequently, it selects the suitable periods from the clusters with suitability membership degree equal to or more than the accepted threshold, see algorithm 2. After that an optimization process takes place on the selected clusters by shifting to left or to right then test the suitability degree for the new cluster. In other words, the optimization can easily be achieved by removing days, for example 5 days, from the beginning of a cluster and adding same no of days to the end of the cluster and re-evaluate the modified cluster suitability. At the end, the set of resulted suitable periods have the highest suitability degrees reached by the optimization operation. The following query shows how to retrieve the clusters greater than or equal an accepted threshold: Select cluster from clusters of periods

where period_suitability (max temperature, min temperature, humidity) >= suitability_threshold

279

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Algorithm 2. Fuzzy Selection Algorithm Input: Fuzzy clustered periods resulted from the automatic fuzzy clustering phase, crop plantation data, accepted threshold for suitability membership function. Output: suitable periods. Begin Suitable periods =0. For each Period in clustered Periods Temperature suitability = temp_suitability (Max Temperature, Min Temperature, Period). Humidity suitability = humidity_suitability (humidity, Period). Sunshine suitability =sunshine_suitability (sunshine, period). If min (Temperature suitability, Humidity suitability, Sunshine suitability)> = accepted threshold then Optimized period=Period optimization (period). Add Optimized period to Suitable periods. Else Continue. End if End for each End

THE FIRST ILLUSTRATIVE CASE STUDY The proposed approach is tested on Agro-climatic spatial database for Alexandria as a governorate in Egypt. The illustrative case study addresses the effects of change in temperature, humidity and sunshine on the plantation of squash crop in Alexandria governorate. It assumes that the water required for the potatoes crop plantation is available. In other words, the approach calculates the suitability respecting the effects of the specified climate changes only. The historical data of five years from 2010 to 2014 is used with a weight for each year that increases from oldest to newest year. The climate requirements for squash crop as shown in Kenanaonline (2013) and Caae-eg (2013) are as follows: • • • • • •

The average age of the plant is 100 days. Maximum temperature less than 32º. Minimum temperature more than Frost point. Best temperature from 16º to 25º. Best humidity from 65% to 70%. Best sunshine from 12 to 13 hours. Accordingly, the fuzzy clustering is performed according to the flowing requirements:

280

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

• • •

The average age of the plant is 100 days.so that each period is more than or equal to 100. The accepted standard deviation (S.D) of the data less than or equal to 0.1. Accept threshold 0.8.

The Suitability of Traditional Periods Figures 5-7 show the suitability of winter buttonhole, summer buttonhole and Nile buttonhole for squash plantation in Alexandria governorate respecting the changes in climate variables temperature T, humidity H and sun shining S. Table 1 shows the suitability of the traditional plantation dates for squash in Alexandria respecting climate changes.

Table 1. Suitability of traditional squash plantation dates Plantation Dates

Traditional Old Dates

Suitability

Winter buttonhole (Figure 5)

1 Dec to 11 mar

81.95%

Summer buttonhole (Figure 6)

1 Feb to 12 may

85.56%

Nile buttonhole (Figure 7)

1 Jul to 9 Oct

46.26%

Figure 5. Suitability of traditional Winter buttonhole for planting squash

281

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 6. Suitability of traditional Summer buttonhole for planting squash

Figure 7. Suitability of traditional Nile buttonhole for planting squash

282

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

The Suggested Periods by the Proposed Approach The proposed approach starts by dividing the year days into clusters of days with more similar climate values. Each cluster has a period length more than or equal to the average age of the under-study crop. Also, the variance between climate data in a cluster should be less than a specified threshold, i.e., 0.1. Consequentially, the resulted clusters are stored in the clusters database. On the other, the proposed approach generates the suitability membership functions based on the climatic requirements of the crop under study. Finally, the selection and optimization are performed on the clusters stored in the clusters database depending on the generated suitability membership functions of the crop. The result includes all clusters with matching degree greater than the specified threshold. Figures [8-12] show the suitability of the resulted periods for planting squash crop in Alexandria respecting a threshold of 75%. Also, Table 2 shows the suitability of the predicted plantation dates for squash in Alexandria respecting climate changes. Table 2. The predicted squash plantation dates Plantation Dates

New Predicted Dates

Suitability

First date (Figure 8)

10 Jan to 20 Apr

83.19%

Second date (Figure 9)

14 Feb to 25 May

86.16%

Third date (Figure 10)

16 Sep to 25 Dec

75.19%

Fourth date (Figure 11)

19 Oct to 27 Jan

85.5%

Fifth date (Figure 12)

1 Nov to 9 Feb

83.08%

Figure 8. Suitability of the predicted first buttonhole for planting squash

283

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 9. Suitability of the predicted second buttonhole for planting squash

Figure 10. Suitability of the predicted third buttonhole for planting squash

284

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Considering both of Table 1 and Table 2, it is obvious that climate changes namely temperature, humidity and sunshine degrees affected the squash plantation dates. As shown in Table 1, it is clear that the traditional period “Nile buttonhole” become not suitable for squash plantation since its suitability becomes 46.26%. Hence, planting of squash at this period will damage the crop. On the other hand, some other periods become more suitable for squash plantation as shown in Table 2. The proposed approach discovered new suitable periods for squash plantation like fourth period, an ideal period, with around 85.5% suitability degree that starts from 19 Oct to 27 Jan, as shown in Figure 11. Also, the fifth period with around 83.08% suitability degree is another more suitable period that starts from 1 Nov to 9 Feb as shown in Figure 12. Both of first, second and third periods represent also suitable periods for squash plantation with suitability degrees around 83.19%, around 86.16% and around 75% as shown in Figure 8, Figure 9 and Figure 10 respectively. As noted, all periods resulted from the proposed approach are wholly discovered or at least adjusted by shifting to at least 30 days. According to the explained sample of results, the squash plantation dates should be changed reflecting climate changes in order to increase the productivity of the crop and reducing the cost of production. Consequently, climate change makes some crops may not suitable to plant in their old places at Figure 11. Suitability of the predicted fourth buttonhole for planting squash

285

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 12. Suitability of the predicted fifth buttonhole for planting squash

the same traditional periods. Yet, it can be planted in its traditional places but at deferent periods of time. By showing the results of the proposed approach to some experts in Agriculture Research Center in Egypt they accept and admire the results. And they explained that they are doing this adaptation by human observations.

SECOND ILLUSTRATIVE CASE STUDY This case study is concerned with discovering the more suitable periods of planting maize crop in Alexandria governorate in Egypt. It takes into account three climate variables namely the temperature, humidity and sunshine. The climate requirements for maize crop as given in Deltaagro(2013) are as follows: • • • •

The average age of the plant is 120 days. Maximum and minimum temperatures are 40º and 5º respectively. Best temperature is from 27º to 35º and Best humidity from 30% to 70%. On the other hand, best sunshine starts from 12 to 14 hours. The fuzzy clustering is performed according to the flowing requirements:

286

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

• • •

The average age of the plant is 100 days. Accordingly, each period is more than or equal to 100. The accepted standard deviation (S.D) of the data at each period is less than or equal to 0.1. The accepted threshold is 0.8.

Suitability of Traditional Periods The results of evaluating the maize crop traditional planting dates using the evaluating algorithm in the proposed approach are sown in Figure 13 and Figure 14. Also, Table 3 shows the suitability of the start of a traditional plantation dates for Maize in Alexandria respecting climate changes.

The Proposed Approach Suggested Periods The proposed approach is applied to predict the more suitable planting periods for planting maize crop in Alexandria. The result includes all periods that have matching degree greater than the specified threshold Figures 15-19 show the suitability of the resulted predicted periods for planting maize crop in Alexandria respecting a threshold of 80%. Also, Table 4 shows ranges of suitability degrees of range of start and end plantation of such predicted periods. Considering both of Table 3, Table 4 and Figures from Figure 13 to Figure 19, it is obvious that climate changes namely temperature, humidity and sunshine degrees affect clearly the plantation dates Figure 13. Suitability of first traditional plantation date of maize in Alexandria

287

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 14. Suitability of second traditional plantation date of maize in Alexandria

Table 3. The Maize traditional old plantation dates The Traditional Plantation Dates

Traditional Old Dates

Suitability

The first traditional plantation dates (Figure 13)

From 1 Apr. to 30 Jul.

81.9%

The second traditional plantation dates (Figure 14)

From 31 May to 28 Sep.

93.33%

Table 4. Maize plantation dates discovred by the proposed approach The Predicted Plantation Periods

The Ranges of Predicted Plantation Periods Start Dates

Suitability

The first suitable plantation dates shown in Figure 15 and Figure 16

From 24 Mar. to 25 Aug.

80.22% to 80.73%

The second suitable plantation dates shown in Figure 17 and Figure 18

From 11 May to 27 Jul

90.11% to 90.08%

The third, most, suitable plantation dates shown in Figure 19

From 16 Jun. to 20 Jun.

94.5%

288

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 15. Suitability of planting maize at the start of first predicted period in Alexandria

Figure 16. Suitability of the end of first predicted plantation date of maize in Alexandria

289

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Figure 17. Suitability of planting maize at the start of second predicted period in Alexandria

Figure 18. Suitability of the end of second predicted plantation date of maize in Alexandria

290

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

of maize crop at Alexandria governorate in Egypt. As shown in Table 3, Figure 13 and Figure 14, the climate changes have little effects on the old traditional maize plantation dates. On the other hand, climate changes produce more suitable plantation dates for maize crop. Firstly, as shown in Table 4 and Figure 15 the start of maize planation date could be shifted to the last week of March with a good suitability degree more than 80%. Secondly, the end of maize date could be shifted to the last week of August with a good suitability degree more than 80%. Thirdly, the proposed approach has been discovering more suitable plantation dates with suitability degree more than 90% like in Figure 18. Fourthly, the proposed approach discovered that the best suitable date for maize plantation is at first half of June month as shown in Figure 19. Finally, the most important note is that the period of maize plantation dates has been extended from two months to nearly five months. Such note represents a positive effect of climate changes on the maize production.

RESULTS AND DISCUSSIONS The proposed approach presented in this chapter handles a lot of shortages existed in the previous ones. Firstly, the proposed approach converts the provided climate data of the year into clusters of similar climate values. The length of such clusters is more than or equal the life time of the specified crop. So, the selection process is performed quickly. Secondly, evaluating the suitability of each cluster depends on the climate values of each day not on the weakly average or monthly average. Thirdly, the proposed Figure 19. Suitability of the most suitable, third, predicted plantation date of maize in Alexandria

291

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

approach provides a matching measure for selected period of plantation for the underlying crop regarding climate changes. Finally, the proposed approach depends in the suitability calculation process on the real minimum and maximum values of climate variables (that have minimum and maximum value) not on the average values. Table 5 presents a comparison between the proposed approach and some other similar approaches. On the other hand, considering the results from applying the proposed approach, a set of notes are observed as follows: 1. 2. 3. 4.

It seems that there exists an increase of the temperatures in the last years. The suitability of crops planation in Alexandria is affected by the changes in the climate variables. There exists an important need for making adaptation in plantation dates every few years. The modern technology must go on how to make adaptation with the climate changes as it has been imposed in reality. 5. The governorates must make attention that by the time a lot of traditional crops will not be suitable to be planted in some areas in Egypt. Accordingly, the governmental plans should respect such imposed changes or paying the invoice of crops catastrophic damage.

CONCLUSION This chapter presented a proposed fuzzy approach for crops plantation dates adaptation with climate changes. Firstly, it uses the available historical spatial Agro-climatic data to predict the values of climate variables like temperature degrees, sunshine and humidity for the incoming year at a specific area. Secondly, the proposed approach divides the year days into clusters of days with similar climate values. Table 5. Comparison between the proposed approach against other previous approaches Approach

Inputs

Processing

Outputs

A Crop Model and Fuzzy Rule Based Approach for Optimizing Maize Planting Dates in Burkina Faso, West Africa [Moussa2014]

The average value of climate variables

No clustering for the year to periods according to climate data before the selection of planting date

A single plantation date for the selected crop within two weeks

Crop Yield Forecasting by Adaptive Neuro Fuzzy Inference System [Pankaj, 2011]

The average weekly values of climate variables

No clustering for the year to periods according to climate data before the selection of planting date

A single plantation date for the selected crop

Consideration of Climate Conditions in Reservoir Operation Using Fuzzy Inference System (FIS) [Hamid et al, 2013]

The climate variables that affect the availability of water

No clustering for the year to periods according to climate data before the selection of planting date

Suggests a single crop plantations date based on the availability of the needed amount of water of the crops only

The proposed fuzzy-based approach

The values of climate variables for each day of the year in addition to the max and min values for some climate variables like temperature.

The approach clusters the year to periods that more similar to each other in the climate condition with threshold not more than the accepted threshold.

The propped approach aims to discover all suitable planation periods for each crop with a suitability degree for each one

292

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

The length of such clusters is suitable for the specified crop plantation life time. On the other hand, the suitable climate values are set for the specified crop. Consequently, the proposed approach selects the more suitable clusters of days for planting the specified crop respecting a specified threshold. Finally, an optimization process within the proposed approach takes place in order to adjust the reached clusters to be more suitable. The proposed approach is tested Based on the available historical spatial Agroclimatic database for Alexandria governorate in Egypt and climate requirements for squash and maize crop. As a result, the proposed approach discovers new periods of time that are more suitable for squash and maize planting than old periods with a suitability degree for each period. Also, it discovers that some traditional periods become not suitable for planting squash in Alexandria at all. Accordingly, the proposed approach guides and helps agriculture investors in a flexible manner to adjust the plantation dates for any crop at any location given that the historical spatial Agro-climatic data for such location are available. In con-sequence, such approach greatly helps decision makers in making agricultural maps. It also helps in the distribution of the crops planting along the year. On the other hand, it directly increases the profit and decreases the cost of any crop plantation and prevents from catastrophic damage that can happen as a result of climate changes.

REFERENCES America’s Climate Choices. (2010). Panel on Advancing the Science of Climate Change; National Research Council, “Advancing the Science of Climate Change. Washington, DC: The National Academies Press. Atmos. (2014). Retrieved 11 October 2014 from http://ww2010.atmos.uiuc.edu/%28Gh%29/guides/mtr/ fcst/mth/oth.rxml / Beaubouef & Petry. (2010). Fuzzy and Rough Set Approaches for Uncertainty in Spatial Data. Springer. Bizikova, L., Nijnik, M., & Nijnik, A. (2015). Exploring institutional changes in agriculture to inform adaptation planning to climate change in transition countries. Mitigation and Adaptation Strategies for Global Change, 20(8), 1385–1406. doi:10.1007/s11027-014-9552-9 Branco, A., Evsukoff, A., Ebecken, N. (2005). Generating fuzzy queries from weighted fuzzy classifier rules. ICDM Workshop on Computational Intelligence in Data Mining. Caae-eg. (2013). Retrieved 20 October 2013 from: http://www.caae-eg.com/new/index.php/2012-12-2510-49-19/2010-09-18-17-00-51/2011-01-15-19-27-42/234-2011-05-25-17-20-36.html/ Defang, N., Manu, I., Bime, M., Tabi, O., & Defang, H. (2014). Impact of climate change on crop production and development of Muyuka subdivision – Cameroon. International Journal of Agriculture, Forestry and Fisheries, 2(2), 40–45. Deltaagro. (2013). Retrieved 20 October 2013 from http://www.deltaagro.com/DeltaLibraryar.aspx/ Hamid, R., Mohammad, A., & Mohammad, H. (2013). Consideration of Climate Conditions in Reservoir Operation Using Fuzzy Inference System (FIS). British Journal of Environment & Climate Change, 3(3), 444–463. doi:10.9734/BJECC/2013/2295

293

Fuzzy-Based Approach for Reducing the Impacts of Climate Changes on Agricultural Crops

Karl & Melillo. (2009). Global Climate Change Impacts in the United States. New York, NY: Cambridge University Press. Kenanaonline. (2013). Retrieved 20 October 2013 from http://kenanaonline.com/users/zidangroup/ posts/95467 / Kumar, P. (2011). Crop Yield Forecasting by Adaptive Neuro Fuzzy Inference System. Mathematical Theory and Modeling, 1, 3. Kumar, P. (2016). Fuzzy-Based Querying Approach for Multidimensional Big Data Quality Assessment. Handbook of Research on Fuzzy and Rough Set Theory in Organizational Decision Making. Mohaddes, & Mohayidin. (2008). Application of the Fuzzy Approach for Agricultural Production Planning in a Watershed, a Case Study of the Atrak Watershed, Iran. American-Eurasian J. Agric. & Environ. Sci., 3 (4), 636-648. Mohammed, Allah, & Hefny. (2014). Fuzzy time series approach for optimizing crops planting dates with climate changes. Computer Engineering Conference (ICENCO), 2014 10th International. IEEE. Moussa, W., Patrick, L., Moussa, S., & Haraldk, S. (2014). A Crop Model and Fuzzy Rule Based Approach for Optimizing Maize Planting Dates in Burkina Faso, West Africa. Journal of Applied Meteorology and Climatology, 53. Rajeshwar, G. (2013). Predicting Suitability of Crop by Developing Fuzzy Decision Support System. IJETAE, 3(Special Issue), 2. Sabour, A. A., Gadallah, A. M., & Hefny, H. A. (2014). Flexible Querying of Relational Databases: Fuzzy Set Based Approach. International Conference on Advanced Machine Learning Technologies and Applications. Springer International Publishing.doi:10.1007/978-3-319-13461-1_42 Werro, N. (2015). Relational Databases & Fuzzy Classification. In Fuzzy Classification of Online Customers. Springer International Publishing. doi:10.1007/978-3-319-15970-6_3 Worldweatheronline. (2013). Retrieved 20 October 2013 from worldweatheronline.com/ Yonia, Hartati, & Sitanggang. (2010). A Fuzzy Based Decision Support System for Evaluating Land Suitability &Selecting Crops. Journal of Computer Science, 6(4), 417–424. doi:10.3844/jcssp.2010.417.424 Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338–353. doi:10.1016/S00199958(65)90241-X Zhang, J., & Goodchild, M. (2002). Uncertainty in Geographical Information. London: Taylor & Francis. doi:10.4324/9780203471326

294

295

Chapter 14

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification M. N. Al-Berry Ain Shams University, Egypt

H. M. Ebeid Ain Shams University, Egypt

Mohammed A.-M. Salem Ain Shams University, Egypt

A. S. Hussein Arab Open University, Kuwait

Mohamed F. Tolba Ain Shams University, Egypt

ABSTRACT Human action recognition is a very active field in computer vision. Many important applications depend on accurate human action recognition, which is based on accurate representation of the actions. These applications include surveillance, athletic performance analysis, driver assistance, robotics, and human-centered computing. This chapter presents a thorough review of the field, concentrating the recent action representation methods that use spatio-temporal information. In addition, the authors propose a stationary wavelet-based representation of natural human actions in realistic videos. The proposed representation utilizes the 3D Stationary Wavelet Transform to encode the directional multi-scale spatiotemporal characteristics of the motion available in a frame sequence. It was tested using the Weizmann, and KTH datasets, and produced good preliminary results while having reasonable computational complexity when compared to existing state–of–the–art methods.

INTRODUCTION Recently, intelligent cognitive systems began to appear with a vision that ambient intelligence in the near future will be a part of our daily life (Pantic, Nijholt, Pentland, & Huanag, 2008). This opened the challenge that computers should be able to understand actions performed by humans and respond according to this understanding. DOI: 10.4018/978-1-5225-2229-4.ch014

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Many applications depend on human action and activity recognition. These applications can be classified into surveillance, control, and analysis applications (Moeslund, Hilton, & Kruger, 2006). Intelligent surveillance is the monitoring process that analyses the scene, interprets object behaviors, and involves as well event detection, object detection, recognition, and tracking. This includes security systems that detect abnormal behavior (Huang & Tan, 2010; Roshtkhari & Levine, 2013) in security sensitive areas like airports (Aggarwal & Cai, 1999), surveillance of crowd behavior (Chen & Huang, 2011; Sharif, Uyaver, & Djeraba, 2010), group activity recognition (Cheng, Qin, Huang, Yan, & Tian, 2014), and person identification using behavioral biometrics (Turaga, Chellappa, Subrahamanian, & Udrea, 2008; Sarkar, Phillips, Liu, Vega, Grother, & Bowyer, 2005). Control applications are the category of applications that depend on interaction between human and computer (Pantic, Nijholt, Pentland, & Huanag, 2008; Poppe, 2010; Pantic M., Pentland, Nijholt, & Huanag, 2007; Rautaray & Agrawal, 2012). These applications recognize the human gestures to control something such as smart houses (Brdiczka, Langet, Maisonnasse, & Crowley, 2009; Fatima, Fahim, Lee, & Lee, 2013), and intelligent vehicles (Wu & Trivedi, 2006). Analysis applications include content-based image and video retrieval (Laptev, Marszalek, Schmid, & Rozenfeld, 2008), driver sleeping detection, robotics (Freedman, Jung, Grupen, & Zilberstein, 2014), and athletic performance analysis. The field of action and activity recognition is still an open research area because there are various types of challenges that face it. For action recognition, challenges arise from variations in the rate execution of actions (Cristani, Raghavendra, Del Bue, & Murino, 2013) (Thi, Cheng, Zhang, Wang, & Satoh, 2012) (Ashraf, Sun, & Foroosh, 2014). As the number of individuals and interactions increase, the complexity of the task increases. Therefore, higher behavior understanding faces some more difficult challenges including the number of modalities to be used, how to fuse them, and how to make use of the context in the process of learning and recognition (Vishwakarma & Agrawal, 2013). Poppe (Poppe, 2010), defined vision-based human action recognition as: “The process of labeling image sequences with action labels”. Following Weinland et al. (Weinland, Ranford, & Boyer, 2011), an action is a sequence of movements generated by a performer during the performance of a task, and an action label is a name, such that an average human agent can understand and perform the named action. Different methods have been proposed for segmenting, representing, and classifying actions. These methods can be classified into different taxonomies (Weinland, Ranford, & Boyer, 2011), (Pantic, Pentland, Nijholt, & Huanag, 2006), (Turaga, Chellappa, Subrahamanian, & Udrea, 2008). One of the famous methods that have been used for holistic motion representation is the Motion History Image (MHI) (Davis, 2001) (Babu & Ramakrishnan, 2004) (Ahad, Tan, Kim, & Ishikawa, 2012). Motion History Images are temporal templates that are simple, but robust in motion representation, and they are used for action recognition by several research groups (Ahad, Tan, Kim, & Ishikawa, 2012). This chapter aims at providing a review on the recent advances in the field, with a focus on spatiotemporal action representation. In addition, the chapter proposes a multi-scale spatio-temporal action representation based on 3D Stationary Wavelet Analysis. In this paper, a stationary wavelet-based directional action representation is proposed. The proposed representation is based on the 3D Stationary Wavelet Transform (SWT) that has been proposed and used in (Al-Berry, Salem, Hussein, & Tolba, 2014) for spatio-temporal motion detection. The 3D SWT succeeded in motion detection in the presence of illumination variations in both indoor and outdoor scenarios while having reasonable complexity. In the proposed action representation, the 3D SWT is used to encode the action into 3 directional wavelet-based templates. Hu invariant moments (Hu, 1962) have been used for describing the templates, and concatenated into a combined feature vector. The preliminary results obtained using these simple 296

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

features show that the proposed representation results are comparable to state-of-the-art methods. The proposed method also has the advantage that it can be used for joint action detection and representation. This means computing the 3D SWT once and using the output for all subsequent tasks, and thus reducing the computations.

RELATED WORK In this section, a thorough review is presented on the field of action and activity recognition, with special concentration on spatio-temporal action representation and description techniques. First basic definitions are presented. Different categories of action representation techniques are then described. The proceeding subsection gives a detailed review of spatio-temporal action representation, and finally the last subsection gives a short notification of action classification techniques.

Definitions and Complexity Levels The words “action” and “activity” are used in the literature in two different ways. Sometimes they are used interchangeably, and sometimes to describe different targeted levels of complexity in recognizing the movement. For example, Bobick (Bobick, 1997) used “movement” for low-level atomic actions, “activity” for a sequence of movements, and “action” for higher-order events. Poppe (Poppe, 2010), used the name “action primitive” for simple movements, “action” for a set of simple actions that describe the whole-body movements, and “activities” for a number of successive actions. Vishwakarma and Agrawal (Vishwakarma & Agrawal, 2013) classified the complexity of human activities into four levels. The first level was referred to as “gestures”; motion of a part of the body in a very short time period. The second was “actions”; a single-person activity composed of temporally ordered gestures, “Interaction” is the third level describes two or more persons performing an activity or a person interacting with an object. The last level is “group activities” which are activities performed by two groups of multiple objects. Figure 1 shows different the different levels of abstraction presented by different researchers. Figure 1. Different taxonomies of action levels

297

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

The task of automatic human action recognition is the process in which a visual system can analyze the motion performed by a human and describe it using a name that can be understood by an average person (Turaga, Chellappa, Subrahamanian, & Udrea, 2008), (Poppe, 2010), (Lv, Nevatia, & Wai Lee, 2005). The name given to the action is called an action label (Weinland, Ranford, & Boyer, 2011).

Classification of Action and Activity Recognition Techniques Action and activity recognition techniques have been categorized based on various criteria. Aggarwal and Cai (Aggarwal & Cai, 1999) classified recognition techniques into template matching approaches and state-spaces approaches. Template matching approaches converts the frame sequence into a static pattern that is matched to one of the learned patterns. State-space approaches model the action as a sequence of states. Each state is a fixed posture, and these postures are combined in a predefined sequence with certain probabilities. State-space techniques are not sensitive to the variation in the action execution, but require expensive computing iterations in the training process. On the other hand, template matching approaches have the advantage of low computational complexity but may be sensitive to different action rate of execution (Aggarwal & Cai, 1999). Lv et al. (Lv, Nevatia, & Wai Lee, 2005) classified the techniques based on the used data types into 2D and 3D approaches. Pantic et al. (Pantic M., Pentland, Nijholt, & Huanag, 2006) stated that most of the action recognition techniques are classified into model-based techniques or appearance-based techniques. Model-based techniques model the body parts using geometric primitives like cones or spheres, while appearance based techniques use appearance information like color and texture. Another criterion has been used by Turaga et al. (Turaga, Chellappa, Subrahamanian, & Udrea, 2008). They classified the approaches based on the complexity of the recognized actions. For simple actions, three categories of approaches were described as follows: • • •

Non-Parametric: Extract some features from each video frame and match these features to a prestored template. This includes 2D templates that represent the action in a single 2D image, and 3D templates that represent the action as an object in (x,y,t) space. Volumetric: Based on processing the whole video as a volume and extracting spatio-temporal features from it. This includes techniques based on the extraction of space-time interest points to describe the action, and techniques that match sub-volumes of the video to templates. Parametric: Explicitly model the temporal dynamics of actions. The parameters of these models are learned from the training data.

For activities that are more complex, the authors described graphical models, syntactic, and knowledgebased approaches. These approaches enable for higher level representations and modeling of inherent semantics in the scenes that contain complex scenarios. Poppe (Poppe, 2010) classified representation techniques into global and local representations. Global representations use an image descriptor to represents the global motion, while local representations describe the motion as a set of independent regions calculated around some extracted interest points. These are then combined into a final representation.

298

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

In (Weinland, Ranford, & Boyer, 2011), Weinland et al. uggested that action representation, segmentation, and recognition techniques can be classified according to the body parts involved (hand gestures, full body gestures, etc.), the selected image features (optical flow, interest points, etc.), or the class of technique used for learning and recognition. They adopted the classification of techniques based on the spatial and temporal structure of actions. Spatial action representations include body models, image models, and spatial statistics. Temporal representations include action grammars, action templates, and temporal statistics. Shao et al. (Shao, Ji, Liu, & Zhang, 2012) classified action recognition techniques into model-based approaches, spatio-temporal template based approaches, and bag-of-words approaches. Model-based approaches require the modeling of the human body into 2D or 3D representation. This representation is then used for action recognition (Aggarwal & Cai, 1999). For example, Gupta et al (Gupta, Singh, Dixit, Semwal, & Dubey, 2013) used a model-based approach to recognize actions by modeling the motion of legs only. In spatio-temporal template based approaches, space and time information are used to construct a representation of the action that highlights where and when motion occurs in the sequence of frames as found in the work of Bobick and Davies (Bobick & Davies, The recognition of human movement using temporal templates, 2001) and Weinland et al. (Weinland, Ronfard, & Boyer, Free Viewpoint Action Recognition using Motion History Volumes, 2006). Bag-of-words approaches detect local features (visual word) to describe actions as found in (Laptev, Marszalek, Schmid, & Rozenfeld, 2008). In the survey by Vishwakarma and Agrawal (Vishwakarma & Agrawal, 2013) action recognition techniques were classified into two main categories: Hierarchical and non-hierarchical. Non-hierarchical techniques are those used for simple primitive actions, while hierarchical techniques are used for complex interactions and group activities. These two main categories are further divided into sub-categories in a way that is relatively close to the classification used by Turaga et al. (Turaga, Chellappa, Subrahamanian, & Udrea, 2008). More recently, Rahman et al. (Rahman, Song, Leung, Lee, & Lee, 2014) classified techniques that depend on low level features into feature tracking based techniques, intensity and gradient based techniques and silhouette based techniques. Feature tracking based techniques recognize actions based on the trajectories of tracked features in either 2D or 3D space. They require modeling of the body segments in 2D or 3D. Intensity and gradient-based techniques utilize gradient or intensity based features to recognize actions. Space–time interest points are detected, and then actions are recognized by modeling these interest points. Silhouette based methods start by obtaining silhouettes of the moving object, and then extract features from the obtained silhouettes to recognize actions. Here the authors classify spatio-temporal action recognition techniques based on the data type of action representation into 2D and 3D techniques. The 2D representation encodes the spatio-temporal motion information into a single 2D template; while 3D representation represents the motion as a 3D volume. Both representations have been used and reported in the literature with different performances and applications. 2D and 3D techniques are in turn classified based on the type of representation used into global and local. Global techniques build holistic representation of the action using spatial and temporal information, while local techniques use the spatio-temporal information to extract local features. Local representations have the advantage of requiring low computational and storage resources. The disadvantage is that they lack global relationships between these features. Global and local representations can be fused into a combined representation (Liu, Liu, Zhang, & Lu, 2010). The classification is detailed in the following sub-section.

299

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

2D Spatio-Temporal Representation and Description Global Representation One of the most famous global 2D spatio-temporal representations is the Motion History Images (MHI) and Motion Energy Images (MEI) proposed by Davies and Bobick (Davies & Bobick, 1997) (Bobick & Davies, 2001). In (Davis, 2001), Davies described the accumulation of the global motion into one 2D template by layering the motion regions over time. The MHI indicates the motion is recent by using different intensity levels, while the MEI encodes the whole motion into a single binary template, while. Figure 2 shows examples of MHI’s of different exercise types. Following this representation, many researchers used this technique for action recognition in different applications (Babu & Ramakrishnan, 2004) (Shao, Ji, Liu, & Zhang, 2012). Different types of features can be used to describe this representation. Hu invariant moments (Hu, 1962) are often used in combination with MEI and MHI (Sharma, Kumar, Kumar, & McLachlan, Representation and Classification of Human Movement Using Temporal Templates and Statistical Measure of Similarity, 2002). Other types of features may also be used. In (Sharma, Kumar, Kumar, & McLachlan, Wavelet Directional Histograms for Classification of Human Gestures Represented by Spatio-Temporal Templates, 2004), the authors represented actions using MHI. They modified this representation to be invariant to translation and scale, and applied two-dimensional, 3 level dyadic wavelet transforms on them and concluded that the directional sub-bands by themselves are not efficient for action classification. In (Sharma & Kumar, Moments and Wavelets for Classification of Human Gestures Represented by Spatio-Temporal Templates, 2004) they used the orthogonal Legendre moments to describe the histograms of MHI’s, and modeled the wavelet sub-bands by generalized Gaussian density (GGD) parameters - shape factor and standard deviation. They proved that this description enhances the classification accuracy of the directional information contained in the wavelet sub-bands. In (Shao, Ji, Liu, & Zhang, 2012), MHI’s were used to detect continuous actions. They used the Pyramid of Correlogram of Oriented Gradient (PCOG) to describe the MHI and MEI. Their used feature proves to provide a good discrimination between different action classes. Qian et al. (Qian, Mao, Figure 2. First row: Different exercise types; second row: their corresponding MHI’s Shao, Ji, Liu, & Zhang, 2012.

300

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Xiang, & Wang, 2010), constructed the contour coding of the motion energy image (CCMEI). They used CCMEI in combination with some local features to describe actions that were classified using the multi-class Support Vector Machine (SVM). The accuracy of their technique was tested using their own dataset and the KTH dataset (Schüldt, Laptev, & Caputo, 2004). MHI’s are view-dependent and only suitable for representing actions parallel to the camera. They also require accurate extraction of the actor’s silhouettes, which can be a difficult task in bad illumination conditions (Aggarwal & Xia, 2014). In (Ahad, Tan, Kim, & Ishikawa, 2012) the authors gave an extensive review of this type of representation and mentioned some drawbacks of the representation.

Local Representation In local spatio-temporal action recognition, space-time interest points are detected, and described using suitable descriptors. These descriptors are then classified into video-words and the histogram of these video-words is used for recognition of actions (Bregonzio, Xiang, & Gong, 2012). Bregonzio et al. (Bregonzio, Xiang, & Gong, 2012), proposed a spatio-temporal method in which they used the global information of interest points. They accumulated interest points over multiple time scales into clouds, and fused the features based on Multiple Kernel Learning. In (Moussa, Hamayed, Fayek, & El Nemr, 2013) interest points in each frame are extracted using the Scale Invariant Feature Transform (SIFT) (Lowe, 2004) descriptor with a tuning step to decrease the number of detected points. The SIFT feature vector is passed to KNN clustering algorithm to build a codebook. Figure 3 shows demonstrates some extracted features. Jargalsaikhan et al. (Jargalsaikhan, Direkoglu, Little, & O’Connor, 2014) Evaluated and compared four local descriptors: Trajectory (TRAJ), Histogram of Orientation Gradient (HOG), Histogram of Orientation Flow (HOF) and a Motion Boundary Histogram (MBH), by combining them with a standard bag-of-features representation and a Support Vector Machine classifier. Their results showed that combining MBH with TRAJ has the best performance in the case of partial occlusion while TRAJ in combination with MBH achieves the best results in the presence of heavy occlusion. Figure 3. Local space-time interest points extracted Moussa, Hamayed, Fayek, & El Nemr, 2013.

301

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

3D Spatio-Temporal Representation and Description 3D techniques add the time as the third dimension to the spatial dimensions and represent the actions in 3D volume of global representation or a group of 3D local patches.

Global Representation One way of representing actions in a global 3D template is the one used by Yilmaz and Shah (Yilmaz & Shah, 2005), and Mokhber et al. (Mokhber, Achard, & Milgram, 2008). Yilmaz and Shah represented actions using view invariant spatio-temporal volumes. Theses Space-Time Volumes (STV) were generated by computing point correspondences between object contours extracted from consecutive frames. This was done using a graph theoretical approach. The STV were analyzed to compute important action descriptors that correspond to changes in direction, speed and shape of the contour, and are reflected on the STV surface, refer to Figure 4. A set of these action descriptors was called the action sketch, and was used for action classification. Their technique proved to be view-invariant. Figure 4. Walk action from two different view angles. First row shows sample frames, second row shows extracted contours, third row shows corresponding STV. Derived from the work of Yilmaz & Shah, 2005.

302

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Blank et al. (Blank M., Gorelick, Shechtman, Irani, & Basri, 2005) manipulated actions as 3D shapes formed by the silhouettes in the space-time volume as shown in Figure 5. They used the solution to the Poisson equation to extract space-time features for describing actions. Gorelick et al. (Gorelick, Blank, Shechtman, Irani, & Basri, 2007) showed that a single space-time cube contains information enough for a action classification in simple cases, and in more complex cases, reliable performance can be achieved by integrating information of all space-time cubes.

Local Representation Dollar et al. (Dollar, Rabaud, Cottrell, & Belongie, 2005) proposed a local 3D action representation. Interest points were detected, and at each detected interest point, they extracted a cuboid that contains the spatio-temporally windowed pixel values. The size of the cuboid was set such that it contains most of the volume of data that contributed to the response function at the corresponding interest point. These cuboids were used as features for action recognition. Figure 6 shows a space-time volume and the corresponding extracted cuboids. Ballan et al. (Ballan, Bertini, Del Bimbo, Seidenari, & Serra, 2009), proposed using a combination of a 3D gradient descriptor and an optical flow descriptor to describe space-time interest points. Their approach was tested on two popular datasets and showed good results. Jin and Shao (Jin & Shao, 2010) combined the brightness gradient and 3D shape context together to increase the discriminative power of features. They extracted cuboids around the space-time points and formed the feature vector using the gradient values of pixels in cuboids. After applying a principle component analysis to reduce the dimension of the feature vector, they used another feature that is the 3D shape context (3DSC) to describe the correlation of extracted interest-points. Bao and Shibata (Bao & Shibata, 2013) proposed a hierarchical approach for spatio-temporal action representation and description. First, they represented the motion by motion fields calculated from input video sequences and manipulated by max filters. Then a set of prototype patches were used to recognize Figure 5. Space time shapes of “jack”,” walk”, and “run” actions Blank, Gorelick, Shechtman, Irani, & Basri, 2005.

303

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Figure 6. Space-time volume and extracted space-time interest points (cuboids) From the work of (Dollar, Rabaud, Cottrell, & Belongie, 2005).

actions by comparing local features in the query with the prototypes. In addition, their proposed hierarchical structure showed promising results in recognizing actions without the need of pre-processing. Figure 7 illustrates the technique.

Classification After obtaining a good representation and description of the actions, the rest of the action recognition process becomes a classification problem. This classification is performed using the direct classification of templates or by explicitly modeling variations in time (state-space modeling) (Poppe, 2010).

Figure 7. Overview of action representation and description using spatio-temporal motion field patches Derived from (Bao & Shibata, 2013).

304

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Direct classification approaches do not deal with temporal order. It manipulates the obtained image representation to learn and assign action labels. This class of approaches include discriminating classifiers and nearest neighbor classifiers. Discriminating classifiers are trained to discriminate between different action classes rather than modeling each class independently. Support Vector Machines (SVM) (Shao, Ji, Liu, & Zhang, 2012), (Babu & Ramakrishnan, 2004), (Moussa, Hamayed, Fayek, & El Nemr, 2013) are good examples of this class. Nearest neighbor classifiers are based on statistical pattern recognition. In this type of classifiers, the input pattern (observation) is classified to the nearest class in the training set. This requires the use of some distance metrics to determine the “closest” class. Neural Networks (NN) (Babu & Ramakrishnan, 2004), (Babu, Suresh, & Savitha, Human action recognition using a fast learning fully complex-valued classifier, 2012), K- Nearest Neighbor KNN (Babu & Ramakrishnan, 2004), (Jin & Choi, 2013), and Bayes classifiers (Babu & Ramakrishnan, 2004) (Cilla, Patricio, Berlanga, & Molina, 2012) are examples of this class. A dimension reduction step is usually performed on the feature vectors before classification; especially when using the bag-of-interest-point approach. Linear Discriminant Analysis (LDA), and Principle Component Analysis (PCA) (Tharwat, 2016) (Weinland, Ronfard, & Boyer, Free Viewpoint Action Recognition using Motion History Volumes, 2006), are being used. In (Kong, Zhang, Hu, & Jia, 2011), the authors argued that PCA do not take class separability into account, and so may not be an efficient way to build codebook. They proposed the weighted adaptive metric learning (WAML) that enables selecting essential dimensions for building an efficient codebook (Kong, Zhang, Hu, & Jia, 2011). A large amount of labeled data is usually required to train classifiers. Zhang et al. (Zhang, Liu, Xu, & Lu, 2011), proposed a boosted multi-class semi-supervised learning algorithm in which the CO-EM algorithm is adopted to leverage the information from unlabeled data. State-space approaches are based on state-space action modeling where actions are modeled as a sequence of states that are connected with defined probabilities. They could be generative or discriminative. Generative models learn a certain action class, while discriminative models focus on the differences between classes. Examples of generative models include Hidden Markov Models (HMM), and grammars. Conditional Random fields (CRF) are good examples of discriminative models.

PROPOSED DIRECTIONAL WAVELET-BASED TEMPLATES This section provides a description of the proposed stationary wavelet-based action representation method. The proposed representation is motivated by the motion history images (Davies & Bobick, 1997) and based on the 3D SWT proposed in (Al-Berry, Salem, Hussein, & Tolba, 2014). The directional wavelet energy images are proposed. The features used in classification are then described.

Proposed Directional Wavelet Energy Images The first proposal is to build a directional Wavelet-based Energy Image (WEI) using the 3D SWT proposed (Al-Berry, Salem, Hussein, & Tolba, 2014), where the video sequence is represented as a 3D volume of frames with time being the third dimension. The video sequence is divided into blocks of 8 frames, and a 3 level SWT is applied on the block. The coefficients of three sub-bands (ADD, DAD, DDD) are thresholded to obtain foreground images. sThe foregrounds obtained at the eighth layer of 3

305

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

different scales are fused into three sub-band foreground images (Od(x,y,t)) (d = 1, 2, 3). These sub-band foreground images encode the directional motion energy during the processed 8 frames at 3 different scales, and thus can be used to represent the action in the duration of these 8 frames. This is illustrated in Figure 8, and is formulated as follows: First, the 3D SWT coefficients w dj (x , y, z ) w dj (x , y, t ) w dj (x , y, t ) w dj (x , y, t ) w dj (x , y, t ) , where (x,y) are the spatial coordinates of the frames, t is the time, j is the resolution and d is the sub-band orientation, and the approximation coefficient c j (x , y, t ) computed by the associated scaling function as described in (Al-Berry, Salem, Hussein, & Tolba, 2014) as follows:

(

)

(1)

(

)

(2)

(

)

(3)

(

)

(4)

(

)

(5)

(

)

(6)

(

)

(7)

(

)

(8)

(j ) (j ) (j ) c j +1 k, l, m  = h h h * c j k, l, m  (j ) (j ) (j ) w 1j +1 k, l, m  = g h h * c j k, l, m  (j ) (j ) (j ) w j2+1 k, l, m  = h g h * c j k, l, m  (j ) (j ) (j ) w j3+1 k, l, m  = g g h * c j k, l, m  (j ) (j ) (j ) w j4+1 k, l, m  = h h g * c j k, l, m  (j ) (j ) (j ) w j5+1 k, l, m  = g h g * c j k, l, m  (j ) (j ) (j ) w j6+1 k, l, m  = h g g * c j k, l, m  (j ) (j ) (j ) w j7+1 k, l, m  = g g g * c j k, l, m 

where g and h are the analysis filters of the wavelet function, and the associated scaling function respectively. h n  h n  h n  and g n  g n  are the impulse responses of the filters h, g, and h n  = h −n  h n  = h −n  and g n  = g −n  , n ∈ Zh n  = h −n  and g n  = g −n  , n ∈ Z and g n  = g −n  ,n ∈ Z are their time-reversed versions. k, l, m are the translations in the x, y, and t direc-

tions respectively. Motion is highlighted in the temporal changes that happened along the t-axis, which are represented in t he det ail sub-images denoted w dj (x , y, t ),d = (5, 6, 7 ) wtd (x , y, t ) ,

306

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

d = (5, 6, 7 ) wtd (x , y, t ), d = (5, 6, 7 ) . These coefficients of the different scales are threshold to obtain

motion images O jd (x , y, t )O jd (x , y, t ) .

1 if w d (x , y, t ) > τ for some j, d j O jd (x , y, t ) =  0 Otherwise 

(9)

The threshold value τ can be determined manually or using an automatic technique. Here we use Otsu’s thresholding technique described in (Gonzalez & Woods, 2008). This thresholding technique maximizes the between-class variance and is entirely based on the computation of the image histogram (Gonzalez & Woods, 2008). The final multi-scale motion energy image for each directional sub-band d (MMEId) is obtained by logical ORing the motion images obtained from different resolutions for this sub-band. For each directional multi-scale energy image, the seven Hu moments (Hu, 1962) are computed. The Hu moments are known to give good discrimination of shapes while being translation, scale, mirroring, and rotation invariant (Gonzalez & Woods, 2008). Multiple energy images can be combined into a Directional Wavelet History Image (DWH). The Motion history image is a way of representing the motion sequence in one gray scale view-based template. The motion history image Hτ(x,y,t) is computed using an update function ψ(x,y,t) as follows (Ahad, Tan, Kim, & Ishikawa, 2012): τ  H τ (x , y, t ) =  max 0, H τ (x , y, t ) − δ 

(

)

if ψ (x , y, t ) = 1 Otherwise

(10)

Figure 8. Forming directional wavelet-based templates using 3D SWT

307

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

(x,y) and t are the location and time and ψ(x,y,t) indicates the presence of motion in the current video image, τ indicates the temporal extent of the movement and δ is the decay parameter. In the proposed method, when multiple blocks of frames are processed using the 3D SWT, the resulting foreground images can be combined into three directional Wavelet–based History Image (DWHτ) template as follows: τ  DWH τd (x , y, t ) =  max 0, DWH τd (x , y, t − 1) − δ 

(

)

if MMEI d (x , y, t ) = 1 Otherwise

(11)

(x,y) and t are the location and time, d is the sub-band and O(x,y,t) is the binary motion mask obtained by the motion detector at this sub-band, i.e., it signals the presence of motion in this direction.

Feature Extraction For describing the obtained templates, the seven Hu invariant moments (Hu, 1962) are computed. The Hu moments are known to result in good discrimination of shapes while being translation, scale, mirroring, and rotation invariant (Gonzalez & Woods, 2008). The set of seven invariant moments is defined as follows (Gonzalez & Woods, 2008): φ1 = η20 + η02

(12)

2 φ2 = (η20 + η02 ) + 4η11 2

(13)

φ3 = (η30 − 3η12 ) + (3η21 − η03 ) 2

(14)

φ4 = (η30 − η12 ) + (3η21 − η03 )

(15)

2 2  φ5 = (η30 − 3η12 )(η30 + η12 ) (η30 − η12 ) − 3 (η21 + η03 )    2 2  + (3η21 − η03 )(η21 + η03 ) 3 (η30 + η12 ) − (η21 + η03 )   

(16)

2 2  φ6 = (η20 − η02 ) (η30 + η12 ) − (η21 + η03 )  + 4η11 (η30 − η12 )(η21 + η03 )  

(17)

2 2  φ7 = (η21 + η03 )(η30 + η12 ) (η30 + η12 ) − 3 (η21 + η03 )    2 2  + (3η12 − η30 )(η21 + η03 ) 3 (η30 + η12 ) −(η21 + η03 )   

(18)

2

2

308

2

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

where, n is the normalized central moments defined as ηpq =

µpq γ µ00

(19)

µpq is the central moment of order (p+q), p = 0,1,2,… and q = 0,1,2,…, γ=

p +q +1 2

(20)

or p+q = 2,3, … The features extracted from the three directional templates can be combined into a single feature vector by simple concatenation.

RESULTS AND DISCUSSION The classification results obtained using the proposed action representations are presented in this section. The first sub-section describes the used datasets; the second describes the experimental setup, while the last presents the obtained results.

Description of Dataset The proposed representation was tested using two datasets (Weizmann, KTH). Weizmann dataset (Blank, Gorelick, Shechtman, Irani, & Basri, 2005) contains 93 video sequences for 10 actions, bend, jack, jump, pjump, run, side, skip, wave1, wave2, and walk. They are performed by 9 persons in front of a static background. Sample frames from the dataset are shown in Figure 9. KTH dataset contains video sequences for six different actions performed by 25 different people in four different scenarios s1, s2, s3, and s4 (Schüldt, Laptev, & Caputo, 2004). The first scenario s1 is “outdoors”, s2 is “outdoors with scale variation”, s3 is “outdoors with different clothes”, and s4 is “indoors”. The sequences have a spatial resolution of 160 x 120 and a frame rate of 25 fps. Each action is performed four times in each sequence. This dataset is considered one of the largest single view datasets with respect to the number of sequences and is widely used by researchers. Different sample frames and scenarios are shown in Figure 10.

Experimental Setup The human action recognition system consists of a feature extraction step and a classification step. In the first experiment, a feature space is constructed for each directional template by using the seven Hu invariant moments. The quadratic discriminant analysis classifier is used to find out the true class of the test patterns. The classifier assumes a Gaussian mixture model and does not use prior probabilities or costs for fitting.

309

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Figure 9. Four sample frames of the “jack” action from Weizmann dataset

Blank, Gorelick, Shechtman, Irani & Basri, 2005.

Figure 10. Sample of KTH dataset actions and scenarios

The human action recognition experiments are performed on the Weizmann and KTH dataset. Through the experiments, the images of 10 actions have been checked; each action performed by 9 different persons. The correct classification rate is used for evaluation. The correct classification rate (CCR) is the percentage of correctly classified samples of the dataset.

310

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Experiments and Results First, the 3D SWT is used to build a directional Wavelet-based Energy Image (WEI). After that, the motion history image is used to represent the motion sequence in one gray scale view-based template. The coefficients of three sub-bands (ADD, DAD, DDD) are used to check which one achieves a better performance. Figure 11 shows the directional wavelet energy (first row), and history (second row) templates obtained for the “jack” action. Then, the seven Hu invariant moments are computed for both the directional wavelet energy and the history templates. Finally, the feature vector that contained the seven Hu moments of the corresponding directional template or history templates is passed to a quadratic discriminant analysis classifier. In the first experiment, the discrimination power of the features extracted from single sub-bands has been tested. Confusion matrices obtained from classifying Directional Wavelet Energy Images (WEI) of Weizmann dataset are shown in Figure 12. The ADD band gave the highest correct classification rate 90.32%, while using the DDD band resulted in high confusion between different action classes. The classification results of Wiezmann dataset obtained using the proposed directional wavelet history templates are illustrated in Figure 13. Again the DDD band didn’t contain enough information to discriminate between different action classes, while in this case the DAD band recorded the highest classification rate. It can be concluded that adding the history of one more block doesn’t enhance the results significantly. For KTH dataset the same experiment was performed. Results obtained using Motion Energy Images are shown in Figure 14. It is clear that after increasing the number of samples of class, the discrimination power of the features of single sub-band decreased dramatically. The same results was obtained when using history images. Figure 11. Directional wavelet energy (first row), and directional wavelet history (second row) of jack action

311

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Figure 12. Confusion matrices obtained using different directional wavelet energy sb-band using Weizmann dataset (a) ADD band, (b) DAD band, and (c) DDD band.

Figure 13. Confusion matrices obtained using different directional history sub-bands using Weizmann dataset (a) ADD band, (b) DAD band, and (c) DDD band

In the second experiment, the features of the three directional templates are combined into a single feature vector by simple concatenation. Figure 15 shows the results obtained for the two studied datasets. The performance obtained for Weizmann dataset decreased. This may be a result of over fitting caused by the small number of samples used and the greater number of features. This problem can be addressed using feature selection. KTH dataset recorded good performance realized in an average of 91.67% CCR. The performance of the proposed method is compared with some of the state–of–the–art methods. The selected methods are Ballan et al. (Ballan, Bertini, Del Bimbo, Seidenari, & Serra, 2009), Kong et al. (Kong, Zhang, Hu, & Jia, 2011), and Bregonzio et al. (Bregonzio, Xiang, & Gong, 2012). Table 1 demonstrates the comparison. These methods were chosen for comparison as they all share a common property with the proposed method, which is, the use of spatio-temporal information. The difference is that the proposed method is a global representation, while others use local features. The proposed method recorded comparable performance, while having the advantage of low computational complexity. Rough Complexity analysis shows that the complexity of the method is O(N2) (Al-Berry, Salem, Hussein, & Tolba, 2014). Other methods have higher CCR but at the expense of more complex computations. The second advantage of the proposed method is that wavelets are parallelizable, so the performance can be further enhanced to operate in real-time. 312

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Figure 14. Confusion matrices obtained using different directional history sub-bands using KTH dataset. Lef-hand column: ADD band, center column: DAD band, and right-hand column DDD band

Figure 15. Results obtained using the combined feature vector

313

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Table 1. Performance comparison between the proposed method and stat-of the art methods Method Ballan et al. [37]

KTH

Weizmann

92.1%

92.41%

Kong et al. [38]

88.81%

92.2%

Bregonzio et al. [39]

94.33%

96.66%

Proposed method

91.67%

81.72%

CONCLUSION AND FUTURE WORK Human Action recognition is one of the most active fields in computer vision. This is a result of many computer vision applications that depend on action recognition. These applications include robotics, human-computer interaction, intelligent surveillance, and content-based video retrieval, among others. Although the problem of action recognition is relatively old, it is still immature and not completely solved. This is because of the number of challenges that face the task in many applications. Examples of these challenges include varying and dynamic backgrounds and illuminations, variations in action performance, and the variation of recording settings. In this chapter, the authors review basic information needed for researchers who are interested in this topic, including basic definitions, levels of abstractions, and different techniques and taxonomies, with a special focus on the class of spatio-temporal techniques. In addition, set of directional stationary wavelet-based action representations are proposed. The proposed directional representations are first evaluated individually to test the efficiency of the directional information contained in the templates. Hu invariant moments have been used for action description in combination with discriminant analysis classification using the Weizmann and KTH benchmark datasets. The features extracted from all directional templates are also combined in single feature vector and the performance is investigated. Preliminary results and a comparison with the state–of–the–art methods show that the information in some of the proposed directional representations can be efficient and promising for action classification, while having reasonable complexity. Future work may include extracting some local features from the proposed template and combining them with the global representation. Other types of features can be extracted from the 3D wavelet coefficients’ volume for better classification rate.

REFERENCES Aggarwal, J., & Cai, Q. (1999). Human Motion Analysis: A Review. Computer Vision and Image Understanding, 73(3), 428–440. doi:10.1006/cviu.1998.0744 Aggarwal, J. K., & Xia, L. (2014). Human activity recognition from 3D data: A review. Pattern Recognition Letters, 48, 70–80. doi:10.1016/j.patrec.2014.04.011 Ahad, M., Tan, J., Kim, H., & Ishikawa, S. (2012). Motion history image: Its variants and applications. Machine Vision and Applications, 23(2), 255–281. doi:10.1007/s00138-010-0298-4

314

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Al-Berry, M. N., Salem, M. A.-M., Hussein, A. S., & Tolba, M. F. (2014). Spatio-Temporal Motion Detection for Intelligent Surveillance Applications. International Journal of Computational Methods, 11(1). Ashraf, N., Sun, C., & Foroosh, H. (2014). View invariant action recognition using projective depth. Computer Vision and Image Understanding, 123, 41–52. doi:10.1016/j.cviu.2014.03.005 Babu, R. V., & Ramakrishnan, K. R. (2004). Recognition of human actions using motion history information extrcted from the compressed video. Image and Vision Computing, 22(8), 597–607. doi:10.1016/j. imavis.2003.11.004 Babu, R. V., Suresh, S., & Savitha, R. (2012). Human action recognition using a fast learning fully complex-valued classifier. Neurocomputing, 89, 202–212. doi:10.1016/j.neucom.2012.03.003 Ballan, L., Bertini, M., Del Bimbo, A., Seidenari, L., & Serra, G. (2009). Recognizing Human Actions by fusing Spatio-temporal Appearance and Motion Descriptors. Academic Press. Bao, R., & Shibata, T. (2013). A hardware friendly algorithm for action recognition using spatio-temporal motion patches. Neurocomputing, 100, 98–106. doi:10.1016/j.neucom.2011.12.041 Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as Space-time Shapes. International conference on Computer Vision ICCV’2005, (pp. 1395-1402). Bobick, A. F. (1997). Movement, activity, and action: The role of knowledge in the perception of motion. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 352(1358), 1257–1265. doi:10.1098/rstb.1997.0108 PMID:9304692 Bobick, A. F., & Davies, J. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267. doi:10.1109/34.910878 Brdiczka, O., Langet, M., Maisonnasse, J., & Crowley, J. L. (2009). Detecting Human Behavior Models From Multimodal Observations in a Smart Home. IEEE Transactions on Automation Science and Engineering, 6(4), 588–597. doi:10.1109/TASE.2008.2004965 Bregonzio, M., Xiang, T., & Gong, S. (2012). Fusing appearance and distribution information of interest points for action recognition. Pattern Recognition, 45(3), 1220–1234. doi:10.1016/j.patcog.2011.08.014 Chen, D.-Y., & Huang, P.-C. (2011). Motion-based unusual event detection in human crowds. Journal of Visual Communication and Image Representation, 22(2), 178–186. doi:10.1016/j.jvcir.2010.12.004 Chen, M.-Y., & Hauptmann, A. (2009). MoSIFT: Recognizing Human Actions in Surveillance Videos. School of Cegie Mellon Universityomputer Science at Research Showcase, Carnegie Mellon University, Computer Science. Cheng, Z., Qin, L., Huang, Q., Yan, S., & Tian, Q. (2014). Recognizing human group action by layered model with multiple cues. Neurocomputing, 136, 124–135. doi:10.1016/j.neucom.2014.01.019 Cilla, R., Patricio, M. A., Berlanga, A., & Molina, J. M. (2012). A probabilistic, discriminative and distributed system for the recognition of human actions from multiple views. Neurocomputing, 75(1), 78–87. doi:10.1016/j.neucom.2011.03.051

315

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Cristani, M., Raghavendra, R., Del Bue, A., & Murino, V. (2013). Human Behavior Analysis in Video Surveillance: Social Signal Processing Perspective. Neurocomputing, 100, 86–97. doi:10.1016/j.neucom.2011.12.038 Davies, J., & Bobick, A. F. (1997). The representation and recognition of human movements using temporal templates. Proc of IEEE CVPR, (pp. 928–934). doi:10.1109/CVPR.1997.609439 Davis, J. W. (2001). Representing and Recognizing Human Motion: From Moion Templates to Movement Categories. Digital Human Modeling Workshop, IROS 2001. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior Recognition via Sparse SpatioTemporal Features. International Conference on Computer Communications and Networks. Fatima, I., Fahim, M., Lee, Y.-K., & Lee, S. (2013). A Unified Framework for Activity RecognitionBased Behavior Analysis and Action Prediction in Smart Homes. Sensors (Basel, Switzerland), 13(2), 2682–2699. doi:10.3390/s130202682 PMID:23435057 Freedman, R. G., Jung, H.-T., Grupen, R. A., & Zilberstein, S. (2014). How Robots Can Recognize Activities and Plans Using Topic Models. Academic Press. Gonzalez, R. C., & Woods, R. E. (2008). Digital Image Processing (3rd ed.). Printice Hall. Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as Space-Time Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253. doi:10.1109/ TPAMI.2007.70711 PMID:17934233 Gupta, J. P., Singh, N., Dixit, P., Semwal, V. B., & Dubey, S. R. (2013). Human Activity Recognition using Gait Pattern. International Journal of Computer Vision and Image Processing, 3(3), 31–53. doi:10.4018/ijcvip.2013070103 Hu, M.-K. (1962). Visual Pattern Recognition by Moment Invariants. IEEE Transactions on Information Theory, 8(2), 179–187. doi:10.1109/TIT.1962.1057692 Huang, K., & Tan, T. (2010). Vs-star: A visual interpretation system for visual surveillance. Pattern Recognition Letters, 31(14), 2265–2285. doi:10.1016/j.patrec.2010.05.029 Jargalsaikhan, I., Direkoglu, C., Little, S., & O’Connor, N. E. (2014). An evaluation of local action descriptors fo human action classification in the presence of occulusion. In Multimedia Modeling (pp. 56-67). Jin, R., & Shao, L. (2010). Retrieving Human Actions Using Spatio-Temporal Features and Relevance Feedback. In Multimedia Interaction and Intelligent User Interfaces, Advance in Pattern Recognition (pp. 1-23). Springer-Verlag. doi:10.1007/978-1-84996-507-1_1 Jin, S.-Y., & Choi, H.-J. (2013). Clustering Space-Time Interest Points for Action Representation. Sixth International Conference on Machine Vision (ICMV 2013). Kim, I. S., Choi, H. S., Yi, K. M., Choi, J. Y., & Kong, S. G. (2010). Intelligent Visual Surveillance: A survey. International Journal of Control, Automation, and Systems, 8(5), 926–939. doi:10.1007/ s12555-010-0501-4

316

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Kong, Y., Zhang, X., Hu, W., & Jia, Y. (2011). Adaptive learning codebook for action recognition. Pattern Recognition Letters, 32(8), 1178–1186. doi:10.1016/j.patrec.2011.03.006 Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Larning Realistic Human Actions from Movies (pp. 1–8). Computer Vision and Pattern Recognition. Liu, S., Liu, J., Zhang, T., & Lu, H. (2010). Human Action Recognition in Videos Using Hybrid Features. In Advances in Multimedia Modeling, Lecture Notes in Computer Science (Vol. 5916, pp. 411-421). doi:10.1007/978-3-642-11301-7_42 Lowe, D. G. (2004). Distinctive image features from scale invariant key points. International Journal of Computer Vision, 60(2), 91–110. doi:10.1023/B:VISI.0000029664.99615.94 Lv, F., Nevatia, R., & Wai Lee, M. (2005). 3D Human Action Recognition Using Spatio-temporal Motion Templates. Lecture Notes in Computer Science, 3766, 120 - 130. Moeslund, T. B., Hilton, A., & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2-3), 90–126. doi:10.1016/j. cviu.2006.08.002 Mokhber, A., Achard, C., & Milgram, M. (2008). Recognition of human behavior by space-time Silhouette characterization. Pattern Recognition Letters, 29(1), 81–89. doi:10.1016/j.patrec.2007.08.016 Moussa, M. M., Hamayed, E., Fayek, M. B., & El Nemr, H. A. (2013). An enhanced method for human action recognition. Journal of Advanced Research. PMID:25750750 Pantic, M., Nijholt, A., Pentland, A., & Huanag, T. S. (2008). Human- Centered Intelligent HumanComputer Interaction (HCI2): How far are we from attaining it? Int. J. Autonomous and Adaptive Communications Systems, 1(2), 168–187. doi:10.1504/IJAACS.2008.019799 Pantic, M., Pentland, A., Nijholt, A., & Huanag, T. (2006). Machine Understanding of Human Behavior. ACM Int’l Conf Multimodal Interface. Pantic, M., Pentland, A., Nijholt, A., & Huanag, T. (2007). Human Computing and Machine Understanding of Human Behavior: A Survey. In Human Computing (pp. 47–71). Springer-Verlag. doi:10.1007/9783-540-72348-6_3 Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990. doi:10.1016/j.imavis.2009.11.014 Qian, H., Mao, Y., Xiang, W., & Wang, Z. (2010). Recognition of human activities using SVM multiclass classifier. Pattern Recognition Letters, 31(2), 100–111. doi:10.1016/j.patrec.2009.09.019 Rahman, S. A., Song, I., Leung, M. K., Lee, I., & Lee, K. (2014). Fast action recognition using negative space features. Expert Systems with Applications, 41(2), 574–587. doi:10.1016/j.eswa.2013.07.082 Rapantzikos, K., Avrithis, Y., & Kollias, S. (2007). Spatiotemporal saliency for event detection and representation in the 3D Wavelet Domain: Potential in human action recognition. 6th ACM International Conference on Image and Video Retrieval, (pp. 294-301).

317

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Rapantzikos, K., Tsapatsoulis, N., Avrithis, Y., & Kollias, S. (2009). Spatiotemporal saliency for video classification. Signal Processing Image Communication, 24(7), 557–571. doi:10.1016/j.image.2009.03.002 Rautaray, S. S., & Agrawal, A. (2012). Real Time Multiple Hand Gesture Recognition System for Human Computer Interaction. International Journal of Intelligent Systems and Applications, 5(5), 56–64. doi:10.5815/ijisa.2012.05.08 Roshtkhari, M. J., & Levine, M. D. (2013). An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions. Computer Vision and Image Understanding, 31(11), 864–876. Sarkar, S., Phillips, P. J., Liu, Z., Vega, I. R., Grother, P., & Bowyer, K. W. (2005). The HumanID Gait Challenge Problem: Datasets, Performance, and Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 72(2), 162–177. doi:10.1109/TPAMI.2005.39 PMID:15688555 Schindler, K., & Gool, L. V. (2008). Action snippets: how many frames does human action recognition require. IEEE Conf. Computer Vision and Pattern Recognition CVPR’08, (pp. 1-8). doi:10.1109/ CVPR.2008.4587730 Sch¨uldt, C., Laptev, I., & Caputo, B. (2004). Recognizing Human Actions: A Local SVM Approach. ICPR. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. International Conference on Multimedia, (pp. 357–360). doi:10.1145/1291233.1291311 Shao, L., Ji, L., Liu, Y., & Zhang, J. (2012). Human action segmentation and recognition via motion and shape analysis. Pattern Recognition Letters, 33(4), 438–445. doi:10.1016/j.patrec.2011.05.015 Sharif, H., Uyaver, S., & Djeraba, C. (2010). Crowd Behavior Surveillance Using Bhattacharyya Distance Metric. In CompIMAGE 2010 (pp. 311–323). Springer-Verlag. doi:10.1007/978-3-642-12712-0_28 Sharma, A., & Kumar, D. K. (2004). Moments and Wavelets for Classification of Human Gestures Represented by Spatio-Temporal Templates. Academic Press. Sharma, A., Kumar, D. K., Kumar, S., & McLachlan, N. (2002). Representation and Classification of Human Movement Using Temporal Templates and Statistical Measure of Similarity. WITSP’2002, Wollongong, NSW, Australia. Sharma, A., Kumar, D. K., Kumar, S., & McLachlan, N. (2004). Wavelet Directional Histograms for Classification of Human Gestures Represented by Spatio-Temporal Templates. 10th International Multimedia Modeling Conference MMM’04. doi:10.1109/MULMM.2004.1264967 Tharwat, A. (2016). Principal component analysis-a tutorial. International Journal of Applied Pattern Recognition, 3(3), 197–240. doi:10.1504/IJAPR.2016.079733 Thi, T. H., Cheng, L., Zhang, J., Wang, L., & Satoh, S. (2012). Structured learning of local features for human action classification and localization. Image and Vision Computing, 30(1), 1–14. doi:10.1016/j. imavis.2011.12.006

318

Directional Multi-Scale Stationary Wavelet-Based Representation for Human Action Classification

Turaga, P., Chellappa, R., Subrahamanian, V., & Udrea, O. (2008). Machine Recognition of Human Activities: A Survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473–1487. doi:10.1109/TCSVT.2008.2005594 Vishwakarma, S., & Agrawal, A. (2013). A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer, 29(10), 983–1009. doi:10.1007/s00371-012-0752-6 Weinland, D., Ranford, R., & Boyer, E. (2011). A Survey of Vision-Based Methods for Action Representation Segmentation and Recognition. Computer Vision and Image Understanding, 115(2), 224–241. doi:10.1016/j.cviu.2010.10.002 Weinland, D., Ronfard, R., & Boyer, E. (2006). Free Viewpoint Action Recognition using Motion History Volumes. Computer Vision and Image Understanding, 104(2), 249–257. doi:10.1016/j.cviu.2006.07.013 Wu, J., & Trivedi, M. M. (2006). Visual Modules for Head Gesture Analysis in Intelligent Vehicle Systems. IEEE Intelligent Vehicle Symposium (pp. 103-115). Tokyo, Japan: Springer-Berlin. Yilmaz, A., & Shah, M. (2005). Actions as Objects: A Novel Action Representation. CVPR. Zhang, T., Liu, S., Xu, C., & Lu, H. (2011). Boosted multi-class semi-supervised learning for human action recognition. Pattern Recognition, 44(10-11), 2334–2342. doi:10.1016/j.patcog.2010.06.018

319

320

Chapter 15

Data Streams Processing Techniques Data Streams Processing Techniques Fatma Mohamed Ain Shams University, Egypt

Nagwa L. Badr Ain Shams University, Egypt

Rasha M. Ismail Ain Shams University, Egypt

Mohamed F. Tolba Ain Shams University, Egypt

ABSTRACT Many modern applications in several domains such as sensor networks, financial applications, web logs and click-streams operate on continuous, unbounded, rapid, time-varying streams of data elements. These applications present new challenges that are not addressed by traditional data management techniques. For the query processing of continuous data streams, we consider in particular continuous queries which are evaluated continuously as data streams continue to arrive. The answer to a continuous query is produced over time, always reflecting the stream data seen so far. One of the most critical requirements of stream processing is fast processing. So, parallel and distributed processing would be good solutions. This paper gives (1) analysis to the different continuous query processing techniques; (2) a comparative study for the data streams execution environments; and (3) finally, we propose an integrated system for processing data streams based on cloud computing which apply continuous query optimization technique on cloud environment.

INTRODUCTION Recently a new class of data-intensive applications has become widely recognized: applications in which the data are modeled best not as persistent relations but as transient data streams. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable and unbounded streams appear to yield some fundamentally new research problems. These applications also have inherent real-time requirements, and queries on the streaming data should be finished within their respective deadlines (Kapitanova, Son, Kang & Kim, 2011; Lijie & Yaxuan, 2010). In this context, researchers have proposed DOI: 10.4018/978-1-5225-2229-4.ch015

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Data Streams Processing Techniques Data Streams Processing Techniques

a new computing paradigm based on Stream Processing Engines (SPEs). SPEs are computing systems designed to process continuous streams of data with minimal delay. Data streams are not stored, but are processed on-the-fly using continuous queries. The latter differs from queries in traditional database systems because a continuous query is constantly “standing” over the streaming tuples and results are continuously output. In the last few years, there have been substantial advancements in the field of data stream processing. From centralized SPEs, to Distributed Stream Processing Engines (DSPEs), which distribute different queries among a cluster of nodes (interquery parallelism) or even distributing different operators of a query across different nodes (interoperator parallelism). However, some applications have reached the limits of current distributed data streaming infrastructures (Gulisano, Jimenez-Peris, Patino-Martinez, Soriente & Valduriez, 2012). Because of the continuous changes in input rates, DSPSs need techniques for adjusting resources dynamically with workload changes. Making decisions when to update resource allocation in response to workload changes and how, is an important issue. Effective algorithms for elastic resource management and load balancing were proposed, where resizes the number of VMs in a DSPS deployment in response to workload demands by taking throughput measurements of each involved VM (Cerviño, Kalyvianaki, Salvachúa & Pietzuch, 2012; Fernandez, Migliavacca, Kalyvianaki & Pietzuch, 2013; Gulisano et al., 2012). Thus, cloud computing has emerged as a flexible for facilitating resource management for elastic application deployments at unprecedented scale. Cloud providers offer a shared set of machines to cloud tenants, often following an Infrastructure-as-a-Service (IaaS) model. Tenants create their own virtual infrastructures on top of physical resources through virtualization. Virtual machines (VMs) then act as execution environments for applications (Cerviño et al., 2012). Thus, we categorize research challenges in data streams to: 1) Continuous queries processing which focus on continuous queries optimization, how to provide real time answering of continuous queries, how to process different typed of continuous queries, and how efficiently process multiple continuous queries. 2) Data streams execution environments where different environments were proposed to execute data streams such as parallel, distributed, and cloud environments. Where, exploiting parallelism and distribution techniques to fast data streams processing, also exploiting virtualization strategies in cloud to provide elastic processing environment in response to workload demands. The rest of the chapter is organized as follows: related background is introduced first, and then efficient processing techniques for continuous queries, including different algorithms for effective continuous query optimization are presented. Then, we present different execution environments for data streams, which include parallel, distributed, and cloud environments. Then, we present the related research issues. And then our proposed system for data streams processing over cloud computing is presented. Then future research directions are presented. Finally, we present the conclusion.

BACKGROUND Data Streams Data streams are the data which generated from most of the recent applications such as sensor networks, real-time internet traffic analysis, and on-line financial trading. This data has a continuous, unbounded, rapid and time-varying nature rather than finite stored data sets which generated from the traditional applications. Thus the traditional database management systems (DBMSs) are not suitable with this

321

Data Streams Processing Techniques Data Streams Processing Techniques

data. And to match the data streams nature, the data stream management system (DSMS) was presented. DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries.

Data Steams Processing Because of the continuous and time-varying nature of the data streams, it needs real time and continuous processing. For query processing in the presence of the data streams, the continuous queries were considered. It evaluated continuously over time with the continuous arrive of the data streams. It makes sense for users to ask a query once and receive updated answers over time.

Sliding Windows A challenging issue in processing data streams is that they are of unbounded length. Thus, storing the entire stream is impossible. To efficiently process and deal with the unbounded data streams, the sliding window model was used. In sliding window model, only the most recent N elements are used when answering the continuous queries and old data must be removed as time goes on, where N is the window size.

Continuous Query Optimization Continuous query optimization is the concept which based on two factors, the time factor and the accuracy factor. The optimization based on the time factor focuses on how to minimize the processing time of the continuous and rapid data streams. And the optimization based on the accuracy factor focuses on providing accurate results for data streams queries. Most of traditional database systems use single query optimizers, which select the best query plan to process the whole data based on the overall average statistics of data. However, this isn’t efficient with data streams with the changeable statistics over time. Because of the disadvantages of using single plan for data streams processing, some database systems determine pre-computed query plans and provide the best execution plans for the tuples on-the-fly at runtime, as its optimization strategy.

CONTINUOUS QUERIES PROCESSING This section presents different algorithms for effective continuous query optimization, which provide real time answering with high accuracy. Also, it presents the processing for different types of continuous queries. In addition, it presents how to process multiple continuous queries using efficient methods.

Continuous Queries Processing Based on Multiple Query Plans In order to improve query performance, the query plan optimization and migration strategy based on multi-factors is generated. It cannot only chooses the optimized query plan according variable data stream character parameters, but also insures smooth migration between query plans (Li, Wang, Huang & Chen, 2014).

322

Data Streams Processing Techniques Data Streams Processing Techniques

The Query Mesh (QM) framework was proposed by (Nehme, Works, Lei, Rundensteiner & Bertin, 2013), which compute multiple query plans, each designed for a particular subset of the data with distinct statistical properties. Traditional query optimizers select one query plan for processing all data based on overall statistics of all data. Because of data streams nature and their non-uniform distributions, selecting a single query plan provides ineffective query results. Thus, QM is a good alternative to data stream processing approaches in the literature. QM improves execution time by up to 44% better than the state-of-art approaches, but this approach increases the memory usage for its offline work. The semantic query optimization (SQO) approach was proposed by (Ding, Works & Rundensteiner, 2011). SQO uses dynamic substream metadata at runtime to find the best query plan for processing the incoming streams. The most important advantage of SQO approach is it depends on identifying four herald enabled semantic query optimization opportunities to reduce query execution cost. Thus, SQO doesn’t need to use a cost model for applying these optimizations. Also, SQO optimization technique reduces query execution times, up to 60%, compared to the approaches in the literature. Despite all these advantages of SQO, it is not applicable to the semantics of sensor observations especially in the case of mobile sensors. Lim and Babu (2013) proposed Cyclops to efficiently process continuous queries. Cyclops is a continuous query processing platform that manages windowed aggregation queries in an ecosystem, which composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. The most important advantage of Cyclops is; it executes continuous queries using a combination selected from various execution plan and execution engine choices. Cyclops shows the cost spectrum of query execution plans across three different execution engines—Esper, Storm, and Hadoop. One of the important advantages of Cyclops is it presents an interactive visualization of the rich execution plan space of windowed aggregation queries. Adaptive multi-route query processing (AMR) was proposed in (Works, Rundensteiner & Agu, 2013), for processing stream queries in highly fluctuating environments. AMR dynamically routes the incoming tuples to operators based on up-to-date system statistics instead of process all incoming tuples with the same fixed plan. Adaptive Multi-Route Index (AMRI) was proposed, which employs a bitmap timepartitioned design. The main advantage of AMRI is it provides a balance between efficient processing of continuous data streams and reducing the index overhead. AMRI improves the overall throughput by 68% better than the state-of-the-art approach. Table 1 presents comparison between four different techniques for optimizing continuous queries (QM (Nehme et al., 2013), SQO (Ding et al., 2011), Cyclops (Lim a & Babu, 2013) and AMR ((Works et al., 2013)), where QM, Cyclops and AMR depend on cost model in their work but SQO don’t need cost model because it depends on four herald enabled semantic query optimization opportunities; which ensure reducing query execution cost. QM needs a classifier to test the incoming tuples in runtime. However, SQO, Cyclops and AMR don’t use a classifier in their work.

Continuous Queries with Probabilistic Guarantees Two different techniques were proposed in (Papapetrou, Garofalakis & Deligiannakis, 2012; Zhang & Cheng, 2013), which provide probabilistic guarantees for continuous queries answers. Sketching technique (termed ECM-sketch) was proposed in (Papapetrou et al., 2012). It considers the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. It allows effective summarization of streaming data over both time-based and count-based sliding windows

323

Data Streams Processing Techniques Data Streams Processing Techniques

Table 1. Comparison between QM, SQO, Cyclops and AMR Query Optimization Technique

Objective

Offline Work

Cost Model

Assigning Plan

Classifier

QM

Finding multiple query plans each for a substream

Select best query plans for training data

Use cost model to estimate cost for each plan

Online

Use classifier to assign plan to each incoming tuple

SQO

Finding multiple query plans each for a substream

Identify the semantic query optimization opportunities

No need to use cost model

Online

Isn’t use classifier

Cyclops

Picking the most suitable engine and plan for the query execution

Select best plan engine pair

Use cost model to select the best pair

Online

Isn’t use classifier

AMR

Continuous routing of tuples based on up-todate statistics

Determine up-to-date statistics

Use cost model to find the best index configuration

Online

Isn’t use classifier

with probabilistic accuracy guarantees. Although the advantages of ECM-sketches, it doesn’t support uncertain data streams. Zhang and Cheng (2013) proposed the probabilistic filters protocol for probabilistic queries, which consider data impreciseness and provide statistical guarantees in answers. The main advantage of this protocol that; it adapts with pervasive applications, which deploy a lot of sensor devices. It reduces the communication and energy costs of the sensor devices. Based on probabilistic tolerance, the probabilistic filters protocol; provides accurate answers for continuous queries and reduces the utilization of resources. All these advantages of probabilistic filters protocol lead to improve the overall performance. The ECM-sketch which proposed by Papapetrou et al. (2012), provides answers for complex queries and the complex queries weren’t handled by Zhang and Cheng (2013). With regard to high-dimensional data streams processing, it handled only by ECM-sketch. ECM-sketch not suitable for sensor networks such as probabilistic filters protocol. Where, the probabilistic filters protocol; reduces the communication and energy costs of the sensor devices in sensor networks. The probabilistic filters protocol considers multiple user query requests which not considered by ECM-sketch.

Continuous Bounded Queries Processing Different algorithms were proposed in (Bhide & Ramamritham, 2013; Huang & Lin, 2014) for answering continuous bounded queries. Huang and Lin (2014) proposed the continuous Within Query-based (CWQ-based) algorithm and the continuous min–max distance bounded query (CM2DBQ) algorithm for efficiently processing of CM2DBQ. “Given a moving query object q, a minimal distance dm, and a maximal distance dM, a CM2DBQ retrieves the bounded objects whose road distances to q are within the range [dm, dM] at every time instant”. Also, bounded objects updating mechanism was proposed for real time evaluation of continuous query results in the case of object updates. Bhide and Ramamritham (2013) proposed Caicos system for efficiently processing continuous bounded queries over a data stream. The main advantage of Caicos is it tracks continuously the evolving information of a highly dynamic data streams. Also, it efficiently provides real time and up-to-date continuous queries answers. An iterative algorithm and efficient dynamic programming-based algorithm

324

Data Streams Processing Techniques Data Streams Processing Techniques

were proposed for identifying the accurate metadata that needs to be updated for providing efficient continuous results. Caicos system which proposed in (Bhide & Ramamritham, 2013) considers scheduling multiple jobs on multiple parallel machines, which not considered in (Huang & Lin, 2014) because Caicos system runs in a multiprocessor environment. Concerning the accuracy, Caicos system depends on accurate metadata which provides accurate results, but algorithms in (Huang & Lin, 2014) didn’t address it. Providing real time processing of distributed data streams in large road networks was handled only by Huang and Lin (2014).

Continuous Queries Processing Over Sliding Window Join Two approaches were proposed in (Kim, 2013; Qian, Wang, Chen & Dong, 2012) for improving the processing of data streams over sliding window joins. Qian et al. (2012) proposed a hardware co-processor called Uncertain Data Window Join Special co-Processor (UWJSP) for accelerating join processing of data streams. It process window join operation over multiple uncertain data streams. The main advantage of UWJSP is it greatly promotes Uncertain Data Stream Management System (UDSMS) processing speed. Also, UWJSP provides low cost and high performance. Kim (2013) proposed a structure for sliding window equijoins to efficiently process of data streams. They proposed an alternative hash table organization for sliding window equijoins. The most advantage of this proposed organization is it easily finds the expired tuples and discarded them; because these tuples are always grouped in the oldest hash tables. Their results ensure that the proposed method improves the overall performance of data streams processing. UWJSP which proposed by Qian et al. (2012), handles multiple uncertain data streams processing which not handled in (Kim, 2013). Also UWJSP uses instruction sets to efficiently track the changing queries which not used in (Kim, 2013). Thus, UWJSP provides high scalability and flexibility than the proposed structure by Kim (2013). However, in (Kim, 2013) the proposed sliding window equijoins structure allocates a hash table for each set of arriving tuples for a sliding window interval not for each stream source, which improves the performance of the sliding window than in (Kim, 2013).

Continuous Pattern Mining Different algorithms were proposed in (Chen, 2014; Dou, Lin, Kalogeraki & Gunopulos, 2014; Yang, Rundensteiner & Ward, 2013) for continuous pattern mining. Dou et al. (2014) proposed indexing algorithms for efficiently providing real time answers of historical online queries (constrained aggregate queries, historical online sampling queries and pattern matching queries) in sensor networks. The answer of these queries based on flash storage of sensor devices. Index construction algorithm was proposed to create multi-resolution indexes. Also, two techniques called checkpointing and reconstruction were proposed for failure recovery in sensor devices. Their results prove the efficiency of the proposed algorithms. Yang et al. (2013) proposed methods for the incremental detection of neighbor-based patterns, in particular, density-based clusters and distance-based outliers over sliding stream windows; because incremental computation for pattern detection continuous queries is an important challenge. Thus, the main advantage of these methods is providing real-time detecting of patterns within streams sliding window. Also they developed cost model to measure the performance of the proposed strategies.

325

Data Streams Processing Techniques Data Streams Processing Techniques

A data structure called sliding window top-k pattern tree (SWTP-tree) was proposed by Chen (2014) for efficiently mining the frequent patterns in a sliding window. SWTP-tree scans the data streams in sliding window continuously. Also an improved FP-growth mining algorithm is proposed for generating the top-k frequent patterns based on the proposed SWTP-tree. It gets the top-k patterns in their descending frequency order. Concerning constrained aggregate queries and historical online sampling queries; those are handled only in (Dou et al., 2014). Only the algorithms in (Yang et al., 2013) are adaptable with traffic monitoring. Yang et al. (2013); Chen (2014) introduced the continuous pattern mining in-depth than Dou et al. (2014).

Continuous Top-k Queries Processing Sandhya and Devi (2013) describe how to efficiently answer Top-K dominating query. Top-K dominating query selects k data objects that influence the highest number of objects in a data set. Improved Event Based Algorithm (IEVA) was proposed to efficiently answer Top-K dominating query. IEVA uses event scheduling and rescheduling towards avoiding the examination of points for inclusion in TOPK. Li et al. (2014) proposed an approach called Voronoi Diagram based Algorithm (VDA) for efficiently answering top-n bi-chromatic reverse k-nearest neighbor (BRkNN) query. Also, they propose a method for finding the candidate region for a BRkNN query. In addition, they propose filter-refinement based algorithm to answer BRkNN query efficiently. Their results ensure that the proposed algorithms improve the overall performance for answering top-n BRkNN queries. A distributed quantile filter-based algorithm was proposed in (Chen, Liang & Yu, 2014) to answer top-k query in a wireless sensor network. It evaluate top-k query in wireless sensor networks to maximize the network lifetime. Also, an online algorithm was proposed for answering time-dependent top-k queries with different values of k. Different approaches were proposed in (B. Chen et al., 2014; H. Chen, 2014; Li et al., 2014; Sandhya & Devi, 2013) for answering top-k queries. The safe interval which used for getting the top-k data points computed only in (Sandhya & Devi, 2013). This safe interval speeds up the query response time because it used for getting the top-k data points without the examination of all points for inclusion in TOPK. Also, there is no need to build the Voronoi Diagram (VD) as in (Li et al., 2014) or to build SWTP-tree as (H. Chen, 2014). In addition, updating the safe intervals is easier than updating nodes in (H. Chen, 2014; Li et al., 2014). In (Li et al., 2014) it is important to build the Voronoi Diagram (VD) to answer the bichromatic reverse k-nearest neighbor (BRkNN) query without the need to compute the BRkNN answer set for every data point. Li et al. (2014) address the k nearest neighbors’ answers which not addressed in the rest. Thus, algorithms which proposed in (Li et al., 2014) outperform the rest in location-based applications. Quantile filter-based algorithm in (B. Chen et al., 2014) is very suitable for the wireless sensor network; it is the only one that addresses maximizing the wireless sensor network lifetime.

Continuous Nearest Neighbor Queries Processing Wang et al. (2014) proposed an effective filtering-and-refinement framework for efficiently answering visible k nearest neighbor (Vk NN) continuous query. “Vk NN query retrieves k objects that are visible and nearest to the query object”. Two pruning algorithms called safe region based pruning and invisible time period based pruning was proposed to reduce the search space for query processing. Their results ensure that the proposed pruning techniques provide high efficiency.

326

Data Streams Processing Techniques Data Streams Processing Techniques

Two-dimensional index structure called bit-vector R-tree (bR-tree) was proposed by Jung, Chung & Liu (2012) for efficiently processing generalized k-nearest neighbor (GkNN) queries not only for spatial data objects but also for non-spatial data objects in wireless networks. A search algorithm was proposed for efficiently pruning the search space of bR-tree. Their results proved the efficiency of bR-tree in terms of energy consumption, latency, and memory usage. In (Yi, Ryu, Son, & Chung, 2014) a method was proposed for efficiently answering of continuous view field nearest neighbor query processing. “Given the view field and location of a user, the view field nearest neighbor query retrieves a data object that is nearest to the user’s location and falls within the user’s view field”. The processing of these queries occured on two phases: initial phase and update phase. Two algorithms called Naive Exploration Algorithm and Fan-shaped Exploration Algorithm were proposed for the initial phase. Also, Fan- shaped Monitoring Algorithm was proposed for of the update phase. An approach was proposed in (Cho, 2013). for efficiently answering continuous range k-nearest neighbor (CRNN) query in vehicular ad hoc networks. Three algorithms were proposed called decide current valid interval, decide next split point, and find next split point; which efficiently decide the current valid interval of the results. The main advantages of this approach are minimizing network bandwidth usage, reducing the computational cost, and decreasing local memory usage. Elmongui, Mokbel and Aref (2013) proposed algorithms to answer continuous aggregate nearest neighbor (CANN) queries for moving objects in spatio-temporal systems. To compute CANN queries results; a holistic algorithm (H-CANN) was proposed. Also, a progressive algorithm for CANN (PCANN) was proposed; it improves the performance of CANN query answer. Aggregate nearest neighbor query in uncertain graphs (UG-ANN) was proposed by Liu, C. Wang & J. Wang (2014) for answering ANN queries through uncertain graphs to be suitable for such applications as social network analysis and road network monitoring. Two pruning algorithms were proposed to improve the performance of UG-ANN; called structural pruning and instance pruning. In (D. Zhang, Chow, Li, X. Zhang & Xu, 2013) a server-side spatial mashup framework called SMashQ was proposed to answer k-NN queries in time-dependent road networks. This framework addresses the problem of collecting real-time traffic data from vehicles or roadside sensors in road networks. The main advantage of this framework is enabling a database server to access the route and travel time information from external Web mapping services such as Google Maps. Also, greedy object grouping algorithm was proposed; to reduce the number of external web mapping requests. One of the most advantages of SMashQ is it can scale up to a large number of objects. B. Wang, Qu, X. Wang, G. Wang and Kitsuregawa (2013) proposed a two-layer index structure called virtual grid quadtree with Voronoi diagram (VGQ-Vor) to answer mobile k nearest neighbor (MkNN) queries. This structure depends on Voronoi diagram for efficient processing of MkNN queries. Also, a moving k nearest neighbor (kNN) query algorithm based on VGQ-Vor was proposed. It checks mobile objects in the neighboring regions the mobile object in the query. Fangzhou, Guohui, Li, Xiaosong and Cong (2013) proposed methods based on Voronoi Diagrams for Uncertain Data (UV-Diagram) to answer probabilistic nearest neighbor queries of uncertain data objects via wireless data broadcast (BPNN). UV-Hilbert-Partition was proposed to partition the UV-Diagram into several grid cells. Also, organizing method was proposed; to organize UV-Diagram’s cells. To address the BPNN query processing, a distributed index, named UVHilbert-DI was proposed.

327

Data Streams Processing Techniques Data Streams Processing Techniques

Concerning data streams processing in emerging applications, only the proposed framework in (Wang et al., 2014) is suitable. Only the algorithms proposed by Fangzhou et al. (2013); Li et al. (2014); B. Wang (2013) depend on Voronoi Diagram for answering nearest neighbor queries. BR-tree which was proposed by Jung et al. (2012), handles spatial and non-spatial specifications of data objects which are not handled in the rest. Also, algorithms in (Fangzhou et al., 2013; Jung et al., 2012) are the only algorithms which address data streams processing in wireless broadcasting systems. Only the algorithms in (Fangzhou et al., 2013; Liu et al., 2014) consider uncertain data streams. Only in (Yi et al., 2014) data streams processing in range based applications as augmented reality systems, tour guide systems, and CCTV-based surveillance systems were considered. The proposed algorithms in (Cho, 2013) outperform the rest techniques in processing data streams in vehicular ad hoc networks because the other techniques don’t consider the specifications which needed for these networks. With regard to the aggregate nearest neighbor queries, it addressed only by Elmongui et al. (2013); Liu et al. (2014). Concerning road network, it addressed only in (D. Zhang, 2013).

Continuous Queries Processing Based Tree Structure Jung, Kim and Chung (2014) proposed query indexing structure called Query Region tree (QR-tree), for processing continuous range queries (CRQs) over moving objects. “CRQ is the query which continuously retrieves the moving objects that are currently within a given query region of interest”. One of the advantages of QR-tree is it reduces the frequency when any moving object contacts the server to receive new resident domains or update the query results. Different algorithms were proposed in (Tang, Zhou, Niu & Wang, 2014) for efficiently answering continuous region queries of sensor networks. ECH-tree generation algorithm was proposed to reduce the energy consumption in sensor networks. Also, a time-correlated region query method was proposed to answer continuous efficiently. Their result ensures the efficiency of the proposed methods. In (Baldoni, Bonomi, Cerocchi & Querzoni, 2013) an overlay network topology called Virtual Tree (VT) was proposed for efficiently answering interval valid aggregation queries in large scale dynamic networks. Also, distributed query protocol called overlay management protocol (OMP) was proposed to guarantee connectivity and path stability in VT. Merkle Skyline R-tree and Partial S4-tree were proposed by Lin, Xu, Hu and Lee (2013) for efficiently answering location-based arbitrary-subspace skyline queries (LASQs). One of the main advantages of LASQs is it considers spatial and non-spatial attributes. Prefetching-based approach was proposed to provide efficient one-shot authentication to LASQs in outsourced databases and to reduce the server processing time. In (Baldoni et al., 2013; Chen, 2014; Jung et al., 2012; Jung et al., 2014; Lin et al., 2013; Tang et al., 2014) different tree based solutions were proposed. In (Jung et al., 2012; Jung et al., 2014; Lin et al., 2013) non-spatial specifications of data objects were handled; but not handled in the rest. Also, only the techniques proposed by (Jung et al., 2012; Jung et al., 2014; Lin et al., 2013; Tang et al., 2014) are suitable for location-based services. BR-tree in (Jung et al., 2012) is the only adaptable one with wireless broadcasting systems. And, data streams processing in peer-to-peer systems were addressed only in (Baldoni et al., 2013). Techniques in (Baldoni et al., 2013) are the best of them to answer aggregation queries. Concerning query authentication, it addressed only by Lin et al. (2013). Only the proposed structure in (Chen, 2014) considers patterns mining.

328

Data Streams Processing Techniques Data Streams Processing Techniques

Continuous Skyline Queries Processing Nagendra and Candan (2013) show how skyline-window-join (SWJ) queries over pairs of data streams are computed. Given a set of data objects, the skyline query returns the objects that are not dominated by others. A Layered Skyline-window-Join (LSJ) was proposed, where LSJ operator partitions the overall process into processing layers. Also, LSJ outperforms the existing approaches which are not designed to eliminate redundant work across multiple processing layers. A two-phase approach for continuous skyline monitoring in two-tier streaming settings was proposed in (Lu, Zhou & Haustad, 2013). In the initialization phase, the initial query result is obtained by correctly merging local skyline from all data sites. Then, all data tuples are categorized with respect to their membership in local skyline and global skyline. One of the most advantages of this approach is it minimizes the bandwidth consumption between the server and the data sites. The k-Skyband algorithm was proposed by Gao, Miao, Cui, Chen & Li (2014) for efficiently answering k-skyband (kSB) query on incomplete data. kSB query differ from the traditional skyline query; where some dimensional values are missing. This method based on three concepts called expired skyline, shadow skyline, and thickness warehouse to improve its results. Also, constrained skyline (CS) and group-by skyline (GBS) queries over incomplete data were handled. Huang, Chang and Lee (2012) proposed algorithms for efficiently answering two distance-based skyline queries called Continuous de-skyline query (Cde-SQ) and continuous k nearest neighbor-skyline query (Cknn-SQ) in road networks. For efficiently answering Cde-SQ; Cde-SQ and Cde-SQ+ algorithms were proposed. And, for efficiently answering Cknn-SQ; Cknn -SQ and Cknn -SQ+ algorithms were proposed. In (Lin, Xu & Hu, 2013) index-based (I-SKY) and non-index-based (N-SKY) algorithms were proposed for efficiently answering range-based skyline queries. Also, incremental versions of the two algorithms were proposed to avoid re-computing the query results. The main advantage of them is saving computation cost for highly dynamic data streams. In addition, efficient methods were proposed for computing the valid scope of each skyline object. A sliding window skyline model was proposed by Ding, Lian, Chen and Jin (2012) for efficiently answering probabilistic skyline query over uncertain data streams. Also, candidate list approach was proposed to efficiently determine the candidate skylines; which may be skylines in the future. In addition, enhanced refinement strategy was proposed to reduce the computation cost and provide more accurate results. A filtering algorithm called FSKY was proposed in (Yin, Lin, Yu & Luo, 2014) to efficiently answer skyline queries for anti-correlated and clustered databases in sensor networks. Also, they proposed a scheme for data cluster representation. In addition, a sampling method was proposed for reducing the communication cost and save energy. Although the advantages of the proposed algorithms, approximate skyline and subspace skyline queries not handled. In (Nagendra & Candan, 2013) SWJ handles continuous skyline monitoring over multiple data streams, which not handled in all of the previously mentioned skyline queries processing techniques. Also, layering the sliding windows which eliminate redundant work across consecutive windows was considered only in (Nagendra & Candan, 2013). Only the proposed algorithms in (Gao et al., 2013; Lu et al., 2013; Yin et al., 2014) address skyline monitoring in distributed environment such as a wide-area sensor network. Also, the two-tier streaming processing was handled in (Lu et al., 2013; Yin et al., 2014) but not handled in the rest. With regard to skylines computation over incomplete data, constrained skyline and group-by skyline; those were handled 329

Data Streams Processing Techniques Data Streams Processing Techniques

only by Gao et al. (2013). Answering skyline queries for anti-correlated and clustered databases was handled only in (Yin et al., 2014). Only algorithms proposed by Huang et al. (2012) are suitable for road network. The uncertain data processing was treated only by Ding et al. (2012). Also, the probabilistic skyline computing was considered only in (Ding et al., 2012; Lin et al., 2013). Algorithms by Lin et al. (2013) are the suitable algorithms for location-based services. Also, both spatial and non-spatial attributes of data were addressed only by Lin et al. (2013).

Multi Continuous Queries Processing To provide practical solutions for matching highly dynamic data streams with multiple and dynamicallyupdated continuous queries, a stream processing system should support incremental evaluation over new data and support query optimization for continuous queries including computation sharing among multiple queries. Multi-query optimization has been shown to be a powerful technique for improving the efficiency of query processing. By, exploiting the overlap between queries in terms of shared operators and streams (Kalyvianaki, Wiesemann, Vu, Kuhn & Pietzuch, 2011). Ray, Madria and Linderman (2012) showed how multiple queries can be represented in the form of an operator tree, such that their commonalities can be easily exploited for multi-query plan generation. Operator tree shows how the plans from different queries can be merged. Multiple queries on same or related data streams may share their processing to alleviate the processing, storage, and communication costs. Continuous queries may be complex, involving various operators, not all these operators are supported at every node in operator tree. Consider the blocking operators, like join or count. These operators require a set of stream tuples to be stored before the operations can be applied. Some nodes may not have enough storage capability for processing these operations; this is an important problem that faces operator tree. The field Programmable Gate Arrays (FPGAs) based real-time data analytics platform was proposed by Sadoghi et al. (2012). An FPGA is especially powerful in exploiting parallelism because any form of parallel execution can be directly mapped to logic circuits in hardware. In addition, FPGAs can meet the required elasticity in scaling out to meet increasing throughput demands. FPGA exploits the overlapping components among given query plans to further improve the resource utilization and generate global Select-Project-Join (SPJ) query plan. Thus, FPGA outperforms operator tree [43] when dealing with join operators. T. Chen, L. Chen, Ozsu and Xiao (2013) show how to provide real-time response for multiple topk queries. Sharing the results of queries is a key factor in saving the computation cost and providing real-time response. They based on the frequency, which specifies the upper bounds on the re-execution intervals of queries; in sharing results of multiple top-k queries. In (Park & Lee, 2012) the adaptive Sharing-based Extended Greedy Optimization Approach (ASEGO) was proposed. A-SEGO can be used to produce an optimized global execution plan for multiple continuous queries. A-SEGO uses a cost model to determine if the current optimized global plan becomes not efficient. When the current plan is no longer efficient, A-SEGO generate a newly optimized plan in a timely manner. In (T. Chen et al., 2013; Park & Lee, 2012; Ray et al., 2012; Sadoghi et al., 2012) four techniques for multiple query optimization based sharing (operator tree, FPGA, A-SEGO and Frequency based Top-K). Table 2 presents comparison between the four techniques. Operator tree, FPGA and A-SEGO share the

330

Data Streams Processing Techniques Data Streams Processing Techniques

Table 2. Comparison between operator tree, FPGA, A-SEGO, and frequency based Top-K optimization Optimization Method

Query Type

Sharing Strategy

Output

Cost Based

Operator tree

Select-project queries

Share queries processing

Single global query plan

Non-based

FPGA

Select-project- join queries

Share queries processing

Single global query plan

Non-based

A-SEGO

Multi-way join queries

Share common join operations results

Single global query plan

Based

Frequency based

Top-K queries

Share queries intermediate results

Single global query plan

Non-based

processing of multiple queries, but FPGA and A-SEGO outperforms operator tree when dealing with join operators. Frequency based Share intermediate results of Top-K queries which have the same frequency. A-SEGO depends on cost model but the rest of them not depend on cost model. FAst Skyline compuTation for multiple queries (FAST) was proposed by Y. Lee, K. Lee & Kim (2013), for processing multiple continuous skyline queries over a data stream. FAST uses a filtering technique that can early discard an object that will not be a member of any future skyline of continuous queries, and uses a discriminant that can efficiently determine which objects in memory are skyline objects for which queries.

DATA STREAMS EXECUTION ENVIRONMENTS This section presents different execution environments for data streams, which achieve high performance and low-latency. Also, presents elastic stream processing algorithms and dynamic load balancing techniques.

Parallel and Distributed Stream Processing One of the most critical requirements of query processing in data stream systems is fast processing. So, parallel processing of queries using multiple processing units would be a good solution. Achieving data parallelism through multiple processing nodes requires partitioning the input stream of an operator into distinct and independent sub-streams for nodes to process concurrently (Safaei, Sharifrazavian, Sharifi & Haghjoo, 2012; Backman, Fonseca & Çetintemel, 2012). Also distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. In a Distributed Data Stream Management System (DDSMS), the task of data stream processing is distributed on several processing nodes, and then results are assembled by cooperating of these nodes. Thus, DDSMS achieve higher performance, in terms of supported data stream rates, and better scalability, and the number of concurrent queries (Kalyvianaki et al., 2011; Shan, Xuejiao, Li & Lizhen, 2012).

Data Streams Processing Based on Map Reduce Framework The MapReduce framework has been introduced as a scalable and fault-tolerant data processing framework that enables the processing of a massive volume of data in parallel on clusters of horizontally scalable

331

Data Streams Processing Techniques Data Streams Processing Techniques

commodity machines. However, it is not adequate for supporting real-time stream processing tasks (Aly et al., 2012; Bedini, Sakr, Theeten, Sala & Cogan, 2013). Main-Memory MapReduce (M3) was introduced by Aly et al. (2012) as a framework for parallel computing in which, continuous queries over streams of data can be efficiently answered with high scalability, performance and short response time. M3 extends Hadoop where, M3 supports continuous execution of the Map and Reduce phases where individual Mappers and Reducers never terminate. MapUpdate framework was proposed in (Lam et al., 2012) for parallel and distributed processing of data streams. MapUpdate is an extension of MapReduce framework. MapUpdate enables the developer to write a few functions and these functions are automatically executed over a cluster of machines. Also, it achieves low latency and high scalability. The main drawback of MapUpdate is it doesn’t efficiently provide dynamic partitioning of the incoming data streams.

Data Streams Processing Based on the Query Mega Graph (QMG) Safaei et al. (2012) introduced Dynamic Tuple Routing (DTR) algorithm for parallel processing of continuous queries in a multiprocessing environment. In DTR, each input data stream tuple, is processed through the shortest path in Query Mega Graph (QMG) (Safaei & Haghjoo, 2010). Table 3 presents comparison between M3, MapUpdate and QMG frameworks (Aly et al., 2012; Lam et al., 2012; Safaei et al., 2012). The three frameworks provide parallel processing for data streams and efficiently answered queries with high scalability, performance and short response time. M3 and MapUpdate extend MapReduce framework to process continuous streams. But QMG process data streams over the Query Mega Graph.

Data Streams Processing Based on Task Graph Structure Two replication methodologies called Data Parallel Replication Mechanism (DPRM) and Task Copy Replication Mechanism (TCRM) were proposed in (Guirado, Roig & Ripoll, 2013) to efficiently improve the throughput of parallel and distributed data streams processing. The main advantage of DPRM is applying concurrence processing on different pieces of the same, which leading to reducing processing computation time and decreasing the latency. Table 3. Comparison between M3 and QMG Framework

Tuples Execution

Dividing Tuples Decisions over Execution Units

Number of Processing Units for Each Tuple

Routing through Processing

M3

Through group of mappers and reducers

Based on rate split

One mapper and one reducer

Through RMI

Map Update

Through group of mappers and updaters

Using deterministic tie-breaking procedure

One mapper and one updater

Through Deterministic tiebreaking procedure

QMG

Through K logical machines

Based on cost model

One for each step

Through processing node itself

332

Data Streams Processing Techniques Data Streams Processing Techniques

Ajwani et al. (2013) proposed a graph based framework for generating the task graphs for processing data streams. Graph generation techniques were proposed to generate directed graphs with a specified degree distribution. One of the most important advantages of these techniques is the correct generation of the undirected graphs. In (Guirado et al., 2013) data and task parallelism were considered to improve the throughput; but not considered in (Ajwani et al., 2013). Also, task replication was handled only by Guirado et al. (2013). But, the proposed techniques in (Ajwani et al., 2013) are more suitable for generating large streaming task-graphs than those in (Guirado et al., 2013) because they generate it in a reasonable time.

Data Streams Processing in Traffic Distributed Networks Anceaume and Busnel (2014) proposed AnKLe distributed algorithm to efficiently process data streams in distributed systems. It does online estimation of the similarity between observed data streams and expected ones, to detect in real time the presence of intrusions in network traffic. One of the most advantages of AnKLe distributed algorithm is estimation results with guaranteed error bounds and with little capacities in terms of storage and processing. A framework is built on Discretized Streams was proposed by Hunter, Das, Zaharia, Abbeel, and Bayen, (2013) to provide scalable traffic estimation for large scale data streams. Thus, an online Expectation Maximization (EM) algorithm was proposed. The main advantage of EM algorithm is; it is efficiently computes travel time distributions of traffic by incremental online updates. Also, it validated with a large dataset of GPS traces. In addition, this algorithm can scale to very large road networks and can update traffic state in a few seconds. In (Anceaume & Busnel, 2014; Hunter et al., 2013) two different frameworks were proposed for data streams processing in distributed systems. Both of them are suitable for traffic networks, but EM which proposed by Hunter et al. (2013) is suitable for estimating traffic on a very large city network. AnKLe in (Anceaume & Busnel, 2014) provides more accurate results because its approximations with guaranteed error bounds. Also, it decreases the memory usage.

Controlling Streams Processing on Overload Conditions Shan et al. (2012) proposed the improved Pair-Wise Algorithm to adjust the load of processing nodes in DDSMS and balance the load among all the computing nodes. The Pair-Wise Algorithm is composed of two main steps: 1) Initial distribution 2) Improved load balancing. Based on the initial distribution algorithm, an improved Pair-Wise algorithm is used to adjust the load of processing nodes. In (Lei, Rundensteiner & Guttman, 2013) Robust Load Distribution (RLD) was proposed to provide robust query processing performance in distributed processing environments. RLD provides o-optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. A co-scheduling framework based on fuzzy logic was proposed in (Cao, Zhang & Tan, 2012) for processing data streams. A dynamic control method is proposed, the main advantage of it is avoiding resource shortage as well as overprovision. In this control method, fuzzy logic control is applied, where CPU can be co-scheduled and co-allocated for data streams processing. Also, iterative algorithm is adopted to iteratively allocate bandwidth by closely watching processing and storage status.

333

Data Streams Processing Techniques Data Streams Processing Techniques

A streaming warehouse model was proposed by Golab, Johnson and Shkapenyuk (2012) to update scheduling in real-time data streams warehouses. This model combines the main features of traditional data warehouses, and real time and continuous processing of data streams systems. To solve the streaming warehouse update problem, scheduling algorithm was proposed. In (Tang & Gedik, 2013) autopipelining solution for Data Stream Processing was proposed. It takes advantage of multicore processors to improve throughput of streaming applications, in an effective and transparent way. This solution is effective in the sense that it provides good utilization of resources by dynamically finding and exploiting sources of pipeline parallelism in streaming applications. Autopipelining used a base optimization algorithm to provide good utilization of resources. Cho, Tsai, Chiu and Yang (2014) proposed a scheduling algorithm named power and deadline-aware multicore scheduling (PDAMS) algorithm for real-time processing in multicore systems. One of the advantages of PDAMS is workload balancing and energy saving of multicore systems. In addition, each processor core In PDAMS, can manage itself. Although all of these advantages, PDAMS didn’t consider both of static power consumption, the overhead of scaling voltage and frequency. In (Cao et al., 2012; Cho et al., 2014; Golab et al., 2012; Lei et al., 2013; Shan et al., 2012;Tang & Gedik 2013) different algorithms were proposed to avoid the system failure and the decrease performance occurred as a result of overload condition. All of them except RLD depend on dynamic updating and scheduling for real time workload balancing. But, RLD depends on providing robust physical solution as its control methodology. Also, RLD is only one which process data streams over multiple logical plans. With regard to the efficient parallel processing for good resources utilization, it was handled only in (Cho et al., 2014; Tang & Gedik 2013).

Data Streams Processing Based on Cloud Environment Cloud computing offers an elastic infrastructure that distributed stream processing systems (DSPSs) can use to obtain resources on-demand, but an open problem is to provide a scalable and elastic stream processing engine for processing large data stream volumes. In (Gulisano, 2012; Gulisano et al., 2010, 2012) a transparent query parallelism technique was mentioned “StreamCloud”, which its elastic protocols exhibit low intrusiveness, and enabling effective adjustment of resources to the incoming load. In StreamCloud, users express regular continuous queries that are automatically parallelized. Fernandez et al. (2013) describe an integrated approach for stateful operator scale out and recovery called “Fault-Tolerant Scale Out”. It scales out the number of cloud-hosted machines on demand, parallelizing operators when the workload increases and efficiently recover resource failure. In Fault-Tolerant Scale Out algorithm, to support dynamic scale out, they add a bottleneck detector that based on system statistics to identify the bottleneck operators in the query. To support dynamic Fault-Tolerant, they add a failure detector to recover a failed operator. Cervino et al. (2012) propose an adaptive algorithm that resizes the number of VMs in a DSPS deployment in response to input streams rates, called “Adaptive Cloud Stream Processing”. It maintains low latency with a given throughput while keeping VMs operating to their maximum processing capacity. The Adaptive Cloud Stream Processing algorithm is invoked periodically and calculates the new number of VMs that are needed to support the current workload demand. In (Saleh, Gropengieβer, Betz, Mandarawi & Sattler, 2013) they develop a Complex Event Processing (CEP)-based resource monitoring framework for data stream processing on cloud, which continuously monitor resource utilization, manage, and adjust these resources in real-time to meet the SLAs while

334

Data Streams Processing Techniques Data Streams Processing Techniques

not overprovisioning resources. It controls hosts’ status automatically according to the user-specified rules and metrics such as increasing or decreasing the amount of memory, number of cores, and disk resources for VMs (scaling up and down). The auto-scaling techniques were proposed by Heinze, Pappalardo, Jerzak and Fetzer (2014) for Elastic Data Stream Processing. Auto-scaling techniques solve the problem of finding the right time to scale in or to scale out resources, which is one of the major challenges for the elastic systems. The goal of the auto-scaling techniques is to maximize the system utilization and at the same time to guarantee a low end to end latency. Hu, Jiang, Liu and Wang (2014) proposed multi-step-ahead load forecasting method to adjust number of resources on demand; in data streams processing on cloud computing. Because of the complex and dynamic characteristics of cloud computing, this method based on statistical learning technology and support vector regression (SVR) to provide efficient solution which adaptable with cloud computing. In (Kailasam, Gnanasambandam, Dharanipragada & Sharma, 2013) an autonomic cloud bursting approach was proposed for optimizing ordered throughput for near real-time, data-intensive, independent computations. They proposed three scheduling heuristics as part of the autonomic cloud bursting approach. These scheduling heuristics optimize ordered throughput although the changes on workload characteristics, variation in bandwidth, and available resources. Based on workload characteristics the autonomic cloud bursting approach can provision appropriate number of resources in EC, because it determines the number of instances which can be optimally utilized in the external cloud. The main advantage of using cloud bursting is the efficient execution of online data streams workloads across multiple clouds. An elastic auto-parallelization solution for data stream processing was proposed by Gedik, Schneider, Hirzel and Wu (2013). It dynamically adjusts the number of channels used for processing, to achieve high throughput without unnecessarily wasting resources. The auto-parallelization solution uses control algorithm, which periodically re-evaluates the number of channels to be used based on local run-time metrics it maintains. In (Dahiphale et al., 2014) Cloud MapReduce (CMR) was proposed. CMR overcomes the limitations of traditional MapReduce where, it supports data stream processing. Also, CMR uses pipeline between Map and Reduce phases. This Pipelined MapReduce approach leads to increase parallelism between the Map and Reduce phases. In addition, CMR takes the advantages of cloud computing. CMR supports flexible pricing using Amazon Cloud’s spot instances. In (Aly et al., 2012; Dahiphale et al., 2014; Lam et al., 2012) three extension frameworks of MapReduce framework were proposed. Table 4 presents comparison between M3, Map Update and CMR frameworks, where the three frameworks extend MapReduce framework to support streams processing. Furthermore, CMR takes the advantages of cloud computing. Table 4. Comparison between M3 and CMR Framework

Supporting Streams Processing

Basic Environment

Cloud Based

Connection between Map and Reduce Phases

M3

Yes

MapReduce framework

Non-Cloud based

Through RMI

Map Update

Yes

MapReduce framework

Non-Cloud based

Through deterministic tiebreaking procedure

CMR

Yes

MapReduce framework

cloud based

Through pipelining model

335

Data Streams Processing Techniques Data Streams Processing Techniques

An information flow control model was proposed in (Xie, Ray, Adaikkalavan & Gamble, 2013) for securely and efficiently processing data streams. Also, they share the processing of multiple queries on the cloud using the operator tree (Ray et al., 2012). The main advantage of this model is it protects the information against unauthorized disclosure and modification and prevents the leakage of information across organizations which process their data based on cloud computing. All of the previously mentioned cloud based frameworks except information flow control model which proposed by Xie et al. (2013) consider cloud resources provisioning on demand. However, information flow control model outperforms the rest with regard to the security and processing multiple queries as it is the only one which handles these issues. Only the frameworks in (Dahiphale et al., 2014; Dou et al., 2014; Gulisano, 2010, 2012) handled parallel processing of data streams on cloud computing. The proposed approach by Castro et al. (2013) is the only one which addressed fault recovery based on resources provisioning. Method in (Hu et al., 2014) is the only one which based on SVR to be more adaptable with the cloud computing environment. Also, Complex Event Processing was considered only by Saleh et al. (2013).

RESEARCH ISSUES Based on our survey on the different environments for executing data stream, we conclude that the best environment for executing data stream is the cloud environment, where it provides great elasticity and high performance than other environments. Cloud computing can scale the number of processing virtual machines (up/down) on demand based on the continuous change in the input rate of data streams. It provides continuous adaptation with the incoming workload where it efficiently detects the overload conditions and then scale up the number of processing resources (virtual machines). Also, it detects the reduction of the streams input rate and then scale down the number of processing resources (virtual machines). There are many techniques were proposed to manage the continuous and changeable streams workload and provision the processing virtual machines based on this changeable workload. The cloud computing provides real time load balancing over the processing resources. Also, there are many research on how to efficiently recover failure occurred during data streams processing on the cloud environment. Cloud computing has additional solution over other environments for failure recovery, where it can scale up the number of processing resources and route the workload from the damaged virtual machines to the new ones. Also, multiple continuous queries are efficiently processed on the cloud computing. In addition, there are many proposed parallelization techniques on the cloud environment, where it provides parallel and distributed processing for data streams. Also Cloud computing provides great security for streams processing. For all these reasons the cloud environment is the best environment for executing data stream. In Table 5 We present a comparison between cloud based (Castro et al., 2013; Cervino et al., 2012; Dahiphale et al. 2014; Gedik et al., 2013; Gulisan, 2012; Gulisan et al., 2010, 2012; Heinze et al., 2014; Hu et al. 2014; Kailasam et al., 2013; Saleh et al., 2013) and non-cloud based (Aly et al., 2012; Lei et al., 2013; Safaei et al., 2012; Shan et al., 2012) execution environments.

336

Data Streams Processing Techniques Data Streams Processing Techniques

Table 5. Comparison between Cloud based and non-cloud based environments Environment Type

Methodologies

Overload Controlling

Elasticity

Overload Effect on Data

Overall Performance

Cloud based

• StreamCloud method • Fault-Tolerant Scale Out algorithm • Adaptive Provisioning algorithm • CMR • CEP-based resource monitoring • Multi-step-ahead load forecasting • Auto-scaling

Scaling the number of VMs

Elastic environment

No affection (data is processed on provisioned VMs)

Higher performance and higher accuracy

Non-Cloud based

• M3 method • DTR algorithm • RLD algorithm • Improved Pair-Wise Algorithm

Load rebalancing

   Non-Elastic environment

Data is lost

Lower performance and lower accuracy

SOLUTIONS AND RECOMMENDATIONS From the previous sections it is clear that real time answering for data streams is a very important issue. And, all of the existing systems give solutions for data stream processing from one perspective. Some systems focus on the optimization techniques used regardless of the processing environment. And other systems only focus on improving the processing environment. Thus, we propose the optimized cloud query mesh system which based on the idea of the query mesh (QM) solution for data streams processing. The basic idea of the QM solution is data streams processing over multiple query plans. Our proposed system solves the problems of existing systems (improving streams processing from one perspective), where it combines the two improvements’ viewpoints. It applies continuous query optimization technique on the cloud environment. It takes the advantage of cloud processing environment where it exploits the virtualization of cloud computing to process streams in elastic and scalable way. Also, it applies continuous query optimization technique which not applied before in the cloud, where it processes the data streams over multiple query plans each plan is suitable for subset of data with the same statistics instead of using a single query plan to execute the whole data based on the average statistics of the whole data streams. We divide our system to two subsystems, each subsystem responsible for some functions. The first subsystem is the Continuous Query Optimization Sub-System, which is our offline phase. It consists of two main blocks continuous query optimizer and operator’s distributer. The continuous query optimizer is responsible for getting set of query plans each plan for a subset of data with distinct statistical properties. And, operator’s distributer is responsible for generating physical plan which suitable to all query plans generated by the query optimizer. The second subsystem is the Streams Cloud-Based Execution Sub-System, which is our online phase. It consists of four main blocks (the input manager, the execution machines, the observer and the global manager). The input manager is responsible for managing the incoming tuples and assigning the suitable plan to each tuple. And, the execution machines which execute each tuple based on the assigned plan. And the observer is responsible for detecting the overload conditions. And the global manager is responsible for provisioning the demanded VMs in the case of overload conditions (Figure 1).

337

Data Streams Processing Techniques Data Streams Processing Techniques

Figure 1. The proposed optimized cloud query mesh system

FUTURE RESEARCH DIRECTIONS Most of researches on data stream processing in the literature focus on improving the performance either by proposing new query optimization techniques without any interest to the processing environment or by only improving the processing environment. However, there are many challenges on data stream processing such as the execution time, memory usage and online overheads. These challenges aren’t handled efficiently by focusing on only one viewpoint of improvement. Thus the future improvements should be in the two perspectives. And to provide real time processing of data streams, there is a need to future research which proposes hybrid techniques. These techniques should focus on the two viewpoints improvement. Also, data streams should be processed on elastic environment to adapt with the changeable nature of the data streams over time. Thus, we will get real time and more accurate answers for the continuous queries.

CONCLUSION Many research challenging issues are appeared when processing data streams because real time answering needed for continuous queries. Where, the ability to provide real time answering is an important consideration in data stream processing because of its changeable nature. Also, accuracy of results and

338

Data Streams Processing Techniques Data Streams Processing Techniques

reducing data loss are important issues. Many recent research focuses on providing frameworks and proposing new techniques for data streams processing. Also, different environments were proposed for executing continuous queries. In this paper we present a survey on recent algorithms for data streams processing and Continuous queries optimization. Also, present techniques for efficiently process multiple continuous queries. In addition, we present different execution environments for data streams. We show the recent work for parallel and distributed processing data streams. Also, present novel techniques for cloud based stream processing. Based on the research challenges on data streams, we propose system for continuous query optimization based on the cloud computing as our future work. This system will provide real time answers for continuous queries. Also, it will provide elastic scaling of cloud resources based on streams’ input rates.

REFERENCES Ajwani, D., Ali, S., Katrinis, K., Li, C. H., Park, A. J., Morrison, J. P., & Schenfeld, E. (2013). Generating synthetic task graphs for simulating stream computing systems. Journal of Parallel and Distributed Computing, 73(10), 1362–1374. doi:10.1016/j.jpdc.2013.06.002 Aly, A. M., Sallam, A., Gnanasekaran, B. M., Nguyen-Dinh, L., Aref, W. G., Ouzzani, M., & Ghafoor, A. (2012, April). M3: Stream processing on main-memory mapreduce. In Proceedings of Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1253-1256). IEEE. Anceaume, E., & Busnel, Y. (2014). A distributed information divergence estimation over data streams. Journal of Parallel and Distributed Systems. IEEE Transactions on, 25(2), 478–487. Backman, N., Fonseca, R., & Çetintemel, U. (2012, April). Managing parallelism for stream processing in the cloud. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing (p. 1). ACM. doi:10.1145/2169090.2169091 Baldoni, R., Bonomi, S., Cerocchi, A., & Querzoni, L. (2013). Virtual Tree: A robust architecture for interval valid queries in dynamic distributed systems. Journal of Parallel and Distributed Computing, 73(8), 1135–1145. doi:10.1016/j.jpdc.2013.03.017 Bedini, I., Sakr, S., Theeten, B., Sala, A., & Cogan, P. (2013, April). Modeling performance of a parallel streaming engine: bridging theory and costs. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (pp. 173-184). ACM. doi:10.1145/2479871.2479895 Bhide, M., & Ramamritham, K. (2013). Category-Based Infidelity Bounded Queries over Unstructured Data Streams. Knowledge and Data Engineering. IEEE Transactions on, 25(11), 2448–2462. Cao, J., Zhang, W., & Tan, W. (2012). Dynamic control of data streaming and processing in a virtualized environment. Journal of Automation Science and Engineering. IEEE Transactions on, 9(2), 365–376. Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., & Pietzuch, P. (2013). Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 international conference on Management of data (pp. 725-736). ACM. doi:10.1145/2463676.2465282

339

Data Streams Processing Techniques Data Streams Processing Techniques

Cervino, J., Kalyvianaki, E., Salvachua, J., & Pietzuch, P. (2012, April). Adaptive provisioning of stream processing systems in the cloud. In Proceedings of Data Engineering Workshops (pp. 295–301). IEEE. doi:10.1109/ICDEW.2012.40 Chen, B., Liang, W., & Yu, J. X. (2014). Energy-efficient top-k query evaluation and maintenance in wireless sensor networks. Journal of Wireless Networks, 20(4), 591-610. Chen, H. (2014). Mining top-k frequent patterns over data streams sliding window. Journal of Intelligent Information Systems, 42(1), 111–131. doi:10.1007/s10844-013-0265-4 Chen, T., Chen, L., Ozsu, M. T., & Xiao, N. (2013). Optimizing multi-top-k queries over uncertain data streams. Journal of Knowledge and Data Engineering. IEEE Transactions on, 25(8), 1814–1829. Cho, H. J. (2013). Continuous range k-nearest neighbor queries in vehicular ad hoc networks. Journal of Systems and Software, 86(5), 1323–1332. doi:10.1016/j.jss.2012.12.034 Cho, K. M., Tsai, C. W., Chiu, Y. S., & Yang, C. S. (2014). A High Performance Load Balance Strategy for Real-Time Multicore Systems. The Journal of Scientific World. Dahiphale, D., Karve, R., Vasilakos, A. V., Liu, H., Yu, Z., Chhajer, A., & Wang, C. et al. (2014). An Advanced MapReduce: Cloud MapReduce, Enhancements and Applications. Journal of Network and Service Management. IEEE Transactions on, 11(1), 101–115. Ding, L., Works, K., & Rundensteiner, E. A. (2011). Semantic stream query optimization exploiting dynamic metadata. In Proceedings of IEEE Conference on Data Engineering (ICDE), (pp. 111-122). IEEE. doi:10.1109/ICDE.2011.5767840 Ding, X., Lian, X., Chen, L., & Jin, H. (2012). Continuous monitoring of skylines over uncertain data streams. Journal of Information Science, 184(1), 196–214. doi:10.1016/j.ins.2011.09.007 Dou, A., Lin, S., Kalogeraki, V., & Gunopulos, D. (2014). Supporting historic queries in sensor networks with flash storage. Journal of Information Systems, 39, 217–232. doi:10.1016/j.is.2012.04.002 Elmongui, H. G., Mokbel, M. F., & Aref, W. G. (2013). Continuous aggregate nearest neighbor queries. Journal of GeoInformatica, 17(1), 63–95. doi:10.1007/s10707-011-0149-0 Fangzhou, Z., Guohui, L., Li, L., Xiaosong, Z., & Cong, Z. (2013). Probabilistic nearest neighbor queries of uncertain data via wireless data broadcast. Journal of Peer-to-Peer Networking and Applications, 6(4), 363–379. doi:10.1007/s12083-013-0210-x Gao, Y., Miao, X., Cui, H., Chen, G., & Li, Q. (2014). Processing k-skyband, constrained skyline, and group-by skyline queries on incomplete data. Journal of Expert Systems with Applications, 41(10), 4959–4974. doi:10.1016/j.eswa.2014.02.033 Gedik, B., Schneider, S., Hirzel, M., & Wu, K. (2013). Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems, 25(6), 1447–1463. doi:10.1109/TPDS.2013.295 Golab, L., Johnson, T., & Shkapenyuk, V. (2012). Scalable scheduling of updates in streaming data warehouses. Journal of Knowledge and Data Engineering. IEEE Transactions on, 24(6), 1092–1105.

340

Data Streams Processing Techniques Data Streams Processing Techniques

Guirado, F., Roig, C., & Ripoll, A. (2013). Enhancing throughput for streaming applications running on cluster systems. Journal of Parallel and Distributed Computing, 73(8), 1092–1105. doi:10.1016/j. jpdc.2013.04.006 Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Soriente, C., & Valduriez, P. (2012). Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems, 23(12), 2351–2365. doi:10.1109/TPDS.2012.24 Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., & Valduriez, P. (2010, June). Streamcloud: A large scale data streaming system. In Proceedings of the Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on (pp. 126-137). IEEE. doi:10.1109/ICDCS.2010.72 Gulisano, V. M. (2012). StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine (Doctoral dissertation). Informatica. Heinze, T., Pappalardo, V., Jerzak, Z., & Fetzer, C. (2014, March). Auto-scaling techniques for elastic data stream processing. In Proceedings of the 30th International Conference on Data Engineering Workshops (ICDEW), (pp. 296-302). IEEE. Hu, R., Jiang, J., Liu, G., & Wang, L. (2014). Efficient Resources Provisioning Based on Load Forecasting in Cloud. The Journal of Scientific World. Huang, Y. K., Chang, C. H., & Lee, C. (2012). Continuous distance-based skyline queries in road networks. Journal of Information Systems, 37(7), 611–633. doi:10.1016/j.is.2012.02.003 Huang, Y. K., & Lin, L. F. (2014). Efficient processing of continuous min–max distance bounded query with updates in road networks. Journal of Information Science, 278, 187–205. doi:10.1016/j.ins.2014.03.040 Hunter, T., Das, T., Zaharia, M., Abbeel, P., & Bayen, A. M. (2013). Large-Scale Estimation in Cyberphysical Systems Using Streaming Data: A Case Study With Arterial Traffic Estimation. Journal of Automation Science and Engineering. IEEE Transactions on, 10(4), 884–898. Jung, H., Chung, Y. D., & Liu, L. (2012). Processing generalized k-nearest neighbor queries on a wireless broadcast stream. Journal of Information Science, 188, 64–79. doi:10.1016/j.ins.2011.11.007 Jung, H., Kim, Y. S., & Chung, Y. D. (2014). QR-tree: An efficient and scalable method for evaluation of continuous range queries. Journal of Information Science, 274, 156–176. doi:10.1016/j.ins.2014.02.061 Kailasam, S., Gnanasambandam, N., Dharanipragada, J., & Sharma, N. (2013). Optimizing ordered throughput using autonomic cloud bursting schedulers. IEEE Transactions on Software Engineering, 39(11), 1564–1581. doi:10.1109/TSE.2013.26 Kalyvianaki, E., Wiesemann, W., Vu, Q. H., Kuhn, D., & Pietzuch, P. (2011, April). SQPR: Stream query planning with reuse. In Proceedings of the Data Engineering (ICDE), 2011 IEEE 27th International Conference on (pp. 840-851). IEEE. Kapitanova, K., Son, S. H., Kang, W., & Kim, W. T. (2011). Modeling and Analyzing Real-Time Data Streams. In Proceedings of International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (pp. 91-98). IEEE.

341

Data Streams Processing Techniques Data Streams Processing Techniques

Kim, H. G. (2013). A Structure for Sliding Window Equijoins in Data Stream Processing. In Proceedings of the Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on (pp. 100-103). IEEE. doi:10.1109/CSE.2013.25 Lam, W., Liu, L., Prasad, S. T. S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814–1825. doi:10.14778/2367502.2367520 Lee, Y. W., Lee, K. Y., & Kim, M. H. (2013). Efficient processing of multiple continuous skyline queries over a data stream. Journal of Information Science, 221, 316–337. doi:10.1016/j.ins.2012.09.040 Lei, C., Rundensteiner, E. A., & Guttman, J. D. (2013, April). Robust distributed stream processing. In Proceedings of the Data Engineering (ICDE), 2013 IEEE 29th International Conference on (pp. 817828). IEEE. Li, C. L., Wang, E. T., Huang, G. J., & Chen, A. L. (2014). Top-n query processing in spatial databases considering bi-chromatic reverse k-nearest neighbors. Journal of Information Systems, 42, 123–138. doi:10.1016/j.is.2014.01.001 Lijie, Z., & Yaxuan, W. (2010). Query Plan Optimization and Migration Strategy over Data Stream. In Proceedings of International forum on Information Technology and Applications (Vol. 3, pp. 72-75). IEEE. doi:10.1109/IFITA.2010.199 Lim, H., & Babu, S. (2013). Execution and optimization of continuous queries with cyclops. In Proceedings of the 2013 international conference on Management of data (pp. 1069-1072). ACM. doi:10.1145/2463676.2465248 Lin, X., Xu, J., & Hu, H. (2013). Range-based skyline queries in mobile environments. Journal of Knowledge and Data Engineering. IEEE Transactions on, 25(4), 835–849. Lin, X., Xu, J., Hu, H., & Lee, W. (2013). Authenticating Location-Based Skyline Queries in Arbitrary Subspaces. IEEE Transactions on Knowledge and Data Engineering, 26(6). Liu, Z., Wang, C., & Wang, J. (2014). Aggregate nearest neighbor queries in uncertain graphs. Journal of World Wide Web, 17(1), 161–188. doi:10.1007/s11280-012-0200-6 Lu, H., Zhou, Y., & Haustad, J. (2013). Efficient and scalable continuous skyline monitoring in two-tier streaming settings. Journal of Information Systems, 38(1), 68–81. doi:10.1016/j.is.2012.05.005 Nagendra, M., & Candan, K. S. (2013). Layered processing of skyline-window-join (SWJ) queries using iteration-fabric. In Proceedings of the Data Engineering (ICDE), 2013 IEEE 29th International Conference on (pp. 985-996). IEEE. Nehme, R. V., Works, K., Lei, C., Rundensteiner, E. A., & Bertino, E. (2013). Multi-route query processing and optimization. Journal of Computer and System Sciences, 79(3), 312–329. doi:10.1016/j. jcss.2012.09.010 Papapetrou, O., Garofalakis, M., & Deligiannakis, A. (2012). Sketch-based querying of distributed sliding-window data streams. Proceedings of the VLDB Endowment, 5(10), 992–1003. doi:10.14778/2336664.2336672

342

Data Streams Processing Techniques Data Streams Processing Techniques

Park, H. K., & Lee, W. S. (2012). Adaptive optimization for multiple continuous queries. Journal of Data & Knowledge Engineering, 71(1), 29–46. doi:10.1016/j.datak.2011.07.008 Qian, J., Li, Y., Wang, Y., Chen, H., & Dong, Y. (2012). An embedded co-processor for accelerating window joins over uncertain data streams. Journal of Microprocessors and Microsystems, 36(6), 489–504. doi:10.1016/j.micpro.2012.04.007 Ray, I., Madria, S. K., & Linderman, M. (2012, October). Query Plan Execution in a Heterogeneous Stream Management System for Situational Awareness. In Proceedings of Reliable Distributed Systems (pp. 424–429). SRDS. doi:10.1109/SRDS.2012.54 Sadoghi, M., Javed, R., Tarafdar, N., Singh, H., Palaniappan, R., & Jacobsen, H. A. (2012, April). Multi-query stream processing on fpgas. In Proceedings of Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1229-1232). IEEE. doi:10.1109/ICDE.2012.39 Safaei, A. A., & Haghjoo, M. S. (2010). Parallel processing of continuous queries over data streams. Journal of Distributed and Parallel Databases, 28(2-3), 93–118. doi:10.1007/s10619-010-7066-3 Safaei, A. A., Sharifrazavian, A., Sharifi, M., & Haghjoo, M. S. (2012). Dynamic routing of data stream tuples among parallel query plan running on multi-core processors. Journal of Distributed and Parallel Databases, 30(2), 145–176. doi:10.1007/s10619-012-7090-6 Saleh, O., Gropengießer, F., Betz, H., Mandarawi, W., & Sattler, K. U. (2013). Monitoring and autoscaling IaaS clouds: a case for complex event processing on data streams. In Proceedings of the 2013 IEEE/ ACM 6th International Conference on Utility and Cloud Computing (pp. 387-392). IEEE Computer Society. doi:10.1109/UCC.2013.78 Sandhya, G., & Devi, S. K. (2013). An adaptive sliding window based continuous Top-K dominating queries. In Proceedings of the Intelligent Systems and Control (ISCO), 2013 7th International Conference on (pp. 349-353). IEEE. doi:10.1109/ISCO.2013.6481177 Shan, L., Xuejiao, H., Li, Y., & Lizhen, X. (2012, November). Research and Improvement of Load Balancing Algorithm in Distributed Sonar Data Stream Management System. In Proceedings of Web Information Systems and Applications (WISA), 2012 Ninth (pp. 163-169). IEEE. doi:10.1109/WISA.2012.17 Tang, J., Zhou, Z., Niu, J., & Wang, Q. (2014). An energy efficient hierarchical clustering index tree for facilitating time-correlated region queries in the Internet of Things. Journal of Network and Computer Applications, 40, 1–11. doi:10.1016/j.jnca.2013.07.009 Tang, Y., & Gedik, B. (2013). Autopipelining for Data Stream Processing. Journal of Parallel and Distributed Systems. IEEE Transactions on, 24(12), 2344–2354. Wang, B., Qu, J., Wang, X., Wang, G., & Kitsuregawa, M. (2013). VGQ-Vor: Extending virtual grid quadtree with Voronoi diagram for mobile k nearest neighbor queries over mobile objects. Journal of Frontiers of Computer Science, 7(1), 44–54. doi:10.1007/s11704-012-2069-z Wang, Y., Zhang, R., Xu, C., Qi, J., Gu, Y., & Yu, G. (2014). Continuous visible k nearest neighbor query on moving objects. Journal of Information Systems, 44, 1–21. doi:10.1016/j.is.2014.02.003

343

Data Streams Processing Techniques Data Streams Processing Techniques

Works, K., Rundensteiner, E. A., & Agu, E. (2013). Optimizing adaptive multi-route query processing via time-partitioned indices. Journal of Computer and System Sciences, 79(3), 330–348. doi:10.1016/j. jcss.2012.09.011 Xie, X., Ray, I., Adaikkalavan, R., & Gamble, R. (2013, June). Information flow control for stream processing in clouds. In Proceedings of the 18th ACM symposium on Access control models and technologies (pp. 89-100). ACM. doi:10.1145/2462410.2463205 Yang, D., Rundensteiner, E. A., & Ward, M. O. (2013). Mining neighbor-based patterns in data streams. Journal of Information Systems, 38(3), 331–350. doi:10.1016/j.is.2012.08.001 Yi, S., Ryu, H., Son, J., & Chung, Y. D. (2014). View field nearest neighbor: A novel type of spatial queries. Journal of Information Science, 275, 68–82. doi:10.1016/j.ins.2014.02.022 Yin, B., Lin, Y., Yu, J., & Luo, Q. (2014). Energy-efficient filtering for skyline queries in cluster-based sensor networks. Journal of Computers & Electrical Engineering, 40(2), 350–366. doi:10.1016/j.compeleceng.2013.03.021 Zhang, D., Chow, C. Y., Li, Q., Zhang, X., & Xu, Y. (2013). SMashQ: Spatial mashup framework for k-NN queries in time-dependent road networks. Journal of Distributed and Parallel Databases, 31(2), 259–287. doi:10.1007/s10619-012-7110-6 Zhang, Y., & Cheng, R. (2013). Probabilistic filters: A stream protocol for continuous probabilistic queries. Journal of Information Systems, 38(1), 132–154. doi:10.1016/j.is.2012.06.003

KEY TERMS AND DEFINITIONS Cloud Computing: A type of internet-based computing which based on sharing computing resources rather than having local servers or personal devices to handle applications. Continuous Query: A data streams’ query which is evaluated continuously over time with the continuous arrival of data streams. Data Streams: A continuous, unbounded, rapid and time-varying data elements which generated from many modern applications such as sensor networks, financial applications and web logs applications. MapReduce: A programming model which process massive amounts of unstructured data in parallel and distributed cluster of processors. Nearest Neighbor Query: A type of query which used to find the nearest neighbor objects to a given point in space. Pattern Mining: An important concept in data mining which is used to find existing patterns in data. Skyline Query: A type of query which used to return the objects which are not dominated by any other objects. Sliding Window: A processing model which is used to process the continuous data streams in an incremental manner.

344

345

Chapter 16

A Preparation Framework for EHR Data to Construct CBR Case-Base Shaker El-Sappagh Mansoura University, Egypt

Alaa M. Riad Mansoura University, Egypt

Mohammed Elmogy Mansoura University, Egypt

Hosam Zaghloul Mansoura University, Egypt

Farid A. Badria Mansoura University, Egypt

ABSTRACT Diabetes mellitus diagnosis is an experience-based problem. Case-Based Reasoning (CBR) is the first choice for these problems. CBR depends on the quality of its case-base structure and contents; however, building a case-base is a challenge. Electronic Health Record (EHR) data can be used as a starting point for building case-bases, but it needs a set of preparation steps. This chapter proposes an EHRbased case-base preparation framework. It has three phases: data-preparation, coding, and fuzzification. The first two phases will be discussed in this chapter using a diabetes diagnosis dataset collected from EHRs of 60 patients. The result is the case-base knowledge. The first phase uses some machine-learning algorithms for case-base data preparation. For encoding phase, we propose and apply an encoding methodology based on SNOMED-CT. We will build an OWL2 ontology from collected SNOMED-CT concepts. A CBR prototype has been designed, and results show enhancements to the diagnosis accuracy.

INTRODUCTION Diabetes Mellitus (DM) is a serious disease. If it has not treated on time and properly, it can lead to serious complications including death. This makes diabetes one of the main priorities in medical science research, which in turn generates huge amounts of data. These data are transactional and distributed in the patient’s EHR. An early diabetes diagnosis is the most critical step in diabetes management. The DOI: 10.4018/978-1-5225-2229-4.ch016

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

A Preparation Framework for EHR Data to Construct CBR Case-Base

diagnosis of diabetes is an ill-formed problem and depends on the physician experience. Case Based Reasoning (CBR) is considered as the most suitable Clinical Decision Support System (CDSS) for dealing with these problems where physicians share their experience (Richter and Weber, 2013; Blanco, 2013). Therefore, case-base creation is a challenging step. On the other hand, CBR is appealing in medical domains because a case-base already exists as the stored symptoms, medical history, physical examinations, lab tests, diagnoses, treatments, and outcomes for each patient (Andritsos et al., 2014). However, because clinical data are usually incomplete, inconsistent, and noisy, these data need a set of preparation steps before converted into CDSS knowledge (Abidi & Manickam, 2002). The first step is the data preprocessing stage that is applied to enhance data quality. The application of a set of machine learning algorithms improves the accuracy of CBR case retrieval algorithms. The second step is the coding stage that is used to represent the pre-processed data with standard coding terminology such as SNOMED CT (SCT) (Lee et al., 2013). We have proposed a diabetes diagnosis reference set from SCT version 2013 and modeled it in an OWL 2 ontology (El-Sappagh et al., 2014). This ontology is used to encode the unstructured (i.e. textual) contents of the case base knowledge base. Lack of standard data affects the accuracy of CDSS implementation (Ahmadian et al., 2011). Data standardization is critical for CBR systems for many reasons. The encoded knowledge supports: (1) the creation of distributed CBR systems; (2) the integration and interoperability between CDSS and EHR environment (Ahmadian et al., 2011); and (3) the creation of knowledge-intensive CBR systems. As a result, CBR supports semantic retrieval algorithms, and its intelligence is increased (Melton et al., 2006). Finally, the third step is the data fuzzification stage that is used to handle vague knowledge. Physicians always describe patients using vague terms, such as the sugar level is high, the patient has obese, and so on. Moreover, the patients often describe their conditions using imprecise terms. As Zadeh (2003) argued much of the knowledge that humans acquire through experience be perception-based and thus subject to imprecision and inaccuracy. Such knowledge, when not treated in some suitable way that can consider and convey its inherent imprecision, usually leads to reduced effectiveness of the knowledge-based systems that use it. Vagueness can be handled using fuzzy logic (Zadeh 2003), which has been used in diabetes diagnosis rule-based systems (Lee and Wang, 2011). Moreover, fuzzy logic has been integrated with CBR in hybrid systems (Abdul et al., 2014) and used for calculating the fuzzy similarity between cases (Khanum et al., 2009). However, in diabetes diagnosis domain, there are no studies in fuzzy CBR systems. Authors in (Burnum, 1989; Weiner & Embi, 2009) stated that the introduction of health information technology like EHRs has not led to improvements in the quality of the data being recorded, but rather to the recording of a greater quantity of bad data. As a result, Lei (1991) has proposed what he called the first law of informatics: “data shall be used only for the purpose for which they were collected.” In the same time, EHR contains all the current and history of medical data of the patient. These data can be used as a complete source for building the CBR’s case-base (Abidi & Manickam, 2002). The quality of CBR is based on the quality of case-base content (Andritsos, 2014). EHR data quality measurement and improvement must be an essential step in using its data in CDSS’s knowledge base (Abidi & Manickam, 2002). As a result, data preprocessing steps are the first and the foremost to improve the accuracy of CBR systems (Borges et al., 2012). By focusing on DM diagnosis, its medical dataset is seldom complete (Jayalskshmi & Santhakumaran, 2010). Moreover, because diabetes is a lifelong disease, even data available for an individual patient may be massive and complicated to interpret. Data preprocessing steps include deleting of low-quality rows and columns, feature selection, feature mining, integration, transformation (i.e. normalization and discretization), data cleaning, feature weighting, etc. (Begum et al., 2010). An example of a system focusing on feature mining is the dietary counseling system by Wu et al. (2004). 346

A Preparation Framework for EHR Data to Construct CBR Case-Base

Jagannathan and Petrovic (2009) have concluded that missing values in the case-base pose a common and serious problem that impairs the performance of the system, and they have provided an imputation method to deal with missing value. However, they have only handled missing values of some attributes, and they have left others. Floyd et al. (2008) have concluded that applying preprocessing techniques to a case-base as feature selection can increase the performance of a CBR system. Case retrieval is the most important phase in CBR system. However, it mainly depends on the types of case-base knowledge and its quality. All existing case retrieval algorithms depend on one type of knowledge for retrieval. We will combine the most important three techniques for data preparation to enhance the case retrieval process, especially for the medical domain. In this paper, our proposed framework will result in cleaned, normalized, fuzzified, and encoded knowledge. These types of knowledge support different queries and different types of similarity algorithms, which improve the retrieval phase of CBR system. A diabetes diagnosis is used as a case study for applying this framework. There are not CBR systems that utilize machine-learning algorithms to prepare case-base knowledge. Moreover, most of the CBR systems for diabetes diagnosis have used a diabetes-specific data set. However, diabetes, as a chronic disease, results in many other diseases as nephropathy, retinopathy, neuropathy, heart diseases, stroke, and others (Michael, 2008). In our dataset, the patient is described by 70 different features, as shown in Table 1. These features link Diabetes with other diseases, such as cancer, kidney disease, and liver diseases. A CBR-based CDSS for diabetes diagnosis will be designed using myCBR 3 protégé plugin (myCBR3 Project, 2014). The myCBR1 is an open-source similarity-based retrieval tool and software development kit (SDK). The diagnosis will be the patient’s diabetes status plus his future conditions of having other complications or diseases, such as cancers. The paper is organized as follows. Section 2 provides related works. Section 3 provides a description of our diabetes dataset. Section 4 provides the proposed preparation framework. Section 5 provides a case study for CBR system. Finally, Section 6 provides the conclusion and future works.

RELATED WORK The related works are divided into two sides. The first is the related works for data pre-processing, and the second is the related works to data encoding.

Case-Base Knowledge Pre-Processing Data quality of EHR should be measured by using it as a knowledge source. Weiskopf and Weng (2013) have determined five dimensions to measure the quality of EHR data. Klompas et al. (2013) have asserted that EHR data can improve diabetes management even if its raw data quality is low. For achieving data quality, a preprocessing or data preparation step is critical for any knowledge-based CDSS system (Esfandiari et al., 2014). For example, the case retrieval algorithms, such as nearest neighbor, require data cleaning and normalization steps (Jagannathan and Petrovic, 2009; Kuhn and Johnson, 2013). Low-quality data need handling in CBR. For example, Xie et al. (2013) have handled missing values and unmatched features in the case retrieval algorithm. Guessouma et al. (2014) have proposed five approaches for managing missing data problem in a medical CBR system. The data preprocessing steps include data cleaning (Jagannathan and Petrovic, 2009), data transformation (Jayalskshmi and Santhakumaran, 2010), feature mining and selection (Piramuthu, 2004), etc. Especially for CBR, the

347

A Preparation Framework for EHR Data to Construct CBR Case-Base

conversion of database structure to case-base structure is a critical step (Abidi and Manickam, 2002). A database record is similar to case-base case, but the transformation from generic EHR to specialized case-base is not a straightforward mapping of attributes. In order to change one of the simple database records into a case, it is required to associate an experience to the record (Baig, 2008). Abidi and Manickam (2002) have assumed that the structure of case-base is defined in advance. They have mapped the structure then contents of EHR to case-base. However, this assumption is not realistic. The structure of the case-base depends on the structure and contents of EHR, and it must be inferred from it. As EHR contains patient raw chronicle data, a temporal abstraction preparation step is critical to aggregate and provide trends from patient data (Bottrighi et al., 2010). Moreover, features weights must be specified for case retrieval algorithms (Richter and Weber, 2013). It can be determined manually by a domain expert or CPGs (Guessouma et al., 2014), or it can be determined automatically using machine learning algorithms (Gopal, 2007) such as a neural network (Abidi and Manickam, 2002). The choice of the case features that best distinguish classes of instances have a large impact on the similarity measure (Baig 2008), and it improves the performance and decreases the complexity (Andritsos et al., 2014). This process can be done automatically (Xiong and Funk, 2010) (using techniques as information gain (Shanga et al., 2013) and Relief (Huang et al., 2007)) or manually (Kwiatkowska and Atkins 2004; Balakrishnan et al., 2012) according to domain expert or CPGs. Kotsiantis (Kotsiantis et al., 2006) has surveyed data preprocessing algorithms for each step. Han et al. (2008) have proposed the preprocessing steps for a DM dataset using RapidMiner (RapidMiner, 2014). There is no single sequence of data pre-processing algorithms with the best performance (Kotsiantis et al., 2006). As a result, data preprocessing is a set of not ordered steps (Esfandiari et al., 2014). It can be an inherent component of a CBR system, or it can be performed as a preprocessing step. For example, eXiT*CBR (Pla et al., 2013) system incorporate basic preprocessing steps as discretization, normalization, and feature selection techniques. However, the complete list of needed preprocessing steps is different according to the nature of data and the CBR system purpose (Andritsos et al., 2014). As a result, this paper will provide the preprocessing step as a separate phase before CBR system processes.

Case-Base Knowledge Encoding With respect to the data encoding issue, calculating the similarity between two patients conditions based on simple string matching between clinical terms is insufficient. The similarity between ontology concepts is calculated by clinical distance and semantic distance. Intelligent case retrieval depends on the existence of domain ontology and an encoded case base. El-sappagh et al. (2014) have proposed a domain background ontology derived from SCT standard terminology. The remaining problem is how to encode the case base knowledge to be represented using SCT concepts. The coded data support the semantic retrieval in CBR in many ways. For example, concepts can be represented in different levels of granularity (e.g. in SCT terminology, type II diabetes mellitus can be represented as Type II diabetes mellitus with neuropathic arthropathy IS_A Type II diabetes mellitus with arthropathy IS_A diabetes mellitus type 2 IS_A diabetes mellitus) which enrich user query. Moreover, CBR retrieval algorithm can catch different descriptions of the same concept (e.g. Myocardial infarction = Heart attack = Cardiac infarction). This way, CBR system thinks similar to a domain expert, and it integrates data from different systems as hospitals, doctors’ offices, and outpatient departments. SCT2 is a complete clinical terminology in the world. It contains more than 388,000 active concepts organized in 19 hierarchies, 1.14 million descriptions, and 1.38 million relationships (International Health Terminology Standards Development

348

A Preparation Framework for EHR Data to Construct CBR Case-Base

Organization, 2014). Silva et al. (2011) have concluded that concept coverage of SCT was 98.5% for coding of problem lists and diagnosis. Studies that describe how SCT is implemented in clinical settings are few and focused mostly on data capture, data retrieval, and decision support (Silva et al., 2011). The existing methodologies for mapping clinical text in EHR to SCT concepts range from manual methods to semi-automatic to automatic ones (Lee et al., 2010; Barrett et al., 2012; Lamy et al., 2013). Lamy et al. (2006) have presented a semi-automatic semantic method for the mapping of SCT concepts to VCM icons. Kim et al. (2012) have proposed an EAV-based data model from CPGs for pressure ulcers wound assessment and encode its data elements using SCT. It has mainly depended on pre-coordination. EAV is a flexible method for data representation and exchange between databases and CDSSs. Kooij et al. (2006) have asserted that for standardization of EHR, it is required to use HL7 RIM data model and SCT code for every item. Lau et al. (2008) have described a methodology for encoding problem lists used in general practice with SCT. This method has been complemented by Lee et al. (2010). These two methods have encoded raw data sets, and pre-coordination has gotten the highest priority. The Lee’s methodology is a complete method. However, it has concentrated on the data cleaning, normalization, and matching steps, and has not mentioned the physical storage structure of the data such as EAV, and it did not define whether the EHR data model is standardized using RIM or not. Usage of terminology in information systems requires decisions on how the terminology should fit into the information structure, for example with information standards. This process is often called terminology binding. Moreover, it has not discussed how the codes and its values are semantically stored. According to our proposed SCT reference set for diabetes diagnosis concepts, we propose a novel encoding methodology and use it to encode our diabetes diagnosis case-base knowledge to be in a standard form. This case base contains 60 cases collected from the EHRs of patients from some hospitals of Mansura University, Egypt.

DATASET DESCRIPTION The paper uses a dataset from the diagnostic biochemical lab, AutoLab of Mansoura Institutions, Mansoura University, Mansoura, Egypt. This data was collected in the period from January 2010 through August 2013. The control subjects were healthy and recruited from the diagnostic biochemical lab and were matched by age, sex, and ethnicity to the case subjects. The eligibility criteria for controls were the same as those for patients, except for having a cancer diagnosis. A short structured questionnaire was used to screen for potential controls based on the eligibility criteria. Analysis of the answers received on the short questionnaire indicated that 80% of those questioned agreed to participate in clinical research. A total of 67 eligible subjects were ascertained in the current study. However, seven control subjects were excluded due to limited blood samples for testing AFP. Blood samples (5 mL) were taken, centrifuged, and the serum separated and stored at 220uC until analyzed. Serum samples were assayed for AFP by enzyme-linked immunosorbent assay with commercial kits (Abbott, North Chicago, IL), transferase (ALT), and aspartate aminotransferase (AST), with an auto-analyzer (Hitachi Model 736, Japan) and commercial kits. The problem features include: Demographics (Residence, Occupation, Gender, Age, BMI); Lab tests (HbA1C, 2h PG, FPG); Hematological Profile (e.g. Prothrombin INR, Red cell count, Hemoglobin, Haematocrit (PCV), MCV, MCH, MCHC, Platelet count, White cell count, Basophils, Lymphocytes, Monocytes, Eosinophils); Symptoms (Urination frequency, Vision, Thirst, Hunger, Fatigue); Kidney Function Lab tests (Serum Potassium, Serum Urea, Serum Uric Acid, Serum Creatinine, Serum Sodium);

349

A Preparation Framework for EHR Data to Construct CBR Case-Base

Table 1. Patient attributes used to describe cases Feature Type Demographics

Diabetes Lab Tests

Hematological Profile

Symptoms

Kidney Function Lab tests

Lipid Profile

Tumor Markers

Data Type

Normal Range

UoM

Min-Mean-Max

F. No.

Residence

Feature Name

C

{ Urban, Rural}

-

-

1

Occupation

C

{Farmer, …}

-

-

2

Gender

C

{Male, Female}

-

-

3

Age

N

-

year

29-48.117-74

4

BMI

N

18.5 - 25

kg/m2

20-33.117-45

5

HbA1C

N

0.5 X (t ) − A D X (t + 1) =  bl * D ′ e cos (2πl ) + X (t ) if p < 0 : 5 

(18)

903

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

where, p ∈ 0, 1 represents the probability of choose either the shrinking encircling mechanism or the

spiral model to update the position of whales. The humpback whales search randomly for prey, therefore, their locations are updated by choosing a randomly search solution (position) instead of the best search position as follows D = C X r (t ) − X (t )

(19)

x (t + 1) = X r (t ) − A D

(20)

where X r (t ) is a random position vector chosen from the current population. Whale optimization algorithm is described in more details in Algorithm (3). The WO algorithm is modified to select the features, where the search space is modeled as n dimensional Boolean and the fitness function is neighborhood roughest, we called this modification as Binary WO algorithm. The solution of WOA is a binary vector, in which the 1’s means that the corresponding features are selected and 0’s features otherwise. The WOA algorithm starts by selecting random position for each whale, and then position is converted into a binary vector as: x (t + 1) = X r (t ) − A D

Table 4. Algorithm 3: Whale optimization algorithm Initial value to maximum number iteration, global best function, N number of whales, Generate a population of N whales Current iteration= 1 for each whale Compute the fitness function for current whale If (current fitness less than global best function) then Best position =current position, global best function= current fitness End If End For DO for Decrease the value of a from 2 to 0 do for each whale Compute A, C and the probability p If probability less than 0.5 then The position is updated by using Else If the value of

A

less than 0.5 then

The position is update by using Else The position is update by using End if End if End for Current iteration = Current iteration +1 Until current iteration < maximum iteration

904

(21)

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

where ε is a random value, then the fitness function for each whale is is computed as:  R   F (R ) = αγR (d ) + (1 − α)1 −   C  

(22)

where γR (d ) is defined as: γC (d ) = POSC (d ) / U

(23)

This function can be considered as a fitness function in feature selection algorithms. But the lower approximation of neighborhood roughest is used instead roughest and R represents the number of selected features from C . For each whale the fitness function is computed and compared with the global best fitness. Where α is the random variable that balances between the accuracy of classification and the number of selected features.

Particle Swarm Optimization (PSO) Algorithm In PSO algorithm, particle swarm consists of N particles, and the position of each particle stands for the potential solution in D-dimensional space. The particle updates its positions and hence the obtained solutions are according to the following: (1) Keeping its inertia; keeping its own position, (2) Changing the condition according to its own most optimist position; named pbest , and (3) Changing the condition according to the swarm’s most optimist position; social expertise named gbest . The position of each particle in the swarm is affected both by the most optimist position during its movement and the position of the most optimist particle in its surrounding (Eberhart& Kennedy, 1995). The repositioning of each particle is done according to the equation (24). x it,+d 1 = x it,d + vit,+d 1, x it,+d 1 = x it,d + vit,+d 1, x it,+d 1 = x it,d + vit,+d 1,

(24)

where X is a vector representing the particle position in the space, t is the iteration number, i is the is the current position particle number, x it,d is the updated position of dimension d for particle i, and x it,+1 d of particle i at iteration t, x it,d is the modified velocity of a particle i. The velocity vector is calculated as given in equation (25):

(

)

(

)

vit,+d 1 = vit,d + c1r1t pbestit,d − x it,d + c2r2t gbestdt − x it,d

(25)

where vit,d is the current velocity of dimension d for particle i at iteration t, pbestit,d is the most optimist position for particle i along dimension d at time t and gbestdt is the most optimist position for the swarm at time t along dimension d, c1 , and c2 represent the amount of loyalty and selfishness of particles.

905

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Usually, c1 is equal to c2 and they are equal to 2, r1t and r2t represent random fiction always uniformly randomized in the range [0, 1], PSO is described in more details in Algorithm (4).

Genetics Algorithm (GA) GA mimics the natural evolution process on a population of initial individuals represented numerically. Some individuals of the initial population are exposed to crossover and mutation operations to produce better individuals which contribute to the next generation of the population. To determine which individuals deserve participating in the crossover and mutation, all the population’s individuals undergo a selection process that selects the fittest individuals according to a predetermined fitness function. The crossover operation randomly chooses pairs of these selected individuals to breed. The mutation of some individuals keeps diversity among the population, see Algorithm (5).

THE PROPOSED ARABIC CHARACTER RECOGNITION SYSTEM In this section the proposed system is introduced in which it consists of three main phases, namely, Preprocessing, feature extraction (including feature selection) and the final phase is the classification. The overall phases of the proposed system are illustrated in Figure 1.

Preprocessing Phase The input character image has to be prepared for the next steps, this makes the extracted features more efficient in recognition. In this work, we used the techniques in (Sahlol, Suen, Elbasyoni, & Sallam, 2014a) which included binarization and several methods of noise removal (including morphological and Table 5. Algorithm 4: Particle Swarm Optimization algorithm Randomly initialize particles position and velocity. While (Stopping criterion doesn’t met) { Evaluate each particle swarm fitness. For (p = 1 to N} If (F(xp) < F(pbestp)) { Update pbestp = Xp Else gbest = Xp For (D = 1 to d (number of dimensions) Update particle’s velocity and position

Table 6. Algorithm 5: Genetic algorithm Randomly initialize population of individuals. Evaluate the individuals’ fitness. While (Stopping criterion doesn’t met) Select the best individuals. Generate new individuals by using crossover and mutation operations. Evaluate the new individuals’ fitness. Replace the worst individuals by the new best individuals.

906

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Figure 1. The overall process of the proposed system

statistical operations, dilation and median filtering), more preprocessing used in this work are shown in Figure 2.

Feature Extraction and Selection Phase Several features were used to meet the unlimited variations of Arabic characters and the similarities of distinct character shapes. We adopted the features in (Sahlol, Suen, Elbasyoni, & Sallam, 2014b), which are; Features from Whole Character (main body and secondaries), Features from Only Main Body of the Character and features from only the secondaries; in addition, we used Gradient features; In this work each character image was normalized by 128×128, this normalization scale was chosen because it achieves better results among the 64 × 64 or the 32×32 normalization scales, this also matches with

Figure 2. Current pixel “p” and neighboring based data/noise decision

907

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

the results achieved by (Sagheer, 2010). Consequently, the gradient algorithm with Sobol filter has been used in this work.

Feature Vector Normalization The Min–max normalization method (Hann& Kamber, 2011) was used because the features range was very wide; also it is simple and more efficient in terms of time-consuming. Before building the feature extraction process, there are two problems must be defined which are feature extraction and feature selection. Feature extraction is related to which technique will be used to extract features from the handwritten Arabic character as representations. While, in feature selection, the most relevant features to improve the classification accuracy must be searched. For many years, statistics and machine learning communities have studied the feature selection problem. In machine learning, the problem of feature selection is a global optimization problem. This section will give a brief review of the optimization algorithms that used in the proposed handwritten Arabic recognition system including Bat Algorithm, Grey Wolf Optimization, Whale optimization algorithm (for neighborhood rough set optimization) Particular swarm optimization and Genetics algorithms. To improve the performance of Arabic character recognition system, we present a feature selection method that is based on Bat algorithm, Grey wolf Optimization algorithm, Whale optimization Algorithm, Particle swarm optimization algorithm and Genetic algorithm. They were discussed in more details in section; Preliminaries: swarms-based optimization.

Classification Phase K-Nearest Neighbors (K-NN) K-NN is a supervised pattern recognition technique, it can be considered as a non-parametric for classification and regression (in this work it is used as a classifier) for an entire data, it performs segmentation by comparing new data to a collection of labeled examples in a training set. For each new sample to be classified, the algorithm computes the probability of the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Linear Discriminant Analysis (LDA) It is a classification method that assumes that different classes generate data based on different Gaussian distributions. It learns a linear classification boundary for the training samples space. LDA fits a multivariate normal density to each class, with a pooled estimate of covariance (Gaber et al., 2015). LDA achieved the best recognition with Arabic OCR system among the other classifiers that were used in the experiment, with a minimum number of features (Khedher, Abandah & Al-Khawaldeh, 2005).

Random Forest (RF) Random Forest is a general term for ensemble methods using tree-based classifiers. In training, the Random Forest algorithm creates trees (which is a user-defined parameter, however, the algorithm is not

908

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

sensitive to it.), each tree is trained by a sample of the original training data, and searches only across a randomly selected subset of the input variables to determine a split (for each node). For classification, each tree in the Random Forest casts a unit vote for the most popular class at an input. The output of the classifier is determined by a majority vote of the trees, often, a blindly selected (the number of trees) is set to the square root of the number of inputs; as a result, the Random Forest algorithm can handle high dimensional data and use a large number of trees in the ensemble (Gislason, Benediktsson & Sveinsson, 2006).

EXPERIMENTAL RESULT AND DISCISSION CENPARMI dataset (Alamri, Sadri, Suen & Nobile, 2008) is used to measure the efficiency of the proposed algorithm. The dataset contains about 21000 images of Arabic handwritten characters. The characters have been written by 328 writers. The samples were carefully selected to represent initial, medial, and final of the 28 Arabic character forms. In this work we work only on the 28 basic Arabic characters. Table 7 shows the parameters setting of the used bio-inspiring algorithms, where Table 8 shows collection of samples from the dataset. The results of the used classifiers were evaluated using a well-known statistical equation; the overall accuracy: Accuracy =

TP + TN TP + TN + FP + FN

(26)

where TP is the true positive samples, TN is the true negative samples, FP is the false positive samples and FN is the false negative samples. Table 7. WOA, BA, PSO, GWO and GA parameters that used in this work Optimizer WOA GWO

Parameter Number of Search Agents

30

Number of Iterations

50

Number of Search Agents

5

Number of Iterations BAT

Value(s)

Number of Search Agents

100 5

Number of Iterations

100

Loudness (A) of BAT

0.5

Pulse rate I of BAT

0.5

Frequency minimum (Qmin) of BAT

0

Frequency maximum (Qmax) of BAT

2

GA

Crossover fraction of GA

0.8

PSO

Inertia factor of PSO

0.1

Individual-best acceleration factor

0.1

909

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Table 8. Variation in characters from dataset

The proposed optimization approach has been implemented by MATLAB, Figure 3 and Tables 8 summarize the results. Table 9 shows that most of the optimization algorithms achieve better recognition accuracy comparing with the non-optimized features (The whole feature set). Also, the WOA-NRS and the PSO achieve the best results among the other optimizers, respectively. From Table 10, it is obvious that the proposed feature selection algorithms achieved better results comparing to other previous works, the most interesting thing here is that we achieved better accuracy along with less computation which reflects on time-consuming and less running resources, as well.

Figure 3. Best three experiments for GWO, PSO, GA, WOA-NRS AND BA

910

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Table 9. Recognition accuracy and time cost of features optimized by GA, PSO, GWO, BA and WO-NRS based by K-nn and RF classifiers Optimizer

KNN

RF

LDA

None

80.66

83.01

84.02

GA

81.13

81.53

88.20

PSO

80.99

82.68

88.07

WOA (NRS)

81.94

84.70

89.01

BA

80.05

82.34

86.99

GWO

81.26

82.41

88.07

Table 10. Comparisons between previous works and the proposed algorithm Previous Work

Classifier

Optimization

Recognition Accuracy (%)

(Al-Taani& Al-Haj, 2010)

Decision tree

-

(Abandah, Younis & Khedher, 2008)

Linear Discriminant Analysis

Selection of (95) features

75.3 87

(Sahlol, Suen, Elbasyoni, & Sallam, 2014a)

Feed Forward Neural Network.

-

88

The proposed system

Linear Discriminant Analysis

WOA (NRS)

89.01

GA

88.20

PSO

88.07

GWO

88.07

BA

86.99

CONCLUSION AND FUTURE WORK This paper presents an approach for feature selection using bio-inspired optimization algorithms to improve the recognition rate for handwritten Arabic characters. The used optimizers included Bat Algorithm (BA), Grey Wolf Optimization (GWO), Whale optimization “which optimizes the neighborhood rough set”, Particle Swarm Optimization (PSO) and Genetic Algorithm (GA). Each character has to pass through OCR steps, pre-processing, feature extraction, and classification. Some important pre-processing operations were achieved. Some features were also extracted to overcome variations of characters shapes in a standard dataset (CENPARMI). The optimizers were used as a feature selector to select the most significant features which might improve the recognition rate. The experimental results showed that the used optimization algorithms improve the classification rate significantly; the results were measured by well-known classifiers RF, LDA and K-nn. In the future, we intend to work on the updating mechanisms of the optimization algorithms to resolve feature selection to further minimize the number of features and maximize the classification accuracy. We may also examine them on large OCR datasets with a large number of features.

911

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

REFERENCES Abandah, G. A., & Anssari, N. (n.d.). Novel moment features extraction for recognizing handwritten Arabic letters. Journal of Computer Science, 5(3), 226-232. Abandah, G. A., Younis, K. S., & Khedher, M. Z. Handwritten Arabic Character Recognition Using Multiple Classifiers Based on Letter Form. Proceedings of the 5th International Conference on Signal Processing, Pattern Recognition, and Applications (SPPRA), 128-133. Aghdam, M. H., Ghasem-Aghaee, N., & Basiri, M. E. (2009, April). Text feature selection using ant colony optimization. Expert Systems with Applications, 36(3), 6843–6853. doi:10.1016/j.eswa.2008.08.022 Al-Taani, A. T., & Al-Haj, S. (n.d.). Recognition of On-line Arabic handwritten characters using structural features. Journal of Pattern Recognition Research, 1, 23-37. Alamri, H., Sadri, J., Suen, C. Y., & Nobile, N. (n.d.). A novel comprehensive database for Arabic offline handwriting recognition. Proceedings of 11th International Conference on Frontiers in Handwriting Recognition, 8, 664-669. Eberhart, R. C., & Kennedy, J. (n.d.). A New Optimizer Using Particle Swarm Theory. Proceeding of the Sixth International Symposium on Micro Machine and Human Science, 39-43. doi:10.1109/ MHS.1995.494215 El-Gaafary, A. A., Mohamed, Y. S., Hemeida, A. M. & Mohamed, A. A. (n.d.). Grey wolf optimization for multi input multi output system. Universal Journal of Communications and Network, 3(1), 1-6. Emary, E., Zawbaa, H. M., Grosan, C., & Hassenian, A. (n.d.). Feature subset selection approach by gray-wolf optimization. Afro-European Conference for Industrial Advancement, 41(7), 1-13. Enache, A.-C., & Sgârciu, V. (n.d.). An Improved Bat Algorithm Driven by Support Vector Machines for Intrusion Detection. International Joint Conference, 41-51. Fouad, M. M., Zawbaa, H. M., Gaber, T., Snasel, V., & Hassanien, A. E. (2016). A Fish Detection Approach Based on BAT Algorithm. In The 1st International Conference on Advanced Intelligent System and Informatics (AISI2015), (pp. 273-283). Springer International Publishing. doi:10.1007/978-3-31926690-9_25 Gaber, T., Tharwat, A., Ibrahim, A., Snášel, V., & Hassanien, A. E. (2015, September). Human Thermal Face Recognition Based on Random Linear Oracle (RLO) Ensembles. In Proceedings of the 2015 International Conference on Intelligent Networking and Collaborative Systems (pp. 91-98). IEEE Computer Society. doi:10.1109/INCoS.2015.67 Gislason, M. P. O., Benediktsson, J. A., & Sveinsson, J. R. (2006, March). Random forests for land cover classification. Pattern Recognition Letters, 27(4), 294–300. doi:10.1016/j.patrec.2005.08.011 Hann, J., & Kamber, M. (n.d.). Data Mining: Concepts and techniques (3rd ed.). Morgan Kaufman. Hassanien, A. E., Alamry, E., & Intelligence, S. (2015). Principles, Advances, and Applications. CRC – Taylor & Francis Group.

912

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Hu, Q., Yu, D., Liu, J., & Wu, C. (2008, September). Neighborhood Rough Set Based Heterogeneous Feature Subset Selection. Information Sciences, 178(18), 3577–3594. doi:10.1016/j.ins.2008.05.024 Huang, C. L. & Dun, J. F. (n.d.). A distributed PSOSVM hybrid system with feature selection and parameter optimization. Journal of Applied Soft Computing, 8(4), 1381-1391. Kennedy, J., & Eberhart, R. (n.d.). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks, 1942-1948. Khedher, M. Z., Abandah, G. A. & Al-Khawaldeh, A. M. (n.d.). Optimizing Feature Selection for Recognizing Handwritten Arabic Characters. World Academy of Science, Engineering and Technology, 1(4), 1023-1026. Kumar, S. U., & Hannah, H. I. (n.d.). PSO-based feature selection and neighborhood rough set-based classification for BCI multiclass motor imagery task. Neural Computing and Applications, 1-20. Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., & Wang, S. (n.d.). An improved particle swarm optimization for feature selection. Journal of Bionic Engineering, 8(2), 191-200. Mirjalili, S., & Lewisa, A. (2016, May). The Whale Optimization Algorithm. Advances in Engineering Software, 95, 51–67. doi:10.1016/j.advengsoft.2016.01.008 Mirjalili, S., Mirjalili, S.M., & Lewis, A. (n.d.). Grey Wolf Optimizer. Journal of Advances in Engineering Software, 69(7), 46–61. Nakamura, R. Y. M., Pereira, L., Acuckoo, M., Costa, K. A., Rodrigues, D., Papa, J. P., & Yang, X. S. (n.d.). BBA: a binary bat algorithm for feature selection. Proceedings of the XXV SIBGRAPI-Conference on Graphics Patterns and Images, 291-297. doi:10.1109/SIBGRAPI.2012.47 Nguyen, Bach, H., Xue, B., Liu, I., Andreae, P., & Zhang, M. (n.d.). New mechanism for archive maintenance in PSO-based multiobjective feature selection. Soft Computing, 1-20. Rani, A., & Rajalaxmi, R. R. (n.d.). Unsupervised feature selection using binary bat algorithm. 2nd International Conference on Electronics and Communication Systems (ICECS), 451-456. doi:10.1109/ ECS.2015.7124945 Sagheer, M. W. (n.d.). Novel Word Recognition and Word Spotting Systems for Offline Urdu Handwriting (Master thesis). Concordia University, Montreal, Quebec, Canada. Sahlol, A.T., Suen, C.Y., Elbasyoni, M.R., & Sallam, A.A. (n.d.). A proposed OCR Algorithm for cursive Handwritten Arabic Character Recognition. Journal of Pattern Recognition and Intelligent Systems, 90-104. Sahlol, A.T., Suen, C.Y., Elbasyoni, M.R., & Sallam, A.A. (n.d.). Investigating of Preprocessing Techniques and Novel Features in Recognition of Handwritten Arabic Characters. Artificial Neural Networks in Pattern Recognition, 264-276. Si-Yuan, J. (n.d.). A hybrid genetic algorithm for feature subset selection in rough set theory. Soft Computing, 18, 1373–1382.

913

Bio-Inspired Optimization Algorithms for Arabic Handwritten Characters

Stjepan, O., & Oreski, G. (n.d.). Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications, 41(4), 2052–2064. Tsai, C.-F., Eberle, W., & Chu, C.-Y. (2013, February). Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, 240–247. doi:10.1016/j.knosys.2012.11.005 Yang, X.-S. (2011). Bat algorithm for multi-objective optimisation. International Journal of Bio-inspired Computation, 3(5), 267–274. doi:10.1504/IJBIC.2011.042259 Zahran, B. M., & Kanaan, G. (n.d.). Text Feature Selection using Particle Swarm Optimization Algorithm. World Applied Sciences Journal, 7, 69-74.

914

915

Chapter 40

Telemetry Data Mining Techniques, Applications, and Challenges Sara Ahmed Al Azhar University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Tarek Gaber Suez Canal University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt Aboul Ella Hassanien Cairo University, Egypt & Scientific Research Group in Egypt (SRGE), Egypt

ABSTRACT The most recent rise of telemetry is around the use of Radio-telemetry technology for tracking the traces of moving objects. Initially, the radio telemetry was first used in the 1960s for studying the behavior and ecology of wild animals. Nowadays, there’s a wide spectrum application of can benefits from radio telemetry technology with tracking methods, such as path discovery, location prediction, movement behavior analysis, and so on. Accordingly, rapid advance of telemetry tracking system boosts the generation of large-scale trajectory data of tracking traces of moving objects. In this study, we survey various applications of trajectory data mining and review an extensive collection of existing trajectory data mining techniques to be used as a guideline for designing future trajectory data mining solutions.

INTRODUCTION Telemetry is the automatic measurement and wireless transmission of data from remote sources and then transmitting it to a central or host location (Mcdermott, 2006). There, it can be monitored and used to control a process at the remote site. Telemetry is the process by which an object’s characteristics are measured, and the results transmitted to a distant station where they are displayed, and processed according to user specifications (Al-Serafi, 2015). Today, telemetry applications include measuring and DOI: 10.4018/978-1-5225-2229-4.ch040

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Telemetry Data Mining Techniques, Applications, and Challenges

transmitting data from sensors located in automobiles, smart meters, power sources, robots and even wildlife in what is commonly called the Internet of Things (IoT). Thanks to the wide adoption of GPS and other telemetry tracking system, massive amounts of trajectory data of tracking traces of moving objects have been collected. Moving object data could be related to human, objects (e.g., airplanes, vehicles, spacecraft and ships), animals, and/or natural forces (e.g., hurricanes and tornadoes). Since trajectories are sequences of real-valued locations with errors and missing values, mining of this trajectory is not a straightforward task. Hence, research of trajectory mining has attracted a great deal of attention for recent years (Xiaoliang, 2012). In this survey we present brief review of the trajectory data mining techniques, applications and some research challenges and future works are reported. The rest of the paper is structured as follows: in the following section, Trajectory data model is defined then applications of trajectory data mining are discussed in the second section. In the third section, the important techniques of trajectory data mining tasks are presented. Finally, conclusion is presented.

TRAJECTORY DATA MODEL A spatial trajectory is a trace generated by a moving object in geographical spaces, which is consisting of an ordered set of spatiotemporal points (Frentzos, 2007). This can be defined for any trajectory “T” which can be seen as an ordered set of spatiotemporal points consisting of 3 dimensions: location in terms of x-coordinate “x”, y-coordinate “y” and temporal dimension in terms of time “t” This is formally defined as, and can be seen in Figure 1, (Al-Serafi, 2015) T=

{(x , y , t ), (x , y , t ), …, (x , y , t )} 1

1

1

2

2

2

n

n

n

(1)

APPLICATION OF TRAJECTORY DATA MINING There exists a wide spectrum of applications driven and improved by trajectory data mining, such as; knowing moving objects locations in advance can be substantial. Discovery of behavioral patterns and prediction of future movement can greatly influence different fields, such as analysis of the wild animals’ movement in order to predict their migrations, monitoring and analysis of vehicle movement in order to predict traffic congestions, mobile user movement and access point availability prediction in order to assure the requested level of quality of service or analysis and location prediction of the movement of Figure 1. Trajectory consisting of “n” points

916

Telemetry Data Mining Techniques, Applications, and Challenges

aircrafts or space crafts (Ivana, 2006). In the section, we classify these applications based on the derivation of trajectories categories. The derivation of trajectories can be classified into four major categories, mobility of people, mobility of animals, mobility of vehicles and mobility of natural phenomena.

Mobility of People People have been recording their real-world movements in the form of spatial trajectories, passively and actively, for a long time. (Zhengi, 2011) Trajectory data provide a lot of opportunities to analyses movement behavior of people. Discovery of movement patterns is crucial for understanding human behavior. Knowledge discovery on movement data has been one of the most productive research communities, having generated substantial scientific output as seen by the vast amount of algorithms and methods developed in the last decade. The majority of these methods have been focused on mining the geometric properties of a trajectory. Recent research work has been carried out to address the challenges of generating new knowledge in understanding the human mobility behavior (Zhenhuii, 2010). Predicting human behavior accurately under emergency is a crucial issue for disaster alarming, disaster management, disaster relief and societal reconstruction after disasters, is one of the recent work in this field, (Song, 2014). Analyze emergency behavior of human beings and their mobility patterns after a big nuclear accident in Japan, leveraging a large human mobility database. It is proved that emergency behavior after disasters sometimes correlates with their normal mobility patterns. Furthermore, several impacting factors, e.g., social relationship, intensity of a disaster, damage level, new reporting, population flow, are investigated and thus a predictive model is derived. Another recent work (González, 2008) explores individual human mobility patterns by studying a large number of anonymous position data from mobile phone users and reveals a high degree of temporal and spatial regularity in human trajectories. (Zhenni, 2016)

Mobility of Animals Biologists have been collecting the moving trajectories of animals, to track animal movements at fine spatial and temporal scales for the purpose of group behavior analysis (W. W, 1963) (Dipto, 2015). Many data mining technologies can be applied to animal movement data to abstract animal mobility-related phenomena, such as (Yuwei, 2016) proposed method for notion of continuous behavior patterns as a concise representation of popular migration routes and underlying sequential behaviors during migration.

Mobility of Vehicles There exist a wide spectrum of applications driven and improve by trajectories generated by vehicles, Path discovery is one of the most common applications, it is to find at least one path that satisfy a predefined objective given a source and a destination such as the fastest path problem (Chen, 2011), and Location /Destination Prediction problem to predict the final destination of vehicle trips (Xue, 2013).

917

Telemetry Data Mining Techniques, Applications, and Challenges

Mobility of Natural Phenomena Meteorologists, environmentalists, climatologists, and oceanographers are busy collecting the trajectories of some natural phenomena, such as hurricanes, tornados, and ocean currents. These trajectories capture the change of the environment and climate, helping scientists deal with natural disasters and protect the natural environment we live in (ZHENGi, 2015).

TRAJECTORY DATA MINING TECHNIQUES Mining the trajectory data or mobility data is an emerging area of research, it aims at the analysis of mobility data by means of appropriate patterns and models extracted by efficient algorithms. In general, there are many data mining methods developed for analyzing moving objects data. Based on the nature of the problems, these methods can be categorized as i.e., classification, clustering, segmentation and pattern mining.

Trajectory Classification Classification is one of the most important research areas of knowledge discovery in trajectory data. It aims at explaining the behavior of current moving objects and predicting that of future ones. Classification is one of the fundamental problems in machine learning theory. Suppose we are given n classes of trajectories, and when we are faced with a new, previously unseen trajectory, we have to assign it to one of the classes. The problem can be formalized as follows:

(T , c ), …, (T 1

1

m

, cm ) ∈ (TD ×C )

(2)

where TD is a non-empty set of the trajectories samples list as is defined in equation (1), and in the present context C = {1,..., n }; the ci ∈ C called labels and contain information about which class a particular trajectory belongs to. Classification means generalization to unseen trajectory data (T , c ) , i.e

we want to predict the c ∈ C given some new trajectory ∈TD . Formally, this amounts to the estimation of a function f : T → C using the input- output training data, generated independently and identically distributed according to an unknown probability distribution. (Lokeshi, 2010) A number of trajectory classification methods have been proposed to build a model on training data and then to apply the trained model to predict the labels of test trajectories. Most of earlier proposed methods are that they use the shapes of trajectories to do classification, e.g., by modeling a whole trajectory with a single mathematical function such as the Hidden Markov Model (HMM), such as (Bashir, 2011) has presented a framework of classifying human motion trajectories, which uses the HMM with a mixture of Gaussians. Another example of classification model based on distance similarity measures, which is considered a fundamental ingredient of classification process (Ayeldeen, 2015), that can effectively determine the similarity of trajectories. In (Lokeshi, 2010), focus on a trajectory similarity technique to measure the distance, a Nearest Neighbor classification method for trajectory data has been proposed in this work, the main issue of a Nearest Neighbor classifier is measuring the distance between two items,

918

Telemetry Data Mining Techniques, Applications, and Challenges

the classification results have demonstrated performing classification accuracy as well as classification efficiency. In recent work, (Patel, 2012) focus on duration information to boost prediction accuracy unlike earlier works, which did not consider the duration information of trajectory. The method utilizes information theory to obtain regions where the trajectories have similar speeds and directions. Further, trajectories are summarized into a network based on the minimum description length (MDL) principle that takes into account the duration difference among trajectories of different classes, since duration information greatly contributes to differentiating moving objects that travel at similar speeds.

Trajectory Clustering Trajectory clustering is the most popular topic in current trajectory data mining, which aims at is covering the similarity (distance) in moving object database, grouping similar trajectories into the same cluster, and finding the most common movement behaviors. (ZX, 2012) The measurement of trajectory similarity (or distance) is one of the key points in defining trajectory clustering; the trajectory related definitions and their associated attributes are defined as following: A trajectory (TR ) is a chronological sequence consisted of multi-dimensional locations, which are denoted by equation (3), sampling point in

{

}

TRi = P1, P2, …, Pm (1 ≤ i ≤ n ) ⋅ Pj (1 ≤ j ≤ m )

(3)

TRi , is represented as < Location j ,Tj > , which means that the position of the moving object is Location j at time Tj . Location j is a multi-dimensional point. A trajectory Pc1, Pc 2, …, Pci (1 ≤ c1 < c2 < … ≤ m ) represent a trajectory segment or sub-trajectory of a trajectory TRi , denoted as TS (Trajectory Seg-

ment), TSi = Li 1, Li 2, …, Linum . (Guan, 2016)

Similarity Measures for Trajectories Similarity measurement is one of the most important parts is a clustering algorithm. The similarity or distance of two distinguished data must be compared before they can be grouped into clusters (Zhang, 2006). The distance between trajectories needs to be carefully defined in order to reflect the true underlying similarity, this is due to the fact that trajectories are essentially high dimensional data attached with both spatial and temporal attributes, which needs to be considered for similarity measures. Many of methods have been widely applied to facilitate query processing and to the purpose of clustering in trajectory data, e.g., Euclidean distance (ED) (Jonkery et al. 1980), Dynamic Time Warping (DTW) (Soong & Rosenberg 1988), distance based on Longest Common Subsequence (LCSS) (Kearney & Hansen 1990), Edit Distance with Real Penalty (ERP) (Chen & Ng 2004), Edit Distance on Real sequence (EDR) (Chen et al. 2005). (Haozhou, 2013) 919

Telemetry Data Mining Techniques, Applications, and Challenges

Currently, trajectory clustering becomes an attractive topic in moving object data mining field. Hung et al. (Hung, 2015) proposed a framework called clustering and aggregating clues of trajectories to find useful patterns for discovering trajectory routes that represent the frequent movement behaviours of a user. Besides clustering the location of moving object trajectories, many studies also focus on the shape of the trajectories on base of a series of location points. Shape based clustering analyses is moving object patterns mainly depend on the moving object location along with the time information. (Yanagisawa, 2003) defined a shape-based similarity query method using Euclidean distance and DTW distance to find trajectories which are similar to others in shape. In time depended clustering algorithms, both relative time and absolute time instance are needed to find similar movement patterns, such as periodic patterns and other time related application. Time depended clustering provides an effective way to discover the intrinsic structure and condense information over temporal data by exploring dynamic regularities underlying temporal data in an unsupervised learning way (Chis, 2009). For many application domains, useful information may only be extracted from trajectory data if their semantics and the background geographic information are considered. Several works for trajectory data analysis have been developed with considering geographic information as the background, such as (Yuan, 2012) and (Wang, 2013). None of the previous approaches considers the underlying uncertainty, since a trajectory data consists of movements of objects, which record their position as it evolves over time, the concept of uncertainty appears in various ways; data imprecision due to sampling and/or measurement errors, uncertainty in querying and answering, fuzziness by purpose during pre-processing for preserving anonymity, and so on (Nikos, 2011). In (Trajcevski, 2004), a model for uncertain trajectories is proposed that associates an uncertainty threshold ε to each trajectory point (x , y, t ) . Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters (Xiao, 2007), it is widely applied in modern applications of trajectory clustering. In (Pelekis, 2009), the effect of uncertainty in trajectory clustering was studied, and they proposed an intuitionistic point vector representation of trajectories that encompasses the underlying uncertainty and introduce an effective distance metric to cope with uncertainty then proposed a variant of the FCM clustering algorithm, and the experimental evaluation over real world trajectory data demonstrates the efficiency and effectiveness of the proposed approach.

Trajectory Segmentation In many application scenarios, a trajectory is partitioned into sub-trajectories, each of which is often called a segment, a partition or a frame. This may have different purposes, e.g., data handling (efficient storing and searching) or analysis (learning about the underlying structures in trajectory data). (Buchin, 2011) & (Maike, 2010), proposed framework for trajectory segmentation to segment the trajectories of migrating birds, based on spatiotemporal criteria. (Buchin et al. 2011) segments a trajectory such that each segment is homogenous in the sense that a set of spatiotemporal criteria are fulfilled. For this, it considers spatiotemporal attributes, such as heading, speed, and location. It uses criteria on the attributes, such as bounding the heading angular range, bounding the speed ratio, or requiring the locations to lie in a disk of given radius. The criteria are required to be monotone in the sense that if a segment fulfills a criterion, then so does every sub segment of the segment. Criteria can be combined, by a Boolean or a linear combination. The framework provides

920

Telemetry Data Mining Techniques, Applications, and Challenges

algorithms for segmenting a trajectory into the fewest number of pieces such that each piece fulfils the set of criteria. The algorithms are greedy strategies: incremental-search and double-and-search. These greedy strategies can be described as follows. The algorithm starts at the beginning of the trajectory and finds the longest segment that fulfills the set of criteria. Then it starts new at the end point of the segment just found, and again finds the longest segment fulfilling the criteria. And so on. The incremental search finds the longest segment by incrementing the test segment by one in each step. Double-and-search uses first an exponential search on the segment size, until this fails, and then a binary search between the last two points (the last where it succeeded and the first where it failed). The run time of both algorithms depends on the chosen criterion. In most cases, an overall O (n log n ) run time is achieved, where n is the number of edges of the trajectory to be segmented. (Maike, 2010), (Buchin et al. 2016) extend and implement the framework presented in (Buchin et al. 2011) for segmentation based on movement states of an object, they proposed semi-automatic framework in the sense that the parameters for segmentation need to be input manually, and the resulting segmentation is then computed automatically.

Trajectory Pattern Mining Trajectory pattern mining is an emerging and rapidly developing topic in the areas of data mining and query processing that aims at discovering groups of trajectories based on their proximity in either a spatial or a spatiotemporal sense (Hoyoung, 2011; Zhenhuii, 2011). There are various types of patterns, such as gathering/group patterns, sequential patterns and periodic patterns (Zhenni, 2016). Periodicity is one of the most frequently occurring phenomena for moving objects. Animals usually have periodic movement behaviours, such as daily foraging behaviours or yearly migration behaviours. Such periodic behaviours are the keys to understand animal movement and they also reflect the seasonal, climate, or environmental changes of the ecosystem. Periodic pattern mining of trajectory data concerns the discovery of periodic object behaviour, i.e., objects that follow the same routes (approximately) over regular time intervals (Li, 2010). (Li, 2012) address a problem of mining periodic behaviours of object movements for animal and biological sustainability studies. The work solves two crucial sub problems of detecting periods and mining periodic movement behaviour based on reference locations and probabilistic models, respectively. Regarding each trajectory as a sequence, a sequential pattern is often defined as a subsequence that at least trajectories share the subsequence, where δ is a user-specific threshold. (Zheng, 2014) address a problem of mining sequential patterns in semantic trajectories, develop a set of novel techniques to tackle the challenge of efficient discovery of gathering patterns on archived trajectory dataset also they proposed an online discovery solution by applying a series of optimization schemes, which can keep track of gathering patterns while new trajectory data arrive.

CONCLUSION A spatial trajectory is a trace generated by a moving object in geographical spaces, usually represented by a series of chronologically ordered points, trajectory data mining is an emerging area of research, having a large variety of applications because of the wide availability of location-acquisition techniques and tracking methods have generated massive spatial trajectory data, which represent the mobility of

921

Telemetry Data Mining Techniques, Applications, and Challenges

a diversity of moving objects, such as people, vehicles and animals. This study offers an overview of trajectory data mining model and identifies several important trajectory applications, such as path discovery, location prediction, and movement behavior analysis. The survey conducts a systematic survey on the major trajectory data mining tasks, such as trajectory classification, clustering, segmentation and pattern mining, and reviews an extensive collection of existing studies in the proposed framework of trajectory data mining.

REFERENCES Lord, R. D. (1963). A radio tracking system for wild animals. The Journal of Wildlife Management, 2, 9–24. Al-Serafi, A., & Elragal, A. (2013), Trajectory Data Mining: a Novel Distance Measure. Proceedings of The Fifth International Conference on Advanced Geographic Information Systems, Applications, and Services, 125-132. Ayeldeen, H., Mahmood, M. A., & Hassanien, A. E. (2015). Effective Classification and Categorization for Categorical Sets: Distance Similarity Measures. Information Systems Design and Intelligent Applications, 339, 359–368. Bashir. A., Khokhar. A. A. & Schonfeld. D., (2011). Object trajectory-based activity classification and recognition using hidden Markov models. IEEE Trans. on Image Processing, 16, 1912-1919. Buchin, M., Driemel, A., Van, M., & Sacristán, V. (2011). Segmenting trajectories: a framework and algorithms using spatiotemporal criteria. J Spatial Inf Sci, 3, 33–63. Chen, Z., Shen, H. T., & Zhou, X. (2011). Discovering popular routes from trajectories. Proceedings of 2th Int. Conf. Data Eng. (ICDE), 900–911. Chis, M., Banerjee, S., & Hassanien, A. E. (2009). Clustering Time Series Data: An Evolutionary Approach. Foundations of Computational Intelligence, 6, 193–207. Dipto, S., Colin, A., Larry, G., & Raja, S. (2015). Analyzing Animal Movement Characteristics from Location Data. Transactions in GIS, 19(4), 516–534. doi:10.1111/tgis.12114 Frentzos, E., Gratsias, K., & Theodoridis, Y. (2007). Index-based Most Similar Trajectory Search. Proceedings of IEEE 23rd International Conference on Data Engineering, ICDE09, 816 - 825. González, M. C., Hidalgo, C. A., & Barabási, A.-L. (2008). Understanding individual human mobility patterns. Nature, 453(7196), 9–82. doi:10.1038/nature06958 PMID:18528393 Guan, Y., Penghui, S., Jie, Z., Daxing, L., & Canwei, W. (2016). A review of moving object trajectory clustering algorithms. An International Science and Engineering Journal, 46, 1–22. Haozhou, W., Han, S., Kai, Z., Shazia, S., & Xiaofang, Z. (2013). An Effectiveness Study on Trajectory Similarity Measure. Proceedings of The Twenty-Fourth Australasian Database Conference (ADC), 13-22.

922

Telemetry Data Mining Techniques, Applications, and Challenges

Hoyoung, J., Man, L., & Christian, S. (2011). Trajectory Pattern Mining. In Computing with Spatial Trajectories. Springer. Hung, C., Peng, W., & Lee, W.-C. (2015). Clustering and aggregating clues of trajectories for mining trajectory patterns and routes. The VLDB Journal, 24(2), 169–192. doi:10.1007/s00778-011-0262-6 Ivana, N., & Fertalj, K. (2010). Automation of the Moving Objects Movement Prediction Process. Computer Science and Information Systems, 7(4), 931–945. doi:10.2298/CSIS090608020N Lavanya, N. P., Sarvani, A., Soujanya Kumari, K. S., Swathi, L. Y., & Purnachandra, M. R. (2014). Design of Xilinx Based Telemetry System Using Verilog. International Journal of Scientific Engineering and Research, 2, 135–139. Li, Z., Ding, B., Han, J., Kays, R., & Nye, P. (2010). Mining periodic behaviors for moving objects. Proceedings of 16th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 1099-1108. doi:10.1145/1835804.1835942 Li, Z., Han, J., Ding, B., & Kays, R. (2012). Mining periodic behaviors of object movements for animal and biological sustainability studies. Data Mining and Knowledge Discovery, 24(2), 355–386. doi:10.1007/ s10618-011-0227-9 Lokesh, K. S., Prakash, V., Simon, S., & Ajaya, K. (2010). Nearest Neighbour Classification for Trajectory Data. Proceedings of International Conference, ICT 2010, 180-185. Maike, B., Helmut, K., & Andrea, K. (2013). Segmenting Trajectories by Movement States. Advances in Spatial Data Handling, 15-25. Mcdermott, S., & Irving, K. (2006). The Application of Satellite Communication Technology to Operational Knowledge Acquisition. Proceedings of IEEE Military Communications Conference, 1-5. doi:10.1109/MILCOM.2006.302268 Nikos, P., Ioannis, K., Evangelo, E., Kotsifakos, E. F., & Yannis, T. (2011). Clustering uncertain trajectories. Knowl Inf Syst, 28, 11-14. Patel, D., Sheng, C., Hsu, W., & Lee, M. L. (2012). Incorporating duration information for trajectory classification. Proceedings of IEEE 28th Int. Conf. Data Eng. (ICDE), 1132–1143. Pelekis, N., Kopanakis, I., Kotsifakos, E., Frentzos, E., & Theodoridis, Y. (2009). Clustering trajectories of moving objects in an uncertain world. In Proceedings of The IEEE International Conference on Data Mining (ICDM’09). Miami, FL: IEEE CS Press. doi:10.1109/ICDM.2009.57 Shinde, S. (2014). A Survey Paper on Trajectory Pattern Mining for Pattern Matching Query. Proceedings of International Journal of Computer Applications, 86(1), 2-30. Song, X., Zhang, Q., Sekimoto, Y., & Shibasaki, R. (2014). Prediction of human emergency behavior and their mobility following large-scale disaster. Proceedings of 20th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 5–14. Trajcevski, G., Wolfson, O., Hinrichs, K., & Chamberlain, S. (2004). Managing uncertainty in moving objects databases. ACM Transactions on Database Systems, 29(3), 463–507. doi:10.1145/1016028.1016030

923

Telemetry Data Mining Techniques, Applications, and Challenges

Wang, X., Li, G., Jiang, G., & Shi, Z. (2013). Semantic trajectory-based event detection and event pattern mining. Knowledge and Information Systems, 3(2), 305–329. doi:10.1007/s10115-011-0471-8 Xiao, K., Hock, S. H., & Hassanien, A. E. (2007). Fuzzy c-means clustering with adjustable feature weighting distribution for brain MRI ventricles segmentation. Proceedings of the Ninth IASTED International Conference on Signal and Image Processing, 483-489. Xiaoliang, G., Hiroki, A., & Takeaki, U. (2012). Pattern Mining from Trajectory GPS Data. Proceedings of International Conference on Advanced Applied Informatics (AAI 2012), 60-65. doi:10.1109/ IIAI-AAI.2012.21 Xue, A. Y., Zhang, R. Y., Zheng, X. X., Huang, J., & Xu, Z. (2013). Destination prediction by subtrajectory synthesis and privacy protection against such prediction. Proceedings of 29th IEEE Int. Conf. Data Eng. (ICDE), 254–265. Yanagisawa, Y., Akahani, J., & Satoch, T. (2003). Shape-based similarity query for trajectory of mobile objects. Proceedings of the 4th International Conference on MDM, 63-77. doi:10.1007/3-540-36389-0_5 Yasodha, M., & Ponmuthurama, L. (2012). A survey on temporal data clustering. Int J Adv Res Comput Commun Eng, 1(9), 2–86. Yuan, G., Xia, S., Zhang, L., Zhou, Y., & Ji, C. (2012). An efficient trajectory-clustering algorithm based on an index tree. Int J Trans Inst Meas Control, 34(7), 850–861. doi:10.1177/0142331211423284 Yuwei, W., Ze, L., John, T., Diann, P., Yan, X., Scott, N., … Sivananinth, A. (2016). A new method for discovering behavior patterns among animal movements. Int J Geogr Inf Sci, 30, 929–94. Zhang, Z., Kaiqi, H., & Tieniu, T. (2006). Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes. Proceedings of 18th International Conference on Pattern Recognition (ICPR ‘06), 1135–1138. doi:10.1109/ICPR.2006.392 Zheng, Y. (2015). Trajectory Data Mining: An Overview. ACM Transactions on Intelligent Systems and Technology, 6(3). Zheng, K., Zheng, Y., Yuan, N. J., Shang, S., & Zhou, X. (2014). Online discovery of gathering patterns over trajectories. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1974–1988. doi:10.1109/TKDE.2013.160 Zheng, Y., & Zhou, X. (2011). Trajectory pre-processing. In Computing with Spatial Trajectories. Springer. Zhenhui, L., Jiawei, H., Bolin, D., & Roland, K. (2011). Mining periodic behaviors of object movements for animal and biological sustainability studies. Data Mining and Knowledge Discovery, 24, 355–386. Zhenhui, L., Jiawei, H., Ming, J., Lu, A., Yintao, Y., & Bolin, D. (2010). MoveMine: Mining Moving Object Data for Discovery of Animal Movement Patterns. ACM Journal, 1, 111–146. Zhenni, F., & Yanmin, Z. (2016). A Survey on Trajectory Data Mining. Techniques and Applications, IEEE ACCESS, 4, 2056–2067. doi:10.1109/ACCESS.2016.2553681 Zx, Y., Chakraborty, D., Parent, C., Spaccapietra, S., & Abere, K. (2012). Semantic trajectories: Mobility data computation and annotation. ACM Transactions on Intelligent Systems and Technology, 9(4), 1–34.

924

925

Chapter 41

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach in Digital Mammography Mohammed A. Osman Helwan University, Egypt

Ayman E. Khedr Helwan University, Egypt

Ashraf Darwish Helwan University, Egypt

Atef Z. Ghalwash Helwan University, Egypt Aboul Ella Hassanien Cairo University, Egypt

ABSTRACT Breast cancer or malignant breast neoplasm is the most common type of cancer in women. Researchers are not sure of the exact cause of breast cancer. If the cancer can be detected early, the options of treatment and the chances of total recovery will increase. Computer Aided Diagnostic (CAD) systems can help the researchers and specialists in detecting the abnormalities early. The main goal of computerized breast cancer detection in digital mammography is to identify the presence of abnormalities such as mass lesions and Micro calcification Clusters (MCCs). Early detection and diagnosis of breast cancer represent the key for breast cancer control and can increase the success of treatment. This chapter investigates a new CAD system for the diagnosis process of benign and malignant breast tumors from digital mammography. X-ray mammograms are considered the most effective and reliable method in early detection of breast cancer. In this chapter, the breast tumor is segmented from medical image using Fuzzy Clustering Means (FCM) and the features for mammogram images are extracted. The results of this work showed that these features are used to train the classifier to classify tumors. The effectiveness and performance of this work is examined using classification accuracy, sensitivity and specificity and the practical part of the proposed system distinguishes tumors with high accuracy.

DOI: 10.4018/978-1-5225-2229-4.ch041

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

INTRODUCTION AND RELATED WORK Breast cancer is one of the major causes of death among women, especially in developed countries. The early detection of this disease can reduce the rate of death in women. CAD can help the radiologist in detecting the abnormalities in an efficient way. The mammograms can be used in the detection of breast cancer according to the World Health Organization’s International Agency for Research on Cancer (Gaber et al., 2015). Mammographic images are X-ray images of breast region (Verma & Zhang, 2007; Hassanien et al., 2003, Hassanien and Tai-hoon, 2012). Computer-assisted breast tumor classification which is based on the image analysis techniques provides more useful information. The conventional method for the breast tumor classification consists of three steps process. The first step involves the segmentation of breast tumor from the image. The second step is feature extraction and the third one is the classification process using a classifier. The goal of this study is to increase the diagnostic accuracy of image processing for optimum classification between benign and malignant abnormalities in digital mammograms. Image enhancement module is a vital part as a preprocessing for any image processing technique. Image processing techniques like morphological operations and threshold techniques are applied in this study to enhance the mammogram images for the computerized detection of breast cancer. Segmentation of medical images is an important step (Ali et al., 2015). This study employs a fuzzy segmentation algorithm for segmenting the mammogram images. Texture based features are extracted from the segmented images. These features are fed to the classifier for classification process. The binary classification accuracy of the developed system is measured using the Receiver Operating Characteristic (ROC) analysis with performance measures such as sensitivity, specificity and accuracy. This chapter proposes a new technique based on fuzzy algorithm and ANN to diagnosis breast cancer from digital mammograms. Research in areas of Computer-Aided Diagnostic (CAD) systems developed within a decade. In early studies investigators outlined many approaches and limitations of Computer-Aided Diagnosis (CAD) in mammography. Winsberg, Elkin, Macy, Brodaz, and Weymouth (1967) described a method that compares density between left and right breasts. An algorithm was developed by Kimme, O’Loughlin, and Sklansky (1975) to detect abnormal breast regions. They calculated seven features for breast images and compared them corresponding to regions of the left and right breasts. Smith, Wagner, Guenther, and Solmon (1977) introduced a measure to distinguish between malignant and benign cancer. Hand, Semmlow, Ackerman, and Alcorn (1979) constructed fourteen parameters of three basic textural features, intensity, roughness and directionality to detect malignant areas on xeromammograms. They achieved sensitivity of 87%. Now recent studies characterized by greater use of image processing, feature analysis and artificial intelligence methods. Varela, Tahoces, Mendez, Souto and Vidal (2007) applied iris filter and means of adaptive threshold to segment images and extract feature to train neural network classifier. System results sensitivity 88% and 94% at 1.02 false positives per image. Kumar and MONI (2010) applied Fuzzy Clustering Means (FCM) to extract tumor from Computed Tomography (CT), textural information obtained using curvelet transform. Consequently after classification 94.3% accuracy obtained. This chapter presents a new method for breast cancer early detection. The remainder of this chapter is organized as follows. Section 2 presents the proposed model and its framework. In section 3, the research methodology and implementation of this chapter is presented and performance evaluation and analysis of the results are presented. Section 4 concludes this chapter and future work is presented. 926

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

THE PROPOSED MODEL The computer aided diagnosis systems are necessary to support medical staff to achieve high capability and effectiveness. The modeling of the framework consists of three main stages, Mammogram Image Processing and segmentation, Texture Feature Extraction and Selection and Classification process. The main property of an ANN is its ability of learning. Training or learning is a method of parameter change by which a neural network adapts itself to a stimulus and then desired output is produced. Mostly, there are three kinds of learning: (a) Supervised learning, (b) Unsupervised learning, (c) Reinforcement learning. The main goal of this work in medical diagnostics is to develop user-friendly systems, procedures and methodologies for clinicians. Mammography breast cancer images were collected from Digital Database for Screening Mammography (DDSM), Image processing was applied to remove any artifact or noise and image contrast enhancement used after that. Image segmentation applied and verified by FCM algorithm, Feature extraction used to mine the data with preferred features and finally classification established to define malignant tumors from benign. Figure 1 shows the overview of the proposed framework for classification. In the first stage, image processing techniques and algorithms are applied on the digital mammographic images for the purpose of image preprocessing and image segmentation. In this research, mammogram preprocessing includes: noise removal (using Median Filtering), background separation (using Global Thresholding), artifact and label separation (using Morphological Operations) and segment the images using Fuzzy Clustering Means (FCM) algorithm (Weijie et al., 2006). In the second stage, texture features analysis is performed, Feature extraction is very important stage in pattern classification. There are several types of features extracted from the mammograms. To build a system for the diagnosis process of benign and malignant breast tumors, we must get all available information existing in mammograms. But not all features can differentiate between benign and malignant tumors, so we used features that can do (Kumar et al., 2011). The last stage in Figure 1 uses the optimal subset of texture features to construct a classification engine (classifier) using Artificial Neural Network (ANN). Neural networks have proven themselves as the best tool for tumor classification. The accuracy of the proposed classification system is evaluated using a performance measures such as sensitivity, specificity computed for a medical diagnosis test. The output of the proposed system classifies the tested samples (ROIs) as malignant or benign (Toshiyuki et al., 2008).

RESEARCH METHODOLOGY AND IMPLEMENTATION In the implementation work different images have been tested for benign and malignant tumors. Used images have same types and same sizes. In this research, the image processing and texture analysis techniques are applied using the MATLAB programming language on a PC under Windows 7. The implementation work is applied on image processing, fuzzy clustering technique and neural network.

Mammogram Database All of the mammograms used in our work were obtained from the Digital Database for Screening Mammography (DDSM) distributed by University of South Florida. The DDSM is a database of digitized

927

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 1. Overview of the proposed model

mammograms with associated ground truth and other information. The purpose of this database is to provide a large set of mammograms that are free and can be used by researchers to evaluate and compare the performance of Computer-Aided Detection (CAD) algorithms. There are 100 images were used in this implementation, these images are divided into two groups of training and testing sets with 65 for training and 35 for testing. A set of (40 malignant, 25 benign) used for training the network, and another set of (20 malignant, 15 benign) used for testing the classifier. Figure 2 describes some examples for DDSM.

928

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 2. Examples of mammography images from DDSM

Mammogram Image Preprocessing The major problem with the precise segmentation of the breast region is due to the existence of such artifacts, which may cause trivial segmentation algorithms to fail. The mammogram preprocessing stage indicated involves noise removal and artifact separation in order to suppress the background (black pixels) in the mammogram images. Another purpose of mammogram preprocessing is to improve the reliability and robustness of the mammogram segmentation, as discussed in the following sections.

Noise Removal and Artifact Separation The grayscale mammography images are digitally represented using the MATLAB Image Processing Toolbox. The range of intensity values of the acquired mammogram images has [0,255] gray levels. Digitization noises such as horizontal and vertical lines tend to appear on most of the mammogram images, as shown by arrow in Figure 3 and 4. These noises are removed from the mammogram images by applying a two dimensional Median Filtering approach in a 3-by-3 connected neighborhood. The horizontal and vertical lines are removed from the mammogram images without affecting the breast profile (Zhou et al., 2005).

929

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 3. Digitization noises (lines) in mammographic images

Figure 4. Mammogram images after noise removal using 2D median filtering

In order to use the area morphology, the grayscale [0,255] mammogram image needs to be transformed into the binary [0, 1] format. The simplest technique for transforming a grayscale image into binary is by using threshold. In order to convert a grayscale image into binary, a grayscale threshold for that image needs to be determined in order segment the artifacts and the background, while keeping the breast-skin edges in contact, so as not to lose information from the breast profile. In background and artifact separation, a global threshold (T) is a user determined value that is used to optimally segment the background region and the artifacts from the breast profile for a mammogram image dataset. In order to determine a global threshold (T) for a mammogram image dataset, a trial and

930

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

error procedure is typically used, where the segmentation performance of all the mammogram images is evaluated for all possible threshold levels. In this research, visual inspection of the segmented mammogram images determines the global threshold to be: T = 18, for all possible threshold levels in between 0 to 255. Figure 5 describes the Separation of artifacts.

Image Contrast Enhancement Contrast enhancement is performed by finding the limits to contrast stretch an image, where the tolerance level is a scalar quantity, which saturates fractions at low and high pixel levels. In this research, the default tolerance value of t = [min max] is used, where min is the smallest grayscale value in the mammogram image and max is the highest grayscale value in the mammogram image. Figure 6 shows the contrast enhancement technique applied to the mammogram images. As indicated in Figure 6 (d) the histogram of the original image in Figure 6 (b) is stretched, which increases the number of brighter pixels.

Image Segmentation To perform the segmentation on a digital mammogram, each pixel of the image has to be assigned to one region. The method presented in the following section, based on a fuzzy membership value assigned to each pixel. First, the algorithm divides the image into two regions. Then each region is divided into two new segments (sub regions), and the process can be continued until the desired number of segmentation levels is reached (Pan et al., 2001).

Fuzzy Clustering Algorithm In this step, breast tumor is segmented using Fuzzy Clustering Means (FCM) algorithm. It is based on the minimization of the objective function. Partitioning by fuzzy is carried out through an iterative optimization of the membership function based on the similarity between the data and the center of a cluster. Fuzzy Clustering Means (FCM) assigns different degrees of membership to each point. The membership of a point is thus shared among various clusters. This creates the concept of a fuzzy boundaries which differs from the traditional concept of well-defined boundaries. Thus, Fuzzy Clustering Means (FCM) varies the threshold between clusters through an iterative process. As a result, the threshold is determined appropriately for every slice and the tumor region can be successfully extracted. Jm (U,v) is the object function and uiK is the membership function, are defined using the equations (1), (2) n

c

J m (U , v ) = ∑ ∑ (uik ) (dik ) k =1 i =1

m

2

(1)

931

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 5. Separation of artifacts: (a) Original grayscale image with artifact and label, (b) Thresholded image using a value of T = 18, (c) Selection of the largest object with respect to Area, (d) Grayscale image with artifacts separated

uik =

1 2/(m −1)

 d   ik  ∑ d  j =1   jk  c

(2)

where d2ik is the distance between the kth data (pixel value) and the center of the ith cluster and vi denotes the center value of the ith cluster, which are defined by equations (1.3) and (1.4) as follows: dik2 = X k −Vi

932

(3)

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 6. Contrast enhancement of a mammogram image: (a) Mammogram image obtained after mammogram preprocessing, (b) Histogram of the original image in (a), (c) Contrast enhancement applied to the mammogram image in (a), (d) Histogram of contrast enhanced image in (c)

n

vi =

∑ (u k =1 n

ik

∑ (u k =1

)m x k

(4)

m

ik

)

where xk is the intensity of the kth pixel, n is the number of data (pixels), c is the number of clusters, and m is the exponent weight. The pixels in the background (low intensity) are included in the first cluster. The second cluster includes pixels in the tumor region (medium intensity) and the pixels in the liver region other than tumor (high intensity) are included in the third cluster. The tumor region is output for further analysis. After segmentation process, images will be shown in Figure 7.

933

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 7. (a) Breast cancer image; (b) segmented tumor

Feature Extraction and Selection From the perspective of pattern classification feature extraction is a very important stage in pattern classification. The fuzzy segmentation algorithm identifies some regions which are suspected to be masses. In the second step of our detection method, there are several types of features extracted from the mammograms. To build a system for the diagnosis process of benign and malignant breast tumors, we must get all available information existing in mammograms (Lee et al., 2007). But not all features can differentiate between benign and malignant tumors, so we used features that can do. A set of 10 features were calculated. 1. Standard Deviation: It measures how values spread out in a dataset with respect to the mean.

∑ (x − x )

2

s=

n −1

(5)

2. Variance: It measures the dispersion of a set of data points around their mean value. Var = s 2

(6)

3. Mean: It represents the average gray level in the window. x =

934

∑x n

(7)

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

4. Skewness: It is a measurement of the asymmetry of the data around the sample mean. If skewness is negative, the data are spread out more to the left of the mean than to the right. If skewness is positive, the data are spread out more to the right. y=

E (x − µ)3 σ3

(8)

5. Kurtosis: It is a measurement of how outlier-prone a distribution is. K=

E (x − µ)4 σ4

(9)

6. Entropy: A statistical measure of randomness that can be used to characterize the texture of the image. L −1

S E = −∑ P (b)log2 {P (b)}

(10)

b =a

7. Contrast: It measures the local variations in the gray-level co-occurrence matrix. CON =

∑ (i − j )

2

i , j ∈G

⋅ co(i, j )

(11)

8. Energy: It provides the sum of squared elements in the gray-level co-occurrence matrix (GLCM), also known as uniformity or the angular second moment. ASM =

∑ co(i, j )

i , j ∈G

2

(12)

9. Homogeneity: It measures the closeness of distribution of elements in the gray-level co-occurrence matrix (GLCM) to the GLCM diagonal. HOM = ∑ i, j

P (i, j ) 1+ i − j

(13)

10. Correlation: It measures the joint probability occurrence of the specified pixel pairs.

COR =

G −1

∑ P(i, j )(i − µ )( j − µ ) / σ σ

i , j =0

i

j

i

j

(14)

935

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Feature Normalization In neural network and other data mining approaches texture features values obtained need to be represented in a normalized scale. All selected features are scaled (normalized) in the range between 0 and 1. Feature normalization is performed using the following expression: Nf (x ) =

f (x ) − min( f (x )) max( f (x )) − min( f (x ))

(15)

where f(x) represents the feature, min(f(x)) and max(f(x)) represents the minimum and maximum values corresponding to the feature f(x) (Sujana et al., 1996).

Classification Engine Artificial neural networks (ANNs) have become a popular approach for environmental modeling in the last two decades. A neural network as shown in Figure 8 starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables (JinchangRen et al., 2011). The optimal subset of normalized features will be used to model a neural network classification engine for the purpose of pattern classification. In neural network when we fed an unknown input into network, it can produce a result from past experience. The output layer produce either 1 for normal or 0 for cancer. In this study, the input layer has 10 nodes and the hidden layer has 10 nodes and the output layer has one node. The neural network trained by adjusting the weights so as to be able to predict the correct class.

Figure 8. Sample neural network

936

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Training and Testing Data Separation To implement ANN (Artificial neural network), the normalized features needs to be separated into two distinct sets, i.e. the training set and the testing-validation set. As observed from Table 1, a set of 100 images obtained from DDSM database. The proposed method trained with 65 images (40 malignant, 25 benign) and tested with 35 images (20 malignant, 15 benign). ANN training set is performed in MATLAB using the ANN classification engine and the training samples as indicated in Table 1. These samples are chosen randomly by the neural network classification engine.The Receive Operating Characteristic (ROC) curve for training model is obtained as in Figure 9. After training process, ANN is tested by using 35 samples as indicated in Table 1. ANN testing model depending on the past experience of the training model and it produce a result. The confusion matrix and Receive Operating Characteristic (ROC) curve for training model is obtained as in Figure 10.

Table 1. Values of the samples used for training and testing Class

Training Set

Testing Set

Malignant

40

20

Benign

25

15

Total Samples

65

35

Figure 9. ROC curve for training model

937

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 10. ROC curve for testing model

Performance Evaluation A number of different measures are commonly used to evaluate the performance of the proposed method. These measures including accuracy, sensitivity and specificity. Sensitivity is the ratio of tumors which were marked and classified as tumor. Sensitivity=True Positive/True Positive+ False Negative. Specificity is the ratio of tumors which were not marked and also not classified as tumor. Specificity= True Negative/ True Negative+ False Positive. Accuracy measures the quality of binary classification. Confusion matrix defined as in Table 2.

Results and Discussion In order to evaluate this study, conduct experiments on a set of 100 images obtained from DDSM database. The proposed method trained with 65 images (40 malignant, 25 benign) and tested with 35 images (20 malignant, 15 benign). The confusion matrix for classification is shown in Figure 11.

Table 2. Confusion matrix Actual

Predicted Positive

Negative

Positive

True Positives (TP)

False Positives (FP)

Negative

False Negatives (FN)

True Negatives (TN)

938

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Figure 11. Confusion matrix for testing result

Table 3 show the computed sensitivity, specificity and accuracy for the proposed method. The obtained classification accuracy is (97.1%) whereas sensitivity and specificity are 100% and 93.3%. The overall accuracy for benign is 100% and for cancer is 95.2%.

CONCLUSION AND FUTURE WORK Image processing and image analysis techniques help radiologists in the difficult task of mammographic tumor diagnosis. The proposed system is used to segment the tumor from breast images with Fuzzy Clustering Means (FCM) technique. Texture features can be extracted and used to train the ANN (Artificial Neural Network) classifier to classify the breast tumor as benign and malignant with better performance to help radiologists during their medical decision process. The proposed system can be extended for other types of images or for other classes of diseases like liver, brain tumors diseases. The method employed in this study has given better performance. The maximum accuracy rate for tumor classification is (97.1%). The performance can be increased more by increasing the number of samples. For future work, features combined with statistical moment features to improve the results in classification of mammogram images. The proposed system can be extended for medical diseases diagnosis Table 3. Performance measures Tested Cases 35 case

Measures Specificity

Sensitivity

Accuracy

93.3%

100%

97.1%

939

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

REFERENCES Ali, M. A., Sayed, G. I., Gaber, T., Hassanien, A. E., Snasel, V., & Silva, L. F. (2015, September). Detection of breast abnormalities of thermograms based on a new segmentation method. In Proceedings of Federated Conference on Computer Science and Information Systems (FedCSIS), 2015 (pp. 255-261). IEEE. doi:10.15439/2015F318 Chen, W., Giger, M. L., & Bick, U. (2006). A fuzzy c-means (FCM)-based approach for computerized segmentation of breast lesions in dynamic contrast enhanced MR images. In Academic Radiology (pp. 63-72). Gaber, T., Ismail, G., Anter, A., Soliman, M., Ali, M., Semary, N.,... Snasel, V. (2015, August). Thermogram breast cancer prediction approach based on Neutrosophic sets and fuzzy c-means algorithm. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 4254-4257). IEEE. doi:10.1109/EMBC.2015.7319334 Hand, W., Semmlow, J., Ackerman, L., & Alcorn, F. (1979). Computer screening of xeromammograms: a technique for defining suspicious areas of the breast. In Computer and Biomedical Research (pp. 445469). doi:10.1016/0010-4809(79)90031-4 Hassanien, A. E., & Abed, E. T. (2003). Digital mammography image analysis system based on mathematical morphology. In IEEE computer society 7th International Conference On Intelligent Engineering Systems INES (pp. 4-6). Hassanien, A. E., & Kim, T. (2012). Breast cancer MRI diagnosis approach using support vector machine and pulse coupled neural networks. Journal of Applied Logic, 10(4), 277–284. doi:10.1016/j. jal.2012.07.003 Kimme, C., O’Loughlin, B., & Sklansky, J. (1975). Automatic detection of suspicious abnormalities in breast radiographs. In A. Klinger, K. S. Fu, & T. L. Kunii (Eds.), Data Structures, Computer Graphics, and Pattern Recognition (pp. 429–447). Kumar, S. S., Moni, R. S., & Rajeesh, J. (2011). Automatic Segmentation of Liver and Tumor for CAD of Liver. Journal of Advances in Information Technology, 63-70. doi:10.4304/jait.2.1.63-70 Kumar, S., & Moni, R. (2010). Diagnosis of liver tumor from CT images using curvelet transform. International Journal on Computer Science and Engineering, 1173-1178. Lee, J., Kim, N., Lee, H., Seo, J. B., Won, H. J., Shin, Y. M.,... Kim, S. H. (2007). Efficient liver segmentation using a level-set method with optimal detection of the initial liver boundary from level-set speed images. In Computer Methods and Programs in Biomedicine (pp. 26-38). doi:10.1016/j.cmpb.2007.07.005 Okada, Shimada, Hori, Nakamoto, Chen, Nakamura, & Yoshinobu. (2008). Automated Segmentation of the Liver from 3D CT Images Using Probabilistic Atlas and Multilevel Statistical Shape Model. In Academic Radiology (pp. 1390-1403). Pan, S., & Dawant, B. M. (2001). Automatic 3D segmentation of the liver from abdominal CT images: a level-set approach. In Medical Imaging (pp. 128-138).

940

Enhanced Breast Cancer Diagnosis System Using Fuzzy Clustering Means Approach

Ren, Wang, & Jiang. (2011). Effective recognition of MCCs in mammograms using an improved neural classifier. In Engineering Applications of Artificial Intelligence (pp. 638-645). Smith, K., Wagner, S., Guenther, R., & Solmon, D. (1977). The diagnosis of breast cancer in mammograms by the evaluation of density patterns. Radiology. Sujana, H., Swarnamani, S., & Suresh, S. (1996). Application of artificial neural networks for the classification of liver lesions by image texture parameters, Ultrasound. In Medicine and Biology (pp. 1177-1181). Varela, C., Tahoces, P., Mendez, A., Souto, M., & Vidal, J. (2007). Computerized detection of breast mass in digitized mammograms. In Computers in Biology and Medicine (pp. 214-226). Verma, B., & Zhang, P. (2007). A novel neural-genetic algorithm to find the most significant combination of features in digital mammograms. In Applied Soft Computing (pp. 513-525). doi:10.1016/j. asoc.2005.02.008 Winsberg, F., Elkin, M., Macy, J., Brodaz, V., & Weymouth, W. (1967). Detection of radiographic abnormalities in mammograms by means of optical scanning and computer analysis. Radiology. Zhou, X., Kitagawa, T., Okuo, K., Hara, T., Fujita, H., Yokoyama, R., & Hoshi, H. et al. (2005). Construction of a probabilistic atlas for automated liver segmentation in non-contrast torso CT images. International Congress Series.

941

942

Chapter 42

TAntNet-4:

A Threshold-Based AntNet Algorithm with Improved Scout Behavior Ayman M. Ghazy Cairo University, Egypt Hesham A. Hefny Cairo University, Egypt

ABSTRACT Traffic Routing System (TRS) is one of the most important intelligent transport systems which is used to direct vehicles to good routes and reduce congestion on the road network. The performance of TRS mainly depends on a dynamic routing algorithm due to the dynamic nature of traffic on road network. AntNet algorithm is a routing algorithm inspired from the foraging behavior of ants. TAntNet is a family of dynamic routing algorithms that uses a threshold travel time to enhance the performance of AntNet algorithm when applied to traffic road networks. TAntNet-1 and TAntNet-2 adopt different techniques for path update to fast direct to the discovered good route and conserve on this good route. TAntNet-3 has been recently proposed by inspiring the scout behavior of bees to avoid the bad effect of forward ants that take bad routes. This chapter presents a new member in TAntNet family of algorithms called TAntNet-4 that uses two scouts instead of one compared with TAntNet-2. The new algorithm also saves the discovered route of each of the two scouts to use the best of them by the corresponding backward ant. The experimental results ensure the high performance of TAntNet-4 compared with AntNet, other members of TAntNet family.

INTRODUCTION Traffic congestion is a serious problem in most modern cities. Nowadays traffic jams and congestion on roads become one of the most difficult problems that face every one in every day. Traffic congestion wastes fuel and hours of work time. Construction of new roads may be very costly or impossible in many cases. Therefore, the importance of finding methods for efficient utilization of the existing DOI: 10.4018/978-1-5225-2229-4.ch042

Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

TAntNet-4

infrastructure appears. Researches devote a lot of effort to minimize traffic congestions and to improve road utilization and safety. Vehicle routing is one of the most important services that provided by a modern Intelligent Transportation Systems. The traffic on road network is dynamic, in other words the traffic flow on roads are changed over time. However there exist many alternative routes to reach destination but most drivers using the route based on their previous experiences. But the route state is changed over time, so the good route may be one bad after some time and vice versa. Hence it is clear that the depending on the derivers experience is not the optimal way. The previous discussion represents the needing to use an intelligent route guiding system, that are take into account the real time road state, i.e. the dynamic data of the network such as congestions. The navigation system, can direct drivers to good route and consequently transfer the overload to other routes. This make road network could operate efficiency at peak volume levels. Swarm intelligence was widely applied for traffic routing for both computer network and road network. One of the most promising swarm intelligence methodologies that applied for traffic routing is the Ant based routing algorithms. AntNet algorithm introduced by Di Caro and Dorigo (1998), the algorithm inspired from the behavior of real ant when foraging, for routing through communication network. This chapter focuses on the TAntNet family which appears in (Ghazy, 2011; Ghazy et al. 2012 and Ghazy & Hefny 2014) as a modification for AntNet for dynamic traffic routing of road network. Also this chapter presents a new member in TAntNet family that is called TAntNet-4.

BACKGROUND AND RELATED WORKS Ant routing algorithms is one of the most promising swarm intelligence (SI) methodologies that are studied in many researches (Di Caro & Dorigo, 1998; Kassabalidis et al., 2002; Kroon & Rothkrantz, 2003; Suson, 2010; Claes, R., & Holvoet 2012; Jabbarpouret al., 2014a; Yousefi & Zamani, 2013 ; Ghazy & Hefny, 2014, Jabbarpour et al., 2014b and Girme, 2015). AntNet algorithm appears in 1998 by Di Caro & Dorigo (1998) for routing of data communication network. The algorithm with its distributed multi agent characteristics, attracted many researchers to adopt the algorithm in both data communication network and road traffic network. It has been shown that under varying traffic loads of data networks, AntNet algorithm represents better performance than that of Dijkstra’s shortest path algorithm (DhillonVan & Mieghem, 2007). Also it has been it has been shown in Kiruthika and Kalyanasundaram (2015) that, AntNet algorithm gives better results comparing with Ad hoc on demand Multipath Distance Vector algorithm. Many improvements have been proposed to the AntNet algorithm. Baran and Sosa (2001) presented a modified algorithm that initialize the routing table with data that reflects previous knowledge, about network topology rather than the presumption of uniform probabilities distribution given in original AntNet algorithm. Tekineret al. (2004) proposed a new version of the AntNet algorithm that utilized the ant/packet ratio to limit the number of used ants. Soltani et al. (2006) introduce a new type of ants called helping ants to increase cooperation among neighboring nodes, thereby reducing AntNet algorithm’s convergence time. Gupta et al. (2012) presented a study for computation of the pheromone values in AntNet. Radwan et al. (2011)

943

TAntNet-4

introduced a modified AntNet with blocking–expanding ring search and local retransmission technique for routing of Mobile ad hoc network (MANET). Sharma et al. (2013) showed that load balancing is successfully fulfilled for ant based techniques. Ant Based Control (ABC) algorithm has been applied forroad networks, Kroon &Rothkrantz (2003) uses ABCfor dynamically routing the vehicles through a city. Suson (2010) presented a modification of Ant Based Control (ABC) and AntNet for routing vehicle drivers using historically-based traffic information. A cooperative ACO algorithm for finding routes based on a cooperative pheromone among ants has been presented by Claes and Holvoet (2012).It has been shown in Kponyo Jerry et al. (2015) that by using ACO the global traffic situation can be reduced through cooperation among the vehicles. An optimal routing method for car navigation system based on a combination between Divide and Conquer method and Ant Colony algorithm has been proposed by Yousefi and Zamani (2013). Their proposed method, divided the road network into small areas, and then the learning operation is done in these small areas. Then different learnt paths are combined together to make the complete paths. This method causes traffic load balance over the road network. Tatomir & Rothkrantz (2004) applied aversion of the AntNet algorithm to improve the traveling time over a road traffic network with the ability to divert traffic from congested routes. Boehlé et al. (2008) presented a city based parking routing system (CBPRS) that used Ant based routing. Kammoun et al. (2010) proposed an adaptive vehicle guidance system inspired from the ant behavior. Their system allows adjusting the route choice according to the real-time changes in the road network, such as new congestions and jams. Claes and Holvoet (2011) introduced an Ant Colony Optimization combined with link travel time prediction. The proposed algorithm takes into account link travel time prediction, which can reduce the travel time. In (Ghazy, 2011 ; Ghazy et al. 2012 and Ghazy &Hefny 2014) a family of modified AntNet algorithm called threshold based AntNet algorithm (TAntNet) for dynamic traffic routing of road networks has been presented, this family uses thresholds and new types of ants with a new technique for launching different types of ants to routing vehicles on road network. Recently, many researches worked on producing hybrid algorithms that combine features from both ants and bees behavior (Kashefikiaet al., 2011; Raghavendran et al., 2012). Rahmatizadeh et al. (2009) proposed an Ant-Bee Routing algorithm, which inspired from the behavior of both ant and bee to solve the routing problem. The algorithm is based on the AntNet algorithm and enhanced via using bee agents, it use forward agent inspired from ant and backward agent inspired from bee (Rahmatizadeh et al., 2009). Pankajavalli & Arumugam (2011) introduced and implemented a hybrid algorithm based on ant and bee behavior called BADSR for Routing in mobile ad-hoc network. The algorithm integrate the best of ant colony optimization (ACO) and bee colony optimization (BCO), the algorithm uses forward ant agents to collect data and backward bee agents to update the links state, the bee agent update data based on checking a threshold. The results of simulation showed better result for the BADSR algorithm in terms of reliability and energy consumption (Pankajavalli & Arumugam, 2011). Suguna & Maheswari (2012) presented an algorithm based on the foraging behavior of Ant colony optimization and bee colony optimization for on demand ad-hoc routing algorithm. Their algorithm depends on bee agents to collect data about the neighborhood of the node, and forward ant agents to update the pheromone state of the links. The results showed that the proposed algorithm has the potential to become an appropriate routing strategy for mobile ad-hoc networks (Suguna & Maheswari, 2012).

944

TAntNet-4

THE ANTNET ALGORITHM The Ant Net algorithm (Di Caro, G., & Dorigo, M. 1998) uses a data structure for each node on the network and uses two types of forward agents to manipulate these data structure and find the optimal route between each source and destination. The used agents are as the following: • • •

Forward Ant, This type of ants is responsible for gathering information about the state of the network while moving from a source node to a destination node. Backward ant, this type of ants is responsible for using the data collected by the forward Ant to update the routing tables of each node while moving backward from destination node to source node. Data Structures at Each Node on the Network

Ants communicate through the information they concurrently read and write in the data structures stored at each network node. The data structures for each node k over a network of N nodes are: routing table (Tk) and local traffic table (Mk) as shown in Figure 1. Each row in Tk corresponds to one destination in the network and each column corresponds to one of the neighbors of node k. For each destination d and each neighbor node n, the probability pnd of choosing n as the next node when the destination is d is stored in Tk such that:

∑P

n ∈N k

nd

=1

(1)

{

}

where d ∈ 1, N  and N k = neighbors (k ) . The local traffic table Mk stores the following statistics information about the network: Figure 1. The data structure at each node k on a network of N nodes

945

TAntNet-4

µd and σd2: The mean and variance of the trip times of ants that launch from node k to node d. µd and σd2are calculated as follows: µd ← µd ± η (tkd − µd )

(2)

2   σd2 ← σd2 ± η (tkd − µd ) − σd2   

(3)

where: tkd: Is the trip time observed by the new forward ant that launches from node k to destination d. η∈ (0, 1], weighs the number of recent samples that will affect μd and σd2. Wd: The moving observation window of size Wmax, that is used to save all trip times of the last Wmax ants that launched from node k to node d. η: is related to Wmax as follows:

Wmax =

5c ,c < 1 η

(4)

The AntNet Algorithm The AntNet algorithm can be described as follows (Di Caro, G., &Dorigo, M. 1998): 1. At regular intervals, a random destination d is chosen, and a forward ant Fsd is launched. 2. While Fsd is traveling towards the destination d, it keeps a memory about its path. When an ant arrives at a node k coming from node j the identifier of the visited node k and the travel time needed to reach k coming from j are pushed onto a memory stack Ssd. 3. At each visited node k, the next node is selected from the neighbors that have not been visited according to the probability Pnd. If all the neighboring nodes have been visited previously, then the next node is chosen among all the neighbors. 4. If a cycle is detected, i.e. If the ant is forced to return to an already visited node, the cycle’s nodes are popped from the ant’s stack and all memory about the cycle is destroyed. 5. When the destination node d is reached, a Backward Ant Bds is generated and Forward Ant transfers its memory contained in the stack Ssd to the generated backward ant, and then Forward Ant dies. 6. Backward Ant takes the opposite direction of the same path as the corresponding forward ant. At each node k, the Backward Ant pops up the stack Ssd to know the next node. 7. When arriving at a node k coming from a neighbor node h, the Backward Ant updates both of routing table Tk and local traffic table Mk for all the entries corresponding to the destination node d. Updates also are performed on the entries corresponding to every node k’∈Ssd, k’ ≠ d, on the sup

946

TAntNet-4

path followed by Fsd when the elapsed trip time is less than µ ± I(µ,σ). The mean μd and variance σd2 of the model Mk are updated using the formulas (2) and (3). The updating of the routing table Tk is performed as follows: The probability Phd ' is increased by the reinforcement value r as: Phd ' ← Phd ' ± r (1 − Phd ' )

(5)

while the other neighbors Pnd ' are decreased by the negative reinforcement as: Pnd ' ← Pnd ' − rPnd ′ , ∀n ≠ h, n ∈ N k

(6)

where: h: is the next node to k in the chosen path by the forward ant. d ′ : is the destination or sub path destination. Nk: is the set of neighbors of node k. The formulas of (5) and (6), represents that, every path found by the Forward Ants will have a positive reinforcement. The reinforcement value r ∈(0, 1] is calculated as follows:  t   tsup − tbest  bestd  d r = c1   ± c2   t    tsup − tbestd ± tkd − tbestd  kd 

(

) (

)

    

(7)

where: tkd : is the observed forward ant’s trip time from node k to the destination d. tbest : is the best trip time experienced by the forward ants traveling towards the destination d over the d

observation window Wd.

The coefficients C1 and C2 weight the importance of each term in (7). It was observed that the value of C2 should not be too large (0.35 is an upper limit). The value of tsup is calculated as: tsup = µd ±

σd 1 − γ w max

(8)

where:

947

TAntNet-4

γ : Gives the selected confidence level. It is observed that best results are obtained for γ ∈ 0.75, 0.8 . The value r calculated in (7) is finally transformed by means of a squash function s(x) defined by: s (x ) =

r←

1 a  1 ± exp    x 

s (r ) s (1)

, x ∈ (0, 1 , a ∈ R ±

(9)

(10)

The squash function s(x) uses to make the small values of r is quite negligible while updating the routing tables.

THRESHOLD BASED ANTNET FAMILY This section focuses on the TAntNet family which appears in (Ghazy, 2011; Ghazy & Hefny, 2014). TAntNet family aims to enhance the performance of AntNet algorithm which appears in (Di Caro & Dorigo, 1998). TAntNet family presented as a family of algorithms for routing through the traffic of road network, TAntNet family depends in its enhancements on the following ideas: • • •

Using Thresholds. Using new type of agents. Increasing the using of the learning information.

In the following we will shortly discuss briefly each member of TAntNet family and proposed a new enhanced member.

TAntNet-1 In AntNet algorithm, when ants have already found a good route between a source and a destination node, they continue keep searching for another better route among all paths over all nodes of the network. This may be logical and understood for the case of high speed data communication networks, where best routes among different hosts may change in few milliseconds. However, this is not the case for dynamic traffic routing of road networks. Once we have found a good route between a source and a destination point, it is likely to remain a good route for a considerable period of time, which can be assigned for each area by traffic road experts. Within such a period of time, there is no meaning and no need to keep launching ants form the source node to all other nodes in the network searching for what has been already discovered and already in hand. After such a period of time, the route needs to be checked again to be sure if it is still good or there is a need to search for another good route. Realizing such a fact, for the

948

TAntNet-4

case of road networks saves much of the unneeded computations and reduces the computational cost of the AntNet algorithm. This is the idea behind the appearance ofTAntNet-1 algorithm in (Ghazy, 2011; Ghazy et al. 2012), which is the first member in the family of TAntNet. The algorithm of TAntNet-1 adopts the using of threshold of the pre known information about good travel time between a source and a destination to detect the good route. The good travel times among different nodes can be obtained by many ways as follows: 1. Traffic road experts. 2. Each link on the road network has an available maximum speed, by which a route of shortest travel time can be computed between every source and destination. 3. Run the classical AntNet routing system, without any pre-assumption about good travelling times between any source and any destination. In other words, good travel time is considered to be a zero value. After running the routing system for a pre-determined period of time, best routes found by the launched ants are saved together with their good travel times in the various routing tables at every considered node1. This process can also be used for different intervals of time during the day. In other words, it can be computed in the morning, at the peak hours, and at night. Also, it can be used to extract the best data during specific periods throughout the year such as the schooling period during which the roads become quite crowded. Exploitation of the pre-known good travel times as threshold values achieves the following benefits: • • •

Removing of unneeded computations. Conserving on the discovered good path. Fast redirect to the discovered good path.

The TAntNet-1 algorithm used a new type of ants called check ant and follow a different technique to launches ants. Check ants were responsible of periodically checking the discovered good route. Also two new columns were added to the routing table. The first column is (T_Good), which contains the previously known good travel time between the source node and every destination, this column was used as a threshold in the TAntNet-1 algorithm. This column is preset before the initialization of the algorithm. The second column (G) was used to mark the destinations having good routes discovered. This column is used as a guide to determine which type of ants need to be launches (i.e. forward or check ants). The new modification presented inTAntNet-1, allows the algorithm to preserve on the discovered good routes and to, rapidly, converge toward good routes. Simulation results show better results for TAntNet-1 compared with the traditional AntNet algorithm for traffic routing of road network.

TAntNet-2 TAntNet-1 was suffer from the problem of losing of the discovered good route by the effect of sub path update, and from here the need to modify the algorithm of TAntNet-1 is raised, and the algorithm of TAntNet-2 appear to overcome this problem and prevent the sub pass update for the discovered good routes [Ghazy (2011)]. From which mention above we can see that both of TAntNet-1 and TAntNet-2 uses the same threshold to detect and conserve on the discovered good route.TAntNet-2 shows better performance compared with TAntNet-1 and the traditional AntNet algorithm. See Algorithm 1.

949

TAntNet-4

Algorithm 1. Threshold-based AntNet (TAntNet-2) / * Main loop */ FOR each (Node s) / *Concurrent activity */ t=current time WHILE t ≤ T / * T is the total experiment time */ Set d:= Select destination node; / *Tsd travel time from s to d */ Set Tsd = 0 IF (Gd = yes) Launch Check Ant (s, d); / * From s to d */ ELSE Launch Forward Ant (s, d); / * From s to d */ IF (TsdT_GoodSd) Set Gd = No END IF END CHECK ANT Forward Ant (source node: s, destination node: d) WHILE (current_node ≠ destination_node) Select next node using routing table Push on stack(next_node, travel_time); Set current_node = next_node; END WHILE Launch backward ant Die END Forward Ant Backward Ant (source node: s, destination node: d) WHILE (current node ≠ source node) do Choose next node by popping the stack Update the traffic model

continued on next page

950

TAntNet-4

Algorithm 1. Continued Update the routing table as follows: IF (Tsd