Grid, Cloud, and Cluster Computing [1 ed.] 9781683925699, 9781601324993

Proceedings of the 2019 International Conference on Grid, Cloud, and Cluster Computing (GCC'19) held July 29th - Au

160 84 2MB

English Pages 30 Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Grid, Cloud, and Cluster Computing [1 ed.]
 9781683925699, 9781601324993

Citation preview

WORLDCOMP’19

PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON GRID, CLOUD, & CLUSTER COMPUTING GRID, CLOUD, & CLUSTER COMPUTING

Grid, Cloud, and Cluster Computing

GCC’19 Editors Hamid R. Arabnia Leonidas Deligiannidis, Fernando G. Tinetti

U.S. $49.95 ISBN 9781601324993

54995

EMBD-GCC19_Full-Cover.indd All Pages

Arabnia

9 781601 324993

Publication of the 2019 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE’19) July 29 - August 01, 2019 | Las Vegas, Nevada, USA https://americancse.org/events/csce2019

Copyright © 2019 CSREA Press

18-Feb-20 5:28:50 PM

This volume contains papers presented at the 2019 International Conference on Grid, Cloud, & Cluster Computing. Their inclusion in this publication does not necessarily constitute endorsements by editors or by the publisher.

Copyright and Reprint Permission Copying without a fee is permitted provided that the copies are not made or distributed for direct commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source. Please contact the publisher for other copying, reprint, or republication permission.

American Council on Science and Education (ACSE)

Copyright © 2019 CSREA Press ISBN: 1-60132-499-5 Printed in the United States of America https://americancse.org/events/csce2019/proceedings

Foreword It gives us great pleasure to introduce this collection of papers to be presented at the 2019 International Conference on Grid, Cloud, and Cluster Computing (GCC’19), July 29 – August 1, 2019, at Luxor Hotel (a property of MGM Resorts International), Las Vegas, USA. The preliminary edition of this book (available in July 2019 for distribution on site at the conference) includes only a small subset of the accepted research articles. The final edition (available in August 2019) will include all accepted research articles. This is due to deadline extension requests received from most authors who wished to continue enhancing the write-up of their papers (by incorporating the referees’ suggestions). The final edition of the proceedings will be made available at https://americancse.org/events/csce2019/proceedings . An important mission of the World Congress in Computer Science, Computer Engineering, and Applied Computing, CSCE (a federated congress to which this conference is affiliated with) includes "Providing a unique platform for a diverse community of constituents composed of scholars, researchers, developers, educators, and practitioners. The Congress makes concerted effort to reach out to participants affiliated with diverse entities (such as: universities, institutions, corporations, government agencies, and research centers/labs) from all over the world. The congress also attempts to connect participants from institutions that have teaching as their main mission with those who are affiliated with institutions that have research as their main mission. The congress uses a quota system to achieve its institution and geography diversity objectives." By any definition of diversity, this congress is among the most diverse scientific meeting in USA. We are proud to report that this federated congress has authors and participants from 67 different nations representing variety of personal and scientific experiences that arise from differences in culture and values. As can be seen (see below), the program committee of this conference as well as the program committee of all other tracks of the federated congress are as diverse as its authors and participants. The program committee would like to thank all those who submitted papers for consideration. About 70% of the submissions were from outside the United States. Each submitted paper was peer-reviewed by two experts in the field for originality, significance, clarity, impact, and soundness. In cases of contradictory recommendations, a member of the conference program committee was charged to make the final decision; often, this involved seeking help from additional referees. In addition, papers whose authors included a member of the conference program committee were evaluated using the double-blinded review process. One exception to the above evaluation process was for papers that were submitted directly to chairs/organizers of pre-approved sessions/workshops; in these cases, the chairs/organizers were responsible for the evaluation of such submissions. The overall paper acceptance rate for regular papers was 18%; 20% of the remaining papers were accepted as poster papers (at the time of this writing, we had not yet received the acceptance rate for a couple of individual tracks.) We are very grateful to the many colleagues who offered their services in organizing the conference. In particular, we would like to thank the members of Program Committee of GCC’19, members of the congress Steering Committee, and members of the committees of federated congress tracks that have topics within the scope of GCC. Many individuals listed below, will be requested after the conference to provide their expertise and services for selecting papers for publication (extended versions) in journal special issues as well as for publication in a set of research books (to be prepared for publishers including: Springer, Elsevier, BMC journals, and others).  

Prof. Emeritus Nizar Al-Holou (Congress Steering Committee); Professor and Chair, Electrical and Computer Engineering Department; Vice Chair, IEEE/SEM-Computer Chapter; University of Detroit Mercy, Detroit, Michigan, USA Prof. Hamid R. Arabnia (Congress Steering Committee); Graduate Program Director (PhD, MS, MAMS); The University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing (Springer); Editor-in-Chief, Transactions of Computational Science & Computational Intelligence (Springer); Fellow, Center of Excellence in Terrorism, Resilience, Intelligence & Organized Crime Research (CENTRIC).

   



        

Prof. Dr. Juan-Vicente Capella-Hernandez; Universitat Politecnica de Valencia (UPV), Department of Computer Engineering (DISCA), Valencia, Spain Prof. Emeritus Kevin Daimi (Congress Steering Committee); Director, Computer Science and Software Engineering Programs, Department of Mathematics, Computer Science and Software Engineering, University of Detroit Mercy, Detroit, Michigan, USA Prof. Leonidas Deligiannidis (Congress Steering Committee); Department of Computer Information Systems, Wentworth Institute of Technology, Boston, Massachusetts, USA; Visiting Professor, MIT, USA Prof. Mary Mehrnoosh Eshaghian-Wilner (Congress Steering Committee); Professor of Engineering Practice, University of Southern California, California, USA; Adjunct Professor, Electrical Engineering, University of California Los Angeles, Los Angeles (UCLA), California, USA Prof. Louie Lolong Lacatan; Chairperson, Computer Engineerig Department, College of Engineering, Adamson University, Manila, Philippines; Senior Member, International Association of Computer Science and Information Technology (IACSIT), Singapore; Member, International Association of Online Engineering (IAOE), Austria Prof. Hyo Jong Lee; Director, Center for Advanced Image and Information Technology, Division of Computer Science and Engineering, Chonbuk National University, South Korea Dr. Ali Mostafaeipour; Industrial Engineering Department, Yazd University, Yazd, Iran Dr. Houssem Eddine Nouri; Informatics Applied in Management, Institut Superieur de Gestion de Tunis, University of Tunis, Tunisia Prof. Dr., Eng. Robert Ehimen Okonigene (Congress Steering Committee); Department of Electrical & Electronics Engineering, Faculty of Engineering and Technology, Ambrose Alli University, Edo State, Nigeria Ashu M. G. Solo (Publicity), Fellow of British Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc. Prof. Fernando G. Tinetti (Congress Steering Committee); School of Computer Science, Universidad Nacional de La Plata, La Plata, Argentina; also at Comision Investigaciones Cientificas de la Prov. de Bs. As., Argentina Prof. Layne T. Watson (Congress Steering Committee); Fellow of IEEE; Fellow of The National Institute of Aerospace; Professor of Computer Science, Mathematics, and Aerospace and Ocean Engineering, Virginia Polytechnic Institute & State University, Blacksburg, Virginia, USA Prof. Jane You (Congress Steering Committee); Associate Head, Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong Dr. Farhana H. Zulkernine; Coordinator of the Cognitive Science Program, School of Computing, Queen's University, Kingston, ON, Canada

We would like to extend our appreciation to the referees, the members of the program committees of individual sessions, tracks, and workshops; their names do not appear in this document; they are listed on the web sites of individual tracks. As Sponsors-at-large, partners, and/or organizers each of the followings (separated by semicolons) provided help for at least one track of the Congress: Computer Science Research, Education, and Applications Press (CSREA); US Chapter of World Academy of Science; American Council on Science & Education & Federated Research Council (http://www.americancse.org/). In addition, a number of university faculty members and their staff (names appear on the cover of the set of proceedings), several publishers of computer science and computer engineering books and journals, chapters and/or task forces of computer science associations/organizations from 3 regions, and developers of high-performance machines and systems provided significant help in organizing the conference as well as providing some resources. We are grateful to them all. We express our gratitude to keynote, invited, and individual conference/tracks and tutorial speakers - the list of speakers appears on the conference web site. We would also like to thank the followings: UCMSS (Universal Conference Management Systems & Support, California, USA) for managing all aspects of the

conference; Dr. Tim Field of APC for coordinating and managing the printing of the proceedings; and the staff of Luxor Hotel (Convention department) at Las Vegas for the professional service they provided. Last but not least, we would like to thank the Co-Editors of GCC’19: Prof. Hamid R. Arabnia, Prof. Leonidas Deligiannidis, and Prof. Fernando G. Tinetti. We present the proceedings of GCC’19.

Steering Committee, 2019 http://americancse.org/

Contents SESSION: HIGH-PERFORMANCE COMPUTING - CLOUD COMPUTING The Design and Implementation of Astronomical Data Analysis System on HPC Cloud Jaegyoon Hahm, Ju-Won Park, Hyeyoung Cho, Min-Su Shin, Chang Hee Ree

3

SESSION: HIGH-PERFORMANCE COMPUTING - HADOOP FRAMEWORK A Speculation and Prefetching Model for Efficient Computation of MapReduce Tasks on Hadoop HDFS System Lan Yang

9

SESSION: LATE BREAKING PAPER: CLOUD MIGRATION Critical Risk Management Practices to Mitigate Cloud Migration Misconfigurations Michael Atadika, Karen Burke, Neil Rowe

15

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

1

SESSION HIGH-PERFORMANCE COMPUTING - CLOUD COMPUTING Chair(s) TBA

ISBN: 1-60132-499-5, CSREA Press ©

2

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

3

The Design and Implementation of Astronomical Data Analysis System on HPC Cloud Jaegyoon Hahm1, Ju-Won Park1, Hyeyoung Cho1, Min-Su Shin2, and Chang Hee Ree2 1 Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information, Daejeon, Republic of Korea 2 Galaxy Evolution Research Group, Korea Astronomy and Space Science Institute, Daejeon, Republic of Korea

Abstract - Astronomy is a representative data-intensive science that can take advantage of cloud computing because it requires flexible infrastructure services for variable workloads and various data analysis tools. The purpose of this study is to show the usefulness of cloud computing as a research environment required to analyze large scale data in science, such as astronomy. We implemented an OpenStack cloud and a Kubernetes-based orchestration service for scientific data analysis. On the cloud, we have successfully constructed data analysis systems with a task scheduler and an in-memory database tool to support the task processing and data I/O environment which are required in astronomical researches. Furthermore, we aim to construct highperformance cloud service for various data-intensive research in more scientific fields. Keywords: cloud computing, astronomical data analysis, data analysis platform, openstack, kubernetes

1

Particular, the astronomical research has demands to utilize cloud computing, which is the ability to acquire resources for simulation-driven numerical experiments or mass data analysis in an immediate and dynamic way. Therefore, the type of cloud service that is expected by astronomical science researchers will be an Infrastructure as a Service (IaaS) for flexible resources for running with existing software and research methodologies and a Platform as a Service (PaaS) to be applied with new data analytic tools. In this paper, we propose a methodology and feasibility of cloud computing that focuses on flexible use of resources and astronomical science researchers' problems when using cloud services. Section 2 introduces related researches, and Section 3 describes the features and requirements of the target application. In Section 4 we describe the implementation of the data analysis system for the target application. Finally, in Section 5 we provide the conclusions and future plans.

2

Introduction

Recently, in the field of science and technology, more and more data is generated through advanced data-capturing sources [1]. And naturally, researchers are increasingly using cutting-edge data analysis techniques, such as big data analysis and machine learning. Astronomy is a typical field of collecting and analyzing large amounts of data through various observation tools, such as astronomical telescopes, and data growth rate will increase rapidly in the near future. As a notable example, Large Synoptic Survey Telescope (LSST) will start to produce large volume of datasets up to 20TB per day from observing large area of the sky in full operations from 2023. Total database for ten years is expected to be 60 PB for the raw data, and 15 PB for the catalog database [2]. As another big data project, Square Kilometer Array (SKA), which will be constructed as the largest in the world radio telescope until 2024, is also projected to generate and archive 130-300PB per year [3]. In this era of data deluge, there is a growing demand for utilizing cloud computing for data intensive sciences.

Related Works

There have been several examples of cloud applications for astronomical research. The Gemini Observatory has been building a new archive using EC2, EBS, S3 and GLACIER from the Amazon Web Services (AWS) cloud to replace the existing Gemini Science Archive (GSA) [4]. In addition, Williams et al.(2018) have conducted studies to reduce the Panchromatic Hubble Andromeda Treasury (PHAT) photometric data set using Amazon EC2 [5]. Unlike the cases of using public clouds, there are also studies that use an private cloud environment to be built to perform astronomical researches. AstroCloud [6] is a distributed cloud platform which integrates lots of data management and processing tasks for Chinese Virtual Observatory (ChinaVO). In addition, Hahm et al.(2012) developed a platform for constructing virtual machine-based condor clusters for analyzing astronomical time-series data in a private cloud [7]. The purpose of this study was to confirm the possibility of constructing a cluster type analysis platform to perform mass astronomical data analysis in a cloud environment.

ISBN: 1-60132-499-5, CSREA Press ©

4

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

3

Application Requirements and Design

The application used in this study is MAGPHYS SED fitting code, which reads and analyzes the data of the brightness and color of galaxies to estimate their physical properties. The used data is the large-scale survey data of Galaxy And Mass Assembly (GAMA), which is a project to exploit the latest generation of ground-based and space-borne survey facilities to study cosmology and galaxy formation and evolution [8]. On a single processor, MAGPHYS will typically take 10 min to run for a single galaxy. In Figure 1, the data analysis in the application starts with the data obtained by analyzing original image data collected from the telescope through preprocessing. The preprocessed data is a text file DB, which is input data for the analysis. The application extracts the data one line at a time from the input file, submits it to the task queue together with the spectral analysis code, and creates a new DB by storing the analyzed result in the output file. In a traditional research environment, the analysis will be done by building its own clusters for data analysis, or through a job scheduler in a shared cluster.

Fig. 3. Data Analysis with In-memory DB

configured by using asynchronous task scheduler for analysis work (see Figure 2). In this case, we need a shared file system that can be accessed by a large number of workers performing analysis tasks. Second, as shown in Figure 3, data input is performed through in-memory DB instead of file reading for faster I/O, and the output of the analysis result is also stored in the in-memory DB.

4 4.1

Fig. 1. Data Analysis Workflow

The main technical challenge of the application is to achieve faster data I/O and use their own task queue for convenience. GAMA dataset has information of 197,000 galaxies approximately. However, file-based I/O is too slow and hard to manage in this size of dataset. Therefore, a fast data I/O tools and a job scheduler for high-throughput batch processing are required.

Cloud Data Implementation

Analysis

System

KISTI HPC Infrastructure as a Service

Korea Institute of Science and Technology Information (KISTI) is building a high-performance cloud service in order to support data-intensive researches in various science and technology research fields. This is because emerging data-centric sciences require more flexible and dynamic computing environment than traditional HPC service. Especially big data and deep learning researches need customized HPC resources in a flexible manner. So, KISTI cloud will be a service providing customizable highperformance computing and storage resources, such as supercomputer, GPU cluster, etc.

To satisfy these requirements, we designed two types of lightweight data analysis systems. First, data is read through file I/O as usual and data processing environment is

Fig. 4. OpenStack Cloud Testbed

Fig. 2. Data Analysis with Task Scheduler and File I/O

In the first stage, the cloud service will be implemented on a supercomputer. KISTI's newly introduced supercomputer NURION is a national strategic research infrastructure to

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

support R&Ds in various fields. In particular, it has a plan to utilize it for data intensive computing and artificial intelligence. In order to build such a service environment, we will be leverage cloud computing technologies. .

In order to design the service and verify the required skills, we have constructed an OPENSTACK-based IaaS cloud testbed system using a computational cluster (see Figure 4). OPENSTACK [9] is a cloud deployment platform that is used as a de facto standard in industry and research, and is well suited to cloud deployments for high-performance computing too. The cluster used for the testbed has thirteen Intel Xeon-based servers: one deployment node, one controller node, three storage nodes and compute nodes for the rest. OPENSTACK services implemented here are NOVA (Compute), GLANCE (Image), NEUTRON (Network), CINDER (Block Storage), SWIFT (Object Storage), KEYNOTE (Identity), HEAT (Orchestration), HORIZON (Dashboard), MANILA (Network Filesystem) and MAGNUM (Container). In the case of storage, CEPH [10] storage was

configured using three servers and used as a backend for GLANCE, CINDER, SWIFT, and MANILA services. Apart from this, we have configured the Docker-based KUBERNETES orchestration environment using MAGNUM. KUBERNETES is an open source platform for automating Linux container operations [11]. In this study, it is composed with one KUBERNETES master and four workers.

4.2

Implementation of Data Analysis Platform in the Cloud

The data analysis system constructed in this study focuses on how to configure task scheduler and data I/O environment for task processing. We describe the architecture of the analysis system in Figure 5. First, the task scheduler should efficiently distribute and process individual tasks asynchronously. We adopted a lightweight task scheduler so that it can be dynamically configured and used independently, differently from the shared job schedulers, such as PBS and SLURM in the conventional HPC system. In particular, tasks for astronomical data analysis, which require long research time, are often asynchronous processing rather than synchronous processing. In the experiments, we used DASK [12] and CELERY [13] as task schedulers, which are readily available to scientific and

Fig. 5. Data Analytics System Architecture

5

technological researchers and are likely to be used in a common data analysis environment. The structure of the scheduler is very simple, consisting of a scheduler and workers. We write a Python code to submit tasks to the scheduler and manage data. The difference between DASK and CELERY is that DASK allocates and monitors tasks in its own scheduler module, whereas CELERY worker’s tasks are assigned from a separate message queue, such as RABBITMQ. In the data I/O environment configuration, the initial experiment was conducted by configuring a shared file system using OPENSTACK MANILA for the file-based I/O processing used in the existing analysis environment. However, in data processing, file I/O is significantly slower than computing performance, which causes severe performance degradation in analyzing the entire data. In order to solve this bottleneck problem and improve the overall performance, we used an in-memory DB tool called REDIS [14]. REDIS is a memory-based key-value store that is known to handle more than 100,000 transactions per second. heat_template_version: queens … parameters: worker_num: default: 16 … … resources: scheduler: type: OS::Nova::Server properties: name: dask-scheduler image: Ubuntu 18.04 Redis with Dask … template: | #!/bin/bash pip3 install dask distributed --upgrade … dask-scheduler & workers: type: OS::Heat::ResourceGroup properties: count: { get_param: worker_num } resource_def: type: OS::Nova::Server properties: name: dask-worker%index% image: Ubuntu 18.04 Redis with Dask … template: | #!/bin/bash apt-get install redis-server -y pip3 install dask distributed --upgrade … dask-worker dask-scheduler:8786 & outputs: instance_name: … instance_ip: … Fig. 6. HEAT Template for Analysis Platform with REDIS & DASK

ISBN: 1-60132-499-5, CSREA Press ©

6

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

A combination of task scheduler and data I/O environment can be created and configured automatically in an orchestration environment through OPENSTACK HEAT or KUBERNETES in our self-contained cloud. Figure 6 is the structure of one of the HEAT templates used in this experiment. The template is structured with parameters and resources. The resource is composed of scheduler and workers, and the required softwares are installed and configured for each scheduler and workers after boot-up.

5

Conclusion and Future Work

Through experiments, we have successfully analyzed about 5,300 galaxy brightness and color data in a parallel distributed processing environment consisting of DASK or CELERY with REDIS. Figure 7 shows one of the example galaxy from the GAMA data showing a result of the MAGPHYS analysis in the cloud. With OPENSTACK-based cloud, we confirmed that the research environment, especially data analysis system with tools like task scheduler and in-memory DB, can be automatically configured and well-utilized. In addition, we confirmed the availability of an elastic service environment through the cloud to meet the demand for largescale data analysis with volatility.

processing in stream, and various experiments need be performed through the cloud. Based on the experiences of building astronomical big data processing environment in this study, we will provide more flexible and high performance cloud service and let researchers utilize it in various fields of data-centric researches.

6

References

[1] T. Hey, S. Tansley and K. Tolle, The Fourth Paradigm: Data-intensive Scientific Discovery, Microsoft Research, 2009. [2] LSST Corporation. About LSST: Data Management. [Online]. Available from: https://www.lsst.org/about/dm/ 2019.03.10 [3] P. Diamond, SKA Community Briefing. [Online]. Available from https://www.skatelescope.org/skacommunity-briefing-18jan2017/ 2019.03.10 [4] P. Hirest and R. Cardenes, “The new Gemini Observatory archvieve: a fast and low cost observatory data archive running in the cloud”, Proc. SPIE 9913, Software and Cyberinfrastructure for Astronomy IV, 99131E (8 August 2016); doi: 10.1117/12.2231833 [5] B. F. Williams, K. Olsen, R. Khan, D. Pirone and K. Rosema, “Reducing and analyzing the PHAT survey with the cloud”, The Astrophysical Journal Supplemement Series, Volume 236, Number 1 [6] C. Cui et al., “AstroCloud: a distributed cloud computing and application platform for astronomy”, Proc. WCSN2016 [7] J. Hahm et al., “Astronomical time series data analysis leveraging sceince cloud”, Proc. Embedded and Multimedia Computing Tehnology and Service, pp493-500, 2012 [8] S. P. Driver et al., “Galaxy And Mass Assembly (GAMA): Panchromatic Data Release (far-UV-far-IR) and the low-z energy budget”, MNRAS 455, 3911-3942, 2016.

Fig. 7. An example result of the MAGPHYS analysis on the cloud

In this study, we have identified some useful aspects of the cloud for data-driven research. First, we confirmed that it is easy to build an independent execution environment that provides the necessary software stack for research through the cloud. Also in a cloud environment, researchers can easily reuse the same research environment and share research experience by reusing virtual machines or container images deployed by the research community. In the next step, we will configure an environment for realtime processing of in-memory cache data. For practical realtime data processing, it is necessary to construct an optimal environment for data I/O as well as memory-based data

[9] OpenStack Foundation. OpenStack Overview. [Online]. Available from: https://www.openstack.org/software/ 2019.03.10 [10] Red Hat Inc. Ceph Introduction. [Online]. Available from: https://ceph.com/ceph-storage/ 2019.03.10 [11] The Kubernetes Authors. What is Kubernetes?. [Online]. Available from: https://kubernetes.io/docs/concepts/overview/whatis-kubernetes/ 2019.03.10 [12] Dask Core Developers, Why Dask?. [Online]. Available from: https://docs.dask.org/en/latest/why.html 2019.03.10 [13] A. Solem, Celery - Distributed Task Queue. [Online]. Available from: http://docs.celeryproject.org/en/latest/index.html 2019.03.10 [14] S. Sanfilippo, Introduction to Redis. [Online]. Available from: https://redis.io/topics/introduction 2019.03.10.

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

7

SESSION HIGH-PERFORMANCE COMPUTING - HADOOP FRAMEWORK Chair(s) TBA

ISBN: 1-60132-499-5, CSREA Press ©

8

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

9

A Speculation and Prefetching Model for Efficient Computation of MapReduce Tasks on Hadoop HDFS System Lan Yang Computer Science Department California State Polytechnic University, Pomona Pomona, CA 91768, USA

Abstract - MapReduce programming model and Hadoop software framework are keys to big data processing on high performance computing (HPC) clusters. The Hadoop Distributed File System (HDFS) is designed to stream large data sets at high bandwidth. However, Hadoop suffers from a set of drawbacks, particularly having issues with small files as well as dynamic datasets. In this research we target big data applications working with many on-demand datasets of varying sizes. We propose a speculation model that prefetches anticipated datasets for upcoming tasks in support of efficient big data processing on HPC clusters.

Keywords: Prefetching, Speculation, Hadoop, MapReduce, High performance computing cluster.

1 Introduction Along with the emerging technology of cloud computing, Google proposed the MapReduce programming model [1] that allows for massive scalability of unstructured data across hundreds or thousands of high performance computing nodes. Hadoop is an open source software framework that performs distributed processing for huge data sets across the cluster of commodity servers simultaneously. [2] Now distributed as Apache Hadoop [3] many cloud services such as AWS, Cloudera, HortonWorks, and IBM InfoSphere Insights employ Apace Hadoop to offer big data solutions. The Hadoop Distributed File System (HDFS) [2], inspired by Google File System (GFS) [4], is a reliable filesystem of Hadoop designed for storing very large files running on a cluster of commodity hardware. To process big data in Apache Hadoop, the client submits data and program to Hadoop. HDFS stores the data while MapReduce processes the data. While Hadoop is a powerful tool for processing massive data it suffers from a set of drawbacks including issues with small files, no real time data

processing and for batch processing only [5]. The Apache Spark [6] partially solved Hadoop’s real time and batch processing problems by introducing inmemory processing [7]. As a model of Hadoop ecosystem Spark doesn’t have its own distributed filesystem, though it can use HDFS. Hadoop does not suit for small data due to the factor that HDFS lacks the ability to efficiently support the random reading of small files because of its high capacity design. Small files are the major problem in HDFS. In this research, we study a special type of iterative MapReduce tasks working on HDFS with input datasets coming from many small files dynamically, i.e. ondemand. We propose a data prefetching speculation model aiming at improving the performance and flexibility of big data processing on Hadoop HDFS for that special type of MapReduce tasks.

2 Background 2.1 Description of a special type of MapReduce tasks In today’s big data world, MapReduce programming model and Hadoop software framework remain as popular tools for big data processing. Based on a number of big data applications performed on Hadoop we observed the following: (1) An HDFS file splits into chunks, typically of 64128MB in size. To benefit from Hadoop’s parallel processing ability an HDFS file must be large enough to be divided into multiple chunks. Therefore, a file is considered as small if it is significantly smaller than the HDFS chunk size. (2) While many big data applications use large data files that could be pushed to HDFS input directory prior to task execution, some applications use many small datasets distributed across a wide range.

ISBN: 1-60132-499-5, CSREA Press ©

10

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

(3) With the increasing demand of big data processing, more and more applications now require multiple rounds (or iterations) of processing with each round requiring new datasets determined on the outcome of previous computation. For example, in a data processing application for a legal system, the first round MapReduce computation uses prest ored case documents, while the second round might require accessing to certain assets or utilities datasets based on the case outcomes resulted from the first-round analysis. The assets or utilities datasets could consist of hundreds to thousands of files ranging from 1KB to 10MB with only dozens of files relevant depending on the outcome of the first round. It would be very inefficient or inflexible if we have to divide these two rounds into separate client requests. Also, if we could overlap computation and data access time by speculating and prefetching data we could reduce the overall processing time significantly. Here we refer to big data applications with one or more of the above characteristics (i.e. requiring iterative or multiple passes of MapReduce computation, using many small files to form a HDFS chunk, dynamic datasets that are dependent on the outcome of previous rounds of computation) as a special type of MapReduce tasks.

2.3 Computation time vs. data fetch time

2.2 Observation: execution time and HDFS

Figure 1: Data Access Performance Base

In this research, we first tested and analyzed data accessing time ranging from 1K to 16MB on an HPC cluster which consists of 2 DL360 management nodes, 20 DL160 compute nodes, 3.3 TB RAM, 40GBit InfiniBand, 10GBit external Ethernet connection with overall system throughput at 36.6 Tflp at double prevision mode and 149.6 Tflp. Slurm job scheduler [8] is the primary software we use for our testing. The performance data shown in Figure 1 serve as our basis for deriving the performance of our speculation algorithms.

chunks We conducted several dozens of big data applications using Hadoop on a high-performance computing cluster. Table 1 summarizes the MapReduce performance of three relatively large big data analytics tasks.    

 %

 !



&   !

! !   #  '

,

,*, +.

 #  "$'

0&-. 

--* -+

#

($

/&.0 

.0

.

Table 1: Performance data for some big data applications (*requires multi-phase analysis)

3 Speculation and Prefetching Models 3.1 Speculation model We establish a connection graph (CG) to represent relations of commonly used tasks with tasks as nodes and edges as links between tasks. For example, link a birthday party planning task to restaurant reservation tasks as well as entertainment or recreation tasks. An address change task is linked with moving or furniture shopping tasks. The links on CG are prioritized, for example, for birthday task, the restaurant task initially is set with higher priority than the movie ticketing task. The priorities are in 0.0 to 1.0 range and are dynamically updated based on the outcome of our prediction. For example, based on the connection in CG graph and priorities of the links we predict the top two tasks following the birthday task are in order of restaurant task and movie task. If for that particular application it turns out movie task is the correct choice thus we will increase the priority by a small fraction, say 0.1 and capped to 1.0 maximum.

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

3.2 Prefetching algorithm Prefetching concept is inspired by the compilerdirected instruction/data prefetching technique that speculates and prefetches instructions for multiprocessing [9] [10]. Our basic fetching strategy is: overlapping with the computation of current task, we prefetch associated datasets for the next round of computation based on the task speculation. The association between task and data files can be represented as a many-to-many relations between tasks and data files. Each task is pre-associated with a list of files in the order of established ranks. For example, for the restaurant task could be associated with pizza delivery files, restaurant location files etc. The ranks are initialized based on the popularity of the service with a value between 0.0 to 1.0 range with higher value as most popular or most recommended services. The ranks are then adjusted based on the network distance of file locations with priority given to local or nearby files. Again, after the task execution, if a prefetched file turned out to be irrelevant (i.e. the whole file was filtered out at easy MapReduce stage) the rank of that file with regard to that task is reduced. Based on the system configuration we also preset two constant values K and S with K as the optimized/recommended number of containers and S the size of each container (suggest S to be the HDFS chunk size and K to be the desired number of chunks with regard to requested compute nodes.) When prefetching datasets for a set of speculated tasks, the prefetching process repeatedly reads files until it fills up all the containers.

4 Simulation Design We used a Python dictionary to implement the connection graph CG with each distinct task name as a key. The value for a key is a list of task links sorted in descending order of priorities. The task and data relations are also represented as a Python dictionary with task names as keys and a list of data file names sorted in descending order of ranks as values. Currently we simulate the effectiveness of prefetching by using parallel processes created by Slurm directly. Once the approaches are validated we will test it on Hadoop/HDFS.

4.1 Speculation Model For any current task t, the simulated speculation model always fetches the top candidate task p from the

11

CG directory, i.e. CG[t][0] as p and starts the prefetching process. When the t completes it will choose the next task t’. If t’ is the same as p, let t be p and the process continues. If t’s is different from t, restart the prefetching process, reduce the priority for p by one level (currently 0.1) but not less than 0.0, and increase the priority of t’ by 0.1 (capped at 1.0) if it’s already in t’s connect link or add to t’s connect link with a randomly assigned priority (between 0.1 and 0.5) if it’s not in t’s connection link yet.

4.2 Prefetching Model (1) Configuration: one file node N (i.e. a process that only reads data in and writes to certain shared location), created four shared storages (arrays or dictionaries) representing the containers, C1 to C4. Initially all Ci’s are empty and each container has a current capacity and a maximum capacity (all containers may have the same maximum capacity.) It’s easily expendable to multiple file nodes and larger number of containers. (2) Assume the process p selected by the speculation scheme is associated with n small files respectively, say F1, ... Fn. Read in files in the order of F1, ..., Fn. For each file read in, record its size as sj, then searches for a container with its current capacity + sj < maximum capacity, locks it once found and then pushes the content in. If no available container found, the file content is set aside and we increased our failure rate by 1 (failure rate initially is set to 0). Continue to fetch next file until it reaches the condition as spelled in (3). (3) The pre-fetching process ends when all containers reach certain percentage full (i.e. at least 80% full) or when the failure rate reaches to a certain number (say 3). Note: one failure doesn’t mean the containers are full. It could be the scenario that we fetched a very large dataset that couldn’t fit into any of current containers. However, in this case we may further fetch the files next in the list as these might be smaller files.

5 Conclusions In this research work, we studied the possibility of congregating small datasets dynamically to form large data chunk suitable for MapReduce task on Hadoop HDFS. We proposed task speculation and file prefetching models to speed up overall processing tasks. We have setup a primitive simulation test suite to assess

ISBN: 1-60132-499-5, CSREA Press ©

12

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

the feasibility of the speculation and prefetching models. Since currently we are designing the schemes on Slurm multiprocess environments without using HDFS, no performance gain could be measured. Our future (and on-going) work is to implement the design schemes from HPC Slurm processes onto Hadoop HDFS system and measure the effectiveness using realworld big data applications.

6 References [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google Research, https://research.google.com/archive/mapreduceosdi04.pdf [2] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) [3] Apache Hadoop https://hadoop.apache.org/ [4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, https://static.googleusercontent.com/media/research.goo gle.com/en//archive/gfs-sosp2003.pdf [5] DATAFLAIR Team, 13 Big Limitations of Hadoop & Solution To Hadoop Drawbacks, https://dataflair.training/blogs/13-limitations-of-hadoop/, March 7, 2019. [6] Apache Spark https://spark.apache.org/ [7] Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica, Spark: Cluster Computing with Working Sets, Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010. [8] Slurm job scheduler, https://slurm.schedmd.com/ [9] Seung Woo Son, Mahmut Kandemir, Mustafa Karakoy, Dhruva Chakrabarti, A compiler-directed data prefetching scheme for chip multiprocessors, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '09) [10] Ricardo Bianchinia, Beng-Hong Limb, Evaluating the Performance of Multithreading and Prefetching in Multiprocessors, https://doi.org/10.1006/jpdc.1996.0109

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

SESSION LATE BREAKING PAPER: CLOUD MIGRATION Chair(s) TBA

ISBN: 1-60132-499-5, CSREA Press ©

13

14

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

15

Critical Risk Management Practices to Mitigate Cloud Migration Misconfigurations M. Atadika, K. Burke, and N. Rowe Computer Science Department, Naval Postgraduate School, Monterey, CA, US Regular Research Paper

Abstract - We identified that as private enterprises continue to gravitate toward the cloud to benefit from cost savings, they may be unprepared to confront four major issues inherent to cloud architecture. Mitigating risks will require that migrating organizations properly recognize and understand: the critical misalignment between service model selection and consumer expectations within the cloud architecture, the cloud-borne vulnerabilities and cloudspecific threats that together create technological challenges, the causal relationship between customer misconfigurations and cloud spills, and the complexity of implementing security controls. Collectively, the four substantive issues cause risk management to manifests itself in more complicated permutations in the cloud. To address these vexing cybersecurity risks, this paper introduces the unifying concept of transformational migration and recommends decoding the cloud service model selection, employing cryptographic erase for applicable use cases, consulting the broadest cloud security control catalogs in addressing cloud-negative controls, managing supply-chain risk through cloud service providers, and adopting a reconfigured Development Security Operations (DevSecOps) workforce. Keywords: Cloud, Misconfigurations, Risk, Migration, Service model

1

Introduction

During the five-year period from 2014-18, the largest cloud service provider, Amazon Web Services (AWS), a proxy of the accelerating technological migration, experienced revenue growth at a compound annual growth rate of 47.9% [1]. See Figure 1. This growth in revenue directly corresponds to a growing trend of data departing on-premises architectures to cloud destinations. Cost may be a primary causal factor for this uptick in cloud migration. Cloud service providers charge fixed unitized fees for the work/cycle performed by each instance of utilization. The tradeoff for these cost-savings, however, is potentially magnified insecurity.

Figure 1. Quarterly revenue of AWS from Q1 to Q4 (in USD millions). Source: [1].

For example, on June 1, 2017, the Washington Post reported that a large federal contractor for the Department of Defense (DoD) accidentally leaked government passwords on an AWS server related to a work assignment for the National Geospatial-Intelligence Agency [2]. Regrettably, this is not an isolated episode but the third recently documented instance of data mishandling by the well-established government contracting firm. The report went on to describe a prevalence of government agencies pivoting to the cloud, with industry leaders substantiating that this is, in fact, indicative of a more universal shift toward cloud-centric computing [2]. As private enterprises rush to the cloud to reap the benefits of financial savings and increased services, they will confront four major issues inherent to cloud architecture. This paper posits that the velocity of cloud adoption— multiplied by the immaturity of the available cloud workforce pool—warrants a rigorous investigation into the sufficiency of risk management capabilities and preparedness. Managing or mitigating risks will require that migrating organizations properly recognize and understand 

the critical misalignment between service model selection and consumer expectations within the cloud architecture,

ISBN: 1-60132-499-5, CSREA Press ©

16

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

  

the cloud-borne vulnerabilities and cloud-specific threats that together create technological challenges, the causal relationship between customer misconfigurations and cloud spills, and the complexity of implementing security controls.

Collectively, the four substantive issues cause risk management to manifest itself in more complicated machinations in the cloud. To address these issues and their related vexing cybersecurity risks, this paper introduces the unifying concept of transformational migration and recommends decoding the cloud service model selection, employing cryptographic erase for applicable use cases, consulting the broadest cloud security control catalogs in addressing cloud-negative controls, managing supply-chain risk through cloud service providers, and adopting a reconfigured Development Security Operations (DevSecOps) workforce [3]. The following sections of this paper justify these recommendations.

2

Background

Prior to the cloud, on-premises system applications were highly customized and expected to operate within a standalone data center. Accordingly, application data was structured for minimal to no interaction with other applications [4]. In contrast, cloud applications are highly agile and are expected to operate in multiple data centers; permissioned cloud data is available on-demand for maximal interaction with other applications. For transference to the cloud, which uses multiple servers, legacy applications developed prior to 2005 likely need to have their source code refactored, since it is doubtful those applications were written to accommodate running on multiple servers [5]. Consumers often execute a cloud migration incorrectly, assuming that they can simply port their entire traditional IT architecture to the cloud without any modification (often referred to as lift and shift; [6]). The lift and shift assumption can contribute to a redundancy of consumer misconfigurations. Lift and shift is problematic because the logic assumes that an on-premises application and its security controls are technically compatible with cloud architectures without any modifications [7]. Contrary to popular perception, it is not the responsibility of the cloud service provider to make a lift and shift migration work, because this cloud transition “strategy” is orthogonal to cloud architectures. This underscores why on-premises applications require changes at multiple logical layers to properly function in a cloud service model, as depicted in the cloud stack in Figure 2. The high customization of on-premises system applications created two deficiencies relative to scalability when compared to cloud systems: it inhibited applications from leveraging data from other applications and it limited administrator knowledge to a small subset of applications, creating pockets of specialization.

Figure 2. Cloud Security Responsibility Matrix (On-premises Application). Source: [8].

Whereas there is heterogeneity in on-premises applications, cloud applications have greater homogeneity requiring fewer application development specialists and more generalists. When an organization transitions to the cloud, it loses governance because it no longer owns resources but rather rents them. The computing resources are also remote to the organization and under the control of the cloud service provider. The cloud interjects the cloud service provider relationship into the customer’s workflow. The SANS Institute’s 2016 white paper, titled Implementing the Critical Security Controls in the Cloud, underscores that roles have to be clearly defined to accommodate the interjection of the cloud service provider [9]. The consumer and the cloud service provider both share responsibilities in the cloud relationship. From the cloud service provider’s perspective, the client has responsibilities that span from running applications down to the guest operating system, while the provider is responsible for the host operating system down to the physical data center [10]. This type of cooperative security is commonly referred to as a “shared responsibility model” in the cloud, and this very division of responsibilities can create confusion or uncertainty that contributes to customer misconfiguration. Fortunately (or unfortunately), there is no one right way to configure all of the available settings: a one-size-fits-all cloud does not exist. To this point, the Central Intelligence Agency (CIA) and National Security Agency (NSA) pursued two different paths in achieving similar cloud computing capabilities. Even though they are two seemingly similar, technologically sophisticated U.S. intelligence agencies, each organization had to make requirement-dependent choices with respect to cloud deployment and service models. In a 2014 interview, the CIA’s chief information officer (CIO), Doug Wolfe, confirmed that the two clandestine agencies chose to build out their respective cloud architectures differently [11]. Wolfe explained that the CIA cloud was built using commercial cloud products with participation from a commercial cloud service provider, while the NSA cloud was designed in-house, also using commercially available products but without participation from a commercial cloud service provider [11]. The service model an organization selects

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

determines the level of involvement it must have in application development within the cloud service provider’s environment. The service model selection in itself also determines a particular set of consumer challenges balanced against greater autonomy in managing cloud-specific settings, configurations, and controls. Thankfully, the selection of the cloud service model can highlight the cloud layers for which the consumer is responsible and, therefore, which security controls to implement. The security controls may essentially be the same. However, the implementer of the controls may (and probably will) differ by cloud service model. This suggests that any meaningful discussion about cloud security will not refer to the ubiquitous cloud but will instead reference a specific selected architecture instantiation, reflecting committed organizational choices. Once organizations have selected the appropriate service model, they must then address the technological challenges inherent in the cloud with respect to their service model selection. The confounding aspect of interoperability is that the cloud integrates multiple sophisticated technologies, cloud service providers, servicing counterparties, logical layers, hardware, and endpoint devices. A cloud service provider’s trustworthiness is compromised if any of the multiple parties or technological interchanges is compromised. At the National Institute of Standards Technology (NIST), the NIST Cloud Computing Forensic Science Working Group (NCC FSWG) shares an example of the enmeshed relationships a forensic investigator may have to unwind: “A cloud Provider that provides an email application (SaaS [software as a service]) may depend on a third-party provider to host log files (i.e., PaaS [platform as a service]), which in turn may rely on a partner who provides the infrastructure to store log files (IaaS [infrastructure as a service])” [12]. Therefore, technological capabilities and limitations dictate the realties that cloud service providers must integrate. The remote delivery of cloud services and the cloud service provider’s capacity as an intermediary give rise to organizational boundary challenges. The multi-geographical operations of cloud service providers create additional legal challenges, as consumers might fall under regulations in multiple jurisdictions if they do not limit the location of servers to only organizationally acceptable jurisdictions. The cloud’s remote delivery presents obstacles to data retrieval that are foreign to on-premises systems. Unlike in onpremises systems, cloud storage is neither local nor persistent; physically attached data storage is only temporary, but not after the abstraction that enables pooling and dynamic customer provisioning. The process of abstraction decouples the physical resources through a process called virtualization, which enables resource pooling. Furthermore, storage is designated as a cloud service provider responsibility, as depicted in Figure 2. The NCC FSWG characterized the separation of a virtual machine from local persistent storage: “Thus, the operational security model of the application, which assumes a secure local log file store, is now broken when moved into a cloud environment” [12]. The consequence of

17

this break in the event of a cyber-incident is the inefficiency of locating stored media, which includes artifacts, log files, and other evidentiary traces [12]. In on-premises systems, the operating systems dependably and centrally manage the consistent generation and storage of valuable evidence traces and the information is well documented. The NCC FSWG also observed that in the cloud, “user based login and controls are typically in the application rather than in the operating system” [12]. Cloud technologies decouple user identification credentials from a corresponding physical workstation [12]. These idiosyncrasies of cloud architecture also create inefficiencies in data retrieval. Not only do organizations need to consider the primary and secondary consequences of diverging from traditional operating security models, but they also must recognize that the cloud exposes them to new vulnerabilities and threats. Several cloud vulnerabilities are distinct and completely cloudspecific. Before designating a vulnerability as cloud-native, it needs to meet a set of criteria; a litmus test to decide if a vulnerability should be assigned as cloud-specific. Determining whether a vulnerability is cloud-native is helpful in discussions with reluctant managers about the relative risk of the cloud. Published by the Institute of Electrical and Electronics Engineers (IEEE), “Understanding Cloud Computing Vulnerabilities” provides a rubric that helps determine if vulnerabilities are cloud-specific [13]. According to the rubric, a vulnerability is cloud-specific if it:    

is intrinsic to or prevalent in a core cloud computing technology, has its root cause in one of NIST’s essential cloud characteristics, is caused when cloud innovations make tried and tested security controls difficult or impossible to implement, or is prevalent in established state-of-the-art cloud offerings. [13]

The first bullet refers to web applications, virtualization, and cryptography as the core cloud technologies [13]. The second bullet alludes to the five essential characteristics attributed to NIST—on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service [13]. The third bullet identifies instances when on-premises system security practices do not transfer to the cloud—for example, the “cloud-negative controls” identified by [9] and elaborated upon in section 3.1, which covers the implementation of security controls. The fourth bullet describes the cloud as pushing present technological boundaries. If a vulnerability is identified in an advanced cloud offering—one that has not been previously identified—then it must be a cloud-specific vulnerability. Although there is some merit to the argument, the IEEE paper erroneously includes weak authentication implementations, which are not technically exclusive to the cloud [13]. Due to the flaw in this interpretation, this fourth indicator can only be seen as partially attributed, or a hybrid cloud-specific vulnerability.

ISBN: 1-60132-499-5, CSREA Press ©

18

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

Considering the vulnerabilities borne of cloud architectures, it is important to determine which cloud-specific threats could exploit those vulnerabilities. All organizations must update their threat model to include cloud-generated threats. For example, bad actors are presently exploiting cloud services in their attacks by remaining anonymous inexpensively, decentralizing their operations by using multiple cloud service providers, and provisioning superior computing power for a fraction of the cost with pay-as-you-go pricing. In 2016, the Cloud Security Alliance released the “Treacherous 12: Cloud Computing Top Threats in 2016,” which it compiled by surveying cloud industry experts. The Treacherous 12 ranks a dozen security concerns in order of severity (Table 1). Table 1. Treacherous 12 Threats Summary. Adapted from [14]. Rank 1 2 3 4 5 6 7 8 9 10 11 12

Threat in Conventional Architectures Data breaches Weak access management

Threat in the Cloud Data breaches Weak access management Insecure APIs System and application vulnerabilities System and application vulnerabilities Account hijacking Account hijacking Malicious insiders Malicious insiders Advance Persistent Threats Advance Persistent Threats Data loss Data loss Insufficient due diligence Insufficient due diligence Nefarious use of cloud services Denial of service Denial of service Shared technology vulnerabilities

Note that from this analysis, of the 12 greatest estimated threats that experts say emanate from the cloud, only three point to truly cloud-specific vulnerabilities. Insecure application programming interfaces (API) (no. 3), nefarious use of cloud services (no. 10), and shared technology vulnerabilities (no. 12) are the cloud-specific threats that merit additional in-depth defense security measures. While not cloud-specific, weak access management, account hijacking, malicious insiders, and insufficient due diligence are the next tier of cloud threats to address.

3

Risk Management Considerations

To securely operate in the cloud, risk management considerations must protect an organization’s assets against a range of undesirable events and associated consequences. Cloud spills are a compelling example of such events. A data spill is any event involving the unauthorized transfer of confidential data from an accredited information system to one that is not accredited [8]. A cloud spill is a type of data spill, specifically originating from a cloud environment. As early as 2013, the government had investigated data spillage specific to the cloud, documented in a Department of Homeland Security (DHS) presentation on February 14, 2013, “Spillage and Cloud Computing.” Clearly, all migrating organizations, but especially agencies involved in national security matters, must effectively reduce cloud spills; however, they still have not found a solution to this problem. Instead of reacting to the aftereffects of cloud spills, migrating organizations need to determine how to anticipate

and mitigate. An informed service model selection can facilitate better prioritization of the pertinent cloud services, logical layers, and underlying data structures. The initial benefit of focusing on service model selection is that doing so raises the awareness of additional cloud security challenges, enabling the consumer to abate these issues through a combination of policy changes or contracts with additional security services. Data security considerations will directly address the cloud’s information structures, which is data either to be stored or processed by computing processes. Application security considerations will directly address the cloud’s application structures, which comprises the application services used in building applications or the resultant clouddeployed application itself [6]. Infrastructure security considerations will directly address the cloud’s scalable and elastic infrastructure, which comprises the enormity of the cloud service provider’s pooled core computing, networking, and storage resources. Configuration, management, and administrative security considerations will directly impact the cloud’s metastructures, which enable cohesive functioning of communication interoperability protocols between the various layer interfaces; critical configuration and management settings are embedded in metastructure signals [6]. The merit of this lower-level understanding is a firmer comprehension of how standard cloud communication functions at different layers within the cloud’s sharedresponsibilities model. Accordingly, security practitioners map their organizational responsibilities to their service model selections. This approach maximizes information security signal-to-noise ratios by only isolating the actionable logical layers. Migrating organizations can begin by replacing applications with software as a service to abandon legacy code, followed by rebuilding cloud native or refactoring backwardcompatible application code with platform as a service, and finally by re-hosting (lift and shift) applications to infrastructure as a service that will not benefit from either current or future cloud capabilities [15]. Software-as-a-service solutions lack customer customization and can lead to vendor lock-in by making the porting of data more challenging. A vendor lock-in mitigant is service-oriented architecture development, which produces applications that are treated like “services,” as in “anything as a service.” Once the application can be treated as a service, it should be able to port or “plug” into any cloud service provider seamlessly and temper the fears of having to make large-scale changes to existing code bases for interoperability with proprietary requirements of the new cloud service provider. Service-oriented architecture is easily reconfigurable. The prevailing methods for either mitigating or responding to cloud data spills are insufficient in terms of consumer autonomy and cloud confidentiality. In regard to autonomy, cloud service providers have invented the concept of bring your own key (BYOK), which bolsters a false sense of security regarding consumer encrypted data. BYOK solutions imply that the consumer’s key is the sole key involved in encrypting and decrypting the customer’s data, which is not the case [16]. In

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

fact, the consumer key is an unnecessary input for the cloud service provider to access the consumer’s data (e.g. responding to subpoena requests). In practice, the cloud service provider first uses their own key to encrypt the data and the customer key second to encrypt the cloud service provider key. The DoD has recognized this deficiency of the BYOK construct and has secured an alternate remediation: cryptographic erase. Cryptographic erase is credited by the DoD as “highassurance data destruction … [in which] media sanitization is performed by sanitizing the cryptographic keys used to encrypt the data, as opposed to sanitizing the storage locations on media containing the encrypted data itself” [8]. Sanitization is the process of making data unrecoverable. Cryptographic erase achieves the goal of data destruction indirectly, by way of key erasure. Cryptographic erase also accommodates “partial sanitization,” in which a subset of the data is sanitized, but this requires the use of unique keys for each subset [8]. Cryptographic erase paired with deleting files is more expedient than physically sanitizing a cloud service provider environment. However, cryptographic erase is only effective for encrypted data. Therefore, the DoD explicitly tasks its components and agencies with ensuring that all DoD data at rest is encrypted. This acknowledges that any data that is in unencrypted states is data at risk. Furthermore, the DoD must have exclusive control of both the encryption keys and key management; this facilitates the DoD’s ability to remediate unilaterally, high-assurance data destruction, without any cloud service provider cooperation [8]. However, cryptographic erase is not a panacea. This technology is an effective tool to resolve data spills due to human error, but it would likely prove ineffective against data spills initiated by malicious code. Cryptographic erase would be unable to contain a running process while data is still in use. Additionally, cryptographic erase is only effective in infrastructure as a service—and some platform as a service— cloud deployments when the consumer determines exactly how the data is stored. Although the DoD has been able to resort to cryptographic erase as a reactionary measure, private enterprise consumers now aware of the BYOK misnomer should focus their attention on prevention. Customer misconfiguration prevention begins when the consumer directly maps security controls to the logical layers for which they are explicitly responsible as a result of their service model selection. 3.1

Implementation of Security Controls

When consumers transition from on-premises systems, they will find gaps within their existing security policies and how they interplay with the contracted terms and conditions of an executed service-level agreement. A wide variability exists among cloud service providers with respect to defined terms and related metrics [17]. Consumers should focus on the definitions used in each agreement. The goals of each specific cloud project, service model, and cloud service provider platform are the critical inputs in determining the additional countermeasures the project should integrate. Organizations

19

should generate their requirements, map architecture, and conclude by diagnosing and then prioritizing the remaining security gaps of the cloud service provider [6]. The 2016 SANS paper indicates that it is imperative for any organization’s security architect to have the ability to discern how onpremises networks differ from virtualized architecture. The SANS paper categorizes security controls into cloud-positive, cloud-negative, and cloud-neutral controls [9]. The three tiers correspond to the ease of application within the cloud. The SANS recommendation, based upon this awareness, allows the security architect to direct greater attention to the cloudnegative controls. Cloud-negative controls emerge when implementation is more difficult or cost-prohibitive in the cloud [9]. The paper specifically identifies logging, boundary defense, and incident response management as cloud-negative controls. NIST 800-53 is heralded as an exhaustive set of security controls. However, the first revision of NIST 800-53 was published in December 2006, predates widespread cloud adoption and was better suited for on-premises environments. In response, FedRAMP (the Federal Risk and Authorization Management Program) is a 2011 federal policy that details the minimally required security authorization procedures with which an agency must comply when engaging with a cloud service provider for contracted cloud services. FedRAMP was specifically drafted to direct federal cloud computing acquisitions, and its goal was to accelerate adoption of cloud services and enforce standardized cybersecurity requirements government-wide. Cloud requirements for the DoD exceed requirements for other federal government agencies; for that reason, the DoD issued the Cloud Computing Security Requirements Guide [8], which describes FedRAMP+. FedRAMP+ adds DoD-specific security controls to fulfill the DoD’s mission requirements. FedRAMP+ is the cloudcomputing customized approach to NIST 800-53 security controls. These controls “were selected primarily because they address issues such as the Advanced Persistent Threat (APT) and/or Insider Threat, and because the DoD … must categorize its systems in accordance with CNSSI 1253, beginning with its baselines, and then tailoring as needed” [8]. CNSSI 1253 is the Committee on National Security Systems Instruction No. 1253 Security Categorization and Control Selection for National Security Systems [18]. A comparison of security controls indicates that 32 CNSSI 1253 controls were added to the NIST SP 800-53 moderate baseline and 88 NIST 800-53 moderate controls were subtracted from the CNSSI 1253 moderate baseline [8]. Non-DoD entities also seeking security controls that surpass federal government agency standards may refer to CNSSI 1253 for more granular control options. Additionally, the Cloud Controls Matrix, published by the CSA, is a rational catalog to begin with because it maps its controls side-by-side with many other control catalogs for easy comparison.

4

Transformational Migration

Ultimately, there is a viable solution for the challenges that migrating organizations face when transitioning to a

ISBN: 1-60132-499-5, CSREA Press ©

20

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

robust, secure cloud environment. However, the solution will require those organizations to reorganize people and processes to minimize the existing gaps between how traditional applications operate and how cloud computing applications are configured. It will also require organizations to incorporate broad uses of encryption, digital forensic incidence-response processes tailored to cloud architectures, practicable workarounds that address cloud-negative security controls, and continuous mandatory cloud training. Transformational migration accounts for these requirements by better aligning processes with how the cloud actually functions. Transformational migration mandates the collocation of relevant data sets through secure application programming interface calls. Additionally, it supports extending the perimeter from the network boundary to include the boundary of specific chunks of data. Extending the perimeter enables the migrating organization to leverage metadata tagging to administer stricter enforcement of file authorizations and legal compliance. Transformational migration mandates security through the complete data security lifecycle: creating, storing, processing, sharing, archiving, and destructing [6]. “Deliver Uncompromised” is a new strategy to address cybersecurity lapses that extend to DoD contractors [19]. Deliver Uncompromised encourages adding security assessment attainment levels in the awarding of contracts along with traditional cost and performance considerations. The new supply-chain risk management strategy believes the cloud can contribute to protecting the DoD supply-chain by specifically encouraging its contractors “to shift information systems and applications to qualified, secure cloud service providers” [19]. This strategy can also be applied to non-DoD supply-chains.

5

Conclusion

Transformational migration is a strategy to prevail the wellworn pattern of human misunderstandings largely driving cloud misconfigurations, which eventually become cloud data spills that require a digital forensic incident-response. A better understanding of how the service model relates to the intent of the application can reduce the risk of customer misconfigurations, which produces a more robust cybersecurity risk posture. Migrating organizations will also require transitioning application professionals to a new dynamic: a transformational workforce with the dexterity to remediate issues at multiple cloud logical layers. The DevSecOps model, comprised of both the newly hired and the retrained workforce, is an integrated team of problem solvers with diverse experiences from application development, engineering, and security disciplines. The DevSecOps teams are tasked with developing and continuously tuning applications by addressing security at multiple layers and for the complete data life cycle. The DevSecOps model is endorsed by the Defense Innovation Board for its comprehensive resolution of existing misalignments between information security professionals and cloud technologies [3]. Using the recommendations of transformational migration as a guide, DevSecOps teams will be able to more

effectively implement critical risk management controls while avoiding detrimental misconfigurations when migrating to the cloud. The research presented in this paper is part of Michael Atadika’s thesis conducted at and published for public release by the Naval Postgraduate School [20].

6

References

[1] Statista [Internet]. [date unknown]. Amazon Web Services: quarterly revenue 2014-2018. Hamburg (Germany): Statista; [cited 2019 Feb 22]. Available from: https://www.statista.com/statistics/250520/forecast-ofamazon-web-services-revenue [2] Gregg, A [Internet]. 2017, Jun. 1. Booz Allen Hamilton employee left sensitive passwords unprotected online. Washington (DC): Washington Post; [cited 2018 Mar 2]. Available from: https://www.washingtonpost.com/business/ capitalbusiness/government-contractor-left-sensitivepasswords-unprotected-online/2017/06/01/916777c6-46f811e7-bcde-624ad94170ab_story.html?utm_term= .6cad14ff8b95 [3] [DoD] Department of Defense. [Internet]. [updated 2018 Oct 2; cited 2019 Mar 14]. Defense innovation board do’s and don’ts for software. Washington (DC): Department of Defense. Available from: https://media.defense.gov/2018/Oct/09/2002049593/-1/1/0/DIB_DOS_DONTS_SOFTWARE_2018.10.05.PD [4] Bommadevara N., Del Miglio A., Jansen S [Internet]. 2018. Cloud adoption to accelerate IT modernization. New York (NY): McKinsey Digital; [cited 2018 May 18]. Available from: https://www.mckinsey.com/business-functions/digitalmckinsey/our-insights/cloud-adoption-to-accelerate-itmodernization [5] Odell, L., Wagner, R., & Weir, T [Internet]. 2015. Department of Defense use of commercial cloud computing capabilities and services. Alexandria (VA): Institute for Defense Analyses; [cited 2018 Aug 23]. Available from: http://www.dtic.mil/dtic/tr/fulltext/u2/1002758.pdf [6] [CSA] Cloud Security Alliance [Internet]. 2017. Security guidance: For critical areas of focus in cloud computing v4.0. Seattle (WA): Cloud Security Alliance; [cited 2018 Apr 10]. Available from: https://cloudsecurityalliance.org/ guidance/#_overview [7] van Eijk, P H J [Internet]. 2018. Cloud migration strategies and their impact on security and governance Seattle (WA): Cloud Security Alliance; [cited 2019 Mar 14]. Available from https://blog.cloudsecurityalliance.org/2018/06/ 29/cloud-migration-strategies-impact-on-securitygovernance/

ISBN: 1-60132-499-5, CSREA Press ©

Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

[8] [DISA] Defense Information Systems Agency [Internet]. 2017, Mar 6. Department of Defense Cloud Computing Security Requirements Guide, version 1, release 3. Washington (DC): Department of Defense; [cited 2018 Apr 10]. Available from: https://www.complianceweek.com/ sites/default/files/department_of_defense_cloud_computing_s ecurity_requirements_guide.pdf [9] SANS Institute [Internet]. 2016. Implementing the critical security controls in the cloud. North Bethesda (MD): SANS Institute; [cited 2017 Oct 20]. Available from: https://www.sans.org/reading-room/whitepapers/critical/ implementing-critical-security-controls-cloud-36725 [10] Clarke, G [Internet]. 2015, Apr. 13. Self preservation is AWS security’s biggest worry, says gros fromage. London (UK): The Register; [cited 2017 Oct 9]. Available from: https://www.theregister.co.uk/2015/04/13/aws_security_sleep less_nights/ [11] [CIA] Central Intelligence Agency [Internet]. 2014, Dec. 17. CIA creates a cloud: An interview with CIA’s chief information officer, Doug Wolfe, on cloud computing at the agency. Washington (DC): Central Intelligence Agency; [cited 2018 Mar 8]. Available from: https://www.cia.gov/newsinformation/featured-story-archive/2014-featured-storyarchive/cia-creates-a-cloud.html [12] [NCC FSWG] NIST Cloud Computing Forensic Science Working Group (NCC FSWG) [Internet]. 2014. NIST cloud computing forensic science challenges, Draft NISTIR 8006. Gaithersburg (MD): NIST; [cited 2018 May 7]. Available from: https://csrc.nist.gov/publications/detail/nistir/8006/draft

21

[17] [CIO & CAO]. Chief Information Officer Council & Chief Acquisition Officers Council [Internet]. 2012. Creating effective cloud computing contracts for the federal government: Best practices for acquiring IT as a service. Washington (DC): Chief Information Officer Council & Chief Acquisition Officers Council; [cited 2018 Dec 07]. Available from: https://www.cio.gov/2012/02/24/cloud-computingupdate-best-practices-for-acquiring-it-as-a-service/ [18] [CNSS] Committee on National Security Systems [Internet]. 2014. Security categorization and control selection for national security systems, CNSSI No. 1253. Washington (DC): Department of Defense; [cited 2018 May 21]. Available from: http://www.dss.mil/documents/CNSSI_No1253.pdf [19] Nakashima, E., Sonne, P [Internet]. 2018, Aug 13. Pentagon is rethinking its multibillion-dollar relationship with U.S. defense contractors to boost supply chain security. Washington (DC): Washington Post; [cited 2018 Aug 13]. Available from: https://www.washingtonpost.com/world/ national-security/the-pentagon-is-rethinking-its-multibilliondollar-relationship-with-us-defense-contractors-to-stresssupply-chain-security/2018/08/12/31d63a06-9a79-11e8b60b-1c897f17e185_story.html?utm_term=.60664aebdfb8 [20] Atadika, M. Applying U.S. military cybersecurity policies to cloud architectures [master’s thesis]. Monterey (CA): Naval Postgraduate School. 2018. 102p.

[13] Grobauer, B., Walloschek, T., & Stöcker, E. 2011. Understanding cloud computing vulnerabilities. IEEE Sec & Pri. [Internet]. [cited 2017 Oct 15]. 9(2), 50–57. Available from: https://doi/org.10.1109/MSP.2010.115 [14] [CSA] Cloud Security Alliance [Internet]. 2016. The treacherous 12: Cloud computing top threats in 2016. Seattle (WA): Cloud Security Alliance; [cited 2017 Nov 1]. Available from: https://downloads.cloudsecurityalliance.org/ assets/research/top-threats/Treacherous-12_CloudComputing_Top-Threats.pdf [15] Woods, J [Internet]. 2011. Five options for migrating applications to the cloud: Rehost, refactor, revise, rebuild or replace. Stamford (CT): Gartner; [cited 2018 Aug 22]. Available from: https://gartnerinfo.com/ futureofit2011/MEX38L_A2%20mex38l_a2.pdf [16] Rich, P [Internet]. 2017. SaaS Encryption: lies, damned lies, and hard truths. Redmond (WA): Microsoft; [cited 2019 Mar 11]. Available from: https://channel9.msdn.com/Events/ Ignite/Microsoft-Ignite-Orlando-2017/BRK2392

ISBN: 1-60132-499-5, CSREA Press ©

Author Index Atadika, Michael - 15 Burke, Karen - 15 Cho, Hyeyoung - 3 Hahm, Jaegyoon - 3 Park, Ju-Won - 3 Ree, Chang Hee - 3 Rowe, Neil - 15 Shin, Min-Su - 3 Yang, Lan - 9