Big Data and High Performance Computing 9781614995821, 9781614995838, 1614995826

1,256 212 6MB

English Pages 168 [169] Year 2015

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Big Data and High Performance Computing
 9781614995821, 9781614995838, 1614995826

Table of contents :
Title Page......Page 2
Preface......Page 6
Contents......Page 8
Fundamentals and General Concepts......Page 10
Big Data: Insight and the Scientific Method......Page 12
Programming Visual and Script-Based Big Data Analytics Workflows on Clouds......Page 27
Big Data from Scientific Simulations......Page 41
Towards a Comprehensive Set of Big Data Benchmarks......Page 56
Technologies, Processing and Storage......Page 76
Big Data Technologies......Page 78
Cori: A Pre-Exascale Supercomputer for Big Data and HPC Applications......Page 91
Architectural Implications for Exascale Based on Big Data Workflow Requirements......Page 110
Case Studies......Page 124
Networking Materials Data: Accelerating Discovery at Experimental Facilities......Page 126
Big Data Research at DKRZ - Climate Model Data Production Workflow......Page 142
Subject Index......Page 166
Author Index......Page 168

Citation preview

BIG DATA AND HIGH PERFORMANCE COMPUTING

Advances in Parallel Computing This book series publishes research and development results on all aspects of parallel computing. Topics may include one or more of the following: high-speed computing architectures (Grids, clusters, Service Oriented Architectures, etc.), network technology, performance measurement, system software, middleware, algorithm design, development tools, software engineering, services and applications. Series Editor:

Professor Dr. Gerhard R. Joubert

Volume 26 Recently published in this series Vol. 25. Vol. 24. Vol. 23. Vol. 22. Vol. 21. Vol. 20. Vol. 19. Vol. 18. Vol. 17. Vol. 16. Vol. 15.

M. Bader, A. Bode, H.-J. Bungartz, M. Gerndt, G.R. Joubert and F. Peters (Eds.), Parallel Computing: Accelerating Computational Science and Engineering (CSE) E.H. D’Hollander, J.J. Dongarra, I.T. Foster, L. Grandinetti and G.R. Joubert (Eds.), Transition of HPC Towards Exascale Computing C. Catlett, W. Gentzsch, L. Grandinetti, G. Joubert and J.L. Vazquez-Poletti (Eds.), Cloud Computing and Big Data K. De Bosschere, E.H. D’Hollander, G.R. Joubert, D. Padua and F. Peters (Eds.), Applications, Tools and Techniques on the Road to Exascale Computing J. Kowalik and T. Puźniakowski, Using OpenCL – Programming Massively Parallel Computers I. Foster, W. Gentzsch, L. Grandinetti and G.R. Joubert (Eds.), High Performance Computing: From Grids and Clouds to Exascale B. Chapman, F. Desprez, G.R. Joubert, A. Lichnewsky, F. Peters and T. Priol (Eds.), Parallel Computing: From Multicores and GPU’s to Petascale W. Gentzsch, L. Grandinetti and G. Joubert (Eds.), High Speed and Large Scale Scientific Computing F. Xhafa (Ed.), Parallel Programming, Models and Applications in Grid and P2P Systems L. Grandinetti (Ed.), High Performance Computing and Grids in Action C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr and F. Peters (Eds.), Parallel Computing: Architectures, Algorithms and Applications

Volumes 1–14 published by Elsevier Science. ISSN 0927-5452 (print) ISSN 1879-808X (online)

Big Daata and d High Perfo ormancce Co omputiing

y Edited by

Luciio Grandiinetti Italy

Gerrhard Jou ubert Neth herlands/Gerrmany

nze Maarcel Kun Germany

and

Valeerio Pasccucci USA

Amstterdam • Berrlin • Tokyo • Washington, DC

© 2015 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-61499-582-1 (print) ISBN 978-1-61499-583-8 (online) Library of Congress Control Number: 2015952250 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected] Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS



Big Data and High Performance Computing L. Grandinetti et al. (Eds.) IOS Press, 2015 © 2015 The authors and IOS Press. All rights reserved.

v

Preface At the International Research Workshop on Advanced High Performance Computing Systems held in Cetraro in July 2014 a number of topics related to the concepts and processing of large data sets were discussed. In this volume a selection of contributions is published that on the one hand cover the concept of Big Data and on the other the processing and storage of very large data sets. The advantages offered by Big Data has been expounded in the popular press over recent years. Reports on results obtained by analysing large data sets covering various fields, such as marketing, medicine, finances, etc., led to the claim that any important real world problem could be solved if sufficient data is available. This claim ignores the fact that a deeper understanding of the fundamental characteristics of a problem is essential in order to ensure that an obtained result is valid and correct. The collection, processing and storage of large data sets must always be seen in relative terms compared to the communication, processing and storage capabilities of the computing platforms available at a given point in time. Thus the problem faced by Herman Hollerith with the collection, processing and archiving of the data collected with the census of 1890 in the USA was a Big Data problem, as very large data sets compared to the processing and storage capabilities of his tabulating machine using punch cards, had to be processed. The collected data were analysed with the aim to detect different patterns in order to suggest answers to diverse questions. Although researchers today have a large selection of extremely powerful computers available, they are faced with similar problems. The data sets that must be processed, analysed and archived are vastly bigger. Thus the requirement to process large data sets with available equipment has not changed fundamentally over the last century. The way scientific research is conducted has, however, changed through the advent of high performance and large scale computing systems. With the introduction of the Third Paradigm of Scientific Research computer simulations have replaced (or complemented) experimental and theoretical procedures. Simulations routinely create massive data sets that need to be analysed in detail, both in-situ and in post processing. The use of available data for another purpose than that for which these were originally collected introduced the so-called Fourth Paradigm of Scientific Research. As mentioned above, this approach was already used by Herman Hollerith in the various analyses of the collected census data. Recently this approach, popularly called Big Data Analytics, received wide spread attention. Many such data analyses resulted in detecting patterns in data sets that resulted in significant new insights. Such successes prompted the collection of data on a massive scale in many areas. Many of these massive data collections are created without necessarily having specific directions from a well identified experiment. In astronomy, for example, massive sky surveys replace individual, targeted observation and the actual discoveries are achieved at a later stage with pure data analysis tools. In other words, the construction of massive data collections has become an independent, preliminary stage of the scientific process. The assumption is that accumulating enough data will automatically result in a repository rich of new events or features of interest. Therefore, it is hoped that a proper data mining process will find such new elements and lead to novel scientific insight. While in principle this is not a guaranteed outcome, experience shows that indeed



vi

collecting massive amounts of data has systematically led to new discoveries in astronomy, genomics, climate, and many other scientific disciplines. In each case, the validity and correctness of such new discoveries must, however, be verified. Such verifications are essential in order to substantiate or repudiate scientific theories, such as, for example, the possible cause of an illness, the existence of a particle or the cause of climate change. The papers selected for publication in this book discuss fundamental aspects of the definition of Big Data as well as considerations from practice where complex data sets are collected, processed and stored. The concepts, problems, methodologies and solutions presented are of much more general applicability than may be suggested by the particular application areas considered. In this sense these papers are also a contribution to the now emerging field of Data Science. The editors hope that readers can benefit from the contributions on theoretical and practical views and experiences included in this book. The editors are especially indebted to Dr Maria Teresa Guaglianone for her valuable assistance, as well as to Microsoft for making their CMT system available.

Lucio Grandinetti , Italy Gerhard Joubert, Netherlands/Germany Marcel Kunze, Germany Valerio Pascucci, USA



vii

Contents Preface Lucio Grandinetti, Gerhard Joubert, Marcel Kunze and Valerio Pascucci

v

Fundamentals and General Concepts Big Data: Insight and the Scientific Method Gerhard R. Joubert

3

Programming Visual and Script-Based Big Data Analytics Workflows on Clouds Loris Belcastro, Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

18

Big Data from Scientific Simulations John Edwards, Sidharth Kumar and Valerio Pascucci

32

Towards a Comprehensive Set of Big Data Benchmarks Geoffrey C. Fox, Shantenu Jha, Judy Qiu, Saliya Ekanayake and Andre Luckow

47

Technologies, Processing and Storage Big Data Technologies Marcel Kunze

69

Cori: A Pre-Exascale Supercomputer for Big Data and HPC Applications Nicholas J. Wright, Sudip S. Dosanjh, Allison K. Andrews, Katerina B. Antypas, Brent Draney, R. Shane Canon, Shreyas Cholia, Christopher S. Daley, Kirsten M. Fagnan, Richard A. Gerber, Lisa Gerhardt, Larry Pezzaglia, Prabhat, Karen H. Schafer and Jay Srinivasan

82

Architectural Implications for Exascale Based on Big Data Workflow Requirements René Jäkel, Ralph Müller-Pfefferkorn, Michael Kluge, Richard Grunzke and Wolfgang E. Nagel

101

Case Studies Networking Materials Data: Accelerating Discovery at Experimental Facilities Ian Foster, Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Ray Osborn, Steven Tuecke, Michael Wilde and Justin Wozniak

117

Big Data Research at DKRZ – Climate Model Data Production Workflow Michael Lautenschlager, Panagiotis Adamidis and Michael Kuhn

133

Subject Index

157

Author Index

159

This page intentionally left blank

Fundamentals and General Concepts

This page intentionally left blank

Big Data and High Performance Computing L. Grandinetti et al. (Eds.) IOS Press, 2015 © 2015 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-583-8-3

3

Big Data: Insight and the Scientific Method Gerhard R. JOUBERT1 Clausthal University of Technology, Germany

Abstract. The general concept of the scientific method or procedure consists in systematic observation, experiment and measurement, and the formulation, testing and modification of hypotheses. In many cases a hypothesis is formulated in the form of a model, for example a mathematical or simulation model. The correctness of a solution of a problem produced by a model is verified by comparing it with collected data. Alternatively, observational data may be collected without a clear specification that the data could also apply to the solution of other, unforeseen problems. In such cases data analytics are used to extract relationships from and detect structures in data sets. In accordance with the scientific method, the results obtained can then be used to formulate one or more hypotheses and associated models as solutions for such problems. This approach allows for ensuring the validity of the solutions obtained. The results thus obtained may lead to a deeper insight in such problems and can represent significant progress in scientific research. The increased interest in so-called Big Data resulted in a growing tendency to consider the structures detected by analysing large data sets as solutions in their own right. A notion is thus developing that the scientific method is becoming obsolete. In this paper it is argued that data, hypotheses and models are essential to gain deeper insights into the nature of the problems considered and to ensure that plausible solutions were found. A further aspect to consider is that the processing of increasingly larger data sets result in an increased demand for HTC (High Throughput Computing) in contrast to HPC (High Performance Computing). The demand for HTC platforms will impact the future development of parallel computing platforms. Keywords Big data, purposed data, data pools, scientific method, high throughput computing

Introduction From time to time new buzzwords emerge focusing attention on a particular information technology field. In most cases such terms do not depict something new, they rather focus public attention on existing technologies [1]. Such newly focussed attention accentuates the possible advantages that could be offered by these technologies. The increased awareness also impacts the distribution of research funds, as projects that somehow focus on these (apparently) new areas are more readily supported. Recent examples of this phenomenon are buzzwords such as Expert Systems, Grids, Clouds, Internet of Things, and of course Big Data [2]. Although the term Big Data was already used towards the end of the 1990s [3], it was catapulted into the lime light during the first decade of this century with reports on

1

Corresponding Author: Lange-Feld-Str. 45, 30559 Hanover, Germany. E-mail: [email protected]

4

G.R. Joubert / Big Data: Insight and the Scientific Method

the advantages that can be achieved through Big Data analytics. Thus the report that Google could detect regional influenza outbreaks in the USA seven to ten days sooner than the Centres for Disease Control and Prevention by monitoring increased search term activity for phrases associated with influenza systems, made people realise the potential of the analysis of large data sets [4]. The impression was created that the mere analysis of data to detect patterns could solve problems of well-nigh any complexity. How large such Big Data repositories should be, was not specified. In subsequent years the Google method returned erroneous results, underlining the fact that a fundamental understanding of the processes involved in, for example, the spread of diseases, is essential. This is an excellent example of how Big Data analytics can lead to unforeseen correlations, but not necessarily to the underlying causality. This fact did not, however, reduce the belief in the huge benefits that can be gained with Big Data analytics. The view that the larger the data set, the more accurate the obtained results will be, gained in popularity, with the ultimate goal the establishment of fully automated problem solving mechanisms based on Big Data analysis methods. The term Big Data remained vague. In spite of the lack of a definition the term became popular and widely used. It is today to be found in international research programs, as a theme for workshops and conferences, journals and books, and it is used in marketing novel software products for handling complex and large data processing tasks. In many cases Big Data is used to indicate that large datasets are involved, but with no clear indication of how large such a repository must be in order to be called “big”. In other cases the term is used to indicate complex data sets without the complexity being defined. The result of all this is that, when given a data repository, there exists no clear way to determine whether this particular data set could be considered to be Big Data. Resulting from the development of more powerful parallel computers with ever larger memories the analysis of increasingly large data sets became and are becoming possible. These analyses result in the detection of hitherto unknown new patterns or relationships. Such new insights obtained from large and integrated data sets continue to support the euphoria about the advantages offered by Big Data. Thus new insights into, for example, possible causes of illnesses or economic development and climate change processes that were obtained by analysing very large and diverse integrated data repositories, resulted in significant progress in a better understanding of many real-world phenomena. A conjecture that resulted from this euphoria is that, if one can collect, store and analyse all pertinent data about any phenomenon, one could gain a full understanding of the particular phenomenon, thus obviating the need for further scientific research. The age old standard approach used in scientific research, the so-called scientific method, appears to have lost its significance. [5, 6] There are three fundamental problems with this view: • It may prove impossible, even in future, to collect, store and analyse the very large resulting data sets in an acceptable time frame and with affordable energy consumption. • Hypotheses combined with models, as part of the scientific method, offer an instrument to gain insight into the nature of a phenomenon, thus reducing the need for collecting excessive amounts of data. • The analysis of data (observations) may lead to erroneous results, if no additional method to verify their validity is employed. The notion to consider correlations between observations as solutions is not new and certainly not a result emanating from large data sets. Thus, for example, in ancient

G.R. Joubert / Big Data: Insight and the Scientific Method

5

times people observed, i. e. collected Big Data, that the sun, moon, planets and stars moved around the earth. Using a coordinate system centred at the centre of the earth the observed movements of the celestial objects, including the planets, correlated with the patterns obtained with Big Data analysis. The cause of the rather strange movements of the planets could, however, not be explained. Only the scientific approach, establishing the sun as the centre of the coordinate system that seemed to contradict the observed Big Data, resulted in a correct description of the celestial objects’ movements. The correct solution that the sun is at the centre of the solar system took ages to be fully accepted. Many examples of correlations, i. e. detected patterns, indicating causes that are not true solutions of the problems considered, exist. Wrong conclusions based on detected correlations are thus inherent to data analysis, regardless of the size of the data sets. Enlarging the size of a data set does not protect against detecting irrelevant patterns or patterns that are not a true solution of the problem considered. It is this inherent weakness of superficial interpretations of observed patterns without verification that leads to erroneous conclusions and generalisations. An example is the result obtained that people who drink lots of coffee are more liable to contract lung cancer. It was later realised that a large percentage of coffee drinkers also smoke and it is the latter habit that is the real cause. Beliefs that are based on erroneous results detected through the analysis of data (observations) are often stronger than the ability to accept true results obtained through scientific research. The purpose with this paper is to contribute to the discussion about what is meant within the framework of Informatics by the term Big Data. The main focus is to determine fundamental differences between regular data and Big Data, thus enabling easy classification. The analytics and storage techniques used for Big Data analyses will not be considered in further detail. The reader is referred to the extensive literature on these topics.

1. Advent of Big Data As a result of the increasing processing power, storage sizes and communication speeds it was possible since the first digital computers became available to process ever larger data sets. Such very large data collections are, for example, recorded as part of a variety of scientific research projects, such as SETI (Search for Extraterrestrial Intelligence), the LHC (Large Hadron Collider), producing about 15 petabytes of data per year, and the planned SKA (Square Kilometre Array), which will produce about one petabyte of stored data per day by ca. 2025. The data collected within the framework of planned experiments are usually well structured and the data elements clearly defined. An important aspect is that the accuracy of the collected data must meet the requirements of the problem considered and is thus known. These facts make it possible to use the recorded data for various analyses to validate or invalidate hypotheses. An alternative source of data in practice is data that is collected in order to build and maintain records of some sort. Typical examples are: • Preferences expressed explicitly or implicitly by members of social networks or users searching for particular products or services in the Internet • Medical records of patients, including their diagnosed ailments, the treatment and their responses to the treatment

6

G.R. Joubert / Big Data: Insight and the Scientific Method



Geodata, including satellite images, maps of infrastructures such as transport systems, electricity networks, etc. • Weather data. • Multimedia data (unstructured data), including sound recordings, images, videos. The data thus collected can be interpreted in various ways and be used to achieve very different objectives. In the first instance the data can be interpreted according to the goal with which it was originally recorded. Thus a medical record will normally be interpreted as a description of the diagnosis and treatment of a particular patient. The file produced describes the medical history of the patient. An alternative view of recorded data, often widely different from the original goal, emerges when the collected data is analysed to detect new structures and relationships. Thus a collection of patient records may be analysed with the view to determine the effectiveness of a particular treatment for an illness. Or the habits (exercise, eating, smoking, etc.) of patients may be analysed and correlated with their diagnosed maladies. Going a step further, various data repositories from different fields may be integrated in order to search for higher-level, more complex structures that can be interpreted as problem solutions. This is, for example, often the case with Geodata projects. Thus city planning data may be combined with actual installation maps of water, gas and sewerage networks in order to ensure that construction work will not interrupt essential services. A fundamental problem is in how far the use of data that was collected with a particular purpose in mind meets the quality requirements of another problem than the original one. 1.1. Definitions As was mentioned already the term “Big Data” was initially used to indicate a large set of collected data. No clear definition of what Big Data is was formulated. Thus, given a set of data, it is impossible to decide whether it is actually Big Data. The claims that Big Data can offer great advantages compared to other solution methods, such as the scientific method, necessitated clarification of what differentiates Big Data from normal data [5, 6]. A number of authors subsequently attempted to formulate acceptable definitions. The most commonly used are: 1. The 3Vs definition: Volume, Variety and Velocity 2. All data available in an organisation 3. Large data sets compared to memory size of topical computer systems: • Yesterday: Terabytes (TB) • Today: Petabytes (PB) • Tomorrow: Exabytes • Then: Zettabytes, Yottabytes, etc. The most popular, according to its use in the literature, is the 3Vs definition [7, 8]. The 3Vs (volume, variety and velocity) are considered as three defining properties or dimensions of big data. Volume refers to the amount of data, variety refers to the number of types of data and velocity refers to the speed of data processing. According to the 3Vs model, the challenges of big data management result from the expansion of all three properties, rather than just the amount of data (volume) to be managed. A problem with this is that none of the three components are clearly defined. It is not

G.R. Joubert / Big Data: Insight and the Scientific Method

7

clear what is meant by volume, variety or velocity or how and according to which norms these are to be measured in order to compare different data sets. It is thus not possible to use this definition to determine whether a given data repository actually represents a Big Data set. Extensions of the 3V’s system to 4V’s and 5V’s have been published. These extensions do not solve the fundamental problem with the 3V’s definition. These will thus not be discussed in more detail here [9]. An alternative is to consider all data available in an organisation as Big Data. This implies that all data available in a large company, such as Walmart, Amazon, Facebook or Google as well as all data available in the small shop round the corner will be Big Data. In spite of this objection this definition is usable in practical situations. It does allow for all possible analyses to be applied to the data available in the particular organisation, whether big or small. This shows the dilemma with the term BIG Data, as the size of the data set does not necessarily need to be large. On the other hand, if one considers all data on earth as belonging to some organisation, whether private, governmental or corporate, then these will all be classified as Big Data, resulting in nearly all existing data being classified as Big Data. This does not make much sense as then the distinction between Big Data and normal Data becomes irrelevant. The third definition is perhaps the one coming closest to a definition that could form the basis of measurement and comparison. Thus one could define Big Data as being of size: • ≧ 1 Terabyte up to year n1 • ≧ 1 Petabyte from year (n1 +1) to year n2 • ≧ 1 Exabyte from year (n2+1) to year n3 • ≧ 1 Zettabyte, 1 Yottabyte, etc. from year ( n3 +1) onwards, where ni < ni+1, i = 1, 2, … This definition has the advantage that it is measurable and can thus be used to determine whether a data set is Big Data within a particular time frame. Whether this sliding scale definition is acceptable and useful in practice is another matter. It is also unsatisfactory in that a single byte may mean the difference between being classified as Big Data or not. These examples of definitions show the basic requirements of a usable definition. The ideal situation is to have a definition that is based on well-defined metrics that enable measurement and comparison of certain characteristics of the object of interest. In practice this is not always the case as well-defined metrics may not be available. In such cases indirect metrics can be used, i. e. metrics that do not directly measure a characteristic, but measure a phenomenon that can be related to the object of interest. An example is a thermometer, where the length of expansion of mercury in a small tube can be related to the ambient temperature. Metrics, whether direct or indirect, can be represented by salient characteristics of the object considered. Such characteristics must be identifiable, measurable and comparable. A definition based on selected characteristics must clearly differentiate between objects. Thus a definition must utilise metrics that make it possible to clearly differentiate between data sets that are Big Data and those that are not. Such metrics must be clearly defined, easily measurable and repeatable, i. e. giving the same results when the measurements are repeated. In practice an important additional requirement is that the selected metrics must be applicable with limited resources. The third definition given above meets these requirements and it differentiates clearly between sets of data. The metric used is well-defined, easily measurable and the determination of the size of a data repository can be repeated. As this definition only considers the magnitude of data sets, it does not tell much about the data itself. It does not consider any other, perhaps more important, characteristics of Big Data.

8

G.R. Joubert / Big Data: Insight and the Scientific Method

The second definition given above also uses a metric that is clearly defined. In practice the application of the metric is not easy to apply as it may not be clear whether a datum belongs to the organisation in the sense that it is meaningful data. This makes the use of the definition in practical situations unclear. Different analysts may turn up with data sets that differ significantly. It does not differentiate between conventional data and Big Data in an organisation as all data is considered to be Big Data. The 3V definition attempts to create the impression that the three characteristics are metrics that can be measured and compared. Volume could perhaps be compared with the second or third definitions, whichever one chooses, but the remaining two criteria are not well-defined metrics. Thus it is not clear how to measure variety or velocity, nor when the particular values obtained indicate that a particular data set is Big Data or not. For an example see [8] Although these definitions do not give clarity about the fact that a data set agrees with some definition of Big Data, the use of this buzz word achieved wide-spread acceptance. Thus many results obtained through the analysis of data sets, regardless of their size, are real and useful contributions to research in many different areas, such as economics, health care, medical diagnosis and treatments, marketing, etc. Such results are obtained through a combination of sophisticated analysis techniques (data analytics) and relevant data sets. In [10] a survey of Big Data definitions was made and a new definition formulated. According to this definition Big Data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. This definition attempts to combine the salient aspects of the descriptions and definitions surveyed. As a result it suffers from the same deficiency that the metrics are not well-defined. For example, what does “complex data sets” mean; how can this complexity be measured? Although the above definitions are not useful, it must be acknowledged that Big Data as such does have an impact that cannot be ignored. Perhaps a better term would have been Big Impact Data, which raises the question of what characteristic of a data repository creates the possibility to discover novel patterns. The answer to this question may result in a discerning and measurable factor that can form the basis for a definition of Big Data. 1.2. Patterns The whole purpose of data analytics is to detect patterns in collections of data. Such patterns can, for example, be the searching and buying behaviour of particular groups of people in the Internet, or comparisons of life styles of groups of patients having the same disease. Such discovered patterns can be, and often are, interpreted as answers to specific questions (problems). It is usually not easy or even possible to ascertain whether such solutions are actually true. To establish the correctness and uniqueness of solutions, additional investigations are needed. The latter fact is easily ignored and the discovered patterns are accepted as true solutions without further questioning. This gave rise to the notion that all, or at least the majority, of real-world problems could be solved by collecting extensive data and analysing these with sophisticated methods. This would then make the methodical, complex and laborious scientific method, requiring theories and models to verify these, superfluous [5, 6]. Such a conclusion ignores fundamental and essential aspects of the scientific method that form the basis of scientific research in all fields, for example: • How were the data sources selected?

G.R. Joubert / Big Data: Insight and the Scientific Method

9

• How were the data for the analysis chosen? • Are the selected data representative of the problem considered? • Can the result(s) be reproduced and substantiated with new/additional data? Without considering such fundamental aspects it becomes highly questionable whether the solutions resulting from data analytics are correct and of any value. These may then be reduced to nothing more than mere indications of possible solutions. In order to illustrate the use of a solution based only on collected data consider the well-known problem of predicting the weather. It is quite common to predict local weather by observing certain phenomena. These are often a combination of various observations, such as cloud formations and animal behaviour. The simplest approach to predict local weather is to: • Observe the local weather condition for one day • Predict that the weather will be the same the following day. On average the result will be correct for considerably more than 50% of predictions made. The accuracy of the predictions depends on the part of the world considered, as there are regions where the weather remains nearly constant over longer periods, and others where weather changes may occur in rapid succession. On average it is, however, difficult to better this simple method with sophisticated methods. Additional collected data, such as observing the behaviour of ants, etc. can improve the accuracy of the predictions. If the area over which weather data is collected is increased and also extended into the atmosphere, the predictions can be further improved. It thus becomes possible not only to more accurately predict the weather, but also for longer periods. This fact supports the notion that the more data one has available the more accurately one can solve the weather prediction problem. Having complete global data available, i.e. Big Data, it may even be postulated that then the weather prediction problem can be solved completely for a reasonable period. Although this example supports the notion that Big Data analytics obviates the need for the scientific method, it also makes it clear that applying a solution method that relies only on the analysis of data cannot give a deeper insight into the nature of weather behaviour. Thus longer term forecasts or prediction of more complex weather patterns, such as hurricanes, cannot be handled well with the pure Big Data analytics approach.

2. Scientific Method The scientific method was developed and tested over a long period of time. It forms the basis of scientific research and is not limited to the natural sciences, but applies to all knowledge disciplines. This method or procedure comprises two phases: 1. Data collection through systematic observation, experiment and measurement, and the 2. Formulation, testing and modification of hypotheses. The order in which these two phases—hypothesis formulation and data collection—are executed in practice depends on the nature of the particular problem considered. When humans started to domesticate plants and animals some ten millenniums ago, the effects of climate changes on agriculture became of paramount importance. Predicting changes in the weather became very important. Attempts to predict the local weather was based on collecting weather related data and attempting to find

10

G.R. Joubert / Big Data: Insight and the Scientific Method

dependencies between observed phenomena and weather changes. Thus the behaviour of animals such as ants gave a certain indication of impending short term weather changes. This did not, however, solve the problem to predict the annual seasonal climate changes. The moon months and years proved insufficient to predict such longer term changes. It was noted that the movement of the sun and especially the solstices and equinoxes were somehow connected to the seasonal climate changes. Tracing the movement of the sun thus allowed farming communities to at least know when spring or summer was approaching2. This was a great step forward enabling the planning of agricultural activities in advance. These data based methods did not, however, enable forecasts of wet or dry periods, showing the limitations of purely data based approaches. The need for improved weather forecasts resulted in extensive research into the physical laws that determine the weather. The hypotheses and the complexity of the resulting models stimulated the development of parallel computers enabling the computation of forecasts within an acceptable time frame. It is the combination of more advanced data collection techniques, especially satellite data, advanced models and powerful compute platforms that enable improved weather forecasts.

Figure 1: Human - World - Model interaction

2

See for example the tracing of the solstices and equinoxes in neolithic temples such as Hagar Qim, Malta.

G.R. Joubert / Big Data: Insight and the Scientific Method

11

An example of a problem that did not start with data collection, but with theory, is the search for the Higgs particle (boson). The theory that such a particle possibly exists was first postulated by a number of physicists. The data to prove its existence could only be collected more than forty years later. 2.1. Hypotheses and Models A hypothesis is usually understood to be an explanation of an observable phenomenon, where observable implies that what is observed can be measured according to welldefined metrics—see for example the weather prediction problem in the previous section (Figure 1). Hypotheses may also predict the observation of one or more phenomena that are, due to the limited nature of human senses or measuring equipment, not measurable as yet, but will hopefully be in future—such as was the case with the Higgs boson mentioned above. A hypothesis can thus be shown to be true, true in part or false. Models are descriptions, usually in mathematical terms, of hypotheses and thus also of the associated observable phenomena—from physics, chemistry, engineering, economics, social sciences, etc. The nature of a phenomenon may be either static or dynamic or a mixture of both. The description of static phenomena has been done for many ages [11]. Dynamic change over time is much more difficult to describe and it is only through the work of Leibniz, Newton and others that dynamic processes could be described, for example with the aid of differential equations. In complex cases the observed phenomena can be simulated by use of experiments and tests or by constructing physical systems that show behaviour that in some way resembles the nature of the real phenomena. The data thus obtained can be used as input values for more refined mathematical models. Alternative approaches involve the use of Monte Carlo methods or the simulation of phenomena by constructing a computational system using FPGAs, neural nets, cellular automata, etc. In practice mathematical models are often too complex to be solved directly (analytically). Computers can be used to compute complex solutions if the analytical models are approximated by numerical models. These again allow for the construction of computational models that can be executed by a suitable computer. In order to optimise computational efficiency the software algorithms must be adapted for the particular computer architecture employed (Figure 2).

Figure 2: Computational models

12

G.R. Joubert / Big Data: Insight and the Scientific Method

Such computer solutions can introduce additional errors. These include errors resulting from the construction of the numerical approximation, errors in the formulation of the computational model, compute (rounding) errors and inaccuracies in the input data. In summary: the use of computational models to verify hypotheses are error prone. These facts may be used as arguments against the use of the scientific method, but this does not mean that they are completely useless and cannot be refined. Due to the complexity and/or a lack of knowledge and understanding of the nature of a particular phenomenon, hypotheses and the resulting models are more often than not simplified and/or incomplete. One may thus state that they are (usually) wrong [12]. On the other hand the development and refinement of hypotheses and models to reach a more exact description of a phenomenon contributes substantially to a better understanding of the real world. This refinement process may show up flaws in a conjecture and generate improved hypotheses. Within the framework of the scientific method it is the process of measuring and observing, developing hypotheses and models, generating solutions for the models and comparing the results with the observed characteristics of the phenomena that is important. Mathematical modelling techniques and the numerical methods that enable their solution with the aid of ever more powerful computers are continuously improved. See, for example, the new developments regarding the solution of stochastic partial differential equations (SPDEs) [13]. Eventually a stage may be reached where a well tested hypothesis may lay the foundation for formulating a new theory. Mathematical models have a further advantage in that they allow for sensitivity analyses that indicate which data have the greatest influence on the accuracy of the end results, as well as the impact due to errors in particular measurements on the computed solution. This information is invaluable in gaining a deeper understanding of the problem or phenomenon studied. 2.2 Validity of patterns If a pattern is detected by data analytics methods applied to available data sets this result in itself may seem to be interesting. Additional verification is needed, however, to ensure that the detected pattern is indeed a reasonable conclusion. Thus the analysis of searches executed by a user may indicate an interest shown by a client in a particular product, but this does not necessarily mean that he/she wishes to buy the product. Great care should thus be taken to confirm the validity of analytics results before assuming their correctness. Such a verification, for example through hypotheses and models or the analysis of additional data, may not lead to a conclusive proof of the correctness of a result, but may strengthen the confidence in its validity. An additional complication in practice is that a pattern detected may appear to be a solution of a particular problem. The question whether this problem is well-posed is easily overlooked. The expectations created by overhasty unverified successes reported, perhaps more from a marketing point of view or by organisations that wish to promote their own expertise in Big Data analytics, will inevitably result in disappointment. Such negative experiences will in the end denigrate Big Data to such an extent that the real value of pattern detections in large and complex data sets may be ignored in future.

G.R. Joubert / Big Data: Insight and the Scientific Method

13

3. Data: Big or Small 3.1 Purposed Data Whenever data is collected and recorded it is crucial that the purpose for which the data is to be used must be clear. The requirement for particular data is normally determined by the problem considered. Often only particular aspects of a problem are of interest and this limits the spectrum of data to be collected. Once it has been determined which data is needed, it must be decided for each datum which characteristics, such as accuracy, frequency, format, etc. must be fulfilled. All metrics, for example the measurement of frequency, must be defined. In general the problem to be solved determines the number of data needed. In practice it is usually strived to collect as little data as possible as the recording, communication, storage, processing and archiving of data consume resources. Due to such constraints data sets are usually kept relatively small. If a phenomenon is recognised, but not well understood, a suitable hypothesis cannot be formulated. To gain more insight into the phenomenon, data can be collected and analysed with the particular purpose to gain insight and a better understanding of the phenomenon. The gained insight can then hopefully result in the definition of a first hypothesis as a step in the scientific method. The hypothesis can lead to a first model and, with the collected data, hopefully result in a verification and possible refinement of the hypothesis. This results in an iterative process through which the problem solution can hopefully be improved. Common to all such cases is that the data is collected with a particular goal in mind. In the following sections such data will be denoted as Purposed Data. Purposed Data is thus all data collected for use in the scientific method to solve real world problems. The quality and quantity of the data must meet the purpose of the data collection effort. This is of greater importance than the size of the data set. 3.2 Data Pools In contrast to Purposed Data that are collected to solve a particular problem, data is increasingly also collected for administrative, legal, insurance, financial planning, medical care, etc. purposes. Such data are not intended in the first place to gain insight into a phenomenon or to solve a real-world problem. Examples can be found in repositories on, for example: • Data on purchases made in the Internet, supermarkets, etc. • Patient data collected by hospitals, medical doctors, dispensaries, etc. • Traffic flow data in cities, highways, sea, air, etc. combined with data on accidents, congestions, emissions, etc. • Satellite data and images of the earth showing global weather conditions, crops and crop diseases, flooded areas, geological characteristics, etc. • Textual data published in books and journals, blogs and social media • Geodata comprising environmental, geographical, geological and infrastructure data. The collected records may contain data—and usually do—that may be useful for other purposes than those originally intended. Such repositories form Data Pools that may be analysed to detect hitherto unknown patterns. Such patterns can lead to new insights and unexpected innovative solutions of real-world problems.

14

G.R. Joubert / Big Data: Insight and the Scientific Method

In practice Data Pools may comprise very large data repositories posing particular problems for storing, communicating and processing the contained data. A further complicating factor is that data collected from various sources, where different criteria regarding formats and metrics were used, may be combined to form a more comprehensive Data Pool. Examples are medical records from various sources that are combined in a study of patient responses to particular treatments and Geodata on existing infrastructures used for city planning, which originate from different city departments. The integration of such diversely sourced data is a major problem that may require significant investments in restructuring and transforming data. In many cases the only solution may prove to be to recollect some of the data using improved criteria. 3.3 Purposed Data versus Data Pools A fundamental difference exists between Purposed Data and Data Pools. Purposed Data is collected for solving a particular problem, whereas data stored in a Data Pool is collected for one purpose, but then analysed to detect patterns that may result in a solution of a different problem. This implies that also Purposed Data may become part of a Data Pool, but then underlying different views from the original one. These new views may result in the detection of new patterns that form possible solutions of new problems. The distinction made between Purposed Data and Data Pools is thus based on the views with which the data are analysed. If Purposed Data are used for solving the problem for which the data were originally collected, i.e. in the context of the scientific method, such data represents Data in the classical sense. 3.4 Big Data Data Pools make it possible to detect new patterns amongst data sets that have little relation with each other. This represents the advantages that are claimed for Big Data. Thus the term Big Data depicts Data Pools, these terms are synonymous. The term Data Silos is used by some authors to describe the same concept. Here the term Data Pool is preferred to Data Silo, as the concept of a pool underlines the heterogeneity of the data sets contained. The size of the data repositories considered are dependent on what is available and what is useful. The actual size may thus be relatively small, making the name Big Data a misnomer. The fundamental characteristic is that of a Data Pool. Big Data is thus data that is used—or perhaps misused—for a purpose for which it was not originally intended. This opens up a spectrum of problems related to Big Data. These must be carefully considered in all aspects of data analytics implemented on such data. Thus the characteristics of data contained in a Big Data (or Data Pool) repository are often unknown or defined in an unsuitable way with uncertain quality. Such aspects become more important when various data sets from different sources are integrated, so-called Data Fusion. The data elements are often defined inconsistently amongst the data sets to be integrated, different metrics may have been used and some of the metrics used may not be well defined. Thus Purposed Data used in a Big Data context may have been well-defined for its original purpose, but the new purpose for which it may be used as part of Big Data analytics may require different quality and quantity criteria. Typical examples can be found in the use of Geodata from various sources in GeoInformation Systems (GIS) or Big Data analytics applied to medical Data Pools.

G.R. Joubert / Big Data: Insight and the Scientific Method

15

Such quality and quantity aspects are in addition to different formatting standards used in cases where different data sets are to be integrated to form a Big Data repository. It is not a purpose of this paper to discuss such problems in detail. This can only be done within the framework of a particular application. Results obtained through Big Data analytics must thus be carefully assessed and verified. Do the solutions obtained solve a particular problem and are these solutions valid? Answering such questions may require the use of the scientific method in a suitable manner. 3.5 Big Data Definition These considerations lead to a definition of Big Data that is based on the use made of a set of data. The size of a data set is irrelevant. This concept is clearly defined and easily applied. It also underlines the fact that a datum may both belong to a Big Data repository and a standard data set depending on what it is used for.

4. Compute Platforms The collection, communication, analysis and storage of large data repositories requires compute platforms that are able to sustain high data throughput rates, so-called High Throughput Computing (HTC), also referred to as Data Intensive Computing (DIC), systems. The main requirements are that such systems must be able to retrieve, process and store large data sets at high speed. Thus high speed access to fast data storage systems is essential. An additional problem results from the difficulty of finding complex structures, which may require that ready access to large data sets must be possible. This requires large internal memories that are directly accessible. On the other hand this complicates the use of multiple processors working in parallel in order to achieve the high processing speeds that are demanded. It is not an aim with this paper to discuss various architectures of such HTC (or DIC) systems. Due to the need for systems that offer improved data processing capabilities in addition to or in lieu of sheer processing power a number of suppliers of parallel systems are addressing these needs. In how far the available machines will be able to meet users’ requirements will have to be established in each individual case. No clearly defined metrics for measuring and comparing the performance of HTC/DIC systems are as yet available.

5. Conclusions The value of so-called Big Data can only be fully utilised, if it is clearly understood what Big Data as such implies and, more importantly, which quality criteria must be met. In this paper it is argued that Purposed Data represents the normal data that is collected within the framework of the scientific method. Whether such data collections are huge or small does not change the logic. These form an integral part of the scientific approach as applied in all areas of scientific research.

16

G.R. Joubert / Big Data: Insight and the Scientific Method

Big Data refers to data that were collected for a particular purpose, but which are then subsequently analysed to detect new unforeseen patterns, i.e. the data are used for another than the original purpose. Such analyses are often executed by combining related data repositories. This fact raises particular questions regarding the compliance with quality and formatting of such data. These must be clearly defined. In practice these requirements are often ignored, which may lead to erroneous results. Big Data analytics offers the potential of discovering solutions or gaining new insights that would not normally have been possible. Due to the uncertainties regarding the accuracy of the data used, in particular in those cases where different data repositories are integrated, the results obtained must be verified. This can, in the end, only be done by a scientific approach. In conclusion: Big Data analytics needs the scientific method, including hypotheses and models, to verify the validity of discovered results. What is needed is: Insight, not Data.

References [1] G. Press: http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-ofbig-data/, 2013. [2] Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. http://www.gartner.com/newsroom/id/2819918, 2014. [3] D. N. Kenwright, D. C. Banks, S. Bryson, R. Haimes, R. V. Liere, S. P. Uselton: Automation or Interaction: What's Best for Big Data? Panel Session, Proceedings Visualization ’99, IEEE (1999), 491-495. [4] M. Helft: Google Uses Searches to Track Flu’s Spread. 11/11/2008, 2008.

New York Times,

[5] C. Anderson: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Wired Magazine 16.07, 2008. [6] M. Graham: Big Data and the end of Theory? The Guardian, http://www.guardian.co.uk/news/datablog/2012/mar/09/big-data-theory, 2012. [7] D. Laney: 3D Data Management: Controlling Data Volume, Velocity, and Variety. META group Inc.. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3DData-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf, 2001. [8] M. A. Levin: Defining Big Data. Slides of Presentation. www.stahq.org/index.php/ download_file/view/363/184, 2013. [9] Demystifying Big Data: A Practical Guide to Transforming the Business of Government. Prepared by the TechAmerica Foundation’s FederalBig Data Commission, TechAmericaFoundation, 601 Pennsylvania Ave., N. W., North Building, Washington D. C. 20004. [10] J. S. Ward and A. Barker: Undefined By Data: A Survey of Big Data Definitions. http://arxiv.org/abs/1309.5821v1, 2013.

G.R. Joubert / Big Data: Insight and the Scientific Method

17

[11] Donald E. Knuth: Ancient Babylonian Algorithms, Communications of the ACM, 15 (1972), 671-677 [12] George E. P. Box, Norman R. Draper: Empirical Model-Building and Response Surfaces, Wiley. ISBN 0471810339, 1987. [13] Hairer, Martin: Solving the KPZ equation, Annals of Mathematics, 559-664 from Volume 178, Issue 2 (2013), 559-664

18

Big Data and High Performance Computing L. Grandinetti et al. (Eds.) IOS Press, 2015 © 2015 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-583-8-18

Programming Visual and Script-based Big Data Analytics Workflows on Clouds Loris BELCASTRO, Fabrizio MAROZZO, Domenico TALIA 1 and Paolo TRUNFIO DIMES, University of Calabria, Italy

Abstract. Data analysis applications often include large datasets and complex software systems in which multiple data processing tools are executed in a coordinated way. Data analysis workflows are effective in expressing task coordination and they can be designed through visual- and script-based programming paradigms. The Data Mining Cloud Framework (DMCF) supports the design and scalable execution of data analysis applications on Cloud platforms. A workflow in DMCF can be developed using a visual- or a script-based language. The visual language, called VL4Cloud, is based on a design approach for highlevel users, e.g., domain expert analysts having a limited knowledge of programming paradigms. The script-based language JS4Cloud is provided as a flexible programming paradigm for skilled users who prefer to code their workflows through scripts. Both languages implement a data-driven task parallelism that spawns ready-to-run tasks to Cloud resources. In addition, they exploit implicit parallelism that frees users from duties like workload partitioning, synchronization and communication. In this chapter, we present the DMCF framework and discuss how its workflow paradigm has been integrated with the MapReduce model. In particular, we describe how VL4Cloud/JS4Cloud workflows can include MapReduce tools, and how these workflows are executed in parallel on DMCF enabling scalable data processing on Clouds. Keywords. Big data, workflows, Data Mining Cloud Framework, MapReduce

Introduction Cloud computing systems provide elastic services, high performance and scalable data storage to a large and everyday increasing number of users [1]. Clouds enlarged the computing and storage offer of distributed computing systems by providing advanced Internet services that complement and complete functionalities of distributed computing provided by the Web, Grids, and peer-to-peer networks. In fact, Cloud computing systems provide large-scale computing infrastructures for complex highperformance applications. Most of those applications use big data repositories and often access and analyze them to extract useful information. Big data is a new and over-used term that refers to massive, heterogeneous, and often unstructured digital content that is difficult to process using traditional data management tools and techniques. The term includes the complexity and variety of data and data types, real-time data collection and processing needs, and the value that can be obtained by smart analytics. Advanced data mining techniques and associated 1

Corresponding Author. Domenico Talia, DIMES, University of Calabria, Via P. Bucci 41c, 87036 Rende, Italy. E-mail: [email protected].

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

19

tools can help extract information from large, complex datasets that are useful in making informed decisions in many business and scientific applications including advertising, market sales, social studies, bioinformatics, and high-energy physics. Combining big data analytics and knowledge discovery techniques with scalable computing systems will produce new insights in a shorter time [2]. Although a few Cloud-based analytics platforms are available today, current research work foresees that they will become common within a few years. Some current solutions are open source systems such as Apache Hadoop and SciDB, while others are proprietary solutions provided by companies such as Google, IBM, Microsoft, EMC, BigML, Splunk Storm, Kognitio, and InsightsOne. As such platforms become available, researchers are increasingly porting powerful data mining programming tools and strategies to the Cloud to exploit complex and flexible software models, such as the distributed workflow paradigm. The increasing use of serviceoriented computing in many application domains is accelerating this trend. Developers and researchers can adopt the software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS) models to implement big data analytics solutions in the Cloud. In such a way, data mining tasks and knowledge discovery applications can be offered as high-level services on Clouds. This approach creates a new way to deliver data analysis software that is called data analytics as a service (DAaaS). This chapter describes a Data Mining Cloud Framework (DMCF) that we developed according to this approach. In DMCF, data analysis workflows can be designed through a visual- or a script-based formalism. The visual formalism, called VL4Cloud, is a very effective design approach for high-level users, e.g., domain expert analysts having a limited knowledge of programming languages. As an alternative, the script-based language, called JS4Cloud, offers a flexible programming approach for skilled users who prefer to program their workflows using a more technical approach. We discuss how the DMCF framework based on the workflow model has been integrated with the MapReduce paradigm. In particular, we describe how VL4Cloud/JS4Cloud workflows can include MapReduce algorithms and tools, and how these workflows can be executed in parallel on DMCF to support scalable data analysis on Clouds. The remainder of the chapter is organized as follows. Section 1 presents the Data Mining Cloud Framework introducing its architecture, the parallel execution model and the workflow-based programming paradigm offered by VL4Cloud and JS4Cloud. Section 2 describes how the VL4Cloud and JS4Cloud languages have been extended to integrate with the MapReduce model. Section 3 describes a data mining application implemented using the proposed approach. Section 4 discusses related work. Finally, Section 5 concludes the chapter.

1. Data Mining Cloud Framework The Data Mining Cloud Framework (DMCF) is a software system developed for allowing users to design and execute data analysis workflows on Clouds. DMCF supports a large variety of data analysis processes, including single-task applications, parameter sweeping applications, and workflow-based applications. A Web-based user interface allows users to compose their applications and to submit them for execution to a Cloud platform, according to a Software-as-a-Service approach.

20

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

Figure 1. Architecture of Data Mining Cloud Framework

The DMCF’s architecture includes a set of components that can be classified as storage and compute components [3] (see Figure 1). The storage components include: •

• •

A Data Folder that contains data sources and the results of knowledge discovery processes. Similarly, a Tool folder contains libraries and executable files for data selection, pre-processing, transformation, data mining, and results evaluation. The Data Table, Tool Table and Task Table that contain metadata information associated with data, tools, and tasks. The Task Queue that manages the tasks to be executed.

The compute components are: • •

A pool of Virtual Compute Servers, which are in charge of executing the data mining tasks. A pool of Virtual Web Servers host the Web-based user interface.

The user interface provides three functionalities: App submission, which allows users to submit single-task, parameter sweeping, or workflow-based applications; ii) App monitoring, which is used to monitor the status and access results of the submitted applications; iii) Data/Tool management, which allows users to manage input/output data and tools. i)

The Data Mining Cloud Framework architecture has been designed as a reference architecture to be implemented on different Cloud systems. However, a first implementation of the framework has been carried out on the Microsoft Azure Cloud

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

21

platform2 and has been evaluated through a set of data analysis applications executed on a Microsoft Cloud data center. The DMCF framework takes advantage of cloud computing features, such as elasticity of resources provisioning. In DMCF, at least one Virtual Web Server runs continuously in the Cloud, as it serves as user front-end. In addition, users specify the minimum and maximum number of Virtual Compute Servers. DMCF can exploit the auto-scaling features of Microsoft Azure that allows dynamic spinning up or shutting down Virtual Compute Servers, based on the number of tasks ready for execution in the DMCF’s Task Queue. Since storage is managed by the Cloud platform, the number of storage servers is transparent to the user. The remainder of the section outlines applications execution in DMCF, and describes the DMCF’s visual- and script-based formalisms used to implement workflow applications. 1.1. Applications execution For designing and executing a knowledge discovery application, users interact with the system performing the following steps: 1. The Website is used to design an application (either single-task, parameter sweeping, or workflow-based) through a Web-based interface that offers both the visual programming interface and the script. 2. When a user submits an application, the system creates a set of tasks and inserts them into the Task Queue on the basis of the application requirements. 3. Each idle Virtual Compute Server picks a task from the Task Queue, and concurrently executes it. 4. Each Virtual Compute Server gets the input dataset from the location specified by the application. To this end, file transfer is performed from the Data Folder where the dataset is located, to the local storage of the Virtual Compute Server. 5. After task completion, each Virtual Compute Server puts the result on the Data Folder. 6. The Website notifies the user as soon as her/his task(s) have completed, and allows her/him to access the results. The set of tasks created on the second step depends on the type of application submitted by a user. In the case of a single-task application, just one data mining task is inserted into the Task Queue. If users submit a parameter sweeping application, a set of tasks corresponding to the combinations of the input parameters values are executed in parallel. If a workflow-based application has to be executed, the set of tasks created depends on how many data analysis tools are invoked within the workflow. Initially, only the workflow tasks without dependencies are inserted into the Task Queue [3]. 1.2. Workflow formalisms The DMCF allows creating data mining and knowledge discovery applications using workflow formalisms. Workflows may encompass all the steps of discovery based on the execution of complex algorithms and the access and analysis of scientific data. In 2

http://azure.microsoft.com/

22

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

data-driven discovery processes, knowledge discovery workflows can produce results that can confirm real experiments or provide insights that cannot be achieved in laboratories. In particular, DMCF allows to program workflow applications using two languages: -

-

VL4Cloud (Visual Language for Cloud), a visual programming language that lets users develop applications by programming the workflow components graphically [4]. JS4Cloud (JavaScript for Cloud), a scripting language for programming data analysis workflows based on JavaScript [5]. Both languages use two key programming abstractions:

-

Data elements denote input files or storage elements (e.g., a dataset to be analyzed) or output files or stored elements (e.g., a data mining model). Tool elements denote algorithms, software tools or complex applications performing any kind of operation that can be applied to a data element (data mining, filtering, partitioning, etc.).

Another common element is the task concept, which represents the unit of parallelism in our model. A task is a Tool, invoked in the workflow, which is intended to run in parallel with other tasks on a set of Cloud resources. According to this approach, VL4Cloud and JS4Cloud implement a data-driven task parallelism. This means that, as soon as a task does not depend on any other task in the same workflow, the runtime asynchronously spawns it to the first available virtual machine. A task Tj does not depend on a task Ti belonging to the same workflow (with i ≠ j), if Tj during its execution does not read any data element created by Ti. In VL4Cloud, workflows are directed acyclic graphs whose nodes represent data and tools elements. The nodes can be connected with each other through direct edges, establishing specific dependency relationships among them. When an edge is being created between two nodes, a label is automatically attached to it representing the type of relationship between the two nodes. Data and Tool nodes can be added to the workflow singularly or in array form. A data array is an ordered collection of input/output data elements, while a tool array represents multiple instances of the same tool. Figure 2 shows an example of data analysis workflow developed using the visual workflow formalism of DMCF [6].

Figure 2. Example of data analysis application designed using VL4Cloud

In JS4Cloud, workflows are defined with a JavaScript code that interacts with Data and Tool elements through three functions:

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

-

23

Data Access, for accessing a Data element stored in the Cloud; Data Definition, to define a new Data element that will be created at runtime as a result of a Tool execution; Tool Execution: to invoke the execution of a Tool available in the Cloud.

Once the JS4Cloud workflow code has been submitted, an interpreter translates the workflow into a set of concurrent tasks by analysing the existing dependencies in the code. The main benefits of JS4Cloud are: 1) it extends the well-known JavaScript language while using only its basic functions (arrays, functions, loops); 2) it implements both a data-driven task parallelism that automatically spawns ready-to-run tasks to the Cloud resources, and data parallelism through an array-based formalism; 3) these two types of parallelism are exploited implicitly so that workflows can be programmed in a totally sequential way, which frees users from duties like work partitioning, synchronization and communication. Figure 3 shows the script-based workflow version of the visual workflow shown in Figure 2. In this example, parallelism is exploited in the for loop at line 7, where up to 16 instances of the J48 classifier are executed in parallel on 16 different partitions of the training sets, and in the for loop at line 10, where up to 16 instances of the Predictor tool are executed in parallel to classify the test set using 16 different classification models.

Figure 3. Example of data analysis application designed using JS4Cloud

Figure 3 shows a snapshot of the parallel classification workflow taken during its execution in the DMCF’s user interface. Beside each code line number, a colored circle indicates the status of execution. This feature allows user to monitor the status of the workflow execution. Green circles at lines 3 and 5 indicate that the two partitioners have completed their execution; the blue circle at line 8 indicates that J48 tasks are still running; the orange circles at lines 11 and 13 indicate that the corresponding tasks are waiting to be executed.

24

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

2. Extending VS4Cloud/JS4Cloud with MapReduce In this section, we describe how the DMCF has been extended to include the execution of MapReduce tasks. In particular, we describe the MapReduce programming model, why it is widely used by data specialists, and how the DMCF’s languages have been extended to support MapReduce applications. 2.1. Motivations MapReduce [7] is a programming model developed by Google to process large amounts of data. It is inspired by the map and reduce primitives present in Lisp and other functional languages. A user defines a MapReduce application in terms of a map function that processes a (key, value) pair to generate a list of intermediate (key, value) pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Most MapReduce implementations are based on a master-slave architecture. A job is submitted by a user node to a master node that selects idle workers and assigns a map or reduce task to each one. When all the tasks have been completed, the master node returns the result to the user node. MapReduce and its best-known implementation Hadoop 3 have become widely used by data specialists to develop parallel applications that analyze big amount of data. Hadoop is designed to scale up from a single server to thousands of servers, and has become the focus of several other projects, including Spark4 for in-memory machine learning and data analysis, Storm5 for streaming data analysis, Hive6 as data warehouse software to query and manage large datasets, and Pig 7 as dataflow language for exploring large datasets. Algorithms and applications written using MapReduce are automatically parallelized and executed on a large number of servers. Consequently, MapReduce has been widely used to implement data mining algorithms in parallel. Chu et al. [8] offer an overview of how several learning algorithms can be efficiently implemented using MapReduce. More in details, the authors demonstrate that MapReduce shows basically a linear speedup with an increasing number of processors on a variety of learning algorithms such as K-means, neural networks and Expectation-Maximization probabilistic clustering. Mahout8 is an Apache project built on Hadoop that provides scalable machine learning libraries. Ricardo project [9] is a platform that integrate R9 statistical tools and Hadoop to support parallel data analysis. The use of MapReduce for data intensive scientific analysis and bioinformatics is deeply analyzed in [10]. For the reasons discussed above and for the large number of MapReduce algorithms and applications available online, we designed an extension of the DMCF’s workflow formalism to support also the execution of MapReduce tools.

3

http://hadoop.apache.org http://spark.apache.org 5 http://storm.apache.org 6 http://hive.apache.org 7 http://pig.apache.org 8 http://mahout.apache.org 9 http://www.r-project.org 4

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

25

2.2. Integration model In DMCF, a Tool represents a software tool or service performing any kind of process that can be applied to a data element (data mining, filtering, partitioning, etc.).

Figure 4. Types of Tools available in DMCF

As shown in Figure 4, three different types of Tools can be used in a DCMF workflow: -

-

A Batch Tool is used to execute an algorithm or a software tool on a Virtual Compute Server without user interaction. All input parameters are passed as command-line arguments. A Web Service Tool is used to insert into a workflow a Web service invocation. It is possible to integrate both REST and SOAP-based Web services [11]. A MapReduce Tool is used to insert into a workflow the execution of a MapReduce algorithm or application running on a cluster of virtual servers.

For each Tool in a workflow, a Tool descriptor includes a reference to its executable, the required libraries, and the list of input and output parameters. Each parameter is characterized by name, description, type, and can be mandatory or optional. In more detail, a MapReduce Tool descriptor is composed by two groups of parameters: generic parameters, which are parameters used by the MapReduce runtime, and applications parameters, which are parameters associated to specific MapReduce applications. In the following, we list a few examples of generic parameters: -

mapreduce.job.reduces: the number of reduce tasks per job; mapreduce.job.maps: the number of map tasks per job; mapreduce.input.fileinputformat.split.minsize: the minimum size of chunk that map input should be split into; mapreduce.input.fileinputformat.split.maxsize: the maximum size of chunk that map input should be split into; mapreduce.map.output.compress: enable the compression of the intermediate mapper outputs before being sent to the reducers.

Figure 5 shows an example of MapReduce Tool descriptor for an implementation of the Random Forest algorithm. As shown by the descriptor, the algorithm can be configured with the following parameters: a set of input files (dataInput), the number

26

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

of trees that will be generated (nTrees), the minimum number of elements for node split (minSplitNum), the column containing the class labels (classColumn), and the output models (dataOutput). DMCF uses this descriptor to allow the inclusion of a RandonForest algorithm in a workflow, and to execute it on a MapReduce cluster.

Figure 5. Example of MapReduce descriptor in JSON format

3. A Data Mining Application Case In this section, we describe a DMCF data mining application whose workflow includes MapReduce computations. Through this example, we show how the MapReduce paradigm has been integrated into DMCF workflows, and how it can be used to exploit the inherent parallelism of the application. The application deals with a significant economic problem coupled with the flight delay prediction. Every year approximately 20% of airline flights are delayed or canceled mainly due to bad weather, carrier equipment or technical airport problems. These delays result in significant cost to both airlines and passengers. In fact, the cost of flight delays for US economy was estimated to be $32.9 billion in 2007 and more than half of it was charged to passengers [12].

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

27

The goal of this application is to implement a predictor of the arrival delay of scheduled flights due to weather conditions. The predicted arrival delay takes into consideration both implicit flight information (origin airport, destination airport, scheduled departure time, scheduled arrival time) and weather forecast at origin and destination airports. In particular, we consider the closest weather observation at origin and destination airports based on scheduled flight departure and arrival time. If the predicted arrival delay of a scheduled flight is less than a given threshold, it is classified as an on-time flight; otherwise, it is classified as a delayed flight. Two open datasets of airline flights and weather observations have been collected, and exploratory data analysis has been performed to discover initial insights, evaluate the quality of data, and identify potentially interesting subsets. The first dataset is the Airline On-Time Performance (AOTP) provided by RITA - Bureau of Transportation Statistics10. The AOTP dataset contains data for domestic US flights by major air carriers, providing for each flight detailed information such as origin and destination airports, scheduled and actual departure and arrival times, air time, and non-stop distance. The second is the Quality Controlled Local Climatological Data (QCLCD) dataset available from the National Climatic Data Center11. The dataset contains hourly weather observations from about 1,600 U.S. stations. Each weather observation includes data about temperature, humidity, wind direction and speed, barometric pressure, sky condition, visibility and weather phenomena descriptor. For data classification, a MapReduce version of the Random Forest (RF) algorithm has been used. RF is an ensemble learning method for classification [13]. It creates a collection of different decision trees called forest. Once forest trees are created, the classification of an unlabeled tuple is performed by aggregating the predictions of the different trees through majority voting. The results presented for this application have been obtained using data for a fiveyear period beginning on January 2009 and ending on December 2013. Data are actually too large to be analyzed on a single server. In fact, the joint table for five years data is larger than 120GB, which cannot be analyzed on a single server due to memory limits. A cloud system makes the analysis possible by providing the necessary computing resources and scalability. In addition, the cloud makes the proposed process more general: in fact, if the amount of data increases (e.g., by extending the analysis to many years of flight and weather data), the cloud provides the required resources with a high level of elasticity, reliability, and scalability. More application details are described in [14].

Figure 6. Flight delay analysis workflow using DMCF with MapReduce

10 11

http://www.transtats.bts.gov http://cdo.ncdc.noaa.gov/qclcd/QCLCD

28

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

Using DMCF, we created a workflow for the whole data analysis process (see Figure 6). The workflow begins by pre-processing the AOTP and the QCLCD datasets using two instances of PreProc Tool. These steps allow looking for possible wrong data, treating missing values, and filtering out diverted and cancelled flights and weather observations not related to airport locations. Then, a Joiner Tool executes a relational join between Flights and Weather Observations data in parallel using a MapReduce algorithm. The result is a JointTable. Then, a PartionerTT Tool creates five pairs of using different delay threshold values. The five instances of training set and test set are represented in the workflow as two data array nodes, labelled as Trainset[5] and Testset[5]. Then, five instances of the RandomForest Tool analyze in parallel the five instances of Trainset to generate five models (Model[5]). For each model, an instance of the Evaluator Tool generates the confusion matrix (EvalModel), which is a commonly used method to measure the quality of classification. Starting from the set of confusion matrices obtained, these tools calculate some metrics, e.g., accuracy, precision, recall, which can be used to select the best model. For our experiments, we deployed a Hadoop cluster over the Virtual Computer Servers of DMCF. The cluster includes 1 head node having eight 2.2 GHz CPU cores and 14 GB of memory, and 8 worker nodes having four 2.2 GHz CPU cores and 7 GB of memory. Table 1 presents the workflow’s turnaround times and speedup values obtained using up to 8 workers. Taking into account the whole workflow, the turnaround time decreases from about 7 hours using 2 workers, to 1.7 hours using 8 workers, with a speedup that is very close to linear values. Table 1. Turnaround time and relative speedups (with respect to 2 workers) 2 workers (1x) Tool PreProc Filter PartitionerTT RandomForest Evaluator Total

Turnaround time (hh.mm.ss) 00.08.34 03.00.21 02.14.06 00.30.27 01.00.44 06.54.12

4 workers (2x)

Speedup -

Turnaround time (hh.mm.ss) 00.04.13 01.36.39 01.06.59 00.15.22 00.31.26 03.34.39

Speedup 2.0 1.9 2.0 2.0 1.9 1.9

8 workers (4x) Turnaround time (hh.mm.ss) 00.02.31 00.46.45 00.33.19 00.07.51 00.16.07 01.46.33

Speedup 3.4 3.9 4.0 3.9 3.8 3.8

Scalability is obtained exploiting the parallelism offered both by MapReduce Tools and by the DMCF workflow languages. In the first case, each MapReduce Tool is executed in parallel exploiting the cluster resources. The level of parallelism depends on the number of map and reduce tasks and on the resources available in the cluster. In the second case, the DMCF workflow languages allow creating parallel paths and array of tools that can be executed concurrently. In this case, the level of the parallelism depends on the dependencies among tasks and on the resources available in the cluster.

4. Related work Several systems have been proposed to design and execute workflow-based applications [15], but only some of them currently work on the Cloud and support visual or script-based workflow programming. The most known systems are Taverna [16], Orange4WS [17] [18], Kepler [19], E-Science Central (e-SC) [20], ClowdFlows

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

29

[21], Pegasus [22] [23], WS-PGRADE [24] and Swift [25]. In particular, Swift is a parallel scripting language that executes workflows across several distributed systems, like clusters, Clouds, grids, and supercomputers. It provides a functional language in which workflows are modelled as a set of program invocations with their associated command-line arguments, input and output files. Swift uses a C-like syntax consisting of function definitions and expressions that provide a data-driven task parallelism. The runtime includes a set of services that implement the parallel execution of Swift scripts exploiting the maximal concurrency permitted by data dependencies within a script and by external resource availability. Swift users can use Galaxy [26] to provide a visual interface for Swift [27]. For comparison purposes, we distinguish two types of parallelism levels: workflow parallelism, which refers to the ability of executing multiple workflows concurrently; and task parallelism, which is the ability of executing multiple tasks of the same workflow concurrently. Most systems, including DMCF, support both workflow and task parallelisms, except for ClowdFlows and E-Science Central that focus on workflow parallelism only. Most systems are provided according with the SaaS model (e.g., E-Science Central, ClowdFlows, Pegasus, WS-PGRADE, Swift+Galaxy and DMCF), whereas Taverna, Kepler and Orange4WS are implemented as desktop applications that can invoke Cloud software exposed as Web Services. All the SaaS systems are implemented on top of Infrastructure-as-a-Service (IaaS) Clouds, except for DMCF that is designed to run on top of Platform-as-a-Service (PaaS) Clouds. DMCF is one of the few SaaS systems featuring both workflow/task parallelism and support to data/tool arrays. However, differently from the data/tool array formalisms provided by the other systems, DMCF’s arrays make explicit the parallelism level of each workflow node, i.e., the number of input/output datasets (in case of data arrays) and the number of tools to be concurrently executed (in case of tool arrays). Furthermore, DMCF is the only system designed to run on top of a PaaS. A key advantage of this approach is the independence from the infrastructure layer. In fact, the DMCF’s components are mapped into PaaS services, which in turn are implemented on infrastructure components. Changes to the Cloud infrastructure affect only the infrastructure/platform interface, which is managed by the Cloud provider, and therefore DMCF’s implementation and functionality are not influenced. In addition, the PaaS approach facilitates the implementation of the system on a public Cloud, which free final users and organizations from any hardware and OS management duties.

5. Conclusions Data analysis applications often involve big data and complex software systems in which multiple data processing tools are executed in a coordinated way. Big data refers to massive, heterogeneous, and often unstructured digital content that is difficult to process using traditional data management tools and technique. Cloud computing systems provide elastic services, high performance and scalable data storage, which can be used as large-scale computing infrastructures for complex high-performance data mining applications. Data analysis workflows are effective in expressing task coordination and can be designed through visual and script-based formalisms. According to this approach, we described the Data Mining Cloud Framework (DMCF), a system supporting the

30

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

scalable execution of data analysis computations on Cloud platforms. A workflow in DMCF can be defined using a visual or a script-based formalism, in both cases implementing a data-driven task parallelism that spawns ready-to-run tasks to Cloud resources. In this chapter, we presented how the DMCF workflow paradigm has been integrated with the MapReduce model. In particular, we described how VL4Cloud/JS4Cloud workflows can include MapReduce algorithms and tools, and how these workflows are executed in parallel on DMCF to enable scalable data processing on Clouds. We described a workflow application that exploits the support to MapReduce provided by DMCF. The goal of this workflow is implementing a predictor of the arrival delay of scheduled flights due to weather conditions, taking into consideration both implicit flight information and weather forecast at origin and destination airports. By executing the workflow on an increasing number of workers, we were able to achieve a nearly linear speedup, thanks to the combined scalability provided by the DMCF workflows languages and by the MapReduce framework.

References [1]

[2] [3]

[4] [5] [6] [7] [8] [9]

[10]

[11]

[12]

[13] [14] [15] [16]

M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Commun. ACM, 53(4), pp. 50–58, April 2010. D. Talia, P. Trunfio. Service-Oriented Distributed Knowledge Discovery, Chapman & Hall/CRC, USA, 2012. F. Marozzo, D. Talia, P. Trunfio. A Cloud Framework for Parameter Sweeping Data Mining Applications. Proc. of the 3rd International Conference on Cloud Computing Technology and Science (CloudCom 2011), Athens, Greece, pp. 367-374, 2011. F. Marozzo, D. Talia, P. Trunfio. Using clouds for scalable knowledge discovery applications. In Euro-Par 2012: Parallel Processing Workshops, pp. 220-227, 2013. F. Marozzo, D. Talia, and P. Trunfio. JS4Cloud: Script-based workflow programming for scalable data analysis on cloud platforms. Concurrency and Computation: Practice and Experience, 2015. F. Marozzo, D. Talia, and P. Trunfio. A cloud framework for big data analytics workflows on Azure. Proc. of the 2012 High Performance Computing Workshop (HPC 2012), 2012. J. Dean, S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1), pp. 107-113, 2008. C. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19(2007), pp. 281, 2007. S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. Proc. of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). ACM, New York, NY, USA, pp. 987–998, 2010. J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. Proc. of the 2008 Fourth IEEE International Conference on eScience (ESCIENCE ’08). IEEE Computer Society, Washington, DC, USA, pp. 277–284, 2008. C. Pautasso, O. Zimmermann, and F. Leymann. Restful web services vs. "big"' web services: making the right architectural decision. Proc. of the 17th international conference on World Wide Web (WWW '08). ACM, New York, NY, USA, pp. 805-814, 2008 April. M. Ball, C. Barnhart, M. Dresner, M. Hansen, K. Neels, A. Odoni, E. Peterson, L. Sherry, A. A. Trani, and B. Zou. Total delay impact study: a comprehensive assessment of the costs and impacts of flight delay in the United States, 2010. L. Breiman. Random forests. Machine learning, Springer, 45(1), pp. 5–32, 2001. L. Belcastro, F. Marozzo, D. Talia, and P. Trunfio. Using Scalable Data Mining for Predicting Flight Delays. 2015. Under Review D. Talia, “Workflow systems for science: Concepts and tools,” ISRN Software Engineering, 2013. K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, F. Bacall, A. Hardisty, A. Nieva de la

L. Belcastro et al. / Programming Visual and Script-Based Big Data Analytics Workflows

31

Hidalga, M. P. Balcazar Vargas, S. Sufi, and C. Goble, The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud, Nucleic Acids Research, 41(W1), pp. W557–W561, July 2013. [17] V. Podpěcan, M. Zemenova, and N. Lavrač, Orange4ws environment for service-oriented data mining, Comput. J., 55(1), pp. 82–98, Jan. 2012. [18] J. Demšar, T. Curk, A. Erjavec, Črt Gorup, T. Hočevar, M. Milutinovič, M. Možina, M. Polajnar, M. Toplak, A. Starič, M. Štajdohar, L. Umek, L. Žagar, J. Žbontar, M. Žitnik, and B. Zupan, Orange: Data mining toolbox in python, Journal of Machine Learning Research, 14, pp. 2349–2353, 2013. [Online]. Available: http://jmlr.org/papers/v14/demsar13a.html [19] B. Ludscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger,M. Jones, E. A. Lee, J. Tao, and Y. Zhao, Scientific workflow management and the kepler system, Concurrency and Computation: Practice and Experience, 18(10), pp. 1039–1065, 2006. [20] H. Hiden, S. Woodman, P. Watson, and J. Cala, Developing cloud applications using the e-Science Central platform, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1983), January 2013. [21] J. Kranjc, V. Podpečan, and N. Lavrač, ClowdFlows: A Cloud Based Scientific Workflow Platform, Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science. Heidelberg, Germany: Springer, 2012, vol. 7524, pp. 816–819. [22] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good et al., Pegasus: A framework for mapping complex scientific workflows onto distributed systems, Scientific Programming, 13(3), pp. 219–237, 2005. [23] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling, Data Sharing Options for Scientific Workflows on Amazon EC2, Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis (SC ’10). IEEE, November 2010, pp. 1–9. [24] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko, K. Karoczkai, and I. Marton, Wspgrade/guse generic dci gateway framework for a large variety of user communities, J. Grid Comput., 10(4), pp. 601–630, Dec. 2012. [25] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster, Swift: A language for distributed parallel scripting, Parallel Computing, 37(9), pp. 633–652, September 2011. [26] B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko, Galaxy: A platform for interactive large-scale genome analysis, Genome Res., 15, pp. 1451–1455, 2005. [27] K. Maheshwari, A. Rodriguez, D. Kelly, R. Madduri, J. Wozniak, M. Wilde, and I. Foster, Enabling multi-task computation on galaxy-based gateways using swift, Proc. IEEE International Conference on Cluster Computing (CLUSTER 2013), Sept 2013, pp. 1–3.

32

Big Data and High Performance Computing L. Grandinetti et al. (Eds.) IOS Press, 2015 © 2015 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-583-8-32

Big Data From Scientific Simulations John Edwards, Sidharth Kumar, Valerio Pascucci1 University of Utah Abstract. Scientific simulations often generate massive amounts of data used for debugging, restarts, and scientific analysis and discovery. Challenges that practitioners face using these types of big data are unique. Of primary importance is speed of writing data during a simulation, but this need for fast I/O is at odds with other priorities, such as data access time for visualization and analysis, efficient storage, and portability across a variety of supercomputer topologies, configurations, file systems, and storage devices. The computational power of high-performance computing systems continues to increase according to Moore’s law, but the same is not true for I/O subsystems, creating a performance gap between computation and I/O. This chapter explores these issues, as well as possible optimization strategies, the use of in situ analytics, and a case study using the PIDX I/O library in a typical simulation. Keywords. Big data, data storage, data visualization and analysis, parallel I/O

Introduction Many scientific questions involve complex interactions between physical entities in complicated domains. Biology researchers might want to understand airflow patterns in mammalian lungs; car manufacturers need to understand how a vehicle will respond in different crash scenarios; climatologists predict the path and intensity of hurricanes; movie makers save money by rendering simulations of stormy seas. These, and many other important questions and tasks, involve complexities that can only rarely be solved using analytic, closed-form methods. A simulation takes a number of inputs: the most important are equations and rules governing behaviors and interactions. The equations are most often in the form of partial differential equations, and rules range from collision detection codes to electron exchange mechanisms. Other inputs are the initial conditions, or the state of the system at the beginning of the simulation, resolution, and number of time steps to take. Spatial resolution deals with how finely we want the state represented at each time step, and temporal resolution is the length of time step. For example, a video game may run a realtime simulation of an exploding building. As the explosion must be simulated and rendered in realtime, and since it will be running 1 Corresponding

author: [email protected]

J. Edwards et al. / Big Data from Scientific Simulations

33

on commodity gaming hardware, the resolution is likely to be very low. That is, most fine details may be represented using other techniques (e.g. billboarding) while the simulation results are used only for the overall behavior. On the other hand, a simulation of a combustion engine requires extreme detail in order to design the engine for even small efficiency gains. Generally spatial and temporal resolution go hand-in-hand. But regardless of how resolution is balanced among the dimensions, the net effect on the data is the same: as resolution increases, the amount of data produced from the simulation increases. For example, if we store the entire system state at every time step, then halving the time step doubles the amount of data. If the spatial resolution is doubled in each dimension in a 3D simulation, then the amount of data jumps by a factor of 23 . Scientific simulations present particularly interesting challenges in terms of data. It is extremely rare for a scientific simulation to run in realtime. As a result, simulations can be run at high resolutions, requiring days, weeks, and even months to complete. Further, often these simulations are not feasible on desktop hardware and require a computing cluster. Clusters present an additional level of complexity since the data, at simulation time, is distributed among compute nodes. Transferring the results to a storage medium, such as a hard drive is, in general, a slow process. While computation speed is increasing as described by Moore’s law, I/O subsystem speeds are not improving nearly as fast, so the gap between computation and I/O speeds continues to widen [1]. I/O library performance is affected by characteristics of the target machine, including network topology, memory, processors, file systems, and storage hardware. Performance is also affected by data characteristics, file formats, and tunable algorithmic parameters. Because the data may be extremely large, gigabytes and even terabytes at each timestep, the practitioner may choose to save the state of the system, or “dump the data,” infrequently in order to avoid slowing the simulation down with costly I/O, as well as to save on storage space. In a simulation that dumps every n timesteps, we call n the “output interval”. It is important to understand what simulation data is used for. There are three general uses: scientific insight, restart, and debugging. Scientific insight The most obvious use of simulation data is to answer the scientific question at hand. For example, what did the universe look like 2 billion years ago, and what will it look like 2 billion years from now [2]? This question can be answered by running a large-scale simulation to completion and visualizing the results. However, to use only this result would be short-sighted. An additional answer that we can obtain essentially for free is “how could the universe evolve from now until 2 billion years in the future?” To answer this question, we simply dump state data as the simulation is running. The important concept here is that simulations not only answer questions, but they lead to additional questions to answer through visualization and analysis. Restart For any of a number of reasons, a simulation may fail prematurely. Maybe the code has a bug, or the supercomputer time allocation ran out, or a power failure occurred. Whatever the reason, the scientist probably will not want to restart the simulation from the beginning. Most practitioners schedule frequent data dumps so that they can start where they left off if the simulation fails to complete. This is called checkpointing and restart.

34

J. Edwards et al. / Big Data from Scientific Simulations

    

  

 



  

   

   

Figure 1. Overview tion/analysis pipeline.

of

the

simula-

Figure 2. Data may not be dumped at every simulation timestep, and different views of the data may be dumped at different times. Timesteps 0, 2, and 4 store full-resolution versions of the data. Timesteps 1 and 3 store decimated versions of the data.

Debugging Scientific simulations usually comprise extremely complex code. As a result, the heaviest use of simulation output data is in developing and debugging the simulation itself. Simulation code may be written over a long period of time using well-understood data. Data dumps from test runs allow the developer to compare against expected results and debug the code. The ability to efficiently dump data at any given time step is critical in streamlining development. Figure 2 shows an example of data dumps during a simulation. Dumps 1 and 5 are full-resolution state representations of the initial and final states of the system, respectively. Dump 2 is a checkpoint, or full-resolution data dump. Dumps 1 and 3 store only a subset of the data, either storing only certain regions, or storing at lower resolution, or a combination of the two. Most simulations are done in one of three different frameworks (Figure 3). The most common framework used in a simulation on a supercomputer is a Cartesian, regular, or rectilinear grid approach. The highly structured format of rectilinear simulations simplifies processor allocation and communication. However, it leads to some waste in resources, as less important areas of the simulation are treated with the same spatial resolution as more important areas. Adaptive Mesh Refinement (AMR) simulations attempt to solve this by simulating at higher resolutions in areas of interest, but still partitioning into rectangular cells. Mesh construction, processor allocation, and processor communi-

35

J. Edwards et al. / Big Data from Scientific Simulations

(a) Rectilinear

(b) AMR

(c) Unstructured

(d) Particle

Figure 3. Types of data structures used in simulation.

cation are necessarily more complex, but gains from adaptivity often justify the complexity. Unstructured grids use elements other than rectangles, often using simplices (triangles in 2D; tetrahedra in 3D). These types of simulations provide tremendous flexibility and are frequently used for many types of simulations, but present challenges when moving to a parallel environment, especially on a supercomputer. Rectilinear, AMR, and unstructured simulations typically solve a partial differential equation over a domain decomposition. Another type of simulation is particle simulation, where particles are followed in their paths timestep by timestep. These simulations require accurate computations of particle interactions and behaviors. From a data management perspective, the type of decomposition of the domain is important both for how the data is stored on the storage device, and how the data is transferred to the storage device. For example, in a rectilinear simulation it may make sense to store some data dumps adaptively, while it makes little sense to store an unstructured grid as a Cartesian grid. As mentioned, the decomposition type also affects the size of the dumps, however, steps can be taken to reduce dump size in certain cases. If a suitable decimation strategy is possible, then some dumps for debugging or analysis can represent a low-resolution version of the data, or can represent a region of interest of the data.

1. Supercomputer Storage Infrastructure Supercomputers are identified by their immense computational capability stemming from their thousands and millions of parallel processing units. As an example, leadership class machines such as the IBM Blue Gene/Q supercomputer at the Argonne National Laboratory [3] and the Cray XT system at the Oak Ridge National Laboratory [4] respectively consist of 750K and 225K processing units and have peak performances of 10 and 1.75 petaflops. Supercomputers often differ from each other with respect to architecture, inter-connect network, operating system environments, file system and many other parameters. A common concept is that of processing units (cores) and nodes. Processors are grouped into nodes when then are linked together with the interconnect network. Compute nodes perform the simulation while I/O nodes provide I/O functionality. Compute nodes generally don’t have access to the file system, so data is routed through I/O nodes while interface with file system software.

36

J. Edwards et al. / Big Data from Scientific Simulations

1.1. Parallel File System The file system is the direct interface to the the storage system of the supercomputer. Being able to use and access the file system optimally is key to I/O performance. File systems often have several tunable parameters, such as stripe count and size, that need to be set optimally. In the next subsection, we explore the two most commonly used file-systems, Lustre [5] and GPFS [6] in more detail. Both Lustre and GPFS file systems are scalable and can be part of multiple computer clusters with thousands of client nodes and several petabytes (PB) of storage. This makes these file systems a popular choice for supercomputing data centers, including those in industries such as meteorology, oil, and gas exploration. 1.1.1. Lustre Lustre [5] is an open-source, high performance parallel file system that is currently used in over 60% of the Top100 supercomputers in the world. It is designed for scalability and is capable of handling extremely large volumes of data and files with high availibility and coordination of both data and metadata. For example, the Spider supercomputer at Oak Ridge National Laboratories has 10.7 petabytes of disk space and moves data at 240 GB/second [7]. The architecture is based on storage of distributed objects. Lustre delegates block storage management to its back-end servers and eliminates significant scaling and performance issues associated with the consistent management of distributed block storage metadata. 1.1.2. GPFS The General Purpose File System (GPFS) [6] is a parallel file system for high perfomance computing and data intensive applications produced by IBM. It is based on a shared storage model. It automatically distributes and manages files while providing a single consistent view of the file system to every node in the cluster. The filesystem enables efficient data sharing between nodes within and across clusters using standard file system APIs and standard POSIX semantics. IBM is transitioning GPFS to a new product, Spectrum Scale [8], which uses GPFS technology and requires a Spectrum Scale server license from IBM for each node dedicated to the network file system. 1.2. Network Topology Network topology describes how compute nodes in a supercomputer are connected, and plays an important role in data movement. Network topologies are designed to optimize internode communications for simulation and not I/O. Data movement for I/O typically is designed around the network topology. I/O libraries such as GLEAN [9] leverage the network topology in moving data between cores. Network topologies come in several varieties. The Mira supercomputer at Argonne National Laboratories is an IBM Blue Gene/Q system [10], that uses a 5D torus network for I/O and internode communication. The Edison machine employs the ”Dragonfly” topology [11] for the interconnection network. This topology is a group of interconnected local routers connected to other similar router

J. Edwards et al. / Big Data from Scientific Simulations

37

groups by high speed global links. The groups are arranged such that data transfer from one group to another requires only one route through a global link. Processor counts in supercomputers continue to grow, and as a result it becomes more essential to exploit the structure and topology of the interconnect network.

2. Strategies in Storing Data There are three considerations to be made in deciding on a file format for data storage. The first is I/O speed. Given a supercomputer configuration, we must consider how costly it is to transfer data from compute nodes to the storage medium in the storage format. The more closely a data format matches the data in memory, the more efficient the I/O will be. If the format doesn’t match up well, then the data must be reformatted in memory, and the performance of this operation is subject to the network infrastructure. The second consideration is what the data will be used for. If the data is to be used for restarts, one format may be suitable, but if the data is to be visualized, an entirely different format may be the best. The third consideration is size. The smallest size possible for uncompressed data is the size of the data itself. But this size may grow with metadata, replicated data, and unused data. Metadata is required in some form in all formats. It may simply describe the dimensions of the data, or more complex things like data hierarchy. Depending on the format, metadata size may be negligible, or it may rival the size of the data itself. Some file formats have replicated data. For example, a pyramidal scheme of a rectilinear grid may store pixel values at every level of the pyramid. Similarly, unused data, such as filler pixels in a data blocking scheme, can increase storage footprint. 2.1. Number of Files The number of actual files is also a consideration. In one approach, commonly called file-per-process I/O or N − N output, each processor writes to its own file. N is the number of processors. This approach limits our choice of file format. That is, many formats require global knowledge of the data, such as in the case of storing data hierarchically. If each processor writes its own file then global knowledge of the data is not available and sophisticated formats cannot be used. Further, I/O nodes must write to a large number of files, creating a bottleneck. And finally, the number of files produced may also put a burden on downstream visualization and analysis software. Nevertheless, N − N strategies are popular due to their simplicity. N − 1 approaches route all data to a single file. These approaches are more complex due to the required inter-node communication, but there is much more flexibility to optimize. A straightforward optimization is that of larger block writes to disk. That is, large chunks of data can be written at a time, making writes more efficient. Optimization in file structure is also possible. If the data is to be

38

J. Edwards et al. / Big Data from Scientific Simulations

used for visualization, a hierarchical stream-optimized format such as IDX [12] may be used. Or, if only a subset of the data is needed, sampling or other type of decimation may be done before the write. Subfiling approaches (N − M ) provide ultimate flexibility, where the number of files is tuned for maximum performance. Subfiling is discussed in Section 4.3. 2.2. Formats and Libraries The most popular file formats in large-scale simulation are of the N − 1 variety, of which PnetCDF and HDF5 are the most common. Formats generally have an accompanying I/O library, easing use of a particular format. This section discusses both formats and libraries. 2.2.1. MPI-IO MPI-IO is a standard, portable interface for parallel file I/O that was defined as part of the MPI-2 (Message Passing Interface) Standard in 1997. It can be used either directly by applications programmers or by writers of high-level libraries as an interface for portable, high performance I/O in parallel programs. MPI-IO is an interface that sits above a parallel file system and below an application or high-level I/O library as illustrated in Figure 6. Here it is often referred as ”middleware” for parallel I/O. MPI-IO is intended as an interface for multiple processes of a parallel program that is writing/reading parts of a single common file. MPI-IO can be used for both independent I/O where all processes perform write operations independent of each other, as well as in collective I/O mode, where processes coordinate amongst each other and a few processes end up writing all the data. MPI-IO supports both blocking and nonblocking modes of I/O, with the latter one providing potential for overlapping I/O and computation. 2.2.2. HDF5 HDF5 [13] short for ”Hierarchial Data Format, version 5” is designed at three levels: data model, file format and I/O library. The data model consists of abstract classes such as files, groups, datasets, and datatypes, which are instantiated in the form of a file format. The I/O library provides applications with an object-oriented programming interface that is powerful, flexible and highly performant. The data model allows storage of diverse data types, expressed in a customizable, hierarchical organization. The I/O library extracts performance by leveraging MPI-IO collective I/O operations for data aggregation. Owing to the very customizable nature of the format, HDF5 allows users to optimize for their data type, making performance tuning partially the responsibility of the user. 2.2.3. PnetCDF Parallel NetCDF [14] is another popular high level library with similar functionality to HDF5 but in an file format that is compatible with serial NetCDF from Unidata. PnetCDF is a single shared file approach (N − 1), and is optimized for dense, regular datasets. It is inefficient for hierarchical or region of interest (ROI) data in both performance and storage, and so is used only in rectilinear simulation environments.

J. Edwards et al. / Big Data from Scientific Simulations

39

2.2.4. ADIOS ADIOS [15] is another popular library used to manage parallel I/O for scientific applications. One of the key features of ADIOS is that it decouples the description of the data along with transforms to be applied to that data from the application itself. ADIOS supports a variety of back-end formats and plug-ins that can be selected at run time. 2.2.5. GLEAN GLEAN [9] developed at Argonne National Laboratoy, provides a topology-aware mechanism for impoved data movement, compression, subfiling, and staging for I/O acceleration. It also provides interfaces for co-analysis and in-situ analysis, requiring little or no modification to the existing application code base. 2.2.6. PIDX The PIDX I/O library [16,17] enables concurrent writes from multiple cores into the IDX format, a cache-oblivious multi-resolution data format inherently suitable for fast analytics and visualization. PIDX is an N − M approach, contributing to better performance. The number of files to generate can be adjusted based on the file system. This approach extracts more performance out of parallel file systems, and is customizable to specific file systems. Further, PIDX utilizes a customized aggregation phase, leveraging concurrency and leading to more optimized file access patterns. PIDX is naturally suited to multiresolution AMR datasets, adaptive region of interest (ROI) storage of rectilinear grids, and visualization and analysis. IDX does not need to store metadata associated with AMR levels or adaptive ROI; hierarchical and spatial layout characteristics are implicit.

3. Visualization and Analysis How data is stored has direct impact on how it is visualized and analyzed. The vast majority of simulations undergo significant analysis after completion. Figure 4 shows examples of visualizations. We briefly describe three popular visualization packages. 3.1. VisIt and ParaView VisIt [18] and ParaView [19] are popular distributed parallel visualization and analysis applications. They are typically executed in parallel, coordinating visualization and analysis tasks for massive simulation data. The data is typically loaded at full resolution, requiring large amounts of system memory. Both packages utilize a plugin-based architecture, so many formats are supported by both. They are open source and platform-independent and have been deployed on various supercomputers as well as on desktop operating systems such as Windows, Mac OS X, and Linux.

40

J. Edwards et al. / Big Data from Scientific Simulations

(a) VisIt

(b) ParaView

(c) ViSUS

Figure 4. Visualizations using different software packages.

3.2. ViSUS ViSUS [20] is designed for streaming of massive data and uses hierarchical data representation technology, facilitating online visualization and analysis. It’s design allows interactive exploration of massive datasets on commodity hardware, including desktops, laptops, and hand-held devices. Rather than running in a distributed environment, PIDX supports thread-parallel operation. The IDX streaming data format is supported natively.

4. Optimizations No single I/O strategy is suitable for all systems. Supercomputers come in all flavors of size, topology, and design. Ideal I/O libraries are flexible and customizable to the type of system they’re running on. 4.1. Restructuring and Aggregation The process of encoding data is that of organizing data in memory as it is going to be stored after the write. This simplifies the write to storage, but costs some memory, as the data will effectively be duplicated. Very often each node writes its data to storage directly. On systems with dedicated I/O nodes, data is transferred to the I/O nodes and then written. Two problems with these approaches are that, first, file format optimizations are challenging and second, if a node has fragmented data, it will require many writes to get all the data written without overwriting data from other nodes. This becomes particularly expensive when the storage medium is a hard drive. Further, if the data is fragmented, then if the node encodes into a single array, then there will be a lot of unused allocations. Two optimizations help alleviate these issues. The first is aggregation (see Fig. 5). After each node encodes its data, an internode communication phase called aggregation is executed, which reorganizes the data onto “aggregation nodes.” The data is organized such that each aggregation node will need to perform only a single, large block write. Aggregation significantly speeds up the write to storage. But there still exists the problem of possible excessive memory usage when encoding. Restructuring is the process of reallocating spatial regions to nodes so that encoding arrays will have far fewer unused allocations, and so that aggregation

41

J. Edwards et al. / Big Data from Scientific Simulations 











  





 



 



















 

















  











Figure 5. Restructuring and aggregation.





 

     



    

Figure 6. I/O software stack with I/O forwarding.

will require nodes to communicate with fewer aggregation nodes. For example, in Fig. 5, because restructuring was done, node P0 will need to send data only to aggregation node P4, instead of to both P4 and P5 if restructuring isn’t done. Further, much less memory need be allocated on P0 for encoding since the data isn’t fragmented. 4.2. I/O Forwarding Previous sections have referred to I/O nodes. These nodes perform I/O operations on behalf of compute nodes. I/O forwarding is the process of shipping I/O calls from compute nodes to I/O nodes in order to relieve compute nodes from potentially costly I/O operations (see Figure 6). I/O forwarding software is an interface between the file system and the remainder of the software stack, so any optimization made in the I/O forwarding layer is transparent to applications and high-level I/O libraries. The I/O forwarding subsystem optimizes by aggregating, rescheduling, and caching I/O requests. 4.3. Subfiling The number of output files for an application have significant effect on I/O performance. This number can vary from a single shared file for all processes to one file per process. Too few files (such as N − 1 or shared file) or too many files (N − N

42

J. Edwards et al. / Big Data from Scientific Simulations

or file per process) both result in I/O bottlenecks. Since I/O nodes manage the file metadata, too many files per I/O node or too many I/O nodes sharing a file both lead to a bottleneck in metadata management. As an example, on Mira, a Blue Gene/Q machine with GPFS file system, Bui et al [21] write one output file per I/O node to avoid inter-node communication or file metadata management overhead, resulting in better performance than the shared file and file per process options. 4.4. Parameter Learning Realizing high I/O performance for a broad range of applications on all HPC platforms is a major challenge, in part because of complex inter-dependencies between I/O middleware and hardware (see Fig 6). The parallel file system and I/O middleware layers all offer optimization parameters that in theory can result in optimal I/O performance. Unfortunately, it is not easy to derive a set of optimized paameters, as it largely depends on the application, HPC platform, problem size, and concurrency. In order to optimally use HPC resources, an auto-tuning system can hide the complexity of the I/O stack by automatically identifying parameters that accelerates I/O performance. Earlier work in autotuning I/O research has proposed analytical models, heuristic models, and trial-and-error based approaches. The models can then be used to obtain optimal parameters. Unfortunately, all these methods have known limitations [22] and do not generalize well to a wide variety of settings. Modeling techniques based on machine learning overcome these system limitations and build a knowledge-based model that is independent of the specific hardware, underlying file system, or custom library used. Based on the flexibility and independence to a variety of constraints, machine learning techniques have achieved tremendous success in extracting complex relationships just from the training data itself. As an example, PIDX I/O library builds a machine learning-based model using regression analysis [23] on data sets collected during previously conducted characterization studies. The model has the ability to predict performance and identify optimal tuning parameters for a given scenario. With approaches such as this, the efficiency of the performance model increases over time when more training data comes available.Similarly, Parallel HDF5 [24], uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack (see Figure 6).

5. Case Study Traditional I/O libraries and corresponding data formats such as HDF5 and PnetCDF are general-purpose in nature, and generally do not customize to analysis and visualization. On the other hand the chief advantage of PIDX I/O library is that it reorders individual samples in a manner that enables realtime multi-resolution visualization and analysis through use of the IDX data format. It is also tunable for different supercomputer configurations, and has demonstrated excellent write perfomance [16].

J. Edwards et al. / Big Data from Scientific Simulations

43

This section follows the data from setup of an S3D [25] uniform simulation of the lifted ethylene jet (one of the largest combustion simulations performed by S3D) through to visualization and analysis. This simulation has 16 fields of interest, including temperature, pressure, velocity and chemical species. The field we focus on is temperature. We choose the Hopper supercomputer [26] for the simulation. We set the simulation up to run for 2500 timesteps. Because full dumps are expensive, we choose to dump full-resolution checkpoint data (Fig. 7a) only every 100 timesteps, primarily for restart in case the simulation halts prematurely. Using an advanced I/O library like PIDX gives us options for intermediate dumps for monitoring and analysis. We set two thresholds on the temperature field, breaking the domain into three regions of high, medium, and low temperature. We are primarily interested in regions of high temperature, so we save those at full resolution, while medium temperature is saved at 1/64 resolution and low temperature regions are saved at 1/512 resolution (Fig. 7b). With this ability to store data adaptively at varying resolution, we are able to save on both storage space as well as compute time, making it possible to store intermediate snapshots evey 10 time steps. The two flame sheets most easily distinguished on the bottom of Fig. 7 burn very hot and thus get preserved at full resolution. The outside coflow on the left and right is heated by the central flame and thus resides in the medium temperature region. Finally, the channel in between the sheets contains the relatively cool fuel stream which gets classified as low temperature. Together, this configuration creates many sharp resolution drops and isolated regions as well as a significantly uneven data distribution. The resulting adaptive resolution IDX output takes only 39% of the full resolution output time, while writing 30% of the 12.8 GB of full resolution data. As shown in Fig. 7c, the resulting volume rendering using up-sampling to create a uniform resolution grid preserves the regions of inteest (ROI) almost perfectly while showing the expected artifacts especially in the center of the flame. In order to monitor the simulation, we set up a server that updates the intermediate results in our visualization system. In this case, we use the ViSUS visualization framework (VisIt also has a plugin to support the IDX format) to see the updates as they come in. The data coming in is suitable for visualization as-is. However, we can upsample the data (Fig. 7c) to obtain a full-resolution version of the domain and run intermediate analyses, such as topological, statistical, or shape analyses. This allows us to start understanding the data even before the simulation has completed. If the simulation halts for some reason, we retrieve one of the checkpoint dumps, use that data as the initial conditions, and restart the simulation. Once the simulation is complete we transfer the intermediate dumps and final results from the supercomputer to our own storage. 6. In Situ Analytics Many practitioners use data reduction, such as the IDX format, to handle massive amounts of data. Another approach is to perform analytics on the data dur-

44

J. Edwards et al. / Big Data from Scientific Simulations

(a) Full

(b) ROI

(c) Upsampled

Figure 7. Volume rendering of the temperature field of the lifted ethylene jet. (a) Full resolution data. (b) Adaptively sampled data. (c) Adaptively sampled data, up-sampled to create a uniform image.

ing the simulation, with the analytics code running on the supercomputer nodes themselves, called in situ analytics. The big advantage to in situ is that data I/O, typically a major bottleneck, can be reduced by transferring only analysis results in timesteps between full checkpoint dumps. For example, one can use a parallel algorithm to compute merge trees [27], resulting in a vastly smaller version of the data. In situ analyses can include topological, decimated, shape descriptive, statistical, and other summary views of the data. One shortcoming of in situ analytics is that an analysis code must be customized to the supercomputer running the simulation, and thus must address parallel programming, node communication, and scaling considerations. Decisions as to how much analysis to perform, and its tradeoffs with slowing down the simulation must also be made.

7. Conclusions Big data from simulations is inextricably linked with the simulation itself. For example, if I/O is a major bottleneck in the simulation, only a small subset of state data may be stored and available for analysis. The data may be written in a format optimized for simulation I/O, and not for analysis. And analysis may be better done in situ. In short, the entire simulation and analysis pipeline must be looked at holistically when addressing data considerations. Of course, data may be converted to desired formats, but this can come at considerable processing and storage expense, and is not a solution for retrieving missing data from decimation. When planning a simulation run, data tranfer and usage needs should be carefully weighed against I/O performance needs, and suitable I/O libraries, formats, and spatiotemporal resolution parameters should be chosen accordingly.

J. Edwards et al. / Big Data from Scientific Simulations

45

References [1]

[2]

[3] [4] [5] [6] [7]

[8] [9]

[10]

[11] [12]

[13] [14]

[15]

[16]

[17]

[18] [19]

Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham, Robert Ross, Lee Ward, and P Sadayappan. Scalable i/o forwarding framework for highperformance computing systems. In Cluster Computing and Workshops, 2009. CLUSTER’09. IEEE International Conference on, pages 1–10. IEEE, 2009. Volker Springel, Simon DM White, Adrian Jenkins, Carlos S Frenk, Naoki Yoshida, Liang Gao, Julio Navarro, Robert Thacker, Darren Croton, John Helly, et al. Simulating the joint evolution of quasars, galaxies and their large-scale distribution. arXiv preprint astroph/0504097, 2005. Preparing applications for mira, a 10 PetaFLOPS IBM blue gene/q system. http://www.alcf.anl.gov/files/PrepAppsForMira SC11 0.pdf. Cray XT5. http://en.wikipedia.org/wiki/Cray XT5. Lustre home page. http://lustre.org. Frank B Schmuck and Roger L Haskin. GPFS: A shared-disk file system for large computing clusters. In FAST, volume 2, page 19, 2002. Spider up and spinning connections to all computing platforms at ORNL. http://www.hpcwire.com/2009/07/09/spider up and spinning connections to all computing platforms at ornl/. IBM Spectrum Scale. http://public.dhe.ibm.com/common/ssi/ecm/dc/en/dcw03051usen/ DCW03051USEN.PDF. Venkatram Vishwanath, Mark Hereld, Vitali Morozov, and Michael E. Papka. Topologyaware data movement and staging for i/o acceleration on blue gene/p supercomputing systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 19:1–19:11, New York, NY, USA, 2011. ACM. Dong Chen, Noel Eisley, Philip Heidelberger, Sameer Kumar, Amith Mamidala, Fabrizio Petrini, Robert Senger, Yutaka Sugawara, Robert Walkup, Burkhard Steinmacher-Burow, et al. Looking under the hood of the IBM Blue Gene/Q network. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 69. IEEE Computer Society Press, 2012. Edison Dragonfly topology. https://www.nersc.gov/users/computationalsystems/edison/configuration/interconnect/. Valerio Pascucci and Robert J Frank. Global static indexing for real-time exploration of very large regular grids. In Conference on High Performance Networking and Computing, archive proceedings of the ACM/IEEE Conference on Supercomputing, 2001. HDF5 home page. http://www.hdfgroup.org/HDF5/. Jianwei Li, Wei-Keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of SC2003: High Performance Networking and Computing, Phoenix, AZ, November 2003. IEEE Computer Society Press. J. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’08, pages 15–24, New York, June 2008. ACM. S. Kumar, V. Vishwanath, P. Carns, J.A Levine, R. Latham, G. Scorzelli, H. Kolla, R. Grout, R. Ross, M.E. Papka, J. Chen, and V. Pascucci. Efficient data restructuring and aggregation for I/O acceleration in PIDX. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–11, Nov 2012. Sidharth Kumar, John Edwards, Peer-Timo Bremer, Aaron Knoll, Cameron Christensen, Venkatram Vishwanath, Philip Carns, John A Schmidt, and Valerio Pascucci. Efficient i/o and storage of adaptive-resolution data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 413–423. IEEE Press, 2014. VisIt home page. https://wci.llnl.gov/codes/visit/. ParaView home page. http://www.paraview.org/.

46 [20]

[21]

[22]

[23]

[24]

[25]

[26] [27]

J. Edwards et al. / Big Data from Scientific Simulations

V. Pascucci, G. Scorzelli, B. Summa, P.-T. Bremer, A. Gyulassy, C. Christensen, S. Philip, and S. Kumar. The ViSUS visualization framework. In E Wes Bethel, Hank Childs, and Charles Hansen, editors, High Performance Visualization: Enabling Extreme-Scale Scientific Insight. CRC Press, 2012. Huy Bui, Hal Finkel, Venkatram Vishwanath, Salma Habib, Katrin Heitmann, Jason Leigh, Michael Papka, and Kevin Harms. Scalable parallel i/o on a blue gene/q supercomputer using compression, topology-aware data aggregation, and subfiling. In Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on, pages 107–111. IEEE, 2014. Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, Fabien Gaud, and Jian Pei. A practical method for estimating performance degradation on multicore processors, and its application to hpc workloads. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 83:1– 83:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Sidharth Kumar, Avishek Saha, Venkatram Vishwanath, Philip Carns, John A Schmidt, Giorgio Scorzelli, Hemanth Kolla, Ray Grout, Robert Latham, Robert Ross, et al. Characterization and modeling of pidx parallel I/O for performance optimization. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, page 67. ACM, 2013. Babak Behzad, Huong Vu Thanh Luu, Joseph Huchette, Surendra Byna, Prabhat, Ruth Aydt, Quincey Koziol, and Marc Snir. Taming parallel i/o complexity with auto-tuning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 68:1–68:12, New York, NY, USA, 2013. ACM. J H Chen, A Choudhary, B de Supinski, M DeVries, E R Hawkes, S Klasky, W K Liao, K L Ma, J Mellor Crummey, N Podhorszki, R Sankaran, S Shende, and C S Yoo. Terascale direct numerical simulations of turbulent combustion using s3d. In Computational Science and Discovery Volume 2, January 2009. Hopper home page. https://www.nersc.gov/users/computational-systems/hopper/. Aaditya G Landge, Valerio Pascucci, Attila Gyulassy, Janine C Bennett, Hemanth Kolla, Jacqueline Chen, and Peer-Timo Bremer. In-situ feature extraction of large scale combustion simulations using segmented merge trees. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1020–1031. IEEE Press, 2014.

Big Data and High Performance Computing L. Grandinetti et al. (Eds.) IOS Press, 2015 © 2015 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-583-8-47

47

Towards a Comprehensive Set of Big Data Benchmarks Geoffrey C. FOX a, 1, Shantenu JHAb, Judy QIUa, Saliya EKANAYAKEa and Andre LUCKOWb a School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA b RADICAL, Rutgers University, Piscataway, NJ 08854, USA

Abstract. This paper reviews the Ogre classification of Big Data application with 50 facets divided into four groups or views. These four correspond to Problem Architecture, Execution mode, Data source and style, and the Processing model used. We then look at multiple existing or proposed benchmark suites and analyze their coverage of the different facets suggesting a process to obtain a complete set. We illustrate this by looking at parallel data analytics benchmarked on multicore clusters. Keywords. Big Data, Benchmarking, Analytics, Database

Introduction We propose a systematic approach to Big Data benchmarking based on a recent classification (Ogres) of applications using a set of facets or features divided into 4 dimensions: Problem Architecture (or Structure), Execution mode, Data source, storage and access, and the Processing algorithms used. This is reviewed in Section 1 and summarized in a detailed Table given in the Appendix. We analyze many existing and proposed benchmark sets in Section 2 and show how they cover the set of facets. We give some examples of benchmarking data analytics on clusters in Section 3 and propose further steps in Section 4.

1. Overview of Ogres 1.1. What is an Ogre? The Berkeley Dwarfs [1] were an important step towards defining an exemplar set of parallel (high performance computing) applications. The recent NRC report [2] gave Seven Computational Giants of Massive Data Analysis, which start to define critical types of data analytics problems. We propose Ogres [3-5]  an extension of these ideas based on an analysis by NIST of 51 big data applications [6, 7]. Big Data Ogres provide a systematic approach to understanding applications, and as such they have facets which represent key characteristics defined both from our experience and from a 1

Corresponding Author.

48

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

bottom-up study of features from several individual applications. The facets capture common characteristics which are inevitably multi-dimensional and often overlapping. We note that in HPC, the Berkeley Dwarfs were very successful as patterns but did not get adopted as a standard benchmark set. Rather the NAS Parallel Benchmarks [8], Linpack [9], and (mini-)applications played this role. This suggests that benchmarks do not follow directly from patterns, but the latter can help by allowing one to understand breadth of applications covered by a benchmark set.

Figure 1: The 4 Ogre Views and their Facets

1.2. Ogres have Facets We suggest Ogres possess properties that we classify in four distinct dimensions or views. Each view consists of facets; when multiple facets are linked together, they

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

49

describe classes of big data problems represented as an Ogre. One view of an Ogre is the overall problem architecture (labelled AV) which is naturally related to the machine architecture needed to support data intensive application. Then there is the execution (computational) features view (labelled EV), describing issues such as I/O versus compute rates, iterative nature of computation and the classic V’s of Big Data: defining problem size, rate of change, etc. The data source & style view (labelled DV) includes facets specifying how the data is collected, stored and accessed. The final processing view (labelled PV) has facets which describe classes of processing steps including algorithms and kernels. Ogres are specified by the particular value of a set of facets linked from the different views. The views are illustrated in Figure 1 and listed in Table A in the Appendix. Table A also lists in its last 3 columns a measure of the coverage of two types of benchmarks – those of SPIDAL from Section 2.1 and those of the database survey given in section 2.2 with the rightmost column listing the application coverage of the NIST survey [6, 7]. While the ordering of facets along the four dimensions does not follow a strict rule, we tend to keep related facets nearby such as EV4 to EV7. Also, AV1 to AV5 represent forms of MapReduce that can cover a broad range of applications. Loosely, the ordering (from lowest to highest label) tries to list facets of general interest to more specific to the application. For example, EV8 to EV14 are specific to the application while EV1 to EV7 are more general. In our language, instances of Ogres can form benchmarks. One can consider composite or atomic (simple, basic) benchmarks. For example, a clustering benchmark is an instance of an Ogre with a Map-Collective facet in the Problem Architecture view and the machine learning sub-facet in the Processing view. The Execution view describes properties that could vary for different clustering algorithms and would often be measured in a benchmarking process. Note a simple benchmark like this could ignore the Data Source & Style view and just be studied for in-memory data. Alternatively we can consider a composite benchmark linking clustering to different data storage mechanisms. A given benchmark can be associated with multiple facets in a single view, i.e. clustering has other problem architecture facets including SPMD, BSP, and Global Analytics.

2. Particular Benchmarks as instances of Ogres Our approach suggests choosing benchmarks from Ogre instances that cover a diverse range of facets. Rather than trying to be comprehensive at this stage, we give some examples. Note that kernel benchmarks are instances of Ogre Processing facets classified as PV2 to PV13; this is where the NAS parallel benchmarks or TeraSort [10] would fit. Further we have micro-benchmarks such as MPI ping-pong and SPEC [11] as facet PV1 and giving measures of Ogre execution facets EV1 to EV5. In sections 2.1, 2.2 and 2.3 we go through 3 different sources of benchmarks comparing them to facets. 2.1. SPIDAL Library as Ogre Benchmarks We are part of a recently started NSF project from the DIBBs (Data Infrastructure Building Blocks) program where one can use Ogres to classify Building Blocks that are part of the SPIDAL Scalable Parallel Interoperable Data Analytics Library, which is the focus of this program. In Table 1, we list the proposed library members from this

50

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

project with more details available [12, 13]. Note each problem can provide benchmarks for many different execution view facets. For example the first set (Graph Analytics) are all instances of the Graph Algorithm facet of the Processing view and either the Map Point-to-Point and/or Shared memory facets in the Problem architecture view. In the second group of spatial kernels, we find queries from the Search/Query/Index and MapReduce facets with Spatial Clustering from Global Machine Learning, Map-Collective and Global Analytics facets and Distance-based queries from Pleasingly Parallel and Search/Query/Index facets. These benchmarks all have the spatial data abstraction facet and could be involved with GIS facet. In Machine Learning in general and for image processing categories we find several Clustering algorithms illustrating O(N), O(N2), and Metric (non-metric) space execution view facets; Levenberg-Marquardt Optimization and SMACOF MultiDimensional Scaling with Linear Algebra Kernels and Expectation maximization/least squares characteristic in optimization methodology facet from Processing view; TFIDF Search and Random Forest with Pleasingly Parallel facets. All exhibit the machine learning sub-facet of the processing view. In the last 3 columns of Table 1, we quantify this, listing facets for each SPIDAL library member for three of the four facet views. The analytics focus of this project implies little overlap with the Data Source & Style view and some of the entries are preliminary estimates which need more study. Further the SPIDAL project reviewed the Apache libraries MLlib and Mahout in choosing their library members. Algorithm

Applications

Problem Execution Arch View View

Processing View

Graph Analytics GA 1 Community detection

Social networks, webgraph

2 Subgraph/motif finding 3 Finding diameter

Webgraph, biological/social networks Social networks, webgraph

4 Clustering coefficient

Social networks

5 Page rank

Webgraph

6 Maximal cliques 7 Connected component

Social networks, webgraph Social networks, webgraph

8 Betweenness centrality

Social networks

9 Shortest path

Social networks, webgraph

1 2 3 4 1 2 3

Spatial Queries and Analytics SQA GIS/social Spatial relationship based networks/pathology queries informatics (add GIS in data Distance based queries view) Spatial clustering Spatial modeling Core Image Processing IP Image preprocessing Computer vision/pathology Object detection & informatics segmentation Image/object feature

3, 4, 7 4, 7 4, 7 4, 7

9S, 10I, 11, 3, 9ML, 13 12G 9D, 10I, 12G 3, 9ML, 13 9D, 10I, 12G 9S, 10I, 11, 12G 9S, 10I, 11, 12V 9D, 10I, 12G 9D, 10I, 12G 9D, 10I, 12G, 13N 9D, 10I, 12G, 13N

3, 9ML, 13 3, 9ML, 13

2

12P

6

1 3, 7, 8 1

12P 12P 12P

2 3, 9ML,EM 2

1 1

13M 13M

2 2, 9ML

1

13M

2, 9ML

3, 4, 7 4, 7 4, 7 6 6

3, 9ML, 12, 13 3, 9ML, 13 3, 9ML, 13 9ML, 13 9ML, 13

51

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks computation 4 3D image registration 5 Object matching 6 3D feature extraction General (Core) Machine Learning 1 DA Vector Clustering

Accurate Clusters

2 DA Non-metric Clustering

Accurate Clusters, Biology, Web

1 1 1

13M 13N 13N

2, 9ML 2, 9ML 2, 9ML

3, 7, 8

9D, 10I, 11, 12V, 13M, 14N 9S, 10R, 11, 12BI, 13N, 14NN 9D, 10I(Elkan), 11, 12V, 13M, 14N 9D, 10R, 11, 12V, 14NN

9ML, 9EM, 12

3, 7, 8

3, 7, 8 3

Kmeans; Basic, Fuzzy and Elkan

Fast Clustering

4

Levenberg-Marquardt Optimization

Non-linear Gauss-Newton, use in MDS

5 DA, Weighted SMACOF

MDS with general weights

6 TFIDF Search

Find nearest neighbors in document corpus

3, 7, 8

3, 7, 8

1

Find pairs of documents with 3, 7, 8 TFIDF distance below a threshold 3, 7, 8 8 Support Vector Machine SVM Learn and Classify 7 All-pairs similarity search

1

9 Random Forest

Learn and Classify

10 Gibbs sampling (MCMC)

Solve global inference problems

3, 7, 8

3, 7, 8 Latent Dirichlet Allocation LDA Topic models (Latent factors) 11 with Gibbs sampling or Var. Bayes 3, 7, 8 Singular Value Decomposition Dimension Reduction and 12 PCA SVD 13 Hidden Markov Models (HMM)

Global inference on sequence 3, 7, 8 models

9ML, 9EM, 12 9ML, 9EM

9ML, 9NO, 9LS, 9EM, 12 9S, 10R, 11, 9ML, 9NO, 12BI, 13N, 9LS, 9EM, 12, 14 14NN 2, 9ML 9S, 10R, 12BI, 13N, 14N 9ML 9S, 10R, 12BI, 13N, 14NN 9S, 10R, 11, 7, 8, 9ML 12V, 13M, 14N 2, 7, 8, 9ML 9S, 10R, 12BI, 13N, 14N 9S, 10R, 11, 9ML, 9NO, 12BW, 13N, 9EM 14N 9S, 10R, 11, 9ML, 9EM 12BW, 13N, 14N 9S, 10R, 11, 9ML, 12 12V, 13M, 14NN 9S, 10R, 11, 2, 9ML, 12 12BI

Table 1: The proposed members of SPIDAL library [12] and the Ogre facets that they support. There are no SPIDAL library members directly addressing Data Source & Style View (except spatial analytics and GIS) and so that is omitted.

In some of the facets, we have used abbreviations to denote subfacets. For example in PV9, Optimization Methodology, we use ML= Machine Learning, NO = Nonlinear Optimization, LS = Least Squares, EM = expectation maximization, LQP = Linear/Quadratic Programming, CO = Combinatorial Optimization. The execution view EV uses abbreviations: EV12 Data Abstraction (K= key-value, BW= bag of words, BI = bag of items, P= pixel/spatial, V= vectors/matrices, S= sequence, G=

52

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

graph) while EV facets 9, 10, 13 and 14 use simple abbreviations given in figure 1. These are spelt out in the appendix Table A. 2.2. Well Established Data Systems and Database Benchmarks Big Data has an excellent base set of benchmarks coming from the long established efforts of the database community with important Industry contributions. We build on Baru and Rabl’s excellent tutorial [14], which has a thorough discussion of benchmarks including the TPC series [15], HiBench [16], Yahoo Cloud Serving Benchmark [17], BigDataBench [18], BigBench [19] and Berkeley Big Data Benchmark [20] that quantify the Ogre Data Source & Style facets. We summarize these and other important benchmarks from Europe [21], SPEC [22] and other micro-benchmark [23, 24] and analytics and social networking studies [25-28] in the following sections. 2.2.1. Micro Benchmarks The SPEC Benchmarks [22] are well known here and they cover a variety of areas including CPU, HPC, Servers, Power, Virtualization and the Web. SPEC has set up a Big Data working group [29] that will further improve SPEC benchmarks in this area. There are several studies of I/O performance including use of flash storage [23] and HDFS [24]. These types of benchmarks correspond to Ogre facets that include DV3-4 and PV1-2. 2.2.2. Enterprise Database Benchmarks: TPC The Transaction Processing Performance Council TPC [15] benchmarks are wellknown and central to the database industry. TPC covers multiple areas including OLTP online transaction processing with the PC-C and TPC-E sets. Business Intelligence is represented by the warehouse TPC-H benchmark with non-trivial fixed schema and arbitrary scale factor 1GB to 100TB. There is also the database oriented TPC-DS benchmarks featuring nontrivial processing. The TPCx-HS Benchmark [30] is aimed at Hadoop Systems. These benchmarks correspond to OGRE facets AV2, EV10 DV1, 2, 4, PV3, 6. 2.2.3. Enterprise Database Benchmarks: BigBench BigBench [31, 32] is an industry-led effort to define a comprehensive Big Data benchmark that emerged with a proposal appearing in the first workshop on Big Data benchmarking (WBDB) [33]. It is a paper and pencil specification, but comes with a reference implementation to get started. BigBench models a retailer and benchmarks 30 queries around it covering 5 business categories depicted in the McKinsey report [34]. The retailer data model in BigBench addresses the three V’s – volume, variety, and velocity – of Big Data systems. It covers variety by introducing structured, semistructured, and unstructured data in the model. While the first is an adaptation from the TPC-DS benchmark’s data model, the semi-structured data represents the click stream on the site, and unstructured data denotes product reviews submitted by users [35]. Volume and velocity are covered with a scale factor in the specification. BigBench is aimed at modern parallel databases like Hive Impala and Shark and covers Ogre facets AV2, EV4-6,9,10 DV1, 2, 4, PV3, 6. This illustrates a movement of traditional

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

53

structured data benchmarks to semi-structured, and unstructured data although this area with other aspects such as multimedia needs more study. 2.2.4. Enterprise Database Benchmarks: Yahoo Cloud Serving Benchmark This Yahoo cloud serving benchmark [17] benchmarks basic (CRUD) operations (insert, read, update, delete, scan) store for major NoSQL key-value systems Accumulo, Cassandra, Dynamo, HBase, HyperTable, JDBC, MongoDB, and Redis Voldemort. This exhibits Ogre Facets DV1,4 and PV6. 2.2.5. Enterprise Database Benchmarks: Berkeley Big Data Benchmark The Berkeley Big Data Benchmark [20] investigates parallel SQL and Hadoop environments: Redshift, Hive, Shark, Impala, Stinger/Tez. It takes workloads and 4 distinct SQL style queries from an earlier work from Brown University and collaborators (called CALDA) that produced a similar Hadoop benchmark [36, 37]. This shows Ogre Facets AV2,12 EV9,10 DV1,2,4, PV3,6. 2.2.6. HiBench and Hadoop Oriented Benchmarks from Database to Analytics Here we summarize several Hadoop oriented benchmarks with HiBench [16, 38] as the most comprehensive. It has five components including: 1. Micro benchmarks including sort, WordCount, and TeraSort, which is a Hadoopbased sort benchmark [10] from Apache. Further the HDFS DFSIO benchmark [39] is enhanced as EnhancedDFSIO. 2. HiBench includes a Web search benchmark built around Apache Nutch (web crawler) Indexing, and PageRank. 3. It includes some Machine learning with Bayesian Classification (training) and Kmeans Clustering from Apache Mahout [40] 4. It has OLAP analytical query with Join and Aggregation from Hive performance benchmark [41]. 5. HiBench recently added an ETL-Recommendation Pipeline, which updates structured web sales data and the unstructured web logs data, and then recalculates the up-to-date item-item similarity matrix, which is the core of online recommendation. [42] Other Hadoop benchmarks include one [43] from IBM that includes Terasort and the trace-based SWIM [44, 45] (Statistical Workload Injector for MapReduce), a benchmark representing a real-world big data workload developed by University of California at Berkley in close cooperation with Facebook. Gridmix [46] is another Hadoop trace-based benchmark and Terasort is extended [47] with several related benchmarks testing I/O subsystems: GraySort, MinuteSort, and CloudSort. The work of [48] seems similar to the analytics and HDFS side of HiBench. Indiana University has several papers on benchmarks of Iterative and classic Mapreduce extending the analytics side of HiBench and merging with current Facet analysis [49-55]. The benchmarks in this subsection exhibit facets AV2,3,7,8,12 EV9,10, DV1,2,4 and PV1,3,5,6,7,8,9,12.

54

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

2.2.7. Integrated Datasystem Benchmarks: BigDataBench The integrated BigDataBench suite from China [18] has a growing number of components, with version 3.1 covering search, social networks, e-commerce, multimedia and bioinformatics domains. Kernels and micro-benchmarks include database read, write, scan, sort, grep, wordcount, BFS breadth first search, index, PageRank, Kmeans, connected components, collaborative filtering, naive Bayes and Bioinformatics SAND and BLAST. BigDataBench is hosted on Hbase, MySQL, Nutch, MPI, Spark, Hadoop, Hive, Shark, and Impala. Facets probed are AV1,2,3,4,7,8,12 EV11, DV1,2 and PV1,2,3,5,6,7,8,9,11,12,13. 2.2.8. Integrated Datasystem Benchmarks: CloudSuite The Cloudsuite [21, 56] benchmark collection from Europe has some distinctive features including use of the Faban (from SPEC) [57] workload generator, a simulator and provision of benchmarks as ready to go virtual machine images. It covers data analytics based on a standard Wikipedia Mahout+Hadoop benchmark with Bayes Classifier plus graph analytics TunkRank from GraphLab [58]. Data Caching has a streaming simulated Twitter test using memcached but not Apache Storm. Data serving is based on Yahoo work in section 2.2.4 while other applications include media streaming, software testing, web search and web serving. Software covered includes Darwin, Cloud9, Nutch, Tomcat, Nginx, Olio and MySQL. Facets include AV1,2,4,7,8,12 EV9,10,12, DV 1,2,4 and PV1,2,3,6,7,9,13. 2.3. Machine Learning, Graph and Other Benchmarks The processing view has the well-known Graph500 [25] benchmarks (and associated machine ranking), but of course libraries like R [59], Mahout [40] and MLlib [60] also include many candidates for analytics benchmarks. Section 2.1 covered a rich set of analytics and 2.2 largely database benchmarks (with some modest analytics) and here we cover other analytics benchmarks. The benchmarks in section 2.3 exhibit Ogre facets AV2,3,4,6,7,8 EV12, DV2,4 and PV2,3,7,8,9,12,13. 2.3.1. Graph500 Benchmarks There are [25] currently two kernels and 6 sizes from 17GB to 1.1PB which are used to produce the Graph 500 ranking of supercomputers. The first kernel constructs a tree and the second does a breadth first search (BFS). This covers facets AV2,4,6,7,8, EV4 and PV1,3,13. Note that there are several excellent libraries with a rich set of graph algorithms including Oracle PGX [61], GraphLab [58], Intel GraphBuilder [62], GraphX [63], CINET [64], and Pegasus [65]. 2.3.2. Minebench Minebench [26] is a comprehensive data-mining library with 15 members covering five categories: classification, clustering, association rule mining, structure learning and optimization. OpenMP implementations are given for many kernels. There are also specialized machine learning libraries such as Caffe [66], Torch [67] and Theano [68] for deep learning that can form the basis of benchmarks.

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

55

2.3.3. BG and LinkBench Benchmarks BG [27] emulates read and write actions performed on a social networking datastore, mimicking small transactions on Facebook, and benchmarks them against a given service level agreement. LinkBench [28], developed by Facebook, is intended to serve as a synthetic benchmark to predict the performance of a database system serving Facebook’s production data.

3. Illustrating Ogres with Initial Benchmarking

Figure 2: Comparison of OpenMPI 1.8.1 C, OpenMPI 1.8.1 Java and FastMPJ 1.0_6. We evaluate four MPI Collectives: SendReceive, Broadcast, AllReduce and AllGather. Different MPI implementations with message size ranging from 0 bytes (B) up to one megabyte (MB). These are averaged values over patterns 1x1x8, 1x2x8, and 1x4x8 where pattern format is number of concurrent tasks (CT) per process x number of processes per node x number of nodes (i.e. TxPxN).

3.1. SPIDAL Codes This section looks at two SPIDAL clustering codes GML1 and GML2 of Table 2 corresponding to metric and non-metric space scenarios [69]. Both use deterministic annealing DA and are believed to be the best available codes for cases when accurate clusters are needed [70] – non-metric DA pairwise clustering (DA-PWC) [71] and the metric DA vector sponge (DA-VS) [72, 73]. Both were originally written in C# and built on MPI.NET and threads running on Windows HPC clusters. They now have been converted to Java and actually get better performance sequentially and in parallel than the original C# versions. More details are available online [74] and the parallel performance of DA-VS has been presented in detail for C# version [75]. We use DAVS to motivate Micro-benchmarks but focus on DA-PWC here.

56

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

3.2. Benchmarking Environment We used two IU School of Informatics and Computing clusters, Madrid and Tempest, and one FutureGrid [76] cluster – India, as described below. x Tempest: 32 nodes, each has 4 Intel Xeon E7450 CPUs at 2.40GHz with 6 cores, totaling 24 cores per node; 48 GB node memory and 20Gbps Infiniband (IB) network connection. It originally ran Windows Server 2008 R2 HPC Edition – version 6.1 (Build 7601: Service Pack 1). Currently it runs Red Hat Enterprise Linux release version 5.11 (Tikanga) x Madrid: 8 nodes, each has 4 AMD Opteron 8356 at 2.30GHz with 4 cores, totaling 16 cores per node; 16GB node memory and 1Gbps Ethernet network connection. It runs Red Hat Enterprise Linux Server release 6.5 x India cluster on FutureGrid (FG): 128 nodes, each has 2 Intel Xeon X5550 CPUs at 2.66GHz with 4 cores, totaling 8 cores per node; 24GB node memory and 20Gbps IB network connection. It runs Red Hat Enterprise Linux Server release 5.10. Both our clustering codes are written to use a mix of MPI and Thread parallelism and used Microsoft TPL for thread parallelism in a C# .NET 4.0 and MPI.NET 1.0.0 environment. There was no consensus Java OpenMP package we could use to replace TPL for Java codes. We chose to use a novel Java parallel tasks library called Habanero Java (HJ) library from Rice University [77, 78], which requires Java 8. 3.3. Micro-benchmarks

b) SendReceive a)AllReduce

Figure 3: Comparison of MPI performance on machines with Infiniband (FG FutureGrid) and without an Infiniband Network (Madrid) for MPI a) AllReduce and b) SendReceive. These are averaged values over patterns 1x1x8, 1x2x8, and 1x4x8 where pattern format is number of concurrent tasks (CT) per process x number of processes per node x number of nodes (i.e. TxPxN)

One important issue for data analytics is that many important codes are not in the C++/Fortran ecosystem familiar from HPC. There are in particular no well-established Java technologies for the core parallel computing technologies MPI and OpenMP. There have been several message passing frameworks for Java [79], but the choice is restricted if you need support for Infiniband (IB) network as discussed in [80]. The situation has clarified recently as OpenMPI now has an excellent Java binding [81], which is an adaptation from the original mpiJava library [82]. This uses wrappers around their C MPI library and we also evaluate FastMPJ 1.0_6, which is a pure Java

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

57

implementation of mpiJava 1.2 [83] specification and supports IB as does OpenMPI where we use version 1.8.1 unless specified differently. The SPIDAL codes DA-VS and DA-PWC code rely heavily on a few MPI operations – AllReduce, SendReceive, Broadcast, and AllGather. We studied their Java performance against native implementations with micro-benchmarks adapted from OSU micro-benchmarks suite [84]. Initially C outperformed Java but around November 2013, support to represent data for communications as Java direct buffers (outside the managed heap) was added (around OpenMPI trunk r30301) avoiding earlier Java JNI and object serialization costs resulting in similar performance between Java and C versions of MPI, as reported in figure 2. Note that FastMPJ has good performance for large messages but especially for AllReduce, and is significantly slower than OpenMPI for messages below 1 KB in size.

Figure 4: DA-PWC performance as a function of problem size for C# and Java. The Java results come from FutureGrid and C# from Tempest with C# results scaled by 0.7 to reflect measured relative sequential performance of machines. We used 32 nodes each running with 8 way parallelism (MPI internally and between nodes) totaling 256 way

Figure 5: Speedup of 40K point DA-PWC on 8, 16 and 32 nodes for case where 8 MPI processes run on each node. . The Java results come from FutureGrid and C# from Tempest with C# results scaled by 0.7. Results are scaled to performance of 8 node run.

Figures 3 (a) and (b) show MPI AllReduce and SendReceive performance with and without Infiniband IB. While the decrease in communication times with IB is as expected, the near identical performance of Java with native benchmark in both IB and Ethernet cases is promising for the goal of high performance Java libraries. These figures use OMPI-trunk (r30301), which is older than OMPI 1.8.1 but has similar characteristics because of the fact that it was using direct buffers, which are the main improvement over earlier versions of OpenMPI. 3.4. SPIDAL Clustering Benchmarks Here we do not discuss FastMPJ as our Java DA-PWC implementation gave frequent runtime errors with this MPI version. We have performed [74] an extensive performance analysis of the two clustering codes comparing C# and Java and looking at different parallelism choices in the nodes; MPI or threading. We give some illustrative results here. DA-PWC should scale in execution time like problem size squared and Figure 4 shows Java results consistent with this; the 40K problem size runs faster than 16 times 10K execution time due to increased communication on a smaller problem.

58

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

The C# results are not consistent with the expected model and illustrative of other anomalies we found with C#. We did not explore more as Microsoft has abandoned this platform. Figure 5 examines this for fixed problem size (strong scaling) increasing parallelism from 64 to 256.

Figure 6: Strong Scaling expressed as speedup and efficiency on FutureGrid for 3 datasets (12K, 20K, 40K) as a function of parallelism from 1 to 256. Apart from sequential case, all runs ran 8-way parallel on each node; results are averaged over that node parallelism being 8 MPI processes, 2 threads, 4 MPI processes or 4 threads, 2 MPI processes

Figure 6 summarizes speedups and parallel efficiencies for all datasets across parallelisms from 1 to 256. The intention of this is to illustrate the behavior with increasing parallelism for different data sizes. It shows that DA-PWC scales well with the increase in parallelism up to the limit of 256 for 20K and 40K data sets. The 12K data set shows reasonable scaling within 8 and 64 way parallelisms, but not outside this range. This is the usual issue that small problems increase their communication fraction as you increase parallelism with strong scaling. Figures 7 and 8 use an existing clustering algorithm [73] and compare the use of threads and MPI process with runs labelled TxPxN where T threads and P MPI processes are run in each of N nodes. In figure 7, we compare PxQx8 with QxPx8 for (P,Q) choices Figure 7: Comparison of C# and Java with MPI and Threads for (1,2) (1,4) (2,4) (1,8) with DA-VS SPIDAL clustering of 240K points and about 25000 clusters. The Java results come from FutureGrid OpenMPI trunk best performance occurring r30301 and C# from Tempest with C# results scaled by 0.7 to at 1x8x8 for C# and 1x4x8 reflect measured relative sequential performance of machines. for Java. Figure 8 looks at The runs are labelled TxPxN where T threads and P MPI DA-PWC with parallelisms processes are run in each of N nodes. from 1 to 32 realized in different choices between threads and MPI. There are a set of 6 speedup groups for the same parallelism – 1, 2, 4, 8, 16, and 32. These lead to plateaus in plot corresponding to MPI and threads giving similar speedups, except for the 8x1xN cases, which shows

G.C. Fox et al. / Towards a Comprehensive Set of Big Data Benchmarks

59

lower performance. This effect occurs as the machine (India) has 2 physical CPUs each with 4 cores, so running 1 process with more than 4 concurrent tasks appears to introduce thread switches between CPUs, which is expensive. The best approach in this case is to restrict number of threads to be