Software Ecosystems: Tooling and Analytics 9783031360596, 9783031360602

This book highlights recent research advances in various domains related to software ecosystems such as library reuse, c

173 97 17MB

English Pages 336 [321] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Software Ecosystems: Tooling and Analytics
 9783031360596, 9783031360602

Table of contents :
Foreword
Preface
How This Book Originated
Who Contributed to This Book
What This Book Is Not About
Who This Book Is Intended For
How This Book Is Structured
Acknowledgments
Contents
Contributors
Acronyms
1 An Introduction to Software Ecosystems
1.1 The Origins of Software Ecosystems
1.2 Perspectives and Definitions of Software Ecosystems
1.3 Examples of Software Ecosystems
1.3.1 Digital Platform Ecosystems
1.3.2 Component-Based Software Ecosystems
1.3.3 Web-Based Code Hosting Platforms
1.3.4 Open-Source Software Communities
1.3.5 Communication-Oriented Ecosystems
1.3.6 Software Automation Ecosystems
1.4 Data Sources for Mining Software Ecosystems
1.4.1 Mining the GitHub Ecosystem
1.4.2 Mining the Java Ecosystem
1.4.3 Mining Software Library Ecosystems
1.4.4 Mining Other Software Ecosystems
1.5 The CHAOSS Project
1.6 Summary
References
Part I Software Ecosystem Representations
2 The Software Heritage Open Science Ecosystem
2.1 The Software Heritage Archive
2.1.1 Data Model
2.1.2 Software Heritage Persistent Identifiers (SWHIDs)
2.2 Large Open Datasets for Empirical Software Engineering
2.2.1 The Software Heritage Datasets
2.2.1.1 The Software Heritage Graph Dataset
2.2.1.2 Accessing Source Code Files
2.2.1.3 License Dataset
2.3 Research Highlights
2.3.1 Enabling Artifact Access and (Large-Scale) Analysis
2.3.2 Software Provenance and Evolution
2.3.3 Software Forks
2.3.4 Diversity, Equity, and Inclusion
2.4 Building the Software Pillar of Open Science
2.4.1 Software in the Scholarly Ecosystem
2.4.2 Extending the Scholarly Ecosystem Architecture to Software
2.4.3 Growing Technical and Policy Support
2.4.4 Supporting Researchers
2.5 Conclusions and Perspectives
References
3 Promises and Perils of Mining Software Package Ecosystem Data
3.1 Introduction
3.2 Software Package Ecosystem
3.3 Data Sources
3.4 Promises and Perils
3.4.1 Planning What Information to Mine
3.4.2 Defining Components and Their Dependencies
3.4.3 Defining Boundaries and Completeness
3.4.4 Analyzing and Visualizing the Data
3.5 Application: When to Apply Which Peril
3.5.1 Two Case Studies
3.5.2 Applying Perils and Their Mitigation Strategies
3.6 Chapter Summary
References
Part II Analyzing Software Ecosystems
4 Mining for Software Library Usage Patterns Within an Ecosystem: Are We There Yet?
4.1 Introduction
4.2 Example of API Usage Patterns in Software Libraries
4.3 Usages as Sets of Frequent Co-occurrences
4.4 Usages as Pairs or Subsequences of APIs via Software Mining
4.5 Graph Representation for Usage Patterns via Static Analysis
4.5.1 Object Usage Representation
4.5.2 Graph-Based API Usage Pattern Mining Algorithm
4.5.2.1 Important Concepts in Graph-Based Usage Pattern Mining
4.5.2.2 Overview of GrouMiner Algorithm
4.5.2.3 Detailed GrouMiner Algorithm
4.5.3 API Usage Graph Pattern Mining
4.5.3.1 Semantic-Aware API Usage Pattern Mining with MUDetect
4.5.4 Cooperative API Usage Pattern Mining Approach
4.5.5 Probabilistic API Usage Mining
4.5.6 API Usage Mining via Topic Modeling
4.5.7 Mining for Less Frequent API Usage Patterns
4.6 Applications of Usage Patterns
4.6.1 Graph-Based API Usage Anomaly Detection
4.6.2 Pattern-Oriented Code Completion
4.6.3 Integration of API Usage Patterns
4.7 Conclusion
References
5 Emotion Analysis in Software Ecosystems
5.1 What Is a Software Ecosystem?
5.2 What Is Emotion?
5.3 Why Would One Study Emotions in Software Engineering?
5.4 How to Measure Emotion?
5.4.1 Tools
5.4.2 Datasets
5.5 What Do We Know About Emotions and Software Ecosystems?
5.5.1 Ecosystems as Communication Platforms
5.5.1.1 Stack Overflow
5.5.1.2 GitHub
5.5.2 Ecosystems as Interrelated Projects
5.5.2.1 GitHub
5.5.2.2 Apache
5.5.2.3 Other Ecosystems
5.6 What Next?
5.7 What Have We Discussed in This Chapter?
References
Part III Evolution Within Software Ecosystems
6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms
6.1 Introduction
6.2 State of the Art
6.3 Motivations for Variant Forking on Social Coding Platforms
6.3.1 Technical
6.3.2 Governance
6.3.3 Legal
6.3.4 Other Categories
6.4 Mining Variant Forks on GitHub
6.4.1 The Different Types of Variant Forks
6.4.2 How to Mine Variant Forks?
6.4.3 What Are Divergent Variants?
6.5 Challenges of Maintaining Variant Forks
6.6 Research Roadmap
6.6.1 Recommendation Tools
6.6.2 Shareable Updates Among Variants
6.6.3 Transplantation Tools
6.7 Conclusion
References
7 Supporting Collateral Evolution in Software Ecosystems
7.1 Introduction
7.2 Supporting Collateral Evolution in Linux Kernel
7.2.1 Recommending Code Changes for Automatic Backporting of Linux Device Drivers
7.2.2 Spinfer: Inferring Semantic Patches for the Linux Kernel
7.2.3 Other Studies
7.3 Supporting Collateral Evolution in Android
7.3.1 An Empirical Study on Deprecated-API Usage Update in Android
7.3.1.1 Datasets
7.3.1.2 Results
7.3.2 Example-Based Automatic Android Deprecated-API Usage Update
7.3.2.1 Design of CocciEvolve
7.3.2.2 Dataset and Evaluation Results
7.3.3 Data-Flow Analysis and Variable Denormalization-Based Automated Android API Update
7.3.3.1 AndroEvolve Architecture
7.3.3.2 Evaluation of AndroEvolve
7.3.4 Other Studies
7.4 Supporting Collateral Evolution in ML Libraries
7.4.1 Characterizing the Updates of Deprecated ML API Usages
7.4.1.1 Datasets
7.4.1.2 Update Operations to Migrate Deprecated API Usages
7.4.2 Automated Update of Deprecated Machine Learning APIs
7.4.2.1 Architecture of MLCatchUp
7.4.2.2 Evaluating MLCatchUp on Updating Deprecated APIs
7.4.3 Other Studies
7.5 Open Problems and Future Work
7.6 Conclusion
References
Part IV Software Automation Ecosystems
8 The GitHub Development Workflow Automation Ecosystems
8.1 Introduction
8.1.1 Collaborative Software Development and Social Coding
8.1.2 The GitHub Social Coding Platform
8.1.3 Continuous Integration and Deployment
8.1.4 The Workflow Automation Ecosystems of GitHub
8.2 Workflow Automation Through Development Bots
8.2.1 What Are Development Bots?
8.2.2 The Role of Bots in GitHub's Socio-technical Ecosystem
8.2.3 Advantages of Using Development Bots
8.2.4 Challenges of Using Development Bots
8.3 Workflow Automation Through GitHub Actions
8.3.1 What Is GitHub Actions?
8.3.2 Empirical Studies on GitHub Actions
8.3.3 The GitHub Actions Ecosystem
8.3.4 Challenges of the GitHub Actions Ecosystem
8.4 Discussion
References
9 Infrastructure-as-Code Ecosystems
9.1 Introduction
9.2 Docker and Its Docker Hub Ecosystem
9.2.1 Introduction to Containerization
9.2.2 The Docker Containerization Tool
9.2.3 The Docker Hub Ecosystem
9.2.3.1 Types of Images Collected on Docker Hub
9.2.3.2 Image Metadata Maintained on Docker Hub
9.2.4 Approaches to Analyzing Docker Hub Images
9.2.4.1 Docker Hub Metadata Analysis
9.2.4.2 Static Analysis of Dockerfiles and Docker Images
9.2.4.3 Dynamic Analysis of Dockerfiles and Docker Images
9.2.5 Empirical Insights from Analyzing Docker Hub Images
9.2.5.1 Technical Lag and Security in the Docker Hub Ecosystem
9.2.5.2 Technical Debt and Code Smells in Dockerfiles
9.2.5.3 Challenges in Maintaining and Evolving Dockerfiles
9.3 Ansible and Its Ansible Galaxy Ecosystem
9.3.1 Introduction to Configuration Management
9.3.2 The Ansible Configuration Management Tool
9.3.2.1 Ansible Plays and Playbooks
9.3.2.2 Ansible Roles
9.3.3 The Ansible Galaxy Ecosystem
9.3.3.1 Types of Ansible Galaxy Content
9.3.3.2 Types of Metadata Maintained by Ansible Galaxy
9.3.4 Approaches to Analyzing Ansible Galaxy
9.3.4.1 Ansible Galaxy Metadata Analysis
9.3.4.2 Static Analysis of Ansible Infrastructure Code
9.3.4.3 Dynamic Analysis of Ansible Infrastructure Code
9.3.5 Empirical Insights from Analyzing Ansible Infrastructure Code
9.3.5.1 Code Smells and Quality in the Ansible Galaxy Ecosystem
9.3.5.2 Defect Prediction for the Ansible Galaxy Ecosystem
9.3.5.3 Evolution Within the Ansible Galaxy Ecosystem
9.4 Conclusion
References
Part V Model-Centered Software Ecosystems
10 Machine Learning for Managing Modeling Ecosystems: Techniques, Applications, and a Research Vision
10.1 Introduction
10.2 Background in Machine Learning
10.2.1 Supervised Learning
10.2.2 Unsupervised Learning
10.2.3 Reinforcement Learning (RL)
10.3 Literature Review
10.3.1 Methodology
10.3.2 Query String
10.3.3 Inclusion and Exclusion Criteria
10.3.4 Manual Labelling
10.3.5 Results
10.4 Existing Machine Learning Applications in MDE
10.4.1 Model Assistants
10.4.2 Model Classification
10.4.3 Model Refactoring
10.4.4 Model Repair
10.4.5 Model Requirements
10.4.6 Model Search
10.4.7 Model Synthesis
10.4.8 Model Transformation Development
10.4.9 Others
10.5 A Roadmap for the Deployment of ML in MDE
10.5.1 Data Privacy Management
10.5.2 Detecting Technical Debt
10.5.3 Adversarial Machine Learning
10.5.4 Mining Time Series Data
10.6 Conclusion
References
11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems
11.1 Introduction
11.2 Mining Techniques
11.2.1 Introduction
11.2.2 Static Analysis of Relational Database Accesses
11.2.3 Static Analysis of NoSQL Database Accesses
11.2.4 Reflections
11.3 Analysis Techniques
11.3.1 Introduction
11.3.2 Static Analysis Techniques
11.3.2.1 Example 1: SQLInspect—A Static Analyzer
11.3.2.2 Example 2: Preventing Program Inconsistencies
11.3.3 Visualization
11.3.3.1 Introduction
11.3.3.2 Example 1: DAHLIA
11.3.3.3 Example 2: m3triCity
11.3.4 Reflections
11.4 Empirical Studies
11.4.1 Introduction
11.4.2 The (Joint) Use of Data Models and Technologies
11.4.3 Prevalence, Impact, and Evolution of SQL Bad Smells
11.4.4 Self-Admitted Technical Debt in Database Access Code
11.4.5 Database Code Testing (Best) Practices
11.4.6 Reflections
11.5 Conclusion
References

Citation preview

Tom Mens Coen De Roover Anthony Cleve   Editors

Software Ecosystems Tooling and Analytics

Software Ecosystems

Tom Mens • Coen De Roover • Anthony Cleve Editors

Software Ecosystems Tooling and Analytics

Editors Tom Mens Dept. Informatique, FS University of Mons Mons, Belgium

Coen De Roover Software Languages Lab Vrije Universiteit Brussel Brussels, Belgium

Anthony Cleve PReCISE research center, Namur Digital Institute (NaDI) University of Namur Namur, Belgium

ISBN 978-3-031-36059-6 ISBN 978-3-031-36060-2 https://doi.org/10.1007/978-3-031-36060-2

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Chapter 2 is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapter. This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Foreword

Emacs or vi(m)? Even with the integration of Visual Studio Code (VSCode) inside GitHub, there is no end in sight for the quintessential editor wars. Since the mid1980s, thousands of online, mostly futile, discussions and flamewars have focused on which editor is the best for coding and other text processing tasks. At the surface level, this long-standing debate seems to focus merely on factors like the graphical user interface of the editors, modal vs. chord-based keyboard handling, or availability of the editors on a variety of operating systems. However, to actual users of these tools the choice for a given editor really is based on two key criteria: (1) the ability to customize the editor to their specific workflow via third-party, open-source extensions (or “plugins”) and (2) a sense of community and belonging among users and providers of third-party extensions. Extensibility is built into each of these editors via an underlying scripting language (from Emacs’ underlying Lisp interpreter and vim’s Vim script to VSCode’s TypeScript) and a dedicated plugin API. These two elements aim to seamlessly blend default functionality shared by all users (basic text manipulation) with highly custom functionality of interest to a fraction of the userbase (support for different versioning control systems, advanced text completion mechanisms, etc.). At the time of writing this foreword (March 2023), the popular MELPA package archive1 (one of several archives for Emacs) lists 5,391 plugins, vim.org 5,888 vim plugins, and the VSCode Marketplace2 even 44,330 extensions! These extensions do not come out of thin air but build on each other, both in terms of technical dependencies and underlying ideas, based on strong interactions between communities of extension developers, users, and enthusiasts. Dedicated fora on reddit or Stack Overflow feature thousands of enthusiasts exchanging ideas, workflow suggestions, ad hoc customizations, bug fixes, and plans for future extensions. Furthermore, GitHub is rife with people sharing their personal configuration files (“emacs configuration” resulted in 6,480 hits), or even standardizing them

1 https://melpa.org 2 https://marketplace.visualstudio.com

v

vi

Foreword

into official, supported distributions or variants of their editor (e.g., Doom Emacs or Prelude Emacs). Code contributed by such distributions as well as individual extensions can then be picked up by the developer community of the underlying editor, making its way into the upstream project. Apart from code contributions, artists also chime in to submit new themes or styles to tailor the visual aspects of their favorite editor. In other words, each editor has its own software ecosystem, essentially consisting of a base technology (the editor itself), technical components (the editor extensions) depending on that technology and each other, and social interactions between each component’s communities. As such, the point of the editor wars is not about choosing the “best” base technology but about buying into the “best” editor ecosystem, where the definition of “best” refers to health-related measures such as the sustainability of an ecosystem, the absence of long-standing bugs, the rate of innovation, as well as the degree to which the values of the ecosystem’s community match with those of a given user. In that respect, the so-called editor wars are actually not that different from “competition” between other ecosystems, such as mobile app frameworks and their respective app stores (e.g., iOS versus Android ecosystems), programming languages with their third-party library support (e.g., Javascript’s npm vs. Java’s Maven Central ecosystems), open-source Linux distributions (e.g., Fedora vs. Ubuntu ecosystems), or even software infrastructure technologies (e.g., Docker Hub vs. Kubernetes). In all these cases, it is not about the base technology or product; it is all about the community, technical interactions, and value creation surrounding these. This seamlessly brings us to the topic of this book: given a software ecosystem, how can one measure and monitor its health, innovation, value, etc. in a consistent and effective manner? What kinds of data sources are available to gauge an ecosystem’s development practices, evolution, and internal conflicts, both among ecosystem contributors and between competing ecosystem projects? How can this data be obtained, cleaned, and preprocessed? What kinds of analyses should be performed, and which models could be used? How to interpret the resulting findings, and how can they impact the state-of-the-practice of software ecosystems? For one, these questions address the essential concerns practitioners have when trying to select the right ecosystem for their business, to plan maintenance activities, or to identify important health risks that might impact (parts of) their ecosystems. This is very clear from current initiatives like the Linux Foundation’s CHAOSS project,3 which stands for “Community Health Analytics in Open-Source Software.” Together with industry, various CHAOSS working groups have developed a catalog of (thus far) 79 health metrics at the level of individual software projects, only 9 of which catch some aspect of health at the ecosystem level. Hence, in order to get insights in new ecosystem-level health metrics and to automate their measurement, this book is indispensable.

3 https://chaoss.community

Foreword

vii

At the same time, researchers and students need to learn about the state of the art in this domain in order to study ever more challenging software ecosystem topics. What has attracted me to this research domain and also led me to participate in an international research project on ecosystem health with researchers of Wallonia (Belgium) and Quebec (Canada) is the need for interdisciplinary research. Research labs in sociology, biology, information sciences, etc. have developed highly sophisticated theories and metaphors that provide unique perspectives on ecosystem health. Yet, these labs lack the software engineering and software analytics backgrounds required to empirically validate such theories. Again, a book like this one prepares the reader for exactly this purpose. Finally, what will the future bring to the domain of software ecosystems and its researchers? Similar to how software ecosystems are more than just a set of individual open-source projects, ecosystems of ecosystems are more than just an agglomeration of individual ecosystems. Going back to our example ecosystem of editors, the recent breakthrough of Microsoft’s Language Server Protocol (LSP) has shown ways in which innovation not only propagates from one ecosystem (VSCode) to its competing ecosystems but also that it can spawn a new, interacting ecosystem of language servers as a side effect. Such meta-phenomena can have disruptive side effects on software ecosystems, hence necessitating thorough empirical research. At the same time, it will be fascinating to understand how the role of AI will impact existing ecosystems. While, thus far, ecosystem communities consist of actual humans assisted by bots for rote automation of tedious tasks, the introduction of AI agents, whose contributions might be hard to distinguish from human contributions, has the potential to disrupt or even sabotage today’s ecosystem dynamics. Once more, the advent of AI in software ecosystems provides another major opportunity to validate existing theories of ecosystem dynamics using the techniques presented by this book. Kingston, ON, Canada March 2023

Bram Adams (proud user of the Emacs ecosystem)

Preface

The discipline of software engineering emerged in 1969 as a result of the first international conference on software engineering that took place in Garmisch (Germany) and that was sponsored by the NATO Science Committee. Over a period spanning several decades, the discipline has given rise to increasingly advanced processes and tool support for maintaining and evolving ever more complex and interconnected software products. Software engineering tools offer support for a wide range of activities, including project management, version control, configuration management, collaborative coding, quality assurance, dependency management, continuous integration and deployment, containerization, and virtualization. Since the seminal book “Software Ecosystem” by Messerschmitt and Szyperski in 2003, software ecosystems have become a very active topic of research in software engineering. As the different chapters of this book will reveal, software ecosystems exist in many different forms and flavors, so it is difficult to provide a unique encompassing definition. But a key aspect of such software ecosystems is that software products can no longer be considered or maintained in isolation, since they belong to ever more interconnected and interdependent networks of coevolving software components and systems. This was enabled by technological advances in various domains such as component-based software engineering, global software development, and cloud computing. The everincreasing importance of collaborative online “social” coding platforms, aiming to develop software in a collaborative way, has made software ecosystems indispensable to software practitioners, in commercial as well as open-source settings. This has led to the widespread use and popularity of large registries of reusable software libraries for a wide variety of programming languages, operating systems, and project-specific software communities.

ix

x

Preface

How This Book Originated The idea to write this book originated from a large inter-university fundamental research project – called SECOAssist (see secoassist.github.io) – that studied the technical aspects of software ecosystems, in order to provide tools and techniques to assist contributors, maintainers, managers, and other ecosystems stakeholders in their daily activities. This project was financed by the Belgian regional science foundations F.R.S.-FNRS and FWO-Vlaanderen under the “Excellence of Science” seal. It took place from 2018 till 2023 and spawned many research results in the form of scientific publications, PhD dissertations, open-source tools, and datasets. The three co-editors of this book, Tom Mens, Coen De Roover, and Anthony Cleve, were principal investigators of the project together with Serge Demeyer, leading the research efforts of their respective teams. The conducted research was quite diverse, covering a wide range of software ecosystems and focusing on challenging technical and sociotechnical aspects thereof. Among others, we empirically analyzed a wide range of maintenance issues related to software ecosystems of reusable software libraries. Example maintenance issues studied include outdatedness; security vulnerabilities; semantic versioning; how to select, replace, and migrate software libraries; and contributor abandonment. We also investigated advanced testing techniques such as test amplification and test transplantation and how to apply them at the ecosystem level. We studied sociotechnical aspects in the GitHub ecosystem, including the phenomenon of forking, the use of development bots, and the integrated CI/CD workflow infrastructure of GitHub Actions. We also analyzed issues around database usage and migration in ecosystems for data-intensive software. Finally, we explored maintenance issues in a range of emerging software ecosystems, such as the Docker Hub containerization ecosystem, the Intrastructure-as-Code ecosystem forming around Ansible Galaxy, the Q&A ecosystem of Stack Overflow, the OpenStack ecosystem, and the GitHub Actions ecosystem.

Who Contributed to This Book This book aims to further expand upon and report about the software ecosystems research that has been conducted in recent years, going well beyond the results achieved by the SECOAssist research project. To this end, we invited some of the most renowned researchers worldwide who have made significant contributions to the field. Their names and affiliations can be found just before the book’s table of contents. The chapters they have contributed focus on the nature of particular software ecosystems or on specific domain-specific tooling and analyses to understand, support, and improve those ecosystems. Following the spirit of open science and collaborative development practices, we used a GitHub repository during the process of writing and reviewing the

Preface

xi

book chapter. Each contributor had access to the material of each chapter, and each chapter was peer-reviewed by at least three different contributors using the repository’s discussion forum.

What This Book Is Not About Previously published books on the topic of software ecosystems have primarily focused on the business aspects of software ecosystems: • D.G. Messerschmitt, C. Szyperski (2003) Software Ecosystem: Understanding an Indispensable Technology and Industry. MIT press • K.M. Popp, R. Meyer (2010) Profit from Software Ecosystems: Business Models, Ecosystems and Partnerships in the Software Industry. Norderstedt, Germany • S. Jansen, S. Brinkkemper, M.A. Cusumano (2013) Software Ecosystems: Analyzing and Managing Business Networks in the Software Industry. Edward Elgar Publishing We fully agree these are very important aspects but regretted the absence of a complementary book focusing on the technical aspects related to empirically analyzing, supporting, and improving software ecosystems. This is the main reason why we decided to create this book. Even though we have tried to be as comprehensive as possible, we acknowledge that this book might not cover all important research topics related to software ecosystems. We apologize to the reader if specific technical or other aspects related to software ecosystems and their analysis have not been sufficiently addressed.

Who This Book Is Intended For This book is intended for all those practitioners and researchers interested in developing tool support for or in the empirical analysis of software ecosystems. The reader will find the contributed chapters to cover a wide spectrum of social and technical aspects of software ecosystems, each including an overview of the state of the art. While this book has not been written as a classical textbook, we believe that it can be used as supplementary material to present software ecosystems research during advanced (graduate or postgraduate) software engineering lectures and capita selecta. This is exactly what we, book editors, intend to do as well in our own university courses. For researchers, the book can be used as a starting point for exploring the wealth of software ecosystems results surveyed in each chapter.

xii

Preface

How This Book Is Structured This book starts with an introductory chapter (Chap. 1) that provides a historical account of the origins of software ecosystems. This chapter sets the necessary context about the domain of software ecosystems by highlighting its different perspectives, definitions, and representations. It also provides many concrete examples highlighting the variety of software ecosystems that have emerged during the previous decades. The book is composed of five parts, each containing two contributed chapters. Part I contains three chapters on software ecosystem representations. Chapter 2 focuses on the Software Heritage open science ecosystem. This ecosystem has been recognized by UNESCO because of its ongoing effort to preserve and provide access to the digital heritage of free and open-source software. The chapter focuses on important aspects such as open science and research reproducibility and also discusses some of the techniques that are required to maintain and query this massive software ecosystem. Chapter 3 reflects on software ecosystems composed of many interdependent components and the challenges that researchers are facing to mine such ecosystems. The authors propose a list of promises and perils related to these challenges. Part II of the book contains two chapters that focus on different ways and techniques of analyzing software ecosystems. Chapter 4 focuses on technical aspects of how to mine software library usage patterns in ecosystems of reusable software libraries. Chapter 5 focuses on social aspects in software ecosystems, by analyzing how emotions play a role in the context of developer communication and interaction. It presents a range of sentiment analysis techniques and tools that can be used to carry out such analyses. Part III of the book contains two chapters that focus on aspects related to the evolution of software ecosystems. Chapter 6 focuses on the phenomenon of forking of software repositories in social coding ecosystems such as GitHub. The chapter studies the prevalent phenomenon of variant forks as a reuse mechanism to split of software projects and steer them into a new direction. The focus is on how such variant forks continue to be maintained over time and to which extent they coevolve with the main repository they originated from. Chapter 7 discusses the effect of collateral evolution in software ecosystems. Collateral evolution is a type of adaptive maintenance that is required to keep a software system functional when its surrounding technological environment is facing changes that are beyond the control of the system itself. A typical example of such collateral evolution are external software libraries that are frequently subject to changes that necessitate adaptations in the software systems making using of those libraries. This phenomenon is explored for three software ecosystems (the Linux kernel, Android apps, and machine learning software), and techniques are presented to support collateral evolution in those ecosystems. Part IV of the book looks at what could be called software automation ecosystems. Such ecosystems are brought about by the increasing automation of software

Preface

xiii

engineering processes. Chapter 8 studies development workflow automation tools, considering the GitHub Actions ecosystems and the use of development bots. Chapter 9 looks at the ecosystems that have formed around containerization and configuration management tools, focusing specifically on the Docker Hub and Ansible Galaxy ecosystems. Finally, Part V focuses on what could be called model-centered software ecosystems. Chapter 10 considers ecosystems stemming from model-driven software engineering. Such ecosystems contain software model artefacts and associated tools. The chapter focuses on how machine learning and deep learning techniques can be used to understand and analyze the artefacts contained in such ecosystems. Last but not least, Chap. 11 looks at data-intensive software ecosystems, which are ecosystems in which database models and their corresponding tools form a crucial role. The chapter focuses on techniques to mine, analyze, and visualize such ecosystems. It also reports on empirical analysis based on such techniques.

Acknowledgments We thank the FWO-Vlaanderen and F.R.S-FNRS Science Foundations in Belgium for the generous research funding they have provided us for our research on software ecosystems. We also thank our respective universities in Mons, Brussels, and Namur for having given us the opportunity, environment, and resources to carry out our research and to enable this research collaboration. We express our gratitude to all chapter authors of this book for having taken the time and effort to contribute highquality chapters while respecting the imposed deadlines. Last but not least, we thank our loving families, friends, and colleagues for their support. Mons, Belgium Brussels, Belgium Namur, Belgium March 2023

Tom Mens Coen De Roover Anthony Cleve

Contents

1

An Introduction to Software Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom Mens and Coen De Roover

1

Part I Software Ecosystem Representations 2

The Software Heritage Open Science Ecosystem . . . . . . . . . . . . . . . . . . . . . . . Roberto Di Cosmo and Stefano Zacchiroli

33

3

Promises and Perils of Mining Software Package Ecosystem Data . . . Raula Gaikovina Kula, Katsuro Inoue, and Christoph Treude

63

Part II Analyzing Software Ecosystems 4

5

Mining for Software Library Usage Patterns Within an Ecosystem: Are We There Yet? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tien N. Nguyen

85

Emotion Analysis in Software Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Nicole Novielli and Alexander Serebrenik

Part III Evolution Within Software Ecosystems 6

Analyzing Variant Forks of Software Repositories from Social Coding Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 John Businge, Mehrdad Abdi, and Serge Demeyer

7

Supporting Collateral Evolution in Software Ecosystems . . . . . . . . . . . . . 153 Zhou Yang, Bowen Xu, and David Lo

Part IV Software Automation Ecosystems 8

The GitHub Development Workflow Automation Ecosystems . . . . . . . . 183 Mairieli Wessel, Tom Mens, Alexandre Decan, and Pooya Rostami Mazrae xv

xvi

9

Contents

Infrastructure-as-Code Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Ruben Opdebeeck, Ahmed Zerouali, and Coen De Roover

Part V Model-Centered Software Ecosystems 10

Machine Learning for Managing Modeling Ecosystems: Techniques, Applications, and a Research Vision . . . . . . . . . . . . . . . . . . . . . . . 249 Davide Di Ruscio, Phuong T. Nguyen, and Alfonso Pierantonio

11

Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Csaba Nagy, Michele Lanza, and Anthony Cleve

Contributors

Mehrdad Abdi AnSyMo, Universiteit Antwerpen, Antwerpen, Belgium John Businge Evol, University of Nevada Las Vegas, Las Vegas, NV, USA Anthony Cleve Faculté d’informatique, Université de Namur, Namur, Belgium Serge Demeyer AnSyMo, Universiteit Antwerpen, Antwerpen, Belgium Alexandre Decan Software Engineering Lab, University of Mons, Mons, Belgium Coen De Roover Software Languages Lab, Vrije Universiteit Brussel, Elsene, Belgium Roberto Di Cosmo Software Heritage – INRIA and Université de Paris Cité, Paris, France Davide Di Ruscio University of L’Aquila, L’Aquila, Italy Katsuro Inoue Nanzan University, Nagoya, Japan Raula Gaikovina Kula Nara Institute of Science and Technology, Takayama, Nara, Japan Michele Lanza Software Institute, Università della Svizzera italiana, Lugano, Switzerland David Lo Singapore Management University, Singapore, Singapore Tom Mens Software Engineering Lab, University of Mons, Mons, Belgium Csaba Nagy Software Institute, Università della Svizzera italiana, Lugano, Switzerland Phuong T. Nguyen University of L’Aquila, L’Aquila, Italy Tien N. Nguyen Computer Science Department, University of Texas at Dallas, Richardson, TX, USA

xvii

xviii

Contributors

Nicole Novielli Dipartimento di Informatica, University of Bari “A. Moro”, Bari, Italy Ruben Opdebeeck Software Languages Lab, Vrije Universiteit Brussel, Elsene, Belgium Alfonso Pierantonio University of L’Aquila, L’Aquila, Italy Pooya Rostami Mazrae Software Engineering Lab, University of Mons, Mons, Belgium Alexander Serebrenik Eindhoven University of Technology, Eindhoven, MB, The Netherlands Christoph Treude School of Computing and Information Systems, University of Melbourne, Carlton, VIC, Australia Mairieli Wessel Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands Bowen Xu Singapore Management University, Singapore, Singapore Zhou Yang Singapore Management University, Singapore, Singapore Stefano Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France Ahmed Zerouali Software Languages Lab, Vrije Universiteit Brussel, Elsene, Belgium

Acronyms

ACM API AI AST AUG AWS BPMN CE CHAOSS CI CI/CD CNN CPS CSP CSV DAO DAG DB DBMS DE&I

DL DSL DT EIC EMF

Association for Computing and Machinery—see www.acm.org Application Programming Interface Artificial Intelligence Abstract Syntax Tree API Usage Graph—see Chap. 4. Amazon Web Services—a cloud computing platform proposed by Amazon. See aws.amazon.com Business Process Modeling Notation Collateral Evolution—see Chap. 7 Community Health Analytics in Open-Source Software—see chaoss.community Continuous Integration—see CI/CD Continuous Integration, Deployment, and Delivery Convolutional Neural Network—a specific kind of Neural Network Cyber-Physical System Constraint Satisfaction Problem Comma Separated Values—a very simple data format Data Access Object Directed Acyclic Graph Database Database Management System Diversity, Equity and Inclusion—a set of principles to build inclusive and welcoming workspaces and environments for less privileged individuals or organizations Deep Learning—a branch of artificial intelligence focused on developing algorithms and applications based on Neural Networks Domain-Specific Language Decision Tree—a specific kind of supervised machine learning where the classifier is modeled as a tree Error-Inducing Change Eclipse Modeling Framework xix

xx

EMR EOSC

FAIR FFNN FK FL FOSS FP FSF FUSE GDBT GDPR GK GNN GNU GraphQL HQL IaC ICSME IDE IEEE INRIA

IT JDBC JDK JPA

JSON LOC LR

Acronyms

Electronic Medical Record European Open Science Cloud—a European Commission initiative aiming at developing an infrastructure providing its users with services promoting open science practices. See www.eosc-portal.eueoscportal.eu Findable, Accessible, Interoperable, Reusable—four guiding principles for scientific data management Feed-Forward Neural Network—a specific kind of Neural Network Foreign Key—a kind of constraint that is frequently used in relational databases Federated Learning—a decentralized form of machine learning that can be used to train models on decentralized data Free Open-Source Software—see OSS False Positive Free Software Foundation—see www.fsf.org Filesystem in User SpacE—see Chap. 2 Gradient Boosted Decision Tree—a specific kind of Decision Tree General Data Protection Regulation—a data protection law imposed by the European Union Graph Kernel—a method to compute the internal weights of a graph neural network. See Chap. 10 Graph Neural Network—a specific kind of Neural Network Recurvise acronym for GNU’s Not Unix Graph Query Language—a query language for building flexible, efficient, and powerful APIs Hibernate Query Language—an SQL-like query language for Hibernate Infrastructure as Code International Conference on Software Maintenance and Evolution—the name of an annual scientific IEEE conference Integrated Development Environment Institute of Electrical and Electronics Engineers—see www.ieee.org Institut National de Recherche en Informatique et en Automatique— French National Institute for Research in Computer Science and Control Information Technology Java Database Connectivity Java Development Kit Jakarta Persistence API (formerly known as Java Persistence API)—a Java specification to manage object-relational mappings (ORM) between Java applications and a relational database JavaScript Object Notation—a human-readable data serialization language and file format, just like XML and YAML Lines of Code Linear Regression

Acronyms

LSTM MDE ML

MOF

MSR NATO NB NIST NLP

NN NoSQL

NVD OCI OCL

OMG

ORM

OS OSI OSS PDG PNG PR QA

xxi

Long Short-Term Memory—a specific kind of Recurrent Neural Network Model-Driven (Software) Engineering Machine Learning—a branch of computer science that uses algorithms trained on data to produce models that can perform complex tasks such as analyzing, reasoning, and learning Meta-Object Facility—a domain-specific modeling language proposed by the OMG for specifying metamodels for a variety of modeling languages (including UML) to be used in the context of software engineering (i.e., for specifying, analyzing, designing, verifying, and validating software systems) Mining Software Repositories—the name of an annual scientific IEEE conference North Atlantic Treaty Organization—see www.nato.int Naïve Bayesian—a specific kind of supervised machine learning technique National Institute of Standards and Technology—part of the US Department of Commerce. See www.nist.gov Natural Language Processing—a branch on the intersection of computer science and computational linguistics focused understanding, interpreting, and manipultating natural (i.e., human) language Neural Network—a computational technique in the area of artificial intelligence (AI) that is at the heart of deep learning (DL) algorithms Not Only SQL or non-SQL—an category of database management systems that enable storing and querying data outside the traditional relational DBMS that rely on SQL as a query language National Vulnerability Database—a software security vulnerability database maintained by NIST. See nvd.nist.gov Oracle Cloud Infrastructure Object Constraint Language—a query and constraint language proposed by the OMG to be used to query or express constraints over models (e.g., UML and SysML) or metamodels (e.g., MOF) Object Management Group—an international non-profit consortium focused on developing technology standards such as UML, SysML, and many more Object Relational Mapping—a technique or framework used to simplify the mapping and translation between object-oriented programs and relational databases Operating System Open-Source Initiative—see opensource.org Open-Source Software Program Dependency Graph Portable Network Graphic—a data format for image files Pull Request Quality Assurance

xxii

Q&A RAM REST RF RL RNN ROS RTM SATD SBSE SE SemVer SHA1 SPDX SQL SSD SVG SVM SWH SWHID SysML

TD TP UML

UNESCO URI URL VCS XML XP YAML

Acronyms

Question and Answer Random Access Memory Representational State Transfer—an architectural style for allowing computer systems to communicate with each other Random Forest—a specific kind of supervised machine learning technique Reinforcement Learning—one of the three machine learning paradigms, alongside supervised and unsupervised learning Recurrent Neural Network—a specific kind of Neural Network Robot Operating System—see www.ros.org Relational Topic Model Self-Admitted Technical Debt—a specific kind of technical debt Search-Based Software Engineering Software Engineering Semantic Versioning—see semver.org Secure Hash Algorithm 1—a well-known cryptographic hash algorithm Software Package Data Exchange—an open standard for software bill of materials. See spdx.dev Structured Query Language—a language used to query data stored in a relational database Social Semantic Diversity Scalable Vector Graphic—a data format for image files Support Vector Machines—a specific kind of supervised machine learning technique Software Heritage—see Chap. 2 Software Heritage Persistent Identifier—see Chap. 2 Systems Modeling Language—a modeling language proposed by the OMG to be used in the context of systems engineering (i.e., specifying, analyzing, designing, verifying, and validating systems) Technical Debt True Positive Unified Modeling Language—a modeling language proposed by the OMG to be used in the context of software engineering (i.e., specifying, analyzing, designing, verifying, and validating software systems) United Nations Educational, Scientific and Cultural Organization—see www.unesco.org Uniform Resource Identifier Uniform Resource Locator Version Control System eXtensible Markup Language—a human-readable data serialization language and file format, just like YAML and JSON eXtreme Programming—one of the many agile software development methodologies Recursive acronym for YAML Ain’t Markup Language—a humanreadable data serialization language and file format, just like XML and JSON

Chapter 1

An Introduction to Software Ecosystems Tom Mens and Coen De Roover

Abstract This chapter defines and presents the kinds of software ecosystems that are targeted in this book. The focus is on the development, tooling, and analytics aspects of “software ecosystems,” i.e., communities of software developers and the interconnected software components (e.g., projects, libraries, packages, repositories, plug-ins, apps) they are developing and maintaining. The technical and social dependencies between these developers and software components form a sociotechnical dependency network, and the dynamics of this network change over time. We classify and provide several examples of such ecosystems, many of which will be explored in further detail in the subsequent chapters of the book. The chapter also introduces and clarifies the relevant terms needed to understand and analyze these ecosystems, as well as the techniques and research methods that can be used to analyze different aspects of these ecosystems.

1.1 The Origins of Software Ecosystems Today, software ecosystems are considered an important domain of study within the general discipline of software engineering. This section describes its origins, by summarizing the important milestones that have led to its emergence. Figure 1.1 depicts these milestones chronologically. The software engineering discipline emerged in 1968 as the result of a first international conference [126], sponsored by the NATO Science Committee, based on the realization that more disciplined techniques, engineering principles, and theoretical foundations were urgently needed to cope with the increasing complexity, importance, and impact of software systems in all sectors of economy and industry. T. Mens () University of Mons, Mons, Belgium e-mail: [email protected] C. De Roover Vrije Universiteit Brussel, Etterbeek, Belgium e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_1

1

2

T. Mens and C. De Roover

Fig. 1.1 Milestones that contributed to the domain of research (analytics) and development (tooling) of software ecosystems

Even the key idea of software reuse [61, 97], which suggests to reduce time-tomarket, cost, and effort when building software while at the same time increasing reuse, productivity, and quality, is as old as the software engineering discipline itself. During the aforementioned conference, Malcolm Douglas McIlroy proposed to face increasing software complexity by building software through the reuse of high-quality software components [112]. In the late 1970s, awareness increased that the development of large-scale software needs to embrace change as a key aspect of the development process [186]. This has led Manny Lehman to propose the so-called laws of software evolution, focusing on how industrial software systems continue to evolve after their first deployment or public release [19, 101, 102]. The software evolution research domain is still thriving today [114, 116], with two dedicated annual conferences: the IEEE International Conference on Software Maintenance and Evolution (ICSME) and the IEEE Software Analysis, Evolution and Reengineering Conference (SANER). Another important factor having contributed to the popularity of software ecosystems is the emergence and ever-increasing importance of free software and open-source software (OSS) since the early 1980s, partly through the creation of the GNU project1 in 1983 and the Free Software Foundation (FSF) in 1985 by Richard Stallman, as well as the creation of the Linux operating system in 1991. Strong open-source advocates such as Eric Raymond [144] further contributed to the popularity through the creation of the Open-Source Initiative (OSI) in 1998, and by contrasting cathedral-style closed development process models with the bazaar-style open development process models for open-source and free software in which the code is publicly developed over the Internet. This bazaar-style model evolved into

1 https://www.gnu.org.

1 An Introduction to Software Ecosystems

3

geographically distributed global software development [73, 78] models, supported by the immensely popular social coding platforms [43] such as GitHub, GitLab, Gitea, and BitBucket. In parallel, the importance of software reuse in the late 1990s gave rise to additional subfields of software engineering such as the domain of component-based software engineering [96, 159], focusing on methods and principles for composing large systems from loosely coupled and independently evolving software components. Around the same time, it was joined by another subfield, called software product line engineering [179], which explicitly aims to enable developing closely related software products using a process modelled after product line manufacturing, separating the domain engineering phase of producing reusable software artefacts that are common to the product family, from the application engineering phase that focuses on developing concrete software applications that exploit the commonalities of the reusable artefacts created during the domain engineering phase. Software product lines have allowed many companies to reduce costs while at the same time increasing quality and time to market, by providing a product line platform and architecture that allows to scale up from the development and maintenance of individual software products to the maintenance of entire families of software products. However, these product families still remain within the organizational boundaries of the company. Around the same time, the lightweight and iterative process models known as agile software processes started to come to the forefront, with a user-centric vision requiring adaptive and continuous software change. Different variants, such as Scrum [150] and eXtreme Programming (XP) [17], led to the foundation of the Agile Alliance and the creation of the agile manifesto [18]. In support of agile software processes, various development practices and tools for continuous integration and delivery (CI/CD) emerged later on in the decade. Since the 2003 seminal book by Messerschmitt and Szyperski [117], software ecosystems have become an active topic of research in software engineering. As argued by Jan Bosch [27, 28], software ecosystems expand upon software product lines by allowing companies to cross the organizational boundaries and make their software development platforms available to third parties that, in turn, can contribute to the popularity of the produced software through externally developed components and applications. The key point of software ecosystems is that software products can no longer be considered or maintained in isolation, since they have become heavily interconnected.

1.2 Perspectives and Definitions of Software Ecosystems Messerschmitt and Szyperski [117] were arguably among the first to use the term software ecosystem and defined it rather generically as a collection of software products that have some given degree of symbiotic relationships. Since then, the

4

T. Mens and C. De Roover

research literature has provided different definitions of software ecosystems, from many different perspectives. From an ecological perspective, several researchers have tried to exploit the analogy between software ecosystems and natural ecosystems. The term software ecosystem quite obviously originates from its ecological counterpart of biological ecosystems that can be found in nature, in a wide variety of forms (e.g., rainforests, coral reefs, deserts, mountain zones, and polar ecosystems). In 1930, Roy Clapham introduced the term ecosystem in an ecological context to denote the physical and biological components of an environment considered in relation to each other as a unit [184]. These components encompass all living organisms (e.g., plants, animals, microorganisms) and physical constituents (e.g., light, water, soil, rocks, minerals) that interact with one another in a given environment. Dunghana et al. [53] compared the characteristics of natural and software ecosystems. Mens [113] provided a highlevel historical and ecological perspective on how software ecosystems evolve. Moore [123] and Iansiti and Levien [82] focused on the analogy between business ecosystems and ecology. From an economic and business perspective, Jansen et al. [83] provide a more precise definition: a set of businesses functioning as a unit and interacting with a shared market for software and services, together with the relationships among them. In a similar vein, Bosch et al. [27] say that a software ecosystem consists of a software platform, a set of internal and external developers and a community of domain experts in service to a community of users that compose relevant solution elements to satisfy their needs. Hanssen [76] defines it as a networked community of organizations, which base their relations to each other on a common interest in a central software technology. An excellent entry point to this business-oriented viewpoint on software ecosystems is the book edited by Jansen et al. [84]. In contrast, the chapters in the current book focus mostly on the complementary technical and social perspectives. From a more technical perspective, the focus is on technical aspects such as the software tools that are being used (e.g., version control systems, issue and bug trackers, social coding platforms, integrated development environments, programming languages) and the software artefacts that are being used and produced (e.g., source code, executable code, tests, databases, documentation, trace logs, bug and vulnerability reports). Within this technical perspective, Lungu [105] defined a software ecosystem as a collection of software projects that are developed and evolve together in the same environment. The notion of environment can be interpreted rather broadly. The environment can correspond to a software-producing organization, including the tools and libraries used by this organization for developing its software projects, as well as the clients using the developed software projects. It can correspond to an academic environment, composed of software projects developed and maintained by students and researchers in research units. It can also correspond to an entire OSS community consisting of geographically dispersed project collaborators focused around similar philosophies or goals. From a social perspective, the focus is on the social context and network structure that emerges as a result of the collaboration dynamics and interaction

1 An Introduction to Software Ecosystems

5

between the different contributors to the projects that belong to the software ecosystem. This social structure is at least as important as the technical aspects and includes the various stakeholders that participate in the software ecosystem, such as developers, end users, project managers, analysts, designers, software architects, security specialists, legal consultants, clients, QA teams, and many more. Chapter 5 focuses on these social aspects from an emotion analysis viewpoint. Manikas [111] combined all these perspectives into a single all-encompassing definition of a software ecosystem as the interactions of a set of actors on top of a common technological platform that results in a number of software solutions or services. Each actor is motivated by a set of interests or business models and connected to the rest of the actors and the ecosystem as a whole with symbiotic relationships, while the technological platform is structured in a way that allows the involvement and contribution of the different actors.

1.3 Examples of Software Ecosystems Following the wide diversity of definitions of software ecosystem, the kinds of software ecosystems that have been studied in recent research are equally diverse. An interesting entry point into how the research literature on software ecosystems has been evolving over the years are the many published systematic literature reviews, such as [14, 29, 110, 111, 151]. Without attempting to be complete, Table 1.1 groups into different categories some of the most popular examples of software ecosystems that have been studied in the research literature. These categories are not necessarily disjoint, since software ecosystems tend to contain different types of components that can be studied from different viewpoints. The remaining subsections provide more details for each category, illustrating the variety of software ecosystems that have been studied and providing examples of well-known ecosystems and empirical research that has been conducted on them.

1.3.1 Digital Platform Ecosystems Hein et al. [77] define a digital platform ecosystem as a software ecosystem that comprises a platform owner that implements governance mechanisms to facilitate value-creating mechanisms on a digital platform between the platform owner and an ecosystem of autonomous complementors and consumers. This is in line with the previously mentioned definition by Bosch et al. [27] that a software ecosystem consists of a software platform, a set of internal and external developers and a community of domain experts in service to a community of users that compose relevant solution elements to satisfy their needs.

6

T. Mens and C. De Roover

Table 1.1 Categories of software ecosystems Category digital platforms

social coding platforms component-based software ecosystems

software automation ecosystems

communicationoriented ecosystems OSS communities

Examples mobile app stores, integrated development environments SourceForge, GitHub, GitLab, Gitea, Bitbucket software library registries (e.g., CRAN, npm, RubyGems, PyPi, Maven Central), OS package registries (e.g., Debian packages, Ubuntu package archive) Docker Hub, Kubernetes, Ansible Galaxy, Chef Supermarket, Puppet Forge

mailing lists, Stack Overflow, Slack Apache Software Foundation, Linux Foundation

Components mobile apps, software plug-ins, or extensions software project repositories interdependent software packages

container images, configuration and orchestration scripts, CI/CD pipelines and workflows

e-mail threads, questions, answers, messages, posts, etc. OSS projects

Contributors third-party app or plug-in developers and their users software project contributors consumers and producers of software packages and libraries

creators and maintainers of workflow automation, containerization and orchestration solutions programmers, developers, end users, researchers community members, code contributors, project maintainers, end users

Well-known examples of digital platform ecosystems are the mobile software ecosystems provided by companies such as Microsoft, Apple, and Google. The company owns and controls an app store as a central platform to which other companies or individuals can contribute apps, which in turn can be downloaded and installed by mobile device users. The systematic mapping studies by de Lima Fontao et al. [51] and [156] report on the abundant research that has been conducted on these mobile software ecosystems. Any software system that provides a mechanism for third parties to contribute plug-ins or extensions that enhance the functionalities of the system can be considered as a digital software ecosystem. Examples of these are configurable text editors such as Emacs and Vim and integrated software development environments (IDEs) such as IntelliJ IDEA, VS Code, NetBeans, and Eclipse. The latter ecosystem in particular has been the subject of quite some research on its evolutionary dynamics (e.g., [4, 30–33, 91, 115, 128, 166]). These examples show that digital platform ecosystems are not necessarily controlled by a single company. In many cases, they are managed by a consortium, foundation, or open-source community. For example, NetBeans is controlled by the Apache Foundation, and Eclipse is controlled by the Eclipse Foundation.

1 An Introduction to Software Ecosystems

7

Another well-known digital platform ecosystem is WordPress, the most popular content management system in use today, which features a plugin architecture and template system that enables third parties to publish themes and extend the core functionality. Um et al. [170] presented a recent study of this ecosystem. Yet another example is OpenStack, an open-source cloud computing platform involving more than 500 companies. This ecosystem has been studied by several researchers (e.g., [60, 162, 166, 193]).

1.3.2 Component-Based Software Ecosystems A very important category of software ecosystems are so-called component-based software ecosystems. They constitute large collections of reusable software components, which often have many interdependencies among them [1]. Empirical studies on component-based software ecosystems tend to focus on the technicalities of dependency-based reuse, which differentiates them from studies on digital platform ecosystems, which have a more business-oriented and managerial focus. As explained in Sect. 1.1, the idea of building software by reusing existing software components is as old as the software engineering discipline itself, since it was proposed by McIlroy in 1968 during the very first software engineering conference [112]. The goal was to reduce time-to-market, cost, and effort when building software while at the same time increasing reuse, productivity, and quality. This has given rise to a very important and abundant subfield of software engineering that is commonly referred to as component-based software engineering. Despite the large body of research in this field (e.g., [34, 97, 159]), it was not able to live up to its promises due to a lack of a standard marketplace for software components, combined with a lack of proper component models, terminology, and scalable tooling [95]. All of this has changed nowadays, probably due to a combination of the increasing popularity of OSS and the emergence of affordable cloud computing solutions. Among the most important success stories of component-based software ecosystems are undoubtedly the many interconnected software packages for OSS operating systems such as the GNU Project since 1983, Linux since 1991, Debian since 1993 (e.g., [1, 36, 39, 40, 69]), and Ubuntu since 2004. They come with associated package management systems (or package managers for short) such as DPKG (since 1994) and APT (since 1998), which are systems that automate the process of selecting, installing (or removing), upgrading, and configuring those packages. Package managers typically maintain a database of software dependencies and version information to prevent software incompatibilities. Another popular type of ecosystems of reusable components are software libraries. Software developers, regardless of whether they are part of an OSS community or software company, rely to a large extent on such reusable thirdparty software libraries. These library ecosystems tend to come with their own specific package managers and package registries and are available for all major

8

T. Mens and C. De Roover

programming languages. Examples include the CPAN archive network (created in 1995 for the Perl programming language, the CRAN archive network (created in 1997) and Bioconductor for the R statistical programming language) [62, 138], npm and Bower for JavaScript [2, 41, 46, 47, 49, 190], PyPI for Python [171], Maven (Central) for JVM-based languages such as Java and Scala [22, 130, 155], Packagist for PHP, RubyGems for Ruby [49, 87, 190], NuGet for the .NET ecosystem [103], and the Cargo package manager and its associated crates registry for the Rust programming language [48, 149]. Another example is the Robot Operating System (ROS), the most popular middleware for robotics development, offering reusable libraries for building a robot, distributed through a dedicated package manager [59, 94, 136]. Decan et al. [48] studied and compared seven software library ecosystems for programming languages, focusing on the evolutionary characteristics of their package dependency networks. They observed that library dependency networks tend to grow over time but that some packages are more impactful than other. A minority of packages are responsible for most of the package updates, a small proportion of packages accounts for most of the reverse dependencies, and there is a high proportion of fragile packages due to a high number of transitive dependencies. This makes software library ecosystems prone to a variety of technical, dependency-related issues [2, 40, 45, 155], licensing issues [108], security vulnerabilities [6, 47, 190], backward compatibility [26, 44, 49], and reliance on deprecated components [41], as well as obsolete or outdated components [46, 100, 158, 191]. Versioning practices, such as the use of semantic versioning, can be used to a certain extent to reduce some of these risks [44, 55, 98, 130]. Library ecosystems also face many social challenges, such as how to attract and retain contributors and how to avoid contributor abandonment [42].

1.3.3 Web-Based Code Hosting Platforms The landscape of web-based code hosting platforms has seen many important changes over the last two decades, as can be seen in Fig. 1.2. SourceForge was created in 1999 as a centralized web-based platform for hosting and managing the version history of free and OSS projects. It used to be a very popular data source for empirical research (e.g., [79, 93, 129, 147]). This is no longer the case today, as the majority of OSS projects have migrated to other hosting platforms. Google started running a similar open-source project hosting service, called Google Code, in 2006 but shut it down in January 2016. The same happened to Gitorious, which ran from 2008 to 2015. GitHub replaced Google Code as the most popular and largest hosting platform for open-source (and commercial) software projects that use the git version control system. Other alternatives such as Bitbucket (also created in 2008), GitLab (created in 2014), and the likes are much less popular for hosting OSS projects. Even older is Gitee (created in 2013), an online git forge mainly used in China for hosting open-

1 An Introduction to Software Ecosystems

9

Fig. 1.2 Historical overview of source code hosting platforms

source software. A relatively new contender in the field is Gitea, created in 2016 and funded by the Open Source Collective. GitHub maintains historical information about hundreds of millions of OSS repositories and has been the subject of many empirical studies focusing on different aspects [88]. GitHub is claimed to be the first social coding platform [43], as it was the first hosting platform to provide a wide range of mechanisms and associated visualizations to increase collaboration by making socially significant information visible: watching, starring, commits, issues, pull requests, and commenting. Being an enabler of social coding, the social aspects around GitHub projects have been studied extensively [135, 168], including communication patterns [134], collaboration through pull requests [72, 142, 187], variation in contributor workload [172], gender and tenure diversity [173, 174], geographically distributed development [143, 160, 177], socio-technical alignment between projects [25], the impact of gamification on the behavior of software developers [120], and sentiment and emotion analysis [74, 86, 139, 153, 180, 185]. The latter will be presented in more detail in Chap. 5. The phenomenon of project forking has also been actively studied in the context of GitHub [23, 85, 194], as will be discussed in more detail in Chap. 6. The automation of development activities in GitHub projects has also been studied, such as the use of CI/CD tools [20, 67, 174], and the use of development bots [3, 66, 178, 181]. The latter perspective on the ecosystem will be discussed in Chap. 8. The same chapter also explains how GitHub can be studied from the point of view of a digital platform ecosystem (cf. Sect. 1.3.1), as it offers a marketplace of apps and Actions that can be provided by third parties.

10

T. Mens and C. De Roover

1.3.4 Open-Source Software Communities Quite some research on software ecosystems has focused on collections of OSS projects maintained by decentralized communities of software developers. Such OSS ecosystems have clear advantages over closed, proprietary software ecosystems. For example, their openness guarantees the accessibility to all. Following the adagio that “given enough eyeballs, all bugs are shallow” [145], OSS ecosystems benefit from a potentially very large number of people that can report bugs, review the code, and identify potential security issues. Provided that the software licenses being used are compatible, organizations and companies can save money by relying on OSS components rather than reinventing the wheel and developing those components themselves. At the downside, OSS ecosystems and their constituent components are frequently maintained on a volunteer basis by unpaid developers. This imposes an increased risk of unmaintained components or slow response time. Organizations that rely on OSS ecosystems could significantly reduce these risks by financially sponsoring the respective communities of OSS developers. Many fiscal and legal initiatives for doing so exist, such as the Open Collective, the Open Source Collective, and the Open Collective Foundation. OSS ecosystems are often controlled, maintained, and hosted by a nonprofit software foundation. A well-known example is the Apache Software Foundation (www.apache.org). It hosts several hundreds of OSS projects, involving tens of thousands of code contributors. This ecosystem has been a popular subject of research (e.g., [16, 35, 38, 119, 161]). Another example is the Linux Foundation (www.linuxfoundation.org), whose initial goal was to support the development and evolution of the Linux operating system but nowadays hosts hundreds of OSS projects with hundreds of thousands of code contributors. As can be expected, the OSS project communities of third-party components that surround a digital platform ecosystem (cf. Sect. 1.3.1) also tend to be managed by nonprofit foundations. For example, the Eclipse Foundation controls the Eclipse plug-ins, the WordPress Foundation controls the WordPress plug-ins, and the Open Infrastructure Foundation manages the OpenStack projects. Much in the same way as public OSS ecosystems, there exists a multitude of entirely private and company-controlled software ecosystems. We defer to the book by Jansen et al. [84] that focuses on the business aspects of such commercial software ecosystems. Given their proprietary nature, they have been much less the subject of quantitative empirical research studies, but it is likely that such private ecosystems share many of the characteristics known to OSS ecosystems. As a matter of illustration of such ecosystems, among the countless available examples, we mention Wolfram’s Mathematica and MathWorks’ MATLAB with their large collections of –often third-party– add-ons and the ecosystem surrounding SAP, the world’s largest software vendor of enterprise resource planning solutions.

1 An Introduction to Software Ecosystems

11

1.3.5 Communication-Oriented Ecosystems The previous categories of software ecosystems have in common that the main components they focus on are technical code-related software artefacts (e.g., software library packages and their metadata, mobile software applications, software plug-ins, project repositories, application code and tests, software containers, configuration scripts). The current category focuses on what we will refer to as communicationoriented ecosystems, in which the main component is some social communication artefact that is shared among members of a software community through some communication channel. Examples of these are mailing lists, developer discussion fora, question-and-answer (Q&A) platforms such as Stack Overflow, and modern communication platforms such as Slack and Discord. Each of them constitute software ecosystems in their own right. A particularity of these ecosystems is that the main components they contain (e.g., questions, answers, posts, e-mail, and message threads) are mostly based on unstructured or semi-structured text. As a consequence, extracting and analyzing relevant information from them requires specific techniques based on Natural Language Processing (NLP). These “social programmer ecosystems” [127] have been analyzed by researchers for various reasons, mostly from a social viewpoint. Mailing Lists Mailing lists are a common communication medium for software development teams, although they are gradually being replaced by more modern communication technologies. As the same person may have multiple email addresses, disambiguation techniques are often required to uniquely identify a given team member [183]. They have been the subject of multiple empirical studies (e.g., [75, 188]). Some of these studies have tried to identify personality traits or emotions expressed through e-mails [99, 146, 167]. Discussion Fora Software development discussion fora support mass communication and coordination among distributed software development teams [157]. They are a considerable improvement over mailing lists in that they provide browse and search functions, as well as a platform for posting questions within a specific domain of interest and for receiving expert answers to these questions. A generic, and undoubtedly the most popular, discussion forum is Stack Overflow, dedicated to questions and answers related to computer programming and software development. It belongs to the Stack Exchange network, providing a range of websites covering specific topics. Such Q&A platforms can be considered as a software ecosystem where the “components” are questions and their answers (including all the metadata that comes with them), and the contributor community consists of developers that are asking questions and experts that provide answers to these questions. The Stack Overflow ecosystem has been studied for various purposes and in various ways [5, 8, 13, 15, 109, 124, 125, 175, 176, 188, 192]. The open dataset SOTorrent has been made available on top of a data dump with all posts from 2018 to 2020 [10–12]. Some researchers [52, 127, 169] have applied sentiment

12

T. Mens and C. De Roover

and emotion analysis techniques on data extracted from Stack Overflow. We refer to Chap. 5 for a more detailed account on the use of such techniques in software ecosystems. Next to generic discussion fora such as Stack Overflow, some software project communities prefer to use project-specific discussion fora. This is, for example, the case for Eclipse. Nugroho et al. [128] present an empirical analysis of how this forum is being used by its participants. Modern Communication Platforms Several kinds of modern communication platforms, such as Slack and Discord, are increasingly used by software development teams. Lin et al. [104] reported how Slack facilitates messaging and archiving, as well as the creation of automated integrations with external services and bots to support the work of software development teams.

1.3.6 Software Automation Ecosystems Another category of software ecosystems is what we would refer to as software automation ecosystems. They revolve around technological solutions that aim to automate part of the management, development, packaging, deployment, delivery, configuration, and orchestration of software applications, often through cloud-based platforms. We can mention at least three categories: containerization solutions, orchestration tools based on Infrastructure as Code, and tools for automating DevOps and CI/CD. Containerization Containerization allows developers to package all (software and data) components of their applications into so-called containers, which are lightweight, portable, and self-contained executable software environments that can run on any operating system or cloud platform. By isolating the software applications from the underlying hardware infrastructure, they become easier to manage and more resilient to change. Docker is the most popular containerization tool, and it comes with multiple online registries to store, manage, distribute, and share containers (e.g., Google Container Registry, Amazon ECR, JFrog Container Registry, RedHat’s Quay, and of course Docker Hub). While each of these registries come with their own set of features and benefits, Docker Hub is by far the largest of these registries. The corresponding ecosystem is studied in Chap. 9 and more specifically in Sect. 9.2. Infrastructure Management Through Infrastructure as Code (IaC), infrastructure management tools enable automating the provisioning, configuration, deployment, scaling, and load balancing of the machines used in a digital infrastructure. Different infrastructure management tools have been proposed, including Ansible, Chef, and Puppet. Each of them comes with their own platform or registry for sharing configuration scripts (Ansible Galaxy, Chef Supermarket, and Puppet Forge). Sharma et al. [152] studied best practices in Puppet configuration code, analyzing 4,621

1 An Introduction to Software Ecosystems

13

Puppet repositories for the presence of implementation and design configuration smells. Opdebeeck et al. conversely studied variable-related [131] and securityrelated [132] bad smells in Ansible files respectively. The Ansible Galaxy ecosystem has been an active subject of study in general, as will be shown in Chap. 9 and more specifically Sect. 9.3. DevOps and CI/CD Collaborative distributed software development processes, especially for software projects hosted on social coding platforms, tend to be streamlined and automated using continuous integration, deployment, and delivery tools (CI/CD), which are a key part of DevOps practices. CI/CD tools enable project maintainers to specify project-specific workflows or pipelines that automate many repetitive and error-prone human activities that are part of the development process. Examples are test automation, code quality analysis, dependency management, and vulnerability detection. A wide range of CI/CD tools exist (e.g., Jenkins, Travis, CircleCI, GitLab CI/CD, and GitHub Actions to name just a few). Coming with a registry or marketplace of reusable workflow components that facilitate the creation and evolution of workflows, an ecosystem has formed around many of these tools. In particular, Chap. 8 will focus on the ecosystems surrounding GitHub Actions, the integrated CI/CD service of GitHub. Since its introduction, the CI/CD landscape on GitHub has radically changed [50, 92, 182].

1.4 Data Sources for Mining Software Ecosystems The Mining Software Repositories (MSR) research community relies on a wide variety of publicly accessible raw data, APIs or other data extraction tools, data dumps, curated datasets, and data processing tools (e.g., dedicated parsers) depending on the specific purpose and needs of the research being conducted. The Pros These data sources and their associated tooling form a gold mine for empirical research in software engineering, and they have allowed the MSR field to thrive. Relying on existing publicly accessible data substantially reduces the laborious and error-prone effort of the data extraction and processing phases of empirical research. As such, it has allowed researchers and software practitioners to learn a great deal about software engineering practices in the field and how to improve these practices. Moreover, this allows multiple researchers to rely on the same data, facilitating comparison and reproducibility of research results [68]. The Cons At the same time, these data sources and tools come with a variety of negative consequences, such as: • Existing data and tools can quickly become obsolete, as it is difficult and effort-intensive to keep up with changes in the original data source or in the APIs required to access them. Many initiatives to create and maintain data extraction tools or curated datasets have been discontinued, mainly due to a lack

14









T. Mens and C. De Roover

of continued funding or because the original maintainers have abandoned the initiative due to career changes. Ethical, legal, or privacy reasons may prevent specific parts of the data of interest to be made available [65]. Examples are proprietary copyrighted source code or personal information that cannot be revealed due by GDPR regulations. Specific analyses may need specific types of data that are not readily available in existing datasets, requiring the creation of new datasets or the extension of existing ones. Talking from a personal experience, it often takes several months of effort to obtain, preprocess, validate, curate, and improve the quality of the obtained data. Not doing so may lead to results that are inaccurate, biased, not replicable, or not generalizable to other situations. Existing datasets may not be appropriate for specific analyses, because of how the data has been gathered or filtered. As an illustration of this problem, suppose, for example, that we want to analyze the effort spent by human contributors in some software ecosystem, based on an available dataset containing contributor accounts and their associated activities over time. If this dataset does not distinguish between human accounts and automated bots, then the results will be biased by bot activities being considered as human activities, calling for the use of bot identification approaches and associated datasets (e.g., [66]). Research that is relying on raw data sources instead of curated datasets may reduce reproducibility since, unlike for a published dataset, there is no guarantee that the original data will remain the same after publication of the research results. For example, GitHub repositories may be deleted, and the history of a git repository may be changed at any time [24, 89].

The following subsections provide a list of data sources that have been used in empirical research on a wide variety of software ecosystems. This list is nonexhaustive, given the plethora of established and newly emerging ecosystems, data sources about them, and research studies on them.

1.4.1 Mining the GitHub Ecosystem For git repositories hosted on the GitHub social coding platform, different ways have been proposed to source their data. GitHub provides public REST and GraphQL APIs to interact with its huge dataset of events and interaction with the hosted repositories. As an alternative, different attempts have been made to provide datasets and data dumps containing relevant data extracted from GitHub, with varying success: • GHArchive2 records, archives, and makes available the public GitHub timeline for public consumption and analysis. It is available on Google BigQuery, and

2 https://www.gharchive.org.

1 An Introduction to Software Ecosystems

15

it contains datasets, aggregated into hourly archives, based on 20.+ event types, ranging from new commits and fork events to opening new tickets, commenting, and adding members to a project. • In a similar way, GHTorrent aimed to obtain data from GitHub public repositories [70, 71], covering a large part of the activity from 2012 to 2019. The latest available data dump was created in March 2021,3 and the initiative has been discontinued altogether. • TravisTorrent was a dataset created in 2017 based on Travis CI and GitHub. It provides access to over 2.6 million Travis builds from more than 1,000 GitHub projects [21].

1.4.2 Mining the Java Ecosystem Multiple datasets have been produced for use in studies on the ecosystem surrounding the Java programming language. The Qualitas Corpus [163], a curated dataset of Java software systems, aimed to facilitate reproducing these studies. Only two data dumps have been released, in 2010 and in 2013. More recent datasets for Java focused on Apache’s Maven Central Repository, a software package registry maintaining a huge collection of libraries for the Java Virtual Machine. For example, Raemaekers et al. provide the Maven Dependency Dataset with metrics, changes, and a dependency graph for 148,253 jar files [140]. The dataset was used to study the phenomena of semantic versioning and breaking changes [141]. Mitropoulos et al. [118] provide a complementary dataset containing the FindBugs results for every project version included in the Maven Central Repository. More recently, Benelallam et al. [22] created the Maven Dependency Graph, an open-source data set containing a snapshot of the whole Maven Central Repository taken on September 2018, stored in a temporal graph database modelling all dependencies. This dataset has been used for various purposes, such as the study of dependency bloat [155] and diversity [154].

1.4.3 Mining Software Library Ecosystems Beyond the Java ecosystem, many software library ecosystems have been studied for a wide range of programming languages. For the purpose of analysing the dependency networks of these ecosystems, Libraries.io [90] has been used by several researchers (e.g., [48, 108, 158, 189, 190]). Five successive data dumps have been made available from 2017 to 2020, containing metadata from a wide range of

3 http://ghtorrent-downloads.ewi.tudelft.nl/mysql/.

16

T. Mens and C. De Roover

different package managers. No more recent data dumps have been released since Tidelift decided to discontinue active maintenance of the dataset. As a kind of successor to Libraries.io, the Ecosyste.ms project4 was started in 2022. Currently sponsored by the Open Collective,5 it focuses on expanding available data and APIs, as such providing a foundational basis for researchers to better analyze open-source software and for funders to better prioritize which projects need to be funded most. The Ecosyste.ms platform provides a shared collection of openly accessible services to support, sustain, and secure critical opensource software components. Each service comes with an openly accessible JSON API to facilitate the creation of new tools and services. The APIs and data structures are designed to be as generic as possible, to facilitate analyzing different data sources in an ecosystem-agnostic way. Some of the supported services include: • An index of several millions of open-source packages from dozens of package registries (for programming languages and Linux distributions), with tens of thousands of new package versions being added on a daily basis. • An index of the historical timeline of several billions of events that occurred across public git repositories (hosted on GitHub, GitLab, or Gitea) over many years, with hundreds of thousands of events being added on an hourly basis. • An index of dozens of millions of open-source repositories and Docker projects and their dependencies originating from a dozen of different sources, with tens of thousands of new repositories being added on a daily basis. • A range of services to provide software repository, contributor, and security vulnerability metadata, parse software dependency and licensing metadata, resolve software package dependency trees, generate diffs between package releases, and many more.

1.4.4 Mining Other Software Ecosystems Beyond the data sources mentioned above, a wide variety of other initiatives to mine, analyze, or archive software ecosystems have been proposed through a plethora of datasets or data sources that are –or have been– available for researchers or practitioners. Of particular relevance is the Software Heritage ecosystem [54]. It is the largest public software archive, containing the development history of billions of source code files from more than 180 million collaborative software development projects. Supported by a partnership with UNESCO, its long-term mission is to collect, preserve, and make easily accessible the source code of publicly available software.

4 https://ecosyste.ms. 5 https://opencollective.com.

1 An Introduction to Software Ecosystems

17

It comes with its own filesystem [7] and graph dataset [137]. For more details, we refer to Chap. 2, which is entirely focused on this ecosystem. World of Code (WoC) [106, 107] is another ambitious initiative to create a very large and frequently updated collection of historical data in OSS ecosystems. The provided infrastructure facilitates the analysis of technical dependencies, social networks, and their interrelationships. To this end, WoC provides tools for efficiently correcting, augmenting, querying, and analyzing that data—a foundation for understanding the structure and evolution of the relationships that drive OSS activities. Boa [57, 58, 81] is yet another initiative to support the efficient mining of largescale datasets of software repository data. Boa provides a domain-specific language and distributed computing infrastructure to facilitate this. Many other attempts have been made in the past to create and support publicly available software datasets and platforms, but these are no longer actively maintained today. We mention some notable examples below. The PROMISE Software Engineering Repository is a collection of publicly available datasets to serve researchers in conducting predictive software engineering in a repeatable, verifiable, and refutable way [148]. FLOSSmole is another collaborative collection of OSS project data [80]. Candoia is a platform and ecosystem for building and sharing software repository mining tools and applications [164, 165]. Sourcerer is a research project aimed at exploring open-source projects and provided an opensource infrastructure and curated datasets for other researchers to use [9]. DebSources is a dataset containing source code and metadata spanning two decades of history related to the Debian Linux distribution until 2016 [37]. Jira is one of the most popular issue tracking systems (ITSs) in practice. A first Jira repository dataset was created in 2015, containing more than 700K issue reports and more than two million issue comments extracted from the Jira issue tracking system of the Apache Software Foundation, Spring, JBoss, and CodeHaus OSS communities [133]. A more recent dataset created in 2022 gathers data from 16 public Jira repositories containing 1822 projects and spanning 2.7 million issues with a combined total of 32 million changes, nine million comments, and one million issue links [121, 122].

1.5 The CHAOSS Project In an introductory chapter on software ecosystems, it is indispensable to also mention the CHAOSS initiative (which is an acronym for Community Health Analytics in Open Source Software) [64].6 It is a Linux Foundation project aimed at better understanding OSS community health on a global scale [63]. Unhealthy OSS projects can have a negative impact on the community involved in them,

6 https://chaoss.community.

18

T. Mens and C. De Roover

as well as on organizations that are relying on them. CHAOSS therefore focuses on understanding and supporting health through the creation of metrics, metrics models, and software development analytics tools for measuring and visualizing community health in OSS projects. Two main OSS tools are proposed by CHAOSS to do so: Augur and GrimoireLab [56]. The latter is an open-source toolkit with support for extracting, visualizing, and analyzing activity, community, and process data from 30+ data sources related to code management, issues, code reviewing, mailing list, developer fora, and more. Perhaps one shortcoming of these tools is that they have not been designed to scale up to visualize or analyze health issues at the level of ecosystems containing thousands or even millions of interconnected projects.

1.6 Summary This introductory chapter served as a first stepping stone for newcomers in the research field of software ecosystems. We aimed to provide the necessary material to get up to speed in this domain. After a historical account of where software ecosystems originated from, we highlighted the different perspectives on software ecosystems and their accompanying definitions. We categorized the different kinds of software ecosystems, providing many examples for each category. Since the book to which this introductory chapter belongs focuses on software ecosystem tooling and analytics, we presented a rich set of data sources and datasets that have been or can be used for mining software ecosystems. Given that the field of software ecosystems is evolving at a rapid pace, it is difficult to predict the direction into which it is heading and the extent to which the current tools and data sources will evolve or get replaced in the future.

References 1. Abate, P., Di Cosmo, R., Boender, J., Zacchiroli, S.: Strong dependencies between software components. In: International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 89–99 (2009). https://doi.org/10.1109/ESEM.2009.5316017 2. Abdalkareem, R., Nourry, O., Wehaibi, S., Mujahid, S., Shihab, E.: Why do developers use trivial packages? An empirical case study on npm. In: Joint Meeting on Foundations of Software Engineering (FSE), pp. 385–395 (2017). https://doi.org/10.1145/3106237.3106267 3. Abdellatif, A., Wessel, M., Steinmacher, I., Gerosa, M.A., Shihab, E.: BotHunter: an approach to detect software bots in GitHub. In: International Conference on Mining Software Repositories (MSR), pp. 6–17. IEEE Computer Society, Washington (2022). https://doi.org/ 10.1145/3524842.3527959 4. Abou Khalil, Z., Constantinou, E., Mens, T., Duchien, L.: On the impact of release policies on bug handling activity: a case study of Eclipse. J. Syst. Software 173 (2021). https://doi. org/10.1016/j.jss.2020.110882

1 An Introduction to Software Ecosystems

19

5. Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K., Schneider, K.A.: CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empir. Software Eng. 25(2), 1493–1532 (2020). https://doi.org/10.1007/s10664-019-09743-4 6. Alfadel, M., Costa, D.E., Shihab, E., Shihab, E.: Empirical analysis of security vulnerabilities in Python packages. In: International Conference on Software Analysis, Evolution and Reengineering (SANER) (2021). https://doi.org/10.1109/saner50967.2021.00048 7. Allançon, T., Pietri, A., Zacchiroli, S.: The software heritage filesystem (SwhFS): integrating source code archival with development. In: International Conference on Software Engineering (ICSE). IEEE, Piscataway (2021). https://doi.org/10.1109/ICSE-Companion52605.2021. 00032 8. Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J.: Discovering value from community activity on focused question answering sites: a case study of Stack Overflow. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 850–858. ACM, New York (2012). https://doi.org/10.1145/2339530.2339665 9. Bajracharya, S., Ossher, J., Lopes, C.: Sourcerer: An infrastructure for large-scale collection and analysis of open-source code. Sci. Comput. Program. 79, 241–259 (2014). https://doi.org/ 10.1016/j.scico.2012.04.008 10. Baltes, S.: SOTorrent dataset (2021). https://doi.org/10.5281/zenodo.4415593 11. Baltes, S., Dumani, L., Treude, C., Diehl, S.: SOTorrent: reconstructing and analyzing the evolution of Stack Overflow posts. In: International Conference on Mining Software Repositories (MSR), pp. 319–330. ACM, New York (2018). https://doi.org/10.1145/3196398. 3196430 12. Baltes, S., Treude, C., Diehl, S.: SOTorrent: studying the origin, evolution, and usage of Stack Overflow code snippets. In: International Conference on Mining Software Repositories (MSR), pp. 191–194. IEEE, Piscataway/ACM, New York (2019). https://doi.org/10.1109/ MSR.2019.00038 13. Bangash, A.A., Sahar, H., Chowdhury, S., Wong, A.W., Hindle, A., Ali, K.: What do developers know about machine learning: a study of ML discussions on StackOverflow. In: International Conference on Mining Software Repositories (MSR), pp. 260–264 (2019). https://doi.org/10.1109/MSR.2019.00052 14. Barbosa, O., Alves, C.: A systematic mapping study on software ecosystems. In: International Workshop on Software Ecosystems (IWSECO), CEUR Workshop Proceedings, vol. 746, pp. 15–26 (2011) 15. Barua, A., Thomas, S.W., Hassan, A.E.: What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir. Software Eng. 19(3), 619–654 (2014). https://doi. org/10.1007/s10664-012-9231-y 16. Bavota, G., Canfora, G., Di Penta, M., Oliveto, R., Panichella, S.: The evolution of project inter-dependencies in a software ecosystem: the case of Apache. In: International Conference on Software Maintenance (ICSM), pp. 280–289 (2013). https://doi.org/10.1109/ICSM.2013. 39 17. Beck, K.: Embracing change with extreme programming. Computer 32(10), 70–77 (1999). https://doi.org/10.1109/2.796139 18. Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., et al.: Manifesto for agile software development. Technical report, Snowbird, UT (2001) 19. Belady, L.A., Lehman, M.M.: A model of large program development. IBM Syst. J. 15(3), 225–252 (1976). https://doi.org/10.1147/sj.153.0225 20. Beller, M., Gousios, G., Zaidman, A.: Oops, my tests broke the build: an explorative analysis of Travis CI with GitHub. In: International Conference on Mining Software Repositories (MSR), pp. 356–367. IEEE, Piscataway (2017). https://doi.org/10.1109/MSR.2017.62 21. Beller, M., Gousios, G., Zaidman, A.: TravisTorrent: synthesizing Travis CI and GitHub for full-stack research on continuous integration. In: International Conference on Mining Software Repositories (MSR), pp. 447–450 (2017). https://doi.org/10.1109/MSR.2017.24

20

T. Mens and C. De Roover

22. Benelallam, A., Harrand, N., Soto-Valero, C., Baudry, B., Barais, O.: The Maven dependency graph: a temporal graph-based representation of Maven Central. In: International Conference on Mining Software Repositories (MSR), pp. 344–348 (2019). https://doi.org/10.1109/MSR. 2019.00060 23. Biazzini, M., Baudry, B.: May the fork be with you: novel metrics to analyze collaboration on GitHub. In: International Workshop on Emerging Trends in Software Metrics, pp. 37–43. ACM, New York (2014). https://doi.org/10.1145/2593868.2593875 24. Bird, C., Rigby, P.C., Barr, E.T., Hamilton, D.J., Germán, D.M., Devanbu, P.T.: The promises and perils of mining git. In: International Working Conference on Mining Software Repositories (MSR), pp. 1–10. IEEE, Piscataway (2009). https://doi.org/10.1109/MSR.2009. 5069475 25. Blincoe, K., Harrison, F., Kaur, N., Damian, D.: Reference coupling: An exploration of interproject technical dependencies and their characteristics within large software ecosystems. Inform. Software Technol. 110, 174–189 (2019). https://doi.org/10.1016/j.infsof.2019.03.005 26. Bogart, C., Kästner, C., Herbsleb, J., Thung, F.: When and how to make breaking changes: policies and practices in 18 open source software ecosystems. Trans. Software Eng. Methodol. 30(4) (2021). https://doi.org/10.1145/3447245 27. Bosch, J.: From software product lines to software ecosystems. In: International Software Product Line Conference (SPLC) (2009) 28. Bosch, J., Bosch-Sijtsema, P.: From integration to composition: on the impact of software product lines, global development and ecosystems. J. Syst. Software 83(1), 67–76 (2010). https://doi.org/10.1016/j.jss.2009.06.051 29. Burström, T., Lahti, T., Parida, V., Wartiovaara, M., Wincent, J.: Software ecosystems now and in the future: a definition, systematic literature review, and integration into the business and digital ecosystem literature. Trans. Eng. Manag., 1–16 (2022). https://doi.org/10.1109/ TEM.2022.3216633 30. Businge, J., Serebrenik, A., van den Brand, M.G.J.: Survival of Eclipse third-party plug-ins. In: International Conference on Software Maintenance (ICSM), pp. 368–377 (2012). https:// doi.org/10.1109/ICSM.2012.6405295 31. Businge, J., Serebrenik, A., van den Brand, M.G.J.: Analyzing the Eclipse API usage: putting the developer in the loop. In: European Conference on Software Maintenance and Reengineering (CSMR), pp. 37–46. IEEE Computer Society, Washington (2013). https://doi. org/10.1109/CSMR.2013.14 32. Businge, J., Serebrenik, A., Brand, M.G.: Eclipse API usage: the good and the bad. Software Qual. J. 23(1), 107–141 (2015). https://doi.org/10.1007/s11219-013-9221-3 33. Businge, J., Kawuma, S., Openja, M., Bainomugisha, E., Serebrenik, A.: How stable are Eclipse application framework internal interfaces? In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 117–127 (2019). https://doi.org/10. 1109/SANER.2019.8668018 34. Caldiera, G., Basili, V.: Identifying and qualifying reusable software components. Computer 24(2), 61–70 (1991). https://doi.org/10.1109/2.67210 35. Calefato, F., Lanubile, F., Vasilescu, B.: A large-scale, in-depth analysis of developers’ personalities in the Apache ecosystem. Inf. Software Technol 114, 1–20 (2019). https://doi. org/10.1016/j.infsof.2019.05.012 36. Caneill, M., Zacchiroli, S.: Debsources: live and historical views on macro-level software evolution. In: International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, New York (2014). https://doi.org/10.1145/2652524.2652528. http://sources. debian.net 37. Caneill, M., German, D.M., Zacchiroli, S.: The debsources dataset: two decades of free and open source software. Empir. Software Eng. 22, 1405–1437 (2017). https://doi.org/10.1007/ s10664-016-9461-5 38. Chen, B., (Jack) Jiang, Z.M.: Characterizing logging practices in Java-based open source software projects – a replication study in Apache software foundation. Empir. Software Eng. 22(1), 330–374 (2017). https://doi.org/10.1007/s10664-016-9429-5

1 An Introduction to Software Ecosystems

21

39. Claes, M., Mens, T., Di Cosmo, R., Vouillon, J.: A historical analysis of Debian package incompatibilities. In: Working Conference on Mining Software Repositories (MSR), pp. 212– 223 (2015). https://doi.org/10.1109/MSR.2015.27 40. Claes, M., Decan, A., Mens, T.: Intercomponent dependency issues in software ecosystems. In: Software Technology: 10 Years of Innovation in IEEE Computer, chap. 3, pp. 35–57. Wiley, Hoboken (2018). https://doi.org/10.1002/9781119174240.ch3 41. Cogo, F.R., Oliva, G.A., Hassan, A.E.: Deprecation of packages and releases in software ecosystems: a case study on npm. Trans. Software Eng. (2021). https://doi.org/10.1109/TSE. 2021.3055123 42. Constantinou, E., Mens, T.: An empirical comparison of developer retention in the RubyGems and npm software ecosystems. Innovations Syst. Software Eng. 13(2), 101–115 (2017). https://doi.org/10.1007/s11334-017-0303-4 43. Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social coding in GitHub: transparency and collaboration in an open software repository. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 1277–1286. ACM, New York (2012). https://doi. org/10.1145/2145204.2145396 44. Decan, A., Mens, T.: What do package dependencies tell us about semantic versioning? Trans. Software Eng. 47(6), 1226–1240 (2021). https://doi.org/10.1109/TSE.2019.2918315 45. Decan, A., Mens, T., Claes, M.: An empirical comparison of dependency issues in OSS packaging ecosystems. In: International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway (2017). https://doi.org/10.1109/SANER.2017. 7884604 46. Decan, A., Mens, T., Constantinou, E.: On the evolution of technical lag in the npm package dependency network. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 404–414. IEEE, Piscataway (2018). https://doi.org/10.1109/ICSME. 2018.00050 47. Decan, A., Mens, T., Constantinou, E.: On the impact of security vulnerabilities in the npm package dependency network. In: International Conference on Mining Software Repositories (MSR), pp. 181–191 (2018). https://doi.org/10.1007/s10664-022-10154-1 48. Decan, A., Mens, T., Grosjean, P.: An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir. Software Eng. 24(1), 381–416 (2019). https://doi.org/10.1007/s10664-017-9589-y 49. Decan, A., Mens, T., Zerouali, A., De Roover, C.: Back to the past – analysing backporting practices in package dependency networks. Trans. Software Eng. (2021). https://doi.org/10. 1109/TSE.2021.3112204 50. Decan, A., Mens, T., Mazrae, P.R., Golzadeh, M.: On the use of GitHub Actions in software development repositories. In: International Conference on Software Maintenance and Evolution (ICSME). IEEE, Piscataway (2022). https://doi.org/10.1109/ICSME55016. 2022.00029 51. de Lima Fontao, A., Pereira dos Santos, R., Dias-Neto, A.C.: Mobile software ecosystem (MSECO): a systematic mapping study. In: Annual Computer Software and Applications Conference (COMPSAC), vol. 2, pp. 653–658. IEEE, Piscataway (2015). https://doi.org/10. 1109/COMPSAC.2015.121 52. de Lima Fontão, A., Ekwoge, O.M., dos Santos, R.P., Dias-Neto, A.C.: Facing up the primary emotions in mobile software ecosystems from developer experience. In: Workshop on Social, Human, and Economic Aspects of Software (WASHES), pp. 5–11. ACM, New York (2017). https://doi.org/10.1145/3098322.3098325 53. Dhungana, D., Groher, I., Schludermann, E., Biffl, S.: Guiding principles of natural ecosystems and their applicability to software ecosystems. In: Software Ecosystems: Analyzing and Managing Business Networks in the Software Industry, chap. 3, pp. 43–58. Edward Elgar, Cheltenham (2013). https://doi.org/10.4337/9781781955628.00010 54. Di Cosmo, R., Zacchiroli, S.: Software Heritage: why and how to preserve software source code. In: International Conference on Digital Preservation (iPRES) (2017)

22

T. Mens and C. De Roover

55. Dietrich, J., Pearce, D., Stringer, J., Tahir, A., Blincoe, K.: Dependency versioning in the wild. In: International Conference on Mining Software Repositories (MSR), pp. 349–359. IEEE, Piscataway (2019). https://doi.org/10.1109/MSR.2019.00061 56. Dueñas, S., Cosentino, V., Gonzalez-Barahona, J.M., del Castillo San Felix, A., IzquierdoCortazar, D., Cañas-Díaz, L., Pérez García-Plaza, A.: GrimoireLab: a toolset for software development analytics. PeerJ Comput. Sci. (2021). https://doi.org/10.7717/peerj-cs.601 57. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: International Conference on Software Engineering (ICSE), pp. 422–431. IEEE, Piscataway (2013). https://doi.org/10.1109/ICSE. 2013.6606588 58. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: Ultra-large-scale software repository and source-code mining. Trans. Software Eng. Methodol. 25(1) (2015). https://doi.org/10. 1145/2803171 59. Estefo, P., Simmonds, J., Robbes, R., Fabry, J.: The Robot Operating System: package reuse and community dynamics. J. Syst. Software 151, 226–242 (2019). https://doi.org/10.1016/j. jss.2019.02.024 60. Foundjem, A., Constantinou, E., Mens, T., Adams, B.: A mixed-methods analysis of microcollaborative coding practices in OpenStack. Empir. Software Eng. 27(5), 120 (2022). https:// doi.org/10.1007/s10664-022-10167-w 61. Frakes, W., Kang, K.: Software reuse research: status and future. Trans. Software Eng. 31(7), 529–536 (2005). https://doi.org/10.1109/TSE.2005.85 62. German, D.M., Adams, B., Hassan, A.E.: The evolution of the R software ecosystem. In: European Conference on Software Maintenance and Reengineering (CSMR), pp. 243–252 (2013). https://doi.org/10.1109/CSMR.2013.33 63. Goggins, S., Lumbard, K., Germonprez, M.: Open source community health: analytical metrics and their corresponding narratives. In: International Workshop on Software Health in Projects, Ecosystems and Communities (SoHeal), pp. 25–33 (2021). https://doi.org/10.1109/ SoHeal52568.2021.00010 64. Goggins, S.P., Germonprez, M., Lumbard, K.: Making open source project health transparent. Computer 54(8), 104–111 (2021). https://doi.org/10.1109/MC.2021.3084015 65. Gold, N.E., Krinke, J.: Ethics in the mining of software repositories. Empir. Software Eng. 27(1), 17 (2022). https://doi.org/10.1007/s10664-021-10057-7 66. Golzadeh, M., Decan, A., Legay, D., Mens, T.: A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. J. Syst. Software 175 (2021). https:// doi.org/10.1016/j.jss.2021.110911 67. Golzadeh, M., Decan, A., Mens, T.: On the rise and fall of CI services in GitHub. In: International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway (2021). https://doi.org/10.1109/SANER53432.2022.00084 68. Gonzalez-Barahona, J.M., Robles, G.: On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Software Eng. 17(1), 75–89 (2012). https://doi.org/10.1007/s10664-011-9181-9 69. Gonzalez-Barahona, J.M., Robles, G., Michlmayr, M., Amor, J.J., German, D.M.: Macrolevel software evolution: a case study of a large software compilation. Empir. Software Eng. 14(3), 262–285 (2009). https://doi.org/10.1007/s10664-008-9100-x 70. Gousios, G., Spinellis, D.: GHTorrent: Github’s data from a firehose. In: Working Conference of Mining Software Repositories (MSR), pp. 12–21 (2012). https://doi.org/10.1109/MSR. 2012.6224294 71. Gousios, G., Spinellis, D.: Mining software engineering data from GitHub. In: International Conference on Software Engineering (ICSE), pp. 501–502 (2017). https://doi.org/10.1109/ ICSE-C.2017.164 72. Gousios, G., Storey, M.A., Bacchelli, A.: Work practices and challenges in pull-based development: the contributor’s perspective. In: International Conference on Software Engineering (ICSE), pp. 285–296. ACM, New York (2016). https://doi.org/10.1145/2884781.2884826

1 An Introduction to Software Ecosystems

23

73. Grinter, R.E., Herbsleb, J.D., Perry, D.E.: The geography of coordination: dealing with distance in R&D work. In: International ACM SIGGROUP conference on Supporting group work (GROUP), pp. 306–315 (1999). https://doi.org/10.1145/320297.320333 74. Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: International Conference on Mining Software Repositories (MSR), pp. 352–355. ACM, New York (2014). https://doi.org/10.1145/2597073.2597118 75. Guzzi, A., Bacchelli, A., Lanza, M., Pinzger, M., van Deursen, A.: Communication in open source software development mailing lists. In: Working Conference on Mining Software Repositories (MSR), pp. 277–286. IEEE, Piscataway (2013) 76. Hanssen, G.K.: A longitudinal case study of an emerging software ecosystem: implications for practice and theory. J. Syst. Software 85(7), 1455–1466 (2012). https://doi.org/10.1016/j. jss.2011.04.020 77. Hein, A., Schreieck, M., Riasanow, T., Setzke, D.S., Wiesche, M., Böhm, M., Krcmar, H.: Digital platform ecosystems. Electron. Mark. 30(1), 87–98 (2020). https://doi.org/10.1007/ s12525-019-00377-4 78. Herbsleb, J.D., Moitra, D.: Global software development. IEEE Software 18(2), 16–20 (2001). https://doi.org/10.1109/52.914732 79. Howison, J., Crowston, K.: The perils and pitfalls of mining SourceForge. In: International Workshop on Mining Software Repositories (MSR), pp. 7–11 (2004). https://doi.org/10.1049/ ic:20040467 80. Howison, J., Conklin, M., Crowston, K.: Flossmole: a collaborative repository for FLOSS research data and analyses. IJITWE 1(3), 17–26 (2006). https://doi.org/10.4018/jitwe. 2006070102 81. Hung, C.S., Dyer, R.: Boa views: easy modularization and sharing of MSR analyses. In: International Conference on Mining Software Repositories (MSR), pp. 147–157. ACM, New York (2020). https://doi.org/10.1145/3379597.3387480 82. Iansiti, M., Levien, R.: Strategy as ecology. Harvard Bus. Rev. 82(3), 68–81 (2004) 83. Jansen, S., Finkelstein, A., Brinkkemper, S.: A sense of community: a research agenda for software ecosystems. In: International Conference on Software Engineering, pp. 187–190 (2009). https://doi.org/10.1109/ICSE-COMPANION.2009.5070978 84. Jansen, S., Brinkkemper, S., Cusumano, M.A.: Software Ecosystems: Analyzing and Managing Business Networks in the Software Industry. Edward Elgar, Cheltenham (2013) 85. Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P.S., Zhang, L.: Why and how developers fork what from whom in GitHub. Empir. Software Eng. 22(1), 547–578 (2017). https://doi.org/10.1007/ s10664-016-9436-6 86. Jurado, F., Rodríguez Marín, P.: Sentiment analysis in monitoring software development processes: an exploratory case study on GitHub’s project issues. J. Syst. Software 104, 82–89 (2015). https://doi.org/10.1016/j.jss.2015.02.055 87. Kabbedijk, J., Jansen, S.: Steering insight: An exploration of the Ruby software ecosystem. In: Software Business, pp. 44–55. Springer, Berlin (2011). https://doi.org/10.1007/978-3-64221544-5%5C_5 88. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining GitHub. In: Working Conference on Mining Software Repositories (MSR), MSR 2014, pp. 92–101. ACM, New York (2014). https://doi.org/10. 1145/2597073.2597074 89. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D., Damian, D.: An in-depth study of the promises and perils of mining GitHub. Empir. Software Eng. 21(5), 2035–2071 (2016). https://doi.org/10.1007/s10664-015-9393-5 90. Katz, J.: Libraries.io open source repository and dependency metadata (2020). https://doi.org/ 10.5281/zenodo.3626071 91. Kawuma, S., Businge, J., Bainomugisha, E.: Can we find stable alternatives for unstable Eclipse interfaces? In: International Conference on Program Comprehension (ICPC), pp. 1– 10 (2016). https://doi.org/10.1109/ICPC.2016.7503716

24

T. Mens and C. De Roover

92. Kinsman, T., Wessel, M., Gerosa, M.A., Treude, C.: How do software developers use GitHub Actions to automate their workflows? In: International Conference on Mining Software Repositories (MSR), pp. 420–431. IEEE, Piscataway (2021). https://doi.org/10. 1109/MSR52588.2021.00054 93. Koch, S.: Exploring the effects of SourceForge.net coordination and communication tools on the efficiency of open source projects using data envelopment analysis. Empir. Software Eng. 14(4), 397–417 (2009). https://doi.org/10.1007/s10664-008-9086-4 94. Kolak, S., Afzal, A., Le Goues, C., Hilton, M., Timperley, C.S.: It takes a village to build a robot: an empirical study of the ROS ecosystem. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 430–440 (2020). https://doi.org/10.1109/ ICSME46990.2020.00048 95. Kotovs, V.: Forty years of software reuse. Sci. J. Riga Tech. Univ. 38(38), 153–160 (2009). https://doi.org/10.2478/v10143-009-0013-y 96. Kozaczynski, W., Booch, G.: Component-based software engineering. IEEE Software 15(5), 34–36 (1998). https://doi.org/10.1109/MS.1998.714621 97. Krueger, C.W.: Software reuse. ACM Comput. Surv. 24(2), 131–183 (1992). https://doi.org/ 10.1145/130844.130856 98. Lam, P., Dietrich, J., Pearce, D.J.: Putting the semantics into semantic versioning. In: International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pp. 157–179. ACM, New York (2020). https://doi.org/10.1145/ 3426428.3426922 99. Lanovaz, M.J., Adams, B.: Comparing the communication tone and responses of users and developers in two R mailing lists: measuring positive and negative emails. IEEE Software 36(5), 46–50 (2019). https://doi.org/10.1109/MS.2019.2922949 100. Lauinger, T., Chaabane, A., Wilson, C.B.: Thou shalt not depend on me. Commun. ACM 61(6), 41–47 (2018). https://doi.org/10.1145/3190562 101. Lehman, M.M.: On understanding laws, evolution and conservation in the large program life cycle. J. Syst. Software 1(3), 213–221 (1980). https://doi.org/10.1016/0164-1212(79)900220 102. Lehman, M.M.: Programs, life cycles, and laws of software evolution. Proc. IEEE 68(9), 1060–1076 (1980). https://doi.org/10.1109/PROC.1980.11805 103. Li, Z., Wang, Y., Lin, Z., Cheung, S.C., Lou, J.G.: Nufix: Escape from NuGet dependency maze. In: International Conference on Software Engineering (ICSE), pp. 1545–1557. ACM, New York (2022). https://doi.org/10.1145/3510003.3510118 104. Lin, B., Zagalsky, A., Storey, M.A., Serebrenik, A.: Why developers are slacking off: understanding how software teams use Slack. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 333–336. ACM, New York (2016). https://doi. org/10.1145/2818052.2869117 105. Lungu, M.: Towards reverse engineering software ecosystems. In: International Conference on Software Maintenance (ICSM), pp. 428–431. IEEE, Piscataway (2008). https://doi.org/10. 1109/ICSM.2008.4658096 106. Ma, Y., Bogart, C., Amreen, S., Zaretzki, R., Mockus, A.: World of code: an infrastructure for mining the universe of open source VCS data. In: International Conference on Mining Software Repositories (MSR), pp. 143–154. IEEE, Piscataway (2019). https://doi.org/10. 1109/MSR.2019.00031 107. Ma, Y., Dey, T., Bogart, C., Amreen, S., Valiev, M., Tutko, A., Kennard, D., Zaretzki, R., Mockus, A.: World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir. Software Eng. 26(2) (2021). https://doi.org/10. 1007/s10664-020-09905-9 108. Makari, I.S., Zerouali, A., De Roover, C.: Prevalence and evolution of license violations in npm and RubyGems dependency networks. In: International Conference on Software and Systems Reuse (ICSR), pp. 85–100. Springer, Berlin (2022). https://doi.org/10.1007/978-3031-08129-3_6

1 An Introduction to Software Ecosystems

25

109. Manes, S.S., Baysal, O.: How often and what StackOverflow posts do developers reference in their GitHub projects? In: International Conference on Mining Software Repositories (MSR), pp. 235–239 (2019). https://doi.org/10.1109/MSR.2019.00047 110. Manikas, K.: Revisiting software ecosystems research: a longitudinal literature study. J. Syst. Software 117, 84–103 (2016). https://doi.org/10.1016/j.jss.2016.02.003 111. Manikas, K., Hansen, K.M.: Software ecosystems: a systematic literature review. J. Syst. Software 86(5), 1294–1306 (2013). https://doi.org/10.1016/j.jss.2012.12.026 112. McIlroy, M.D.: Mass produced software components. In: Software Engineering: Report of a Conference Sponsored by the NATO Science Committee. Garmisch, Germany (1969) 113. Mens, T.: Evolving software ecosystems: a historical and ecological perspective. NATO Sci. Peace Sec. Ser. D Inform. Commun. Sec. Volume 40: Dependable Software Systems Engineering, 170–192 (2015). https://doi.org/10.3233/978-1-61499-495-4-170 114. Mens, T., Demeyer, S. (eds.): Software Evolution. Springer, Berlin (2008) 115. Mens, T., Fernández-Ramil, J., Degrandsart, S.: The evolution of Eclipse. In: International Conference on Software Maintenance (ICSM). IEEE, Piscataway (2008). https://doi.org/10. 1109/ICSM.2008.4658087 116. Mens, T., Serebrenik, A., Cleve, A.: Evolving Software Systems. Springer, Berlin (2014) 117. Messerschmitt, D.G., Szyperski, C.: Software ecosystem: understanding an indispensable technology and industry. MIT Press, Cambridge (2003) 118. Mitropoulos, D., Karakoidas, V., Louridas, P., Gousios, G., Spinellis, D.: The bug catalog of the Maven ecosystem. In: Working Conference on Mining Software Repositories (MSR), pp. 372–375. ACM, New York (2014). https://doi.org/10.1145/2597073.2597123 119. Mockus, A., Fielding, R.T., Herbsleb, J.D.: Two case studies of open source software development: apache and mozilla. Trans. Software Eng. Methodol. 11(3), 309–346 (2002). https://doi.org/10.1145/567793.567795 120. Moldon, L., Strohmaier, M., Wachs, J.: How gamification affects software developers: cautionary evidence from a natural experiment on GitHub. In: International Conference on Software Engineering (ICSE), pp. 549–561 (2021). https://doi.org/10.1109/ICSE43902.2021. 00058 121. Montgomery, L., Lüders, C., Maalej, P.D.W.: The public Jira dataset (2022). https://doi.org/ 10.5281/zenodo.5901804 122. Montgomery, L., Lüders, C., Maalej, W.: An alternative issue tracking dataset of public Jira repositories. In: International Conference on Mining Software Repositories (MSR), pp. 73– 77. ACM, New York (2022). https://doi.org/10.1145/3524842.3528486 123. Moore, J.: Predators and prey: a new ecology of competition. Harvard Bus. Rev. 71(3), 75–83 (1993) 124. Nagy, C., Cleve, A.: Mining stack overflow for discovering error patterns in SQL queries. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 516–520. IEEE, Piscataway (2015). https://doi.org/10.1109/ICSM.2015.7332505 125. Nasehi, S.M., Sillito, J., Maurer, F., Burns, C.: What makes a good code example? A study of programming Q&A in StackOverflow. In: International Conference on Software Maintenance (ICSM), pp. 25–34. IEEE, Piscataway (2012). https://doi.org/10.1109/ICSM.2012.6405249 126. Naur, P., Randell, B.: Software Engineering: Report of a Conference Sponsored by the NATO Science Committee. NATO, Garmisch (1969) 127. Novielli, N., Calefato, F., Lanubile, F.: The challenges of sentiment detection in the social programmer ecosystem. In: International Workshop on Social Software Engineering (SSE), pp. 33–40. ACM, New York (2015). https://doi.org/10.1145/2804381.2804387 128. Nugroho, Y.S., Islam, S., Nakasai, K., Rehman, I., Hata, H., Kula, R.G., Nagappan, M., Matsumoto, K.: How are project-specific forums utilized? A study of participation, content, and sentiment in the Eclipse ecosystem. Empir. Software Eng. 26(6), 132 (2021). https://doi. org/10.1007/s10664-021-10032-2 129. Nyman, L., Mikkonen, T.: To fork or not to fork: Fork motivations in SourceForge projects. Int. J. Open Source Software Proces. 3(3) (2011). https://doi.org/10.4018/jossp.2011070101

26

T. Mens and C. De Roover

130. Ochoa, L., Degueule, T., Falleri, J.R., Vinju, J.: Breaking bad? Semantic versioning and impact of breaking changes in Maven Central. Empir. Software Eng. 27(3), 61 (2022). https:// doi.org/10.1007/s10664-021-10052-y 131. Opdebeeck, R., Zerouali, A., De Roover, C.: Smelly variables in Ansible infrastructure code: detection, prevalence, and lifetime. In: International Conference on Mining Software Repositories (MSR). ACM, New York (2022). https://doi.org/10.1145/3524842.3527964 132. Opdebeeck, R., Zerouali, A., De Roover, C.: Control and data flow in security smell detection for infrastructure as code: Is it worth the effort? In: International Conference on Mining Software Repositories (MSR). ACM, New York (2023) 133. Ortu, M., Destefanis, G., Adams, B., Murgia, A., Marchesi, M., Tonelli, R.: The JIRA repository dataset: understanding social aspects of software development. In: International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, New York (2015). https://doi.org/10.1145/2810146.2810147 134. Ortu, M., Hall, T., Marchesi, M., Tonelli, R., Bowes, D., Destefanis, G.: Mining communication patterns in software development: a GitHub analysis. In: International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), pp. 70–79. ACM, New York (2018). https://doi.org/10.1145/3273934.3273943 135. Padhye, R., Mani, S., Sinha, V.S.: A study of external community contribution to opensource projects on GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 332–335. ACM, New York (2014). https://doi.org/10.1145/2597073.2597113 136. Pichler, M., Dieber, B., Pinzger, M.: Can i depend on you? Mapping the dependency and quality landscape of ROS packages. In: International Conference on Robotic Computing (IRC), pp. 78–85. IEEE, Piscataway (2019). https://doi.org/10.1109/IRC.2019.00020 137. Pietri, A., Spinellis, D., Zacchiroli, S.: The software heritage graph dataset: large-scale analysis of public software development history. In: International Conference on Mining Software Repositories (MSR). IEEE, Piscataway (2020). https://doi.org/10.1145/3379597. 3387510 138. Plakidas, K., Schall, D., Zdun, U.: Evolution of the R software ecosystem: metrics, relationships, and their impact on qualities. J. Syst. Software 132, 119–146 (2017). https://doi.org/10. 1016/j.jss.2017.06.095 139. Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: sentiment analysis of security discussions on GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 348–351. ACM, New York (2014). https://doi.org/10.1145/2597073.2597117 140. Raemaekers, S., van Deursen, A., Visser, J.: The Maven repository dataset of metrics, changes, and dependencies. In: Working Conference on Mining Software Repositories (MSR), pp. 221–224 (2013). https://doi.org/10.1109/MSR.2013.6624031 141. Raemaekers, S., Van Deursen, A., Visser, J.: Semantic versioning versus breaking changes: a study of the Maven repository. In: International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 215–224. IEEE, Piscataway (2014). https://doi.org/ 10.1109/SCAM.2014.30 142. Rahman, M.M., Roy, C.K.: An insight into the pull requests of GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 364–367. ACM, New York (2014). https://doi.org/10.1145/2597073.2597121 143. Rastogi, A., Nagappan, N., Gousios, G., van der Hoek, A.: Relationship between geographical location and evaluation of developer contributions in GitHub. In: International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, New York (2018). https:// doi.org/10.1145/3239235.3240504 144. Raymond, E.: The cathedral and the bazaar. Knowl. Technol. Policy 12(3), 23–49 (1999). https://doi.org/10.1007/s12130-999-1026-0 145. Raymond, E.S.: The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O’Reilly, Sebastopol (1999) 146. Rigby, P.C., Hassan, A.E.: What can OSS mailing lists tell us? A preliminary psychometric text analysis of the Apache developer mailing list. In: International Workshop on Mining Software Repositories (MSR), pp. 23–23 (2007). https://doi.org/10.1109/MSR.2007.35

1 An Introduction to Software Ecosystems

27

147. Robles, G., Gonzalez-Barahona, J.M.: Geographic location of developers at SourceForge. In: International Workshop on Mining Software Repositories (MSR), pp. 144–150. ACM, New York (2006). https://doi.org/10.1145/1137983.1138017 148. Sayyad Shirabad, J., Menzies, T.: The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa (2005). URL http://promise.site.uottawa.ca/SERepository 149. Schueller, W., Wachs, J., Servedio, V.D.P., Thurner, S., Loreto, V.: Evolving collaboration, dependencies, and use in the rust open source software ecosystem. Sci. Data 9(1), 703 (2022). https://doi.org/10.1038/s41597-022-01819-z 150. Schwaber, K.: SCRUM development process. In: Business Object Design and Implementation, pp. 117–134. Springer, Berlin (1997) 151. Seppänen, M., Hyrynsalmi, S., Manikas, K., Suominen, A.: Yet another ecosystem literature review: 10+1 research communities. In: European Technology and Engineering Management Summit (E-TEMS), pp. 1–8. IEEE, Piscataway (2017). https://doi.org/10.1109/E-TEMS. 2017.8244229 152. Sharma, T., Fragkoulis, M., Spinellis, D.: Does your configuration code smell? In: Working Conference on Mining Software Repositories (MSR), pp. 189–200 (2016). https://doi.org/10. 1145/2901739.2901761 153. Singh, N., Singh, P.: How do code refactoring activities impact software developers’ sentiments? An empirical investigation into GitHub commits. In: Asia-Pacific Software Engineering Conference (APSEC), pp. 648–653. IEEE, Piscataway (2017). https://doi.org/ 10.1109/APSEC.2017.79 154. Soto-Valero, C., Benelallam, A., Harrand, N., Barais, O., Baudry, B.: The emergence of software diversity in Maven Central. In: International Conference on Mining Software Repositories (MSR), pp. 333–343 (2019). https://doi.org/10.1109/MSR.2019.00059 155. Soto-Valero, C., Harrand, N., Monperrus, M., Baudry, B.: A comprehensive study of bloated dependencies in the Maven ecosystem. Empir. Software Eng. 26(3), 1–44 (2021). https://doi. org/10.1007/s10664-020-09914-8 156. Steglich, C., Marczak, S., Guerra, L.P., Mosmann, L.H., Perin, M., Figueira Filho, F., de Souza, C.: Revisiting the mobile software ecosystems literature. In: International Workshop on Software Engineering for Systems-of-Systems (SESoS) and Workshop on Distributed Software Development, Software Ecosystems and Systems-of-Systems (WDES), pp. 50–57 (2019). https://doi.org/10.1109/SESoS/WDES.2019.00015 157. Storey, M.A., Zagalsky, A., Filho, F.F., Singer, L., German, D.M.: How social and communication channels shape and challenge a participatory culture in software development. Trans. Software Eng. 43(2), 185–204 (2017). https://doi.org/10.1109/TSE.2016.2584053 158. Stringer, J., Tahir, A., Blincoe, K., Dietrich, J.: Technical lag of dependencies in major package managers. In: Asia-Pacific Software Engineering Conference (APSEC), pp. 228–237 (2020). https://doi.org/10.1109/APSEC51365.2020.00031 159. Szyperski, C., Gruntz, D., Murer, S.: Component Software: Beyond Object-Oriented Programming, 1st ed. Addison-Wesley, Boston (1997) 160. Takhteyev, Y., Hilts, A.: Investigating the geography of open source software through GitHub. https://flosshub.org/sites/flosshub.org/files/Takhteyev-Hilts-2010.pdf (2010) 161. Tan, J., Feitosa, D., Avgeriou, P., Lungu, M.: Evolution of technical debt remediation in Python: a case study on the Apache software ecosystem. J. Software Evol. Proces. 33(4) (2020). https://doi.org/10.1002/smr.2319 162. Teixeira, J., Hyrynsalmi, S.: How do software ecosystems co-evolve? A view from OpenStack and beyond. In: International Conference of Software Business (ICSOB), pp. 115–130. Springer, Berlin (2017). https://doi.org/10.1007/978-3-319-69191-6 163. Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The Qualitas Corpus: a curated collection of Java code for empirical studies. In: Asia Pacific Software Engineering Conference (APSEC), pp. 336–345 (2010). https://doi.org/10.1109/ APSEC.2010.46

28

T. Mens and C. De Roover

164. Tiwari, N.M., Upadhyaya, G., Rajan, H.: Candoia: a platform and ecosystem for mining software repositories tools. In: International Conference on Software Engineering (ICSE), pp. 759–764 (2016). https://doi.org/10.1145/2889160.2892662 165. Tiwari, N.M., Upadhyaya, G., Nguyen, H.A., Rajan, H.: Candoia: A platform for building and sharing mining software repositories tools as apps. In: International Conference on Mining Software Repositories (MSR), pp. 53–63 (2017). https://doi.org/10.1109/MSR.2017.56 166. Tourani, P., Adams, B.: The impact of human discussions on just-in-time quality assurance: an empirical study on OpenStack and Eclipse. In: International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 189–200. IEEE, Piscataway (2016). https://doi.org/10.1109/SANER.2016.113 167. Tourani, P., Jiang, Y., Adams, B.: Monitoring sentiment in open source mailing lists: exploratory study on the Apache ecosystem. In: International Conference on Computer Science and Software Engineering (CASCON), pp. 34–44. IBM, Armonk/ACM, New York (2014) 168. Tsay, J., Dabbish, L., Herbsleb, J.: Influence of social and technical factors for evaluating contribution in GitHub. In: International Conference on Software Engineering (ICSE), pp. 356–366. ACM, New York (2014). https://doi.org/10.1145/2568225.2568315 169. Uddin, G., Khomh, F.: Automatic mining of opinions expressed about APIs in Stack Overflow. Trans. Software Eng., 1–1 (2019). https://doi.org/10.1109/TSE.2019.2900245 170. Um, S., Zhang, B., Wattal, S., Yoo, Y.: Software components and product variety in a platform ecosystem: a dynamic network analysis of WordPress. Inform. Syst. Res. (2022). https://doi. org/10.1287/isre.2022.1172 171. Valiev, M., Vasilescu, B., Herbsleb, J.: Ecosystem-level determinants of sustained activity in open-source projects: a case study of the PyPI ecosystem. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 644–655. ACM, New York (2018). https://doi.org/10.1145/3236024. 3236062 172. Vasilescu, B., Serebrenik, A., Goeminne, M., Mens, T.: On the variation and specialisation of workload: a case study of the Gnome ecosystem community. Empir. Software Eng. 19(4), 955–1008 (2014). https://doi.org/10.1007/s10664-013-9244-1 173. Vasilescu, B., Posnett, D., Ray, B., van den Brand, M.G., Serebrenik, A., Devanbu, P., Filkov, V.: Gender and tenure diversity in GitHub teams. In: Conference on Human Factors in Computing Systems (CHI), pp. 3789–3798. ACM, New York (2015). https://doi.org/10.1145/ 2702123.2702549 174. Vasilescu, B., Yu, Y., Wang, H., Devanbu, P., Filkov, V.: Quality and productivity outcomes relating to continuous integration in GitHub. In: Joint meeting on Foundations of Software Engineering (ESEC/FSE), pp. 805–816 (2015). https://doi.org/10.1145/2786805.2786850 175. Velázquez-Rodríguez, C., Constantinou, E., De Roover, C.: Uncovering library features from API usage on Stack Overflow. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 207–217. IEEE, Piscataway (2022). https://doi.org/10. 1109/SANER53432.2022.00035 176. Velázquez-Rodríguez, C., Di Nucci, D., De Roover, C.: A text classification approach to API type resolution for incomplete code snippets. Sci. Comput. Programm. 227, 102941 (2023). https://doi.org/10.1016/j.scico.2023.102941 177. Wachs, J., Nitecki, M., Schueller, W., Polleres, A.: The geography of open source software: evidence from GitHub. Technol. Forecast. Soc. Change 176 (2021). https://doi.org/10.1016/j. techfore.2022.121478 178. Wang, Z., Wang, Y., Redmiles, D.: From specialized mechanics to project butlers: the usage of bots in OSS development. IEEE Software (2022). https://doi.org/10.1109/MS.2022.3180297 179. Weiss, D.M., Lai, C.T.R.: Software Product-Line Engineering: A Family-Based Software Development Process. Addison-Wesley (1999). ISBN 0201694387, 9780201694383 180. Werder, K., Brinkkemper, S.: MEME: toward a method for emotions extraction from GitHub. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 20–24. ACM, New York (2018). https://doi.org/10.1145/3194932.3194941

1 An Introduction to Software Ecosystems

29

181. Wessel, M., Serebrenik, A., Wiese, I., Steinmacher, I., Gerosa, M.A.: Quality gatekeepers: investigating the effects of code review bots on pull request activities. Empir. Software Eng. 27(5), 108 (2022). https://doi.org/10.1007/s10664-022-10130-9 182. Wessel, M., Vargovich, J., Gerosa, M.A., Treude, C.: Github actions: the impact on the pull request process. Preprint. arXiv:2206.14118 (2022) 183. Wiese, I.S., Da Silva, J.T., Steinmacher, I., Treude, C., Gerosa, M.A.: Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 345–355. IEEE, Piscataway (2016). https://doi.org/10.1109/ICSME.2016.13 184. Willis, A.: The ecosystem: an evolving concept viewed historically. Funct. Ecol. 11, 268–271 (1997) 185. Yang, B., Wei, X., Liu, C.: Sentiments analysis in GitHub repositories: an empirical study. In: Asia-Pacific Software Engineering Conference Workshops (APSEC Workshops), pp. 84–89. IEEE, Piscataway (2017). https://doi.org/10.1109/APSECW.2017.13 186. Yau, S., Collofello, J., MacGregor, T.: Ripple effect analysis of software maintenance. In: International Computer Software and Applications Conference (COMPSAC), pp. 60–65. IEEE, Piscataway (1978). https://doi.org/10.1109/CMPSAC.1978.810308 187. Yu, Y., Wang, H., Filkov, V., Devanbu, P., Vasilescu, B.: Wait for it: determinants of pull request evaluation latency on GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 367–371 (2015). https://doi.org/10.1109/MSR.2015.42 188. Zagalsky, A., German, D.M., Storey, M.A., Teshima, C.G., Poo-Caamaño, G.: How the R community creates and curates knowledge: an extended study of Stack Overflow and mailing lists. Empir. Software Eng. 23(2), 953–986 (2018). https://doi.org/10.1007/s10664017-9536-y 189. Zerouali, A., Constantinou, E., Mens, T., Robles, G., González-Barahona, J.: An empirical analysis of technical lag in npm package dependencies. In: International Conference on Software Reuse (ICSR). Lecture Notes in Computer Science, vol. 10826, pp. 95–110. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-90421-4_6 190. Zerouali, A., Mens, T., Decan, A., De Roover, C.: On the impact of security vulnerabilities in the npm and RubyGems dependency networks. Empir. Software Eng. 27(5), 1–45 (2022). https://doi.org/10.1007/s10664-022-10154-1 191. Zerouali, A., Mens, T., Gonzalez-Barahona, J., Decan, A., Constantinou, E., Robles, G.: A formal framework for measuring technical lag in component repositories—and its application to npm. J. Software: Evol. Process 31(8) (2019). https://doi.org/10.1002/smr.2157 192. Zerouali, A., Velázquez-Rodríguez, C., De Roover, C.: Identifying versions of libraries used in Stack Overflow code snippets. In: International Conference on Mining Software Repositories (MSR), pp. 341–345. IEEE, Piscataway (2021). https://doi.org/10.1109/MSR52588. 2021.00046 193. Zhang, Y., Liu, H., Tan, X., Zhou, M., Jin, Z., Zhu, J.: Turnover of companies in openstack: prevalence and rationale. Trans. Software Eng. Methodol. 31(4) (2022). https://doi.org/10. 1145/3510849 194. Zhou, S., Vasilescu, B., Kästner, C.: How has forking changed in the last 20 years? A study of hard forks on GitHub. In: International Conference on Software Engineering (ICSE), pp. 445– 456. ACM, New York (2020). https://doi.org/10.1145/3377811.3380412

Part I

Software Ecosystem Representations

Chapter 2

The Software Heritage Open Science Ecosystem Roberto Di Cosmo and Stefano Zacchiroli

Abstract Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of July 2023, it has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. In this chapter, we describe the Software Heritage ecosystem, focusing on research and open science use cases. On the one hand, Software Heritage supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code. This giant graph of source code artifacts (files, directories, and commits) can be used –and has been used– to study repository forks, open source contributors, vulnerability propagation, software provenance tracking, source code indexing, and more. On the other hand, Software Heritage ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments, contributing to making research reproducible. The source code used in scientific experiments can be archived –e.g., via integration with open-access repositories – referenced using persistent identifiers that allow downstream integrity checks and linked to/from other scholarly digital artifacts.

R. Di Cosmo () Inria and Université Paris Cité, Paris, France e-mail: [email protected] S. Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France e-mail: [email protected] © The Author(s) 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_2

33

34

R. Di Cosmo and S. Zacchiroli

2.1 The Software Heritage Archive Software Heritage [1, 12] is a nonprofit initiative started by Inria in partnership with UNESCO to build a long-term universal archive specifically designed for software source code, capable of storing source code files and directories, together with their full development histories. Software Heritage’s mission is to collect, preserve, and make easily accessible the source code of all publicly available software, addressing the needs of a plurality of stakeholders, ranging from cultural heritage to public administrations and from research to industry. The key principles that underpin this initiative are described in detail in two articles written for a broader audience in the early years of the project [1, 12]. One of these principles was to avoid any a priori selection of the contents of the archive, to avoid the risk of missing relevant source code, whose value will only become apparent later on. Hence, one of the strategies enacted for collecting source code to archive is the large-scale automated crawling of major software development forges and distributions, as shown in Fig. 2.1. As a consequence of this automated harvesting, there is no guarantee that the content of the archive only contains quality source code or only code that builds properly: curation of the contents will need to happen at a later stage, via human or automated processes that build a view of the archive for specific needs. It may also happen that the archive ends up containing content that needs to be removed, and this required the creation of a process to handle take down requests following current legal regulations.1 The sustainability plan is based on several pillars. The first one is the support of Inria, a national research institution that is involved for the long term. A second one is the fact that Software Heritage provides a common infrastructure catering to the needs of a variety of stakeholders, ranging from industry to academia and from cultural heritage to public administrations. As a consequence, funding comes from a diverse group of sponsors, ranging from IT companies to public institutions. Finally, an extra layer of archival security is provided by a network of independent international mirrors that maintain each a full copy of the archive.2 We recall here a few key properties that set Software Heritage apart from other scholarly infrastructures: • Software Heritage proactively archives all software, making it possible to store and reference any piece of publicly available software relevant to a research result, independently from any specific field of endeavor, and even when the author(s) did not take any step to have it archived [1, 12];

1 See

https://www.softwareheritage.org/legal/content-policy/ for details. details can be found at https://www.softwareheritage.org/support/sponsors and https:// www.softwareheritage.org/mirrors. 2 More

svn

svn svn

Listing (full/incremental)

PyPi lister

Debian lister

. . .

git

dsc dsc

hg

git

hg

git

hg

Scheduling

tar zip

git

software origins

Loading & deduplication

tar loader

Debian source package loader

. . .

Mercurial loader

Git loader

Fig. 2.1 Software Heritage data flow: crawling (on the left) and archival (right)

...

Package repos

GitHub lister

GitLab lister

Distros

Forges

Software Heritage Archive Merkle DAG + blob storage

2 The Software Heritage Open Science Ecosystem 35

36

R. Di Cosmo and S. Zacchiroli

• Software Heritage stores source code with its development history in a uniform data structure, a Merkle Directed Acyclic Graph (DAG) [32], which allows to provide uniform, intrinsic identifiers for tens of billions archived software artifacts, independently of the version control system (VCS) or package distribution technology used by software developers [15]. Relevance for Software Ecosystems Software Heritage relates to software ecosystems, according to the seminal definition of Messerschmitt et al. [33] in two main ways. On the one hand, software products are associated with source code artifacts that are versioned and stored in VCSs. For Free/Open Source Software (FOSS), and more generally public code, those artifacts are distributed publicly and can be mined to pursue various goals. Software Heritage collects and preserves observable artifacts that originate from open-source ecosystems, enabling others to access and exploit them in the foreseeable future. On the other hand, Software Heritage provides the means to foster the sharing of even more of those artifacts in the specific case of open scientific practices— what we refer to as the “open science ecosystem” in this chapter. Contrary to software-only ecosystems, the open science ecosystem encompasses a variety of software and non-software artifacts (e.g., data, publications); Software Heritage has contributed to this ecosystem the missing piece of long-term archival and referencing of scientifically relevant software source code artifacts.

2.1.1 Data Model Modern software development produces multiple kinds of source code artifacts (e.g., source code files, directories, commits), which are usually stored and tracked in version control systems, distributed as packages in various formats, or otherwise. When designing a software source code archive that stores source code with its version control history coming from a disparate set of platforms, there are different design options available. One option is to keep a verbatim copy of all the harvested content, which makes it easy to immediately reuse the package or version control tool. However, this approach can result in storage explosion: as a consequence of both social coding practices on collaborative development platforms and the liberal licensing terms of open-source software, those source code artifacts end up being massively duplicated across code hosting and distribution platforms. Choosing a data structure that minimizes duplication is better for long-term preservation and the ability to identify easily code reuse and duplication. This is the choice made by Software Heritage. Its data model is a Direct Acyclic Graph (DAG) that leverages classical ideas from content addressable storage and Merkle trees [32], which we recall briefly here. As shown in Fig. 2.2, the Software Heritage DAG is organized in five logical layers, which we describe below from bottom to top.

2 The Software Heritage Open Science Ecosystem

37

origins Merkle DAG snapshots releases revisions

directories

contents Fig. 2.2 Data model of the Software Heritage archive: a directed acyclic graph (DAG) linking together deduplicated software artifacts shared across the entire body of (archived) public code

Contents (or “blobs”) form the graph’s leaves and contain the raw content of source code files, not including their filenames (which are context-dependent and stored only as part of directory entries). Directories are associative lists mapping names to directory entries and associated metadata (e.g., permissions). Each entry can point to content objects (“file entries”), revisions (“revision entries,” e.g., to represent git submodules or subversion externals), or other directories (“directory entries”). Revisions (or “commits”) are point-in-time representations of the entire source tree of a development project. Each revision points to the root directory of the project source tree and a list of its parent revisions (if any). Releases (or “tags”) are revisions that have been marked by developers as noteworthy with a specific, usually mnemonic, name (e.g., a version number like “4.2”). Each release points to a revision and might include additional metadata such as a changelog message, digital signature, etc. Snapshots are point-in-time captures of the full state of a project development repository. While revisions capture the state of a single development line (or “branch”), snapshots capture the state of all branches in a repository and allow to reconstruct the full state of a repository that has been deleted or modified destructively (e.g., rewriting its history with tools like “git rebase”). Origins represent the places where artifacts have been encountered in the wild (e.g., a public Git repository) and link those places to snapshot nodes and associated metadata (e.g., the timestamp at which crawling happened), allowing to start archive traversals pointing into the Merkle DAG. The Software Heritage archive is hence a giant graph containing nodes corresponding to all these artifacts and links between them as graph edges.

38

R. Di Cosmo and S. Zacchiroli

18.0B

260M

3.50B

240M

16.0B

3.00B

220M

14.0B

200M

2.50B

12.0B 10.0B

2.00B

8.00B

1.50B

6.00B

180M 160M 140M 120M 100M

1.00B

80.0M

500M

40.0M

60.0M

4.00B 2.00B

20.0M 0.00 01 Jan 202 4

01 Jan 202 2 01 Jan 202 3

01 Jan 202 0 01 Jan 202 1

01 Jan 201 8 01 Jan 201 9

01 Jan 201 6 01 Jan 201 7

01 Jan 201 5

01 Jan 202 4

01 Jan 202 2 01 Jan 202 3

01 Jan 202 0 01 Jan 202 1

01 Jan 201 8 01 Jan 201 9

01 Jan 201 6 01 Jan 201 7

0.00

01 Jan 201 5

01 Jan 202 4

01 Jan 202 2 01 Jan 202 3

01 Jan 202 0 01 Jan 202 1

01 Jan 201 8 01 Jan 201 9

01 Jan 201 6 01 Jan 201 7

01 Jan 201 5

0.00

Fig. 2.3 Evolution of the Software Heritage archive over time (July 2023)

What makes this DAG capable of deduplicating identical content is the fact that each node is identified by a cryptographic hash that concisely represent its contents, and that is used in the SWHID identifier detailed in the next section. For the blobs that are the leaves of the graph, this identifier is just a hash of the blob itself, so even if the same file content can be present in multiple projects, its identifier will be the same, and it will be stored in the archive only once, like in classical content addressable storage [39]. For internal nodes, the identifier is computed from the aggregation of the identifiers of its children, following the construction originally introduced by Ralph Merkle [32]: as a consequence, if the same directory, possibly containing thousands of files, is duplicated across multiple projects, its identifier will stay the same, and it will be stored only once in the archive. The same goes for revision, releases, and snapshots. In terms of size, the archive grows steadily over time as new source code artifacts get added to it, as shown in Fig. 2.3. As of July 2023, the Software Heritage archive contained over 16 billion unique source code files, harvested from more than 250 million software origins.3

2.1.2 Software Heritage Persistent Identifiers (SWHIDs) As part of the archival process, a Software Heritage Persistent Identifier (SWHID), is computed for each source code artifact added to the archive and can be used later to reference, look up, and retrieve it from the archive. The general syntax of SWHIDs is shown in Fig. 2.4.4

3 See

https://archive.softwareheritage.org for these and other up-to-date statistics. https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html for the full specification of SWHIDs.

4 See

2 The Software Heritage Open Science Ecosystem

39

Fig. 2.4 Schema of the Software Heritage identifiers (SWHID)

SWHIDs are URIs [5] with a simple syntax. Core SWHIDs start with the “swh” URI scheme; the colon (:) is used as separator between the logical parts of identifiers; the schema version (currently 1) is the current version of this identifier schema, then follows the type of source code artifacts identified, and finally comes a hex-encoded (using lowercase ASCII characters) cryptographic signature of this object, computed in a standard way, as detailed in [13, 15]. Core SWHIDs can then be complemented by qualifiers that carry contextual extrinsic information about the referenced source code artifact: origin: the software origin where an object has been found or observed in the wild, as a URI; visit: persistent identifier of a snapshot corresponding to a specific visit of a repository containing the designated object; anchor: a designated node in the Merkle DAG relative to which a path to the object is specified; path: the absolute file path, from the root directory associated with the anchor node, to the object; lines: line number(s) of interest, usually pointing within a source code file. The combination of core SWHIDs and qualifiers provides a powerful means of referring in a research article all source code artefacts of interest. By keeping all the development history in a single global Merkle DAG, Software Heritage offers unique opportunities for massive analysis of the software development landscape. By archiving and referencing all the publicly available source code, the archive also constitutes the ideal place to preserve research software artifacts and offers powerful mechanisms to enhance research articles with precise references to relevant fragments of source code and contributes an essential building block to the software pillar of Open Science.

40

R. Di Cosmo and S. Zacchiroli

2.2 Large Open Datasets for Empirical Software Engineering The availability of large amounts of source code that came with the growing adoption of open source and collaborative development has attracted the interest of software engineering researchers since the beginning of the 2000s and opened the way to large-scale empirical software engineering studies and a dedicated conference, Mining Software Repositories. Several shared concerns emerged over time in this area, and we recall here some of the ones that are relevant for the discussion in this chapter. One issue is the significant overhead involved in the systematic extraction of relevant data from the publicly available repositories and their analysis for testing research hypotheses. Building a very large-scale dataset containing massive amounts of source code with its version control history is a complex undertaking and requires significant resources, as shown in seminal work by Mockus in 2009 [34]. The lack of a common infrastructure spawned a proliferation of ad hoc pipelines for collecting and organizing source code with its version control history, a duplication of efforts that were subtracted to the time available to perform the intended research and hindered their reusability. A few initiatives were born with the intention of improving this unsatisfactory state of affairs: Boa [17] provides selected datasets (the largest and most recent one at the time of writing consists of about eight million GitHub repositories sampled in October 2019) and a dedicated domain specific language to perform efficient queries on them, while World of Code [31] collects git repositories on a large scale and maintains dedicated data structures that ease their analysis. The complexity of addressing the variety of existing code hosting platforms and version control systems resulted in focusing only on subsets of the most popular ones, in particular the GitHub forge and the git version control system, which raises another issue: the risk of introducing bias in the results. In empirical sciences, selection bias [24] is the bias that originates from performing an experiment on a non-representative subset of the entire population under study. It is a methodological issue that can lead to threats to the external validity of experiments, i.e., incorrectly concluding that the obtained results are valid for the entire population, whereas they might only apply to the selected subset. In empirical software engineering, a common pattern that could result in selection bias is performing experiments on software artifacts coming from a relatively small set of development projects. It can be mitigated by ensuring that the project set is representative of the larger set of projects of interest, but doing so could be challenging. Finally, there is the issue of enabling reproducibility of large-scale experiments— i.e., the ability to replicate the findings of a previous scientific experiment, by the same or a different team of scientists, reusing varying amounts of the artifacts used

2 The Software Heritage Open Science Ecosystem

41

in the original experiment [29].5 Large-scale empirical experiments in software engineering might easily require shipping hundreds of GiB up to a few TiB of source code artifacts as part of replication packages, whereas current scientific platform for data self archival usually cap at tens of GiB.6 The comprehensiveness of the Software Heritage archive, which makes available the largest public corpus of source code artifacts in a single logical place, helps with all these issues: • reduces the opportunity cost of conducting large-scale experiments by offering at regular intervals as open datasets full dumps of the archive content • contributes to mitigate selection bias and the associated external validity threats by providing a corpus that strives to be comprehensive for researchers conducting empirical software engineering experiments targeting large project populations. • the persistence offered by an independent digital archive, run by a nonprofit open organization, eases the process of ensuring the reproducibility of large-scale experiments, avoiding the need to re-archive the same open-source code artifacts in multiple papers, a wasteful practice that should be avoided if possible. Using Software Heritage is enough to thoroughly document in replication packages the SWHIDs (see Sect. 2.1.2) of all source code artifacts7 used in an empirical experiment to enable other scientists to reproduce the experiments later on [11]. Table 2.1 summarizes the above points, comparing with a few other infrastructures designed specifically for software engineering studies. In the rest of this section, we briefly describe the datasets that Software Heritage curates and maintains to the benefit of other researchers in the field of empirical software engineering. Before detailing the available datasets, we recall that building and maintaining the Software Heritage infrastructure that is instrumental to build them is a multimillion dollar undertaking. We are making significant efforts to reduce the burden on the prospective users, by providing dumps at regular intervals that help with reproducibility and making them directly available on public clouds like AWS. Researchers can then either run their queries directly on the cloud, paying only the compute time, or download them for exploiting them on their own infrastructure. To give an idea of the associated costs for researchers, SQL queries on the graph datasets described in Sect. 2.2.1.1 can be performed using Amazon Athena for approximately 5$ per Terabyte scanned at the time of writing. For example, an

5 For the sake of conciseness, we do not differentiate here between repeatability, reproducibility, and replicability; we refer instead the interested reader to the ACM terminology available at https://www.acm.org/publications/policies/artifact-review-and-badging-current. To varying degrees, Software Heritage helps with all of them, specifically when it comes to mitigating the risk of losing availability to source code artifacts. 6 For comparison: the total size of source code archived at Software Heritage is .≈1 PiB at the time of writing. 7 As it will become clear in Sect. 2.1.2, in most cases, it will be sufficient to list the SWHIDs of the releases or repository snapshots.

42

R. Di Cosmo and S. Zacchiroli

Table 2.1 Comparison of infrastructures for performing empirical software engineering research Infrastructure Criteria host organisation purpose scope dataset access query language cost dataset update frequency reproducibility

SWH

SWH graph

Boa

World of Code

on S3 (on premise) non profit foundation research project research project archival & research all platforms GitHub, SourceForce Git hosting open closed closed free on demand on demand SQL Athena graph API custom DSL custom API 5$/TB 10K$ setup free free 6 months ≈ yearly ≈ yearly named dataset named dataset named dataset SWHID list

SQL query to get the 4 topmost commit verb stems from over two billion revisions scans approximately 100 Gigabytes of data and provides the user with the answer in less than a minute, for a total cost of approximately 50 cents, a minimal fraction of the cost one would incur to set up an on-premise solution. When SQL queries are not enough (typically when a graph traversal is needed), the cost of a cloud solution may quickly become significant, and it may become more interesting to set up an on-premise solution. The full compressed graph dataset can be exploited using medium range server grade machines that are accessible for less than 10,000 dollars.

2.2.1 The Software Heritage Datasets The entire content of the Software Heritage archive is publicly available to researchers interested in conducting empirical experiments on it. At the simplest level, the content of the archive can be browsed interactively using the Web user interface at https://archive.softwareheritage.org/ and accessed programmatically using the Web API documented at https://archive.softwareheritage.org/api/. These access paths, however, are not really suitable for large-scale experiments due to protocol overheads and rate limitations enforced to avoid depleting archive resources. To address this, several curated datasets are regularly extracted from the archive and made available to researchers in ways suitable for mass analysis.

2 The Software Heritage Open Science Ecosystem

2.2.1.1

43

The Software Heritage Graph Dataset

Consider the data model discussed in Sect. 2.1.1. The entire archive graph is exported periodically as the Software Heritage Graph Dataset [38]. Note the word “graph” in there, which characterizes this particular dataset and denotes that only the graph is included in the dataset, up to the content of its leave nodes, excluded (for size reasons). This dataset is suitable for analyzing source code metadata, including commit information, filenames, software provenance, code reuse, etc., but not for textual analyses of archived source code, as that is stored in graph leaves (see the blob dataset below for how to analyze actual code). The data model of the graph dataset is a relational representation of the archive Merkle DAG, with one “table” for each type of node: blobs, directories, commits, releases, and snapshots. Each table entry is associated with several attributes, such as multiple checksums for blobs, filenames and attributes for directories, commit messages and timestamps for commits, etc. The full schema is documented at https://docs.softwareheritage.org/devel/swh-dataset/graph/schema.html. In practical terms, the dataset is distributed as a set of Apache ORC files for each table, suitable for loading into scale-out columnar-oriented data processing frameworks such as Spark and Hadoop. The ORC files can be downloaded from the public Amazon S3 bucket s3://softwareheritage/graph/. At the time of writing, the most recent dataset export has timestamp 2022-12-07, so, for example, the first ORC files of the commit table are: 1

2 3 4 5

$ aws s3 ls --no-sign-request → s3://softwareheritage/graph/2022-12-07/orc/revision/ 2022-12-13 17:41:44 3099338621 revision-[..]-f9492019c788.orc 2022-12-13 17:32:42 4714929458 revision-[..]-42da526d2964.orc 2022-12-13 17:57:00 3095895911 revision-[..]-9c46b558269d.orc [..]

The current version of the dataset contains metadata for 13 billion source code files, ten billion directories, 2.7 billion commits, 35 million releases, and 200 million VCS snapshots, coming from 189 M software origins. The total size of the dataset is 11 TiB, which makes it unpractical for use on personal machines, as opposed to research clusters. For that reason, hosted versions of the dataset are also available on Amazon Athena and Azure Databricks. The former can be queried using the Presto distributed SQL engine without having to download the dataset locally. For example, the following query will return the most common first word stems used in commit messages across more than 2.7 billion commits in just a few seconds: Listing 2.1 Simple SQL query to get the 4 topmost commit verb stems 1 SELECT count(*) as c,word FROM ( 2 SELECT word_stem(lower(split_part(trim(from_utf8(message)), ’ ’, 1))) → as word 3 from revision WHERE length(message) < 1000000) 4 WHERE word != ’’ 5 GROUP BY word ORDER BY c DESC LIMIT 4

44

R. Di Cosmo and S. Zacchiroli

For the curious reader, the (unsurprising) results of the query look like this: Count 294 369 196 178 738 450 152 441 261 113 924 516

Word updat merg add fix

More complex queries and examples can be found in previous work [38]. For more details about using the graph dataset, we refer the reader to its technical documentation at https://docs.softwareheritage.org/devel/swh-dataset/graph/. In addition to the research highlights presented later in this chapter, the Software Heritage graph dataset has been used as subject of study for the 2020 edition of the MSR (Mining Software Repositories) mining challenge, where students and young researchers in software repository mining have used it to solve the most interesting mining problems they could think of. To facilitate their task “teaser” datasets –data samples with exactly the same shape of the full dataset, but much smaller– have also been produced and can be used by researchers to understand how the dataset works before attacking its full scale. For example, the popular-3k-python teaser contains a subset of 2.197 popular repositories tagged as implemented in Python and being popular according to various metrics (e.g., GitHub stars, PyPI download statistics, etc.). The gitlab-all teaser corresponds to all public repositories on www.gitlab.comgitlab.com (as of December 2020), an often neglected ecosystem of Git repositories, which is interesting to study to avoid (or compare against) GitHubspecific biases.

2.2.1.2

Accessing Source Code Files

All source code files archived by Software Heritage are spread across multiple copies and also mirrored to the public Amazon S3 bucket www.s3:// softwareheritage/content/s3://softwareheritage/content/. From there, individual files can be retrieved, possibly massively and in parallel, based on their SHA1 checksums. Starting from SWHIDs, one can obtain SHA1 checksums using the content table of the graph dataset and then access the associated content as follows:

2 The Software Heritage Open Science Ecosystem

1 2 3

45

$ aws s3 cp s3://softwareheritage/content/\ 8624bcdae55baeef00cd11d5dfcfa60f68710a02 . download: s3://softwareheritage/content/8624b[..] to ./8624b[..]

4 5 6

$ zcat 8624bcdae55baeef00cd11d5dfcfa60f68710a02 | sha1sum 8624bcdae55baeef00cd11d5dfcfa60f68710a02 -

7 8 9 10 11

$ zcat 8624bcdae55baeef00cd11d5dfcfa60f68710a02 | head GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 [..]

Note that individual files are gzip-compressed to further reduce storage size. The general empirical analysis workflow involves three simple steps: identify the source code files of interest using the metadata available in the graph dataset, obtain their checksum identifiers, and then retrieve them in batch and in parallel from public cloud providers. This process scales well up to many million files to be analyzed. For even larger-scale experiments, e.g., analyzing all source code files archived at Software Heritage, research institutions may consider setting up a local mirror of the archive.8

2.2.1.3

License Dataset

In addition to datasets that correspond to the actual content of the archive, i.e., source code artifacts as encountered among public code, it is also possible to curate derived datasets extracted from Software Heritage for the specific use cases or fields of endeavors. As of today one notable example of such a derived dataset is the license blob dataset, available at https://annex.softwareheritage.org/public/dataset/licenseblobs/ and described in [51]. It consists of the largest known dataset of the complete texts of free/open-source software (FOSS) license variants. To assemble it, the authors collected from the Software Heritage archive all versions of files whose names are commonly used to convey licensing terms to software users and developers, e.g., COPYRIGHT, LICENSE, etc. (the exact pattern is documented as part of the dataset replication package). The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open-source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various empirical software engineering contexts. Metadata include file length measures, detected MIME type,

8 See https://www.softwareheritage.org/mirrors/ for details, including storage requirements. At the time of writing, a full mirror of the archive requires about 1 PiB of raw storage.

46

R. Di Cosmo and S. Zacchiroli

detected SPDX [45] license (using ScanCode [35], a state-of-the-art tool for license detection), example origin (e.g., GitHub repository), and oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.

2.3 Research Highlights The datasets discussed in the previous section have been used to tackle research problems in empirical software engineering and neighboring fields. In this section, we provide brief highlights on the most interesting of them.

2.3.1 Enabling Artifact Access and (Large-Scale) Analysis Applied research in various fields has been conducted to ease access to such a huge amount of data as the Software Heritage archive for empirical researchers. This kind of research is not, strictly speaking, research enabled by the availability of the archive to solve software engineering problems but rather research motivated by the practical need of empowering fellow scholars to do so empirically. As a first example, SwhFS (Software Heritage File System) [2] is a virtual filesystem developed using the Linux FUSE (Filesystem in User SpacE) framework that can “mount,” in the UNIX tradition, selected parts of the archive as if they were available locally as part of your filesystem. For example, starting from a known SWHID, one can, for instance: 1 2 3

$ mkdir swhfs $ swh fs mount swhfs/ # mount the archive $ cd swhfs/

4 5 6

$ cat archive/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 #include

7 8 9 10

int main(void) { printf("Hello, World!\n"); }

11 12 13 14 15 16 17 18

$ cd archive/swh:1:dir:1fee702c7e6d14395bbf\ 5ac3598e73bcbf97b030 $ ls | wc -l 127 $ grep -i antenna THE_LUNAR_LANDING.s | cut -f 5 # IS THE LR ANTENNA IN POSITION 1 YET # BRANCH IF ANTENNA ALREADY IN POSITION 1

2 The Software Heritage Open Science Ecosystem

47

In the second example, we are grepping through the code of Apollo 11 guidance computer code, searching for reference to antennas. SwhFS allows to bridge the gap between classic UNIX-like mining tools, which are often relied upon in the fields of empirical software engineering and software repository mining, as well as by the Software Heritage APIs. However, it is not suitable for very-large-scale mining, due to the fact that seemingly local archive access pass through the public Internet (with caching, but still not suitable for large experiments). swh-graph [7] is a way to enable such large-scale experiments. The main idea behind its approach is to adapt and apply graph compression techniques, commonly used for graphs such as the Web or social network, to the Merkle DAG graph that underpins the Software Heritage archive. The main research question addressed by swh-graph is: Is it possible to efficiently perform software development history analyses at ultra-large scale, on a single, relatively cheap machine?

The answer is affirmative. As of today, the entire structure of the Software Heritage graph (.≈25 billion nodes .+ 350 billion edges) can be loaded in memory on a single machine equipped with .≈ 200 GiB of RAM (roughly: 100 GiB for the direct graph + 100 GiB for its transposed version, which is useful in many research use cases such as source code provenance analysis). While significant and not suitable for personal machines, such requirements are perfectly fine for server-grade hardware on the market, with an investment of a few thousand US dollars in RAM. Once loaded, the entire graph can be visited in full in just a few hours and a single path visit from end to end can be performed in tens of nanoseconds per edge, close to the cost of a single memory access per edge. In practical terms, this allows to answer queries such as “where does this file/directory/commit come from” or “list the entire content of this repositories” in fractions of seconds (depending just on the size of the answer, in most cases) fully in memory, without having to rely on a DBMS or even just disk accesses. The price to pay for this is that (1) the compressed graph representation loaded in memory is derived from the main archive and not incremental (it should periodically be recreated) and (2) only the graph structure and selected metadata fit in RAM; others reside on disk (although using compressed representations as well [37]) and need to be memory mapped for efficient access to frequently accessed information. Finally, the archive also provides interesting use cases for database research. Recently, Wellenzohn et al. [48] has used it to develop a novel type of content-andstructure (CAS) index, capable of indexing over time the evolution of properties associated to specific graph nodes, e.g., a file content residing at a given place in a repository changing over time together with its metadata (last modified timestamp, author, etc.). While these indexes existed before, their deployment and efficient prepopulation were still unexplored at this scale.

48

R. Di Cosmo and S. Zacchiroli

2.3.2 Software Provenance and Evolution The peculiar structure –a fully deduplicated Merkle DAG– and comprehensiveness of the Software Heritage archive provides a powerful observation point and tool on the evolution and provenance of public source code artifacts. In particular, it is possible, on the one hand, to navigate the Merkle DAG backward, starting from any artifact of interest (source code file, directory, commit, etc.), to obtain the full list of all places (e.g., different repositories) where it has ever been distributed from. This area is referred to as software provenance and, in its simplest form, deals with determining the original (i.e., earliest) distribution place of a given artifact. More generally, being able to identify all places that have ever distributed it provides a way to measure software impact, track out of date copies or clones, and more. Rousseau et al. [42] used the Software Heritage archive in a study that made two relevant contributions in this area. First, exploiting the fact that commits are deduplicated and timestamped, they verified that the growth of public code as a whole, at least as it is observable from the lenses of Software Heritage, is exponential: the amount of original commits (i.e., commits never observed before throughout the archive, no matter the origin repository) in public source code doubles every .≈ 30 months and has been doing so for the past 20 years. If, on the other hand, we look at original source code blobs (i.e., files whose content has never been observed before throughout the archive, up to that point in time), the overall trends remain the same, and only the speed changes: the amount of original public source code blobs doubles every .≈ 22 months. These are remarkable findings for software evolution, which had never been verified before at this macro-level. Second, the authors showed how to model software provenance compactly, so that it can be represented (space-)efficiently at the scale of Software Heritage and can be used to address software audit use cases, which are commonplace in opensource compliance scenarios, merger and acquisition audits, etc.

2.3.3 Software Forks The same characteristics that enable studying the evolution and provenance of public code artifacts can be leveraged to study the global ecosystem of software forks. In particular, the fact that commits are fully deduplicated allows to detect forks –both collaborative ones, such as those created on social coding platforms to submit pull requests, and hostile ones used to bring the project in a different direction– even when they are not created on the same platform. It is possible to detect the fork of a project originally created on GitHub and living on GitLab.com, or vice versa, based on the fact that the respective repositories share a common commit history. This is important as a methodological point for empirical researchers, because by relying only on platform metadata (e.g., the fact that a repository has been created by clicking on a “fork” button on the GitHub user interface), researchers

2 The Software Heritage Open Science Ecosystem

49

risk overlooking other relevant forks. In previous work, Zacchiroli [51] provided a classification of the type of forks based on whether they are explicitly tracked as being forks of one another on a coding platform (Type 1 forks), they share at least one commit (Type 2), or they share a common root directory at some point in their histories (Type 3). He empirically verified that between .3.8% and .16%, forks could be overlooked by considering only type 1 forks, possibly inducing a significant threat to validity for empirical analyses of forks that strive to be comprehensive. Along the same lines, Bhattacharjee et al. [6] (participants in the MSR 2020 mining challenge) focus their analyses on “cross-platform” forks between GitHub and GitLab.com, identifying several cases in which interesting development activity can be found on GitLab even for projects initially mirrored from GitHub.

2.3.4 Diversity, Equity, and Inclusion Diversity, equity, and inclusion studies (DE&I) are hot research topics in the area of human aspects of software engineering. Free/open-source software artifacts, as archived by Software Heritage, provides a wealth of data for analyzing evolutionary DE&I trends, in particular in the very long term and at the largest scale attempted thus far. A recent study by Zacchiroli [50] has used Software Heritage to explore the trend of gender diversity over a time period of 50 years. He conducted a longitudinal study of the population of contributors to publicly available software source code, analyzing 1.6 billion commits corresponding to the development history of 120 million projects, contributed by 33 million distinct authors over a period of 50 years. At this scale, authors cannot be interviewed to ask their gender, nor cross-checking with large-enough complementary dataset was possible. Instead, automated detection based on census data from around the world and the genderguesser tool (benchmarked for accuracy and popular in the field) was used. Results show that while the amount of commits by female authors remains very low overall (male authors have contributed more than 92% of public code commits over the 50 years leading to 2019), there is evidence of a stable long-term increase in their proportion over all contributions (with the ratio of commits by female authors growing steadily over 15 years, reaching in 2019 for the first time 10% of all contributions to public code). Follow-up studies have added the spatial dimension investigating the geographic gap in addition to the gender one. Rossi et al. [40] have developed techniques to detect the geographic origin of authors of Software Heritage commit, using as signals the time-zone offset and the author names (compared against census date from around the world). Results over 50 years of development history show evidence of the early dominance of North America in open-source software, later joined by Europe. After that period, the geographic diversity in public code has been constantly increasing, with more and more contributions coming from Central and South Asia (comprising India), Russia, Africa, and Central and South America.

50

R. Di Cosmo and S. Zacchiroli

Finally, Rossi et al. [41] put together the temporal and spatial dimension using the Software Heritage archive to investigate whether the ratio of women participation over time shows notable differences around the world, at the granularity of 20 macro-regions. The main result is that the increased trend of women participation is indeed a worldwide phenomenon, with the exception of specific regions of Asia where the increase is either slowed or completely flat. An incidental finding is also worth noting: the positive trend of increased women participation observed up to 2019 has been reversed by the COVID-19 pandemic, with the ratio of both contributions by and active female authors decreasing sharply starting at about that time. These studies show how social aspects of software engineering can benefit from large-scale empirical studies and how they can be enabled by comprehensive, public archives of public code artifacts.

2.4 Building the Software Pillar of Open Science Software plays a key role in scientific research, and it can be a tool, a result, and a research object. [. . . ] France will support the development and preservation of source code – inseparable from the support of humanity’s technical and scientific knowledge – and it will, from this position, continue its support for the Software Heritage universal archive. So as to create an ecosystem that connects code, data and publications, the collaboration between the national open archive HAL, the national research data platform Recherche Data Gouv, the scientific publishing sector and Software Heritage will be strengthened. Second french national plan for open science, July 2021 [22]

Software is an essential research output, and its source code implements and describes data generation and collection, data visualization, data analysis, data transformation, and data processing with a level of precision that is not met by scholarly articles alone. Publicly accessible software source code allows a better understanding of the process that leads to research results, and open-source software allows researchers to build upon the results obtained by others, provided proper mechanisms are put in place to make sure that software source code is preserved and that it is referenced in a persistent way. There is a growing general awareness of its importance for supporting the research process [9, 25, 46]. Many research communities focus on the issue of scientific reproducibility and strongly encourage making the source code of the artefact available by archiving it in publicly accessible long-term archives; some have even put in place mechanisms to assess research software, like the Artefact Evaluation process introduced in the ESEC-FSE 2011 conference and now widely adopted by many computer science conferences [10] and the ACM Artifact Review and Badging program.9 Others raise the complementary issues of making

9 https://www.acm.org/publications/policies/artifact-review-badging.

2 The Software Heritage Open Science Ecosystem

51

it easier to discover existing research software and giving academic credit to its authors [26, 30, 44]. These important issues are similar in spirit to those that led to the now-popular FAIR data movement [49], and as a first step, it is important to clearly identify the different concerns that come into play when addressing software, and in particular its source code, as a research output. They can be classified as follows: Archival: software artifacts must be properly archived, to ensure we can retrieve them at a later time; Reference: software artifacts must be properly referenced to ensure we can identify the exact code, among many potentially archived copies, used for reproducing a specific experiment; Description: software artifacts must be equipped with proper metadata to make it easy to find them in a catalog or through a search engine; Citation: research software must be properly cited in research articles in order to give credit to the people that contributed to it. These are not only different concerns but also separate ones. Establishing proper credit for contributors via citations or providing proper metadata to describe the artifacts requires a curation process [3, 8, 14] and is way more complex than simply providing stable, intrinsic identifiers to reference a precise version of a software source code for reproducibility purposes [4, 15, 26]. Also, as remarked in [4, 25], research software is often a thin layer on top of a large number of software dependencies that are developed and maintained outside of academia, so the usual approach based on institutional archives is not sufficient to cover all the software that is relevant for reproducibility of research. In this section, we focus on the first two concerns, archival and reference, that can be addressed fully by leveraging the Software Heritage archive, but we also describe how Software Heritage contributes through its ecosystem to the two other concerns.

2.4.1 Software in the Scholarly Ecosystem Presenting results in journal or conference articles has always been part of the research activity. The growing trend, however, is to include software to support or demonstrate such results. This activity can be a significant part of academic work and must be properly taken into account when researchers are evaluated [4, 44]. Software source code developed by researchers is only a thin layer on top of the complex web of software components, most of them developed outside of academia, which are necessary to produce scientific results: as an example, Fig. 2.5 shows the many components that are needed by the popular matplotlib library [27]. As a consequence, scholarly infrastructures that support software source code written in academia must go the extra mile to ensure they adopt standards and provide mechanisms that are compatible with the ones used by tens of millions

uuid-runtime

libuuid1

gcc-9-base

libquadmath0

libgfortran5

libgpg-error-l10n

libgpg-error0

libgcrypt20 liblz4-1

-6-

passwd

file

liblzma5

tzdata libssl1.1

libxss1 libxext6

libfontconfig1

libxft2

libjpeg62

libjs-jquery-ui

libpng16-16

jquery javascript-common

fonts-lyx ttf-bitstream-vera libjs-jquery

python-matplotlib-data

libxrender1 libfreetype6

libjpeg62-turbo

libtiff5

blt8.0-unoff blt4.2 blt8.0 libtk8.6 libtcl8.6

[debconf] {cdebconf}

python3-tk

blt tk8.6-blt2.5

libwebp6 libjbig0 libzstd1

x11-common

{debconf} debconf-2.0

readline-common libgpm2 libtinfo6

libmpdec2 libreadline7 libncursesw6

libpython3.7-stdlib

libbz2-1.0 {dpkg} install-info libreadline-common

bzip2

libmagic-mgc

xz-lzma libmagic1

xz-utils

libpython3-stdlib

libpython3.7-minimal

libdb5.3 libffi6

libexpat1 python3.7

python3.7-minimal

mime-support libsqlite3-0

...

python3

libwebpdemux2 liblcms2-2 libwebpmux3

python3-pil

libimagequant0 [mime-support] python3-pil.imagetk python3-olefile

[python3.7] [python3.7]

python3.7:any

python3-kiwisolver

python3-minimal

python3-pkg-resources

python3-numpy

libblas3 libblas.so.3

libsystemd0 libsmartcols1 adduser

[python3] [python3]

...

-3-

liblapack3 liblapack.so.3

python3-dateutil

python3:any

python3-six

python3-cycler

python3-matplotlib

Fig. 2.5 Direct and indirect dependencies for a specific python package (matplotlib). In blue are the Python dependencies; in red are the “true” system dependencies incurred by python (e.g., the libc or libjpeg62); in green are some dependencies triggered by the package management system but which are very likely not used by python (e.g., adduser or dpkg)

python3-pyparsing

python3-numpy-abi9

52 R. Di Cosmo and S. Zacchiroli

2 The Software Heritage Open Science Ecosystem

53

of non-academic software developers worldwide. They also need to ensure that the large amount of software components that are developed outside academia, but are relevant for research activities, are properly taken into account. Over the recent years, there have been a number of initiatives to add support for software artifacts in the scholarly world, which fall short of satisfying these requirements. They can be roughly classified in two categories: overlays on public forges provide links from articles to the source code repository of the associated software artifact as found on a public code hosting platform (forge); typical examples are websites like https://paperswithcode.com/, http:// www.replicabilitystamp.org/, and the Code and data links recently introduced in ArXiv.org. deposits in academic repositories take snapshots of a given state of the source code, usually in the form of a .zip or .tar file, and store it in the repository exactly like an article or a dataset, with an associated publisher identifier; a typical example in computer science is the ACM Digital Library, but there are a number of general academic repositories where software artefacts have been deposited, like FigShare and Zenodo. The approaches in the first category rely on code hosting platforms that do not guarantee persistence of the software artifact: the author of a project may alter, rename, or remove it, and we have seen that code hosting platforms can be discontinued or decide to remove large amount of projects.10 The approaches in the second category do take into account persistence, as they archive software snapshots, but they lose the version control history and do not provide the granularity needed to reference the internal components of a software artifact (directories, files, snippets). And none of the initiatives in these categories provide a means to properly archive and reference the numerous external dependencies of software artefacts. This is where Software Heritage comes into play for Open Science, by providing an archive designed for software that provides persistence, preserves the version control history, supports granularity in the identification of software artefacts and their components, and harvests all publicly available source code. The differences described above are summarized in Table 2.2, where we only consider infrastructures in the second category described above, as they are the only one assuming the mission to archive their contents. We also take into account additional features found in academic repositories, like the possibility of depositing content with an embargo period, which is not possible on Software Heritage, and the existence of a curation process to obtain qualified metadata, which is currently out of scope of Software Heritage.

10 Google

Code and Gitorious.org were shut down in 2015, Bitbucket removed support for the Mercurial VCS in 2020, and in 2022, Gitlab.com considered removing all projects inactive for more than a year.

54

R. Di Cosmo and S. Zacchiroli

Table 2.2 Comparison of infrastructures for archiving research software. The various granularities of identifiers are abbreviated with the same convention used in SWHIDs (snp for snapshot, etc.), plus the abbreviation frg that stands for the ability to identify a code fragment Infrastructure Criteria identifier

granularity

archival history browse code scope embargo curation

integration

Software Heritage ACM DL

intrinsic snp, rel, rev dir, cnt, frg harvest deposit save code now full VCS yes universal no no BitBucket, SourceForge, GitHub, Gitea, Gitlab, HAL, etc.

HAL

Figshare

Zenodo

extrinsic extrinsic + intrinsic extrinsic (via SWH)

extrinsic

dir

dir

dir

rel, dir

deposit

deposit

deposit

deposit

no no discipline no yes

no no academic yes yes

no releases no no academic academic yes yes no no

SWH

GitHub

2.4.2 Extending the Scholarly Ecosystem Architecture to Software In the framework of the European Open Science Cloud initiative (EOSC), a working group has been tasked in 2019 to bring together representatives from a broad spectrum of scholarly infrastructures to study these issues and propose concrete ways to address theme. The result, known as the EOSC Scholarly Infrastructures for Research Software (SIRS) report [16], was published in 2020 and provides a detailed analysis of the existing infrastructures, their relationships, and the workflows that are needed to properly support software as a research result on par with publications and data. Figure 2.6 presents the main categories of identified actors: Scholarly repositories: services that have as one of their primary goals the longterm preservation of the digital content that they collect. Academic publishers: organizations that prepare submitted research texts, possibly with associated source code and data, to produce a publication and manage the dissemination, promotion, and archival process. Software and data can be part of the main publication or assets given as supplementary materials depending on the policy of the journal.

2 The Software Heritage Open Science Ecosystem

55

Fig. 2.6 Overview of the high-level architecture of scholarly infrastructures for research software, as described in the EOSC SIRS report

Aggregators: services that collect information about digital content from a variety of sources with the primary goal of increasing its discoverability and possibly adding value to this information via processes like curation, abstraction, classification, and linking. These actors have a long history of collaboration around research articles, with well-defined workflows and collaborations. The novelty here is the fact that to handle research software, it is no longer possible to work in isolation inside the academic world, for the reasons explained previously: one needs a means to share information and work with other ecosystems where software is present, like in industry and public administration. One key finding of the EOSC SIRS Report is that Software Heritage provides the shared basic architectural layer that allows to interconnect all these ecosystems, because of its unified approach to archiving and referencing all software artefacts, independently of the tools or platforms used to develop or distribute the software involved.

2.4.3 Growing Technical and Policy Support In order to take advantage of the services provide by Software Heritage in this setting, a broad spectrum of actions have been started and are ongoing. We briefly survey here the ones that are most relevant at the time of writing.

56

R. Di Cosmo and S. Zacchiroli

Fig. 2.7 Overview of the interplay between HAL and Software Heritage for research software

At the national level, France has developed a multi-annual plan on Open Science that includes research software [21, 22] and consistently implemented this plan through a series of steps that range from technical development to policy measures. On the technical side, the French national open-access repository HAL [14] (analogous to the popular arXiv service11 ) has been integrated with the Software Heritage archive. The integration allows researchers to have their software projects archived and referenced in Software Heritage, while curated rich metadata and citation information are made available on HAL [14], with a streamlined process depicted in Fig. 2.7. On the policy side, the second French national plan for open science [22], published in July 2021, prescribes the use of Software Heritage and HAL for all the research software produced in France, and Software Heritage is now listed

11 https://arxiv.org.

2 The Software Heritage Open Science Ecosystem

57

in the official national roadmap of research infrastructures published in February 2022 [23]. This approach is now being pushed forward at the European level, through funding for consortia that will build the needed connectors between Software Heritage and several infrastructures and technologies used in academia, using the French experience as a reference. Most notably, the FAIRCORE4EOSC [19] European project includes plans to build connectors with scholarly repository systems like Dataverse [47] and InvenioRDM [28] (the white-label variant of Zenodo), publishers like Dagstuhl [43] and Episcience [18], and aggregators like swMath [20] and OpenAire [36].

2.4.4 Supporting Researchers The growing awareness about the importance of software as a research output will inevitably bring new recommendations for research activity, which will eventually become obligations for researchers, as we have seen with publications and data. Through the collaboration with academic infrastructures, Software Heritage is striving to develop mechanisms that minimize the extra burden for researchers, and we mention here a few examples. A newly released extension, codename updateswh, for the popular Web browsers Firefox and Google Chrome allows to trigger archival in just one click for any public repository hosted on Bitbucket, GitLab (.com and any instance), GitHub, and any instance of Gitea. It also allows to access in one click the archived version of the repository and obtain the associated SWHID identifier. Integration with Web hooks is available for a variety of code hosting platforms, including Bitbucket, GitHub, GitLab.com, and SourceForge, as well as for instances of GitLab and Gitea, which enable owners of projects hosted on those platforms to trigger archival automatically on any new release, reducing the burden on researchers even more. Software Heritage will try to detect and parse intrinsic metadata present in software projects independently of the format chosen, but we see the value of standardizing on a common format. This is why, with all academic platforms, we are working with, we are advocating the use of codemeta.json, a machine readable file based on the CodeMeta extension of schema.org, to retrieve automatically metadata associated to software artifact, in order to avoid the need for researchers to fill forms when declaring software artifacts in academic catalogs, following the schema put in place with the HAL national open-access portal. Finally, we have released the biblatex-software bibliographic style extension to make it easy to cite software artefacts in publications written using the popular LATEX framework.

58

R. Di Cosmo and S. Zacchiroli

2.5 Conclusions and Perspectives In conclusion, the Software Heritage ecosystem is a useful resource for both software engineering studies and for Open Science. As an infrastructure for research on software engineering, the archive provides numerous benefits. The SWHID intrinsic identifiers make it easier for researchers to identify and track software artifacts across different repositories and systems. The uniform data structure used by the archive abstracts away all the details of software forges and package managers, providing a standardized representation of software code that is easy to use and analyze. The availability of the open datasets makes it possible to tailor experiments to one’s needs and improves their reproducibility. An obvious direction at the time of writing is to leverage Software Heritage’s extensive source code corpus for pre-training large language models. Future collaborations may lead to integrate functionalities like the domain-specific language from the Boa project or the efficient data structures of the World of Code project, enabling researchers to run more specialized queries and achieve more detailed insights. Regarding the Open Science aspect, Software Heritage already offers the reference archive for all publicly available research software. The next step is to interconnect it with a growing number of scholarly infrastructures, which will increase reproducibility of research in all fields and support software citation directly from the archive, contributing to increasing visibility of research software. Going forward, we believe that Software Heritage will provide a unique observatory for the whole software development ecosystem, both in academia and outside of it. We hope that with growing adoption, it will play an increasingly valuable role in advancing the state of software engineering research and in supporting the software pillar of open science.

References 1. Abramatic, J.F., Di Cosmo, R., Zacchiroli, S.: Building the universal archive of source code. Commun. ACM 61(10), 29–31 (2018). https://doi.org/10.1145/3183558 2. Allançon, T., Pietri, A., Zacchiroli, S.: The software heritage filesystem (SwhFS): integrating source code archival with development. In: International Conference on Software Engineering (ICSE). IEEE, Piscataway (2021). https://doi.org/10.1109/ICSE-Companion52605.2021. 00032 3. Allen, A., Schmidt, J.: Looking before leaping: creating a software registry. J. Open Res. Softw. 3(e15) (2015). https://doi.org/10.5334/jors.bv 4. Alliez, P., Di Cosmo, R., Guedj, B., Girault, A., Hacid, M.S., Legrand, A., Rougier, N.: Attributing and referencing (research) software: best practices and outlook from INRIA. Comput. Sci. Eng. 22(1), 39–52 (2020). https://doi.org/10.1109/MCSE.2019.2949413. Available from https://hal.archives-ouvertes.fr/hal-02135891 5. Berners-Lee, T., Fielding, R., Masinter, L.: Uniform resource identifier (URI): Generic syntax. RFC 3986, RFC Editor (2005)

2 The Software Heritage Open Science Ecosystem

59

6. Bhattacharjee, A., Nath, S.S., Zhou, S., Chakroborti, D., Roy, B., Roy, C.K., Schneider, K.A.: An exploratory study to find motives behind cross-platform forks from software heritage dataset. In: International Conference on Mining Software Repositories (MSR), pp. 11–15. ACM, New York (2020). https://doi.org/10.1145/3379597.3387512 7. Boldi, P., Pietri, A., Vigna, S., Zacchiroli, S.: Ultra-large-scale repository analysis via graph compression. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 184–194. IEEE, Piscataway (2020). https://doi.org/10.1109/SANER48275. 2020.9054827 8. Bönisch, S., Brickenstein, M., Chrapary, H., Greuel, G., Sperber, W.: swMATH - a new information service for mathematical software. In: MKM/Calculemus/DML. Lecture Notes in Computer Science, vol. 7961, pp. 369–373. Springer, Berlin (2013) 9. Borgman, C.L., Wallis, J.C., Mayernik, M.S.: Who’s got the data? Interdependencies in science and technology collaborations. In: Computer Supported Cooperative Work (CSCW), vol. 21, pp. 485–523 (2012). https://doi.org/10.1007/s10606-012-9169-z 10. Childers, B.R., Fursin, G., Krishnamurthi, S., Zeller, A.: Artifact evaluation for publications (Dagstuhl Perspectives Workshop 15452). Dagstuhl Rep. 5(11), 29–35 (2016). https://doi.org/ 10.4230/DagRep.5.11.29 11. Di Cosmo, R.: Archiving and referencing source code with software heritage. In: International Conference on Mathematical Software (ICMS). Lecture Notes in Computer Science, vol. 12097, pp. 362–373. Springer, Berlin (2020). https://doi.org/10.1007/978-3-030-522001_36 12. Di Cosmo, R., Zacchiroli, S.: Software Heritage: Why and how to preserve software source code. In: International Conference on Digital Preservation (iPRES) (2017) 13. Di Cosmo, R., Gruenpeter, M., Zacchiroli, S.: Identifiers for digital objects: the case of software source code preservation. In: International Conference on Digital Preservation (iPRES) (2018). https://doi.org/10.17605/OSF.IO/KDE56 14. Di Cosmo, R., Gruenpeter, M., Marmol, B.P., Monteil, A., Romary, L., Sadowska, J.: Curated Archiving of Research Software Artifacts: lessons learned from the French open archive (HAL) (2019). Presented at the International Digital Curation Conference. Submitted to IJDC 15. Di Cosmo, R., Gruenpeter, M., Zacchiroli, S.: Referencing source code artifacts: a separate concern in software citation. Comput. Sci. Eng. 22(2), 33–43 (2020). https://doi.org/10.1109/ MCSE.2019.2963148 16. Di Cosmo, R., Lopez, J.B.G., Abramatic, J.F., Graf, K., Colom, M., Manghi, P., Harrison, M., Barborini, Y., Tenhunen, V., Wagner, M., Dalitz, W., Maassen, J., Martinez-Ortiz, C., Ronchieri, E., Yates, S., Schubotz, M., Candela, L., Fenner, M., Jeangirard, E.: Scholarly Infrastructures for Research Software. European Commission. Directorate General for Research and Innovation (2020). https://doi.org/10.2777/28598 17. Dyer, R., Nguyen, H.A., Rajan, H., Nguyen, T.N.: Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: International Conference on Software Engineering (ICSE), pp. 422–431 (2013) 18. Episciences. https://www.episciences.org. Accessed 15 April 2023 19. FAIRCORE4EOSC project. https://faircore4eosc.eu. Accessed 15 April 2023 20. FIZ Karlsruhe GmbH: swMATH mathematical software. https://swmath.org (2023). Accessed 15 April 2023 21. French Ministry of Research and Higher Education: French National Plan for Open Science. https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-la-science-ouverteles-resultats-de-la-recherche-scientifique-ouverts-tous-49241 (2018) 22. French Ministry of Research and Higher Education: French second national plan for open science: Support and opportunities for universities’ open infrastructures and practices. https://www.enseignementsup-recherche.gouv.fr/fr/le-plan-national-pour-lascience-ouverte-2021-2024-vers-une-generalisation-de-la-science-ouverte-en-48525 (2021) 23. French Ministry of Research and Higher Education: Feuille de route nationale des infrastructures de recherche. https://www.enseignementsup-recherche.gouv.fr/fr/feuille-de-routenationale-des-infrastructures-de-recherche (2022)

60

R. Di Cosmo and S. Zacchiroli

24. Heckman, J.: Varieties of selection bias. Am Eco Rev 80(2), 313–318 (1990) 25. Hinsen, K.: Software development for reproducible research. Comput. Sci. Eng. 15(4), 60–63 (2013). https://doi.org/10.1109/MCSE.2013.91 26. Howison, J., Bullard, J.: Software in the scientific literature: problems with seeing, finding, and using software mentioned in the biology literature. J. Assoc. Inf. Sci. Technol. 67(9), 2137– 2155 (2016). https://doi.org/10.1002/asi.23538 27. Hunter, J.D.: Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55 28. Invenio: InvenioRDM. https://inveniosoftware.org/products/rdm/. Accessed 15 April 2023 29. Ivie, P., Thain, D.: Reproducibility in scientific computing. ACM Comput. Surv. 51(3), 63:1– 63:36 (2018). https://doi.org/10.1145/3186266 30. Lamprecht, A.L., Garcia, L., Kuzak, M., Martinez, C., Arcila, R., Martin Del Pico, E., Dominguez Del Angel, V., van de Sandt, S., Ison, J., Martinez, P.A., McQuilton, P., Valencia, A., Harrow, J., Psomopoulos, F., Gelpi, J.L., Chue Hong, N., Goble, C., Capella-Gutierrez, S.: Towards FAIR principles for research software. Data Sci. 3(1), 37–59 (2020). https://doi.org/ 10.3233/DS-190026 31. Ma, Y., Bogart, C., Amreen, S., Zaretzki, R., Mockus, A.: World of code: an infrastructure for mining the universe of open source VCS data. In: International Conference on Mining Software Repositories (MSR), pp. 143–154. IEEE, Piscataway (2019). https://doi.org/10.1109/ MSR.2019.00031 32. Merkle, R.C.: A digital signature based on a conventional encryption function. In: Advances in Cryptology (CRYPTO), pp. 369–378 (1987). https://doi.org/10.1007/3-540-48184-2%5C_32 33. Messerschmitt, D.G., Szyperski, C.: Software Ecosystem: Understanding an Indispensable Technology and Industry. MIT Press, Cambridge (2003) 34. Mockus, A.: Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: International Working Conference on Mining Software Repositories (MSR), pp. 11–20. IEEE, Piscataway (2009). https://doi.org/10.1109/MSR.2009. 5069476 35. nexB: ScanCode. https://www.aboutcode.org/projects/scancode.html. Accessed 15 April 2023 36. Openaire. https://www.openaire.eu. Accessed 15 April 2023 37. Pietri, A.: Organizing the graph of public software development for large-scale mining. (organisation du graphe de développement logiciel pour l’analyse à grande échelle). Ph.D. Thesis, University of Paris (2021) 38. Pietri, A., Spinellis, D., Zacchiroli, S.: The Software Heritage graph dataset: public software development under one roof. In: International Conference on Mining Software Repositories (MSR), pp. 138–142 (2019). https://doi.org/10.1109/MSR.2019.00030 39. Quinlan, S., Dorward, S.: Venti: a new approach to archival data storage. In: Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley (2002). https://www. usenix.org/conference/fast-02/venti-new-approach-archival-data-storage 40. Rossi, D., Zacchiroli, S.: Geographic diversity in public code contributions: an exploratory large-scale study over 50 years. In: International Conference on Mining Software Repositories (MSR), pp. 80–85. ACM, New York (2022). https://doi.org/10.1145/3524842.3528471 41. Rossi, D., Zacchiroli, S.: Worldwide gender differences in public code contributions (and how they have been affected by the COVID-19 pandemic). In: International Conference on Software Engineering – Software Engineering in Society Track (ICSE-SEIS), pp. 172–183. ACM, New York (2022). https://doi.org/10.1109/ICSE-SEIS55304.2022.9794118 42. Rousseau, G., Di Cosmo, R., Zacchiroli, S.: Software provenance tracking at the scale of public source code. Empirical Software Eng. 25(4), 2930–2959 (2020). https://doi.org/10. 1007/s10664-020-09828-5 43. Schloss Dagstuhl. https://www.dagstuhl.de. Accessed 15 April 2023 44. Smith, A.M., Katz, D.S., Niemeyer, K.E.: Software citation principles. PeerJ Comput. Sci. 2, e86 (2016). https://doi.org/10.7717/peerj-cs.86 45. Stewart, K., Odence, P., Rockett, E.: Software package data exchange (SPDX) specification. IFOSS L. Rev. 2, 191 (2010)

2 The Software Heritage Open Science Ecosystem

61

46. Stodden, V., LeVeque, R.J., Mitchell, I.: Reproducible research for scientific computing: tools and strategies for changing the culture. Comput. Sci. Eng. 14(4), 13–17 (2012). https://doi.org/ 10.1109/MCSE.2012.38 47. The Dataverse Project. https://dataverse.org. Accessed 15 April 2023 48. Wellenzohn, K., Böhlen, M.H., Helmer, S., Pietri, A., Zacchiroli, S.: Robust and scalable content-and-structure indexing. VLDB J. (2022). https://doi.org/10.1007/s00778-022-00764-y 49. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., Bouwman, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., Gonzalez-Beltran, A., Gray, A.J., Groth, P., Goble, C., Grethe, J.S., Heringa, J., ’t Hoen, P.A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S.J., Martone, M.E., Mons, A., Packer, A.L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M.A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., Mons, B.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 160018 (2016). https://doi.org/10.1038/sdata.2016.18 50. Zacchiroli, S.: Gender differences in public code contributions: a 50-year perspective. IEEE Softw. 38(2), 45–50 (2021). https://doi.org/10.1109/MS.2020.3038765 51. Zacchiroli, S.: A large-scale dataset of (open source) license text variants. In: International Conference on Mining Software Repositories (MSR), pp. 757–761. ACM, New York (2022). https://doi.org/10.1145/3524842.3528491

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Chapter 3

Promises and Perils of Mining Software Package Ecosystem Data Raula Gaikovina Kula, Katsuro Inoue, and Christoph Treude

Abstract The use of third-party packages is becoming increasingly popular and has led to the emergence of large software package ecosystems with a maze of interdependencies. Since the reliance on these ecosystems enables developers to reduce development effort and increase productivity, it has attracted the interest of researchers: understanding the infrastructure and dynamics of package ecosystems has given rise to approaches for better code reuse, automated updates, and the avoidance of vulnerabilities, to name a few examples. But the reality of these ecosystems also poses challenges to software engineering researchers, such as the following: How do we obtain the complete network of dependencies along with the corresponding versioning information? What are the boundaries of these package ecosystems? How do we consistently detect dependencies that are declared but not used? How do we consistently identify developers within a package ecosystem? How much of the ecosystem do we need to understand to analyze a single component? How well do our approaches generalize across different programming languages and package ecosystems? In this chapter, we review promises and perils of mining the rich data related to software package ecosystems available to software engineering researchers.

R. G. Kula () Nara Institute of Science and Technology, Nara, Japan e-mail: [email protected] K. Inoue Nanzan University, Nagoya, Japan e-mail: [email protected] C. Treude The University of Melbourne, Carlton, VIC, Australia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_3

63

64

R. G. Kula et al.

3.1 Introduction Third-party libraries are a great way for developers to incorporate code without having to write their own for every functionality required. By using these libraries, developers can save time and energy while still getting the functions they need. Using third-party libraries is becoming increasingly popular and has led to the emergence of large software package ecosystems such as npm. While these ecosystems offer many benefits, they also come with risks, such as software vulnerability attacks [5]. Large software package ecosystems are a treasure trove for researchers who can investigate a wide range of questions. For example, by studying activity in large ecosystems, researchers can identify which libraries are the most popular and learn what characteristics make them successful [8, 16]. Additionally, research on large ecosystems can help developers understand how to protect their code from malicious actors who may attempt to exploit vulnerabilities or insert malware into popular libraries. Studying large software package ecosystems can help us better understand the dynamics of open-source development in general. Open-source development is a complex process that involves many different stakeholders working together (or sometimes competing) to create valuable code that anyone can use or improve upon. By understanding how these interactions play out in different types of ecosystem structures—including those with many small projects versus few very large ones— we can develop insights that might be applicable more broadly across other types of collaborative systems. In this chapter, we identify and discuss promises and perils during the mining process, ranging from planning what information to mine from the ecosystem to analyzing and visualizing the mined data. Therefore, the chapter is broken down into these logical processes of mining ecosystem data: (1) planning what information to mine; (2) defining components and their dependencies; (3) defining boundaries and completeness; and (4) data analysis and visualization. This chapter is intended for researchers and practitioners who are interested in exploring and exploiting software package ecosystem information from a diverse range of sources that are publicly available. We also highlight the pitfalls to consider during the mining process, particularly when these pitfalls could lead to a misinterpretation of the analysis and results. The chapter is written in a manner that encourages newcomers who have little or no experience or who are interested in utilizing ecosystem data across different disciplines outside of software engineering. Our goal is to get new researchers quickly accustomed to gathering ecosystem information for their research.

3 Promises and Perils of Mining Software Package Ecosystem Data

65

3.2 Software Package Ecosystem Chapter 1 presented definitions of the different types of software ecosystems. The focus of the current chapter is on component-based software ecosystems, and we suggest using “software package ecosystem” as a suitable term for representing the symbiotic relationships among third-party library components (as software projects or repositories), as these libraries and their dependent clients coexist on the same technological platform, therefore sharing the same environment and other internal and external factors (e.g., security threats, sharing contributions, etc.). Our interpretation of such software package ecosystems originates from Kula et al. [17], where we formally defined them using a Software Universe Graph (SUG), modelling a structured abstraction of the evolution of software systems and their third-party library dependencies over time. Figure 3.1 provides an illustration of the different relationships within the SUG. Definition 3.1 (SUG) A Software Universe Graph (SUG) is a graph .G = (N, E). N is a set of nodes representing software units. We define a software unit as a version instance of any software program. .E = Euse ∪ Eupdate is a set of edges, where .Euse ⊆ N × N is the set of use relationships, and .Eupdate ⊆ N × N is the set of update relationships that exist in the ecosystem, signaling that a newer release (i.e., an update) of a software unit is made available. Different types of edges between the same pair of nodes are not allowed, i.e.,.Euse ∩ Eupdate = ∅. We represent a use relationship from node a to b using .a → b and an update relationship rom node a to b using .a ⇒ b. Use-relations can be extracted from either the source code or configuration files. As shown in Fig. 3.1, among others, node q1 uses nodes a1 and x1 (i.e.,.q1 → a1 ∈ Euse and .q1 → x1 ∈ Euse ). Node q1 is also updated to q2, which is further updated to q2 (i.e.,.q1 ⇒ q2 ∈ Eupdate and .q2 ⇒ q3 ∈ Eupdate ) and similarly for nodes x1, x2, and x3 that represent successive updates. Fig. 3.1 Conceptual example of the Software Universe Graph, depicting the use and update relationships between different software units

66

R. G. Kula et al.

Note that an update should not be confused with forking. We distinguish a fork as a separate software unit. Definition 3.2 Each node .u ∈ N in the SUG .G = (N, E) is characterized by three attributes: . • u.name Name is the string representation identifier of a software unit. We introduce the name axiom: For nodes u and v, if .u ⇒ v, then .u.name = v.name holds. • u.release. Release refers to the specific assigned change reference for a software unit. For nodes u and v, if .u ⇒ v, then v is the immediate successor of u. Note that the versioning pattern may vary from project to project. • u.time. Time refers to the time stamp at which node u was released. For nodes u and v of .u ⇒ v, .u.time < v.time. Definition 3.3 The SUG has temporal properties. This describes the simultaneity or ordering in reference to time. Let SUG .G = (N, E) be at time t. At time .t > t, we observe an extension of G, such that: G = (N ∪ ΔN, E ∪ ΔE)

.

(3.1)

where .ΔE ∩ (N × N) = ∅ Figure 3.2 illustrates the temporal properties of the SUG. Here, it is observed that G is composed of G augmented with newly added node a3 and its corresponding .a3 → x2 and .a2 ⇒ a3 relations. A SUG grows monotonically over time with only additions. Here, we consider that modification or deletion changes on the SUG do not occur. .

Definition 3.4 A timed SUG specifies the state of the SUG at any point in time. So for an SUG .G = (N, E), we represent a timed SUG .Gt at time t as a sub-graph of G. Formally, Gt ≡ (Nt , Et )

.

where .Nt = {u|u ∈ N ∧ u.time ≤ t} and .Et = {e|e ∈ E ∧ e ∈ Nt }

Fig. 3.2 Temporal property of the Software Universe Graph

(3.2)

3 Promises and Perils of Mining Software Package Ecosystem Data

67

3.3 Data Sources Researchers can use various datasets to model the ecosystem using the SUG model of usage and update relationships. The most obvious data source that has revolutionized data mining in the software engineering domain is the GitHub platform. Established in 2008 and then purchased by Microsoft in 2020, GitHub is home to various popular Open-Source Software. GitHub is built on the git version control system and is useful for storing all changes made to a repository. In the case of the SUG, a GitHub repository can represent one software unit, whose depend relations can be extracted via a configuration file (such as the package.json file for JavaScript projects). The repository should also contain the release information that holds the update relations. Due to its large size, researchers and the GitHub team have made available datasets for researchers to mine, for example, through GitHub’s REST or GraphQL API. These backend Application Programming Interfaces (API) can be used to query large amounts of data on GitHub. Most researchers use the API to download and mine information from the GitHub platform. It is important to note that while GitHub introduced a new feature of Dependency Graphs to map the depend relationship,1 most older projects do not have this feature. In this case, the researcher would need to manually extract and query the configuration files for dependency information. We refer to the Chap. 1 for additional data sources for mining software ecosystems.

3.4 Promises and Perils Using the SUG model of depend and use relations and the available datasets, we present our promises and perils of mining ecosystem information.

3.4.1 Planning What Information to Mine Promise 1. Researchers can access and link heterogeneous data related to software package ecosystems, e.g., package registries and bug trackers. When planning what information to mine from the ecosystem, researchers do not need to limit themselves to the usage and update relationship information. Platforms that host software repositories include other software management systems such as bug trackers. For example, GitHub provides three management systems that are related to a software repository. More specifically, GitHub allows project 1 https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-

supply-chain/about-the-dependency-graph.

68

R. G. Kula et al.

contributors to manage Issues, Pull Requests, and Discussions not only for one project but for multiple projects: • Issues are used to track ideas, feedback, tasks, or bugs for work on GitHub. • Pull Requests allow other developers from an ecosystem to make a contribution to a software repository. Pull requests also allow maintainers to discuss and review potential changes with collaborators and add follow-up commits before changes are merged into the software. • Discussions provide a collaborative communication forum for the community around an open source or internal project. Community members can ask and answer questions, share updates, have open-ended conversations, and follow along on decisions affecting the community’s way of working. These three systems exemplify how developers can contribute to both their own and other projects. Hence, to incorporate this information, we can extend the SUG model, creating a model that includes a contribution relationship [32]. Definition 3.5 A Dependency-Contribution graph incorporates contributions by developers whose libraries are involved in dependency relationships. Wattanakriengkrai et al. [32] explore the congruence between dependency updates and developer contributions. This is based on the original concept of social-technical congruence [4] that states that developers’ contribution patterns are congruent with their coordination needs. Hence, the goal is to identify contributions that are congruent to dependency updates. As shown in Fig. 3.3, the authors extend from the typical SUG graph model where .libi uses .libk and .libj , while .libj also uses .libk , to the example shown in Fig. 3.4. Different to the SUG, this new graph captures developers and their contributions (i.e., the square nodes .devx and .devy represent two different developers making contributions to the circle nodes representing software units). Here, contributions are defined as Pull Requests or Issues that were submitted to both a library and the client that depends on that library. Hence, the graph can show contributions that are congruent to dependency changes for a software unit. This is just one example of the type of research that is enabled by access to heterogeneous data related to software package ecosystems. Fig. 3.3 Example of a dependency graph for a given time period

libi

libk

libj

3 Promises and Perils of Mining Software Package Ecosystem Data Fig. 3.4 Example Dependency-Contribution graph showing relationships between contributions and dependencies

69

devx

devy contribute libi

contribute

libk

contribute

libj

Peril 1. Developers might use different identifiers when contributing to different parts of a software package ecosystem, e.g., when contributing to different libraries. When modelling using such graphs, there is a threat that contributors may use multiple identifiers (i.e., .devx and .devy represent the same contributor). This is a well-known research problem, and there has been research to merge these accounts, such as the work by Wiese et al. [34]. GitHub has introduced mechanisms such as two-factor authentication2 to counteract the issue of multiple identifiers. This is since developers might be less likely to switch accounts if it requires cumbersome authentication. Peril 2. Developers’ contributions to software package ecosystems might be interspersed with bot contributions, e.g., automated dependency updates. The rise of automation and artificial intelligence has led to much work on the integration of automated scheduling (i.e., bots) into software development workflows [10, 11, 27, 29, 33] to name a few. These bots are designed to perform specific tasks within a software package ecosystem. For example, a bot may be programmed to automatically update dependencies, test code changes, or deploy software to production. As an example, the Google APIs repo-automation-bots project lists bots for automated labelling of issues and pull requests, automated approval of pull requests, and triggering releases.3 Bots perform common maintenance tasks in many software projects and are now commonplace [2, 9, 14, 30]. Especially with bots such as dependabot (automated pull requests to update configurations to reduce the risk of vulnerability threats),4 more and more automation has caused a lot of noise in the contributions between projects. There are also bots for communication and documentation [18, 19, 30].

2 https://docs.github.com/en/authentication/securing-your-account-with-two-factor-

authentication-2fa/configuring-two-factor-authentication. 3 https://github.com/googleapis/repo-automation-bots. 4 https://github.com/dependabot.

70

R. G. Kula et al.

To be able to draw accurate conclusions about what humans are doing in software package ecosystems, researchers should consider distinguishing between bot and human contributions. It is also important to differentiate this from other contributions [21]. The research community has responded well, with a wide range of techniques and tools to mitigate this peril [12, 13]. Chapter 8 elaborates on this research about development bots. Peril 3. Not all developer activities in software package ecosystems are accessible to the public, e.g., library use in proprietary settings. Not all developer activities in software package ecosystems are accessible to the public, e.g., when the boundary between open source and industry is blurred [28], which presents a challenge for researchers who aim to study the development process. This is particularly true in proprietary settings where software development is performed behind closed doors or is open source for a limited time period, thus resulting in the artefacts not permanently being publicly available. This can make it difficult to understand the broader ecosystem in which a software project is developed. Proprietary settings may lead to non-standardization in software development practices. Different software projects may use different management systems and tools, making it difficult to accurately compare and analyze software development activities across various projects. For example, some projects may use communication, documentation, and other management tools not captured on the same platform [23]. For example, some projects use Bugzilla instead of GitHub issues and pull requests for their bug and code review systems, while others use Discord, Slack channels, or email threads for their communication needs. This lack of standardization in software development practices presents a challenge for researchers who study the software package ecosystem and understand the development process. To address this issue, researchers should strive to collect data from a diverse set of projects to gain a comprehensive understanding of the software package ecosystem. In addition, researchers may need to adjust their methodologies or data collection techniques to accommodate the different tools and practices used by different software projects.

3.4.2 Defining Components and Their Dependencies Promise 2. Researchers can access a software package ecosystem’s dependency network through package managers and registries, e.g., npm lists the dependencies and dependents for over a million libraries. With the rise of curated datasets like libraries.io, researchers can now recover and model dependency relations between software units using pre-extracted datasets. Table 3.1 shows examples of popular package managers mined from the libraries.io dataset in 2020.

3 Promises and Perils of Mining Software Package Ecosystem Data

71

Table 3.1 Summary of 13 package managers from libraries.io as ranked by TIOBE in 2020 Package Ecosystem PyPI Maven Bower Meteor npm Packagist Puppet RubyGems CRAN CPAN GO NuGet Anaconda

Programming Language Python Java JavaScript JavaScript JavaScript PHP Ruby Ruby R Perl Golang C#, VB Python, R, C#

Tiobe Rank 2 3 7 7 7 8 13 13 14 15 20 5, 6 2, 14, 5

Environment Python JVM Node.js Node.js Node.js PHP Ruby MRI Ruby MRI RStudio Perl Go .NET Anaconda

Dependency Tree Flat Flat Flat Nested Nested (v2) Flat Flat Flat Flat Flat Flat Flat Flat

Package Archive link pypi.org Maven.org bower.io atmospherejs.com npmjs.com packagist.org forge.puppet.com rubygems.org cran.r-project.org metacpan.org pkg.go.dev nuget.org anaconda.org

Peril 4. Different software package ecosystems define the concept of “dependency” differently, e.g., by allowing or not allowing different versions of a library on the same dependency tree. Different software package ecosystems have varying definitions of what constitutes a dependency. For example, some ecosystems may allow multiple versions of a library to exist on the same dependency tree, while others may restrict developers to a single version of a library [15]. These restrictions are often based on the programming language being used, as different languages have different approaches to managing dependencies. It is important to consider the restrictions on dependency relationships when studying software package ecosystems, as they can have a major impact on the development process. For example, the ability to use multiple versions of a library on the same dependency tree can greatly simplify the process of updating dependencies and can make it easier to resolve conflicts between libraries. One way to visualize the impact of these restrictions is to compare the difference between a nested dependency tree and a directed dependency tree, as shown in Fig. 3.5.5 This distinction is important because it highlights the different ways that a software unit can depend on different versions of the same library. In this example, npm v3 creates the dependency tree based on the installation order, therefore flattening unnecessary nested dependencies (i.e., B v1.0 in cyan). This reduces the complexity of a nested tree by resolving some of the transitive dependencies (nested dependencies).

5 Taken

from https://npm.github.io/how-npm-works-docs/npm3/how-npm3-works.html.

72

R. G. Kula et al. npm v2

npm v3

App

App

A v.1.0

C v.1.0

B v1.0

B v2.0

A v.1.0

B v1.0

C v.1.0

B v2.0

Fig. 3.5 Difference between flat and nested dependencies

Peril 5. Developers might declare a dependency to other parts of a software package ecosystem but not use it, e.g., because they failed to update its removal. It is common for developers to declare dependencies on other parts of the software package ecosystem but not always use them. This can happen for various reasons, such as forgetting to remove the dependency after it is no longer needed. This can pose a challenge for researchers who are trying to extract dependencies from package managers, like those in configuration files, as there may be inconsistencies between the listed dependencies and what is actually being compiled and used by the code. This can lead to a biased understanding of the software package ecosystem and the relationships between software components. To address this issue, there have been numerous efforts to track the actual library dependencies compiled and executed in software systems. These efforts aim to provide a more accurate understanding of the dependencies and the relationships between software components. For example, research has been conducted on the use of dynamic analysis to track compiled dependencies in real time and on the development of tools to automatically detect and track executed dependencies [5, 26, 35].

3.4.3 Defining Boundaries and Completeness Promise 3. Researchers can use the boundaries of software package ecosystems to study communities of developers, e.g., developers contributing to and/or benefiting from the npm ecosystem. Following Promise 2, the emergence of package managers has also led to studies that approximate software communities. Using the libraries.io dataset, researchers were able to study projects that host libraries that use package managers. Researchers have used this dataset to compare different library ecosystems [7, 8, 16].

3 Promises and Perils of Mining Software Package Ecosystem Data

73

Peril 6. Package managers do not always represent software package ecosystems, their communities, or their sub-communities, e.g., in cases where multiple package managers exist. Package managers are a fundamental aspect of software package ecosystems, but do not always fully represent the complex relationships and interactions that occur within a community of developers and users, as shown in Table 3.1. In some cases, multiple package managers exist for the same programming language, creating a complex landscape of software libraries and dependencies that are not always easily understood. For instance, Bower and Meteor manage npm libraries, which can lead to confusion and overlap in the management of dependencies. Similarly, Java, Scala, Android, and other Java-based open-source communities all use the Maven package manager, but each of these communities has its own unique set of libraries, dependencies, and development practices. Researchers should be aware of the limitations of package managers when studying software package ecosystems and consider the broader context and relationships that exist within these communities. Peril 7. Lack of activity in parts of a software package ecosystem does not necessarily indicate project failure, e.g., when highly depended-upon libraries are feature-complete. It is important to note that lack of activity in a part of a software package ecosystem does not always mean project failure [6]. In some cases, highly reliedupon libraries that have reached feature-completeness may see little activity but continue to be used by the software community. However, it is still important to consider the long-term sustainability of these libraries, especially given the rate at which technology and software development practices change. This has become a topic of interest in recent years, and researchers have explored best practices for sustaining open-source projects and ensuring their continued success [1, 31]. Understanding the factors that contribute to project sustainability is important to ensure the longevity and continued growth of software package ecosystems. Peril 8. Sampling from a software package ecosystem is challenging since subsetting might alter the dependency network, e.g., by breaking dependency chains. Sampling from a package ecosystem is not straightforward, as the sample composition can be significantly affected due to missing dependency links between libraries. For instance, a subset of the ecosystem might alter the dependencies between libraries, leading to the breakdown of the dependency chains. This could lead to an incomplete picture of the software package ecosystem, leading to incorrect conclusions from a study. To minimize this risk, researchers should carefully consider the boundaries of their study and choose the appropriate sampling method based on the research questions and goals. For example, researchers could focus on popular, highly dependent, or risk-vulnerable aspects of the ecosystem as a

74

R. G. Kula et al.

starting point. For some ecosystems, the number of downloads, stars, and watchers are other aspects for the researcher to utilize. Peril 9. Sampling from a software package ecosystem is challenging since the dependency network changes over time, e.g., when dependencies are added, removed, upgraded, or downgraded. The dynamic nature of package ecosystems and the constant changes to their dependencies can impact the generalizability of the results. Therefore, it is important to also consider the time granularity of the analysis. For example, if the goal is to understand the evolution of dependencies over time, a finer time granularity may be necessary to capture the smaller changes and trends. However, if the goal is to understand the overall structure and relationships within the ecosystem, a coarser time granularity may be sufficient. A three-month window seems appropriate for some studies [3, 22, 24, 31, 32]. Another level of granularity to consider is the size of the component. For instance, there are cases where a single package may contain more than one repository, especially for large library frameworks. The granularity also depends on the nature of the ecosystem itself. For instance, researchers should understand whether the ecosystem comprises library packages (e.g., PyPI) or plugins (e.g., Eclipse) or is a library distribution (e.g., Android).

3.4.4 Analyzing and Visualizing the Data Peril 10. Analyzing and visualizing entire software package ecosystems is challenging due to their size, e.g., in terms of nodes and edges in the network. The size of software package ecosystems implies large datasets, which can be overwhelming for tools and algorithms to analyze and display. Therefore, it may be necessary to make choices about the granularity of the data included in the analysis and visualization. Another alternative is to focus on the most critical parts of the software package ecosystem, such as the high-level structure, highly dependent packages, or parts of the system that pose a risk to security and reliability. The key is to strike a balance between detail and simplicity, providing a meaningful representation of the ecosystem while being able to handle the complexity of its size.

3.5 Application: When to Apply Which Peril We include a disclaimer stating that not all perils are applicable to every mining situation. To demonstrate the practical application of our perils and their mitigation, we present two case studies that involve mining the software package ecosystem.

3 Promises and Perils of Mining Software Package Ecosystem Data

75

Table 3.2 Description of the research objectives and datasets for the two considered case studies Case Study Wattanakriengkrai et al. [32]

Research Objective Explore code contributions between library and client (i.e, use-relations)

Nugroho et al. [25]

Explore discussion contributions between contributors (i.e., contributions)

Datasets libraries.io GitHub API Eclipse API

Each case study has a distinct research objective and focusses on a specific dataset to be mined.

3.5.1 Two Case Studies Table 3.2 presents the two case studies we have selected for this analysis. The first case involves mining for contributions congruent to dependency updates [32]. In this work, the authors mine GitHub repositories for Pull Requests and Issues that were submitted and merged congruent to dependency updates within the npm ecosystem. The second case involves mining communication data for the Eclipse ecosystem [25]. Although the second case does not mine for dependency relations (i.e., use relations), we show that these perils still apply when mining for other relationships in an ecosystem. Moreover, the second case studies the Eclipse ecosystem, which is a different dataset compared to the more popular GitHub dataset.

3.5.2 Applying Perils and Their Mitigation Strategies Table 3.3 provides a summary of the perils that can be applied to each of the case studies. We will now go into the details of mitigation strategies based on these perils. For better organization and understanding, we have grouped the perils according to the four logical processes for mining. Information to Mine The first set of mitigation strategies, which addresses perils 1–3, focusses on planning which information to mine. There are two primary strategies that researchers can employ: 1. Researchers should use research tools and techniques to remove noise and other biases in the dataset, such as bot detection and the handling of multiple identities. This strategy was implemented in both case studies, as contributions and discussions often have the potential to involve bots or developers with multiple identities.

76

R. G. Kula et al.

Table 3.3 Application of each peril to the case studies

Perils

case 1 case 2 npm Eclipse

P1 Developers might use different identifiers when contributing to different parts of a software package ecosystem, e.g., when contributing to different libraries. P2 Developers’ contributions to software package ecosystems might be interspersed with bot contributions, e.g., automated dependency updates. P3 Not all developer activities in software package ecosystems are accessible to the public, e.g., library use in proprietary settings. P4 Different software package ecosystems define the concept of ``dependency” differently, e.g., by allowing or not allowing different versions of a library on the same dependency tree. P5 Developers might declare a dependency to other parts of a software package ecosystem but not use it, e.g., because they failed to update its removal. P6 Package managers do not always represent software package ecosystems, their communities, or their sub-communities, e.g., in cases where multiple package managers exist. P7 Lack of activity in parts of a software package ecosystem does not necessarily indicate project failure, e.g., when highly dependedupon libraries are feature-complete. P8 Sampling from a software package ecosystem is challenging since sub-setting might alter the dependency network, e.g., by breaking dependency chains. P9 Sampling from a software package ecosystem is challenging since the dependency network changes over time, e.g., when dependencies are added, removed, upgraded, or downgraded. P10 Analysing and visualising entire software package ecosystems is challenging due to their size, e.g., in terms of nodes and edges in the network.

2. Depending on the research goals, researchers should recognize that not all contributions are equal and filter the dataset accordingly. We applied these two strategies to both cases. In the first case, the goal was to capture all congruent contributions, so we filtered out contributions made to libraries without dependencies. Since all npm packages are listed in the registry, Peril 3 (private activities) did not apply. In the second case, we addressed Peril 1 by conducting a qualitative analysis to ensure that the member identities were not duplicated, as Eclipse developers were known to change identities. To mitigate Peril 2, we removed bot responses. For the second case, since all forum data is made public, Peril 3 did not apply.

3 Promises and Perils of Mining Software Package Ecosystem Data

77

Defining Dependencies The second set of perils (Perils 4–5) is related to dependency relationships between software units, and only the first case study is applicable. To address these perils, researchers should adopt the following strategy: 1. Researchers should not rely solely on listed dependencies in configuration files (e.g., pom.xml, package.json, etc.) as a measure of dependency between two components. Instead, code-centric approaches should be used to validate which libraries are actually depended upon. For example, in the first case, in addition to mining the configuration information, the authors also analyzed the similarity of the source code contributions to address Peril 4. Regarding Peril 5, since the study’s objective was to investigate changes to the configuration files, the risk of the update not being executed was deemed less important. It is important to note that the second case study did not include dependency analysis and, therefore, these perils did not apply. Defining Boundaries The third set of perils (Perils 6–9) is related to the definition of boundaries and completeness and is relevant for both case studies. To mitigate these perils, we recommend the following strategies: 1. Researchers should recognize that a dormant project does not necessarily mean that it is inactive. Instead, studies can use alternative heuristics, such as the number of dependents and dependencies, as better indicators of a project’s importance in the ecosystem. 2. Researchers should not rely solely on the programming language to define subcommunities. Using a common package manager for the programming language is a more effective rule of thumb for distinguishing boundaries. 3. Researchers should avoid random sampling. Instead, sampling should be tailored to the research goals by considering factors such as an appropriate time window or focusing on specific attributes of components (e.g., most dependents, most popular, most contributors). Peril 6 did not apply to any of the case studies. Particularly for the first case, since the goal was to explore the npm package ecosystem, we assumed that the boundaries were clearly defined by the npm registry. Similarly, the second case study used the generic Eclipse platform as the boundary. Peril 7 was applied to the npm study, while Peril 8 was applied to both case studies. As a result, the two cases conducted a qualitative analysis of the dataset to gain deeper insights. In the first case study, a three-month time window was created to capture dependencies. For the second case study, forum contributors were sampled into three groups (i.e., junior, member, or senior) according to the sliding window of their contributions. Visualization The final peril (Peril 10) relates to visualization, which can be challenging due to the vast size and complexity of software ecosystems. As it is not feasible to visualize every aspect of an ecosystem simultaneously, a focused approach is necessary. A mitigation strategy is to select specific attributes of the ecosystem (e.g., the most dependent, most popular, and most contributions) that align with the research needs and objectives.

78

R. G. Kula et al.

(a) Issues Pull Requests

Years (2014 - 2020)

(b)

(a)

(b)

(c)

(d)

Fig. 3.6 Visualization examples for the two case studies. (a) Visualization of a Time analysis for 107,242 libraries. (b) A visual topology map for 832,058 threads

Figure 3.6 shows two cases where visualizations are employed to gain insights, especially for large datasets. In the first figure (a), we visualize the distributions of the dataset and applied the appropriate statistical tests, along with the effect size, to test our hypotheses and answer research questions. In the second example (b), although not directly related to package ecosystems, the authors utilized a

3 Promises and Perils of Mining Software Package Ecosystem Data

79

topological visualization [20] to gain insights on the over 800,000 forum threads of discussions.

3.6 Chapter Summary This chapter explored the various aspects of mining information from software packaging ecosystems, presenting three promises and ten perils that researchers should be aware of when undertaking such tasks. The chapter was structured around four key processes for mining: (1) planning what information to mine; (2) defining components and their dependencies; (3) defining boundaries and completeness; and (4) data analysis and visualization. To help new and experienced researchers navigate these challenges, we introduced the SUG model, which can serve as a valuable tool to minimize threats to validity. Although some perils may be more relevant to specific research objectives, our aim is to equip researchers with the knowledge and resources needed to confidently gather and integrate software package ecosystem data into their work.

References 1. Ait, A., Izquierdo, J.L.C., Cabot, J.: An empirical study on the survival rate of GitHub projects. In: International Conference on Mining Software Repositories (MSR), pp. 365–375 (2022). https://doi.org/10.1145/3524842.3527941 2. Beschastnikh, I., Lungu, M.F., Zhuang, Y.: Accelerating software engineering research adoption with analysis bots. In: International Conference on Software Engineering: New Ideas and Emerging Results Track, pp. 35–38 (2017). https://doi.org/10.1109/ICSE-NIER.2017.17 3. Brindescu, C., Ahmed, I., Jensen, C., Sarma, A.: An empirical investigation into merge conflicts and their effect on software quality. Empirical Softw. Eng. 25(1), 562–590 (2020). https://doi.org/10.1007/s10664-019-09735-4 4. Cataldo, M., Herbsleb, J.D., Carley, K.M.: Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 2–11. ACM, New York (2008). https://doi.org/10.1145/1414004.1414008 5. Chinthanet, B., Ponta, S.E., Plate, H., Sabetta, A., Kula, R.G., Ishio, T., Matsumoto, K.: Code-based vulnerability detection in Node.Js applications: how far are we? In: International Conference on Automated Software Engineering (ASE), pp. 1199–1203 (2020). https://doi. org/10.1145/3324884.3421838 6. Coelho, J., Valente, M.T.: Why modern open source projects fail. In: Joint Meeting on Foundations of Software Engineering (FSE), pp. 186–196 (2017). https://doi.org/10.1145/ 3106237.3106246 7. Cogo, F.R., Oliva, G.A., Hassan, A.E.: An empirical study of dependency downgrades in the npm ecosystem. Trans. Softw. Eng. (2019). https://doi.org/10.1109/TSE.2019.2952130 8. Decan, A., Mens, T., Grosjean, P.: An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Softw. Eng. 24(1), 381–416 (2019). https:// doi.org/10.1007/s10664-017-9589-y

80

R. G. Kula et al.

9. Dey, T., Mousavi, S., Ponce, E., Fry, T., Vasilescu, B., Filippova, A., Mockus, A.: Detecting and characterizing bots that commit code. In: International Conference on Mining Software Repositories (MSR), pp. 209–219. ACM, New York (2020). https://doi.org/10.1145/3379597. 3387478 10. Erlenhov, L., de Oliveira Neto, F.G., Scandariato, R., Leitner, P.: Current and future bots in software development. In: International Workshop on Bots in Software Engineering (BotSE), pp. 7–11. IEEE, Piscataway (2019). https://doi.org/10.1109/BotSE.2019.00009 11. Farooq, U., Grudin, J.: Human-computer integration. Interactions 23(6), 26–32 (2016). https:// doi.org/10.1145/3001896 12. Golzadeh, M., Decan, A., Chidambaram, N.: On the accuracy of bot detection techniques. In: International Workshop on Bots in Software Engineering (BotSE). IEEE, Piscataway (2022). https://doi.org/10.1145/3528228.3528406 13. Golzadeh, M., Decan, A., Legay, D., Mens, T.: A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. J. Syst. Softw. 175 (2021). https://doi. org/10.1016/j.jss.2021.110911 14. Golzadeh, M., Legay, D., Decan, A., Mens, T.: Bot or not? Detecting bots in GitHub pull request activity based on comment similarity. In: International Workshop on Bots in Software Engineering (BotSE), pp. 31–35 (2020). https://doi.org/10.1145/3387940.3391503 15. Islam, S., Kula, R.G., Treude, C., Chinthanet, B., Ishio, T., Matsumoto, K.: Contrasting thirdparty package management user experience. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 664–668 (2021). https://doi.org/10.1109/ICSME52107. 2021.00077 16. Kikas, R., Gousios, G., Dumas, M., Pfahl, D.: Structure and evolution of package dependency networks. In: International Conference on Mining Software Repositories (MSR), pp. 102–112 (2017). https://doi.org/10.1109/MSR.2017.55 17. Kula, R.G., De Roover, C., German, D.M., Ishio, T., Inoue, K.: A generalized model for visualizing library popularity, adoption, and diffusion within a software ecosystem. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 288–299 (2018). https://doi.org/10.1109/SANER.2018.8330217 18. Lebeuf, C., Storey, M.A., Zagalsky, A.: Software bots. IEEE Software 35(1), 18–23 (2017). https://doi.org/10.1109/MS.2017.4541027 19. Lin, B., Zagalsky, A., Storey, M.A., Serebrenik, A.: Why developers are slacking off: Understanding how software teams use Slack. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 333–336. ACM, New York (2016). https://doi.org/ 10.1145/2818052.2869117 20. Lum, P.Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J., Carlsson, G.E.: Extracting insights from the shape of complex data using topology. Sci. Rep. 3 (2013). https://doi.org/10.1038/srep01236 21. Maeprasart, V., Wattanakriengkrai, S., Kula, R.G., Treude, C., Matsumoto, K.: Understanding the role of external pull requests in the npm ecosystem (2022). arXiv preprint arXiv:2207.04933 22. Mirsaeedi, E., Rigby, P.C.: Mitigating turnover with code review recommendation: balancing expertise, workload, and knowledge distribution. In: International Conference on Software Engineering (ICSE), pp. 1183–1195 (2020). https://doi.org/10.1145/3377811.3380335 23. Montgomery, L., Lüders, C., Maalej, W.: An alternative issue tracking dataset of public Jira repositories. In: International Conference on Mining Software Repositories (MSR), pp. 73–77. ACM, New York (2022). https://doi.org/10.1145/3524842.3528486 24. Nassif, M., Robillard, M.: Revisiting turnover-induced knowledge loss in software projects. In: 2017 IEEE International Conference on Software Maintenance and Evolution, pp. 261–272 (2017). https://doi.org/10.1109/ICSME.2017.64 25. Nugroho, Y.S., Islam, S., Nakasai, K., Rehman, I., Hata, H., Kula, R.G., Nagappan, M., Matsumoto, K.: How are project-specific forums utilized? A study of participation, content, and sentiment in the Eclipse ecosystem. Empirical Softw. Eng. 26(6), 132 (2021). https://doi. org/10.1007/s10664-021-10032-2

3 Promises and Perils of Mining Software Package Ecosystem Data

81

26. Ponta, S., Plate, H., Sabetta, A.: Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 449–460. IEEE, Piscataway (2018). https://doi.org/ 10.1109/ICSME.2018.00054 27. Saadat, S., Colmenares, N., Sukthankar, G.: Do bots modify the workflow of GitHub teams? In: International Workshop on Bots in Software Engineering (BotSE) (2021). https://doi.org/ 10.1109/BotSE52550.2021.00008 28. Stol, K.J., Fitzgerald, B.: Inner source—adopting open source development practices in organizations: a tutorial. IEEE Softw. 32(4), 60–67 (2014). https://doi.org/10.1109/MS.2014. 77 29. Storey, M.A., Zagalsky, A.: Disrupting developer productivity one bot at a time. In: International Symposium on Foundations of Software Engineering (FSE), pp. 928–931 (2016). https:// doi.org/10.1145/2950290.2983989 30. Urli, S., Yu, Z., Seinturier, L., Monperrus, M.: How to design a program repair bot: Insights from the Repairnator project. International Conference on Software Engineering (ICSE) pp. 95–104 (2018). https://doi.org/10.1145/3183519.3183540 31. Valiev, M., Vasilescu, B., Herbsleb, J.: Ecosystem-level determinants of sustained activity in open-source projects: a case study of the PyPI ecosystem. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 644–655. ACM, New York (2018). https://doi.org/10.1145/3236024.3236062 32. Wattanakriengkrai, S., Wang, D., Kula, R.G., Treude, C., Thongtanunam, P., Ishio, T., Matsumoto, K.: Giving back: Contributions congruent to library dependency changes in a software ecosystem. Trans. Softw. Eng. (2022). https://doi.org/10.1109/TSE.2022.3225197 33. Wessel, M., De Souza, B.M., Steinmacher, I., Wiese, I.S., Polato, I., Chaves, A.P., Gerosa, M.A.: The power of bots: understanding bots in OSS projects. In: The ACM International Conference on Human-Computer Interaction (2018). https://doi.org/10.1145/3274451 34. Wiese, I.S., Da Silva, J.T., Steinmacher, I., Treude, C., Gerosa, M.A.: Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 345–355. IEEE, Piscataway (2016). https://doi.org/10.1109/ICSME.2016.13 35. Zapata, R.E., Kula, R.G., Chinthanet, B., Ishio, T., Matsumoto, K., Ihara, A.: Towards smoother library migrations: a look at vulnerable dependency migrations at function level for npm JavaScript packages. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 559–563. IEEE, Piscataway (2018). https://doi.org/10.1109/ICSME.2018.00067

Part II

Analyzing Software Ecosystems

Chapter 4

Mining for Software Library Usage Patterns Within an Ecosystem: Are We There Yet? Tien N. Nguyen

Abstract The use of software libraries is important in a software development ecosystem, in which the systems are interacting and share several usages of Application Programming Interfaces. The patterns of library usages have been shown to be useful in not only improving the productivity of the coding process via software reuse but also improving the code quality. In this chapter, we systematically evaluate the different approaches in library usage pattern mining. We also provide lessons learned from the history of those approaches and discuss the potential future directions in the usage pattern mining area.

4.1 Introduction Software libraries play important roles in software development, especially within an ecosystem. The ecosystem contains several software libraries that work together to provide different functionalities to client applications as well as to the other applications in the ecosystem. The functionality of a library in an ecosystem is often offered via the Application Programming Interface (API) elements, including classes, methods, and fields. According to the usage documentation of the libraries, one needs to use the API elements in a specific combination and/or order to accomplish certain programming tasks. Such usages are referred to as API usages. A frequent API usage is called an API usage pattern. Mining the API usage patterns for an ecosystem is crucial due to several reasons: (1) it helps developers correctly use the APIs provided by the libraries, thus, promoting the correct usages of the applications in the ecosystem; (2) the patterns could be used to detect anomalies and errors in the client applications; and (3) the

T. N. Nguyen () Computer Science Department, University of Texas at Dallas, Richardson, TX, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_4

85

86

T. N. Nguyen

tooling support for API usages such as code completion could be built in the IDE for an ecosystem. Next, we will explain the API elements, API usages, API usage patterns, and the approaches to represent and mine them from the source code. We then explain the applications of such patterns in software engineering.

4.2 Example of API Usage Patterns in Software Libraries Listing 4.1 An API usage pattern with StringBuffer 1 2 3 4 5 6 7 8 9 10 11 12

StringBuffer strbuf = new StringBuffer(); BufferedReader in = new BufferedReader(new FileReader(file)); String str; ... while ((str = in.readLine()) != null) { ... strbuf.append(str); ... } ... if (strbuf.length() > 0) outputMessage(strbuf.toString()); in.close();

Listing 4.1 shows a real-world example in the open-source project Columba 1.4, containing a usage scenario for reading and printing a file. There are five objects: a StringBuffer (strbuf), a BufferedReader (in), a FileReader, and two Strings (str and file). First, strbuf and the FileReader are created. The latter is used in the creation of object in. in’s readLine is called to read a text line into str, and str is added to the content of strbuf via its append. These two actions are carried out in a while loop until no line is left. After the loop, if the content of strbuf is not empty (checked by its length), it is output via toString. Finally, the BufferedReader in is closed. From the example, we can see that the following key information is important in characterizing an API usage: 1. What is the temporal order of their actions (e.g., strbuf must be created before its append can be used)? 2. How do their actions appear in control structures? (e.g., append could be used repeatedly in a while loop), and 3. How do multiple objects interact with one another (e.g., strbuf’s append is used after in’s readLine)? That is, the information describes the usage orders of objects’ actions, i.e., whether an action is used before another, with involving control structures and data dependencies among them. The usage order is not always exhibited in the textual order in source code. For example, the creation of the FileReader object occurs before that of in, while the corresponding constructor appears after in the source code. It is not the order in execution traces either, where append could be executed before or after readLine.

4 Mining Software Library Usage Patterns

87

Capturing the API usages and their patterns is crucial for several software engineering tasks. Let us explain various representation approaches for API usages and their applications in software engineering.

4.3 Usages as Sets of Frequent Co-occurrences Early approaches for usage pattern mining have represented a usage via the set of multiple API elements that frequently appear together in source code. A popular mining technique is association rule mining [8]. CodeWeb [8] generalizes association rules to take into account the inheritance hierarchy, e.g., the application classes inheriting from a particular library class often instantiate another class or one of its descendants. In the above example, these approaches in this line would mine the following frequent sets: 1. StringBuffer.new, StringBuffer.append, StringBuffer.length, and 2. StringBuffer.toString, BufferedReader.new, BufferedReader. readLine, and BufferedReader.close. PR-Miner [7] uses the frequent itemset mining technique to find the functions, variables, and data types that frequently appear in same methods. They use the association rules from source code and detect their violations for detecting bugs. The key issue with this line of approaches is that they do not consider the order of the APIs in the usages, which actually is crucial as explained in our example.

4.4 Usages as Pairs or Subsequences of APIs via Software Mining To address the no-ordering issue with the aforementioned approaches, a common direction of mining usage patterns was to mine the pairs of Application Programming Interfaces (APIs) such as pairs of method calls. For example, in the code in Listing 4.1, we could see the following patterns: 1. BufferedReader.new (line 2) .→ BufferedReader.readLine (line 5) (new represents the class instantiation and .→ means “appearing before”), 2. BufferedReader.readLine (line 5) .→ BufferedReader.close (line 12), 3. StringBuffer.new (line 1) .→ StringBuffer.append (line 7), etc. In JADET [2], for each Java object in a method, JADET extracts a usage model in term of a finite state automaton with anonymous states and transitions labeled with feasible method calls. Its model is built for a single object and does not contain control structures. From a finite-state automaton for an object in a method, JADET

88

T. N. Nguyen

uses frequent itemset mining based on the numbers of methods containing the pairs of method calls to extract a pattern in terms of a set of pairs of those method calls. Engler et al. [4] formulate the usage patterns in terms of pairs as a concept called beliefs. For example, a call to spin_lock followed once by a call to spin_tmlock implies that the programmer may have paired these calls by coincidence. If the pairing happens 999 out of 1000 times, though, then it is probably a valid belief and the sole deviation a probable error [4]. While the notion of beliefs is more than APIs in software libraries, the limitation is still with the representation of the usages in terms of pairs. Some techniques rely on the dynamic analysis to learn the pairs of methods that were executed. Perracotta [21] infers temporal properties in the form of pairs of API calls from execution traces. Weimer and Necula [19] mine temporal safety rules that involve pairs of API calls. Their algorithm’s idea is based on the observation that programs often make mistakes along exceptional control-flow paths, even when they behave correctly on normal execution paths. While the aforementioned approaches examine the pairs in the individual versions of a project, other approaches mine the pairs of method calls from the changes. Williams and Hollingsworth [20] mine method usage patterns in which one function is directly called before another. They look for a source code change that takes a function return value, which was previously not tested before being used, and adds a test of the return value. For each such bug fix, they are interested in the function called to produce the return value. Such a bug fix indicates that the called function is likely to need its return value checked before being used elsewhere in the system. Their tool runs on both versions of source code and determines whether a potential bug has been fixed by a change. MAPO [1] mines API usage patterns from the source code in the form of frequent subsequences of API elements. An example of frequent subsequences for our running code is: BufferedReader.new (line 2) .→ BufferedReader.readLine (line 5) .→ BufferedReader.close (line 12). Wang et al. [18] mine frequent closed sequential patterns. The authors opted for a two-step clustering strategy to identify patterns. The first clustering uses method call sequences as input, and the second one uses frequent closed sequences as input. This technique aimed to reduce the number of redundant patterns and detects more unpopular patterns.

4 Mining Software Library Usage Patterns

89

4.5 Graph Representation for Usage Patterns via Static Analysis 4.5.1 Object Usage Representation This section explains GROUMINER approach [13] that leverages graph-based representations for usage patterns. Object usages involve in the object instantiations, method calls, or data field accesses. Because creating an object is the invocation of its constructor and an access to a field could be considered to be equivalent to a call to a getter or a setter, the term action is used to denote all of such operations. Nguyen et al. [13] introduce a graph-based representation for API usages, called GRaph-based Object Usage Model (GROUM). A GROUM is a labeled, directed acyclic graph (DAG) representing a usage scenario for single/multiple objects. Figure 4.1b shows the GROUM representing the usage of the objects in the illustrated example. Fig. 4.1 GROUM: Graph-based object usage model for an API usage

r.

FileReade r.

r.

.readLine

WHILE

IF

.close

90

T. N. Nguyen

Definition 4.1 (Groum [13]) A GROUM is a DAG such that: 1. Each node is an action node or a control node. An action node represents an invocation of a constructor or a method or an access to a field of one object. A control node represents the branching point of a control structure. The label of an action node is “C.m” with C is its class name and m is the method (or field) name. The label of a control node is the name of its corresponding control structure. 2. A GROUM could involve multiple objects. 3. Each edge represents a usage order and a data dependency. An edge from node A to node B means that A is used before B, i.e., A is generated before B in executable code, and A and B have a data dependency. Edges have no label. To represent the usage order of objects’ actions, GROUMINER uses action nodes in a GROUM. Each of them represents an action of an object and is assigned a label C.m, in which C is the class name of the object and m is the name of a method or a field. (In the context that the class name is clear, we use just the method name to identify the action node). The directed edges of a GROUM are used to represent the usage orders. An edge from an action node A to an action node B means that in the usage scenario, A is used before B . This implies that B is used after A, i.e., there is no path from B to A. For example, in Fig. 4.1, the nodes labeled StringBuffer. and StringBuffer.append represent the object instantiation and the invocation of method append of a StringBuffer object, respectively. The edge connecting them shows the usage order, i.e., is used before append. To represent how developers use the objects within the control flow structures such as conditions, branches, or loop statements, GROUMINER uses control nodes in a GROUM. To conform to the use of edges for representing temporal orders, such control nodes are placed at their branching points (i.e., where the program selects an execution flow), rather than at the starting points of the corresponding statements. The edges between control nodes and the others including action nodes represent the usage orders as well. For example, in Fig. 4.1, the control node labeled WHILE represents the while statement in the code in Listing 4.1, and the edge from BufferedReader.readLine to WHILE indicates that the invocation of readLine is generated before the branching point of that while loop. To represent the scope of a control flow structure (e.g., the invocation of readLine is within the while loop), the list of all action nodes and control nodes within a control flow structure is stored as an attribute of its corresponding control node. In Fig. 4.1, such scope information is illustrated as the dashed rectangles. Note that there is no backward edge for a loop structure in a GROUM since it is a DAG. However, without backward edges, there is still sufficient scope to show that the actions in a loop could be invoked repeatedly. A GROUM is built for each individual method. To represent the usage of multiple interacting objects, a GROUM contains action nodes of not only one object but also those of multiple objects. The edges

4 Mining Software Library Usage Patterns

91

connecting action nodes of multiple objects represent the usage orders as well. Moreover, to make a GROUM have more semantic information, such edges connect only the nodes that have data dependencies, e.g., the ones involving the same object(s). In Fig. 4.1, action nodes of different objects are filled with different backgrounds.

4.5.2 Graph-Based API Usage Pattern Mining Algorithm This section describes GROUMINER’s graph-based pattern mining algorithm for multiple object usages. Intuitively, an object usage is considered as a pattern if it frequently appears in source code. GROUMINER is interested only in the intraprocedural level of source code; therefore, the GROUMs are extracted from all methods. Each method is represented by a GROUM. In many cases, the object usages involve only some, but not all object action and control nodes of an extracted GROUM in a method. In addition, the usages must include all temporal and data properties of those nodes, i.e., all involving edges. Therefore, in a GROUM representing a method, an object usage is an induced subgraph of that GROUM, i.e., involving some nodes and all the edges of such nodes. Note that any induced subgraph of a GROUM is also a GROUM.

4.5.2.1

Important Concepts in Graph-Based Usage Pattern Mining

Definition 4.2 (Graph Dataset) A GROUM dataset is a set of all GROUMs extracted from the code base, denoted by D = {G1 , G2 , ..., Gn }. Definition 4.3 (Occurrence) An induced subgraph X of a GROUMGi is called an occurrence of a GROUM P if X is equivalent to P . A usage could appear more than one time in a portion of code and in the whole code base, i.e., a GROUM P could have multiple occurrences in each GROUMGi . Gi (P ) is used to denote the occurrence set of P in Gi and D(P ) = {G1 (P ), G2 (P ), ..., Gn (P )} to denote the set of all occurrences of P in the entire GROUM dataset. Gi (P ) is empty if P does not occur in Gi . If P occurs many times, only the non-overlapping occurrences are considered as different or independent. Definition 4.4 (Frequency) The frequency of P in Gi , denoted by fi (P ), is the maximum number of independent (i.e., non-overlapping) occurrences of P in Gi . The frequency of P in the entire dataset, f (P ), is the sum of frequencies of P in all GROUMs in the dataset. Definition 4.5 (Pattern) A GROUM P is called a pattern if f (P ) ≥ σ , i.e., P has independently occurred at least σ times in the entire GROUM dataset. σ is a chosen threshold.

92

T. N. Nguyen

Definition 4.6 (Pattern Mining) Given D and σ , find the list L of all patterns P .

Listing 4.2 GROUMINER algorithm 1 function GrouMiner (D) 2 L ← {all patterns of size one} 3 for each P ∈ L do Explore(P,L,D) 4 return L 5 6 function Explore(P,L,D) 7 for each pattern of size one U ∈ L do 8 C ← P ⊕ U 9 for each Q ∈ patterns(C) 10 if f (Q) ≥ σ then 11 L ← L ∪ {Q} 12 Explore(Q,L,D)

4.5.2.2

Overview of GROUMINER Algorithm

The pseudo-code of GROUMINER is displayed in Listing 4.2. First, the smallest patterns (i.e., patterns of size one) are collected into the list of patterns L (line 2). Then, each pattern is used as a starting point for GROUMINER to recursively discover larger patterns by function Explore (line 3). The main steps of exploring a pattern P (lines 6–12) include: 1. generating from P the occurrences of candidate patterns (line 8); 2. grouping those occurrences into isomorphic groups (i.e., function patterns) and considering each group to represent a candidate pattern (line 9); 3. evaluating the frequency of each candidate pattern to find the true patterns and recursively discovering larger patterns from them (lines 10–12).

4.5.2.3

Detailed GROUMINER Algorithm

Step 1. Generating Occurrences of Candidate Patterns Each pattern P is represented by D(P ), the set of its occurrences in the whole graph dataset. Each of such occurrences X is a subgraph, and it might be extended into a larger subgraph by adding a new node Y and all edges connecting Y and the nodes of X. Let us denote that graph X + Y . Since a large pattern must contain a smaller pattern, Y must be a frequent subgraph, i.e., an occurrence of a pattern U of size 1. This will help avoid generating non-pattern subgraphs (i.e., cannot belong to any larger pattern). The operation ⊕ is used to denote the process of extending and generating all occurrences of candidate patterns from all occurrences of two patterns P and U : P ⊕ U = {X + Y |X ∈ Gi (P ), Y ∈ Gi (U ), i = 1..n}

.

(4.1)

4 Mining Software Library Usage Patterns

93

Step 2. Finding Candidate Patterns To find candidate patterns, function patterns is applied on C, the set of all generated occurrences. It groups them into the sets of isomorphic subgraphs. Grouping criteria is based on Exas vectors [12]. All subgraphs having the same vector are considered as isomorphic. Thus, they are the occurrences of the same candidate pattern and are collected into the same set. Then, for each of such candidate Q, the corresponding subgraphs are grouped by the graph that they belong to, i.e., are grouped into G1 (Q), G2 (Q), ...Gn (Q), to identify its occurrence set in the entire dataset D(Q). Step 3. Evaluating the Frequency Function fi (Q) is to evaluate the frequency of Q in each graph Gi . In general, such evaluation is equivalent to the maximum independent set problem because it needs to identify the maximal set of non-overlapping subgraphs of Gi (Q). However, for efficiency, we use a greedy technique to find a non-overlapping subset for Gi (Q) with a size as large as possible. GROUMINER sorts the occurrences in Gi (Q) in descending order by their numbers of nodes that could be added to them. As an occurrence is chosen in that order, its overlapping occurrences are removed. Thus, the resulting set contains only non-overlapping occurrences. Its size is assigned to fi (Q). After all fi (Q) values are computed, the frequency of Q in the whole dataset is calculated: f (Q) = f1 (Q) + f2 (Q) + ... + fn (Q). If f (Q) ≥ σ , Q is considered as a pattern and is used to recursively extend to discover larger patterns. Step 4. Disregarding Occurrences of Discovered Patterns Since the discovery process is recursive, occurrences of a discovered pattern could be generated more than once. (In fact, a sub-graph of size k+1 might be generated at most k+1 times from the sub-graphs of size k it contains.) To avoid this redundancy, when generating the occurrences of candidate patterns, Explore checks if a subgraph is an occurrence of a discovered pattern. It does this by comparing Exas vector of the sub-graph to those of stored patterns in L. If the answer is true, the sub-graph is disregarded in P ⊕ U .

4.5.3 API Usage Graph Pattern Mining The key limitation of GROUMINER’s representation is the lack of data nodes that represent the objects, variables, and literal values. Moreover, the mining algorithm does not consider any semantic meaning of the nodes and edges in a usage graph, leading to the meaningless sub-graphs. To address those issues, MuDetect [17] has improved GROUM into API Usage Graph, called AUG. Definition 4.7 (API Usage Graph) AUG is a directed, connected graph with labelled nodes and edges. Nodes represent data entities (variables, values) and actions (e.g., method calls or operators). Edges represent the API usage relations among the entities and actions represented by nodes.

94

T. N. Nguyen

Fig. 4.2 An API usage and its API Usage Graph (AUG)

Figure 4.2 shows an example of an API usage and its AUG. The action nodes are displayed in the rectangles and the data nodes in the oval shapes. The action nodes represent constructor calls (init), method calls, field accesses, and operators. If the types are available, they will be resolved. However, in the figure, only the simple name is shown for clarity. The relational operators are also encoded as actions to capture conditions. The data nodes represent objects, values, and literals in an API usage. AUG encodes data entities as nodes to make explicit the data dependencies between actions, such as multiple calls on the same object to ensure we have a connected subgraph with all data-dependent parts of a usage. The usage relations are shown with their labels. Order edges are not shown for clarity.

4 Mining Software Library Usage Patterns

4.5.3.1

95

Semantic-Aware API Usage Pattern Mining with MUDetect

Let us explain the details of the MUDetect algorithm for API Usage pattern mining considering semantic edges. The principle of the mining algorithm is that a frequent graph must contain frequent subgraphs. Therefore, one could mine a pattern with a smaller size k and then extend the pattern of the size .k + 1 by adding the suitable edges to each of the patterns of the smaller size. To do so, the algorithm explores each adjacent node that is connected to each node of the pattern of the smaller size. When an adjacent node is added, all the edges that connect that node are also added. By extending via nodes, as opposed to via edges, the algorithm is more scalable. Note that an AUG often has many more edges than nodes. When performing the extension, the algorithm considers different types of adjacent nodes. Basically, if a node that is connected only by an order edge, it is considered as unsuitable for extension. Otherwise, the algorithm applies the following rules that are defined based on the semantics of the code structure to obtain meaningful patterns: 1. A call to a pure method, which has no side effect, is suitable only if it has an outgoing edge to a node in i, i.e., if it defines a data node or controls an action node in i. Since pure methods have no side effects, they can impact a usage only through their return value. To avoid the complexity of interprocedural analysis, the algorithm identifies pure methods heuristically: it considers any method whose name starts with get as pure, since getters are mostly pure and very prevalent. 2. A non-pure method call is always suitable. 3. An operator is suitable only if it has at least one incoming and one outgoing edge to i, because operators are like pure methods whose result is based solely on their parameters, as opposed to parameters and state. 4. A data node is suitable only if it has an outgoing edge to i, i.e., when it is used in the usage. During the extension, the algorithm has a greedy exploration strategy. To identify the larger patterns in the set of all the extensions, the algorithm groups the isomorphic extensions into the pattern candidates. If any unexplored candidate remains, the algorithm selects the most-frequent one and recursively searches for larger patterns. If there are no further frequent extensions, the algorithm collects the pattern into the final list. In addition to the possible extensions, the algorithm also keeps track of those instances that do not have any frequent extension. If these inextensible pattern instances are themselves frequent, it adds this pattern to the final list. The intuition is that an API might have a core pattern and additional alternative patterns containing it. In brief, this greedy strategy avoids the combinatorial explosion problem of exhaustive search with backtracking. Note that similar to GROUMINER [13], the algorithm uses a heuristic that combines graph vectorization and hashing [12] for graph isomorphism detection.

96

T. N. Nguyen

4.5.4 Cooperative API Usage Pattern Mining Approach All the aforementioned approaches for API usage pattern mining infer the patterns only using the library source code. However, these patterns are specific to the considered clients. COUPminer [15] is a cooperative usage pattern mining technique that combines client-based and library-based usage pattern mining. The authors specifically studied which form of combination is better suited to achieve the best trade-off between the generality and accuracy properties. The first combination is the sequential one, by applying a first technique (client or non-clientbased mining) to derive a set of patterns and then apply the second technique to refine these patterns. The second combination is to interleave the different iterations of the two techniques (starting by one or the other technique) in a parallel and cooperative manner to solve a common goal. The parameters of two techniques can be varied to improve the accuracy and to explore the search space more efficiently.

4.5.5 Probabilistic API Usage Mining While the aforementioned approaches for API usage pattern mining are deterministic, they can be difficult to use as they require the parameter tuning on the thresholds of the frequencies of the API usage occurrences. This threshold is difficult to tune in practice as it is prone to exponential blow-up: setting the threshold too low leads to billions of patterns. Conversely, setting it too high leads to no patterns at all. PAM (Probabilistic API Miner) [5] is a near-parameter-free probabilistic algorithm for mining the API call patterns. PAM’s user-specified parameters are independent of the dataset. PAM makes use of a novel probabilistic model of sequences, based on generating a sequence by interleaving a group of subsequences. The list of component subsequences are then the mined API patterns. This is a fully probabilistic formulation of the frequent sequence mining problem, is able to correctly represent both gaps in sequences, and—unlike API mining approaches based on frequent sequence mining—largely avoids returning sequences of items that are individually frequent but uncorrelated [5]. The approach follows a generative process with two parameters: a set of API patterns and a set of probabilities. The hypothetical process is as follows. First, from the set of all interesting API patterns, one samples which ones will appear in the client method that we are about to generate and how many times they will be used, which yields a multiset. Then, one randomly samples a way to interleave the sampled API patterns, and this results in a hypothetical API usage code.

4 Mining Software Library Usage Patterns Fig. 4.3 ExPort: Browsing API usages in the callgraph view [10]

97

contentsChanged

getIndex0

getIndex1

selectedItemChanged

fireActionEvent

fireItemStateChanged

4.5.6 API Usage Mining via Topic Modeling Moritz et al. [10] represent software as a relational topic model, where API calls and the functions that use them are modeled as a document network. They integrated their approach in ExPort [10]: given a task, present the programmer with a list of API methods related to that task. Once the programmer selects the API methods he or she wishes to use, present the programmer with usage examples related to the task. The relational topic model allows Export to mine the usage relations among the API elements and method calls. Figure 4.3 displays an API usage in a call graph view. The branching point shows the different alternatives for a method call at the certain point.

4.5.7 Mining for Less Frequent API Usage Patterns The limitation of the aforementioned mining approaches is its target on frequent patterns. They often use frequent-item mining to extract API usage patterns. However, some patterns might not appear more than a certain threshold predefined by the mining algorithms. Therefore, the mined patterns might not be a complete set. Niu et al. [14] address the API usage patterns without relying on frequent-pattern mining. The approach represents the source code as a network of object usages where an object usage is a set of method calls invoked on a single API class. It automatically extracts usage patterns by clustering the data based on the coexistence relations between object usages. The approach focuses on achieving high completeness even for low-frequent patterns. Moreover, probabilistic API pattern mining approaches [5], such as PAM (Probabilistic API Miner), also serve well for this purpose because they do not rely on the threshold for frequent API usages.

98

T. N. Nguyen

4.6 Applications of Usage Patterns 4.6.1 Graph-Based API Usage Anomaly Detection The usage patterns can be used to automatically find the anomaly usages, i.e., the locations in programs that deviate from the typical object usages. However, not all violations are considered as defects. For example, there might exist the occurrences of the usage -close() (without readLine) that also violate P , but they are acceptable. A violation is considered as an anomaly when it is too rare. The rareness of the violations could be measured by the ratio .v(P1 , P )/f (P1 ), with .v(P1 , P ) is the number of inextensible occurrences of .P1 corresponding to P in the whole dataset. If rareness is smaller than a threshold, corresponding occurrences are considered as anomalies. The lower a rareness value is, the higher the anomaly is ranked. Definition 4.8 (Anomaly) A GROUM H is considered as a usage anomaly of a pattern P if H has an inextensible occurrence .H1 of a sub-pattern .P1 of P and the ratio .v(P1 , P )/f (P1 ) < δ, with .v(P1 , P ) as the number of such inextensible occurrences in the whole GROUM dataset and .δ a chosen threshold. GROUMINER provides anomaly detection in two cases: (1) detecting anomalies in the currently mined project (by using mined GROUMs) and (2) detecting anomalies when the project changes, i.e.„ in the new revision. In both cases, the main task of anomaly detection is to find the inextensible occurrences of all patterns .P1 corresponding to the detected patterns. In the first case, because storing the occurrence set .D(P1 ), GROUMINER can check each occurrence of .P1 in .D(P1 ): if it is inextensible to any occurrence of a detected pattern P generated from .P1 , then it is a violation. Those violations are counted via .v(P1 , P ). After checking all occurrences of .P1 , the rareness value .v(P1 , P )/f (P1 ) is computed. If it is smaller than the threshold .δ, such a violation is reported as an anomaly. In the second case, GROUMINER must update the occurrence sets of detected patterns before finding the anomalies in the new version.

4.6.2 Pattern-Oriented Code Completion This section explains an application on code completion leveraging the mined usage patterns. GraPacc [11] is a Graph-based Pattern-oriented, Context-sensitive tool for Code Completion. It takes as an input a database of usage patterns and completes the code under editing based on its context and those patterns. Given an incomplete code, a code completion tool needs to fill it in with the potential code tokens at the cursor. Listing 4.3 illustrates a code fragment as a query. The character _ denotes the editing cursor where a developer invokes the code

4 Mining Software Library Usage Patterns

99

completion tool during programming. A query is generally incomplete (in terms of the task that is intended to achieve) and might not be parsable. Listing 4.3 SWT query example 1 2 3 4 5 6

Display display = new Display(); Shell shell = new Shell(display); ... Button button = new Button(shell, SWT.PUSH); FormData formData = new FormData(); button._

Definition 4.9 (Query) A query is a code fragment under editing, i.e., a sequence of textual tokens written in a programming language. GraPacc analyzes the query Q (i.e., the code under editing) and extracts its context-sensitive features and weights in four main steps: 1. tokenizing the input Q to extract lexical tokens, which could be used as tokenbased features; 2. using a partial program analysis tool [3] to parse the input code into an AST; 3. building the corresponding GROUM from the AST; 4. extracting the graph-based features from that GROUM, collecting the token-based features from the un-parsable tokens (i.e., the tokens without associated AST node), and determining the context-sensitive weights for the extracted features. GraPacc constructs the corresponding GROUM from the AST provided by partial program analysis in the previous step using the constructing algorithm from a prior work [13]. Due to the incompleteness of the query code, the unresolved nodes in the AST are discarded. They are considered as tokens and used to extract token-based features. The data nodes corresponding to the variables of the data types that are not resolved to fully qualified names are kept with only simple names. Figure 4.4 shows the GROUM built for the query example in Listing 4.3. As seen, the objects shell, button, bData, and display are resolved to the data nodes labeled with their types Shell, Button, FormData, and Display, respectively. Node Button is denoted as the focus node, because the token closest to the editing cursor is button. If the user chooses a pattern P in the recommended list, GraPacc will complete the code in the query Q according to pattern P . Generally, to do that, GraPacc first matches the code in P and Q to find the code in P that has not appeared in Q. Then, Fig. 4.4 Graph-based usage model of query in Listing 4.3

100

T. N. Nguyen

it fills such code into Q in accordance with the context in Q, i.e., at the appropriate locations in Q and with the proper names. Let us first explain the general idea via an example. Let us revisit the query example in the corresponding code in Listing 4.3 and assume that a user selects a pattern. GraPacc first determines that the two Button.new nodes, the two FormData.new nodes, the two Button nodes, and the two FormData nodes in the two GROUMs are respectively matched. That is, two object initializations and the assignment to the variables for Button and FormData already existed in the query. Compared with pattern P , the nodes that have not used include Button.setText and Button.setLayoutData. Thus, GraPacc uses the code corresponding to those nodes to fill in Q. The code completing task is done via creating the corresponding sub-trees in the AST of Q at the appropriate positions and with the proper names for the fields and variables. For example, to fill in Button.setLayoutData, it first needs to create that method call and find its position in the AST of Q (not shown). In this case, the position is next to the variable node button in the AST of Q. Since in the pattern, Button.setLayoutData has a parameter of type FormData, GraPacc must fill in that parameter with a proper name. From pattern P , that parameter must be from the FormData node, which is matched to FormData in Q. It in turn corresponds to the variable formData in Q. Thus, Grapacc chooses the name formData and fills in line 6 of Listing 4.3. A similar process is used for Button.setText, which is added between lines 4–5 of Listing 4.3. Therefore, the final result is: 1 2 3 4

Button button = new Button(shell, SWT.PUSH); button.setText(\_); FormData formData = new FormData(); button.setLayoutData(formData);

4.6.3 Integration of API Usage Patterns Shen et al. [16] proposes an approach for comprehensive integration of API usage patterns, which guides users to interactively fill the missing variables. Their tool, CoDA, can update a list of recommended real-world code examples according to the user’s current context. CoDA gives a comprehensive result as it classifies the recommended expressions according to their syntax, accompanied with a dynamically updated code example list. Examplore [6] predefines a synthetic skeleton for API usage, which includes seven API usage features including preconditions, return value check, exception handling, and so on. Given a recommended API usage example, different features are marked with different colors to help developers quickly locate their desired component. MUSE [9] receives as input a given method and returns a list of the usage examples from the large corpus. By applying static slicing and clone detection, it

4 Mining Software Library Usage Patterns

101

clusters similar examples and recommends different, representative usages of the input method.

4.7 Conclusion There is a rich literature on usage pattern mining approaches that differ from one another in the ways they represent the usage patterns. The approaches could represent a pattern as (1) the sets of the frequent co-occurrences of API elements, (2) the pairs of the API elements, (3) the frequent sub-sequences of API elements, (4) the graph structure among the API elements, (5) probabilistic models, and (6) topic models. We also present the applications of the mined usage patterns in (1) the automated detection of anomaly in API usage patterns, (2) pattern-based code completion, and (3) code integration for API usage patterns. Despite several successes as explained in this chapter over the decade of research in API pattern mining, the key limitation of the mining approaches is the need of explicitly defined thresholds on the number of items that should be considered as the frequent occurrences, i.e., patterns. That leads to another key challenge with this line of work, which is the distinction between the rare item and the incorrect one. For example, a rare API usage does not necessarily mean an incorrect one. These limitations have prevented the success of the API misuse detection approaches. With the advances of artificial intelligence (AI) and deep learning (DL), the future work in API usage patterns has several interesting new directions. First, the documentation of libraries could be leveraged in the same time with the source code to provide as input for large language model (LLM) to learn the usage specifications in a more precise manner than the API usage patterns mined from source code. With the ability to connect the description on the usages in natural-language texts to the concrete code, LLMs could learn the usage specifications and verify whether a particular usage satisfies those constraints. Second, the LLMs can also be used to learn to describe the specifications from the source code that use the libraries (API specification mining). Some level of abstraction might be required to help a model learn better the specifications because each scenario might contain project-specific code. The key advantage with using DL for such mining is that we do not need to rely on the thresholds of the frequently occurring code patterns. Instead, the DL model implicitly learns the patterns and derive the specifications. Third, SE applications such as API misuse detection, API code completion, API specification, etc. are excellent downstream tasks for pre-trained language models. Those models can be fine-tuned to achieve the concrete goals in those applications. Fourth, the advances of explainable AI can also be an excellent contribution to the SE research. They can be leveraged to provide the important explanation for the results predicted from the AI/ML models. For example, if a model provides only the positive result without the explanation, a developer would not know where to look to fix the violation(s). The explanation could provide an excellent guidance for their next actions. Finally, the ability to connect multiple sources of information of

102

T. N. Nguyen

the large language models will be crucial in providing the holistic knowledge on the API usages in the libraries used in an ecosystem.

References 1. Acharya, M., Xie, T., Pei, J., Xu, J.: Mining API patterns as partial orders from source code: from usage scenarios to specifications. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 25–34. ACM, New York (2007) 2. Andrzej, W., Zeller, A., Lindig, C.: Detecting object usage anomalies. In: Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The foundations of Software Engineering (ESEC/FSE), pp. 35–44. ACM, New York (2007). https:// doi.org/10.1145/1287624.1287632 3. Dagenais, B., Hendren, L.: Enabling static analysis for partial Java programs. In: Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), pp. 313– 328. ACM, New York (2008). https://doi.org/10.1145/1449764.1449790 4. Engler, D., Chen, D.Y., Hallem, S., Chou, A., Chelf, B.: Bugs as deviant behavior: a general approach to inferring errors in systems code. In: Symposium on Operating Systems Principles (SOSP), pp. 57–72. ACM, New York (2001) 5. Fowkes, J., Sutton, C.: Parameter-free probabilistic API mining across GitHub. In: International Symposium on Foundations of Software Engineering (FSE), pp. 254–265. ACM, New York (2016). https://doi.org/10.1145/2950290.2950319 6. Glassman, E.L., Zhang, T., Hartmann, B., Kim, M.: Visualizing API usage examples at scale. In: Conference on Human Factors in Computing Systems (CHI). ACM (2018). https://doi.org/ 10.1145/3173574.3174154 7. Li, Z., Zhou, Y.: PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 306–315. ACM, New York (2005) 8. Michail, A.: Data mining library reuse patterns using generalized association rules. In: International Conference on Software Engineering (ICSE), pp. 167–176. ACM, New York (2000). https://doi.org/10.1145/337180.337200 9. Moreno, L., Bavota, G., Di Penta, M., Oliveto, R., Marcus, A.: How can i use this method? In: International Conference on Software Engineering (ICSE), pp. 880–890. IEEE, Piscataway (2015) 10. Moritz, E., Linares-Vásquez, M., Poshyvanyk, D., Grechanik, M., McMillan, C., Gethers, M.: ExPort: detecting and visualizing API usages in large source code repositories. In: International Conference on Automated Software Engineering (ASE), pp. 646–651 (2013). https://doi.org/ 10.1109/ASE.2013.6693127 11. Nguyen, A.T., Nguyen, H.A., Nguyen, T.T., Nguyen, T.N.: GraPacc: a graph-based patternoriented, context-sensitive code completion tool. In: International Conference on Software Engineering (ICSE), pp. 1407–1410. IEEE, Piscataway (2012) 12. Nguyen, H.A., Nguyen, T.T., Pham, N.H., Al-Kofahi, J.M., Nguyen, T.N.: Accurate and efficient structural characteristic feature extraction for clone detection. In: Fundamental Approaches to Software Engineering (FASE), pp. 440–455. Springer, Berlin (2009) 13. Nguyen, T.T., Nguyen, H.A., Pham, N.H., Al-Kofahi, J.M., Nguyen, T.N.: Graph-based mining of multiple object usage patterns. In: Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE), pp. 383–392. ACM, New York (2009). https://doi.org/10.1145/1595696.1595767

4 Mining Software Library Usage Patterns

103

14. Niu, H., Keivanloo, I., Zou, Y.: API usage pattern recommendation for software development. J. Syst. Softw. 129, 127–139 (2017). https://doi.org/10.1016/j.jss.2016.07.026 15. Saied, M.A., Sahraoui, H.: A cooperative approach for combining client-based and librarybased API usage pattern mining. In: International Conference on Program Comprehension (ICPC) (2016). https://doi.org/10.1109/ICPC.2016.7503717 16. Shen, Q., Wu, S., Zou, Y., Xie, B.: Comprehensive integration of API usage patterns. In: International Conference on Program Comprehension (ICPC), pp. 83–93 (2021). https://doi. org/10.1109/ICPC52881.2021.00017 17. Sven, A., Nguyen, H.A., Nadi, S., Nguyen, T.N., Mezini, M.: Investigating next steps in static API-misuse detection. In: International Conference on Mining Software Repositories (MSR), pp. 265–275 (2019). https://doi.org/10.1109/MSR.2019.00053 18. Wang, J., Dang, Y., Zhang, H., Chen, K., Xie, T., Zhang, D.: Mining succinct and highcoverage API usage patterns from source code. In: Working Conference on Mining Software Repositories (MSR), pp. 319–328. IEEE, Piscataway (2013) 19. Weimer, W., Necula, G.C.: Mining temporal specifications for error detection. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pp. 461–476. Springer, Berlin (2005) 20. Williams, C.C., Hollingsworth, J.K.: Automatic mining of source code repositories to improve bug finding techniques. Trans. Softw. Eng. 31(6), 466–480 (2005) 21. Yang, J., Evans, D., Bhardwaj, D., Bhat, T., Das, M.: Perracotta: Mining temporal API rules from imperfect traces. In: International Conference on Software Engineering (ICSE), pp. 282– 291. ACM, New York (2006)

Chapter 5

Emotion Analysis in Software Ecosystems Nicole Novielli and Alexander Serebrenik

Abstract Software developers are known to experience a wide range of emotions while performing development tasks. Emotions expressed in developer communication might reflect openness of the ecosystem to newcomers, presence of conflicts, problems in the software development process, or source code itself. In this chapter, we present an overview of the state-of-the-art research on analysis of emotions in software engineering focusing on the studies of emotion in context of software ecosystems. To encourage further applications of emotion analysis in the industry and research, we also include a table summarizing currently available emotion analysis tools and datasets as well as outline directions for future research.

5.1 What Is a Software Ecosystem? Several definitions of software ecosystems can be found in the literature [14, 64, 77, 78, 82, 83]. Rather than selecting one of these definitions a priori, we have decided to start with adopting a nominalistic approach, i.e., state that an ecosystem is whatever is being called an ecosystem. Following this approach, we conduct a literature review of sentiment and emotion in software ecosystems recording the definitions of the ecosystems used in the primary studies, the ecosystems considered, and the insights obtained in the primary studies.

N. Novielli () University of Bari, Bari, Italy e-mail: [email protected] A. Serebrenik Eindhoven University of Technology, Eindhoven, The Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_5

105

106

N. Novielli and A. Serebrenik

To conduct the literature study, we reuse the collection of 186 articles collected by Lin et al. as part of their systematic literature review of opinion mining1 for software development [75]. The authors used ACM Digital Library, IEEE Xplore Digital Library, Springer Link Online Library, Wiley Online Library, Elsevier ScienceDirect, and Scopus. The following search query was used to locate primary studies in these online databases: (“opinion mining” OR “sentiment analysis” OR “emotion”) AND (“software”) AND (“developer” OR “development”)

We perform a full-text search for the term ecosystem in the 186 articles. While additional articles in the collection of Lin et al. [75] might have studied ecosystems without using the term “ecosystem,” the goal of this section is not to provide a comprehensive overview of emotion analysis in software ecosystems but rather to identify what kind of artefacts are usually being called “ecosystems” in the literature on opinion mining for software development. After excluding the articles that mention the term “ecosystem” only in the bibliography, we obtain 28 primary studies. None of them has provided a formal definition of an ecosystem. Ten articles refer to the “software development ecosystem” or “social programmer ecosystem” as an entire collection of different channels and communication means available to a contemporary software developer, e.g., the Software Engineering Arousal lexicon (SEA) has been specifically designed to address the problem of detecting emotional arousal in the software developer ecosystem [80], and Novielli et al. applied sentiment analysis to such components of the ecosystem as GitHub and Stack Overflow [96]. Two articles explicitly talk about “a rich ecosystem of communication channels” [42, 97]. Four articles refer to the ecosystem of mobile apps: online reviews from an unnamed store [57] or the iTunes and Google Play marketplaces [41, 51, 79, 87] and StackOverflow questions about Android, iOS, and Windows phone [74]. Similarly to the studies of the “software development ecosystem,” this line of research seems to implicitly focus on the presence of a shared communication platform (e.g., app store, GitHub, or Stack Overflow) akin to the definition of Bosch and Bosch-Sijtsema [14]: “A software ecosystem consists of a software platform, a set of internal and external developers and a community of domain experts in service to a community of users that compose relevant solution elements to satisfy their needs.” Differently from this definition, “a software platform” in this line of research is also conceptualized as a collection of interrelated technical platforms or communication channels, e.g., GitHub and Stack Overflow. Seven studies have focused on open-source software communities: Tourani et al. [130] and Ortu et al. [104] considered Apache; in a different paper Ortu et al. [107] further extended the data to include Spring, JBoss, and CodeHaus. In addition, Ferreira et al. studied the Linux kernel [38], Umer et al. [133] the

1 “Opinion mining” is a broader area than sentiment and emotion, but lions’ share of the opinion mining studies in the software engineering context have been dedicated to sentiment and emotion.

5 Emotion Analysis in Software Ecosystems

107

reports from the Mozilla issue tracker collected by Nizamani et al. [92], Tourani and Adams [129] Eclipse and Open Stack, and finally Boudeffa et al. [15] OW2. These studies tend to focus on several projects within the ecosystem chosen, e.g., on ten Open Stack and five Eclipse projects [129] or on XWIKI, Sat4j, and asm from OW2 [15]. The focus on projects within the ecosystems is shared with the way Lungu has approached ecosystems as collections of jointly developed projects [77]. Finally, several papers have used the word “ecosystem” in a very generic sense not necessarily disclosing a particular meaning [4, 39]. Definition: In the context of the sentiment and emotion studies in software engineering, “ecosystems” are often seen as: • either platforms or collections of interrelated communication platforms supporting software development (e.g., GitHub, Stack Overflow, Google Play app store) • or as collections of interrelated software projects (e.g., Apache, Eclipse, OW2).

5.2 What Is Emotion? Emotions have always been in the center of human inquiry with pre-Socratic philosophers being among the first to think about this topic [122]. Emotions have been studied by numerous philosophers [122], historians [109], sociologists [71], psychologists [91], biologists and neurophysiologists [1], economists [136], musicologists [68], literature scientists [53], and computing researchers [23]. Despite this, or maybe due to this, a definition of emotion proved to be elusive [40, 122]: as aptly stated by Fehr and Russel, “everyone knows what an emotion is, until asked to give a definition. Then, it seems, no one knows” [37]. It should come as no surprise then that multiple theories of emotion have been proposed in the literature. Gross and Barrett [47] and Meiselman [81] propose to arrange emotion theories along a continuum ranging from theories of basic emotion through theories of appraisal to psychological construction theories. Among these theories, those on the extremes have found their way into studies of emotion in software engineering: theories of basic emotions such as the one of Ekman [36] consider emotions to be universal and distinct from each other, and while those psychological construction theories tend to see emotions as continuous space organized along several dimensions, e.g., Russel’s circumplex model of affect [114]. Russel [114] has observed that the distinction between such emotions as sadness and anger present in English is absent from some African languages, while English misses words for Bengali obhiman, which refers to sorrow caused by the insensitivity of a loved one or German Schadenfreude, which refers to pleasure derived from another’s displeasure. Based on these and similar arguments, he argued

108

N. Novielli and A. Serebrenik

that emotions can be operationalized along several dimensions: valence,2 arousal, and dominance. Valence expresses the degree of pleasantness of the emotion and typically can be characterized on the scale from negative to positive. Arousal corresponds to degree of activation in the emotion and can be scale from low to high. Dominance is related to feeling in control or feeling controlled. Using these dimensions, Russel states that both excitement and calmness can be characterized by positive valence and feeling in control, with arousal being high for excitement and low for calmness. Anger shares with excitement high arousal and dominance but differs from it by negative valence. Ekman [36] believes emotions to be separate and distinct from each other. He associated emotions with facial expressions [35] and distinctive patterns of activation of the autonomic nervous system [34], as well as connected emotions in humans to comparable expressions observed in other primates [34]. Ekman further argued that there are more emotional words than actual emotions and that only emotions satisfying specific criteria can be seen as basic emotions. These emotions are anger, surprise, disgust, enjoyment, fear, and sadness; later research suggests that contempt should be seen as a basic emotion too. Starting from a similar list of basic emotions, Shaver et al. [119] have proposed a tree-like structure gradually refining these emotions to emotion names. This hierarchy of emotion labels includes basic (primary) emotions, which are further refined into secondary and tertiary ones, e.g., anger is refined to such secondary emotions as envy, rage, and exasperation, with such tertiary subspecies of rage as outrage, hatred, or dislike. Plutchik’s wheel3 of emotions [111] combines discrete and dimensional elements: while Plutchik argues that only a small number of basic emotions exist (and other emotions can be synthesized by combining the basic ones), he also recognizes that each emotion can exist at different levels of arousal, distinguishing between, e.g., “blues”, sadness, and grief. Moreover, emotions on the opposite sides of the wheel are opposing: e.g., joy and sadness and expectancy and surprise. The aforementioned models have been used when studying emotions in the context of software engineering: for example, Murgia et al. [89] have used the model by Shaver et al. [119], as presented in Parrott [108], Khan and Saleh [72] chose the Plutchik’s wheel; Girardi et al. [44] opted for Russel’s circumplex model of emotion. Similarly to the latter work, many studies of emotion in software engineering implicitly adopt a dimensional model; however, as opposed to it, these studies focus only on the valence of emotion. Such studies tend to call valence “sentiment” and consider it to be negative, neutral, or positive [12, 110, 123].

2 Russel

[114] used the term “pleasure” for “valence,” but “valence” is more commonly used in subsequent publications. 3 “Wheel of emotions” is a latter term; the original paper by Plutchik referred to a “threedimensional emotion solid” with degree of arousal providing the third dimension.

5 Emotion Analysis in Software Ecosystems

109

Theories: While multiple theories of emotion can be found in the literature, studies of emotion in the context of software engineering either use theories with a small number of distinct emotions or those positioning emotions in a continuous one- or multidimensional space.

5.3 Why Would One Study Emotions in Software Engineering? Software development has been often stereotyped as a job with few interpersonal requirements [31], and, hence, one might doubt the importance of studying emotions experienced and expressed by software developers. Our answer to the question in the section title is twofold. First, software development has long been recognized as a problem-solving activity [112], and emotions are known to impact problem-solving skills and creativity [5, 46]. Second, software development is a collaborative process [10], and such sites as GitHub and Stack Overflow further require communication to facilitate knowledge sharing and co-creation of software [124]. Previous research in organizational psychology investigated the relation between emotions and knowledge sharing in a global IT organization. The study found that pride and empathy positively impact the willingness to share knowledge and are influenced by knowledge-sharing intentions in their turn [54]. Wurzel Gonçalves et al. [141] investigated interpersonal conflicts in code review and found that they are common and often perceived as an opportunity to learn from disagreement, thus highlighting the need for developing strategies for constructive resolution of conflicts. On the other hand, Murphy-Hill et al. [48] demonstrated the potential negative impact on motivation to continue working with colleagues after receiving destructive criticism in code reviews. Indeed, the presence of negative emotions is one of the dimensions of an interpersonal conflict [7]. To illustrate the latter point, Wurzel Gonçalves et al. report the following comment made in a Linux mailing list in October 2015: “Christ people. This is just sh*t. The conflict I get is due to stupid new gcc header file crap. But what makes me upset is that the crap is for completely bogus reasons.” Along the same line, anger in the statements such as “Is there any progress on this issue??” has been considered by Gachechiladze et al. [42] for identification of actionable insights in issue handling. This is why substantial research effort has been dedicated to understanding emotions experienced and expressed by developers, their triggers, and consequences. However, in order to answer these questions, one has to measure emotions first. Emotions in Software Engineering: Emotions influence both cognitive processes such as problem-solving and interpersonal interaction, both important elements of software engineering.

110

N. Novielli and A. Serebrenik

5.4 How to Measure Emotion? Emotion measurement is an important topic in emotion research [27, 81]. In the following, we summarize the recent advancements in emotion recognition in software engineering. Specifically, we report about available tools and dataset, specifically designed to support emotion recognition in the context of software development. For further discussion of the ways emotions of software developers are measured, we refer the reader to the recent article by Sánchez-Gordón and Colomo-Palacios [115] and for discussion of software-engineering-specific sentiment analysis tools and datasets to Lin et al. [75] and Obaidi and Klünder [102].

5.4.1 Tools Scientific literature on emotion measurement covers a broad spectrum of techniques including psychophysiological signals (e.g., electrodermal skin response, neuroendocrine factors, or heart rate) [93], observation of behavior (e.g., vocal and verbal characteristics or body expressions and postures) [63], measurement of facial expressions [56], self-reporting questionnaires [24], and text analysis [84]. Many of these techniques have been applied in context of software engineering as well: e.g., Girardi et al. have analyzed psychophysiological signals to recognize emotions of developers in the lab [45, 86] and in the field [44, 135]; Novielli et al. [100] have argued that facial expressions should be used as a gold standard; self-reporting questionnaires such as SAM [16] and PANAS [137] have been used by Çalikli et al. [22] and Schneider et al. [117], respectively. As the communication between software developers is to a large extent text-based, the use of vocal information for emotion detection has not been explored in the software engineering context. The work of Herrmann and Klünder [52] takes the first step in this direction. The authors advocate usage of audio to analyze emotion present in software project meetings. However, their approach starts with converting audio to text and hence ignoring the tone and intonation that can reflect emotion experienced by the meeting participants. The very same dominance of text-based communication has led to lions’ share of emotion measurement techniques to focus on textual artefacts produced by software developers, e.g., code review comments, Stack Overflow questions, commit messages, or bug descriptions. Early studies of sentiment and emotion in software engineering used text analysis tools developed for very different kinds of text: e.g., a number of studies [43, 49, 104] have used SentiStrength [127], a tool originally designed for and evaluated on social Web datasets (Myspace, Twitter, YouTube, Digg, Runner’s World, BBC Forums) [126]. However, as observed by Novielli et al. [96], when such tools are applied in the context of software engineering, they produce unreliable results, threatening validity of the previously published conclusions [66]. This observation has led to emergence of a series of softwareengineering-specific sentiment analysis [3, 11, 18, 25, 32, 62, 118] and emotion

5 Emotion Analysis in Software Ecosystems

111

detection tools [20, 59, 61]. As most of these tools are based on machine learning, retraining is recommended when applying them to a different kind of text than the one they have been designed for [94], as indeed different software-engineeringspecific sentiment analysis tools might lead to contradictory results at a fine-grain level, when used off the shelf [99]. Further empirically driven recommendations include carefully choosing the emotion model in line with the research goals, as the operationalization of emotions adopted by the designer of an emotion detection tool might not necessarily match the focus and goal of a given empirical study. Furthermore, when a manually annotated gold standard is not available for retraining, lexicon-based tools such as SentiStrengthSE might represent a viable option [98]. To encourage further applications of emotion analysis in the industry and research, we also include Table 5.1 summarizing currently available emotion analysis tools.

5.4.2 Datasets As an output of recent empirical research in this field, several annotated datasets have been released by SE researchers to further encourage training and fine-tuning of software engineering-specific sentiment analysis tools. Murgia et al. [89] release a dataset of Jira comments labeled according to the Shaver’s primary emotions, namely, joy, love, surprise, anger, fear, and sadness. The dataset, initially composed of 400 text items, was further extended using the same annotation schema by Ortu et al. [107]. Using the same taxonomy, Calefato et al. annotated more than 4000 questions, answers, and comments from Stack Overflow [18]. Other than releasing the emotion labels, the authors also provide a mapping to the valence dimension, thus labeling each text item as either positive (joy, love), negative (anger, fear, sadness), or neutral (absence of emotion label). Surprise was mapped to either positive or negative valence depending on the context. Novielli et al. [94] adopted the same coding guidelines for labeling emotions and mapping them to positive, negative, and neutral in annotating about 7000 comments from GitHub projects. Stack Overflow posts (1500 overall) were also annotated by Lin et al. in the scope of an empirical study on mining positive and negative opinion of developers about software libraries [76]. Jira comments (500 overall) were also annotated by Kaur et al. [70] according to polarity labels, as well as by Islam and Zibran [61], who labeled 1800 Jira comments to identify the presence of excitement, stress, depression, and relaxation. Beyond Jira, Stack Overflow, and GitHub, other data sources were used such as code review comments [3] and tweets [140]. In their survey, Lin et al. [75] report a detailed list of available datasets for sentiment polarity/emotion/politeness detection, which can be used as a gold standard for training supervised classifiers. They also include consideration of dataset annotate including a broader set of emotion-related mental states, such as confusion [33].

112

N. Novielli and A. Serebrenik

Table 5.1 Tools for sentiment/emotion detection in software engineering. Based on previous literature reviews [75, 102, 103] and updated Tool

Methodology

SentiStrengthSE [62] Senti4SD [18]

Lexicon-based

Based on Sentiment detection issue reports

Stack Overflow posts Traditional machine learning SEntiMoji [25] Deep learning issue reports, Stack Overflow posts, code reviews SentiSW [32] Deep learning issue reports code reviews SentiCR [3] Traditional machine learning code reviews SentiSE [118] Traditional machine learning Transformer issue reports, Stack Unnamed Overflow posts, code classifier [143] models reviews, app reviews, GitHub pull-request and commit comments BERT-based Stack Overflow posts Unnamed classifier [11] EASTER [125] Deep learning Stack Overflow posts, app reviews, JIRA issues Sentisead [131] Ensemble Stack Overflow posts, issue reports, app reviews Emotion detection SO BERT BERT-based Stack Overflow posts emotion classifier [13] DEVA [61] Lexicon-based issue reports

Stack Overflow posts, MarValous [59] Traditional machine learning issue reports

Unnamed classifier [88] EmoTxT [20]

Traditional issue reports machine learning StackOverflow posts, Traditional machine learning issue reports

Unnamed classifier [17]

Traditional Stack Overflow posts machine learning

Theoretical model positive, neutral, negative positive, neutral, negative positive, neutral, negative

positive, neutral, negative non-negative, negative positive, neutral, negative positive, neutral, negative

positive, neutral, negative positive, neutral, negative

positive, neutral, negative

Distinct emotions: love, joy, surprise, anger, sadness, fear discretization of the two-dimensional valence/arousal model: excitement, stress, depression, relaxation, neutral discretization of the two-dimensional valence/arousal model: excitement, stress, depression, relaxation, neutral Distinct emotions: joy, love, sadness, neutral Distinct emotions: joy, love, sadness, anger, surprise, fear. Neutral is assigned in absence of other emotions Distinct emotions: joy, love, sadness, anger, surprise, fear, objective

5 Emotion Analysis in Software Ecosystems

113

Measurement: While a broad specter of emotion measurement techniques can be found in the psychological literature, and many of them have been applied in the software engineering context, text-based techniques (sentiment analysis) remain dominant. Multiple sentiment analysis techniques have been designed especially for software engineering.

5.5 What Do We Know About Emotions and Software Ecosystems? Following our observations in Sect. 5.1, we organize this section in two subsections according to the two interpretations of the concept of an “ecosystem,” as a (collection of interrelated) communication platform(s) or as a collection of interrelated projects. A word of caution is in place though: due to use of different datasets and tools, conclusions derived by similar studies might appear contradictory. Moreover, validity conclusions about texts created by software engineers but derived using general-purpose sentiment analysis tools that have not been adjusted to the software engineering domain, should be reassessed as those tools are known to be unreliable in the software engineering context [66, 96]. In particular, this is the case for all results published prior to 2017 as the first software engineering specific sentiment analysis tool has been published in 2017. For example, such insights of Guzman et al. [49] as “Java .GitHub. projects tend to have a slightly more negative score than projects implemented in other languages” or that comments on Monday were more negative than comments on the other days could not have been confirmed when a different sentiment analysis tool has been used [66].

5.5.1 Ecosystems as Communication Platforms In this section, we discuss two popular developer communication platforms, Stack Overflow and GitHub.

5.5.1.1

Stack Overflow

Stack Overflow is a major Q&A platform that has been frequently considered in the research literature in general [2, 6, 9, 90] and through the lens of emotion analysis in particular [17–19, 21, 65, 66, 76, 85, 95, 97, 131, 132]. However, many papers have merely used the data from Stack Overflow to evaluate the sentiment analysis rather than to obtain insights in the development practices on Stack Overflow. We exclude these papers from the subsequent discussion.

114

N. Novielli and A. Serebrenik

Several studies have tried to relate sentiment expressed in Stack Overflow posts to success (e.g., ability to receive an answer) or quality (e.g., as expressed in terms of upvotes and downvotes). Mondal et al. [85] have observed that the upvoted questions tend to be more positive than the downvoted ones. Jiarpakdee et al. [65] have shown that inclusion of sentiment-related variables improves prediction of whether a Stack Overflow question will get an accepted answer. Refining this insight, Calefato et al. [21] recommend the users to write questions using a neutral emotional style as expressing emotions is associated with lower probability of success, i.e., receiving an answer that is accepted as a solution. Finally, Calefato et al. [19] observed that comments rather than questions and answers tend to express emotions and that this can be attributed to the fact that comments do not influence the reputation scores and hence can be seen as a kind of “lawless region” where anything goes. Focusing on the Stack Overflow discussions about API, Uddin and Khomh [132] have observed that certain aspects of APIs such as performance triggered more opinionated statements than other aspects of APIs such as security. Among these opinionated statements, security-related ones are predominantly positive, while the opinions related to performance and portability are much more mixed. Zooming in on specific domains, Uddin and Khomh observed that the distribution of opinions for a given aspect varies, e.g., the opinions about performance are mostly positive for API features related to serialization but mostly negative with regard to the debugging of the performance issues of the APIs. Finally, Cagnoni et al. [17] have used sentiment information to complement indicators of the popularity of programming languages such as the one by TIOBE.4 They have observed that programming languages associated with the highest share of positive posts on Stack Overflow are not necessarily the same as those developers indicate as the most loved languages in a survey: while MATLAB and R trigger the highest share positive emotions in Stack Overflow posts, Rust and Kotlin are the languages indicated as being “loved” by the highest percentage of the Stack Overflow survey. In fact, Python is only one programming language shared by the top-10 of the most loved languages and the top-10 of the programming languages that have triggered the highest share of positive emotions. One might wonder what makes the language “loved”: the insights of Cagnoni et al. [17] suggest that there is more to this than positive atmosphere in the support community.

5.5.1.2

GitHub

Similarly to Stack Overflow, GitHub has been extensively studied in the research literature [28, 128, 134], even triggering a methodological research on how GitHubbased studies should be conducted [69]. Emotion analysis has been also repeatedly conducted on GitHub data [29, 32, 49, 55, 58, 60, 67, 105, 106, 110, 116, 120, 121, 123, 138, 139, 142]. Also similarly to Stack Overflow, manually labeled datasets

4 https://www.tiobe.com/tiobe-index/.

5 Emotion Analysis in Software Ecosystems

115

derived from GitHub have been used to design and evaluate sentiment analysis tools; we do not discuss those papers below. The first group of studies has considered the impact of negative or positive GitHub-related artefacts such as issues and commits on the software development process. Souza and Silva [123] have observed that commits with negative sentiment are slightly more likely to result in broken builds. In a similar vein, Huq et al. [55] have observed negative emotions in contributor commits to indicate that fixinducing change might be needed, while the statistical model of Ortu et al. [106] suggests that issues expressing dominance and sadness are less likely to be merged. Taken together, these studies [55, 106, 123] suggest that commits expressed more negatively deserve a more careful review as they might have undesirable consequences. When it comes to positive emotions, Huq et al. [55] also claimed that “too much positive emotions in discussion may lead to buggy code” as positivity “can turn developers overconfident and careless, reducing their ability to scrutinize their own code” and potentially biasing other reviewers. The latter point might also be related to the observation that positive valence and specifically emotion of joy is linked with a higher probability of merge [106]. Preference of neutral emotional style stemming from these studies is reminiscent of a similar recommendation of Calefato et al. [21] to write Stack Overflow questions using a neutral emotional style. The second group of studies has investigated the software engineering context where specific emotions can be observed. For example, Souza and Silva [123] have shown that commits following a build breakage tend to be more negative. Singh and Singh [120] found that developers express more negative sentiments than positive sentiments when performing refactorings. This finding contradicts the observation of Islam and Zibran [60] that positive emotions are significantly higher than negative emotions for refactoring tasks, despite the fact that both studies [60, 120] have considered the general-purpose sentiment analysis tool SentiStrength [127] and in both cases the tool has been adjusted to the software engineering context. However, the adjustment has not necessarily been carried out in the same way, and the datasets considered have been different, which might explain the difference between the results. Furthermore, differences between the conclusions might be related to different kinds of refactoring activities carried out by the developers: e.g., Singh and Singh [120] observed that high negativity could be particularly attributed to move-refactorings, rename classes, or attributes being pulled up, while no such information is available for the study of Islam and Zibran [60]. Rather than distinguishing between specific types of software development activities such as refactoring or bug fixing, Pletea et al. [110] have focused on the application domain these activities take place in and compared security-related GitHub entities and nonsecurity-related ones. The authors have concluded that security-related entities are more negative than the rest of the entities and that this does not depend on whether one considers as “entities” commits or pull requests and individual comments or entire discussions. This conclusion has been confirmed by subsequent replication studies [66, 99].

116

N. Novielli and A. Serebrenik

Several studies have focused on the influence of the day of the week on the sentiment. Guzman et al. [49] have stated that comments on Monday were more negative than comments on the other days, but this finding could not be confirmed through replication [66]. Sinha et al. [121] reported that overall, the most negative day was Tuesday, while for projects with the highest number of commits in their dataset, the most negative days were Wednesday and Thursday, suggesting that the differences in the distribution of the sentiment over the week might be projectspecific. Islam and Zibran [60] observed negative emotions to be slightly higher in commit messages posted during the weekends than those posted in weekdays and not much differences are visible in the emotional scores for commit messages posted in the five weekdays. Ultimately, evidence on presence of day-related differences in developers’ sentiment is inconclusive at the very least. Ortu et al. [105] and Destefanis et al. [29] compared sentiment in communication of developers and users. According to Ortu et al. [105], when commenting users express more love, sadness, joy, and anger than developers; for replies, however, the situation is partially reversed: developers tend to have expressed more positive emotions (love and joy) and less negative ones (sadness and anger). Using a complementary perspective on the theory of emotion, Destefanis et al. [29] observed that commenters expressed fewer emotions than users, while they communicated with higher levels of arousal, valence, and dominance. Finally, Yang et al. [142] and Jurado and Rodríguez Marín [67] have conducted studies on collections GitHub projects focusing on similarities and differences between these projects. We discuss these papers in Sect. 5.5.2.1.

5.5.2 Ecosystems as Interrelated Projects 5.5.2.1

GitHub

Yang et al. [142] and Jurado and Rodríguez Marín [67] have conducted studies on collections of GitHub projects focusing on similarities and differences between these projects, i.e., as opposed to the studies discussed in Sect. 5.5.1.2, they interpret the notion of an ecosystem as a collection of projects rather than as a shared communication space. Yang et al. [142] observed that the rate of bug-fixing speed increases with emotional values increasing for 13 projects of their dataset, while it decreases for seven projects. Unfortunately, no explanation has been provided for this phenomenon. Jurado and Rodríguez Marín [67] have studied distribution of emotions across nine projects: they have observed that at least 80% of the communication does not express emotion and that the most expressed emotion is joy accounting for 4.66–11.94%. Disgust is the least present emotion barely found in the dataset. The authors have also observed differences between the projects: e.g., fear is overrepresented in Raspberry Pi compared to other projects. Pandas has shown two instances of fear; however, one of these instances merely reflected presence of lexicon related to fear rather than actual fear experienced by developers:

5 Emotion Analysis in Software Ecosystems

117

“The terrible motivating example was this awful hack.” These findings suggest that (a) different projects might have different project culture impacting frequency of different emotions being expressed and (b) pure lexicon-based approaches cannot capture complexity of software engineering communication.

5.5.2.2

Apache

Several studies have considered the Apache ecosystem [26, 70, 88, 89, 113, 130]. Based on the Parrott’s model [108], Murgia et al. [88, 89] have observed that developers express all emotions from the model. While some emotions can refer both to software artifacts and coworkers (e.g., joy, anger, and sadness), others target only artifacts (e.g., surprise and fear) or coworkers (e.g., love). One should keep in mind though that in this study, love is mostly represented as gratitude (“Thanks very much! I appreciate your efforts”), joy as satisfaction of the development process or its results (“I’m happy with the approach and the code looks good”), and sadness as developers apologizing for their mistakes (“Sorry for the delay Stephen”) or expressing their dissatisfaction (“Apache Harmony is no longer releasing. No need to fix this, as sad as it is”). Rigby and Hassan [113] describe “developer B,” the top committer for 1999 and 2000 who has left the project later on. As developer B was preparing to leave, their language shifted from describing new insights to explaining previously taken decisions, and the number of positive emotions decreased as well. Focusing on emoticons, Claes et al. [26] have observed that in more than 90% of occurrences, they are used to express joy. Moreover, emoticons are used more often in Apache projects during the weekend than during weekdays; the effect size was, however, small. Tourani et al. [130] have studied the Apache mailing lists and observed that almost 70% of the communication is neutral, about 20% are positive, and slightly more than 10% are negative. For emails with the positive sentiment, user mailing lists contain substantially more curiosity than developer mailing lists, while for emails with negative sentiment, user mailing lists contain more sadness and less aggression than developer mailing lists.

5.5.2.3

Other Ecosystems

Next we review studies of sentiment and emotion in other ecosystems. Two studies have targeted the Eclipse ecosystem [101, 129]. Using SentiStrength, Tourani and Adams observed that increase in the lowest sentiment score, i.e., most negative score becoming less negative, has a positive but small effect on defect proneness. By manually analyzing the initial posts of threads at the Eclipse forum and the corresponding first replies, Nughoro et al. [101] have observed that Junior contributors and Members post more positive messages than Senior contributors; moreover, Juniors both start positive interactions and receive positive responses, while Seniors post initiate positive, neutral, and negative communication. These observations seem to concur with the idea that the relative power of the

118

N. Novielli and A. Serebrenik

actor and target affects the extent to which display rules require controlling one’s expressions [30, 50]. Two further studies have focused on Mozilla. Umer et al. [133] have shown that inclusion of sentiment improves prediction of whether the enhancements proposed by Mozilla contributors will be integrated. While similarly to Apache the most popular emoticon in Mozilla represents joy, the share of sad and surprised emoticons in Mozilla is more than twice higher [26]. Ferreira et al. [38] have studied sentiment expressed on the Linux Kernel mailing list. While no differences in sentiment across releases, months, and weeks have been observed, Sunday, Tuesday, Wednesday, and Thursday had more positive than negative sentiment. Referring to the specific event of Linus Torvalds taking a break from the community in September 2018, the authors investigated whether it has affected sentiment within the community. While the difference in the sentiment was not immediate, positive sentiment at the level of months and weeks after his break has increased. Lanovaz and Adams [73] compared the sentiment in two mailing lists of the R community: one targeting developers and another one helping users. Developers showed marginally more positive and negative tones than users, and while negative messages by the users did not receive response, this was not the case for developers. One might wonder whether this difference could be attributed to developers seeing users as customers and hence neutralizing their emotions [30]. Sentiment in Ecosystems: When ecosystems are treated as communication platforms, sentiment is used to predict the outcome of developer’s activities on these platforms or to understand differences in the context where different sentiment is observed. When ecosystems are treated as collections of projects, the studies focus on differences between different kinds of contributors, different projects, and popularity of different emotions.

5.6 What Next? In this chapter, we have provided an overview on the state-of-the-art resources for sentiment analysis in software engineering, with specific focus on their application to software ecosystems. The good performance achieved by the available SEspecific sentiment analysis tools provides an evidence that reliable sentiment analysis in software development is possible provided that SE-specific tools are used. Still, open challenges remain for sentiment analysis, in general, and on developers’ communication traces, in particular. Tools based on supervised machine learning might produce a different performance on different data sources due to platform-specific jargon and communication style [98]. As such, we recommend retraining supervised tools using a gold standard from the same domain and data source being targeted. Model retraining or finetuning might be required also in a within-platform setting, as language constantly

5 Emotion Analysis in Software Ecosystems

119

evolve, especially in the context of online interaction. It is the case of emoji, for example, that recently emerged as a predominant way of conveying emotional content [25]. Traditionally, sentiment analysis research has predominantly focused on the English language, also due to the availability of resources for this language (e.g., sentiment lexicons and toolkit). We highlight the need for future research to focus on different languages, in order to support effective interaction in projects and communities that do not use English as a predominant language in their online communication. Finally, while most of the approaches used so far have focused on a single type of measurement, Lisa Feldman Barrett has recently developed a constructionist approach for measurement of emotions, advocating a multimodal approach toward measurement going beyond solely facial analysis or self-reporting or psychophysiology [8]. Along the same dimension, Novielli et al. advocate in favor of the design and implementation of tools combining multiple approaches for emotion assessment, to fully support emotion awareness during software development. Specifically, they envisage the emergence of tools and practices including both selfreporting of emotions through experience sample and emotion detection through using biometrics, as they might provide complementary information on the emotional status of an individual [100].

5.7 What Have We Discussed in This Chapter? Software engineering processes depend on the emotions experienced and expressed by software developers. To get insights in these emotions, psychological theories and automated tools have been developed. Using these theories and tools, multiple studies have investigated software ecosystems through the lens of emotion. Most such studies have considered ecosystems as interrelated communication platforms supporting software development such as GitHub or Stack Overflow, e.g., recommending developers how to ask questions on Stack Overflow or aiming at understanding the impact of emotions on software engineering or context where emotions are likely to emerge. Other studies investigate on ecosystems as collections of projects such as Apache or Eclipse focusing on experiences of developers in these communities.

References 1. Adolphs, R., Anderson, D.J.: The Neuroscience of Emotion: A New Synthesis. Princeton University Press, Princeton (2018) 2. Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K., Schneider, K.A.: CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empirical Softw. Eng. 25(2), 1493–1532 (2020). https://doi.org/10.1007/s10664-019-09743-4

120

N. Novielli and A. Serebrenik

3. Ahmed, T., Bosu, A., Iqbal, A., Rahimi, S.: SentiCR: a customized sentiment analysis tool for code review interactions. In: International Conference on Automated Software Engineering (ASE), pp. 106–111. IEEE, Piscataway (2017). https://doi.org/10.1109/ASE.2017.8115623 4. Ali, N., Hong, J.E.: Value-oriented requirements: eliciting domain requirements from social network services to evolve software product lines. Appl. Sci. 9(19), 3944 (2019) 5. Amabile, T.M., Barsade, S.G., Mueller, J.S., Staw, B.M.: Affect and creativity at work. Administrative Sci. Q. 50(3), 367–403 (2005). https://doi.org/10.2189/asqu.2005.50.3.367 6. Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J.: Discovering value from community activity on focused question answering sites: a case study of Stack Overflow. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 850–858. ACM, New York (2012). https://doi.org/10.1145/2339530.2339665 7. Barki, H., Hartwick, J.: Interpersonal conflict and its management in information system development. MIS Q. 25(2), 195–228 (2001) 8. Barrett, L.F.: How Emotions Are Made: The Secret Life of the Brain. HarperCollins (2017) 9. Barua, A., Thomas, S.W., Hassan, A.E.: What are developers talking about? An analysis of topics and trends in Stack Overflow. Empirical Softw. Eng. 19(3), 619–654 (2014). https:// doi.org/10.1007/s10664-012-9231-y 10. Begel, A., Herbsleb, J.D., Storey, M.A.: The future of collaborative software development. In: Conference on Computer Supported Cooperative Work (CSCW), pp. 17–18. ACM, New York (2012). https://doi.org/10.1145/2141512.2141522 11. Biswas, E., Karabulut, M.E., Pollock, L., Vijay-Shanker, K.: Achieving reliable sentiment analysis in the software engineering domain using bert. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 162–173. IEEE, Piscataway (2020). https://doi.org/10.1109/ICSME46990.2020.00025 12. Biswas, E., Vijay-Shanker, K., Pollock, L.L.: Exploring word embedding techniques to improve sentiment analysis of software engineering texts. In: International Conference on Mining Software Repositories (MSR), pp. 68–78. IEEE / ACM (2019). https://doi.org/10. 1109/MSR.2019.00020 13. Bleyl, D., Buxton, E.K.: Emotion recognition on stackoverflow posts using bert. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 5881–5885 (2022). https://doi. org/10.1109/BigData55660.2022.10020161 14. Bosch, J.: From software product lines to software ecosystems. In: International Software Product Line Conference (SPLC) (2009) 15. Boudeffa, A., Abherve, A., Bagnato, A., Thomas, C., Hamant, M., Montasser, A.: Application of computational linguistics techniques for improving software quality. In: International Conference on Product-Focused Software Process Improvement (PROFES). Lecture Notes in Computer Science, vol. 11915, pp. 577–582. Springer, Berlin (2019). https://doi.org/10. 1007/978-3-030-35333-9%5C_41 16. Bradley, M.M., Lang, P.J.: Measuring emotion: the self-assessment manikin and the semantic differential. J. Behav. Therapy Exp. Psychiatry 25(1), 49–59 (1994). https://doi.org/10.1016/ 0005-7916(94)90063-9 17. Cagnoni, S., Cozzini, L., Lombardo, G., Mordonini, M., Poggi, A., Tomaiuolo, M.: Emotionbased analysis of programming languages on Stack Overflow. ICT Express 6(3), 238–242 (2020). https://doi.org/10.1016/j.icte.2020.07.002 18. Calefato, F., Lanubile, F., Maiorano, F., Novielli, N.: Sentiment polarity detection for software development. Empirical Softw. Eng. 23(3), 1352–1382 (2018). https://doi.org/10. 1007/s10664-017-9546-9 19. Calefato, F., Lanubile, F., Marasciulo, M.C., Novielli, N.: Mining successful answers in Stack Overflow. In: Working Conference on Mining Software Repositories (MSR), pp. 430–433. IEEE, Piscataway (2015). https://doi.org/10.1109/MSR.2015.56 20. Calefato, F., Lanubile, F., Novielli, N.: EmoTxt: a toolkit for emotion recognition from text. In: International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACII Workshops), pp. 79–80. IEEE, Piscataway (2017). https://doi.org/10.1109/ ACIIW.2017.8272591

5 Emotion Analysis in Software Ecosystems

121

21. Calefato, F., Lanubile, F., Novielli, N.: How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow. Inf. Softw. Technol. 94, 186–207 (2018). https://doi.org/10.1016/j.infsof.2017.10.009 22. Çalikli, G., Al-Eryani, M., Baldebo, E., Horkoff, J., Ask, A.: Effects of automated competency evaluation on software engineers’ emotions and motivation: a case study. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 44–50. ACM, New York (2018). https://doi.org/10.1145/3194932.3194939 23. Calvo, R.A., D’Mello, S., Gratch, J., Kappas, A.: The Oxford Handbook of Affective Computing. Oxford University Press, Oxford (2014) 24. Cardello, A.V., Jaeger, S.R.: Measurement of consumer product emotions using questionnaires. In: Meiselman, H.L. (ed.) Emotion Measurement, pp. 165–200. Woodhead Publishing (2016). https://doi.org/10.1016/B978-0-08-100508-8.00008-4 25. Chen, Z., Cao, Y., Lu, X., Mei, Q., Liu, X.: SEntiMoji: an emoji-powered learning approach for sentiment analysis in software engineering. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 841–852. ACM, New York (2019). https://doi.org/10.1145/3338906. 3338977 26. Claes, M., Mäntylä, M., Farooq, U.: On the use of emoticons in open source software development. In: International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, New York (2018). https://doi.org/10.1145/3239235.3267434 27. Coan, J.A. (ed.): Handbook of Emotion Elicitation and Assessment. Series in Affective Science. Oxford University Press, Oxford (2007) 28. Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social coding in GitHub: transparency and collaboration in an open software repository. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 1277–1286. ACM, New York (2012). https://doi. org/10.1145/2145204.2145396 29. Destefanis, G., Ortu, M., Bowes, D., Marchesi, M., Tonelli, R.: On measuring affects of github issues’ commenters. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 14–19. ACM, New York (2018). https://doi.org/10.1145/3194932. 3194936 30. Diefendorff, J.M., Greguras, G.J.: Contextualizing emotional display rules: examining the roles of targets and discrete emotions in shaping display rule perceptions. J. Manag. 35, 880– 898 (2009). https://doi.org/10.1177/0149206308321548 31. Diefendorff, J.M., Richard, E.M.: Antecedents and consequences of emotional display rule perceptions. J. Appl. Psychol. 88(2), 284–294 (2003) 32. Ding, J., Sun, H., Wang, X., Liu, X.: Entity-level sentiment analysis of issue comments. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 7– 13. ACM, New York (2018). https://doi.org/10.1145/3194932.3194935 33. Ebert, F., Castor, F., Novielli, N., Serebrenik, A.: Confusion detection in code reviews. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 549–553. IEEE, Piscataway (2017). https://doi.org/10.1109/ICSME.2017.40 34. Ekman, P.: An argument for basic emotions. Cogn. Emotion 6(3-4), 169–200 (1992). https:// doi.org/10.1080/02699939208411068 35. Ekman, P.: Facial expression and emotion. Am. Psychol. 48, 384–392 (1993). https://doi.org/ 10.1037/0003-066X.48.4.384 36. Ekman, P.: Basic emotions. In: Dalgleish, T., Powers, M.J. (eds.) Handbook of Cognition and Emotion, pp. 45–60. Wiley, London (1999) 37. Fehr, B., Russell, J.A.: Concept of emotion viewed from a prototype perspective. J. Exp. Psychol.: Gener. 113, 464–486 (1984) 38. Ferreira, I., Stewart, K., Germán, D.M., Adams, B.: A longitudinal study on the maintainers’ sentiment of a large scale open source ecosystem. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 17–22. IEEE/ACM (2019). https://doi. org/10.1109/SEmotion.2019.00011

122

N. Novielli and A. Serebrenik

39. Ferreira, J., Dennehy, D., Babu, J., Conboy, K.: Winning of hearts and minds: integrating sentiment analytics into the analysis of contradictions. In: IFIP WG 6.11 Conference on eBusiness, e-Services, and e-Society (I3E). Lecture Notes in Computer Science, vol. 11701, pp. 392–403. Springer, Berlin (2019). https://doi.org/10.1007/978-3-030-29374-1%5C_32 40. Frijda, N.H.: The Psychologists’ Point of View, 3rd edn., pp. 68–87. Handbook of Emotions. The Guilford Press (2008) 41. Fu, B., Lin, J., Li, L., Faloutsos, C., Hong, J.I., Sadeh, N.M.: Why people hate your app: making sense of user feedback in a mobile app store. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1276–1284. ACM, New York (2013). https://doi.org/10.1145/2487575.2488202 42. Gachechiladze, D., Lanubile, F., Novielli, N., Serebrenik, A.: Anger and its direction in collaborative software development. In: International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track ICSE-NIER, pp. 11–14. IEEE, Piscataway (2017). https://doi.org/10.1109/ICSE-NIER.2017.18 43. Garcia, D., Zanetti, M.S., Schweitzer, F.: The role of emotions in contributors activity: a case study on the Gentoo community. In: International Conference on Cloud and Green Computing, pp. 410–417 (2013) 44. Girardi, D., Lanubile, F., Novielli, N., Serebrenik, A.: Emotions and perceived productivity of software developers at the workplace. Trans. Softw. Eng. 1–53 (2022). https://doi.org/10. 1109/TSE.2021.3087906 45. Girardi, D., Novielli, N., Fucci, D., Lanubile, F.: Recognizing developers’ emotions while programming. In: International Conference on Software Engineering (ICSE), pp. 666–677. ACM, New York (2020). https://doi.org/10.1145/3377811.3380374 46. Graziotin, D., Wang, X., Abrahamsson, P.: Happy software developers solve problems better: psychological measurements in empirical software engineering. PeerJ 2, e289 (2014). https:// doi.org/10.7717/peerj.289 47. Gross, J.J., Barrett, L.F.: Emotion generation and emotion regulation: one or two depends on your point of view. Emot. Rev. 3(1), 8–16 (2011) 48. Gunawardena, S.D., Devine, P., Beaumont, I., Garden, L.P., Murphy-Hill, E., Blincoe, K.: Destructive criticism in software code review impacts inclusion. International Conference on Human-Computer Interaction (CSCW), vol. 6 (2022). https://doi.org/10.1145/3555183 49. Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: International Conference on Mining Software Repositories (MSR), pp. 352–355. ACM, New York (2014). https://doi.org/10.1145/2597073.2597118 50. Hall, J.A., Coats, E.J., LeBeau, L.S.: Nonverbal behavior and the vertical dimension of social relations: a meta-analysis. Psychol. Bull. 131(6), 898–924 (2005) 51. Hatamian, M., Serna, J.M., Rannenberg, K.: Revealing the unrevealed: mining smartphone users privacy perception on app markets. Comput. Secur. 83, 332–353 (2019). https://doi.org/ 10.1016/j.cose.2019.02.010 52. Herrmann, M., Klünder, J.: From textual to verbal communication: towards applying sentiment analysis to a software project meeting. In: International Requirements Engineering Conference Workshops (RE), pp. 371–376. IEEE, Piscataway (2021). https://doi.org/10.1109/ REW53955.2021.00065 53. Hogan, P.C., Irish, B.J., Hogan, L.P. (eds.): The Routledge Companion to Literature and Emotion. Routledge, London (2022) 54. van den Hooff, B., Schouten, A.P., Simonovski, S.: What one feels and what one knows: the influence of emotions on attitudes and intentions towards knowledge sharing. J. Knowl. Manag. 16(1), 148–158 (2012). https://doi.org/10.1108/13673271211198990 55. Huq, S.F., Sadiq, A.Z., Sakib, K.: Understanding the effect of developer sentiment on fixinducing changes: an exploratory study on github pull requests. In: 26th Asia-Pacific Software Engineering Conference, APSEC 2019, Putrajaya, Malaysia, December 2-5, 2019, pp. 514– 521. IEEE, Piscataway (2019). https://doi.org/10.1109/APSEC48747.2019.00075

5 Emotion Analysis in Software Ecosystems

123

56. Hwang, H.C., Matsumoto, D.: Measuring emotions in the face. In: Meiselman, H.L. (ed.) Emotion Measurement, pp. 125–144. Woodhead Publishing (2016). https://doi.org/10.1016/ B978-0-08-100508-8.00006-0 57. Iacob, C., Faily, S., Harrison, R.: MARAM: tool support for mobile app review management. In: International Conference on Mobile Computing, Applications and Services (MobiCASE), pp. 42–50. ACM/ICST (2016). https://doi.org/10.4108/eai.30-11-2016.2266941 58. Imtiaz, N., Middleton, J., Girouard, P., Murphy-Hill, E.R.: Sentiment and politeness analysis tools on developer discussions are unreliable, but so are people. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 55–61. ACM, New York (2018). https://doi.org/10.1145/3194932.3194938 59. Islam, M.R., Ahmmed, M.K., Zibran, M.F.: Marvalous: machine learning based detection of emotions in the valence-arousal space in software engineering text. In: Symposium on Applied Computing (SAC), pp. 1786–1793. ACM (2019). https://doi.org/10.1145/3297280. 3297455 60. Islam, M.R., Zibran, M.F.: Towards understanding and exploiting developers’ emotional variations in software engineering. In: International Conference on Software Engineering Research, Management and Applications (SERA), pp. 185–192. IEEE Computer Society (2016). https://doi.org/10.1109/SERA.2016.7516145 61. Islam, M.R., Zibran, M.F.: DEVA: sensing emotions in the valence arousal space in software engineering text. In: Symposium on Applied Computing (SAC), pp. 1536–1543. ACM, New York (2018). https://doi.org/10.1145/3167132.3167296 62. Islam, M.R., Zibran, M.F.: SentiStrength-SE: exploiting domain specificity for improved sentiment analysis in software engineering text. J. Syst. Softw. 145, 125–146 (2018). https:// doi.org/10.1016/j.jss.2018.08.030 63. Jacob-Dazarola, R., Ortíz Nicolás, J.C., Cárdenas Bayona, L.: Behavioral measures of emotion. In: Meiselman, H.L. (ed.) Emotion Measurement, pp. 101–124. Woodhead Publishing (2016). https://doi.org/10.1016/B978-0-08-100508-8.00005-9 64. Jansen, S., Finkelstein, A., Brinkkemper, S.: A sense of community: a research agenda for software ecosystems. In: International Conference on Software Engineering, pp. 187–190 (2009). https://doi.org/10.1109/ICSE-COMPANION.2009.5070978 65. Jiarpakdee, J., Ihara, A., Matsumoto, K.i.: Understanding question quality through affective aspect in Q&A site. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 12–17. ACM, New York (2016). https://doi.org/10.1145/2897000. 2897006 66. Jongeling, R., Sarkar, P., Datta, S., Serebrenik, A.: On negative results when using sentiment analysis tools for software engineering research. Empirical Softw. Eng. 22(5), 2543–2584 (2017). https://doi.org/10.1007/s10664-016-9493-x 67. Jurado, F., Rodríguez Marín, P.: Sentiment analysis in monitoring software development processes: an exploratory case study on GitHub’s project issues. J. Syst. Softw. 104, 82–89 (2015). https://doi.org/10.1016/j.jss.2015.02.055 68. Juslin, P.N., Sloboda, J.A. (eds.): Handbook of Music and Emotion: Theory, Research, Applications. Series in Affective Science. Oxford University Press (2010) 69. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining GitHub. In: Working Conference on Mining Software Repositories (MSR), MSR 2014, pp. 92–101. ACM, New York (2014). https://doi.org/10. 1145/2597073.2597074 70. Kaur, A., Singh, A.P., Dhillon, G.S., Bisht, D.: Emotion mining and sentiment analysis in software engineering domain. In: International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1170–1173 (2018). https://doi.org/10.1109/ICECA. 2018.8474619 71. Kemper, T.D.: A Social Interactional Theory of Emotions. Wiley, New York (1978) 72. Khan, K.M., Saleh, M.: Understanding the impact of emotions on the quality of software artifacts. IEEE Access 9, 110194–110208 (2021). https://doi.org/10.1109/ACCESS.2021. 3102663

124

N. Novielli and A. Serebrenik

73. Lanovaz, M.J., Adams, B.: Comparing the communication tone and responses of users and developers in two R mailing lists: measuring positive and negative emails. IEEE Software 36(5), 46–50 (2019). https://doi.org/10.1109/MS.2019.2922949 74. de Lima Fontão, A., Ekwoge, O.M., dos Santos, R.P., Dias-Neto, A.C.: Facing up the primary emotions in mobile software ecosystems from developer experience. In: Workshop on Social, Human, and Economic Aspects of Software (WASHES), pp. 5–11. ACM, New York (2017). https://doi.org/10.1145/3098322.3098325 75. Lin, B., Cassee, N., Serebrenik, A., Bavota, G., Novielli, N., Lanza, M.: Opinion mining for software development: a systematic literature review. Trans. Softw. Eng. Methodol. 31(3), 1–41 (2022) 76. Lin, B., Zampetti, F., Bavota, G., Di Penta, M., Lanza, M., Oliveto, R.: Sentiment analysis for software engineering: how far can we go? In: International Conference on Software Engineering (ICSE), pp. 94–104. ACM, New York (2018). https://doi.org/10.1145/3180155. 3180195 77. Lungu, M.: Towards reverse engineering software ecosystems. In: International Conference on Software Maintenance (ICSM), pp. 428–431. IEEE, Piscataway (2008). https://doi.org/10. 1109/ICSM.2008.4658096 78. Manikas, K., Hansen, K.M.: Software ecosystems: a systematic literature review. J. Syst. Softw. 86(5), 1294–1306 (2013). https://doi.org/10.1016/j.jss.2012.12.026 79. Mankad, S., Hu, S., Gopal, A.: Single stage prediction with embedded topic modeling of online reviews for mobile app management. Ann. Appl. Stat. 12(4), 2279–2311 (2018) 80. Mäntylä, M.V., Novielli, N., Lanubile, F., Claes, M., Kuutila, M.: Bootstrapping a lexicon for emotional arousal in software engineering. In: International Conference on Mining Software Repositories (MSR), pp. 198–202. IEEE, Piscataway (2017). https://doi.org/10.1109/MSR. 2017.47 81. Meiselman, H.L.: Emotion Measurement, 1st edn. Woodhead Publishing (2016) 82. Messerschmitt, D.G., Szyperski, C.: Software ecosystem: understanding an indispensable technology and industry. MIT Press, Cambridge (2003) 83. Mitleton-Kelly, E.: Ten Principles of Complexity and Enabling Infrastructures, pp. 23–50. Pergamon (2003) 84. Mohammad, S.M.: Sentiment analysis: detecting valence, emotions, and other affectual states from text. In: Meiselman, H.L. (ed.) Emotion Measurement, pp. 201–237. Woodhead Publishing (2016). https://doi.org/10.1016/B978-0-08-100508-8.00009-6 85. Mondal, A.K., Rahman, M.M., Roy, C.K.: Embedded emotion-based classification of stack overflow questions towards the question quality prediction. In: International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 521–526. KSI Research Inc. and Knowledge Systems Institute Graduate School (2016). https://doi.org/10.18293/ SEKE2016-146 86. Müller, S.C., Fritz, T.: Stuck and frustrated or in flow and happy: sensing developers’ emotions and progress. In: International Conference on Software Engineering, pp. 688–699 (2015) 87. Muñoz, S., Araque, O., Llamas, A.F., Iglesias, C.A.: A cognitive agent for mining bugs reports, feature suggestions and sentiment in a mobile application store. In: International Conference on Big Data Innovations and Applications, pp. 17–24. IEEE, Piscataway (2018). https://doi.org/10.1109/Innovate-Data.2018.00010 88. Murgia, A., Ortu, M., Tourani, P., Adams, B., Demeyer, S.: An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems. Empirical Softw. Eng. 23(1), 521–564 (2018). https://doi.org/10.1007/s10664-017-9526-0 89. Murgia, A., Tourani, P., Adams, B., Ortu, M.: Do developers feel emotions? An exploratory analysis of emotions in software artifacts. In: Working Conference on Mining Software Repositories (MSR), pp. 262–271. ACM, New York (2014) 90. Nasehi, S.M., Sillito, J., Maurer, F., Burns, C.: What makes a good code example? A study of programming Q&A in StackOverflow. In: International Conference on Software Maintenance (ICSM), pp. 25–34. IEEE, Piscataway (2012). https://doi.org/10.1109/ICSM.2012.6405249

5 Emotion Analysis in Software Ecosystems

125

91. Niedenthal, P.M., Ric, F.: Psychology of Emotion. Psychology Press, New York (2017) 92. Nizamani, Z.A., Liu, H., Chen, D.M., Niu, Z.: Automatic approval prediction for software enhancement requests. Autom. Softw. Eng. 25(2), 347–381 (2018). https://doi.org/10.1007/ s10515-017-0229-y 93. Norman, G.J., Necka, E., Berntson, G.G.: The psychophysiology of emotions. In: Meiselman, H.L. (ed.) Emotion Measurement, pp. 83–98. Woodhead Publishing (2016). https://doi.org/ 10.1016/B978-0-08-100508-8.00004-7 94. Novielli, N., Calefato, F., Dongiovanni, D., Girardi, D., Lanubile, F.: Can we use SE-specific sentiment analysis tools in a cross-platform setting? In: International Conference on Mining Software Repositories (MSR), pp. 158–168. ACM, New York (2020). https://doi.org/10.1145/ 3379597.3387446 95. Novielli, N., Calefato, F., Lanubile, F.: Towards discovering the role of emotions in Stack Overflow. In: International Workshop on Social Software Engineering (SSE), pp. 33–36. ACM, New York (2014). https://doi.org/10.1145/2661685.2661689 96. Novielli, N., Calefato, F., Lanubile, F.: The challenges of sentiment detection in the social programmer ecosystem. In: International Workshop on Social Software Engineering (SSE), pp. 33–40. ACM, New York (2015). https://doi.org/10.1145/2804381.2804387 97. Novielli, N., Calefato, F., Lanubile, F.: A gold standard for emotion annotation in Stack Overflow. In: International Conference on Mining Software Repositories (MSR), pp. 14–17. ACM, New York (2018). https://doi.org/10.1145/3196398.3196453 98. Novielli, N., Calefato, F., Lanubile, F.: Love, joy, anger, sadness, fear, and surprise: SE needs special kinds of AI: a case study on text mining and SE. IEEE Softw. 37(3), 86–91 (2020). https://doi.org/10.1109/MS.2020.2968557 99. Novielli, N., Calefato, F., Lanubile, F., Serebrenik, A.: Assessment of off-the-shelf SEspecific sentiment analysis tools: an extended replication study. Empirical Softw. Eng. 26(4), 77 (2021). https://doi.org/10.1007/s10664-021-09960-w 100. Novielli, N., Grassi, D., Lanubile, F., Serebrenik, A.: Sensor-based emotion recognition in software development: facial expressions as gold standard. In: International Conference on Affective Computing and Intelligent Interaction (ACII) (2022). https://doi.org/10.1109/ ACII55700.2022.9953808 101. Nugroho, Y.S., Islam, S., Nakasai, K., Rehman, I., Hata, H., Kula, R.G., Nagappan, M., Matsumoto, K.: How are project-specific forums utilized? A study of participation, content, and sentiment in the Eclipse ecosystem. Empirical Softw. Eng. 26(6), 132 (2021). https://doi. org/10.1007/s10664-021-10032-2 102. Obaidi, M., Klünder, J.: Development and application of sentiment analysis tools in software engineering: a systematic literature review. In: Evaluation and Assessment in Software Engineering (EASE), pp. 80—89. ACM, New York (2021). https://doi.org/10.1145/3463274. 3463328 103. Obaidi, M., Nagel, L., Specht, A., Klünder, J.: Sentiment analysis tools in software engineering: a systematic mapping study. Inform. Softw. Technol. 151, 107018 (2022). https://doi.org/ 10.1016/j.infsof.2022.107018 104. Ortu, M., Adams, B., Destefanis, G., Tourani, P., Marchesi, M., Tonelli, R.: Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In: Working Conference on Mining Software Repositories (MSR), pp. 303–313. IEEE, Piscataway (2015). https://doi. org/10.1109/MSR.2015.35 105. Ortu, M., Hall, T., Marchesi, M., Tonelli, R., Bowes, D., Destefanis, G.: Mining communication patterns in software development: a GitHub analysis. In: International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), pp. 70–79. ACM, New York (2018). https://doi.org/10.1145/3273934.3273943 106. Ortu, M., Marchesi, M., Tonelli, R.: Empirical analysis of affect of merged issues on GitHub. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 46–48. IEEE / ACM (2019). https://doi.org/10.1109/SEmotion.2019.00017 107. Ortu, M., Murgia, A., Destefanis, G., Tourani, P., Tonelli, R., Marchesi, M., Adams, B.: The emotional side of software developers in JIRA. In: International Conference on Mining

126

N. Novielli and A. Serebrenik

Software Repositories (MSR), pp. 480–483. ACM, New York (2016). https://doi.org/10.1145/ 2901739.2903505 108. Parrott, W.G.: Emotions in Social Psychology: Essential Readings. Psychology Press (2001) 109. Plamper, J.: The History of Emotions. Oxford University Press, Oxford (2012) 110. Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: sentiment analysis of security discussions on GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 348–351. ACM, New York (2014). https://doi.org/10.1145/2597073.2597117 111. Plutchik, R.: Outlines of a new theory of emotion. Trans. N.Y. Acad. Sci. 20(5), 394–403 (1958) 112. Powell, P.B.: Planning for software validation, verification, and testing. Tech. Rep. 98, US Department of Commerce, National Bureau of Standards (1982) 113. Rigby, P.C., Hassan, A.E.: What can OSS mailing lists tell us? A preliminary psychometric text analysis of the Apache developer mailing list. In: International Workshop on Mining Software Repositories (MSR), pp. 23–23 (2007). https://doi.org/10.1109/MSR.2007.35 114. Russell, J.: Culture and the categorization of emotions. Psychol. Bull. 110(3), 426–450 (1991) 115. Sánchez-Gordón, M., Colomo-Palacios, R.: Taking the emotional pulse of software engineering: a systematic literature review of empirical studies. Inform. Softw. Technol. 115, 23–43 (2019). https://doi.org/10.1016/j.infsof.2019.08.002 116. Santos, M.F., Caetano, J.A., Oliveira, J., Neto, H.T.M.: Analyzing the impact of feedback in GitHub on the software developer’s mood. In: International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 445–444 (2018). https://doi.org/10. 18293/SEKE2018-153 117. Schneider, K., Klünder, J., Kortum, F., Handke, L., Straube, J., Kauffeld, S.: Positive affect through interactions in meetings: the role of proactive and supportive statements. J. Syst. Softw. 143, 59–70 (2018). https://doi.org/10.1016/j.jss.2018.05.001 118. SentiSE. https://github.com/amiangshu/SentiSE 119. Shaver, P., Schwartz, J., Kirson, D., O’connor, C.: Emotion knowledge: further exploration of a prototype approach. J. Pers. Soc. Psychol. 52(6), 1061–1066 (1987) 120. Singh, N., Singh, P.: How do code refactoring activities impact software developers’ sentiments? An empirical investigation into GitHub commits. In: Asia-Pacific Software Engineering Conference (APSEC), pp. 648–653. IEEE, Piscataway (2017). https://doi.org/ 10.1109/APSEC.2017.79 121. Sinha, V., Lazar, A., Sharif, B.: Analyzing developer sentiment in commit logs. In: International Conference on Mining Software Repositories (MSR), pp. 520–523. ACM (2016). https://doi.org/10.1145/2901739.2903501 122. Solomon, R.C.: The Philosophy of Emotions, 3rd edn., pp. 3–16. Handbook of Emotions. The Guilford Press (2008) 123. Souza, R.R.G., Silva, B.: Sentiment analysis of Travis CI builds. In: International Conference on Mining Software Repositories (MSR), pp. 459–462. IEEE, Piscataway (2017). https://doi. org/10.1109/MSR.2017.27 124. Storey, M.A.: The evolution of the social programmer. In: Working Conference on Mining Software Repositories (MSR). IEEE (2012) 125. Sun, K., Shi, X., Gao, H., Kuang, H., Ma, X., Rong, G., Shao, D., Zhao, Z., Zhang, H.: Incorporating pre-trained transformer models into TextCNN for sentiment analysis on software engineering texts. In: Asia-Pacific Symposium on Internetware, pp. 127–136. ACM, New York (2022). https://doi.org/10.1145/3545258.3545273 126. Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63(1), 163–173 (2012) 127. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment in short strength detection informal text. J. Am. Soc. Inf. Sci. Technol. 61(12), 2544–2558 (2010) 128. Thung, F., Bissyandé, T.F., Lo, D., Jiang, L.: Network structure of social coding in github. In: European Conference on Software Maintenance and Reengineering (CSMR), pp. 323–326 (2013). https://doi.org/10.1109/CSMR.2013.41

5 Emotion Analysis in Software Ecosystems

127

129. Tourani, P., Adams, B.: The impact of human discussions on just-in-time quality assurance: an empirical study on OpenStack and Eclipse. In: International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 189–200. IEEE, Piscataway (2016). https://doi.org/10.1109/SANER.2016.113 130. Tourani, P., Jiang, Y., Adams, B.: Monitoring sentiment in open source mailing lists: exploratory study on the Apache ecosystem. In: International Conference on Computer Science and Software Engineering (CASCON), pp. 34–44. IBM / ACM (2014) 131. Uddin, G., Guéhénuc, Y.G., Khomh, F., Roy, C.K.: An empirical study of the effectiveness of an ensemble of stand-alone sentiment detection tools for software engineering datasets. Trans. Softw. Eng. Methodol. 31(3) (2022). https://doi.org/10.1145/3491211 132. Uddin, G., Khomh, F.: Automatic mining of opinions expressed about APIs in Stack Overflow. Trans. Softw. Eng. 1 (2019). https://doi.org/10.1109/TSE.2019.2900245 133. Umer, Q., Liu, H., Sultan, Y.: Sentiment based approval prediction for enhancement reports. J. Syst. Softw. 155, 57–69 (2019). https://doi.org/10.1016/j.jss.2019.05.026 134. Vasilescu, B., Posnett, D., Ray, B., van den Brand, M.G., Serebrenik, A., Devanbu, P., Filkov, V.: Gender and tenure diversity in GitHub teams. In: Conference on Human Factors in Computing Systems (CHI), pp. 3789–3798. ACM, New York (2015). https://doi.org/10.1145/ 2702123.2702549 135. Vrzakova, H., Begel, A., Mehtätalo, L., Bednarik, R.: Affect recognition in code review: an insitu biometric study of reviewer’s affect. J. Syst. Softw. 159 (2020). https://doi.org/10.1016/j. jss.2019.110434 136. Wälde, K., Moors, A.: Current emotion research in economics. Emot. Rev. 9(3), 271–278 (2017). https://doi.org/10.1177/1754073916665470 137. Watson, D., Clark, L.A., Tellegen, A.: Development and validation of brief measures of positive and negative affect: the PANAS scales. J. Pers. Soc. Psychol. 54(6), 1063–1070 (1988) 138. Werder, K.: The evolution of emotional displays in open source software development teams: an individual growth curve analysis. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 1–6. ACM, New York (2018). https://doi.org/10.1145/ 3194932.3194934 139. Werder, K., Brinkkemper, S.: MEME: toward a method for emotions extraction from GitHub. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 20–24. ACM, New York (2018). https://doi.org/10.1145/3194932.3194941 140. Williams, G., Mahmoud, A.: Analyzing, classifying, and interpreting emotions in software users’ tweets. In: International Workshop on Emotion Awareness in Software Engineering (SEmotion), pp. 2–7 (2017). https://doi.org/10.1109/SEmotion.2017.1 141. Wurzel Gonçalves, P., Çalikli, G., Bacchelli, A.: Interpersonal conflicts during code review: developers’ experience and practices. Proc. ACM Hum.-Comput. Interact. 6(CSCW1) (2022). https://doi.org/10.1145/3512945 142. Yang, B., Wei, X., Liu, C.: Sentiments analysis in GitHub repositories: an empirical study. In: Asia-Pacific Software Engineering Conference Workshops (APSEC Workshops), pp. 84–89. IEEE, Piscataway (2017). https://doi.org/10.1109/APSECW.2017.13 143. Zhang, T., Xu, B., Thung, F., Haryono, S.A., Lo, D., Jiang, L.: Sentiment analysis for software engineering: how far can pre-trained transformer models go? In: International Conference on Software Maintenance and Evolution (ICSME), pp. 70–80. IEEE, Piscataway (2020). https:// doi.org/10.1109/ICSME46990.2020.00017

Part III

Evolution Within Software Ecosystems

Chapter 6

Analyzing Variant Forks of Software Repositories from Social Coding Platforms John Businge, Mehrdad Abdi, and Serge Demeyer

Abstract With the rise of social coding platforms that rely on distributed version control systems, software reuse is also on the rise. Through the provision of explicit facilities to share code like pull requests, cherry-picking, and traceability links, social coding platforms have popularized forking (also referred to as “clone-andown”). Two types of forks exist: (i) social forks that are created for isolated development to fix a bug, feature, refactoring, and then merged back into the original project and (ii) variant forks that are created by splitting off a new development branch to steer development into a new direction while leveraging the code of the mainline project. The literature has extensively investigated social forks on social coding platforms, but there are limited studies on variant forks. However, a few studies have revealed that variant forking is quite prevalent on social coding platforms. Furthermore, the studies have revealed that with an increasing number of variants of the original project, development becomes redundant, and maintenance efforts rapidly grow. For example, if a bug is discovered and fixed in one variant, it is often unclear which other variants in the same family are affected by the same bug and how they should be fixed in these variants. In this chapter, our focus is on variant forks in the social coding era. First, we discuss studies that have investigated variant forks both before and after the emergence of social coding platforms. Next, we identify challenges with the parallel maintenance of variant forks and research directions that can possibly provide support.

J. Businge () University of Nevada Las Vegas, Las Vegas, NV, USA e-mail: [email protected] M. Abdi · S. Demeyer University of Antwerp, Antwerp, Belgium e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_6

131

132

J. Businge et al.

6.1 Introduction With the rise of social coding platforms such as GitHub, Bitbucket, and GitLab, software reuse through forking has also become very popular. As of November 2022, GitHub, the most popular social coding platform, registered over 285M+ repositories with over 94M.+ developers, with 3.5B.+ total contributions to all projects [28]. A developer interested in contributing to a project hosted on GitHub may participate by creating a fork. The forks created with the aim of contributing back are known as social forks [76]. In literature, social forking has been studied extensively (e.g., [5, 19, 22, 29, 30, 41, 53, 61, 68, 70]). In contrast, variant forks are created by splitting off a new development branch to steer development into a new direction while leveraging the code of the mainline project [14]. Several studies have investigated variant forking in the context of open-source software projects [9, 11, 14, 16, 23, 26, 27, 43–46, 54–56, 64, 69]. However, most of these studies were carried out on SourceForge, pre-dating the advent of social coding platforms like GitHub, with very few carried out during the social coding era [11, 13–15, 54, 57, 64, 76]. Most of the open-source fork variants on social coding platforms are created through the ad hoc development paradigm of “clone-and-own” [15]. The development paradigm is a commonly adopted approach for developing multivariant software systems, where a new variant of a software system is created by copying and adapting an existing one and the two continue to evolve in parallel [11, 14, 15, 34, 63]. As a result, two or more software projects will share a common code base as well as independent, project-specific code. With an increasing number of variants, a clone-and-own has known drawbacks [8, 24, 48, 62, 73]. Most notably, if a bug is discovered and fixed in one variant, it is often unclear which other variants are affected by the same bug and how this bug should be fixed in these variants. In this book chapter, our focus is on the variant forking in the social coding era. First, we provide an overview of the state of the art, before and after the emergence of social coding platforms. In particular, we investigate the motivations for adopting a variant forks, despite the known drawbacks of clone-and-own approach. Knowing that the phenomenon of variant forks will remain a prevalent practice on social coding platforms such as GitHub, Bitbucket, and GitLab, we explain how to mine such platforms to identify the relevant variant forks in the sea of social forks. We illustrate the practical use of such mining algorithm with a study of diverging variants. We conclude with an overview of the challenges created by parallel maintenance of variant forks and research directions that may result in tool support for managing the shared code inherent in variant forks.

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

133

6.2 State of the Art Robles and González-Barahona [56] carried out a comprehensive study on a carefully filtered list of 220 potential forks referenced on Wikipedia. They report motivations and outcomes for forking on these 220 projects. We summarize the most interesting categories below. • Technical (addition of functionality). Sometimes developers want to include new functionality into the project, but the main developer(s) do not accept the contribution. An example is Poppler, a fork of xpdf relying on the poppler library [56]. • Governance disputes. Some contributors from the community create a variant project because they feel that their feedback is not heard or because the maintainers of the mainline are unresponsive or too slow at accepting their patches. A well-known example is a fork of GNU Emacs (originally Lucid), which was created as a result of the significant delays in bringing out a new version to support the Energize C.++ IDE [71]. • Legal issues. This includes disagreements on the license and trademarks and changes to conform to rules and regulations. An example is X.Org, which originated from XFree86 [56, 71]. XFree86 was originally MIT/X open-source license that is GPL compatible and then was changed to one that was not GPL compatible. This caused many practical problems and a serious uproar in the community, resulting in the project fork X.Org. • Personal reasons. In some situations, the developer team disagrees on fundamental issues (beyond mere technical matters) related to the software development process and the project. An example is the OpenBSD fork from NetBSD. One of the developers of NetBSD had a disagreement with the rest of the core developers and decided fork and focus his efforts on OpenBSD [21]. However, most of these studies were carried out on SourceForge, which was the dominant platform for sharing code in open-source projects a decade ago. At that time, there was substantial controversy around variant forks as it signified a split in the open-source community [9, 16, 23, 26, 43, 45, 55]. This has changed with the advent of social coding platforms centered around git—GitHub, Bitbucket, and GitLab. Indeed, Jiang et al. [31] argue that although forking may have been controversial in the open-source community, it is now encouraged as a built-in feature on GitHub. They conclude that developers create social forks of repositories to submit pull requests, fix bugs, and add new features. Businge et al. [11] focused on variant forks in the Android ecosystem and found that rebranding, simple customizations, feature extension, and implementation of different but related features are the main motivations to create forks of Android apps. These motivations can be seen as refinements of the framework of Robles and González-Barahona [56]: fitting within the categories “Technical,” “Legal,” and “Personal.” Zhou et al. [76] –based on interviews with 18 developers of variant forks on GitHub– go one step further. They confirm that the negative connotation concerning forks have

134

J. Businge et al.

completely changed with the advent of GitHub and conclude that most variant forks start as social forks. Based on the known problems with clone-and-own approaches, a handful of studies investigated the interaction between variant forks and mainlines. Stanciulescu et al. [62] investigated forking practices in Marlin, an open-source 3D printer firmware project hosted on GitHub. The authors report that most variant forks do not retrieve new updates from the main Marlin repository that is hosted on GitHub. Sung et al. [64] investigated variant forks in an industrial case study to uncover the implications of frequent merges from the mainline and the resulting merge conflicts in the variant forks. They implemented a tool that can automatically resolve up to 40% of eight types of mainline-induced build breaks. Many respondents indicated being interested in coordination across repositories, either for eventually merging changes back into the mainline or to monitor activity in the mainline repository such that they can select and integrate interesting updates into their variant project. Businge et al. [15] also investigated the interaction between mainline and variants. The authors quantitatively investigated code propagation among variants and their mainline in three software ecosystems. They found that only about 11% of the 10,979 mainline-variant pairs had integrated code between them. Since the mainlines and variants share a common code base, and with the collaborative maintenance facilities of git and the pull-based development model, one would expect more interactions between the mainline and its variants. Apparently, there are impediments (some of them social, some of them technical) that restrict such interactions. Summary. Forking is a prominent feature of git, hence actively encouraged on social coding platforms. Most forks are created with the aim of creating pull requests to be merged back into to the main line and then dissolve; these are known as social forks. Nevertheless, many forks (named variant forks) actually split off a new development branch to steer development into a new direction while leveraging the code of the mainline. There are clear motivations for creating variant forks, so one would expect that developers exploit the opportunity of having a shared code base. However, there appear to be social as well as technical factors that hinder sharing code between variants.

6.3 Motivations for Variant Forking on Social Coding Platforms Businge et al. [14] conducted a survey of 105 maintainers of different active opensource variant projects hosted on GitHub. The authors refined the categories on why variant forks are created on social coding platforms, technical, governance, legal, and others and have subcategories for each of them. Below we discuss these categories providing concrete examples from variant forks identified on

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

135

GitHub. Throughout the chapter, we shall refer to a repository on GitHub as [owner]/[repoName] and omit the prefix of https://www.github.com/.

6.3.1 Technical The reasons for creating variants usually relate fork variant developers wanting to include new functionality into the project, which may be uninteresting to the developers of the original project. The technical reasons can be subdivided into these categories: • Different goal/content: Variants can be created to address a different goal from the original project. For example, the fork variant PIVX-Project/PIVX and the original variant dashpay/dash have different goals. They are both cryptocurrency applications, but the fork variant introduced a completely different monetary policy called PIVX,1 from the original called Dash.2 • New features: Variants created to add new features that are not interesting to the original. For example, the fork variant ppy/osuTK adds the features of .NET standard, iOS, and android support, which are not supported in the original opentk/opentk. • Customization: The variant developers perform some customizations that fit the purpose of their variant. For example, the fork variant facile-it/symfony-functional-testcase is a slimmed down variant of liip/LiipFunctionalTestBundle. The variant developer wrote on the Readme.md file Forked (and slimmed down) from liip/LiipFunctionalTestBundle. More information can be found on the Readme.md file of the fork variant.3 • Unmaintained/frozen feature: The fork variant takes over maintenance of a feature that the original has frozen. For example, the original project jazzband/piptools shifts from providing one set of features (pip-review + pip-dump) to a different, disjoint set of features (pip-compile + pip-sync). The fork variant lifted pip-review feature into a separate project jgonggrijp/pip-review. Both the original and the fork variant are complementary. More information about why the fork variant was created can be gotten from the issue description4 and the Readme.md file.5

1 https://pivx.org/. 2 https://www.dash.org/. 3 https://github.com/facile-it/symfony-functional-testcase/blob/1.x/README.md. 4 https://github.com/jazzband/pip-tools/issues/185. 5 https://github.com/jgonggrijp/pip-review#origins.

136

J. Businge et al.

6.3.2 Governance The governance reasons for creating variants usually relate to original developers being unresponsive to contributions or refusing to accept features from the public. Therefore, we further categorize reasons into “feature acceptance” and “responsiveness.” • Feature acceptance: The project starts as a social fork to develop a new feature or a bug fix. The developer will propose integrating the new feature or a bug fix into the original project when fully developed. However, when the developer submits the contribution in the form of a pull request, for some reason, the original project owners are unwilling to accept the contribution. As an example, a developer created the fork—fqborges/react-native-maps on July 2017 from the original react-native-maps/react-native-maps and introduced new features. On June 2018, the fork developer submitted a pull request (number 2348) to integrate the changes [52]. Fourteen contributors in 45 comments discussed the pull request. On March 04, 2020, the maintainer of the original project closed the pull request with the message closing as being an old issue and not a clear reply. Thanks for all the effort. On March 13, 2020, the fork developer followed with the following message: For all those interested, I published this work as https:// www. npmjs.com/ package/ react-native-maps-osmdroid. Still experimental though, it works perfectly to my use case, [. . . ]. In the communication, it is clear that the maintainers of the original are not ready to accept the feature, yet the feature is important for the fork developer. • Responsiveness: The project also starts as a social fork to develop a feature or a bug fix. When the fork developer submits the pull request, the maintainers of the original project are not responsive for a lengthy period of time. An example is the fork variant jquast/blessed of the original erikrose/blessings. The fork developer submitted a pull request on April 15, 2015 [7]. Fork and original developers discussed the pull request in 11 comments. On August 18, 2015, the fork developer commented: I do not envision this pull request being merged at this rather dull cadence, and I have many plans to continue contributions. [. . . ] I will continue most efforts as the “blessed” fork, [. . . ]. Currently, the original seems obsolete, with major updates last made on October 24, 2018. The fork project is still being maintained and active at the time of writing.

6.3.3 Legal The legal reasons for creating a forked variant include disagreements on the license and trademarks and changes to conform to rules and regulations. An example of fork variant creation as a result of legal reasons is a set of tools, wintercms,6 which

6 https://github.com/wintercms.

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

137

were forked from the original—octobercms.7 The blog post by the fork variant maintainers report that October CMS has moved to become a locked down paid platform and is no longer free or open source. Winter CMS will always be a free, open-source, community-driven content management framework.8 The fork variant maintainers further report that the fork is a result of a distinct lack of communication between the founders and the maintenance team [. . . ], as well as a general lack of engagement by the founders. Another blog post reports on the shift to the paid platform by the maintainers of the original project—October CMS.9

6.3.4 Other Categories This category is further subdivided into two main groups: (1) supporting the original and (2) supporting personal projects. • Supporting the original: The reasons for creating fork variants are to support the activities of the original project. An example is the forked repository koppor/jabref that is used to collect issues for the original project JabRef/jabref to avoid flooding the original with issues.10 • Supporting personal projects: A developer may fork the original project and extend it with new features. Then, the fork is reused in another project maintained by the fork developer. By creating the fork variant, the developer safeguards themselves from unfavorable actions by the maintainers of the original project. For example, if the original repository is deleted from the social coding platform, the forked repository will still be available [14]. Summary. Many variant forks start as social forks. The decisions to create and maintain variants include technical (concerning diverging features), governance (concerning diverging interests), and legal (concerning diverging licenses). In addition, there are other categories, such as supporting the original and personal projects. Figure 6.1 gives a quantitative overview of the different categorizations.

7 https://github.com/octobercms. 8 https://wintercms.com/blog/post/we-have-forked-october-cms. 9 https://octobercms.com/blog/post/october-cms-moves-become-paid-platform. 10 https://github.com/koppor/jabref/blob/about/README.md.

138

J. Businge et al.

Fig. 6.1 Sankey diagram summarizing the detailed motivations behind creating variant forks

6.4 Mining Variant Forks on GitHub This section explores the possible definitions of what constitutes variant forks and how one can identify variant forks on a social coding platform. Note that these definitions are orthogonal to the motivations for forking as listed in Sect. 6.3. We also list the challenges faced in co-evolving divergent variants on a social coding platform. Finally, despite the shortcomings of the clone-and-own development paradigm, we discuss possible reasons developers prefer maintaining variants using the clone-and-own instead of using the more systematic development paradigm of a software product line.

6.4.1 The Different Types of Variant Forks Variant forks can be further subdivided into forge variant forks and intrinsic variant forks. We borrow the phrases “forge” and “intrinsic” from the study of Pietri et al. [49] in their research on repository forks. • Forge variant forks: Two repositories –A (original repository) and B (forked repository)– are forge variants if they are hosted on the same social coding platform, P (e.g., GitHub, GitLab, or Bitbucket), where B has been created with an explicit “fork repository A” action on platform P. The two repositories

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

139

variants A and B contain traceability links, and contributions may be sent from one variant to the other. • Intrinsic variant forks: They are variants that do not have a traceability link with the original. They are created by manually cloning the original project and then pushing it as an independent repository. In some cases, variant forks may be cloned from one social coding platform and then pushed into another social coding platform. For example, a variant might be cloned from GitHub, pushed to GitLab, and the two continue to co-evolve in parallel.

6.4.2 How to Mine Variant Forks? Variant forks can be lost in a sea of social forks (for the forge variant forks) [10, 12] or a pool of student or experimental projects (for intrinsic variant forks). To precisely identify real variant forks, one has to employ some tricks. Below we describe how forge variant forks and intrinsic variant forks may be precisely identified: • Forge variant forks: Businge et al. [14, 15] collected forge variants forks from software ecosystems having the repositories hosted on GitHub. Developers usually host their projects’ source code on social coding platforms and distribute their package releases to package managers. For example, projects in the programming ecosystems of JavaScript, Java, and Python have their package releases distributed on the package managers of npm, Maven, and pip, respectively. The authors assume that if the original and the forked repository both have their package releases distributed on a package manager, then they are variants of each other. Figure 6.2 is an illustration of how a family forge variants

Fig. 6.2 An illustration of a family of forge variants

140

J. Businge et al.

can be identified. We can see that a repository written in Java programming language is hosted on GitHub. The original and two forked repositories have their packages hosted on Maven package manager. We are certain that the original and two forked repositories are variants of each other since they are maintained on GitHub and have their packages distributed on Maven package manager. The three repositories form a software family of variants. Since Fork1, Fork3, and Fork4 do not have their package releases distributed on Maven, we cannot confirm that they are variant forks. Variant forks and the original projects can be mined from Libraries.io. Libraries.io continuously monitors open-source packages distributed across numerous package managers. In addition to the metadata for a specific package on a given package manager, Libraries.io also extends the package metadata with more information from GitHub. For example, Libraries.io stores a Forkboolean field, which indicates whether the corresponding repository of a package is a fork. Such a field Forkboolean can help one identify forked repositories that have published their packages. • Intrinsic variant forks: These are a subset of intrinsic forks. First, let us explain one might identify intrinsic forks. Recall that intrinsic forks are cloned and manually pushed back into a social coding platform. Pietri et al. [49] report that intrinsic forks can be determined by identifying repositories with shared commits or root directories. Like the forge fork variants, one can precisely identify intrinsic variant forks by ensuring that all the variants in a family have their packages distributed on a social coding platform. Since projects can be cloned from one social coding platform and manually pushed to another, studying how the different categories of intrinsic variants (hosted on the same or different platforms) co-evolve would be interesting. As of September 2022, Libraries.io logs over 32 package managers, 3 social coding platforms (GitHub, Bitbucket, and Gitlab), 36 different programming languages, and over 2.7M unique open-source packages. As a result, researchers can mine different categories of variant families to study their co-evolution and learn about challenges and opportunities.

6.4.3 What Are Divergent Variants? Divergent variants are a subset of the variants in the family that no longer synchronize commits between themselves, yet they continue to evolve [54]. Let us define the relevant terminology [54]: • current_date. The date when the variants pairs are analyzed. • divergence_date. The date after the last synchronization of variants. • hunk. A hunk is a grouping of differing lines between two versions of a file [17]. A hunk is written in the format @@ -l,s +l,s @@ with l the starting line number, s the number of lines the change applies to for each respective file, - indicating the original file, and the + indicating the new (modified) file.

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

141

Fig. 6.3 Illustration of the patch classification from source to target variant (Adapted from Ramkisoen et al. [54])

• buggy file. This is a file containing buggy lines before the pull request to fix the bug is created. • patched files. These are files that are integrated back into the main development branch at the pull request integration with buggy lines removed and new ones added. • diff_file. The resulting file after applying the diff tool [65] on the buggy and the patched file. It contains both the removed lines from the buggy file and added lines in the patched file. • patch. A patch is a collection of one or more diff_files. In this chapter, we specifically refer to a patch when this collection of diff_files stems from a pull request that was created to fix a bug. • git_head file. The latest version of a file retrieved from the git_head on the current_date in the main branch of the target variant. Figure 6.3 is an illustration of a divergent variant pairs—variant1 (source) and variant2 (target). variant2 was born at a point in the history of the variant1, at the fork_date. This implies that variant2 inherits all the commits from variant1. Between the fork_date and divergence_date, variant1 and variant2 both synchronized commits between each other and were even. After the divergence_date, none of variant1 and variant2 synchronized commits up to the current_date. This implies that the commits in the branch after the divergence_date are unique to that branch. Let us assume that the developer of variant1 identified a bug at the second commit after the divergence_date. The developer then decided to create a socialfork or a bug-fixing-branch of the source repository, patched the buggy file foo, and thereafter integrated the patch back into the main branch of the source repository

142

J. Businge et al.

using a pull request. There are four scenarios that are possible at the commit at git_head in variant2: 1. The variant2 developer could have patched the buggy foo in one of the previous commits before the current commit. In this case, this would be an effort duplication (ED). 2. The git_head commit still contains buggy foo. In this case, variant2 has a missed opportunity, and we can even calculate for how long the target branch has missed the patch by calculating the distance between the patch integration date and the git_head date. 3. The git_head commit contains foo that has both patched and buggy lines of code. In this case, both effort duplication and missed opportunity are there. This may happen when a developer fixes one part of the bug and unaware that the bug is spread to other parts of the project. This is a split case. 4. variant2 developer has completely customized foo to suit her needs to the extent that buggy foo or patched foo in the upstream is incomparable to foo at the git_head in variant2. This case of foo is uninteresting. In Fig. 6.4, we present a detailed illustration a patch applied in the source variant1 and missed in the target variant2. The patch fixes a bug in three files, foo, bar, and lot in the variant1 after the divergence_date. In a social fork created, the bug in the three files is fixed and merged back to the upstream and its history. When one checks for the same three files in variant2, they observe the same as the files that contained a bug at the first commit in the variant1 after the divergence_date. This is a case of a missed opportunity, since the developers of the variant2 still have the bug, which is already fixed in variant1 at an earlier date. Summary. Identifying data for variant forks from a sea of student or experimental projects can be challenging. To precisely identify real variant forks, one has to employ some tricks. One way to mine real variant forks is to ensure that both the original and the fork variant’s package releases are distributed to the package managers. Package release distribution indicates that the repositories are products ready to be used by other developers.

Fig. 6.4 Illustration of missed opportunity from source to target variant (Adapted from Ramkisoen et al. [54])

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

143

6.5 Challenges of Maintaining Variant Forks From Sect. 6.3, we deduce that there are a diverse set of motivations for adopting variant forks, despite the known drawbacks of clone-and-own approach. Knowing that the phenomenon of variant forks will remain a prevalent practice, Sect. 6.4 explains how to mine such platforms to identify the relevant variant forks in the sea of social forks. Based on these observations, we list the challenges that will continue to torment project families resulting from variant forks. Challenge 1. Employing ad hoc clone-and-own development paradigm to maintain variants. There are two well-known paradigms for developing multivariant projects: software product line and clone-and-own. A software product line consists of a set of similar software products (i.e., variants) with well-defined commonalities and variabilities [3, 18, 35, 51, 67]. A developer can create a new variant on a software product line by simply turning on/off a set of features. The software product lines strategy easily scales with many variants but is often difficult to adopt as it requires a significant upfront investment of time and money [36, 37]. With cloneand-own, a new variant of a software system is created by copying and adapting an existing one, and the two independently continue to evolve in parallel [15, 76]. As a result, two or more software projects will share a common code base and independent, project-specific code. However, with an increasing number of variants in the family, development becomes redundant, and maintenance efforts rapidly grow [8, 24, 48, 73]. For example, if a bug is discovered and fixed in one variant, it is often unclear which other variants in the family are affected by the same bug and how this bug should be fixed in these variants. Moreover, in a recent study [54], Ramkisoen et al. observed several variant pairs where a common bug is fixed in the one variant but not in the other variant (missed opportunity). The authors also report several variant pairs where two other developers fixed a bug in common code at different times (wasted effort). Studies report that developers frequently create open-source variants on social coding platforms using the ad hoc clone-and-own instead of the more systematic development paradigm of software product line [6, 14, 15, 24, 34, 37]. Reasons for choosing the ad hoc clone-and-own paradigm mainly relate to its inexpensiveness and developer independence. For example, a developer may clone-and-own a mainline repository into a new forked variant. While the two variants will evolve in parallel often exchanging updates, the new developer will have full governance of the variant. Businge et al. [15] performed a large-scale study of clone-and-own variant families on GitHub. The authors investigated variant families from three ecosystems of Android applications, JavaScript, and .NET programming language applications and discovered over 10K variant families. The authors also report that over 80% of variant families comprise two variants and that about 82% of these families are not governed by common developers. Since most families are smallsized, it is not cost-effective to integrate the variants into a software product line.

144

J. Businge et al.

Yet, integrating the variants into the software product line would deprive the diverse developers of their independence. The small-sized and diverse ownership of variants in families may further explain why developers of open-source variants opt for ad hoc clone-and-own over the more systematic development paradigm of software product lines. Challenge 2. Understanding the social and technical factors affecting the evolution variant families. Managing clone-and-own variant families on social coding platforms is a challenging task. To effectively support the families, one has to understand their variability in three dimensions: (i) space (concurrent variations of the system at a single point in time), (ii) time (sequential variations of the system due to its evolution), and (iii) development team in terms of diversity and size [2, 14, 15, 54]. The study of Robles et al. [56] on variant forks, carried out in the pre-GitHub days, reports five outcomes of forking: (i) discontinuation of the fork variant, (ii) discontinuation of the original, (iii) re-merging of the fork variant with the original, (iv) both the original and the fork variant abandoned, and (v) both the original and the fork variant continuously co-evolving. One would expect the same evolutionary trend of the variants in the current days of social coding. Studies need to investigate the possible causes of these evolutionary trends among the variants in the family. Challenge 3. Suboptimal variant maintenance (missing or re-implementing important updates introduced in other variants of a family). Variant developers need to efficiently reuse updates within the family by avoiding redundancy and missing essential updates. One way of not missing essential updates is to employ classical merging [40]. However, classical merging tries to eliminate the variant differences by integrating all the changes into a unified version [33]. An alternative to propagate only the desired changes from one variant to another is to use cherry-picking [50]. Unfortunately, cherry-picking is also problematic since the desired updates are usually lost in a pool of other updates in the variants, as reported by Businge et al. [15]. One has to dig through the updates to find a specific update to cherry-pick into their repository. Furthermore, Businge et al. [15] reveal that variants are usually buried in a pool of social forks. The pool of social forks makes it hard for developers of a given variant to know that other active variants exist in the family. One solution is to collect provenance data about the variants, such as unsynchronized bug fixes, tests, optimizations, features, refactoring operations, and configurations in shared files. Tools that automatically identify and recommend or directly integrate these interesting unsynchronized updates can help reduce the maintenance burden in these clone-and-own variant families.

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

145

Summary. The GitHub platform (with its pull-request-based model) actively encourages forking of projects. As a consequence, clone-and-own approaches are frequently adopted for creating and maintaining product families, despite the known drawbacks. These result in a number of challenges for the stakeholders surrounding the platform: (1) Developers will be forced to rely on ad hoc approaches for sharing code between variants; (2) Researchers have an opportunity to study which factors (both social and technical) affect product families; (3) Tool builders may provide better tool support for sharing code.

6.6 Research Roadmap Since the clone-and-own development paradigm is inevitable for a vast amount of open-source variant families on social coding platforms, researchers have to come up with solutions that can help support the maintenance of these variant families. Instead of forcing developers to employ a heavyweight software product line process, researchers should provide approaches to manage the maintenance of clone-and-own variants. The approaches could utilize the best practices of software product line and version control systems. There are a few studies that have proposed managed clone-and-own solutions for maintaining variants for larger-scale organizations building complex products with many variants [24, 32, 34, 54, 58]. To support the maintenance of clone-and-own variants, we discuss three directions researchers could explore.

6.6.1 Recommendation Tools Ramkisoen et al. [54] presented a proof-of-concept patch recommender tool named PaReco. However, extra features are needed to turn the proof of concept in a practical recommender tool. Currently, PaReco classifies a patch as “possibly interesting” to the target variant (i.e., missed opportunity, effort duplication, or split case) based on the common buggy and patched lines of code in the shared files of source and the target variant. For example, assuming we have two variants: .V1 as the source variant and .V2 as the target variant. .V1 has a patch containing three files: lot, bar, and foo. By definition of a patch [54], each of the three files will have at least one or more hunks (code snippet with the modified lines of code). Using the file’s absolute path, PaReco searches for the three files in the git_head of .V2 . The patch will be classified as “possibly interesting” for .V2 if it finds at least one shared hunk between .V1 and .V2 . This implies that the shared patches between .V1 and .V2 will be classified as “possibly interesting” if the similarity score (S) of the patches is in the range of .0 < S ≤ 1 (0, patches are less similar, and 1, patches are very similar). Missing hunks/files in the .V2 will result in a lower value of patch similarity

146

J. Businge et al.

score—S. The missing hunks/files may result from refactoring operations applied in both .V1 and .V2 . Refactoring operations are intended to improve the software’s design, structure, or implementation while preserving its functionality [20]. Before recommending the patch to the developer, it is crucial to determine if any code restructuring has been performed in the patched files of both the source and target variants. The state-of-the-art tool called RefactoringMiner [66] can be used to identify applied refactoring operations in the two branches. RefactoringMiner version 2.3.0 can identify over 80 refactoring operations with both precision and recall over 94%. It seems thus feasible to extract the refactoring operations applied on the source .V1 and replay them on the target .V2 . Since they are refactoring operations, they will preserve the behavior as opposed to just replaying all commits. This would improve the similarity score and hence reduce the number of split cases. Furthermore, to transform PaReco into a recommender tool, one may need to conduct a user study and learn from the variant developers regarding what would be a reasonable threshold of the patch similarity score between .V1 and .V2 .

6.6.2 Shareable Updates Among Variants The target variant developer is likely to use the “cherry-pick” facility to integrate a given patch into their variant. However, the files that are intended to be cherrypicked could have been restructured in both the source and target variant. Indeed, Mahmoudi et al. [38] reported that refactoring operations are known to cause merge conflicts when merging development branches. To mitigate the integration conflicts caused by refactoring operations while cherry-picking the commits, one could extend the Git cherry-pick tool with refactoring-aware functionality [25]. The PaReco tool prototype demonstrated how one can share bug fixes between source and target variants in a family [54]. However, there are many other items that can be shared among the variants in a family that include dependency upgrades, test cases, code optimizations, vulnerability fixes, and new features. Furthermore, the social coding platforms have substantially improved the situation on both the code reuse and collaboration, which can be reused by variants in a family. For example, issue tracking systems (e.g., JIRA), source code reviews (e.g., Gerrit), Q&A (e.g., Stack Overflow), and continuous integration (e.g., GitHub Actions). Researchers can mine the changes from the various sources to see how they can be shared among different members of the family.

6.6.3 Transplantation Tools Identifying which updates from one project can be shared with another variant project is just the first step. The actual integration of a given piece of code into

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

147

the (possibly divergent) variant poses another series of challenges. This is captured under the term “Software Transplantation,” a term coined by Barr et al. [4]. Following this metaphor, a developer must manually identify an organ in a donor project and then point to an implantation point in a host project where this piece of code must be inserted. The tool automatically extracts the organ and its veins by backward and forward slicing and then uses a search-based approach guided by testing to migrate the functionality between the donor and the host. In the context of variant forks, code transplantation may be more promising because the similarities in donor and host projects may lead to more successful transplantation attempts and the development histories in the forks can help automate the feature and the implantation point identification. Two ideas seem particularly appealing from a variant family perspective. • Test transplantation. Test code is a particularly interesting source for transplanting code from one variant to another. Grafter was a first tool prototype, which illustrated that it is feasible to transplant tests to cloned parts of the system under test [74, 75]. Follow-up studies performed by Schittekat et al. [59] and Abdi and Demeyer [1] discuss how to transplant unit tests from a project to a different project based on their dependency relation. • Automated program repair. The goal of automated program repair is to convert an existing program that nearly satisfies a specification into one that fully satisfies it [42]. Going by the definition, the program that nearly satisfies a specification would be the patch in the source variant, and one that fully satisfies the specification would be the patch that has been transformed to fit in the target variant. Automated program repair can be implemented using many types of specifications such as contracts [47], a reference implementation [39], or through tests [42]. Researchers can leverage automated program repair techniques to improve recommender tools for patch transplantation. Indeed, Salihin et al. explored the feasibility of automatically adapting a patch (i.e., a bug fix) for an error in a donor program and insert into a “similar” target program with a tool named PatchWeave [60]. In a similar vein, Deheng et al. explored the feasibility of searching for fix ingredients for a patch with a tool named TransplantFix [72]. Summary. Given that clone-and-own is commonly adopted on git platforms and that this is likely to remain so in the foreseeable future, this opens up opportunities for future research. We listed three of them pointing at preliminary work for each: shareable updates and code transplantation.

6.7 Conclusion This chapter discussed the evolution of variant forks on social coding platforms. We analyzed the state of the art that studied variant forking in the pre-and current social coding days. We introduced the concept of variant forking on social coding

148

J. Businge et al.

platforms and present the categories of variant forks based on the developers’ motivations for creating the variants. We described two types of variant forks (i.e., forge variant forks and intrinsic variant forks) and a mining algorithm for precisely extracting variant forks from a pool of other software systems hosted on social coding platforms. Finally, we presented some of the challenges developers face in co-evolving the variants and examine possible research directions that can help mitigate the challenges.

References 1. Abdi, M., Demeyer, S.: Test transplantation through dynamic test slicing. In: International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, Piscataway (2022). https://doi.org/10.1109/SCAM55253.2022.00009 2. Ananieva, S., Greiner, S., Kühn, T., Krüger, J., Linsbauer, L., Grüner, S., Kehrer, T., Klare, H., Koziolek, A., Lönn, H., Krieter, S., Seidl, C., Ramesh, S., Reussner, R., Westfechtel, B.: A conceptual model for unifying variability in space and time. In: International Conference on Systems and Software Product Lines (SPLC). ACM. New York (2020). https://doi.org/10. 1145/3382025.3414955 3. Apel, S., Batory, D., Kästner, C., Saake, G.: Feature-Oriented Software Product Lines: Concepts and Implementation. Springer, Berlin (2013) 4. Barr, E.T., Harman, M., Jia, Y., Marginean, A., Petke, J.: Automated software transplantation. In: International Symposium on Software Testing and Analysis (ISSTA), pp. 257–269. ACM, New York (2015). https://doi.org/10.1145/2771783.2771796 5. Baysal, O., Kononenko, O., Holmes, R., Godfrey, M.W.: The secret life of patches: a Firefox case study. In: Working Conference on Reverse Engineering (WCRE), pp. 447–455 (2012). https://doi.org/10.1109/WCRE.2012.54 6. Berger, T., Rublack, R., Nair, D., Atlee, J.M., Becker, M., Czarnecki, K., Wasowski, A.: A survey of variability modeling in industrial practice. In: International Workshop on Variability Modelling of Software-Intensive Systems. ACM, New York (2013). https://doi.org/10.1145/ 2430502.2430513 7. Blessed-integration: merge to master and release to PyPI. https://github.com/erikrose/ blessings/pull/104 (2022). Accessed 15 April 2023 8. Borba, P., Teixeira, L., Gheyi, R.: A theory of software product line refinement. Theor. Comput. Sci. 455, 2–30 (2012). https://doi.org/10.1016/j.tcs.2012.01.031 9. Bratach, P.: Why do open source projects fork? https://thenewstack.io/open-source-projectsfork/ (2017). Accessed 15 April 2023 10. Businge, J., Kawuma, S., Bainomugisha, E., Khomh, F., Nabaasa, E.: Code authorship and fault-proneness of open-source android applications: an empirical study. In: International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), pp. 33–42. ACM, New York (2017). https://doi.org/10.1145/3127005.3127009 11. Businge, J., Openja, M., Nadi, S., Bainomugisha, E., Berger, T.: Clone-based variability management in the Android ecosystem. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 625–634. IEEE, Piscataway (2018) 12. Businge, J., Openja, M., Kavaler, D., Bainomugisha, E., Khomh, F., Filkov, V.: Studying Android app popularity by cross-linking GitHub and Google Play store. In: International Conference on Software Analysis, Evolution and Reengineering, pp. 287–297 (2019) 13. Businge, J., Decan, A., Zerouali, A., Mens, T., Demeyer, S.: An empirical investigation of forks as variants in the npm package distribution. In: The Belgium-Netherlands Software Evolution Workshop, CEUR Workshop Proceedings, vol. 2912. CEUR-WS.org (2020)

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

149

14. Businge, J., Decan, A., Zerouali, A., Mens, T., Demeyer, S., De Roover, C.: Variant forks: motivations and impediments. In: International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022) 15. Businge, J., Openja, M., Nadi, S., Berger, T.: Reuse and maintenance practices among divergent forks in three software ecosystems. Empir. Softw. Eng. 27(2), 54 (2022). https://doi.org/10. 1007/s10664-021-10078-2 16. Chua, B.B.: A survey paper on open source forking motivation reasons and challenges. In: Pacific Asia Conference on Information Systems (PACIS), p. 75 (2017) 17. Comparing and merging files - hunks. https://www.gnu.org/software/diffutils/manual/html_ node/Hunks.html (2021). Accessed 15 April 2023 18. Czarnecki, K., Ulrich, E.: Generative Programming: Methods, Tools, and Applications. O’Reilly Media, Inc., Reading (2000) 19. Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social coding in GitHub: transparency and collaboration in an open software repository. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 1277–1286. ACM, New York (2012). https://doi. org/10.1145/2145204.2145396 20. Demeyer, S., Ducasse, S., Nierstrasz, O.: Finding refactorings via change metrics. In: Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pp. 166–177. ACM, New York (2000). https://doi.org/10.1145/353171.353183 21. de Raadt, T.: Theo de Raadt’s dispute w/ NetBSD. https://zeus.theos.com/deraadt/coremail. html (2006). Accessed 15 April 2023 22. Dey, T., Mockus, A.: Effect of technical and social factors on pull request quality for the npm ecosystem. In: International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, New York (2020). https://doi.org/10.1145/3382494.3410685 23. Dixion, J.: Different kinds of open source forks: Salad, dinner, and fish. https://jamesdixon. wordpress.com/2009/05/13/different-kinds-of-open-source-forks-salad-dinner-and-fish/ (2009). Accessed 15 April 2023 24. Dubinsky, Y., Rubin, J., Berger, T., Duszynski, S., Becker, M., Czarnecki, K.: An exploratory study of cloning in industrial software product lines. In: European Conference on Software Maintenance and Reengineering (CSMR), pp. 25–34 (2013). https://doi.org/10.1109/CSMR. 2013.13 25. Ellis, M., Nadi, S., Dig, D.: Operation-based refactoring-aware merging: an empirical evaluation. Trans. Softw. Eng. (2022). https://doi.org/10.48550/ARXIV.2112.10370 26. Ernst, N.A., Easterbrook, S.M., Mylopoulos, J.: Code forking in open-source software: a requirements perspective. ArXiv. abs/1004.2889 (2010) 27. Gamalielsson, J., Lundell, B.: Sustainability of open source software communities beyond a fork: how and why has the LibreOffice project evolved? J. Syst. Softw. 89, 128 – 145 (2014) 28. GitHub: The state of open source software 2022. octoverse.github.com (2022). Accessed 15 April 2023 29. Iyer, R.N., Yun, S.A., Nagappan, M., Hoey, J.: Effects of personality traits on pull request acceptance. Trans. Softw. Eng. 47(11), 2632–2643 (2021). https://doi.org/10.1109/TSE.2019. 2960357 30. Jiang, Y., Adams, B., German, D.M.: Will my patch make it? And how fast?: Case study on the Linux kernel. In: Working Conference on Mining Software Repositories (MSR), pp. 101–110. IEEE, Piscataway (2013) 31. Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P.S., Zhang, L.: Why and how developers fork what from whom in GitHub. Empir. Softw. Eng. 22(1), 547–578 (2017). https://doi.org/10.1007/ s10664-016-9436-6 32. Kehrer, T., Kelter, U., Taentzer, G.: Propagation of software model changes in the context of industrial plant automation. Automatisierungstechnik 62(11), 803–814 (2014) 33. Kehrer, T., Thüm, T., Schultheiß, A., Bittner, P.M.: Bridging the gap between clone-and-own and software product lines. In: International Conference on Software Engineering – New Ideas and Emerging Results (ICSE-NIER), pp. 21–25 (2021). https://doi.org/10.1109/ICSENIER52604.2021.00013

150

J. Businge et al.

34. Krüger, J., Berger, T.: An empirical analysis of the costs of clone- and platform-oriented software reuse. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 432–444. ACM, New York (2020) 35. Lapeña, R., Ballarin, M., Cetina, C.: Towards clone-and-own support: locating relevant methods in legacy products. In: International Systems and Software Product Line Conference (SPLC), pp. 194–203. ACM, New York (2016). https://doi.org/10.1145/2934466.2934485 36. Linsbauer, L., Lopez-Herrejon, R.E., Egyed, A.: Variability extraction and modeling for product variants. In: International Systems and Software Product Line Conference (SPLC). ACM, New York (2018). https://doi.org/10.1145/3233027.3236396 37. Mahmood, W., Strüber, D., Berger, T., Lämmel, R., Mukelabai, M.: Seamless variability management with the virtual platform. In: International Conference on Software Engineering (ICSE), pp. 1658–1670. IEEE, Piscataway (2021). https://doi.org/10.1109/ICSE43902.2021. 00147 38. Mahmoudi, M., Nadi, S., Tsantalis, N.: Are refactorings to blame? An empirical study of refactorings in merge conflicts. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 151–162 (2019). https://doi.org/10.1109/SANER.2019. 8668012 39. Mechtaev, S., Nguyen, M.D., Noller, Y., Grunske, L., Roychoudhury, A.: Semantic program repair using a reference implementation. In: International Conference on Software Engineering (ICSE), pp. 129–139. ACM, New York (2018). https://doi.org/10.1145/3180155.3180247 40. Mens, T.: A state-of-the-art survey on software merging. Trans. Softw. Eng. 28(5), 449–462 (2002). https://doi.org/10.1109/TSE.2002.1000449 41. Mockus, A., Fielding, R.T., Herbsleb, J.D.: Two case studies of open source software development: Apache and mozilla. Trans. Softw. Eng. Methodol. 11(3), 309–346 (2002). https://doi.org/10.1145/567793.567795 42. Motwani, M., Soto, M., Brun, Y., Just, R., Goues, C.L.: Quality of automated program repair on real-world defects. Trans. Softw. Eng. 48, 637–661 (2022) 43. Nyman, L.: Hackers on forking. In: International Symposium on Open Collaboration (OpenSym), pp. 1–10. ACM, New York (2014). https://doi.org/10.1145/2641580.2641590 44. Nyman, L., Lindman, J.: Code forking, governance, and sustainability in open source software. Technol. Innov. Manage. Rev. 3, 7–12 (2013) 45. Nyman, L., Mikkonen, T.: To fork or not to fork: Fork motivations in SourceForge projects. Int. J. Open Source Softw. Process. 3(3) (2011). https://doi.org/10.4018/jossp.2011070101 46. Nyman, L., Mikkonen, T., Lindman, J., Fougère, M.: Perspectives on code forking and sustainability in open source software. In: Open Source Systems: Long-Term Sustainability – 8th IFIP WG 2.13 International Conference (OSS), pp. 274–279 (2012). https://doi.org/10. 1007/978-3-642-33442-9%5C_21 47. Pei, Y., Furia, C.A., Nordio, M., Wei, Y., Meyer, B., Zeller, A.: Automated fixing of programs with contracts. Trans. Softw. Eng. 40(5), 427–449 (2014). https://doi.org/10.1109/TSE.2014. 2312918 48. Pfofe, T., Thüm, T., Schulze, S., Fenske, W., Schaefer, I.: Synchronizing software variants with variantsync. In: International Systems and Software Product Line Conference (SPLC), pp. 329–332. ACM, New York (2016) 49. Pietri, A., Rousseau, G., Zacchiroli, S.: Forking without clicking: on how to identify software repository forks. In: International Conference on Mining Software Repositories (MSR), pp. 277–287. ACM, New York (2020). https://doi.org/10.1145/3379597.3387450 50. Pilato, C.M., Collins-Sussman, B., Fitzpatrick, B.W.: Version Control with Subversion. Addison-Wesley, Boston (2008) 51. Pohl, K., Böckle, G., Linden, F.J.v.d.: Software Product Line Engineering: Foundations, Principles and Techniques. Springer, Berlin (2005) 52. [Proposal] Osmdroid alternative provider for Android. https://github.com/react-native-maps/ react-native-maps/pull/2348 (2022). Accessed 15 April 2023

6 Analyzing Variant Forks of Software Repositories from Social Coding Platforms

151

53. Rahman, M.M., Roy, C.K.: An insight into the pull requests of GitHub. In: Working Conference on Mining Software Repositories (MSR), pp. 364–367. ACM, New York (2014). https://doi. org/10.1145/2597073.2597121 54. Ramkisoen, P.K., Businge, J., Bradel, v.B., Decan, A., Mens, T., Demeyer, S., De Roover, C., Khomh, F.: Pareco: Patched clones and missed patches among the divergent variants of a software family. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, New York (2022). https://doi.org/ 10.1145/3540250.3549112 55. Raymond, E.S.: The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O’Reilly, Sebastopol (1999) 56. Robles, G., González-Barahona, J.M.: A comprehensive study of software forks: dates, reasons and outcomes. In: International Conference Open Source Systems (OSS), pp. 1–14 (2012). https://doi.org/10.1007/978-3-642-33442-9%5C_1 57. Rocha, H., Businge, J.: Blockchain-oriented software variant forks: a preliminary study. In: 5th International Workshop on Blockchain Oriented Software Engineering (2022) 58. Rubin, J., Czarnecki, K., Chechik, M.: Managing cloned variants: a framework and experience. In: International Software Product Line Conference (SPLC), pp. 101–110. ACM, New York (2013). https://doi.org/10.1145/2491627.2491644 59. Schittekat, I., Abdi, M., Demeyer, S.: Can we increase the test-coverage in libraries using dependent projects’ test-suites? In: International Conference on Evaluation and Assessment in Software Engineering (EASE), pp. 294–298. ACM, New York (2022). https://doi.org/10.1145/ 3530019.3535309 60. Shariffdeen, R.S., Tan, S.H., Gao, M., Roychoudhury, A.: Automated patch transplantation. Trans. Softw. Eng. Methodol. 30(1) (2021). https://doi.org/10.1145/3412376 61. Soares, D.M., de Lima Júnior, M.L., Murta, L., Plastino, A.: Acceptance factors of pull requests in open-source projects. In: Annual ACM Symposium on Applied Computing (SAC), pp. 1541–1546. ACM, New York (2015). https://doi.org/10.1145/2695664.2695856 62. Stanciulescu, S., Schulze, S., Wasowski, A.: Forked and integrated variants in an opensource firmware project. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 151–160 (2015). https://doi.org/10.1109/ICSM.2015.7332461 63. Stanciulescu, S., Berger, T., Walkingshaw, E., Wasowski, A.: Concepts, operations, and feasibility of a projection-based variation control system. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 323–333. IEEE Computer Society, Los Alamitos (2016) 64. Sung, C., Lahiri, S.K., Kaufman, M., Choudhury, P., Wang, C.: Towards understanding and fixing upstream merge induced conflicts in divergent forks: an industrial case study. In: International Conference on Software Engineering (ICSE), pp. 172–181. ACM, New York (2020) 65. The Python Software Foundation: difflib — helpers for computing deltas. https://docs.python. org/3/library/difflib.html (2022). Accessed 15 April 2023 66. Tsantalis, N., Ketkar, A., Dig, D.: Refactoringminer 2.0. Trans. Softw. Eng. 48(3), 930–950 (2022). https://doi.org/10.1109/TSE.2020.3007722 67. van der Linden, F.J., Schmid, K., Rommes, E.: Software Product Lines in Action: The Best Industrial Practice in Product Line Engineering. Springer, Berlin (2007) 68. van der Veen, E., Gousios, G., Zaidman, A.: Automatically prioritizing pull requests. In: Working Conference on Mining Software Repositories (MSR), pp. 357–361. IEEE, Piscataway (2015) 69. Viseur, R.: Forks impacts and motivations in free and open source projects. Int. J. Adv. Comput. Sci. Appl. 3(2), 117–122 (2012) 70. Weißgerber, P., Neu, D., Diehl, S.: Small patches get in! In: International Working Conference on Mining Software Repositories (MSR), pp. 67–76. ACM, New York (2008). https://doi.org/ 10.1145/1370750.1370767

152

J. Businge et al.

71. Wheeler, D.A.: Why open source software/free software (OSS/FS, FLOSS, or FOSS)? Look at the numbers! Appendix A.6 Forking. https://dwheeler.com/oss_fs_why.html#forking (2015). Accessed 15 April 2023 72. Yang, D., Mao, X., Chen, L., Xu, X., Lei, Y., Lo, D., He, J.: Transplantfix: graph differencingbased code transplantation for automated program repair. In: International Conference on Automated Software Engineering (ASE). ACM, New York (2023). https://doi.org/10.1145/ 3551349.3556893 73. Yoshimura, K., Ganesan, D., Muthig, D.: Assessing merge potential of existing engine control systems into a product line. In: International Workshop on Software Engineering for Automotive Systems (SEAS), pp. 61–67. ACM, New York (2006) 74. Zhang, T., Kim, M.: Automated transplantation and differential testing for clones. In: International Conference on Software Engineering (ICSE), pp. 665 — 676. IEEE, Piscataway (2017). https://doi.org/10.1109/ICSE.2017.67 75. Zhang, T., Kim, M.: Grafter: Transplantation and differential testing for clones. In: International Conference on Software Engineering (ICSE), pp. 422–423. ACM, New York (2018). https://doi.org/10.1145/3183440.3195038 76. Zhou, S., Vasilescu, B., Kästner, C.: How has forking changed in the last 20 years? A study of hard forks on GitHub. In: International Conference on Software Engineering (ICSE), pp. 445– 456. ACM, New York (2020). https://doi.org/10.1145/3377811.3380412

Chapter 7

Supporting Collateral Evolution in Software Ecosystems Zhou Yang, Bowen Xu, and David Lo

Abstract In modern software ecosystems, the source code implemented for software features is heavily dependent on the functions and data structures defined in the codebase with the supporting libraries. This mentioned phenomenon poses a significant problem for software evolution, as any changes in the interfaces exported by the libraries can trigger a large number of adjustments in the dependent source code. These adjustments, which we refer to as collateral evolutions (CE), may be complex, entailing substantial code reorganizations. Thus, CE is timeconsuming and error-prone if performed without supporting tools. In this chapter, we first introduce the motivation and challenges on handling CE. Then, we focus on three different software ecosystems (Linux, Android, and machine learning software) and elaborate on how recent technologies and tools could support CE in those ecosystems by different means. We also briefly highlight open problems and potential future work in this new and promising research area of supporting CE to improve software evolution activities, e.g., leveraging the data from software Q&A sites (like Stack Overflow) to support CE of software ecosystems.

7.1 Introduction Software ecosystem is a collection of all the software components (e.g., projects, libraries, packages, repositories, plug-ins, apps) that developers are developing and maintaining. Over time, these software components are continually improved and

Much of the content of this chapter is a summary of, and is based on, the following articles that have at least one of the chapter authors as coauthor: [15–17, 36–38]. We would like to thank the other coauthors of these articles: Ferdian Thung, Xuan-Bach D. Le, Julia Lawall, Lucas Serrano, Van-Anh Nguyen, Lingxiao Jiang, Gilles Muller, Stefanus A. Haryono and Hong Jin Kang. Z. Yang () · B. Xu · D. Lo Singapore Management University, Singapore, Singapore e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_7

153

154

Z. Yang et al.

modified to meet the evolving needs and requirements of users. In the context of software evolution, collateral evolutions (CE) belong to a kind of software changes named adaptive changes, which are applied to software when its environment changes [8, 26]. CE occur when an evolution that affects the interface of a generic library entails modifications in all library clients. Such changes to a library cannot be made in isolation as they are intricately coupled with client code. Performing these CE requires identifying the affected files and modifying all the code fragments in these files according to the changed interface. CE is common in various software ecosystems, especially the ones with largescale development, such as open-source operating systems. For example, in the Linux kernel, the device driver code needs to be kept up to date with the evolution of the rest of the Linux, while the device driver code accounts for more than 70% of the operating system (OS) [9]. Any changes in the interfaces of the Linux kernel or driver support libraries are likely to affect driver code, and restoring their correct behavior usually involves code modifications at many code sites. These modifications are referred to as collateral evolution in Linux. Another example is Android ecosystem: Android OS is frequently updated to include new features or to fix bugs, and Android app developers need to update their apps accordingly to use the new application programming interfaces (APIs). Similarly, machine learning (ML) libraries, which are the foundations for building modern Artificial intelligence (AI) models and applications, have been evolving rapidly to adapt to the fast growth of the latest advances in AI techniques. Consequently, there is a CE relationship between ML libraries and AI applications: periodically updated ML libraries may deprecate existing APIs, which forces developers to update API usages in their applications accordingly. CE have been a pervasive problem, and more importantly, they are timeconsuming and error-prone to be solved [29, 34, 40]. The needed modification for CE may demand substantial code reorganizations [1]. With the rapid growth of software, the size of its collateral has also increased significantly. A single CE may affect hundreds of code sites spread across many different files [29]. And such collateral evolution can introduce bugs into previously mature code [24]. CE in different ecosystems involve different code change patterns; hence, different approaches have been proposed accordingly. In Linux, Serrano et al. formulate the task of handling CE as an example-based program transformation problem [36]. They first studied large-scale software changes in the Linux kernel and created a taxonomy of changes. Based on the taxonomy, Serrano et al. found that existing tools cannot handle the software changes involving control-flow dependencies or multiple variants, which is common in CE of Linux kernel. Thus, to address the limitation, Serrano et al. proposed a new approach, SPINFER [36], that is capable of suggesting transformation rules with richer information to developers. In ML ecosystems, Python is widely used in both frameworks and applications. However, the dynamic-typing feature of Python makes the information of variable and function return types usually unavailable. Previous works [15] on statically typed languages like Java are not directly applicable to ML ecosystems. Stefanus et al. [17] incorporate type inference to build MLCatchUp that can have a perfect

7 Supporting Collateral Evolution in Software Ecosystems

155

detection rate on deprecated API usages in ML ecosystems. Xavier et al. [40] studied the changes in API that break contracts previously established, resulting in compilation errors and behavioral changes in clients. Brito et al. found that 39% of the changes investigated in the study may have an impact on clients [5]. Furthermore, they revealed that the most common breaking changes are due to refactorings [4]. In the rest of this chapter, we first present existing works on supporting CE in different software ecosystems. We describe the works on CE in Linux kernel, Android, and machine learning-based software ecosystems in Sects. 7.2, 7.3, and 7.4, respectively. We describe open problems and future work in Sect. 7.5. Finally, we conclude in Sect. 7.6.

7.2 Supporting Collateral Evolution in Linux Kernel In this section, we first present two of our previous studies, [38] and [36], that aim to address the issue in Sects. 7.2.1 an 7.2.2, respectively. These studies were done by a subset of the authors of this chapter in collaboration with other researchers. We then highlight other studies in Sect. 7.2.3.

7.2.1 Recommending Code Changes for Automatic Backporting of Linux Device Drivers Motivation In Linux, the device driver code accounts for more than 70% of the operating system (OS), and it needs to be kept up to date with the evolution of the rest of the Linux code [9]. Any changes in the interfaces of the Linux or driver support libraries are likely to affect driver code and restore their correct behavior that usually involves code modifications at many code sites. Device drivers are usually created for a specific OS version. However, Linux is fast evolving, with frequent kernel-level API changes. Thus, it is challenging for device driver manufacturers to choose a target kernel version that will be acceptable to the potential users of the device. Typically, the driver manufacturers target the mainline version of the Linux kernel so that the driver code can be integrated with the mainline distribution and maintained by the mainline kernel developers. However, the users usually run on the stable versions that are typically older kernel versions. Hence, for such users, the driver must be backported to older versions. Typically, backporting is done manually in the past on a case-by-case basis. Alternatively, Linux backports project1 provides a compatibility library to hide differences between the current mainline and a host of older versions and provides patches that allow a set of drivers to target this compatibility library. These patches 1 https://backports.wiki.kernel.org/index.php/Main_Page.

156

Z. Yang et al.

are either created manually by the backports project maintainers or are created using manually written rewrite rules with the support of the transformation tool Coccinelle [32]. However, in either way, the backports project maintainer has to determine where changes are needed in the code to backport and how to implement these changes. Both of these operations are tedious and error-prone. Approach To aid in backporting in the Linux kernel, we2 proposed a recommendation system for backporting driver files over code changes toward automating the task. The input of our approach consists of a driver file in a given Linux version, the older Linux version to which the driver file needs to be backported (the target version), and the Git repository that stores the changes to the Linux source code. The output of our approach is in the form of a recommendation list that contains possible changes that can be applied to the error line to make the driver compilable in the target version. The changes are ranked by the similarity between the error line and the result of applying the change to the error line. Our approach is divided into three phases: (1) (compile) error-inducing change (EIC) search, (2) code transformation extraction, and (3) recommendation ranking. We define an error-inducing change as a patch between two consecutive commits in which compiling a target backport file in the older commit version leads to a compile error (similar to Businge et al. [7] in the Eclipse ecosystem). 1. Error-inducing change search. The goal of EIC search is to find the errorinducing change that helps to backport the driver to a target version, for which the driver currently cannot compile. In this phase, our approach accepts as input (1) input driver file, a driver file that needs to be backported; (2) target Linux version, the Linux version to be backported to the driver file; and (3) version control system, the Git repository containing the change history between the target Linux version and the Linux version where the input driver file currently exists. Our approach firstly searches for two consecutive commits such that compiling the input driver file results in a compilation error in the older commit version and no error in the newer commit version. The goals are twofold: (1) find the relevant change in the Linux kernel implementation that results in the input driver file not compiling in the target Linux version, and (2) find the changes that have been performed to existing Linux driver files to adapt them to this Linux kernel change. These adaptations are often committed at the same time as the relevant change to the underlying Linux kernel to prevent compilation errors. By reversing these adaptations, we can obtain the code changes to backport the input driver file. EIC search engine utilizes a binary search, starting with the target version, to jump through the change history recorded in the version control system and compile the input driver file in each visited commit version. The search stops when it finds the commit version that successfully compiles the input driver file (i.e., without compilation error) in the new version but fails

2 In this chapter, we use first-person narration if the set of authors of a work being described overlaps with the authors of this chapter.

7 Supporting Collateral Evolution in Software Ecosystems

157

in the previous version (i.e., with compilation error). The change between the two consecutive versions is considered as EIC and then fed to the next phase. A similar technique has been used in [11, 18]. 2. Code transformation extraction. In this phase, our approach takes the EIC obtained in the previous phase and the input driver file as the input. The goal of this phase is to search for changes in the EIC that are relevant to the line in the input driver file that the compiler has marked as erroneous (which we refer to as the error line), and generates candidate transformations to backport the input driver file. This phase has one processing component, namely, the Code Transformation Extractor, which matches the error line with each deleted line in the EIC. It then generates candidate transformations based on how the deleted lines are changed to the corresponding added lines. These candidate transformations are the recommendations produced by our approach. A similar technique has been used in [30]. 3. Recommendation ranking. In the third and also the final phase, our approach passes the candidate transformations to the Ranker, which ranks the transformations based on the similarity between the error line and the result of applying the transformation. We favor the minimal change between them. A developer who needs to backport the driver file can then examine the generated ranked recommendation list from top to bottom to find a suitable transformation. Experiment To simulate a backporting scenario, we used a dataset consisting of 22 driver families with 100 device driver files from Linux 2.6.x versions. Since general automatic backporting is a challenging problem, we limit the problem to make it more feasible to solve. Specifically, we selected driver files and starting and target versions according to three criteria: (1) the driver file should have only one changed line of source code between two consecutive Linux versions; (2) the driver files will be present in two versions (the old and new versions) and they should be compilable; and (3) when compiling the new version of the driver file within the old version of the Linux kernel, a compilation error should occur. Our goal is to modify the new driver to fix this compilation error. Using this dataset, we investigated the effectiveness of our approach on code change extraction and ranking. For the accuracy of code change extraction, we found that our approach produces correct code changes to successfully backport 68 out of 100 drivers, giving a success rate of 68%. Besides, our approach successfully extracts all required code changes for all selected drivers from seven driver families. As presented in Table 7.1, we use Hit@N to measure the recommendation effectiveness. Hit@N is set to 1 if the correct code change is in the Top-N. We compute the average Hit@N across the recommendations to measure the effectiveness of our ranking strategy. By only recommending the Top-1 code change, we recommend the correct code change for 50 out of the 68 driver files, giving an average Hit@1 of 0.735. Increasing the recommendation to Top-2, we find that there are eight more driver files whose correct code changes are recommended. This translates to an average Hit@2 of 0.853. Increasing the recommendation to Top-3 and Top4 does not change the number of driver files for which the correct code change

158 Table 7.1 Effectiveness of ranking approach

Z. Yang et al. N 1 2 3 4 5

# Correct Code Changes 50 58 58 58 60

Average Hit@N 0.735 0.853 0.853 0.853 0.882

Fig. 7.1 An API migration example in Linux

is recommended. When the recommendation is increased to Top-5, there are two more such driver files. Thus, by only recommending Top-5 candidate code changes, our approach can successfully find the correct code change 88.2% of the time, thus achieving an average Hit@5 of 0.882.

7.2.2 SPINFER: Inferring Semantic Patches for the Linux Kernel Motivation With the rapid growth of the Linux kernel, maintaining the OS is becoming more and more challenging, even for a simple API migration. Figure 7.1 presents an example of the low-resolution timer structure initialization; maintainers need to update the function from INIT_TIMER to SETUP_TIMER.3 Since the migration is not mandatory, the codebase contains a mixture use of both functions. In 2018, these interfaces were considered insecure and were both replaced. However, at the time, the usage of the APIs was in an inconsistent state, where 60% were using 3 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b9eaf1872222.

7 Supporting Collateral Evolution in Software Ecosystems

159

SETUP _ TIMER while 40% were using INIT _ TIMER . Thus, an effective approach that can perform the transformation automatically is desired. In the literature, there are many existing tools that can perform program transformation by learning from single or multiple examples. For example, Sydit [27], Meditor [41], AppEvolve [12], and Genpat [20] generate transformation rules from individual change examples. Lase [28], Refazer [33], and Spdiff [2] are designed to learn from multiple examples, identifying how to abstract over these examples based on the commonalities and differences between them. However, the type of transformation that can be handled by these tools remains unknown. To equip a deeper understanding of the tools, we created a taxonomy of transformation-rule inference based on studying large-scale changes in the Linux kernel. The taxonomy considers transformation inference from four perspectives: (1) the relationship between changed terms in a single change instance (including control-flow and data-flow dependencies), (2) the number of changed instances, (3) the number of changed variants, and (4) unrelated changes. Then, the taxonomy is used to characterize particular change examples and to compare transformation-rule inference approaches. Particularly, we found that existing tools cannot handle the transformations that relate to control-flow dependencies or contain multiple variants. However, both the transformation types are common in the Linux kernel. Moreover, the transformation tools used by the existing tools are not exposed. In other words, it is infeasible for developers to check whether the transformation is correct or not. In practice, the automatic C code transformation tool Coccinelle [22] has been integrated into the Linux kernel developer toolbox since 2008 and widely adopted by the Linux community for automating large-scale changes in kernel code. Coccinelle provides a notion of semantic patch allowing kernel developers to write transformation rules using a diff-like syntax, enhanced with metavariables to represent common but unspecified subterms and notation for reasoning about control-flow paths. Given a semantic patch, Coccinelle applies the rules automatically across the codebase. Still, there remain large-scale changes in the Linux kernel commit history where Coccinelle has not been used. One of the main reasons behind is the usage of Coccinelle; Linux kernel developers need to be familiar with the pre-defined syntax, which is hard to be remembered. Another reason is that some developers often realize using Coccinelle after performing a few manual changes. Furthermore, Coccinelle does not help developers understand existing large-scale changes if no semantic patch is provided.

Approach To address the limitations of the existing tools, we proposed our tools named SPINFER. It is designed to be capable of (1) learning the transformation from multiple examples, i.e., a set of given change examples containing pairs of functions from before and after the change; (2) handling both control-flow dependencies and transformation variants; and (3) exposing Coccinelle transformation rules to Linux maintainers. As shown in Fig. 7.2, our approach focuses on finding abstract semantic

160

Z. Yang et al.

Fig. 7.2 The usage of SPINFER

patch fragments that will be assembled into one or more semantic patch rules. Each rule will match one of the variants illustrated by the examples. To achieve the aim, SPINFER first identifies sets of commonly removed or added terms across the given examples, then generalizes the terms in each set into a pattern that matches all of the terms in the set, and finally integrates these patterns into transformation rules that respect both the control-flow and data constraints exhibited by the examples, splitting the rules if necessary when inconsistencies appear. The framework of SPINFER contains four phases: 1. Identification of abstract fragments. The goal of this phase is to cluster nodes sharing similar subterms to form abstract fragments. To achieve the goal, SPINFER first clusters subterms from the examples having a similar structure and generalizes each cluster into an abstract fragment that matches all the terms in the cluster. 2. Assembling the rule-graphs. To capture control-flow constraints, SPINFER first combines the abstracted fragments into a semantic patch rule-graph, a representation of a transformation rule as a graph, where nodes represent fragments to add and remove and where edges are determined by control-flow dependencies exhibited in the associated examples. 3. Splitting. When assembling fails or when SPINFER detects data-flow inconsistencies, SPINFER splits existing rule-graphs into more specific ones. 4. Rule ordering. In the last step, SPINFER orders the generated rules, removing redundant ones, to maximize precision and recall while minimizing the number of rules for the final semantic patch. Experiment We evaluated SPINFER on a dataset composed of 40 sets of changes randomly picked from changes to the Linux kernel in 2018. We compared the results

7 Supporting Collateral Evolution in Software Ecosystems

161

produced by SPINFER-generated semantic patches with the results produced by human-written semantic patches. In the experiment, SPINFER learns semantic patches from a reduced dataset composed of the first ten changed files, as indicated by the commit author date, or half of the full dataset if the full dataset contains fewer than 20 files. Then, SPINFER is evaluated on the rest of the dataset, i.e., the remaining part that does not include the set of files from which the rules were learned. We used two metrics, Precision and Recall, to measure the performance. Precision refers to the fraction of changes produced that were correct, while Recall means the fraction of needed changes that were produced. SPINFER achieved 87% and 62% in terms of Precision and Recall, respectively. Moreover, SPINFER managed to produce the same semantic patches as the human-written ones for eight cases.

7.2.3 Other Studies There are a number of other studies that support collateral evolution in Linux. The pioneering work raises the issue of collateral evolution in the context of Linux as presented by Padioleau et al. in 2005 [29]. The work focused on the interfaces of the Linux kernel as well as driver support libraries and provided a taxonomy of the main evolutions that occur in these interfaces. Padioleau et al. found that the main evolutions entailed a variety of collateral evolutions (CE) and illustrated performing CE can be complex. To deal with CE in Linux, Lawall et al. proposed an automatic transformation tool named Coccinelle [22] as introduced in Sect. 7.2.2. Besides, Lawall et al. proposed two tools, GCC-REDUCE and PREQUEL, to help the developer deal with two key challenges of driver porting: (1) where changes are needed and (2) what changes should be performed [23]. GCC-REDUCE translates error messages produced by compiling a driver with a target kernel into appropriate queries to PREQUEL, and then PREQUEL is used for querying git commit histories.

7.3 Supporting Collateral Evolution in Android Android has become one of the most widely used operating systems. There exists collateral evolution in the Android ecosystems: Android OS is frequently updated to include new features or to fix bugs, and app developers must update their apps to use the new APIs and maintain applications’ backward compatibility. To support collateral evolution in Android, this section first discusses an empirical study on deprecated API usage updates for Android apps in Sect. 7.3.1. Then, we introduce an example-based automatic Android deprecated-API usage update method in Sect. 7.3.2, after which we present the data-flow analysis and variable denormalization to perform the automated updates in Sect. 7.3.3. In the end, we briefly discuss some related work in Sect. 7.3.4.

162

Z. Yang et al.

7.3.1 An Empirical Study on Deprecated-API Usage Update in Android To illustrate, let us consider the API change in the Android API version 23 as shown in Fig. 7.3: the method getCurrentMinute(), which returns the currently selected minute, was deprecated in version 23 of the Android API, and the method getMinute() was suggested as its replacement. The API usage change mapping is method getCurrentMinute() to method geMinute(). We present an empirical study [37] on deprecated-API usage update that evaluates AppEvolve [12], a tool specifically designed to automatically update API usage for Android apps. AppEvolve takes as input a target app to update and a mapping from a deprecated API to its replacement. AppEvolve works into three phases: API-Usage Analysis, Update Example Search, and API-Usage Update, which are illustrated in Fig. 7.4.

Fig. 7.3 An example of a deprecated API getCurrentMinute() in Android 23

API Usage API Usage Analysis Target App

Search Update Examples

Fig. 7.4 The framework of AppEvolve

Update Examples Analysis

API Usage Update

Evolved Target App

7 Supporting Collateral Evolution in Software Ecosystems

7.3.1.1

163

Datasets

From the F-Droid repository,4 we selected 15 real-world apps that cover 20 API usages in 41 locations in total. We selected five apps each for three Android API versions: 22, 23, and 25. For each API version, the API updates were manually generated by reading the API documentation. Beyond the apps used in the AppEvolve dataset [12], we included a larger set of apps that contain deprecated APIs, which were collected using GitHub Code Search5 with the name of deprecated APIs. The returned results were manually reviewed to eliminate any false-positive results; that is, files did not contain the deprecated APIs that we were looking for. In total, 54 API usages that are randomly selected from 54 apps obtained by querying GitHub Code Search. It is noted that our dataset and the AppEvolve dataset are mutually exclusive. As GitHub Code Search only returns the latest version of a repository, the apps in our dataset use the deprecated API. In contrast, the AppEvolve dataset consists of apps that have already updated the deprecated API usages.

7.3.1.2

Results

On the AppEvolve dataset, AppEvolve can successfully update 37 out of 41 deprecated API usages in the 15 investigated apps. On the new dataset collected, AppEvolve only generates ten applicable updates, and the remaining 44 updates are failures. Combined with the four failed updates on the AppEvolve dataset, it ends up with 48 failed updates in total. Table 7.2 shows the reasons leading to failed updates, and the details are explained as follows. 1. Statements in the examples and at the target location are structurally different. AppEvolve infers the update operations at the statement level; thus, it fails to apply changes inferred about API usage found in one kind of statement to API usage found in another kind of statement. This reason often occurs when the invocation of the deprecated API is used as the right-hand side of an assignment, while in the new dataset, the API invocations happen in various expression contexts. 2. Object and arguments of the deprecated API method are in the form of complex expressions. AppEvolve only abstracts over variables to create update operations. As a result, when examples contain variables for the object or the arguments, the generated edits are not sufficient to update code in which these subterms are expressed as more complex expressions. 3. Edits beyond method boundaries. AppEvolve fails to learn updates that modify program elements that reside outside of the method containing the API usage to

4 https://fdroid.org. 5 https://github.com/search.

164

Z. Yang et al.

Table 7.2 Statistics of reasons for failed updates Test case type

Statements in the examples and at the target location are structurally different

Object and arguments of the deprecated API method are in the form of complex expressions Edits beyond method boundaries Incomplete support of programming language features No examples Others Total

Detail of cases Return statement If statement Method argument Arithmetic operand Declared variable -

Number of cases 6 1 4 3 12 20

Inheritance Static modifier Final modifier -

3 2 1 1 1 4 48

be updated. These updates include operations like importing additional packages and adding fields to a class. 4. Incomplete support of programming language features. AppEvolve fails to update cases in which the update involves programming language features such as static modifier, final modifier, and inheritance. 5. No examples. There is no example of such API update that can be found on GitHub. 6. Others. These include cases that cannot be put into any category above.

7.3.2 Example-Based Automatic Android Deprecated-API Usage Update 7.3.2.1

Design of CocciEvolve

We present a method called CocciEvolve [15] that performs deprecated API updates using only a single after-update example. Figure 7.5 illustrates an overview of CocciEvolve and the relevant pipelines. CocciEvolve consists of three key components: (1) source file normalization, (2) updated API block detection, and (3) API-update semantic patch creation. These components form a pipeline to create the update semantic patch and a pipeline to apply the update to a target file. The first pipeline takes as inputs the API usage change and a source file containing updated API calls. The second pipeline takes as inputs the API usage change, target source file, and update semantic patch file generated by the first pipeline.

7 Supporting Collateral Evolution in Software Ecosystems

An API Update Example

API Usage Change

165

Updated API Block Extractor

API Update Semantic Patch

Normalized Target File

Updated Application

Code Files Normalization

Target File to Update

Fig. 7.5 Overview of CocciEvolve

7.3.2.2

Dataset and Evaluation Results

A dataset of real-world Android projects obtained from GitHub is used to evaluate how CocciEvolve performs in practice. We leveraged AUSearch [3] to find Android API usages in public GitHub repositories. To create an update semantic patch, we use the existing after-update examples provided in the AppEvolve dataset [12]. For each API, only one after-update example is used. In total, 112 target source files from GitHub are obtained for evaluating CocciEvolve, covering the ten most commonly used APIs that were adopted in the original evaluation of AppEvolve [12]. These selected target files have no overlap with the target files used in the AppEvolve paper, which makes them suitable to evaluate the generalizability of AppEvolve in updating other target files. The evaluation results show that CocciEvolve has a better performance than AppEvolve. A user study is conducted by asking a software engineer with 3 years of experience in Android to validate the correctness of the update by verifying that there are no semantic changes in the update. The results show that CocciEvolve can achieve an almost perfect result (near 100% success rate) for each investigated API. In contrast, AppEvolve is not able to produce any code update in most cases. We point out that AppEvolve requires some manual code refactoring and modifications to be able to perform the automated update. More specifically, results show that CocciEvolve mainly failed on two APIs: getAllNetworkInfo() and requestAudioFocus(). The reason is that to update these two APIs, the tool has to create new objects that are used as arguments in the updated APIs. Usually, these new objects are created outside the updated API block, which requires program analysis (e.g., data-flow analysis) to detect and construct the update successfully. However, CocciEvolve currently has

166

Z. Yang et al.

no such functions of sophisticated data-flow analysis. We summarize the features of CocciEvolve as follows: • CocciEvolve does not need extensive setup or configuration. • CocciEvolve is capable of updating multiple API calls in the same file without additional configuration. • CocciEvolve provides an easily readable and understandable semantic patch as a by-product. • CocciEvolve only needs a single updated example.

7.3.3 Data-Flow Analysis and Variable Denormalization-Based Automated Android API Update 7.3.3.1

AndroEvolve Architecture

Haryono et al. present another approach named AndroEvolve [16], which enhances CocciEvolve by incorporating data-flow analysis and variable name denormalization techniques. AndroEvolve leverages data-flow analysis to resolve the value of any variable within the file scope. Variable name denormalization substitutes the temporary variables that may present in the CocciEvolve update with appropriate values in the target file. Figure 7.6 illustrates the overall architecture of AndroEvolve. The workflow of AndroEvolve has two main parts: update-script creation and update-script application.

Update Mapping

After-update Example

Update Script Creation

Data Flow Analysis

Variable Normalization

Create Update Script

Target Code

Update Script

Update Mapping

Update Script Application

Update Mapping

After-update Example

Updated Target Code

Fig. 7.6 Summary of the AndroEvolve workflow

Update Mapping

After-update Example

7 Supporting Collateral Evolution in Software Ecosystems

167

• Update-script creation takes as input the API update mapping and an afterupdate example of the API. The former is composed of the API signatures of the deprecated API and the corresponding updated API, which is used to identify the deprecated and the updated API from the code example and target file. Then, AndroEvolve employs data-flow analysis to resolve the values of outof-method variables in the update example that are used as the arguments of the updated API invocation but are not used in the deprecated API invocation. Variable normalization aims to minimize the syntactic differences between the update example and the target code to be updated. More specifically, it substitutes all complex subexpressions with temporary variables in the part of the code containing the deprecated and updated API invocations to facilitate the API update. The update script, expressed using the Semantic Patch Language (SmPL), is then created from the example with normalized variables. • Update-script application takes three inputs: (1) the target code to be updated, (2) the update script from the previous update-script creation step, and (3) the API update mapping. The update-script application is operated as follows. First, we apply variable normalization to the target code. Then, Coccinelle4J [21] is used to apply the update script to the normalized target code to produce the updated code. Following the update script application, we copy the method and class definitions used in the updated API arguments to the updated code. Finally, we replace the temporary variables with their original expressions in the updated code.

7.3.3.2

Evaluation of AndroEvolve

The dataset for evaluating AndroEvolve has three components: after-update examples used for the update-script creation, one-to-one API mappings from the deprecated APIs to the replacement APIs, and the target files to be updated. The after-update examples that were used in the evaluation of AppEvolve [12] are also included. For the target files to be updated, we extended the target file dataset from CocciEvolve [15]. We used AUSearch [3] to search GitHub repositories for deprecated API usages and randomly sample some public GitHub projects from the returned results. In the end, 360 target files containing 20 deprecated Android APIs were obtained and used as the target files. We compare the update accuracy (i.e., the percentage of correct updates) of both AndroEvolve and CocciEvolve. They manually check the correctness of the updated code produced by the two tools. The results show that AndroEvolve and CocciEvolve provide correct updates for 316 and 249 target files, respectively. An ablation study shows that data-flow analysis contributes to updating more deprecated APIs successfully. It is especially the case when an after-update example has out-of-method variables. To summarize, AndroEvolve has achieved a significant improvement with an overall update accuracy of 87.78%. As another important dimension of practical usage, we also compare the readability of the update results of the two tools. Both an automated and a manual approach are used. A state-of-the-art automated readability measuring tool from

168

Z. Yang et al.

Scalabrino et al. [35] is used as an automated evaluation method, giving a score of between 0.0 and 1.0 (a higher score indicates better readability). In the manual assessment, two experienced Android developers scored 60 successfully updated examples, with 30 examples from each of CocciEvolve and AndroEvolve. For each updated example, the developers were asked to give scores in the Likert scale of 1–5 for the readability of the code. A higher score means higher readability and higher confidence that the code resembles code produced by a human developer. The automated evaluation shows that AndroEvolve’s updated code achieves higher readability than CocciEvolve’s for all of the evaluated deprecated API migrations. This finding is further confirmed by the manual evaluation: the average readability score given by the developers for AndroEvolve achieves a score of 4.817, while CocciEvolve only gets an average score of 2.633. These evaluations highlight that AndroEvolve offers a significant improvement in code readability as compared to CocciEvolve.

7.3.4 Other Studies A large body of work has been done on the collateral evolution between Android APIs and their usages [6, 12, 15, 19, 25, 31, 34, 42, 45]. Li et al. [25] characterized the deprecated Android APIs and identify that annotation and documentation on deprecated APIs are usually inconsistent. They also uncovered that most deprecated APIs are used in popular libraries. Zhou et al. [45] analyzed deprecated APIs in 26 open-source Java frameworks and libraries and found that many of these APIs were never updated. To uncover deprecated API usages with ease, they designed a tool that can detect deprecated Android API usages from code examples on the web. Yang et al. [42] investigated how Android OS updates impact Android apps and present an automatic approach to detect the affected parts in Android apps. Some studies explore the effect of deprecated API [19, 31, 34] on the ecosystems. Robbes et al. [31] conducted a case study on the Smalltalk ecosystem and found that API deprecation messages are not always helpful. Sawant et al. [34] extended Robbes et al.’s work by conducting an empirical study on the effect of deprecation of Java API artifacts on their clients. They found that few API clients update the API version that they use. Besides, they conclude that deprecation mechanisms as implemented in Java do not provide the right incentives for most developers to migrate away from the deprecated API elements, even with the downsides that using deprecated entities entails. Hora et al. [19] conducted another case study on the Pharo ecosystem on the impact of API evolution. They found that API changes can have a large impact on the client systems, methods, and developers. Sawant et al. [34] replicated similar studies on Java APIs and JDK, finding that only a small fraction of developers react to API deprecation and most developers tend to remove usages of deprecated APIs rather than migrating them to the updated APIs.

7 Supporting Collateral Evolution in Software Ecosystems

169

7.4 Supporting Collateral Evolution in ML Libraries Machine learning (ML) libraries, which are the foundations for building modern AI models and applications, have been evolving rapidly to adapt to the fast growth of the latest advances in AI techniques. Consequently, there is a collateral evolution relationship between ML libraries and AI applications: ML libraries may deprecate existing APIs and developers of AI applications need to update API usages in their applications accordingly. In this section, we first present an empirical study that characterizes deprecated ML API usages in Sect. 7.4.1. Then, we present MLCatchUp, a tool that updates deprecated ML APIs automatically, in Sect. 7.4.2. We also present other relevant works in Sect. 7.4.3.

7.4.1 Characterizing the Updates of Deprecated ML API Usages 7.4.1.1

Datasets

We conducted an empirical study on three popular Python ML libraries; they are Scikit-Learn, TensorFlow, and PyTorch. Using GitHub dependency graph, we calculated the number of GitHub repositories depending on different ML libraries. Till August 2020, more than 120,000 repositories utilized Scikit-Learn, while more than 82,000 repositories used TensorFlow. Keras and PyTorch were used by at least 52,000 and 34,000 repositories, respectively. As Keras has been integrated into TensorFlow since January 2017, this study chose Scikit-Learn, TensorFlow, and PyTorch as the research subjects. Deprecated API usages were collected by manually reading the change log of each library. More specifically, we searched for text or API methods that were marked with the word “deprecat-” or “replace-” in the change log, which indicated API deprecation or replacement. The change log of Scikit-Learn can be found in their official documentation page. The change log of PyTorch and TensorFlow is available in GitHub release notes. The manual collection was limited to analyzing the change logs of major and minor release versions (from July 2018 to August 2020). In total, we found 112 pairs of deprecated APIs and the corresponding updates.

7.4.1.2

Update Operations to Migrate Deprecated API Usages

Considering the popularity of Python in developing AI models, this empirical study focuses on the deprecated API updates of Python ML libraries and their prevalence in these libraries. We analyze the update operation for migrating the deprecated

170

Z. Yang et al.

API usages. The update operation pattern is done by analyzing the common pattern of the API update approach between different APIs. Each API is unique, but API developers often use a similar approach to update deprecated API usage. We compare the difference between the deprecated and updated API signature to collect the update operations. These update operations can be categorized as follows: • Remove parameter: This operation removes the deprecated API usage function parameter(s). If the deprecated parameter is not removed, it will result in a TypeError, as the parameter no longer exists. • Rename parameter: This operation replaces one or more keyword parameter names with a new name. If the parameter name is not changed, it will result in an error. Similar to removing parameter deprecation, this update operation only affects API usages where the deprecated keyword parameter is explicitly declared. • Convert positional parameter to keyword parameter: This operation first removes one or more positional parameters and then uses their values to create new keyword parameters. • Rename a method: This operation renames the function or module of the deprecated API to a new updated name. If the module name is changed, an import statement that imports the updated API is also added. Notice that there is no change to the API invocation parameters nor return values. • Add a parameter: This operation adds a new keyword/positional parameter to an API invocation. • Change a parameter type: This operation changes the type of a deprecated API usage parameter. It first checks whether the type of the current API argument matches the new type in the updated API. If the type does not match, the deprecated API argument needs to be changed to follow the new type specification, which may require the creation of new objects or values. In this operation, a previously valid type may no longer be usable in the parameter. The creation of new variables or values may be needed in this update operation. Figure 7.7 provides an example of this operation for the usage of the sklearn.utils.estimator_checks.check_estimator deprecated API. According

Fig. 7.7 An example of checking types to handle CE

7 Supporting Collateral Evolution in Software Ecosystems

171

Fig. 7.8 The documentation of the updated API

to the newer API version (i.e., v0.23.0 in this case as presented in Fig. 7.8), the allowed parameter type in the updated API is changed to only the Estimator type. • Add a constraint to a parameter value: This operation adds a constraint to the value of the API parameter due to a change in the permitted values of the parameter. The API argument’s value is checked to ensure that the value fits the newly permitted range of values. If the argument’s value does not fit, we need to modify its value accordingly. • Remove API: This operation removes the deprecated API without replacing its usage with any updated API. This is typically used when the deprecated API is no longer needed or no longer has any effect when invoked. The distribution of these update operations is reported in Table 7.3, where we can observe that the most common operations are different in each library. The most common deprecated API migrations in Scikit-Learn involve removing parameters. The deprecated API migrations in PyTorch and TensorFlow are mainly renaming methods. Similarly, TensorFlow API migrations often rename parameters as well. Among the update operations, changing parameter type, adding constraints to parameter values, and removing API are the least commonly used, amounting to five or even fewer APIs for each update operation.

172

Z. Yang et al.

Table 7.3 Distribution of the update operations to perform the migration in different libraries Remove Param Rename Param PosToKey Param Rename Method Add Parameter Change Param Type Add Param Value Constraint Remove API

Scikit-learn 12 0 0 4 2 2 1 1

PyTorch 1 1 8 18 1 3 4 0

TensorFlow 0 25 0 17 9 0 0 3

A deprecated API may need to be updated with several new APIs. API mapping represents the ratio between the number of deprecated and updated APIs involved in one migration. This pattern is adapted from the work of Cossette and Walker [10] on transformation classification in Java. Three types of mappings are found for the API mappings in investigated ML libraries: • 1:1 API mapping: A deprecated API is modified or replaced by a single updated API. • 1:N API mapping: A deprecated API is modified or replaced by at least two updated APIs. • 1:0 API mapping: A deprecated API is removed without any suggested replacement. The context, referring to the value and type of the arguments in the deprecated API usage, may affect the migration of deprecated APIs, which is called context dependency. The two categories of context dependency are listed as follows: • Context-dependent update. An API migration that depends on the value of arguments of an API invocation. Take the sklearn.model_selection.KFold API as an example. It is only deprecated if the random_state argument is not None, in which case the shuffle argument must be set to True. Otherwise, there is no change to the API. • Context-independent update. An API migration that is not affected by the value of the API invocation arguments.

7.4.2 Automated Update of Deprecated Machine Learning APIs 7.4.2.1

Architecture of MLCatchUp

Based on the findings of the previous empirical study and the fact that Python deprecated APIs are still being widely used by developers, Stefanus et al. [17]

7 Supporting Collateral Evolution in Software Ecosystems

Input API Signature

173

Input Python File

Python AST Parser

Input File AST

Update DSL

DSL Parser

AST Transformation

Updated AST

Code Diff Checker

Update Diff

Update Application

Updated Input File

Transformation Inference

Fig. 7.9 Architecture of MLCatchUp

created an automated approach to update the usage of Python deprecated APIs. The tool is called MLCatchUp, which compares the difference between the deprecated API signature and the updated API signature to infer the transformations that are required to update API usages. Taking as input API signatures and a code containing deprecated API, MLCatchUp can automatically provide a new version of the code where all usages of the specified deprecated APIs are replaced with the corresponding updated API. MLCatchUp also leverages a domain-specific language to enhance the readability of generated transformations. Figure 7.9 shows the architecture and pipeline of MLCatchUp. MLCatchUp takes two inputs: (1) the input Python file and (2) the input API signatures. The input Python file is the file to be updated. The input API signatures are the deprecated API signature and the updated API signature. MLCatchUp first transforms the input Python file into Abstract Syntax Tree (AST) using the module provided by Python,6 creating the input file AST. The transformation inference process infers necessary transformations for the API migration by analyzing the input API signatures. The necessary transformations are represented using DSL commands, a series of atomic operations required for the update (e.g., rename method, rename parameter, etc.) that are to be performed sequentially. The DSL parser parses these transformations into a list of operations that MLCatchUp can execute. These operations are applied to the input file AST and produce the updated AST. Then the code diff checker compares the updated AST and the input file AST, listing all the code differences between the two ASTs, and outputs the update diff. According to the code differences in the update diff, the update is applied to the input Python file by making only the necessary changes to the API usages without any modification to the original code comments and spacing. This returns the updated input file.

6 https://docs.python.org/3/library/ast.html.

174

7.4.2.2

Z. Yang et al.

Evaluating MLCatchUp on Updating Deprecated APIs

Beyond the 112 identified deprecated APIs (from the Scikit-Learn, TensorFlow, and PyTorch libraries) used in the empirical study in Sect. 7.4.1, we additionally consider eight deprecated APIs from Spacy and Numpy, which are popular libraries for natural language processing and array processing. To provide a realistic evaluation, we collected code containing deprecated API usages from GitHub public repositories for the 120 APIs (112 APIs from Scikit-Learn, TensorFlow, and PyTorch libraries .+ 8 APIs from Spacy and Numpy). Out of the 120 APIs, we found public code usages for only 68 APIs. For each of these 68 APIs, we collected at most five files containing deprecated API usage in the evaluation dataset. In total, 267 files containing 551 API usages from 68 different APIs were collected. Forty-four files contain Scikit-Learn API usages, 82 files contain TensorFlow API usages, 110 files contain PyTorch API usages, 22 files contain Spacy API usages, and nine files contain Numpy API usages. A key feature of MLCatchUp is the usage of type inference. Two different measurements are performed on the three ML libraries: (1) detection rate without type inference and (2) detection rate with type inference. The results show that without the type inference, MLCatchUp is unable to detect all of the deprecated API usages, achieving a detection rate of 93.7%. After integrating type inference into MLCatchUp, the detection rate is boosted to 100.0%. Similar results can be observed on the Spacy and Numpy libraries. Without type inference, MLCatchUp only detects 43.9% of deprecated API usages, and the integration of type inference leads to a 100.0% detection rate.

7.4.3 Other Studies As discussed in Sect. 7.3.4, a large body of work has been done on analyzing and updating deprecated Android APIs and other software ecosystems. However, the collateral evolution of ML APIs and AI applications is still an under-explored research area. Zhang et al. [44] studied how Python framework APIs evolve and the characteristics of the compatibility issues induced by such evolution. They found that the API evolution patterns in Python (including TensorFlow) frameworks largely differ from that in other libraries like Java APIs. Zhang et al. [43] conducted a large-scale study on 6329 API changes mined from the Tensorflow 2 framework. These API changes were classified into six functional categories, and they identified ten reasons based on the card sorting approach.

7 Supporting Collateral Evolution in Software Ecosystems

175

7.5 Open Problems and Future Work In the previous sections, we have highlighted a number of studies that analyze how CE has become a challenging problem with the growth of modern software. Besides, we have also detailed several existing solutions specifically designed to handle CE in different software ecosystems. Albeit the many existing studies in this area, we believe much more work can be done to handle the challenge better. We highlight some of the open problems and potential future work in this section. An Interesting Avenue for Future Work Is to Combine Many Different Sources of Information (such as Crowdsourced Data) and Leverage Them to Deal with CE Since CE is common in many software ecosystems and developers find it is challenging to handle it well, they may seek potential better solutions from crowdsourced platforms, such as Stack Overflow7 and GitHub.8 There are many questions in Stack Overflow related to properly handling CE. As an example shown in Fig. 7.10, the question is about How to handle the change of API window.ActiveXObject in IE11. An answer provided and also voted multiple times by other developers in the community has been accepted by the one who asked the question. Such highly qualified information from the crowdsourced platforms is valuable to handle CE, especially for some complex cases. Automatically Characterizing CE Software Changes at a Large Scale The patterns of software changes to handle CE in different software ecosystems vary. At the same time, they are valuable for equipping a deeper understanding and providing insights into the solution design. Thus, researchers usually built a taxonomy of the changes as their first step (e.g., [36]). However, we find that the existing works on characterizing CE software changes usually involve extensive manual annotation as the studies are carried out based on a large number of changes. Thus, it becomes a barrier, especially for understanding the state of CE in different software ecosystems at a large scale. We believe automated methods to mine and characterize these changes, coupled with how they can support automatic updates, are promising research directions. Leveraging Large Pre-trained Models to Support CE The existing works on supporting CE in various ecosystems are usually guided by limited information, e.g., the updated example of some specific APIs. Recently, large pre-trained models of code (e.g., CodeBERT [13], CodeT5 [39], and GraphCodeBERT [14]) have achieved great success in many software tasks, such as code search, defect prediction, and automatic code review. However, utilizing the knowledge embedded in large-scale open-source data learned by these models to support CE remains unexplored.

7 https://stackoverflow.com. 8 https://github.com.

176

Z. Yang et al.

Fig. 7.10 An example question with answer on handling CE

7.6 Conclusion In this chapter, we discussed the phenomenon of collateral evolution (CE) in software ecosystems. We introduced works that support CE in three important software ecosystems: Linux, Android, and machine learning frameworks. For the Linux ecosystem, we discussed how to recommend code changes for automatically backporting Linux device drivers and a tool, SPINFER, that infers semantic patches for the Linux kernel. For the Android ecosystem, we first introduced an empirical study on the deprecated API usages update in Android and then presented two tools that can automatically update such deprecated APIs. Similarly, we characterized the updates of deprecated ML API usages and discussed an automated tool that supports making such updates. We also explained other studies that are related to CE in the three ecosystems and discussed some open problems and future work.

7 Supporting Collateral Evolution in Software Ecosystems

177

References 1. Andersen, J., Nguyen, A.C., Lo, D., Lawall, J.L., Khoo, S.C.: Semantic patch inference. In: International Conference on Automated Software Engineering (ASE), pp. 382–385 (2012) 2. Andersen, J., Nguyen, A.C., Lo, D., Lawall, J.L., Khoo, S.C.: Semantic patch inference. In: International Conference on Automated Software Engineering (ASE), pp. 382–385. ACM, New York (2012). https://doi.org/10.1145/2351676.2351753 3. Asyrofi, M.H., Thung, F., Lo, D., Jiang, L.: AUSearch: accurate API usage search in GitHub repositories with type resolution. In: International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway (2020) 4. Brito, A., Valente, M.T., Xavier, L., Hora, A.: You broke my code: understanding the motivations for breaking changes in APIs. Empirical Softw. Eng. 25, 1458–1492 (2020) 5. Brito, A., Xavier, L., Hora, A., Valente, M.T.: Why and how java developers break APIs. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 255–265. IEEE, Piscataway (2018) 6. Brito, G., Hora, A., Valente, M.T., Robbes, R.: Do developers deprecate APIs with replacement messages? A large-scale analysis on Java systems. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), vol. 1, pp. 360–369. IEEE, Piscataway (2016) 7. Businge, J., Serebrenik, A., Brand, M.G.: Eclipse API usage: the good and the bad. Softw. Quality J. 23(1), 107–141 (2015). https://doi.org/10.1007/s11219-013-9221-3 8. Chapin, N., Hale, J., Khan, K., Ramil, J., Than, W.G.: Types of software evolution and software maintenance. J. Softw. Maint. Evol. 13, 3–30 (2001) 9. Chou, A., Yang, J., Chelf, B., Hallem, S., Engler, D.: An empirical study of operating systems errors. In: ACM Symposium on Operating Systems Principles, pp. 73–88 (2001) 10. Cossette, B., Walker, R.J.: Seeking the ground truth: a retroactive study on the evolution and migration of software libraries. In: International Symposium on Foundations of Software Engineering (FSE) (2012) 11. Dig, D., Johnson, R.: How do APIs evolve? A story of refactoring. J. Softw. Maint. Evol. Res. Pract. 18(2), 83–107 (2006) 12. Fazzini, M., Xin, Q., Orso, A.: Automated API-usage update for Android apps. In: International Symposium on Software Testing and Analysis (ISSTA), pp. 204–215. ACM, New York (2019) 13. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 1536–1547. Association for Computational Linguistics (2020) 14. Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Shujie, L., Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, M., Deng, S.K., Clement, C., Drain, D., Sundaresan, N., Yin, J., Jiang, D., Zhou, M.: GraphCodeBERT: pre-training code representations with data flow. In: International Conference on Learning Representations (ICLR 2021), Poster (2021) 15. Haryono, S.A., Thung, F., Kang, H.J., Serrano, L., Muller, G., Lawall, J., Lo, D., Jiang, L.: Automatic Android deprecated-API usage update by learning from single updated example. In: International Conference on Program Comprehension (ICPC). IEEE (2020) 16. Haryono, S.A., Thung, F., Lo, D., Jiang, L., Lawall, J., Kang, H.J., Serrano, L., Muller, G.: AndroEvolve: automated android API update with data flow analysis and variable denormalization. Empirical Softw. Eng. 27(3), 73 (2022). https://doi.org/10.1007/s10664-02110096-0 17. Haryono, S.A., Thung, F., Lo, D., Lawall, J., Jiang, L.: Characterization and automatic updates of deprecated machine-learning api usages. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 137–147 (2021). https://doi.org/10.1109/ICSME52107. 2021.00019

178

Z. Yang et al.

18. Henkel, J., Diwan, A.: CatchUp! capturing and replaying refactorings to support API evolution. In: International Conference on Software Engineering (ICSE), pp. 274–283. IEEE, Piscataway (2005) 19. Hora, A., Robbes, R., Anquetil, N., Etien, A., Ducasse, S., Valente, M.T.: How do developers react to API evolution? The Pharo ecosystem case. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 251–260. IEEE, Piscataway (2015) 20. Jiang, J., Ren, L., Xiong, Y., Zhang, L.: Inferring program transformations from singular examples via big code. In: International Conference on Automated Software Engineering (ASE), pp. 255–266. IEEE, Piscataway (2019) 21. Kang, H.J., Thung, F., Lawall, J., Muller, G., Jiang, L., Lo, D.: Automating program transformation for Java using semantic patches. In: European Conference on Object-Oriented Programming (ECOOP) (2019) 22. Lawall, J., Muller, G.: Coccinelle: 10 years of automated evolution in the Linux kernel. In: USENIX Annual Technical Conference, pp. 601–614 (2018) 23. Lawall, J., Palinski, D., Gnirke, L., Muller, G.: Fast and precise retrieval of forward and back porting information for Linux device drivers. In: USENIX Annual Technical Conference, pp. 15–26 (2017) 24. Li, D., Li, L., Kim, D., Bissyandé, T.F., Lo, D., Le Traon, Y.: Watch out for this commit! A study of influential software changes. J. Softw. Evol. Process 31(12), e2181 (2019) 25. Li, L., Gao, J., Bissyandé, T.F., Ma, L., Xia, X., Klein, J.: Characterising deprecated Android APIs. In: International Conference on Mining Software Repositories (MSR), pp. 254–264. ACM, New York (2018) 26. Lientz, B.P., Swanson, E.B.: Software maintenance management: a study of the maintenance of computer application software in 487 data processing organizations. Addison-Wesley (1980) 27. Meng, N., Kim, M., McKinley, K.S.: Sydit: creating and applying a program transformation from an example. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 440–443 (2011) 28. Meng, N., Kim, M., McKinley, K.S.: Lase: locating and applying systematic edits by learning from examples. In: International Conference on Software Engineering (ICSE), pp. 502–511. IEEE, Piscataway (2013). https://doi.org/http://dl.acm.org/citation.cfm?id=2486788.2486855 29. Padioleau, Y., Lawall, J., Muller, G.: Understanding collateral evolution in Linux device drivers. In: SIGOPS/EuroSys European Conference on Computer Systems (ECCS), pp. 59– 71. ACM, New York (2006) 30. Ramkisoen, P.K., Businge, J., van Bladel, B., Decan, A., Demeyer, S., De Roover, C., Khomh, F.: Pareco: patched clones and missed patches among the divergent variants of a software family. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 646–658 (2022) 31. Robbes, R., Lungu, M., Röthlisberger, D.: How do developers react to API deprecation? The case of a Smalltalk ecosystem. In: International Symposium on Foundations of Software Engineering (FSE), p. 56. ACM, New York (2012) 32. Rodriguez, L.R., Lawall, J.: Increasing automation in the backporting of Linux drivers using Coccinelle. In: European Dependable Computing Conference (EDCC), pp. 132–143. IEEE, Piscataway (2015) 33. Rolim, R., Soares, G., D’Antoni, L., Polozov, O., Gulwani, S., Gheyi, R., Suzuki, R., Hartmann, B.: Learning syntactic program transformations from examples. In: International Conference on Software Engineering (ICSE), pp. 404–415. IEEE, Piscataway (2017). https://doi.org/10. 1109/ICSE.2017.44 34. Sawant, A.A., Robbes, R., Bacchelli, A.: On the reaction to deprecation of clients of 4+ 1 popular Java APIs and the JDK. EMSE 23(4), 2158–2197 (2018) 35. Scalabrino, S., Linares-Vásquez, M., Oliveto, R., Poshyvanyk, D.: A comprehensive model for code readability. J. Softw. Evol. Process 30 (2018). https://doi.org/10.1002/smr.1958 36. Serrano, L., Nguyen, V.A., Thung, F., Jiang, L., Lo, D., Lawall, J., Muller, G.: SPINFER: inferring semantic patches for the Linux kernel. In: USENIX Annual Technical Conference, pp. 235–248 (2020)

7 Supporting Collateral Evolution in Software Ecosystems

179

37. Thung, F., Haryono, S.A., Serrano, L., Muller, G., Lawall, J., Lo, D., Jiang, L.: Automated deprecated-API usage update for Android apps: how far are we? In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 602–611. IEEE, Piscataway (2020) 38. Thung, F., Le, X.B.D., Lo, D., Lawall, J.: Recommending code changes for automatic backporting of Linux device drivers. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 222–232. IEEE, Piscataway (2016) 39. Wang, Y., Wang, W., Joty, S., Hoi, S.C.: CodeT5: identifier-aware unified pre-trained encoderdecoder models for code understanding and generation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021) 40. Xavier, L., Brito, A., Hora, A., Valente, M.T.: Historical and impact analysis of API breaking changes: a large-scale study. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 138–147. IEEE, Piscataway (2017) 41. Xu, S., Dong, Z., Meng, N.: Meditor: inference and application of API migration edits. In: International Conference on Program Comprehension (ICPC), pp. 335–346. IEEE, Piscataway (2019) 42. Yang, G., Jones, J., Moninger, A., Che, M.: How do Android operating system updates impact apps? In: International Conference on Mobile Software Engineering and Systems, pp. 156– 160. ACM, New York (2018) 43. Zhang, Z., Yang, Y., Xia, X., Lo, D., Ren, X., Grundy, J.: Unveiling the mystery of API evolution in deep learning frameworks: a case study of Tensorflow 2. In: International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 238– 247 (2021). https://doi.org/10.1109/ICSE-SEIP52600.2021.00033 44. Zhang, Z., Zhu, H., Wen, M., Tao, Y., Liu, Y., Xiong, Y.: How do Python framework APIs evolve? An exploratory study. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 81–92 (2020). https://doi.org/10.1109/SANER48275.2020. 9054800 45. Zhou, J., Walker, R.J.: API deprecation: a retrospective analysis and detection method for code examples on the web. In: International Conference on Software Engineering (ICSE), pp. 266– 277. ACM, New York (2016)

Part IV

Software Automation Ecosystems

Chapter 8

The GitHub Development Workflow Automation Ecosystems Mairieli Wessel, Tom Mens, Alexandre Decan, and Pooya Rostami Mazrae

Abstract Large-scale software development has become a highly collaborative and geographically distributed endeavor, especially in open-source software development ecosystems and their associated developer communities. It has given rise to modern development processes (e.g., pull-based development) that involve a wide range of activities such as issue and bug handling, code reviewing, coding, testing, and deployment. These often very effort-intensive activities are supported by a wide variety of tools such as version control systems, bug and issue trackers, code reviewing systems, code quality analysis tools, test automation, dependency management, and vulnerability detection tools. To reduce the complexity of the collaborative development process, many of the repetitive human activities that are part of the development workflow are being automated by CI/CD tools that help to increase the productivity and quality of software projects. Social coding platforms aim to integrate all this tooling and workflow automation in a single encompassing environment. These social coding platforms gave rise to the emergence of development bots, facilitating the integration with external CI/CD tools and enabling the automation of many other development-related tasks. GitHub, the most popular social coding platform, has introduced GitHub Actions to automate workflows in its hosted software development repositories since November 2019. This chapter explores the ecosystems of development bots and GitHub Actions and their interconnection. It provides an extensive survey of the state of the art in this domain, discusses the opportunities and threats that these ecosystems entail, and

M. Wessel () Radboud University, Nijmegen, Netherlands e-mail: [email protected] T. Mens · P. Rostami Mazrae University of Mons, Mons, Belgium e-mail: [email protected]; [email protected] A. Decan Alexandre Decan is a Research Associate of the Fonds de la Recherche Scientifique - FNRS, University of Mons, Mons, Belgium e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_8

183

184

M. Wessel et al.

reports on the challenges and future perspectives for researchers as well as software practitioners.

8.1 Introduction This introductory section presents the necessary context to set the scene. We start by introducing collaborative software development and social coding (Sect. 8.1.1). Next, we report on the emergence and dominance of GitHub as the most popular social coding platform (Sect. 8.1.2). We continue with a discussion of the practices of continuous integration, deployment, and delivery (Sect. 8.1.3). Finally, we explain the workflow automation solutions of development bots and GitHub Actions that have emerged as highly interconnected ecosystems to support these practices and that have become omnipresent in GitHub (Sect. 8.1.4). We argue that these workflow automation solutions in GitHub constitute novel software ecosystems that are worthy of being studied in their own right. More specifically, Sect. 8.2 focuses on how development bots should be considered as an integral and important part of the fabric of GitHub’s socio-technical ecosystem. Section 8.3 focuses on GitHub Actions and how this forms an automation workflow dependency network bearing many similarities with the ones that have been studied abundantly for packaging ecosystems of reusable software libraries. Section 8.4 wraps up with a discussion about how both types of automation solutions are interrelated and how they are drastically changing the larger GitHub ecosystem of which they are part.

8.1.1 Collaborative Software Development and Social Coding The large majority of today’s software is either open source or depends on it to a large extent. In response to a demand for higher-quality software products and faster time to market, open-source software (OSS) development has become a continuous, highly distributed, and collaborative endeavor [15]. In such a setting, development teams often collaborate on these projects without geographical boundaries [37]. It is no longer expected for software projects to have all their developers working in the same location during the same office hours. To achieve this new way of software development, specific collaboration mechanisms have been devised such as issue and bug tracking, pull-based development [34], code reviews, commenting mechanisms, and the use of social communication channels to interact with other project contributors. Collaboration extends distributed software development from a primarily technical activity to an increasingly social phenomenon [55]. Social activities play an essential role in collaborative development and become sometimes as critical as technical activities. They also come with their own challenges, for

8 The GitHub Development Workflow Automation Ecosystems

185

example, because of cultural differences, language barriers, or social conflicts [10, 38]. A multitude and variety of development-related activities need to be carried out during collaborative software development: developing, debugging, testing, and reviewing code; quality and security analysis; packaging, releasing, and deploying software distributions; and so on. This makes it increasingly challenging for contributor communities to keep up with the rapid pace of producing and maintaining high-quality software releases. It requires the orchestrated use of a wide range of tools such as version control systems, software distribution managers, bug and issue trackers, and vulnerability and dependency analyzers. These tools therefore tend to be integrated into so-called social coding platforms (e.g., GitLab, GitHub, Bitbucket) that have revolutionized collaborative software development practices in the last decade because they provide a high degree of social transparency to all aspects of the development process [16]. Social coding platforms aim to reconcile the technical and social aspects of software development in a single environment. It offers the project contributors a seamless interface and experience to contribute with their peers in an open and fully transparent workflow, where users can contribute bugs and feature requests through an issue tracking system, external contributors can propose code changes through a pull request mechanism, core software developers can push (i.e., commit) their own code changes directly and accept and integrate the changes proposed by external contributors, and code review mechanisms allow code changes to be reviewed by other developers before they can be accepted [34].

8.1.2 The GitHub Social Coding Platform GitHub has revolutionized software development since it was the first platform to propose a pull-based software development process [3, 34]. The pull-based model allows to make a distinction between direct contributions from a typically small group of core developers with commit access to the main code repository and indirect contributions from external contributors that do not have direct commit access. This allows external contributors to propose code changes and code additions through so-called pull requests (PRs). To do so, these contributors have to fork the main repository, update their local copies with code changes, and submit PRs to request to pull these changes into the main code repository [35]. This indirect contribution method enables the project’s maintainers to review the code submitted through each PR, test it, request changes to the submitter of the PR if needed, and finally integrate the PR into the codebase without getting involved in code development [35]. A pull-based development process also comes at a certain cost, since it raises the need for integrators—specialized project members responsible for managing others’ contributions who act as guardians of the projects’ quality [36]. The focus of this chapter will be on GitHub, since it is the largest and most popular social coding platform by far, especially for open-source projects, and as a

186

M. Wessel et al.

consequence, it has been the focus of a significant amount of empirical research. It is a web-based platform on the cloud, based on the git version control system, that hosts the development history of millions of collaborative software repositories and accommodates over 94 million users in 2022 [29]. GitHub continues to include more and more support for collaborative software development such as a web-based interface on top of the git version control system, an issue tracker, the ability to manage project collaborators, the ability to have a discussion forum for each git repository, an easy way to manage PRs or even to submit new PRs directly from within the GitHub interface, a mechanism to create project releases, the ability to create and host project websites, the ability to plan and track projects, support for analyzing outdated dependencies and security vulnerabilities, and metrics and visualizations that provide insights in how the project and its community are evolving over time. GitHub also comes with a REST and GraphQL API to query and retrieve data from GitHub or to integrate GitHub repositories with external tools. By late 2022, GitHub added a range of new features including (i) github.dev, a web-based code editor that runs entirely in the Internet browser to navigate, edit, and commit code changes directly from within the browser; (ii) GitHub CodeSpaces, a more complete development environment that is hosted in the cloud; (iii) GitHub Packages to create, publish, view, and install new packages directly from one’s code repository; (iv) GitHub CoPilot, an AI-based tool that provides smart code auto-completion; and (v) GitHub Actions, a workflow automation tool fully integrated into GitHub.

8.1.3 Continuous Integration and Deployment Continuous integration (CI), continuous deployment, and continuous delivery (CD) have become the cornerstone of collaborative software development practices. CI practices were introduced in the late 1990s in the context of agile development and extreme programming methodologies. According to the Agile Manifesto principles, “our highest priority is to satisfy the customer through early and continuous delivery of valuable software” [5]. In their seminal blog [28], Fowler and Foemmel presented CI as a way to increase the speed of software development while at the same time improving software quality and reducing the cost and risk of work integration among distributed teams. They outlined core CI practices to do so, including frequent code commits, automated tests that run several times a day, frequent and fully reproducible builds, immediately fixing broken builds, and so on. CD practices, on the other hand, aim at automating the delivery and deployment of software products, following any changes to their code [12]. Key elements of continuous deployment are the creation of feasible, small, and isolated software updates that are automatically deployed immediately after completion of the development and testing [49]. Many self-hosted CI/CD tools and cloud-based CI/CD services automate the integration of code changes from multiple contributors into a centralized repository

8 The GitHub Development Workflow Automation Ecosystems

187

where automated builds, tests, quality checks, and deployments are run. Popular examples of such CI/CD solutions are Jenkins, Travis, CircleCI, and Azure DevOps. They have been the subject of much empirical research over the last decades. An excellent starting point is the systematic literature review by Soares et al. [50], covering 106 research publications reporting on the use of CI/CD. This review aimed at identifying and interpreting empirical evidence regarding how CI/CD impacts software development. It revealed that CI/CD has many benefits for software projects. Besides the aforementioned cost reduction and quality and productivity improvement, it also comes with a reduction of security risks, increased project transparency and predictability, greater confidence in the software product, easiness to locate and fix bugs, and improved team communication. CI can also be beneficial to pull-based development by improving and accelerating the integration process. CI/CD services have also been built into social coding platforms. With GitLab CI/CD, GitLab has already featured CI/CD capabilities since November 2012. Bitbucket has supported Pipelines since May 2016. Based on popular demand, in response to this support for CI/CD in competing social coding platforms, GitHub officially began supporting CI/CD through GitHub Actions in August 2019, and the product was released publicly in November 2019. Before the release of GitHub Actions, Travis used to be the most popular CI/CD cloud service for GitHub repositories [6]. However, quantitative evidence has revealed that Travis is getting replaced by GitHub Actions at a rapid pace [32]. Additional qualitative evidence has revealed the reasons behind this replacement and the added value that GitHub Actions is bringing in comparison to Travis [43].

8.1.4 The Workflow Automation Ecosystems of GitHub The previous sections have highlighted that global software development, especially for OSS projects, is a continuous, highly distributed, and collaborative endeavor. The diversity in skills and interests of the projects’ contributors and the wide diversity of activities that need to be supported (e.g., coding, debugging, testing, documenting, packaging, deploying, quality analysis, security analysis, and dependency analysis) make it very challenging for project communities to keep up with the rapid pace of producing and maintaining high-quality software releases. Solutions to automate part of the software development workflow, such as the aforementioned CI/CD tools and services, have been successfully used to reduce this maintenance burden (see Sect. 8.1.3). However, these tools do not support the entire range of project-related activities for which automation could come to the rescue. There are many repetitive and time-consuming social and technical activities for which, traditionally, CI/CD tools did not provide any support. Some examples of these are welcoming newcomers, keeping dependencies up to date, detecting and resolving security vulnerabilities, triaging issues, closing stale issues, finding and assigning code reviewers, encouraging contributors to remain active,

188

M. Wessel et al.

and software licensing. To help project contributors in carrying out these activities, CI/CD solutions have been complemented by novel workflow automation solutions: Development Bots A well-known and very popular example of such workflow automation solutions is what we will refer to as development bots. Erlenhov et al. [27] consider development bots to be artificial software developers who are autonomous and adaptive and have technical as well as social competence. Such automated software development agents have become a widely accepted interface for interacting with human contributors and automating some of their tasks. A study by Wang et al. revealed that bots are frequently used in the most popular OSS projects on GitHub [56]. These bots tend to be specialized in specific activities, belonging to the following main categories: CI/CD assistance, issue and PR management, code review support, dependency and security management, community support, and documentation generation. Section 8.2 will discuss in detail how such bots are used on GitHub to automate part of the software development workflow and how these bots form an integral part of the socio-technical ecosystem of software contributors and software projects. More specifically, bots affect the social interaction within a software project, as they influence how human contributors communicate and collaborate and may even change the collaboration patterns, habits, and productivity of project contributors [42]. GitHub Actions Another popular mechanism to automate development activities in GitHub repositories is using GitHub Actions, a workflow automation service officially released in November 2019. Its deep integration into GitHub implies that GitHub Actions can be used not only for automating traditional CI/CD services such as executing test suites or deploying new releases, but also to facilitate other activities such as code reviews, communicating with developers, and monitoring and fixing dependencies and security vulnerabilities. GitHub Actions allows project maintainers to define automated workflows for such activities. These workflows can be triggered in a variety of ways such as commits, issues, pull requests, comments, schedules, and many more [11]. GitHub Actions also promotes the use and sharing of reusable components, called Actions, in workflows. These Actions are distributed in public GitHub repositories and on the GitHub Marketplace.1 They allow developers to automate their workflows by easily integrating specific tasks (e.g., set up a specific programming language environment, publish a release on a package registry, run tests, and check code quality) without having to write the corresponding code. Only 18 months after its introduction, GitHub Actions has become the most dominant CI/CD service on GitHub [32]. Section 8.3 presents this ecosystem of reusable Actions in more detail. This ecosystem forms a technical dependency network that bears many similarities with traditional package dependency networks of reusable software libraries (such as npm for JavaScript, RubyGems for Ruby, NuGet for .NET, Packagist for PHP, CRAN for R, Maven for Java) that have been the subject of many past empirical studies (e.g., [21]).

1 https://github.com/marketplace?type=actions.

8 The GitHub Development Workflow Automation Ecosystems

189

The two aforementioned workflow automation solutions are increasingly used in OSS projects on GitHub, partly because of their tight integration into the social coding platform, thereby effectively transforming the software development automation landscape. It therefore seems fair to claim that they form new development workflow automation ecosystems that are worthy of being investigated in their own right. Research on these ecosystems is still in its infancy, given the relative novelty of the proposed automation solutions. Development bots and workflows that rely on GitHub Actions are already used in hundreds of thousands of GitHub repositories, and their usage continues to increase (the Marketplace of GitHub Actions has been growing exponentially since its introduction), justifying the need for further studies on the evolution of these ecosystems and their impact on collaborative software development practices.

8.2 Workflow Automation Through Development Bots As explained in Sect. 8.1.4, development bots emerged on social coding platforms such as GitHub to enable the automation of various routines and time-consuming tasks previously assigned only to human developers. This section explores how bots are an integral part of GitHub’s socio-technical collaborative development ecosystem. Considering the workflow automation provided by development bots, we focus on the various usage scenarios, advantages, shortcomings, challenges, and opportunities of using them.

8.2.1 What Are Development Bots? Development bots that reside on social coding platforms such as GitHub are often seen as workflow automation providers due to their ability to react to certain stimuli, such as events triggered by human developers or other tools, and automate routine development-related tasks in response. To a certain extent, bots may act autonomously [27, 64]. In open-source repositories, bots can leverage the public availability of software assets, including source code, discussions, issues, and comments. Besides automatically executing activities, development bots may also exhibit human-like traits. Erlenhov et al. [27] describe bots based on their social competence, which varies from very simple identity characteristics (e.g., a human-like name or profile picture) to more sophisticated ones such as artificial intelligence and the ability to adapt to distinct scenarios. In practice, bots that are active in GitHub repositories are automated agents that interact with the GitHub platform in essentially the same way as a typical human developer would be expected to: they possess a GitHub account, commit code, open or close issues or PRs, and comment on all of the above. Some bots have an official integration with GitHub and

190

M. Wessel et al.

are publicly available as Apps in the GitHub marketplace.2 These official bots are properly tagged as such in the various activities they make in the GitHub platform. Bots can also be used as an interface between human developers and other software services, such as external CI/CD tools or other third-party applications. Such bots provide additional value on top of the services they offer an interface for, by providing new forms of interaction with these services, or by combining multiple services. One particularly interesting example is Dependabot, a dependency management tool responsible for creating PRs in GitHub repositories to propose to upgrade dependencies in order to resolve or reduce the risk of security vulnerabilities or bugs. Dependabot acts as an interface between the project maintainer, who is responsible for keeping the project dependencies up to date, and the package managers (such as npm for JavaScript) that expose the reusable packages that the project depends on. While originally it used to be a third-party service, Dependabot is now deeply integrated into the GitHub platform and has become one of the most popular dependency management bots, accounting for more than 7.7 million dependency updates in OSS projects [65]. A well-known alternative is renovatebot.3 Bots can even create, review, and decide whether to integrate the changes made in a PR into the repository by themselves in complete autonomy. Figure 8.1 provides an example of multiple bots interacting as part of a single PR. There is not a single human contributor involved in this interaction. The PR is triggered by a recommendation by Dependabot to update a dependency. The mergify bot reacts to this by verifying if the proposed change passes all checks, and accepts and merges the PR. Finally, nemobot reacts with a visual comment applauding the merged PR. From a research viewpoint, the increasing use of bots raises the need for largescale empirical studies on bot usage in social coding platforms such as GitHub. Such studies enable us to assess whether bots serve their intended purpose and whether their introduction has any positive or negative side effects on the sociotechnical fabric of the project or ecosystem in which they are used. To enable such empirical studies, it is necessary to determine which projects rely on bots and which user accounts actually correspond to bots. Several bot detection heuristics have been proposed to automatically identify bot contributions [1, 23, 30]. BIMAN [23] relies on bot naming conventions, repetitiveness in commit messages, and features related to files changed in commits. BoDeGHa [30] relies on comment patterns in issue and PR comments in GitHub repositories, based on the assumption that bots tend to use different and fewer comment patterns than humans. BotHunter [1] additionally relies on features corresponding to profile information (e.g., account name) and account activity (e.g., median daily activity) to identify bot accounts

2 https://github.com/marketplace?type=apps. 3 https://github.com/renovatebot/renovate.

8 The GitHub Development Workflow Automation Ecosystems

Fig. 8.1 Example of multiple bots interacting within the same PR

191

192

M. Wessel et al.

more accurately. BoDeGiC [31] allows to detect bots in git repositories based on commit messages and has been trained using the classification model of BoDeGHa. An important challenge when identifying automated contributions by bots is the presence of so-called mixed accounts—accounts used by a human developer and a bot in parallel—exhibiting both human-like and bot-like behavior [30]. Not properly detecting such cases is likely to lead to false positives and false negatives during bot detection, which may affect the outcome of empirical analyses. Cassee et al. [9] have shown that existing classification models are not suitable to reliably detect mixed accounts.

8.2.2 The Role of Bots in GitHub’s Socio-technical Ecosystem An important characteristic of bots is that they form an integral part of GitHub’s socio-technical ecosystem of collaborative software development. To consider them as such, we adopt an ecosystemic and socio-technical viewpoint, similar to Constantinou and Mens [14] who viewed a software ecosystem as a socio-technical network that is composed of a combination of technical components (e.g., software projects and their source code history) and social components (e.g., contributor and communities involved in the development and maintenance of the software). An interesting novelty of bots is that while they are technical components themselves (since they are executable software artifacts), they should also be considered as being social components, since they play a crucial role in the social aspects of the ecosystem. The assistance provided by bots, as new voices in the development conversation [45], has the potential to smooth and improve the efficiency of developers’ communication. Wessel et al. [60] have shown that the number of human comments decreases when using bots, which usually implies that the number of trivial discussions decreases. Indeed, bots are meant to relieve, augment, and support the collaborative software development activities that are carried out by the human contributors that jointly develop and maintain large software projects. Moreover, bots often interact with human collaborators (and with other bots) using the same interface as humans do. Figure 8.2 illustrates an exemplary case of the role that bots play in this socio-technical ecosystem. A human contributor submits a PR to add tests to a particular project module. The first bot to react to the PR, changeset-bot, verifies whether the changeset file was updated, and the proposed change will be released into a specific version of the packages implemented in the repository. Then, vercel bot deploys the code to the third-party application Vercel and provides a URL for the developers to inspect a deployment preview in the PR. Next, the codesandbox-ci bot provides the URL of an isolated test environment to validate the changes made in the PR. Finally, the human project maintainer approves the changes, reacts with a comment, and merges the PR. Like the many roles human software developers can fulfill, a variety of bots have become highly active actors in every phase of the development automation

8 The GitHub Development Workflow Automation Ecosystems

Fig. 8.2 Example of an interaction between two humans and three bots within a single PR

193

194

M. Wessel et al.

workflow. Thanks to the continuous efforts of practitioners and researchers, a wide range of development bots are available for use by developers [42, 56]. Wang et al. [56] have shown that bot usage is common practice in OSS development. Through repository mining of 613 GitHub repositories, they found 201 different bots. Similar to prior research by Wessel et al. [61], the authors provided a classification of bots according to their main role in the repository. These categories include CI/CD assistance, issue and PR management, code review support, dependency and security management, community support, and documentation generation. In addition to the aforementioned examples of bots, other sophisticated bots have been proposed in the literature. Wyrich and Bogner [64], for example, proposed a bot that automatically refactors the source code of a project. Their goal was to eliminate the need for developers to manually find and correct code smells, as this task can be very time-consuming and may require certain expertise. Therefore, the bot was designed to act autonomously, integrating into the natural workflow of the development team on GitHub. The bot makes code changes corresponding to proposed code refactorings and submits a PR with these changes. Project maintainers can review these changes and decide to integrate them into the codebase.

8.2.3 Advantages of Using Development Bots Development bots generally execute tasks that would otherwise have to be performed manually by humans. Through interviews with industry practitioners, Erlenhov et al. [26] found that bots are used either because they improve productivity or enable activities that are not realistically feasible for humans [26]. Some software practitioners stress that bots are able to carry out certain tasks better than humans due to their availability, scalability, and capacity to process large amounts of data [26]. For example, bots can handle tasks continuously 24/7 without ever needing to take a break. Song and Chaparro [51] designed BEE, a bot that automatically analyzes incoming issues on GitHub repositories and immediately provides feedback on them. Due to BEE’s prompt reaction, issue reporters can more quickly gain a general idea of what is missing without waiting for the project maintainers’ feedback. Bots also scale, increase consistency, and mitigate human errors. In terms of productivity increase, bot usage is frequently motivated by the necessity of spending less time on routine, time-consuming, or tedious tasks. Automating such activities through bots allows developers to focus on their core code development and review tasks [26, 53]. Mirhosseini and Parnin [44] analyzed automated PRs created by greenkeeper, a bot to update dependencies, similar to the ones created by dependabot. Such a bot avoids manually monitoring for new releases in the packages. The results show that OSS repositories that use the bot upgraded the dependencies 1.6 times more regularly than repositories that did not use any other bots or tools.

8 The GitHub Development Workflow Automation Ecosystems

195

Specifically in the context of code reviews, Wessel et al. [59] carried out a survey with 127 software project developers to investigate the advantages of adopting bots to support code review activities. Their study confirmed the results of Erlenhov et al. [26]. The main reasons for adopting bots are related to improving developer feedback, automating routine tasks, and ensuring high-quality standards. Interestingly, developers also report benefits related to interpersonal relationships. According to the surveyed developers, negative feedback in an automatic bot report feels less rude or intimidating than if a human would provide the same feedback. They also report that by providing quick and constant feedback, bots reduce the chance that a PR gets abandoned by its author. Bots can also help to support developers unfamiliar with a software project or with specific software engineering practices and technologies. For example, Brown and Parnin [8] propose a bot to nudge students toward applying better software engineering practices. They designed a bot that provides daily updates on software development processes based on students’ code contributions on GitHub. They show that such a bot can improve development practices and increase code quality and productivity. The use of bots to automate development workflows can also result in a change in the habits of project contributors. Wessel et al. [58] investigated how activity traces change after the adoption of bots. They observed that after bot adoption, projects have more merged PRs, fewer comments, fewer rejected PRs, and faster PR rejections. Developers explain that some of these observed effects are caused by increased visibility of code quality metrics, immediate feedback, test automation, the increased confidence in the process, change in the discussion focus, and the fact that bot feedback pushes contributors to take action. In summary, the literature suggests that developers who employ bots primarily expect improved productivity [26, 59]. This, however, surfaces in different ways depending on the context and the tasks the bot performs. Automating time-consuming or tedious tasks and collecting dispersed information (i.e., information gathering) are some ways to improve productivity. Developers also emphasize that bots may perform some tasks better than humans (e.g., handling tasks 24/7 and at scale, increasing consistency, and mitigating human error).

8.2.4 Challenges of Using Development Bots Despite the numerous benefits leveraged by using development bots, several challenges have been reported concerning the workflow automation provided by them [26, 63]. Some bots have been studied in detail, revealing the challenges and limitations of their PR interventions [7, 44, 46]. Trust Trusting a bot to act appropriately and reliably is challenging [26]. A side effect of overly relying on bots is that humans no longer question whether these bots are taking the correct actions since they assume bots to be experts in their tasks.

196

M. Wessel et al.

Therefore, developers can be caught off guard by excessive incorrect outcomes from bots [26]. A key solution to increase trust is building a reliable testing environment that allows developers to try out bots and avoid unanticipated problems. Discoverability and Configuration To confirm the challenges caused by development bots in PR interactions, Wessel et al. [63] interviewed 21 practitioners. Their study revealed several challenges raised by bot usage, such as discoverability and configuration issues. Developers complained about the lack of contextualized actions, limited and burdensome configuration options, and technical overhead to host and deploy their own bot. Moreover, the overload of information generated by bots when interacting on PRs has appeared as the most prominent challenge. Interruption and Noise Developers constantly struggle with interruptions and noise produced by bots [26]. For instance, Brown and Parnin [7] analyzed tool-recommender-bot, a bot that automatically configures a project to use an open-source static analysis tool for Java code and then submits a PR with a generic message explaining how the proposed tool works. They reported that this bot still needs to overcome problems such as notification workload. They applied tool-recommender-bot in real projects for evaluation purposes. Only two PRs out of 52 proposed recommendations were accepted. Peng and Ma [46] studied how developers perceive and work with mention bot, a reviewer recommendation bot created by Facebook. It automatically tags a potential reviewer for a PR depending on the files changed. Project maintainers with higher expertise (i.e., maintainers who contributed more frequently) in a particular file are more likely to be suggested as reviewers by the bot. The study found that mention bot reduced contributors’ effort in identifying proper reviewers. As a negative side effect, however, developers were bothered by frequent review notifications when dealing with a heavy workload. Wessel et al. [63] introduced a theory about how certain bot behaviors can be perceived as noisy. Indeed, many bots provide several comments when an issue or PR is opened by a contributor, with dense information and frequently overusing visual elements. Similarly, bots perform repetitive actions such as creating numerous PRs (e.g., to update the many dependencies a project can have) and leaving dozens of comments in a row (e.g., to report on test coverage each time a new commit is added to the PR). These situations can lead to information and notification overload, disrupting developers’ communication. Oftentimes, the problem is not a singular bot that is too verbose, but a combination of multiple bots that are simultaneously active and, together, lead to information overload [26]. Researchers have attempted to create solutions to reduce the information overload created by bots. Wessel et al. [57] suggested creating better ways to represent the information of bots, such as clearer summaries of pull requests. Ribeiro et al. [47] implemented FunnelBot that integrated these suggestions. Figure 8.3 shows an example of a PR comment posted by FunnelBot. The comment shows (A) an introductory message, (B) a list with all groups of bot messages collapsed, and (C) one expanded example where we can see the CodesandBox comment.

8 The GitHub Development Workflow Automation Ecosystems

197

Fig. 8.3 Example of PR comment created by FunnelBot

8.3 Workflow Automation Through GitHub Actions As explained in Sect. 8.1.4, software development workflows can be automated using different techniques, including CI/CD solutions (presented in Sect. 8.1.3) and development bots (presented in Sect. 8.2). The third way is GitHub Actions, which is the focus of the current section. We explain what GitHub Actions are, how prevalent they are, and how they constitute an ecosystem of their own. We also discuss the potential challenges this novel ecosystem is confronted with.

8.3.1 What Is GitHub Actions? The GitHub social coding platform has introduced GitHub Actions as a way to enable the specification and execution of automated workflows. It started as a beta product in 2018 providing the possibility to create Actions inside containers to augment and connect software development workflows. When the product was officially released to the public in November 2019, GitHub Actions also integrated a fully featured CI/CD service, answering the high demand of GitHub users to provide CI/CD support similar to what was already available in competing social coding platforms such as GitLab and Bitbucket [16]. Since its introduction, GitHub Actions has become the dominant CI/CD service on GitHub based on a quantitative study by Golzadeh et al. [32], including more than 90K GitHub repositories. Figure 8.4 provides a historical overview of CI/CD usage in those repositories, starting from the first observation of Travis usage in June 2011. Initially, GitHub repositories primarily used Travis as a CI/CD service. Over

198

M. Wessel et al.

Fig. 8.4 Evolution of the proportion of GitHub repositories using a specific CI/CD solution

time, other CI/CD solutions were used, but Travis remained the dominant CI/CD. When GitHub Actions entered the CI/CD landscape, it overtook the other CI/CD solutions in popularity in less than 18 months after its introduction. Mazrae et al. [43] complemented this quantitative analysis by qualitative interviews to understand the reasons behind GitHub Actions becoming the dominant CI/CD tool in GitHub, as well as why project maintainers decided to migrate to GitHub Actions primarily. The main reported reasons were the seamless integration into GitHub, the ease of use, and great support for its reusable Actions. GitHub Actions allows repository maintainers to automate a wide range of tasks. In addition to providing typical CI/CD services such as building code, executing test suites, and deploying new releases, GitHub Actions’ tight integration with GitHub enables it to include better support of third-party tools and build support for well-known operating systems and hardware architectures, and more scalable cloud-based hardware to produce results faster. GitHub Actions also facilitates the communication between the project and external tools (such as third-party CI/CD services) and easier dependency and security monitoring and management [22]. Specifying Executable Workflows GitHub Actions is based on the so-called concept of executable workflows that can be defined by maintainers of GitHub repositories. The structure of a workflow is schematically presented in Fig. 8.5 and explained below. A workflow constitutes a configurable automated process that is defined by a YAML file added to the .github/workflows directory of the GitHub repository. A workflow can be executed based on events specified in the workflow description that act as a trigger for running the workflow. Examples of such triggers are commits, issues, PRs, comments, schedules, or even manual invocation [11]. The example workflow in Listing 8.1 (lines 3–6) defines three possible triggers: upon committing (push:) or receiving a PR (pull_request:), or based on a specified time schedule (cron: "0 6 * * 1").

8 The GitHub Development Workflow Automation Ecosystems

199

repository

repository

.github/workflows/

workflows

workflow 1

workflow 2

workflow 3

Parallel

strategy

jobs

job 1

job 2

job 3

Parallel by default / sequential

steps

step 1

step 2

step 3

Sequential

use: (action)

run: (shell cmd)

use: (action)

Fig. 8.5 Schematic representation of the structure of a GitHub workflow description

A workflow typically runs one job in some virtual environment that is created to execute the job (e.g., an instance of some specific version of Ubuntu, macOS, or Microsoft Windows). A workflow can also execute multiple jobs, in parallel (by default) or sequentially. Workflows can define a matrix strategy to automatically create and run parallel jobs based on the combination of variable values defined by the matrix. This is, for example, useful if one would like to build and test source code in multiple versions of a programming language and/or on multiple operating systems. In the example of Listing 8.1, the matrix strategy (lines 10–13) specifies that the job will be run on five different versions of Python for two different operating systems. To run a workflow specified in a GitHub repository, developers can use the infrastructure provided by GitHub, or rely on self-hosted runners if more specific hardware or operating systems are needed. Each job is composed of a series of steps that specify the tasks to be executed sequentially by the job. These steps can be simple shell commands to be run within the virtual environment (such as lines 22–24 in Listing 8.1). Alternatively, steps can use and execute predefined reusable Actions, which will be discussed below.

200

M. Wessel et al.

Listing 8.1 Example of a YAML workflow file 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

name: Test project on: push: pull_request: schedule: - cron: "0 6 * * 1" jobs: build-and-test: strategy: matrix: os: [ubuntu-22.04, windows-latest] python: ["3.6", "3.7", "3.8", "3.9", "3.10"] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: ${{ matrix.python }} - name: Install dependencies run: | pip install -r requirements.txt pip install pytest - name: Execute tests run: pytest

Reusable Actions Actions provide a reuse mechanism for GitHub workflow maintainers to avoid reinventing the wheel when automating repetitive activities [12]. Rather than manually defining the sequence of commands to execute as part of a step (such as lines 22–24 in Listing 8.1), it suffices to use a specific (version of a) reusable Action. For example, line 16 in Listing 8.1 (re)uses version 2 of actions/checkout, and line 18 (re)uses version 2 of actions/setup-python. Actions are themselves developed through GitHub repositories.4 Workflows can reuse any Action shared in a public repository. To facilitate finding such Actions, the GitHub Marketplace provides an interface for providers to promote their Actions and for consumers to easily search for suitable Actions.5 The number of Actions promoted on the Marketplace has been growing exponentially. By December 2022, the Marketplace listed over 16K reusable Actions falling

4 The GitHub repositories for the Actions reused in Listing 8.1 are https://github.com/actions/ checkout and https://github.com/actions/setup-python. 5 See https://github.com/marketplace. In addition to Actions, the marketplace also promotes Apps, which are applications that can contain multiple scripts or an entire application.

8 The GitHub Development Workflow Automation Ecosystems

201

under 19 different categories. These categories contain a wide diversity of Actions, covering tasks such as setting up a specific programming language environment, publishing a release on a package registry, running tests, or checking the code quality [22].

8.3.2 Empirical Studies on GitHub Actions Given that GitHub Actions was publicly introduced in 2019, and despite the fact that GitHub Actions has become the dominating CI/CD solution on GitHub (according to Golzadeh et al. [32]), very few empirical studies have focused on GitHub Actions at the time of writing this chapter. An early quantitative study by Kinsman et al. [40] in 2021 reported that in a dataset of 416,266 GitHub repositories, only as little as 3190 repositories (i.e., less than 1%) had been using GitHub Actions. In 2022, Wessel et al. [62] studied a dataset composed of the 5000 most-starred GitHub repositories and observed that 1489 projects (i.e., 29.8%) had been using GitHub Actions. Also in 2022, Decan et al. [22] reported on a dataset of 67,870 active GitHub repositories in which 29,778 repositories (i.e., 43.9%) had been using GitHub Actions. These quantitative results reveal that GitHub Actions is prevalent in software development repositories on GitHub. To complement these quantitative findings, in 2023, Saroar and Nayebi [48] carried out a survey with 90 GitHub developers about the best practices and perception in using and maintaining GitHub Actions. Table 8.1 reports the top six programming languages that most frequently coincide with GitHub Actions usage according to Decan et al. [22]. They observed that some programming languages are more likely to coincide with GitHub Actions usage than others: TypeScript and Go have a higher proportion of repositories resorting to GitHub Actions usage (58.5% and 57.2%, respectively) compared to JavaScript (34.9%). It is worth noting that the percentages of repositories using GitHub Actions are reported with respect to the language itself. For example, the number of Python repositories using GitHub Actions is 1.52 times higher

Table 8.1 Top six languages with the highest proportion of GitHub repositories using GitHub Actions according to [22] programming language JavaScript Python TypeScript Java C.++ Go

GitHub repositories all repositories using GitHub Actions 13,542 (19.6%) 4730 (34.9%) 12,319 (17.8%) 5654 (45.9%) 6362 (9.2%) 3722 (58.5%) 6105 (8.8%) 2390 (39.2%) 5701 (8.2%) 2331 (40.9%) 4988 (7.2%) 2854 (57.2%)

202

M. Wessel et al.

(5654) than the TypeScript repositories using GitHub Actions (3722). One can also observe the number of repositories for a specific language and its proportion to all the repositories in the dataset. For example, the 13,542 JavaScript repositories correspond to 19.6% of all the repositories in the dataset. The same study also analyzed which event types are mostly used for triggering workflows, reporting that push: and pull_request: are the most frequent events triggering workflows, both used by more than half (63.4% and 56.3%, respectively) of all considered GitHub repositories relying on workflows. This is not surprising since commits and PRs are the most important activities in collaborative coding on GitHub. The most frequently used Action is actions/checkout (used by 35.5% of all steps and 97.8% of all repositories). Other frequently used Actions are related to the deployment of a specific programming language environment (e.g., setup-node, setup-python, setup-java). Overall, 24.2% of all steps use an Action of the form setup-*. Finally, they observe that it is common practice to depend on reusable Actions, given that nearly all repositories (.>99%) that use workflows have at least one step referring to an Action. More than half of the steps in all analyzed workflows (51.1%) use an Action. However, this reuse is concentrated toward a limited set of Actions. For example, the Actions that are officially provided by GitHub (i.e., those actions of the form actions/*) account for 71.7% of all steps that reuse an Action. In addition to this, the Actions being reused tend to be concentrated in a few categories. Table 8.2 provides the top five categories of Actions used by GitHub repositories, as reported by two independently conducted empirical studies [22, 62]. Most of the reused actions belong to the “Utilities” and “Continuous Integration” categories, followed by “Deployment.” This suggests that GitHub Actions is being used mostly to automate the same kinds of activities as what traditional CI/CD tools are being used for. Wessel et al. [62] statistically studied the impact of using workflows on different aspects of software development like PRs, commit frequency, and issue resolution efficiency. By comparing the activities in projects using GitHub Actions, during one full year before the use of GitHub Actions in the project and one full year after its usage, they used the technique of regression discontinuity analysis to provide statistical evidence and showed that after adding GitHub Actions to projects, there tend to be fewer accepted PRs, with more discussion comments and fewer commits,

Table 8.2 Top five most frequent Action categories according to [22, 62] Action category utility continuous integration deployment publishing code quality

% reported by [62] 24.9% 24.7% 9.6% 8.4% 7.7%

% reported by [22] 23.9% 17.3% 7.2% 6.9% 6.1%

8 The GitHub Development Workflow Automation Ecosystems

203

which take more time to merge. On the other hand, there are more rejected PRs, which contain fewer comments and more commits. Wessel et al. [62] studied discussions between developers about the usage of GitHub Actions in their software projects. Out of the 5000 analyzed GitHub repositories, only 897 (18%) had the Discussions feature enabled at the time of data collection, and 830 of those (17%) contained at least one discussion thread. Focusing on this subset of repositories, they filtered the discussions containing the string “GitHub Actions,” resulting in 573 posts in 458 distinct threads of 148 repositories. The most discussed material about GitHub Actions, found in 28.8% of all considered posts, was the need for help with GitHub Actions. This reveals that developers actively sought to learn more about how to use workflows effectively. A second popular category of discussion in the context of GitHub Actions, found in 19.0% of all considered posts, was error messages or debug messages. Developers were trying to solve issues related to using workflows and applications invoked via these workflows, such as linters or code review bots. A third popular category, accounting for 14.6% of all considered posts, involved discussions around reusing Actions. This is expected, given that Actions are a relatively new concept that many developers are not familiar with.

8.3.3 The GitHub Actions Ecosystem As mentioned in Sect. 8.1.4, GitHub Actions is part of the larger workflow automation ecosystem of GitHub that also includes bots and CI/CD solutions for automating development workflows in collaborative software projects. Decan et al. [22] suggested that GitHub Actions can and should be considered as a new emerging ecosystem in its own right. Indeed, the GitHub Actions technology exhibits many similarities with more traditional software packaging ecosystems such as npm (for JavaScript), Cargo (for Rust), Maven (for Java), or PyPI (for Python), to name but a few. Just as software development repositories on GitHub tend to depend on external packages distributed through the above package managers—mainly to avoid the effort-intensive and error-prone practices of copy-paste reuse—the same is valid for development workflows. Maintainers of GitHub repositories can specify their workflows to directly depend on reusable Actions. As such, GitHub Actions forms a kind of dependency network that bears many similarities with the ones of software packaging ecosystems [18]. The parallel with packaging ecosystems is quite obvious: automated workflows, as software clients, express dependencies toward Actions (being the equivalent of reusable packages) that can exist in different versions or releases. Section 8.3.2 reported on quantitative evidence that resorting to reusable Actions in workflows has become a common practice. Continuous Growth Decan et al. [18] carried out a quantitative empirical analysis of the similarities and differences in the evolution of the dependency networks for seven different packaging ecosystems of varying sizes and ages, including Cargo,

204

M. Wessel et al.

Fig. 8.6 Evolution of the number of GitHub repositories using workflows (blue line) and the number of Actions used by these repositories (orange line, scaled by a factor of 10 for ease of comparison)

CPAN, CRAN, npm, NuGet, Packagist, and RubyGems. They observed that these dependency networks tend to grow over time, both in size and in number of package updates. While the vast majority of packages depend on other packages, only a small proportion of these packages account for most of the reuse (i.e., they are targeted by most of the reverse dependencies). Decan et al. [22] conducted a quantitative analysis of GitHub Actions and observed similar characteristics for the GitHub Actions dependency network: nearly all the repositories with GitHub Actions workflows depend on reusable Actions, and most of the reuse is concentrated in a limited number of Actions. They analyzed the evolution of the number of repositories using GitHub Actions workflows and the number of Actions being used by these repositories. Figure 8.6 shows this evolution for the period 2020–2021, revealing a continuous growth of the GitHub Actions ecosystem, in terms of the number of consumers (repositories using GitHub Actions workflows) as well as producers (Actions being reused by GitHub repositories).

8.3.4 Challenges of the GitHub Actions Ecosystem While packaging ecosystems are extremely useful for their respective communities of software developers, they have been shown to face numerous challenges related to dependency management [21, 41, 52], outdatedness [19], security vulnerabilities [20, 69], breaking changes [17, 24], deprecation [13], and abandonment of package maintainers [4, 14]. We posit that GitHub Actions will suffer (and likely suffers already) from similar issues. Outdatedness Software developers are continuously confronted with the difficult choice of whether, when, and how to keep their dependencies up to date. On the one hand, updating a dependency to a more recent version enables them to benefit from

8 The GitHub Development Workflow Automation Ecosystems

205

the latest bug and security fixes. On the other hand, doing so exposes the dependent project to an increased risk of breaking changes, as well as to new bugs or security issues that may not even have been discovered yet. The concept of technical lag was proposed to measure the extent to which a software project has outdated dependencies [33]. This lag can be quantified along different dimensions: as a function of time (how long has a dependency been outdated), version (how many versions is a dependency behind), stability (how many known bugs could have been fixed by updating the dependency), and security (how many security vulnerabilities could have been addressed by updating the dependency). Zerouali et al. [67] formalized this concept in a measurement framework that can be applied at the level of packaging ecosystems. In particular, they analyzed the technical lag of the npm packaging ecosystem, observing that around 26% of the dependencies expressed by npm packages are outdated and that half of these outdated dependencies target a version that is .270+ days older than the newer one. Other researchers have also applied technical lag to quantify outdatedness in software package dependency networks [19, 54]. The technical lag framework was also applied to the ecosystem of Docker containers distributed through Docker Hub [68]. Chapter 9 provides more details on this matter. In a similar vein, applying the technical lag framework to the GitHub Actions ecosystem would allow workflow developers to detect and quantify the presence of outdated Actions in workflows and help in updating them. It is important to do so since, despite the recency of GitHub Actions, according to Decan et al. [22], at least 16% of the dependencies in workflows are targeting an old major version of an Action. Adherence to Semantic Versioning Semantic Versioning (abbreviated to SemVer hereafter) is another mechanism that has been proposed to assist software developers with the delicate trade-off between benefiting from security or bug fixes and being exposed to breaking changes in dependencies. SemVer introduces a set of simple rules that suggest how to assign version numbers in packages to inform developers of dependent software about potentially breaking changes. In a nutshell, SemVer proposes a three-component version scheme major.minor.patch to specify the type of changes that have been made in a new package release. Many software packaging ecosystems (such as npm, Cargo, and Packagist) are mostly SemVer compliant, in that most of their package producers adhere to the SemVer convention [17]. Backward-incompatible changes are signalled by an update of the major component, while supposedly compatible changes come with an update of either the minor or patch component. This allows dependent packages to use so-called dependency constraints to define the range of acceptable versions for a dependency (e.g., it would be safe to accept all dependency updates within the same major version range if the dependency is trusted to be SemVer compliant). Maintainers of GitHub Actions workflows are exposed to a similar risk of incompatible changes in the Actions they use, whether these are logical changes (affecting the behavior of the Actions) or structural changes (affecting the parameters or return values). Therefore, knowing whether an Action adheres to SemVer is helpful

206

M. Wessel et al.

for maintainers of workflows depending on these actions, since they can assume minor and patch updates to be backward compatible and, therefore, free of breaking changes. GitHub recommends reusing Actions in workflows by specifying only the major component of the Action’s version, allowing workflow maintainers to receive critical fixes and security patches while maintaining compatibility. However, little is known about the actual versioning practices followed by producers and consumers of Actions. Preliminary results suggest that GitHub’s recommendation is widely followed since nearly 90% of the version tags used to refer to an Action include only a major component [22]. However, unlike package managers, GitHub Actions offers no support for dependency constraints, implying that producers of Actions are required to move these major version tags each time a new version of the Action is released. Unless automated, this requirement introduces an additional burden on the Action producers [22] and calls for a more profound analysis of the kind of changes made in Action updates and of the versioning practices they follow. Security Vulnerabilities Another issue is that any software project is subject to security vulnerabilities. Package dependency networks have made the attack surface of such vulnerabilities several orders of magnitude higher due to the widespread dependence on reusable software libraries that tend to have deep transitive dependency chains [2, 20, 25, 69]. For example, through a study of 2.8K vulnerabilities in the npm and RubyGems packaging ecosystems, Zerouali et al. [66] found around 40% of the packages to be exposed to a vulnerability due to their (direct or transitive) dependencies, and it often took months to fix them. They also observed that a single vulnerable package could expose up to two-thirds of all the packages depending on it. We see no reason why the GitHub Actions ecosystem would be immune to this phenomenon. Indeed, relying on reusable Actions from third-party repositories or even from the Marketplace further increases the vulnerability attack surface. Since a job in a workflow executes its commands within a runner shared with other jobs from the same workflow, individual jobs in the workflow can compromise other jobs they interact with. For example, a job could query the environment variables used by a later job, write files to a shared directory that a later job processes, or even more directly interact with the Docker socket and inspect other running containers and execute commands in them.6 Multiple examples of security issues in workflows have been reported, sometimes with potentially disastrous consequences, such as manipulating pull requests to steal arbitrary secrets,7 injecting arbitrary code with workflow commands,8 or bypassing code reviews to push unreviewed code.9 Unfortunately, we are not aware of any publicly available quantitative analysis on the impact of reusable Actions on security vulnerabilities in software projects. 6 https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#

using-third-party-actions. 7 https://blog.teddykatz.com/2021/03/17/github-actions-write-access.html. 8 https://packetstormsecurity.com/files/159794/GitHub-Widespread-Injection.html. 9 https://medium.com/cider-sec/bypassing-required-reviews-6e1b29135cc7.

8 The GitHub Development Workflow Automation Ecosystems

207

This shows that there is an urgent need for further research as well as appropriate tooling to support developers of reusable Actions and workflows in assessing and hardening their security. A first step in this direction is GitHub’s built-in dependency monitoring service Dependabot, which has started to support GitHub Actions workflows in January 2022 and reusable Actions in August 2022.10 Abandonment and Deprecation Another important challenge that packaging ecosystems face is the risk of packages becoming unmaintained or deprecated [13] when some or all of their core contributors have abandoned the package development [4, 14, 39]. If this happens, the packages may become inactive, implying that bugs and security vulnerabilities will no longer be fixed. This will propagate to dependent packages that rely on such packages. Cogo et al. [13] have studied the phenomenon of package deprecation in the npm packaging ecosystem, observing that .3.2% of all releases are deprecated, .3.7% of the packages have at least one deprecated release, and .66% of the packages with deprecated releases are fully deprecated. Constantinou et al. [14] studied the phenomena of developer abandonment in the RubyGems and npm packaging ecosystems to determine the characteristics that lead to a higher probability of abandoning the ecosystem. Developers were found to present such a higher risk if they do not engage in discussions with other developers, do not have strong social and technical activity intensity, communicate or commit less frequently, and do not participate in both technical and social activities for long periods of time. Avelino et al. [4] carried out a mixed-methods study to investigate project abandonment in popular GitHub projects, revealing that some projects recovered from the abandonment of key developers because they were taken over by new core maintainers that were aware of the project abandonment risks and had a clear incentive for the project to survive. Since Actions are reusable software components being developed in GitHub repositories, the GitHub Actions ecosystem is likely to suffer from this risk of abandoning developers and the presence of unmaintained or obsolete Actions. This calls for studies to quantify this phenomenon and mechanisms to avoid abandonment or to provide solutions to overcome the negative effects of such abandonment. Examples of such solutions could be finding the right replacement for abandoning developers in Action repositories or suggesting consumers of unmaintained Actions to migrate to alternative Actions. Beyond GitHub Actions The exposure of GitHub Actions to the well-known issues that packaging ecosystems face is all the more worrying because they are not limited to the GitHub Actions ecosystem but may also affect other packaging ecosystems. Conversely, the GitHub Actions ecosystem may be affected by issues coming from packaging ecosystems. This situation is depicted in Fig. 8.7: GitHub hosts the development repositories of many software projects distributed in packaging ecosystems. These development repositories may define automated

10 https://github.blog/2022-08-09-dependabot-now-alerts-for-vulnerable-github-actions/.

208

M. Wessel et al.

publish

software repositories

use

packaging ecosystem developed in

workflows depending on

software packages Actions

rely on

depend on

Fig. 8.7 Interweaving of the GitHub Actions ecosystem and software packaging ecosystems

workflows relying on reusable Actions. The Actions themselves are also developed in (and directly accessed through) GitHub repositories. Since Actions are software components developed in some programming language (mostly in TypeScript currently), they may depend on reusable packages or libraries distributed in package registries such as npm. This potentially strong interconnection between GitHub Actions and packaging ecosystems is not without practical consequences given the issues that these ecosystems may face. Instead of being mostly limited to their own ecosystem, issues affecting either packages or Actions may cross the boundaries and propagate to the other software ecosystems they are interwoven with. Consider, for example, a reusable Action affected by a security vulnerability. To start with, this vulnerability may compromise all the workflows relying on the affected Action. Next, it may also compromise the development repositories in which these workflows are executed. By extension, it may also affect all the software projects developed in these repositories. In turn, these projects may affect all the dependent packages that use them and so on. For example, the action-download-artifact Action, used by several thousands of repositories, was found to expose workflows using it to code injection attacks.11 Conversely, Actions may depend on vulnerable packages distributed in a packaging ecosystem such as npm. As a consequence, issues affecting these packages may propagate to the Actions using them and may in turn propagate to the workflows and development repositories relying on these Actions. In summary, many of the issues that software packaging ecosystems have been shown to face also apply directly or indirectly to the GitHub Actions ecosystem.

11 https://www.legitsecurity.com/blog/github-actions-that-open-the-door-to-cicd-pipeline-

attacks.

8 The GitHub Development Workflow Automation Ecosystems

209

Even worse, given that both kinds of ecosystems are tightly interwoven, issues in either ecosystem can and will propagate across ecosystem boundaries, which may lead to a significantly increased exposure to vulnerabilities and other sociotechnical health issues. This raises the urgent need to conduct empirical research for understanding the extent of these issues, analyze their impact and propagation, and provide tool support for helping repository, package, and workflow maintainers.

8.4 Discussion This chapter focused on the emerging ecosystems of development workflow automation in the GitHub social coding platform, consisting of the socio-technical interaction with bots (automated software development agents), and the workflow automation offered through GitHub Actions. Combined together, GitHub’s sociotechnical ecosystem comprises human contributors, bots, workflows and reusable actions, GitHub Apps,12 and all of the GitHub repositories in which these technologies are being developed and used. It also comprises external CI/CD services or other development automation tools that may be used by these GitHub repositories. In addition to this, there is a tight interweaving with software packaging ecosystems, since software packages may be developed using bots and GitHub Actions, and the development of bots and Actions may depend on software packages. We have argued that the intricate combination of workflow automation solutions constitutes an important and increasing risk that exposes the involved repositories—and, by extension, the software products they generate or that depend on them—to vulnerabilities and other socio-technical issues. Similar issues are likely to apply to other social coding platforms (e.g., GitLab and Bitbucket) for the same reasons as in GitHub, even though the workflow automation solutions and technologies in those platforms may be different. We also argued that bots play an important role in the social fabric of the GitHub ecosystem, since bots interact and communicate with human contributors using a similar interface as the one used by humans (e.g., posting and reacting to comments on issues, PRs, code reviews, and commits in repositories). Actions, on the other hand, are more commonly used to automate technical tasks such as executing test suites and deploying packages. This was quantitatively observed by Decan et al. in [22]. However, the boundaries between the bot ecosystem and the GitHub Actions ecosystem are becoming more and more diffuse. For instance, nothing prevents bots from directly using the functionality offered by Actions (e.g., a bot could trigger the execution of a workflow using Actions that run test suites). Similarly, an Action may instruct a bot to interact with developers and users (e.g., a code coverage Action may report its results through some GitHub badge, issue, or PR comment).

12 https://docs.github.com/en/developers/apps.

210

M. Wessel et al.

Existing workflow automation solutions were already offered through a wide variety of channels for GitHub, for example, through CI/CD services, external bots, dedicated web interfaces, or GitHub Apps. The introduction of GitHub Actions has further increased the overlap between the possible automation services. For instance, some automation services that used to be offered through bots or GitHub Apps have now become available as Actions as well. An example is the GitHub App the-welcome-bot for welcoming newcomers, a task for which more recently a GitHub Action wow-actions/welcome has become available. Two other examples are the renovate dependency update service and the codecov code coverage analysis that used to be available through web services and GitHub Apps, and more recently codecov has become offered as a GitHub Action as well. Going one step further, dependabot, which used to be an independent bot service, has now become fully integrated into the GitHub platform. All these examples illustrate that bots, Apps, Actions, and external services will continue to coexist side by side as part of the development workflow ecosystem. It is yet unclear to which extent GitHub repositories are using a combination of workflow automation solutions, or to which extent they tend to migrate from one solution to another. Hence, empirical studies that shed a deeper insight into this rapidly expanding ecosystem are urgently needed. Acknowledgments This work is supported by the ARC-21/25 UMONS3 Action de Recherche Concertée financée par le Ministère de la Communauté française - Direction générale de l’Enseignement non obligatoire et de la Recherche scientifique, as well as by the Fonds de la Recherche Scientifique - FNRS under grant numbers O.0157.18F-RG43, T.0149.22, and F.4515.23.

References 1. Abdellatif, A., Wessel, M., Steinmacher, I., Gerosa, M.A., Shihab, E.: BotHunter: an approach to detect software bots in GitHub. In: International Conference on Mining Software Repositories (MSR), pp. 6–17. IEEE Computer Society (2022). https://doi.org/10.1145/3524842. 3527959 2. Alfadel, M., Costa, D.E., Shihab, E., Shihab, E.: Empirical analysis of security vulnerabilities in Python packages. In: International Conference on Software Analysis, Evolution and Reengineering (SANER) (2021). https://doi.org/10.1109/saner50967.2021.00048 3. Arora, R., Goel, S., Mittal, R.: Supporting collaborative software development over GitHub. Softwa. Pract. Exper. 47 (2016). https://doi.org/10.1002/spe.2468 4. Avelino, G., Constantinou, E., Valente, M.T., Serebrenik, A.: On the abandonment and survival of open source projects: an empirical investigation. In: International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–12 (2019). https://doi.org/10.1109/ ESEM.2019.8870181 5. Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., et al.: Manifesto for agile software development. Tech. rep., Snowbird, UT (2001) 6. Beller, M., Gousios, G., Zaidman, A.: Oops, my tests broke the build: an explorative analysis of Travis CI with GitHub. In: International Conference on Mining Software Repositories (MSR), pp. 356–367. IEEE, Piscataway (2017). https://doi.org/10.1109/MSR.2017.62

8 The GitHub Development Workflow Automation Ecosystems

211

7. Brown, C., Parnin, C.: Sorry to bother you: designing bots for effective recommendations. In: International Workshop on Bots in Software Engineering (BotSE). IEEE, Piscataway (2019). https://doi.org/10.1109/BotSE.2019.00021 8. Brown, C., Parnin, C.: Nudging students toward better software engineering behaviors. In: International Workshop on Bots in Software Engineering (BotSE), pp. 11–15. IEEE, Piscataway (2021). https://doi.org/10.1109/BotSE52550.2021.00010 9. Cassee, N., Kitsanelis, C., Constantinou, E., Serebrenik, A.: Human, bot or both? A study on the capabilities of classification models on mixed accounts. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 654–658. IEEE, Piscataway (2021). https://doi.org/10.1109/ICSME52107.2021.00075 10. Catolino, G., Palomba, F., Tamburri, D.A., Serebrenik, A.: Understanding community smells variability: a statistical approach. In: International Conference on Software Engineering (ICSE), pp. 77–86 (2021). https://doi.org/10.1109/ICSE-SEIS52602.2021.00017 11. Chandrasekara, C., Herath, P.: Hands-on GitHub Actions: Implement CI/CD with GitHub Action Workflows for Your Applications. Apress (2021). https://doi.org/10.1007/978-1-48426464-5 12. Chen, T., Zhang, Y., Chen, S., Wang, T., Wu, Y.: Let’s supercharge the workflows: an empirical study of GitHub Actions. In: International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, Piscataway (2021). https://doi.org/10.1109/QRSC55045.2021.00163 13. Cogo, F.R., Oliva, G.A., Hassan, A.E.: Deprecation of packages and releases in software ecosystems: a case study on npm. Transactions on Software Engineering (2021). https://doi. org/10.1109/TSE.2021.3055123 14. Constantinou, E., Mens, T.: An empirical comparison of developer retention in the RubyGems and npm software ecosystems. Innovations Syst. Softw. Eng. 13(2), 101–115 (2017). https:// doi.org/10.1007/s11334-017-0303-4 15. Costa, J.M., Cataldo, M., de Souza, C.R.: The scale and evolution of coordination needs in large-scale distributed projects: implications for the future generation of collaborative tools. In: SIGCHI Conference on Human Factors in Computing Systems, pp. 3151–3160 (2011). https://doi.org/10.1145/1978942.1979409 16. Dabbish, L., Stuart, C., Tsay, J., Herbsleb, J.: Social coding in GitHub: transparency and collaboration in an open software repository. In: International Conference on Computer Supported Cooperative Work (CSCW), pp. 1277–1286. ACM (2012). https://doi.org/10.1145/ 2145204.2145396 17. Decan, A., Mens, T.: What do package dependencies tell us about semantic versioning? Trans. Softw. Eng. 47(6), 1226–1240 (2021). https://doi.org/10.1109/TSE.2019.2918315 18. Decan, A., Mens, T., Claes, M.: An empirical comparison of dependency issues in OSS packaging ecosystems. In: International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway (2017). https://doi.org/10.1109/SANER.2017. 7884604 19. Decan, A., Mens, T., Constantinou, E.: On the evolution of technical lag in the npm package dependency network. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 404–414. IEEE, Piscataway (2018). https://doi.org/10.1109/ICSME.2018.00050 20. Decan, A., Mens, T., Constantinou, E.: On the impact of security vulnerabilities in the npm package dependency network. In: International Conference on Mining Software Repositories (MSR), pp. 181–191 (2018). https://doi.org/10.1007/s10664-022-10154-1 21. Decan, A., Mens, T., Grosjean, P.: An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Softw. Eng. 24(1), 381–416 (2019). https:// doi.org/10.1007/s10664-017-9589-y 22. Decan, A., Mens, T., Mazrae, P.R., Golzadeh, M.: On the use of GitHub Actions in software development repositories. In: International Conference on Software Maintenance and Evolution (ICSME). IEEE, Piscataway (2022). https://doi.org/10.1109/ICSME55016.2022.00029

212

M. Wessel et al.

23. Dey, T., Mousavi, S., Ponce, E., Fry, T., Vasilescu, B., Filippova, A., Mockus, A.: Detecting and characterizing bots that commit code. In: International Conference on Mining Software Repositories (MSR), pp. 209–219. ACM (2020). https://doi.org/10.1145/3379597.3387478 24. Dietrich, J., Pearce, D., Stringer, J., Tahir, A., Blincoe, K.: Dependency versioning in the wild. In: International Conference on Mining Software Repositories (MSR), pp. 349–359. IEEE, Piscataway (2019). https://doi.org/10.1109/MSR.2019.00061 25. Düsing, J., Hermann, B.: Analyzing the direct and transitive impact of vulnerabilities onto different artifact repositories. Digit. Threats Res. Pract. (2021). https://doi.org/10.1145/3472811 26. Erlenhov, L., Neto, F.G.d.O., Leitner, P.: An empirical study of bots in software development: characteristics and challenges from a practitioner’s perspective. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 445–455. ACM (2020). https://doi.org/10.1145/3368089.3409680 27. Erlenhov, L., de Oliveira Neto, F.G., Scandariato, R., Leitner, P.: Current and future bots in software development. In: International Workshop on Bots in Software Engineering (BotSE), pp. 7–11. IEEE, Piscataway (2019). https://doi.org/10.1109/BotSE.2019.00009 28. Fowler, M., Foemmel, M.: Continuous Integration (original version) (2000). https:// martinfowler.com/articles/originalContinuousIntegration.html. Accessed 15 Apr 2023 29. GitHub: The state of open source software 2022 (2022). octoverse.github.com. Accessed 15 Apr 2023 30. Golzadeh, M., Decan, A., Legay, D., Mens, T.: A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. J. Syst. Softw. 175 (2021). https://doi. org/10.1016/j.jss.2021.110911 31. Golzadeh, M., Decan, A., Mens, T.: Evaluating a bot detection model on git commit messages. In: CEUR Workshop Proceedings, vol. 2912 (2021) 32. Golzadeh, M., Decan, A., Mens, T.: On the rise and fall of CI services in GitHub. In: International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Piscataway (2021). https://doi.org/10.1109/SANER53432.2022.00084 33. Gonzalez-Barahona, J.M., Sherwood, P., Robles, G., Izquierdo, D.: Technical lag in software compilations: Measuring how outdated a software deployment is. In: IFIP International Conference on Open Source Systems, pp. 182–192. Springer, Berlin (2017). https://doi.org/ 10.1007/978-3-319-57735-7_17 34. Gousios, G., Pinzger, M., van Deursen, A.: An exploratory study of the pull-based software development model. In: International Conference on Software Engineering (ICSE), pp. 345– 355. ACM (2014). https://doi.org/10.1145/2568225.2568260 35. Gousios, G., Storey, M.A., Bacchelli, A.: Work practices and challenges in pull-based development: the contributor’s perspective. In: International Conference on Software Engineering (ICSE), pp. 285–296. ACM (2016). https://doi.org/10.1145/2884781.2884826 36. Gousios, G., Zaidman, A., Storey, M.A., van Deursen, A.: Work practices and challenges in pull-based development: the integrator’s perspective. In: International Conference on Software Engineering (ICSE), pp. 358–368. IEEE, Piscataway (2015). https://doi.org/10.1109/ICSE. 2015.55 37. Herbsleb, J.D.: Global software engineering: the future of socio-technical coordination. In: International Conference on Software Engineering (ISCE)—Workshop on the Future of Software Engineering, pp. 188–198. IEEE, Piscataway (2007). https://doi.org/10.1109/FOSE. 2007.11 38. Holmström, H., Conchúir, E.Ó., Ågerfalk, P.J., Fitzgerald, B.: Global software development challenges: a case study on temporal, geographical and socio-cultural distance. In: International Conference on Global Software Engineering (ICGSE), pp. 3–11. IEEE, Piscataway (2006). https://doi.org/10.1109/ICGSE.2006.261210 39. Kaur, R., Kaur, K.: Insights into developers’ abandonment in FLOSS projects. In: Intelligent Sustainable Systems. Lecture Notes in Networks and Systems, vol. 333. Springer, Berlin (2022). https://doi.org/10.1007/978-981-16-6309-3_69 40. Kinsman, T., Wessel, M., Gerosa, M.A., Treude, C.: How do software developers use GitHub Actions to automate their workflows? In: International Conference on Mining Software Repos-

8 The GitHub Development Workflow Automation Ecosystems

213

itories (MSR), pp. 420–431. IEEE, Piscataway (2021). https://doi.org/10.1109/MSR52588. 2021.00054 41. Kula, R.G., German, D.M., Ouni, A., Ishio, T., Inoue, K.: Do developers update their library dependencies? Empirical Softw. Eng. 23(1), 384–417 (2018). https://doi.org/10.1007/s10664017-9521-5 42. Lebeuf, C., Storey, M.A., Zagalsky, A.: Software bots. IEEE Softw. 35(1), 18–23 (2017). https://doi.org/10.1109/MS.2017.4541027 43. Mazrae, P.R., Mens, T., Golzadeh, M., Decan, A.: On the usage, co-usage and migration of CI/CD tools: a qualitative analysis. Empirical Softw. Eng. (2023). https://doi.org/10.1007/ s10664-022-10285-5 44. Mirhosseini, S., Parnin, C.: Can automated pull requests encourage software developers to upgrade out-of-date dependencies? In: International Conference on Automated Software Engineering (ASE), pp. 84–94. IEEE, Piscataway (2017). https://doi.org/10.1109/ASE.2017. 8115621 45. Monperrus, M.: Explainable software bot contributions: case study of automated bug fixes. In: International Workshop on Bots in Software Engineering (BotSE), pp. 12–15. IEEE, Piscataway (2019). https://doi.org/10.1109/BotSE.2019.00010 46. Peng, Z., Ma, X.: Exploring how software developers work with mention bot in GitHub. CCF Trans. Pervasive Comput. Interaction 1(3), 190–203 (2019). https://doi.org/10.1007/s42486019-00013-2 47. Ribeiro, E., Nascimento, R., Steinmacher, I., Xavier, L., Gerosa, M., De Paula, H., Wessel, M.: Together or apart? Investigating a mediator bot to aggregate bot’s comments on pull requests. In: International Conference on Software Maintenance and Evolution—New Ideas and Emerging Results Track (ICSME-NIER). IEEE, Piscataway (2022). https://doi.org/10. 1109/ICSME55016.2022.00054 48. Saroar, S.G., Nayebi, M.: Developers’ perception of GitHub Actions: a survey analysis. In: International Conference on Evaluation and Assessment in Software Engineering (EASE) (2023) 49. Savor, T., Douglas, M., Gentili, M., Williams, L., Beck, K., Stumm, M.: Continuous deployment at Facebook and OANDA. In: International Conference on Software Engineering Companion (ICSE), pp. 21–30. IEEE, Piscataway (2016). https://doi.org/10.1145/2889160. 2889223 50. Soares, E., Sizilio, G., Santos, J., da Costa, D.A., Kulesza, U.: The effects of continuous integration on software development: a systematic literature review. Empirical Softw. Eng. 27(3), 1–61 (2022). https://doi.org/10.1007/s10664-021-10114-1 51. Song, Y., Chaparro, O.: BEE: A tool for structuring and analyzing bug reports. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 1551–1555. ACM (2020). https://doi.org/10.1145/ 3368089.3417928 52. Soto-Valero, C., Harrand, N., Monperrus, M., Baudry, B.: A comprehensive study of bloated dependencies in the Maven ecosystem. Empirical Softw. Eng. 26(3), 1–44 (2021). https://doi. org/10.1007/s10664-020-09914-8 53. Storey, M.A., Zagalsky, A.: Disrupting developer productivity one bot at a time. In: International Symposium on Foundations of Software Engineering (FSE), pp. 928–931 (2016). https:// doi.org/10.1145/2950290.2983989 54. Stringer, J., Tahir, A., Blincoe, K., Dietrich, J.: Technical lag of dependencies in major package managers. In: Asia-Pacific Software Engineering Conference (APSEC), pp. 228–237 (2020). https://doi.org/10.1109/APSEC51365.2020.00031 55. Tsay, J., Dabbish, L., Herbsleb, J.: Influence of social and technical factors for evaluating contribution in GitHub. In: International Conference on Software Engineering (ICSE), pp. 356–366. ACM (2014). https://doi.org/10.1145/2568225.2568315 56. Wang, Z., Wang, Y., Redmiles, D.: From specialized mechanics to project butlers: the usage of bots in OSS development. IEEE Software (2022). https://doi.org/10.1109/MS.2022.3180297

214

M. Wessel et al.

57. Wessel, M., Abdellatif, A., Wiese, I., Conte, T., Shihab, E., Gerosa, M.A., Steinmacher, I.: Bots for pull requests: the good, the bad, and the promising. In: International Conference on Software Engineering (ICSE), pp. 274–286 (2022). https://doi.org/10.1145/3510003.3512765 58. Wessel, M., Serebrenik, A., Wiese, I., Steinmacher, I., Gerosa, M.A.: Effects of adopting code review bots on pull requests to OSS projects. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 1–11. IEEE, Piscataway (2020). https://doi.org/10. 1109/ICSME46990.2020.00011 59. Wessel, M., Serebrenik, A., Wiese, I., Steinmacher, I., Gerosa, M.A.: What to expect from code review bots on GitHub? A survey with OSS maintainers. In: Brazilian Symposium on Software Engineering (SBES), pp. 457–462 (2020). https://doi.org/10.1145/3422392.3422459 60. Wessel, M., Serebrenik, A., Wiese, I., Steinmacher, I., Gerosa, M.A.: Quality gatekeepers: investigating the effects of code review bots on pull request activities. Empirical Softw. Eng. 27(5), 108 (2022). https://doi.org/10.1007/s10664-022-10130-9 61. Wessel, M., de Souza, B.M., Steinmacher, I., Wiese, I.S., Polato, I., Chaves, A.P., Gerosa, M.A.: The power of bots: characterizing and understanding bots in OSS projects. Proc. ACM Hum.-Comput. Interact. 2(CSCW) (2018). https://doi.org/10.1145/3274451 62. Wessel, M., Vargovich, J., Gerosa, M.A., Treude, C.: Github actions: the impact on the pull request process (2022). arXiv preprint arXiv:2206.14118 63. Wessel, M., Wiese, I., Steinmacher, I., Gerosa, M.A.: Don’t disturb me: challenges of interacting with software bots on open source software projects. In: ACM Hum.-Comput. Interact. (CHI). ACM (2021). https://doi.org/10.1145/3476042 64. Wyrich, M., Bogner, J.: Towards an autonomous bot for automatic source code refactoring. In: International Workshop on Bots in Software Engineering (BotSE), pp. 24–28 (2019). https:// doi.org/10.1109/BotSE.2019.00015 65. Wyrich, M., Ghit, R., Haller, T., Müller, C.: Bots don’t mind waiting, do they? Comparing the interaction with automatically and manually created pull requests. In: International Workshop on Bots in Software Engineering (BotSE), pp. 6–10. IEEE, Piscataway (2021). https://doi.org/ 10.1109/BotSE52550.2021.00009 66. Zerouali, A., Mens, T., Decan, A., De Roover, C.: On the impact of security vulnerabilities in the npm and RubyGems dependency networks. Empirical Softw. Eng. 27(5), 1–45 (2022). https://doi.org/10.1007/s10664-022-10154-1 67. Zerouali, A., Mens, T., Gonzalez-Barahona, J., Decan, A., Constantinou, E., Robles, G.: A formal framework for measuring technical lag in component repositories—and its application to npm. J. Softw. Evol. Process 31(8) (2019). https://doi.org/10.1002/smr.2157 68. Zerouali, A., Mens, T., Robles, G., Gonzalez-Barahona, J.M.: On the relation between outdated docker containers, severity vulnerabilities, and bugs. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 491–501. IEEE, Piscataway (2019). https://doi.org/10.1109/SANER.2019.8668013 69. Zimmermann, M., Staicu, C.A., Tenny, C., Pradel, M.: Small world with high risks: a study of security threats in the npm ecosystem. In: USENIX Security Symposium, pp. 995–1010 (2019)

Chapter 9

Infrastructure-as-Code Ecosystems Ruben Opdebeeck, Ahmed Zerouali, and Coen De Roover

Abstract Infrastructure as Code (IaC) is the practice of automating the provisioning, configuration, and orchestration of systems onto which software is deployed through scripts in domain-specific languages. With the increasing importance of reliable and repeatable deployments, ecosystems are emerging around online repositories of reusable IaC assets. In this chapter, we study two such ecosystems in detail: the one forming around the Docker Hub repository of reusable Docker images and the one forming around the Ansible Galaxy repository of reusable Ansible roles. We start with an introduction to Docker, the most popular container management tool, and Ansible, the most popular configuration management tool. Although both tools are used to configure machines onto which applications are deployed, they differ fundamentally in the means through which this is achieved. Next, we discuss the Docker Hub and Ansible Galaxy online repositories for reusable Docker images and Ansible roles. Having introduced these emerging ecosystems, we highlight a number of approaches taken by researchers studying them. Subsequently, we survey the state of the art in research on the practices followed by their contributors and users, ranging from the versioning of releases and keeping dependencies up to date to detecting bugs. We conclude with the challenges that researchers face when analyzing these ecosystems.

9.1 Introduction A key activity in information technology (IT) operations is managing and configuring the digital infrastructure upon which software systems are deployed. In the broadest sense, infrastructure configuration encompasses account management, setting up firewall filters, the installation of software packages, and their configuration. As infrastructures grow to tens or even hundreds of machines, managing these

R. Opdebeeck () · A. Zerouali · C. De Roover Software Languages Lab, Vrije Universiteit Brussel, Brussels, Belgium e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_9

215

216

R. Opdebeeck et al.

configurations by hand becomes impractical. Moreover, infrastructures that need to be scalable and elastic require the ability for new machines to be spun up or down in a moment’s notice. Thus, automation becomes a necessity. Infrastructure as Code (IaC) is the practice of automatically provisioning, configuring, managing, and orchestrating the machines in a digital infrastructure through source code, which can be read both by humans and machines alike. This enables applying best practices already established for application code to infrastructure code, such as change management through version control systems or quality assurance through testing and code review. Broadly spoken, two paradigms can be discerned among the IaC practice. In the paradigm of immutable infrastructures, existing configurations cannot be changed. It requires technologies such as virtualization and containerization to replace the infrastructure with a newly configured one. In the paradigm of mutable infrastructures, existing configurations can be changed. Doing so reliably and in a repeatable manner requires technologies such as configuration management languages that automate the changes to preexisting infrastructures. Accompanying the rise in IaC technologies are emerging software ecosystems wherein practitioners share open-source artifacts for reuse by others. In this chapter, we discuss two of the most popular of such technologies and their emerging ecosystems, one for each IaC paradigm. We start our discourse with Docker in Sect. 9.2, the leading containerization technology powering immutable infrastructures, and its Docker Hub ecosystem of reusable Docker images. Next, we continue with Ansible in Sect. 9.3, a leading configuration management language powering mutable infrastructures, and its Ansible Galaxy ecosystem of reusable configuration management code. For both technologies, we describe their ecosystem and its participants, metadata, and content. Moreover, we summarize a number of approaches to analyzing the ecosystem, in terms of its metadata as well as its artifacts. We conclude with a number of promising avenues of research into these technologies and their ecosystems.

9.2 Docker and Its Docker Hub Ecosystem 9.2.1 Introduction to Containerization Virtual machines are software-based simulations of the hardware upon which software is deployed. Deploying applications on a virtual machine facilitates porting them across hardware platforms and operating systems while also isolating them from other applications that share the same physical host. A closely related technology is containerization, a lightweight version of virtualization [33]. Containerization enables developers to package all components required to run an application into a “container.” These components include the executable and all dependencies such as web or database servers and their configuration files. Importantly, the container

9 Infrastructure-as-Code Ecosystems

217

does not include the operating system or kernel itself. These are shared with the host machine, along with its physical non-virtualized hardware. This allows containers to remain lightweight yet portable to other hosts with similar hardware and operating system. The application deployed within the container is ensured a consistent infrastructure across different hosts. Containers are an enabling technology of continuous integration and continuous deployment pipelines (see Sect. 8.1.3) in which applications are put through a series of tests, each test running the application in an environment of which the last resembles or is the actual production environment. Containerization is also a popular choice for isolating the micro-services of cloud-native applications [13] from each other, exposing only their functionality through a well-defined interface. Packaging micro-services into containers also facilitates implementing horizontal scaling and elasticity. It suffices to spin up new container instances as the load on the microservice increases and to spin redundant containers as the load decreases.

9.2.2 The Docker Containerization Tool Docker [48], which is a platform designed to build, share, and run containerized applications, standardized the use of containers with easy-to-use tools for developers and established a universal packaging approach, which subsequently accelerated the adoption of container technologies. In 2013, the Docker Engine was launched, and in 2015, Docker launched the Linux Foundation project “Open Container Initiative (OCI)” to design open standards for operating-system-level virtualization, most importantly Linux containers. Today, various containerization tools support the OCI standards established by Docker (e.g., containerd, runc, CRI-O, etc.). According to the 2022 Stack Overflow Developer Survey [46], Docker is the most loved containerization tool. Docker containers are created as instances of Docker images. Each image can be built from a blueprint named “Dockerfile”.1 a simple text-based script with instructions to build a Docker image. Starting from a base image, each instruction creates a new “layer” for the image, which represents the changes to the previous layer caused by executing the instruction. Each layer has an associated digest, i.e., a unique hash signature of the changes, and is immutable. Thus, a Dockerfile produces a Docker image, which is a stack of layers. Once the image is built, it can be used to instantiate multiple containers, each originating from the same Docker image. A Dockerfile can be based on another Dockerfile, thereby inheriting the latter’s image and its layers as the base image. Although it is possible to create Docker images from scratch,2 most images are based on other images. This leads to a hierarchy of images, with a small number of base images forming the foundation.

1 https://docs.docker.com/engine/reference/builder. 2 Using

the Docker-reserved minimal image named “scratch,” https://hub.docker.com/_/scratch.

218

R. Opdebeeck et al.

As images are used to create runnable containers, they contain operating system packages that will complement the kernel provided by the host. Typically, these form the lowest layers of an image, originating from one of the aforementioned base images. Images including Linux-based distributions provide access to that distribution’s toolset, such as their package manager (e.g., aptitude on Debian). Subsequent layers may use these package managers to add third-party packages, e.g., language runtimes and language-specific package managers (e.g., JavaScript and npm, or Python and pip). These can then in turn be used to add third-party language-specific libraries. Finally, the topmost layers are typically used for firstparty code, binaries, and assets. For example, Listing 9.1 depicts the Dockerfile that builds the image redis:7bullseye. This Dockerfile includes different Docker commands. FROM specifies the parent image that is being built upon. RUN executes a shell command in a new layer of the image being built. For example, the shell command groupadd will create a new group on the container with the specified name. ENV defines environment variables for the container. WORKDIR sets the current working directory. COPY copies files into the image, etc. This Dockerfile will build an image that inherits the layers of the debian:bullseye-slim image, amended with layers containing the changes caused by the other commands. Dockerfiles can contain other commands, such as VOLUME to declare a mount point for data volumes and CMD to set the default command to run when a container is started. Listing 9.1 Dockerfile of the image redis:7-bullseye 1 2 3 4 5 6 7 8 9 10 11

FROM debian:bullseye−slim RUN groupadd −r −g 999 redis && useradd −r −g redis −u 999 redis ENV GOSU_VERSION 1.14 RUN set −eux; \ savedAptMark="$(apt−mark showmanual)"; \ apt−get update ; \ apt−get install −y −−no−install−recommends ca−certificates dirmngr gnupg wget; \ ... WORKDIR /data COPY docker−entrypoint.sh /usr / local / bin / ENTRYPOINT ["docker−entrypoint.sh"]

Running containers are provided read-only access to the image’s layers to form the container’s file system. Containers can thus access all files created while the image was built. In addition, the container is provided with a new writable layer on top of the image’s layers, called the container layer, which is used to perform all file system modifications caused while the container is running.

9 Infrastructure-as-Code Ecosystems

219

Listing 9.2 Excerpt of the Dockerfile of debian:bullseye-slim 1 2 3

FROM scratch ADD rootfs. tar .xz / CMD ["bash"]

9.2.3 The Docker Hub Ecosystem Next to being instantiated into containers, Docker images can be shipped to online registries to be reused by third parties. Providing a common place to build, update, and share images, software ecosystems form around such online registries. With more than 9.4M images (as of August 2022), Docker Hub is the largest registry for Docker images. It has served billions of image downloads, with images such as Ubuntu, Redis, Node.js, Alpine, and MySQL each having more than a billion downloads.

9.2.3.1

Types of Images Collected on Docker Hub

Images in Docker Hub are distributed as repositories, allowing contributors to group several variants of images (e.g., for different architectures). Repositories may be public or private, where public repositories are categorized as either official or community repositories. The official status is considered a quality label, hinting that the repository contains secure and well-maintained images produced by well-known organizations (e.g., (e.g., MySQL or Debian). They are therefore often used as the base image for other images. Community repositories, in contrast, can be created by any user or organization [4]. To facilitate the search for images in Docker Hub, they are labelled with the name of the repository (e.g., debian) and a tag (e.g., buster). An image can have multiple tags and thus have multiple labels (e.g., both debian:bullseye and debian:11 refer to the same image). The labels of community images usually start with the name of the organization or user producing the images, followed by the image name and tag (e.g., grimoirelab/full:0.2.26). Figure 9.1 illustrates the workflow of creating Docker containers from Docker images pulled from Docker Hub3 Pulled images can also be used as the base image in a Dockerfile for a new image. For example, Listing 9.2 shows the Dockerfile of the image debian:bullseye-slim. This image is built from scratch, and it was used as the base image in the Dockerfile in Listing 9.1. Therefore, the layers of this image are all included in the list of layers of the redis:7-bullseye image.

3 For more details about the Docker architecture, we refer the reader to https://docs.docker.com/ get-started/overview/.

220

R. Opdebeeck et al. Docker image

Dockerfile

build

push

Docker container

run

pull an

d run

Docker Hub

Fig. 9.1 Process of creating a Docker container

9.2.3.2

Image Metadata Maintained on Docker Hub

The Docker Hub registry maintains several types of metadata about its images. Basic information, like the repository and image name, its type (official or community), tags, number of pulls, number of stars given by other users, and the size of images, can be found on the Docker Hub homepage for each image (e.g., https://hub. docker.com/_/debian). This information is also provided by Docker Hub’s API and augmented with the image’s creation date, home repository in GitHub (or another hosting service), last pull date, supported architectures (e.g., AMD64, ARM64), a unique SHA digest for the image, etc. Most importantly, the API provides a manifest4 for each Docker image, which is a JSON file with metadata about the layers within the image and their size. When trying to run an image with a specified name and tag from Docker Hub, the Docker engine contacts the registry, requesting the manifest for that image. Before downloading the image layers, the engine verifies the manifest’s signature, ensuring that the content was produced by a trusted source. Once downloaded, the engine verifies and ensures that the digest of each layer matches the one specified in the manifest. Layer digests are also used to identify layers that have already been downloaded.

9.2.4 Approaches to Analyzing Docker Hub Images Images within the Docker Hub ecosystem have enjoyed attention in empirical software engineering research. Before surveying the findings of this research, we

4 https://docs.docker.com/registry/spec/manifest-v2-2/.

9 Infrastructure-as-Code Ecosystems

221

discuss the different forms of research methods used to analyze Docker Hub images and their ecosystem.

9.2.4.1

Docker Hub Metadata Analysis

In a pure metadata analysis, researchers analyze information that can be gathered about Docker Hub images without running them as containers and without inspecting the source code of their Dockerfiles. Various sources of such metadata exist. Zerouali et al. [57], for instance, mapped the network of popular images derived from the Debian base image without inspecting any Dockerfile. This first required extracting image repository, name, and tags of 27,760 official and 5,842,567 community images through the Docker Hub API. Using the image manifests, they could then identify 9581 official and 924,139 community images built on top of the Debian base image. The same approach was followed later by the same authors [56] to identify Node.js, Python, and Ruby-based community images. Most of the images shared in Docker Hub belong to open-source projects, of which the version control repositories are hosted on social coding platforms like GitHub, GitLab, or Bitbucket. By extracting the link to these repositories from Docker Hub, researchers can obtain more metadata about the repositories that version an image’s Dockerfile, for instance, information about the age of the repository, its contributors, number of stars, watchers, forks, etc. Commit logs and change statistics from version repositories can also be used in metadata analyses. For example, Lin et al. [26] constructed a dataset [27] that contains information about 3,364,529 Docker images and the 378,615 git repositories behind them. The dataset’s information from Docker Hub includes the image description, tags, number of pulls, publisher username, etc. The dataset’s information from GitHub includes the repository’s branches, releases, commit logs, Dockerfile commit history, etc.

9.2.4.2

Static Analysis of Dockerfiles and Docker Images

There are two main ways in which static analysis can be used to study the Docker Hub ecosystem. The first is to statically analyze the Dockerfiles used to build Docker images. The second is to analyze the content of an image’s layers without running the image as a container. For Dockerfiles, the canonical technique is to parse the instructions of the file into an Abstract Syntax Tree (AST). Each instruction is represented as a node in this AST, with the operands of the instruction (e.g., the shell command of a RUN instruction or the image name of a FROM instruction) appearing as child nodes. Xu et al. [52] further classify these nodes into one of four categories: operator (or Docker command) nodes, resource nodes, shell command nodes, and parameter nodes.

222

R. Opdebeeck et al.

As each RUN instruction contains one or several shell commands, it is possible to find some of them including other bash scripts embedded as arguments (e.g., (e.g., line 5 in Listing 9.1). This adds more levels of nesting to AST representations. Henkel et al. [18] presented an AST representation that tackles this challenge of nesting in Dockerfiles. They employed phased parsing, wherein they first perform a simple top-level parse, resulting in an AST as described above. Then, they refine the tree by parsing RUN instructions into separate commands (e.g., line 3 in Listing 9.3) and parsing the options of popular command-line tools. Listing 9.3 Excerpt of the Dockerfile of shogunshogun-dev:latest 1 2 3

FROM debian:buster−backports MAINTAINER shogun[at]shogun−toolbox.org RUN apt−get update −qq && apt−get upgrade −y && apt−get install −qq → −−force−yes −−no−install−recommends make gcc g++ libc6−dev libbz2−dev → ccache libarpack2−dev ...

Figure 9.2 shows the AST representation for the Dockerfile in Listing 9.3 using Henkel et al.’s approach [18]. The same technique was later used to characterize Dockerfile function code, which is the code that comes after the Docker command (e.g., line 3 in Listing 9.3), through AST paths [60]. Several tools have become available to statically analyze Dockerfiles. The DeepSource Docker analyzer5 scans for potential bugs, anti-patterns, security vulnerabilities, and performance issues. The VS Code Docker extension6 provides basic linting, whereas Hadolint7 detects code smells in an AST of the Dockerfile using a rule-based approach, supporting issues that affect commands as well as the shell code in their operands. It is worth mentioning that Hadolint rules can be customized. For example, rule DL4000 was used to check for the usage of the command MAINTAINER, which was considered a best practice. When this command became deprecated, DL4000 was updated to indicate that the LABEL command should be used instead. By downloading the layers of an image and analyzing their content, the content of images can also be analyzed statically without the need to run them. To download the layers comprising an image, researchers use layer identifiers extracted from the manifest and then download the blobs from the registry using Docker Hub’s API. This technique was used by Henriksson et al. [20] to Henriksson et al. [20] to extract the list of packages installed in an image and to scan the packages for vulnerabilities.

5 https://deepsource.io/docs/analyzer/docker/. 6 https://github.com/microsoft/vscode-docker. 7 https://hadolint.github.io/hadolint/.

9 Infrastructure-as-Code Ecosystems

223

Dockerfile

FROM

buster backports

debian

apt-get update

Flag-qq

Flag-y

RUN

MAINTAINER

BASH-AND

shogun@shogun - toolbox .org

apt-get upgrade

apt-get install

packages

Flag-qq

gcc

...

Flag-...

Fig. 9.2 AST of the Dockerfile in Listing 9.3 constructed using the technique of Henkel et al. [18]

9.2.4.3

Dynamic Analysis of Dockerfiles and Docker Images

Two types of dynamic analysis can be discerned in the literature: (1) analysis of the build output of Dockerfiles and (2) analysis of running containers of built images. Xu et al. [52] used dynamic analysis to detect temporary file smells in Dockerfiles, i.e., files that are created and deleted in separate layers while the image is built, which leads to a redundant increase in image size. To this end, they instrumented the host kernel to log file creation and deletion. The resulting traces enable identifying temporary files across multiple layers. Henkel et al. [19] combined static and dynamic analysis to warn about Dockerfile breakage. Their approach looks for common error patterns in build logs and associated Dockerfiles. For instance, they find that when Docker reports an error message “Unable to locate package python-pip” while building Dockerfiles containing the instructions FROM ubuntu:latest and RUN apt-get -y install python-pip, this error is due to the use of the undefined, latest, or 20.04 tags. Dynamic analysis of running Docker containers can be costly due to the resources required. Nonetheless, it is an effective method for scanning Docker images for security vulnerabilities and bugs. For example, Zerouali et al. [55] developed ConPan, a tool that pulls and runs a Docker image from Docker Hub

224

R. Opdebeeck et al.

to perform software composition analysis. Once the container is running, the tool executes commands to extract the list of packages available at runtime and compares them to package registries. The resulting list is inspected to analyze the packages’ outdatedness (i.e., technical lag [53]) and to identify vulnerable and buggy packages. Similarly, Shu et al. [44] proposed the Docker Image Vulnerability Analysis (DIVA) framework to automatically discover, download, and analyze Docker images for security vulnerabilities.

9.2.5 Empirical Insights from Analyzing Docker Hub Images Having discussed the analysis methods through which the Docker Hub ecosystem and its assets can be studied, we turn to the latest insights uncovered in this manner.

9.2.5.1

Technical Lag and Security in the Docker Hub Ecosystem

A survey among Docker users revealed that the absence of security vulnerabilities is a top concern in the decision to adopt Docker images [3]. Another survey showed that Docker users are also concerned about the presence of bugs in third-party software loaded within images and about outdated versions of this software [1]. In contrast, another survey showed that only 19% of developers claim to test their Docker images for vulnerabilities during development [50]. This signals a tendency to produce and consume Docker images without inspecting their software in detail. Through an in-depth investigation of the attack surface and definition of an adversary model, Combe et al. [8] provided a comprehensive overview of the vulnerabilities brought about by the use of Docker. Shu et al. [44] analyzed Docker images for security vulnerabilities. On a set of 356,218 images, they observed that both official and community images contain an average of 180 vulnerabilities. Many images had not been updated for hundreds of days, calling for a more systematic analysis of the content of Docker containers. Package changes within Docker images can lead to broken functionality, poor performance, or security issues in the applications that depend on the packages. Gholami et al. [15] carried out an analysis in which they studied how packages are changing in official Docker images. After analyzing 37k images from official repositories, they found that 50% of the images conducted at least eight package upgrades. To shed more light on this problem, Sabuhi et al. [42] proposed a method to assess the impact of upgrading packages of Docker images on application performance. Zerouali et al. [59] studied the relationship between outdated system packages in Debian-based images, their security vulnerabilities, and their bugs. They computed the difference between the outdated system packages and their latest available releases in terms of versions, vulnerabilities, and bugs. They found that no Debianbased image is free of vulnerabilities or bugs, so deployers cannot avoid them

9 Infrastructure-as-Code Ecosystems

225

even if they deploy the most recent packages in these images. To ensure that they only consider Debian-based images, they relied on Docker’s inheritance mechanism previously explained in Sect. 9.2.3. Later, the same authors extended this study by instantiating the formal technical lag [58] framework along five different dimensions: package lag, time lag, version lag, bug lag, and vulnerability lag [57]. The technical lag refers to the difference between deployed software package releases and the ideal (e.g., most fresh, secure, or stable) available releases. Then, they carried out an empirical study on 140,498 popular Debian-based images from official and community Docker Hub repositories. For each dimension, they found that community images have higher lag than official images. Depending on the lag dimension, images from specific Debian distributions were found to have a higher lag than those coming from others. For example, version lag was highest for images relying on Debian Testing, while vulnerability lag was highest for OldStable images. They also found that in some cases, the lag increases over time, for example, for package and version lag in Debian Testing images. In a similar study, Zerouali et al. [54] focused on npm third-party packages and evaluated their outdatedness and vulnerabilities using 961 official node-based images coming from three Docker Hub repositories, namely, node, ghost, and mongo-express. They found that the presence of outdated npm packages in official node images increases the risk of security vulnerabilities. Later, the same authors extended this study to include Ruby and Python packages [56]. They found that the last time community images were updated, they had more outdated and vulnerable core packages than non-core ones. After some time, these packages missed more updates leading to more vulnerabilities present in Docker Hub community images. They also reported that the presence of vulnerable packages is considerably more pronounced for Node.js and Ruby images, which tend to be more outdated and more vulnerable than Python images. Moreover, node images tend to have the highest proportion of packages missing major updates, as well as a high number of duplicate package releases. Figure 9.3 shows the process and pipeline followed in the work of Zerouali et al. [56] to construct a representative dataset of community Docker Hub images and their installed third-party packages.

Identifying candidate images

Extracting image layers from Docker Hub

Running images

Extracting installed packages

Collecting package releases

Fig. 9.3 Overview of the data collection pipeline used by Zerouali et al. [56]

Data Analysis

Collecting security vulnerabilities

226

9.2.5.2

R. Opdebeeck et al.

Technical Debt and Code Smells in Dockerfiles

Docker documents several best practices for writing Dockerfiles,8 but developers do not always follow them [18, 41, 51]. This may lead to technical debt and smells in Dockerfiles with a negative impact on the reliability and performance of Docker images. Azuma et al. [2] studied self-admitted technical debt (SATD) in Dockerfiles. SATD are comments left by developers as a reminder about code manifesting technical debt. They manually classified all comments found in 3149 Dockerfiles coming from 462 GitHub repositories. They found that 3% of the comments are SATD and that the three most common SATD classes concern maintainability, testing, and defects. In a large-scale study of 6334 Docker projects, Wu et al. [51] categorized Dockerfile smells into two categories: DL-smells (i.e., violations of the official Dockerfile best practices) and SC-smells (i.e., violations of shell script best practices). They found that nearly 84% of the analyzed projects have smells in their Dockerfiles. Furthermore, they found that DL-smells appear more often than SC-smells. Ksontni et al. [24] manually analyzed the Dockerfile and Docker Compose files of 68 projects for technical debt and refactoring opportunities. Docker Compose is a tool that allows the creation and operation of multi-container applications on a single host using YAML files that include configurations such as services, networks, and volumes for each container. They found six Dockerfile technical debt categories related to Image size, Build Time, Duplication, Maintainability, Understandability, and Modularity. For Docker Compose files, they found four: Duplication, Maintainability, Understandability, and Extensibility. As a remedy, they propose 14 refactorings for Dockerfiles and 12 for Docker Compose files. They conclude that these smells are widespread and that there is a lack of automatic tools that support developers in fixing them.

9.2.5.3

Challenges in Maintaining and Evolving Dockerfiles

Next to the freshness, security, and quality of Docker images, other socio-technical aspects such as their evolution, reproducibility, and adoption have been studied. Cito et al. [7] characterized the Docker ecosystem by discovering prevalent quality issues and studying the evolution of Docker images. Using a dataset of over 70,000 Dockerfiles, they contrasted the general population with samples containing the top 100 and top 1000 most popular projects using Docker. They observed that the most popular projects change more often than the rest of the Docker population with an average of 5.81 revisions per year and five lines of changed code. Most importantly, based on a representative sample of 560 projects, they observed that one out of three Docker images could not be built from their Dockerfiles.

8 https://docs.docker.com/develop/develop-images/dockerfile_best-practices/.

9 Infrastructure-as-Code Ecosystems

227

Oumaziz et al. [32] carried out a study on duplicates in Dockerfiles families (e.g., Python Dockerfiles). After inspecting Dockerfiles from 99 official Docker repositories, they found that duplicates in Dockerfiles are frequent. They also found that maintainers are aware of the existence of these duplicates. However, they have mixed opinions regarding them. Tsuru et al. [47] proposed a method to detect Type2 code clones in Dockerfiles. Henkel et al. [18] found that Dockerfiles on GitHub have on average nearly five times more best practice violations than those written by Docker experts. They argue for more effective tooling in the realm of Dockerfiles. Eng et al. [14] revisited the findings of previous studies about Dockerfiles. After inspecting a large set of 9.4M unique Dockerfiles spanning from 2013 to 2020, they could confirm previous findings of a downward trend in using open-source images and an upward trend in using language images. They also confirmed that the numbers of Dockerfile smells are slightly decreasing. They concluded that their results support previous studies’ recommendations for improving tools for creating Docker images.

9.3 Ansible and Its Ansible Galaxy Ecosystem Having discussed the leading containerization technology powering immutable infrastructures, we turn our attention to the leading technology for configuration management of mutable infrastructures.

9.3.1 Introduction to Configuration Management Configuration management tools such as Ansible, Chef, and Puppet provide automation and replicability to infrastructure deployments using domain-specific languages. Practitioners can use these languages to write configuration management scripts wherein they declaratively specify the steps required to configure the machines in a digital infrastructure. The tools then execute these scripts by executing each step on each individual infrastructure machine automatically. These domain-specific tools often provide built-in abstractions to perform common configuration actions, such as managing users or installing software packages. These abstractions also take care of ensuring idempotence to prevent making changes to a machine’s configuration if it is already configured correctly. Such abstractions enable practitioners to evolve their infrastructure mutably by reexecuting changed scripts on machines, which were configured with earlier versions of the script. Nonetheless, tools cannot offer built-in abstractions for each potential scenario and thus provide mechanisms for extension by means of modules and plug-ins. They also offer mechanisms for the reuse of such modules and plug-ins. Some tools even allow reusing whole configuration scripts. Consequently, the more popular of

228

R. Opdebeeck et al.

these tools are surrounded by sizable communities of open-source developers who contribute their own solutions to infrastructure configuration tasks. This has led to a number of new Infrastructure-as-Code software ecosystems, such as the ones surrounding the Ansible Galaxy, Chef Supermarket, and Puppet Forge platforms. In this section, we focus on the former, Ansible and the Ansible Galaxy ecosystem, since it is the most popular tool of its kind [46].

9.3.2 The Ansible Configuration Management Tool Ansible is an automation platform offering solutions for configuration management and provisioning of cloud machines. As such, it has become one of the most popular Infrastructure-as-Code tools today [16]. To configure groups of remote machines, the Ansible tool pushes configuration changes over the network based on tasks written in a YAML-based domain-specific language. This language offers many of the programming constructs found in general-purpose languages, such as variables, expressions, conditionals, simple loops, and exception handling. A complete Ansible program is called a playbook, which is further subdivided into one or more plays. Ansible code can also be modularized into roles, which can be reused in multiple plays or playbooks.

9.3.2.1

Ansible Plays and Playbooks

A playbook contains all code necessary to configure a complete infrastructure. Playbooks consist of several plays, each configuring a group of machines with the same responsibilities. For example, the playbook depicted in Listing 9.4 contains one play to configure database servers (lines 1–16) and another to configure web servers (lines 18–24). Listing 9.4 Contrived example of an Ansible playbook to configure database and web servers 1 2 3 4 5 6 7 8 9

- hosts: databases vars: psql_db_name: prod psql_db_user: app tasks: - name: Ensure database exists postgresql_db: name: "{{ psql_db_name }}" state: present

10 11 12 13 14 15 16

- name: Ensure user exists and has access to database postgresql_user: db: "{{ psql_db_name }}" name: "{{ psql_db_user }}" priv: ALL state: present

17 18

- hosts: webservers

9 Infrastructure-as-Code Ecosystems 19 20 21 22 23 24

229

tasks: - name: Ensure apache is installed apt: name: apache2 state: present # ...

Each play consists of a sequence of tasks, which Ansible executes in sequential order on each machine individually. Each task performs one action corresponding to one step in the configuration of a machine. For instance, the first task of the first play in Listing 9.4 (lines 6–9) uses the postgresql_db action to create a database. Tasks can also be executed conditionally, or executed in a loop on each item in a list. Ansible offers a number of built-in actions, such as user to manage user accounts, or apt to install software packages (line 21 in Listing 9.4). Practitioners can also implement their own actions in Python through plug-ins. Within tasks, practitioners can write expressions in the Jinja2 templating language. Naturally, expressions can refer to variables, which can be defined on many different levels, e.g., play-level variables, variables local to a task, or variables whose value is specific to individual machines. Our example’s first play defines two play-level variables (lines 3–4), storing the name of the created database and user. These variables are referenced in three expressions, demarcated by double braces (lines 8, 13, and 14). Ansible evaluates the part between double braces and substitutes the result of this evaluation into the string in which the expression is embedded. For instance, the result of evaluating the expression on line 8, which merely refers to a variable, will be “prod”. Expressions can also manipulate data through filters, perform tests for conditional logic, or use lookups to produce data from external sources. As with actions, users can also define their own filters, tests, and lookups using plug-ins.

9.3.2.2

Ansible Roles

It is common for parts of different plays to perform similar configuration tasks. For instance, both web servers and database servers may require network interfaces to be configured and certain firewall rules to be set up. Similarly, one may want to set up a test environment with a database server that uses the same configuration as the production environment. Rather than duplicating the tasks across the different plays, it is possible to modularize and reuse them with a role. An example of a role for the latter situation is depicted in Listings 9.5 and 9.6, which is adapted from the play in Listing 9.4. Listing 9.5 The postgres role’s tasks/main.yml file 1 2 3 4

- name: Ensure database exists postgresql_db: name: "{{ psql_db_name }}" state: present

5 6

- name: Ensure user exists

230 7 8 9 10 11

R. Opdebeeck et al. postgresql_user: db: "{{ psql_db_name }}" name: "{{ psql_db_user }}" priv: ALL state: present

Listing 9.6 The postgres role’s defaults/main.yml file 1 2

psql_db_name: prod psql_db_user: app

Listing 9.7 Playbook using the role 1 2 3 4

# Configure production DB - hosts: databases roles: - role: postgres

5 6 7 8 9 10 11 12 13

# Configure test DB # with separate parameters - hosts: test-database roles: - role: postgres vars: psql_db_name: test psql_db_user: test

Roles follow a strict directory structure. Like a play, a role contains a sequence of tasks, listed in the tasks/main.yml file. In our example, this file contains the same tasks as the play it is adapted from (Listing 9.5). Roles commonly have a set of parameters, called default variables, listed in defaults/main.yml, that can be used to customize the role’s behavior. The example role lists as its default variables the same variables that were present in the original play (Listing 9.6). When a role is included in a play, these variables can be overridden with values specific for that play. This facilitates reuse within an infrastructure configuration project. Listing 9.7 exemplifies this, where we override the variables specifically for the test environment. Moreover, roles can also be shared across different playbooks and can thus be used for separate infrastructures. Therefore, it is possible to construct roles and share them with other practitioners for use in their infrastructure projects. This forms the foundation of the Ansible Galaxy ecosystem, which we cover next.

9.3.3 The Ansible Galaxy Ecosystem Ansible Galaxy is a registry operated by Ansible to facilitate reusing open-source Ansible code. It collects and displays metadata about Ansible roles and plug-ins written by open-source developers. Ansible practitioners can discover the reusable

9 Infrastructure-as-Code Ecosystems

231

content via Ansible Galaxy’s web interface, while a command-line utility can be used to install, update, and manage the content in a project. As of January 2023, Ansible Galaxy indexes over 31,500 roles, the most popular of which have been downloaded millions of times. For instance, the most downloaded role is geerlingguy.java,9 with over 15M downloads, closely followed by geerlingguy.docker10 (13.5M downloads). These roles install Java and Docker, respectively, on various Linux systems and offer many customization options for their users. Ansible Galaxy is merely an indexer and does not store the content itself. Instead, the code is stored in GitHub repositories, and installing consists of cloning the repository. To add (or update) roles or plug-ins to Ansible Galaxy, an open-source developer must import their repository. Ansible Galaxy then populates (or updates) various pieces of metadata according to the information found in that repository.

9.3.3.1

Types of Ansible Galaxy Content

Ansible Galaxy aggregates two types of content, namely, roles and collections. As described above, roles contain reusable tasks that are made generic through the use of parameters. When such roles are committed to open-source repositories, they can be imported into Ansible Galaxy to be reused by others. Such roles generally aim to configure one facet of an infrastructure. For instance, the geerlingguy.docker role mentioned above installs and configures the Docker driver on various Linux platforms. Not only does it install all necessary software packages, it can also set various configuration options, configure the Docker software to start automatically when the system boots, etc. Ansible Galaxy collections are, as the name implies, collections of related Ansible content. Although collections can include roles, their primary use case is to extend the Ansible language by means of plug-ins for actions, filters, lookups, etc. As such, they intend to facilitate writing configuration tasks, rather than bundling configuration tasks for a specific purpose. A collection’s content commonly shares a single theme. For instance, the community.dns collection aggregates plug-ins to manage DNS configurations, and the amazon.aws collection contains content related to Amazon AWS. Collections are backed by GitHub mono-repositories that contain all of the code for the collection’s constituents.

9.3.3.2

Types of Metadata Maintained by Ansible Galaxy

Three types of metadata can be retrieved from Ansible Galaxy, namely, role and collection metadata, repository metadata, and usage and quality metadata. Role

9 https://galaxy.ansible.com/geerlingguy/java. 10 https://galaxy.ansible.com/geerlingguy/docker.

232

R. Opdebeeck et al.

and collection metadata comprises information that is extracted from a role’s meta/main.yml file or a collection’s manifest file. For roles, this includes its name and author, a description, its license, the platforms and Ansible versions it supports, any dependencies on other roles or collections, tags submitted by the author, etc. For collections, this also includes basic data such as the name and author but further includes all constituents of the collection and their information. Repository information comprises information gathered from the GitHub repository. This includes the GitHub URL, repository description, and information about the owner of the repository. Ansible Galaxy also stores the README file of the repository, URLs to Travis CI build information if available, and metrics such as the number of stars, watchers, forks, and open issues at the time the content was imported. It also lists the known versions of the repository, which Ansible Galaxy identifies as git tags that adhere to the semantic versioning format (x.y.z). Finally, the metadata includes information about the latest commit at the time of import and timestamps related to the last import that was performed on the repository. Usage and quality metadata includes a role or collection’s download count and quality information submitted by users and generated by tools. Users of the ecosystem can rate content through community surveys, which inquire about the quality of the role or collection’s documentation, its ease of use, readiness for production, etc. These survey responses can be retrieved individually or through an aggregate “community score.” Roles have additional scoring metrics, the “quality score,” which is an average of a “syntax score” and a “metadata score.” Ansible Galaxy applies linters and additional checks during import time, whose warnings are used to calculate these scores and can also be retrieved individually.

9.3.4 Approaches to Analyzing Ansible Galaxy The Ansible Galaxy ecosystem has been the subject of empirical software engineering research in recent years. This section surveys the analyses that form the foundation for the research methods that have been applied to Ansible Galaxy and Ansible client code. We also highlight some key challenges addressed in such research and point the reader toward datasets. We distinguish between three types of analyses on Ansible Galaxy. Metadata analysis concerns any analysis of role metadata extracted from Ansible Galaxy and related services, such as GitHub repositories. Static and dynamic analysis concerns analyses that operate on the source code of Ansible projects. The distinguishing factor is whether or not Ansible code is executed. In static analysis, the Ansible code is never executed throughout the analysis, whereas in dynamic analysis, (parts of) the code may be evaluated by specialized interpreters.

9 Infrastructure-as-Code Ecosystems

9.3.4.1

233

Ansible Galaxy Metadata Analysis

The Ansible Galaxy indexer can be used as a valuable source of information for ecosystem metadata analysis. Metadata extracted from a role’s git repository, such as commits, tags, contributors, issues, and pull requests, can further enrich this metadata. Finally, roles and repositories are often linked to additional sources of metadata, such as CI services like Travis CI, CircleCI, and GitHub Actions, which can provide information about a project’s build and test statuses. Much of this information has been aggregated in Opdebeeck et al.’s [28] “Andromeda” dataset. It represents a full snapshot of Ansible Galaxy’s content, collected in January 2021. It contains the aforementioned Galaxy metadata of over 25,000 Ansible roles, as well as metadata about their repository’s git commits and tags. The authors also published Voyager [30], the tool used to collect this dataset. It implements the pipeline depicted in Fig. 9.4. It starts by querying Ansible Galaxy’s JSON-based API to discover roles and collect their metadata. Subsequently, it clones the roles’ repositories and collects git metadata and versions. It later analyzes the code in these repositories to create a structural representation and to categorize code changes between two consecutive role versions. Importantly, since the dataset mainly contains raw data, it may not be immediately usable for analysis. For instance, one should likely filter out low-quality repositories using standard metrics such as number of downloads and repository activity. Moreover, Ansible Galaxy supports monolithic repositories, or “monorepos,” which store multiple roles in a single repository. Such monorepos need to be handled with care when cross-referencing with git data, e.g., not every commit in the repository will apply to every role in the repository. Finally, Ansible Galaxy may serve outdated information, such as repository metrics and role versions, since this information is not continually updated in the index. However, since all Galaxy repositories are linked to GitHub repositories, standard tools can be used to compute up-to-date information.

commits: commits: -commits: message: ... -sha1: message: ... ... - message: ... ... tags: sha1: sha1: ... tags: -tags: name: 1.0.0 -commit name: sha1: 1.0.0 ... -commit name: sha1: 1.0.0 ... commit sha1: ...

galaxy.ansible.com

Extract commits & tags

Crawl & normalise canonical id: bennojoy.mysql canonical id:role bennojoy.mysql description: for mysql canonical id:role bennojoy.mysql description: mysql download count: role 6593for description: for mysql download count: name: mysqlcount: 6593 download 6593 name: mysql license: BSD name: mysql license: [] BSD versions: license: [] BSD versions: versions: []

Clone repos

role 1: 1.0.0 role 1: 1.1.0 role 1: 1.0.0 role 1: 1.1.0 role 1: 1.0.0 role 1: 1.1.0

Construct structural model

{} {} {} *not included in Andromeda

Fig. 9.4 Overview of the data collection pipeline of Voyager [31]

Structural change distilling Role role 1 role 1 role 2 ...

v1 1.0.0 1.1.0 v1.1.1

v2 1.1.0 2.0.0 v1.2.0

#TaskEdit 1 3 0 ...

... ... ... ...

234

9.3.4.2

R. Opdebeeck et al.

Static Analysis of Ansible Infrastructure Code

Prior work has proposed various approaches to static analysis of Ansible code, ranging from code smell detection and defect prediction, over change distilling, to model checking of behavioral properties. Many of these approaches have been summarized for IaC in general by Chiari et al. [6]. Static analyses share a need to represent the code they reason about. In this section, we describe a range of representations that have been used in prior work, starting with the simplest and later delving into more advanced representations. Lexical representations are the simplest code representations. They often take the form of token streams, wherein the original source code is split based on positions of certain character classes, e.g., whitespace, or separators such as colons. Such representations do not distinguish between different program elements. This makes it difficult to perform an in-depth analysis of the code. However, token stream representations can be used for tasks such as code smell detection, wherein the detector uses matching rules corresponding to sequences of tokens to highlight potential problems in code. Such approaches have been proposed for other IaC languages, such as Puppet [37]. Nonetheless, research on Ansible has skipped such representations, as it is trivial to obtain syntactical, tree-based representations, as described below. Syntactical representations are the result of transforming token streams into richer, often tree-based structures by assigning syntactical classes to tokens and reassembling them to represent program elements. For instance, in a syntactical representation, a task appears as a single element with sub-elements for the task’s constituents, rather than a subsequent sequence of tokens in a flat token stream. Such representations therefore encode the structure of the program and elide lexical details such as whitespace and separators. This substantially facilitates the analysis of the code. For example, counting the number of tasks that appear in a role requires little more than a simple tree traversal. Since Ansible code is written in YAML, one can easily obtain a tree-based syntactical representation merely by parsing the YAML file. The resulting representation is a data structure consisting of lists and key-value dictionaries, closely matching the structure of the source code. However, this representation exhibits a number of shortcomings. First, the elements of the data structure are unlabeled; that is, without context, one cannot determine whether a key-value mapping is a task, a block of tasks, a collection of variables, or a variable value. This burdens the designer of the analysis, who has to reconstruct this context themselves. Second, YAML parsing does not take Ansible-specific syntax into account, such as the task argument shorthand that allows one to write an action and its arguments on a single line. For example, the shorthand apt: name=apache2 state=present is equivalent to the action of the last task in Listing 9.4. Such shorthands further burden the designer of the analysis, who may need to perform additional transformations on the representation. Nonetheless, many static analysis approaches for Ansible merely parse the YAML code, since it is often sufficient for their use cases, such as code linting [38] or defect prediction [11].

9 Infrastructure-as-Code Ecosystems

235

Structural representations build upon syntactical representations to capture more of the structure of an Ansible program. Opdebeeck et al.’s [31] structural model for Ansible roles is one such representation that addresses the limitations of the previously described syntactical representations. It abstracts over the parsed YAML data structures such that each program element is tagged with its type, e.g., tasks, variables, or blocks. Moreover, the representation is normalized such that syntactical variants (e.g., task argument shorthand) map to the same representation. Figure 9.5 depicts the structural model for the example role in Listings 9.5 and 9.6. One can see that the model represents relevant files in the role, including files containing tasks or default variables. Each unit of the script, including variables and tasks, is modelled as a separate node as a child of the component in which it is contained. For instance, the two variables are part of the default variable file component. The two tasks are part of an implicit top-level block, which in turn is part of the task file. Opdebeeck et al. used this model to distill fine-grained changes to role source code, which were used to compare different role versions. The structural models of over 125,000 role versions and the distilled changes are also contained within the aforementioned Andromeda dataset [28]. Both the structural model builder and change distiller have been made open source in their Voyager data collection tool [30].

Fig. 9.5 Structural model representation of Listings 9.5 and 9.6

236

R. Opdebeeck et al.

Behavioral representations can store behavioral information derived by prior static analyses alongside the representation of the source code. Static analysis tools can then employ the already-derived information in their own analyses. Several kinds of behavioral information can be considered. Control and data flow information are among the most common requirements for in-depth static analyses. Control flow information describes the possible paths that a program may take and includes information about control order, branching points, and loops. Data flow information describes how and where data is defined and used in a program. The Structured Resource Tree representation by Dai et al. [9] is a tree representation whose nodes represent components (files, blocks), units (tasks), variables, and operations (expressions). The tree is structured according to a structural containment relationship, similar to the aforementioned structural model; that is, a component node’s children are those nodes (other components or units) that are contained within it. Contrary to purely structural representations, Structured Resource Trees contain some behavioral information, such as data flow dependencies between variables and expressions, and a partial execution order relationship. However, since the representation relies on “define before use” heuristics, it cannot account for the intricacies of Ansible’s semantics, such as lazy evaluation and complex variable precedence rules. Moreover, the representation does not relate control flow to data flow and can therefore not capture the influence of values on which control flow branch is taken. The latter is traditionally captured in Program Dependence Graphs (PDGs), a representation that joins control and data flow information into a single graph. Opdebeeck et al.’s [29] PDG representation implements this concept for Ansible roles. It consists of control nodes interlinked through control flow order edges and data nodes representing unique abstract values present in the program. The latter are linked to control and data nodes alike via data flow (data definition and data usage) edges. For data nodes, the representation further distinguishes between different types of data, namely, literals, expressions, named value nodes (variables), and unnamed value nodes (results of expressions). The PDG builder also accounts for the aforementioned intricacies of Ansible’s semantics and ensures that no two different concrete values are ever represented by the same abstract data node. To this end, when a variable is used multiple times, the builder ensures that the node for its value is linked to both usages if it can be statically proven that the usages will always receive the exact same value. Otherwise, the builder will use multiple data nodes to represent the different usages of a variable, since its value may have changed in between two usages. An example of a PDG for the role in Listings 9.5 and 9.6 is depicted in Fig. 9.6. From the representation, one can immediately see that both tasks use a common value, as well as how these values are defined and used.

9 Infrastructure-as-Code Ecosystems

237

Fig. 9.6 Program dependence graph representation of Listings 9.5 and 9.6

9.3.4.3

Dynamic Analysis of Ansible Infrastructure Code

Dynamic analysis of configuration management languages in general and Ansible in particular is largely unexplored. Nonetheless, we discern three dynamic analysis approaches that can be applied to Ansible roles. First, since configuration management tools need to interact with a machine’s operating system to configure the machine, a dynamic analysis could collect system call traces while the code is being executed. Such traces provide insights into the behavior of a configuration script at runtime. The collected traces can then be subjected to further analysis to find faults and defects in the code, as has been applied to find missing control order dependencies in Puppet programs [45]. Moreover, traces for different programs can be matched to one another, e.g., to suggest migrations of imperative shell scripts to Ansible tasks [21]. Second, test cases and their outcomes can serve as a source of information for empirical studies. To test their Ansible roles, many developers use molecule [40], a test framework designed for Ansible. Behind the scenes, molecule uses Docker to set up a virtual infrastructure on which it executes an Ansible role. The role developer can subsequently test the final infrastructure state through assertions, also

238

R. Opdebeeck et al.

written in the Ansible language. Test outcomes could provide interesting insights for developers. Moreover, these test cases provide an automated means to execute Ansible roles and could thus be used to generate the aforementioned system call traces. Finally, rather than relying on test cases written by developers, some approaches generate their own test cases to check behavioral properties of infrastructure code. Hummer et al. [22] used model-based testing to find idempotence issues in Chef code. Their approach uses a system model and state transition graph representations (cf. Sect. 9.3.4.2) to generate a sequence of test cases. Subsequently, it executes these test cases, monitors the actions undertaken by the IaC tool, and analyzes them to detect idempotence issues.

9.3.5 Empirical Insights from Analyzing Ansible Infrastructure Code We now discuss the most recent insights from empirical software engineering research into the Ansible ecosystem gained through the analysis methods discussed above. For a more general overview of prior work on Infrastructure as Code, we refer the reader to the systematic mapping study by Rahman et al. [36].

9.3.5.1

Code Smells and Quality in the Ansible Galaxy Ecosystem

Code smells are recurring coding patterns indicating flaws in a program’s design or implementation. Importantly, code smells themselves are not necessarily defects; their presence merely indicates a potential defect as a suggestion for a developer to investigate in more depth. Moreover, code smells may highlight potential maintenance issues and areas worthy of refactoring. A related concept is that of code quality metrics, which quantify various quality-related aspects of source code. Initially, much of the work on quality metrics and code smells in Infrastructure as Code focused on investigating whether metrics for general-purpose languages (e.g., (e.g., number of statements, inconsistent naming conventions) were applicable to IaC languages. For instance, both Sharma et al. [43] and van der Bent et al. [49] transposed code smells and quality metrics, respectively, from general-purpose languages to Puppet code. Dalla Palma et al. [10] extended these efforts to propose a set of 46 quality metrics for Ansible code, such as the number of tasks, loops, and conditionals. These metrics are computed from simple syntactical source code representations (cf. Sect. 9.3.4.2), mainly by counting the number of occurrences of keywords in the source code. Rahman et al. [38] defined seven security-related code smells for Ansible, such as the use of hard-coded passwords or a lack of integrity checking on files downloaded by a task. They implement a rule-based detection tool, called SLAC, to identify these

9 Infrastructure-as-Code Ecosystems

239

smells using a syntactical code representation. They subsequently used this tool to investigate the prevalence of security smells in nearly 15,000 open-source Ansible files, belonging to over 350 repositories. They observed that 25.3% of the studied Ansible repositories found on GitHub contain at least one of the security smells. They furthermore found a total of more than 17,000 security smells throughout these repositories, over 80% of which are uses of hard-coded secrets. The authors further conducted a qualitative analysis by submitting issue reports to GitHub repositories, whose results suggest that most Ansible practitioners agree with the reports. Opdebeeck et al. [29] hypothesize that some of Ansible’s semantics, such as lazy evaluation of expressions and a complicated variable precedence system, may lead to defects in Ansible code. The authors therefore propose six code smells related to the usage of variables and expressions, such as multiple usages of a variable whose value may have changed between the usages, variables that have been defined through unnecessarily complicated mechanisms, and potentially accidental redefinitions of variables. They developed an approach that operates on their Ansible program dependence graph representation and detects the proposed smells through graph matching. The authors apply this approach to study the prevalence of the proposed smells in Ansible roles indexed by Ansible Galaxy. They observed that these code smells could be detected in the development history of over 4200 roles, comprising nearly 20% of the studied dataset with a total of more than 31,000 unique smell instances, roughly 22,000 of which are still present in the role’s latest version. They also found that although some of the smells are getting fixed by developers, the rate at which the smells are introduced outpaces that of the fixes. However, they observed that it may take multiple years before a smell is fixed. Kokuryo et al. [23] investigated the usage of imperative actions that can run arbitrary shell commands. To this end, they investigated the tasks of 864 Ansible roles discovered through the Ansible Galaxy ecosystem. They found that nearly half of these roles use at least one imperative action, mainly to perform operations for which there is no dedicated Ansible action, but also to perform filesystem operations. They further found that many of these imperative actions can be replaced by dedicated Ansible actions, which are preferred over arbitrary shell commands. Hassan and Rahman [17] empirically studied the quality of 4831 test files in 104 open-source Ansible repositories. They found that 1.8% of these files contain test bugs, which they categorize into seven types of test defects, such as excessive amounts of logs being generated, or software artifacts that are unavailable when a test script runs. They further identified a number of testing patterns that correlate with the identified bugs, such as assertion roulette and testing playbooks solely in a local environment that may not accurately represent real-world situations. Many opportunities for code smell and code quality research on Ansible still exist. One particular aspect that is worthy of investigation is a potential connection between Ansible Galaxy’s quality survey scores and the detected code smells or proposed code quality metrics. Such an empirical study could be beneficial to the ecosystem and its practitioners, by providing more insights into automated tools and how their results correlate with user-provided qualitative data.

240

9.3.5.2

R. Opdebeeck et al.

Defect Prediction for the Ansible Galaxy Ecosystem

Defect prediction is the use of machine learning models to predict whether code or code changes are defective. Such models are trained to distinguish between defective and defect-free code using a set of code metrics as features. They are then tasked with predicting whether previously unseen code is defective, based on the model’s inferences from the training phase. For Ansible, Dalla Palma et al. [11] trained and compared five different machine learning models using 104 features and a dataset of more than 100 open-source Ansible repositories. They classify their metrics into three categories. First, IaC-oriented metrics relate to structural properties of the Ansible code and include the 46 aforementioned metrics proposed by Dalla Palma et al. [10] as well as 14 metrics transposed from prior work by Rahman and Williams [39]. Second, delta metrics capture the amount of change in these IaC-oriented metrics between successive releases of a script. Finally, process metrics capture information regarding the development process rather than the code, e.g., the number of developers that worked on a file. Their empirical analysis uncovered that a Random Forest model provides the best defect prediction performance for their features. They also found that IaC-oriented metrics and specifically the number of tokens and lines of code in a file tend to maximize the prediction performance. They further observed that, contrary to traditional defect prediction, process metrics do not contribute a significant performance improvement for Ansible code. This could be due to infrastructure code being changed less often than traditional application code. Borovits et al. [5] used a machine learning approach to detect linguistic inconsistencies in Ansible tasks. To this end, they built a synthetic dataset by mutating tasks found in open-source repositories, thereby creating inconsistencies between their description and actual behavior. They then use this dataset to train and evaluate multiple machine learning classifiers that predict such inconsistencies. Their results suggest that both classical machine learning techniques and deep learning techniques are effective at finding these inconsistencies.

9.3.5.3

Evolution Within the Ansible Galaxy Ecosystem

Many software ecosystems recommend using the semantic versioning (SemVer) standard [34] (x.y.z) to denote software release versions [12, 25, 35]. Ansible Galaxy is no different and recommends its contributors to use SemVer to denote their role release versions. However, it provides no guidelines in regard to interpreting the semantics of semantic versioning, such as selecting a release type (i.e., major, minor, or patch version increments). To alleviate this gap, Opdebeeck et al. [31] conducted a large-scale empirical study into the versioning practices employed in the Ansible Galaxy ecosystem. Their findings indicate that although a majority of Ansible Galaxy roles adhere to the semantic versioning syntax, a substantial portion of developers may choose their version increment arbitrarily.

9 Infrastructure-as-Code Ecosystems

241

They further conducted a qualitative survey with role developers, querying them about their interpretation of semantic versioning for Ansible roles. The survey reveals that many developers consider a role’s default variables to be its main interface. Backward-incompatible changes to these default variables, e.g., removing or renaming, are considered backward incompatible and therefore lead to a major version bump. Adding new variables is often seen as the addition of functionality and constitutes a minor version increment. With these insights, the authors extract a set of structural change metrics by performing change distilling on their structural model representation (cf. Sect. 9.3.4.2). These metrics are used as the features to train a Random Forest machine learning model tasked with predicting the appropriate version bump for a set of changes. The model indicates that the most important features are generally aligned with practitioners’ responses, such as changes to default variables or additions of tasks. Nonetheless, their classifier experiences difficulty in correctly predicting the version increment. A subsequent manual analysis uncovered that in many cases, the model was making the correct prediction, yet the role developer chose a wrong version increment in practice. Finally, the authors synthesized a set of recommendations for Ansible practitioners and the ecosystem as a whole. For instance, they recommended a set of versioning guidelines based on the practitioner responses and recommended role users to thoroughly test their Ansible code after updating a dependency.

9.4 Conclusion This chapter discussed the emerging Infrastructure-as-Code ecosystems forming around Ansible and Docker, the most popular embodiments of the mutable and immutable IaC paradigms, respectively, the different approaches to analyze them, and the latest empirical insights. Both of these ecosystems pivot around registries in which software artifacts are being created and shared. Ansible has Ansible Galaxy, while Docker has Docker Hub. Artifacts in these registries are similar to general-purpose libraries, in the sense that they can be reused by others in creating and configuring clusters of (virtual) machines, clouds, and containers. Ansible Galaxy hosts Ansible roles and collections, while Docker Hub hosts Docker images. Previous studies [1, 16, 18] have shown that developers and users of these artifacts face many challenges when they are developing or reusing these artifacts. Example challenges include the understandability and testability of these artifacts. Developers complain that testing is difficult and code review is too basic; that is, no clear reviewing guidelines are available. In addition, because these ecosystems are relatively young and growing rapidly, their reusable artifacts suffer from inconsistency; that is, their versions are often backward incompatible. Moreover, many artifacts suffer from security vulnerabilities [38, 57], which may jeopardize dependent artifacts and millions of their users.

242

R. Opdebeeck et al.

Next to the developers, researchers also face different challenges when analyzing these ecosystems. First, a lack of tooling requires researchers to create new tools and analyses from scratch that can cope with the specifics of IaC artifacts. Moreover, in many cases, they need to understand the use and development of these artifacts in the wild in order to come up with the appropriate rules to follow when creating data collection and analysis pipelines. This is due to the lack of clear coding conventions and standards such as semantic versioning for Ansible roles and Dockerfile naming. They are also required to choose the right sample of artifacts to analyze. For example, inheritance in Docker can lead to duplicate images being studied multiple times. Similarly, Ansible scripts can be cloned and distributed across different projects. Biased datasets do not accurately represent reality, which may lead to misinterpretations of empirical results. Furthermore, the vast scale of available data forces researchers to consider frozen snapshots of an IaC ecosystem, which may impede gaining insights into the ecosystem’s evolution. Finally, since most ecosystem research on Ansible and Docker focuses solely on a single ecosystem, the generalizability of their findings is hampered. It is often unclear whether similar observations can be made for other IaC ecosystems, such as the ones formed around Chef and Puppet. We conclude that there are ample opportunities for further research into the domain.

References 1. Anchore.io: Snapshot of the container ecosystem (2017). https://anchore.com/wp-content/ uploads/2017/04/Anchore-Container-Survey-5.pdf. Accessed 15 Apr 2023 2. Azuma, H., Matsumoto, S., Kamei, Y., Kusumoto, S.: An empirical study on self-admitted technical debt in dockerfiles. Empirical Softw. Eng. 27(2), 1–26 (2022) 3. Bettini, A.: Vulnerability exploitation in Docker container environments. FlawCheck, Black Hat Europe (2015) 4. Boettiger, C.: An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015). https://doi.org/10.1145/2723872.2723882 5. Borovits, N., Kumara, I., Di Nucci, D., Krishnan, P., Dalla Palma, S., Palomba, F., Tamburri, D.A., van den Heuvel, W.J.: FindICI: using machine learning to detect linguistic inconsistencies between code and natural language descriptions in infrastructure-as-code. Empirical Softw. Eng. 27(178) (2022). https://doi.org/10.1007/s10664-022-10215-5 6. Chiari, M., De Pascalis, M., Pradella, M.: Static analysis of infrastructure as code: a survey. In: International Conference on Software Architecture (ICSA), pp. 218–225 (2022). https://doi. org/10.1109/ICSA-C54293.2022.00049 7. Cito, J., Schermann, G., Wittern, J.E., Leitner, P., Zumberi, S., Gall, H.C.: An empirical analysis of the Docker container ecosystem on GitHub. In: International Conference on Mining Software Repositories (MSR), pp. 323–333. IEEE, Piscataway (2017). https://doi.org/10.1109/ MSR.2017.67 8. Combe, T., Martin, A., Di Pietro, R.: To Docker or not to Docker: a security perspective. IEEE Cloud Comput. 3(5), 54–62 (2016). https://doi.org/10.1109/MCC.2016.100 9. Dai, T., Karve, A., Koper, G., Zeng, S.: Automatically detecting risky scripts in infrastructure code. In: Symposium on Cloud Computing (SoCC), pp. 358–371. ACM (2020). https://doi. org/10.1145/3419111.3421303

9 Infrastructure-as-Code Ecosystems

243

10. Dalla Palma, S., Di Nucci, D., Palomba, F., Tamburri, D.A.: Toward a catalog of software quality metrics for infrastructure code. J. Syst. Softw. 170 (2020). https://doi.org/10.1016/j.jss. 2020.110726 11. Dalla Palma, S., Di Nucci, D., Palomba, F., Tamburri, D.A.: Within-project defect prediction of infrastructure-as-code using product and process metrics. Trans. Softw. Eng. 48(6), 2086–2104 (2022). https://doi.org/10.1109/TSE.2021.3051492 12. Decan, A., Mens, T.: What do package dependencies tell us about semantic versioning? Trans. Softw. Eng. 47(6), 1226–1240 (2021). https://doi.org/10.1109/TSE.2019.2918315 13. Dragoni, N., Giallorenzo, S., Lafuente, A.L., Mazzara, M., Montesi, F., Mustafin, R., Safina, L.: Microservices: yesterday, today, and tomorrow. In: Present and Ulterior Software Engineering, pp. 195–216 (2017) 14. Eng, K., Hindle, A.: Revisiting Dockerfiles in open source software over time. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 449– 459. IEEE, Piscataway (2021) 15. Gholami, S., Khazaei, H., Bezemer, C.P.: Should you upgrade official Docker Hub images in production environments? In: International Conference on Software Engineering—New Ideas and Emerging Results (ICSE-NIER), pp. 101–105. IEEE, Piscataway (2021) 16. Guerriero, M., Garriga, M., Tamburri, D.A., Palomba, F.: Adoption, support, and challenges of infrastructure-as-code: insights from industry. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 580–589. IEEE, Piscataway (2019) 17. Hassan, M.M., Rahman, A.: As code testing: characterizing test quality in open source Ansible development. In: International Conference on Software Testing, Verification and Validation (ICST), pp. 208–219 (2022). https://doi.org/10.1109/ICST53961.2022.00031 18. Henkel, J., Bird, C., Lahiri, S.K., Reps, T.: Learning from, understanding, and supporting DevOps artifacts for docker. In: International Conference on Software Engineering (ICSE), pp. 38–49. IEEE, Piscataway (2020) 19. Henkel, J., Silva, D., Teixeira, L., d’Amorim, M., Reps, T.: Shipwright: a human-in-the-loop system for Dockerfile repair. In: International Conference on Software Engineering (ICSE), pp. 1148–1160. IEEE, Piscataway (2021). https://doi.org/10.1109/ICSE43902.2021.00106 20. Henriksson, O., Falk, M.: Static vulnerability analysis of Docker images (2017) 21. Horton, E., Parnin, C.: Dozer: migrating shell commands to Ansible modules via execution profiling and synthesis. In: International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 147–148 (2022). https://doi.org/10.1145/3510457. 3513060 22. Hummer, W., Rosenberg, F., Oliveira, F., Eilam, T.: Testing idempotence for infrastructure as code. In: ACM/IFIP/USENIX International Middleware Conference, pp. 368–388 (2013). https://doi.org/10.1007/978-3-642-45065-5%5C_19 23. Kokuryo, S., Kondo, M., Mizuno, O.: An empirical study of utilization of imperative modules in Ansible. In: International Conference on Software Quality, Reliability and Security (QRS), pp. 442–449 (2020). https://doi.org/10.1109/QRS51102.2020.00063 24. Ksontini, E., Kessentini, M., Ferreira, T.d.N., Hassan, F.: Refactorings and technical debt in docker projects: an empirical study. In: International Conference on Automated Software Engineering (ASE), pp. 781–791. IEEE, Piscataway (2021). https://doi.org/10.1109/ASE51524. 2021.9678585 25. Lam, P., Dietrich, J., Pearce, D.J.: Putting the semantics into semantic versioning. In: International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pp. 157–179. ACM (2020). https://doi.org/10.1145/3426428.3426922 26. Lin, C., Nadi, S., Khazaei, H.: A large-scale data set and an empirical study of Docker images hosted on Docker Hub. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 371–381. IEEE, Piscataway (2020). https://doi.org/10.1109/ICSME46990.2020. 00043 27. Lin, C., Nadi, S., Khazaei, H.: A large-scale data set of Docker images hosted on Docker Hub (2020). https://doi.org/10.5281/zenodo.3862987

244

R. Opdebeeck et al.

28. Opdebeeck, R., Zerouali, A., De Roover, C.: Andromeda: a dataset of Ansible Galaxy roles and their evolution. In: International Conference on Mining Software Repositories (MSR), pp. 580–584 (2021). https://doi.org/10.1109/MSR52588.2021.00078 29. Opdebeeck, R., Zerouali, A., De Roover, C.: Smelly variables in Ansible infrastructure code: detection, prevalence, and lifetime. In: International Conference on Mining Software Repositories (MSR). ACM (2022). https://doi.org/10.1145/3524842.3527964 30. Opdebeeck, R., Zerouali, A., Velázquez-Rodríguez, C., De Roover, C.: Replication package of SCAM 2020 Ansible role semantic versioning empirical study (2020). https://doi.org/10.5281/ zenodo.4041169 31. Opdebeeck, R., Zerouali, A., Velázquez-Rodríguez, C., De Roover, C.: On the practice of semantic versioning for Ansible Galaxy roles: an empirical study and a change classification model. J. Syst. Softw. 182 (2021). https://doi.org/10.1016/j.jss.2021.111059 32. Oumaziz, M.A., Falleri, J.R., Blanc, X., Bissyandé, T.F., Klein, J.: Handling duplicates in Dockerfiles families: learning from experts. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 524–535. IEEE, Piscataway (2019) 33. Pahl, C.: Containerization and the PaaS cloud. IEEE Cloud Comput. 2(3), 24–31 (2015) 34. Preston-Werner, T.: Semantic versioning 2.0.0 (2013). https://semver.org/. Accessed 15 Apr 2023 35. Raemaekers, S., van Deursen, A., Visser, J.: Semantic versioning and impact of breaking changes in the maven repository. J. Syst. Softw. 129, 140–158 (2017). https://doi.org/10.1016/ j.jss.2016.04.008 36. Rahman, A., Mahdavi-Hezaveh, R., Williams, L.: A systematic mapping study of infrastructure as code research. Inform. Softw. Technol. 108, 65–77 (2019). https://doi.org/10.1016/j.infsof. 2018.12.004 37. Rahman, A., Parnin, C., Williams, L.: The seven sins: security smells in infrastructure as code scripts. In: International Conference on Software Engineering (ICSE), ICSE ’19, pp. 164–175 (2019). https://doi.org/10.1109/ICSE.2019.00033 38. Rahman, A., Rahman, M.R., Parnin, C., Williams, L.: Security smells in Ansible and Chef scripts: a replication study. Trans. Softw. Eng. Methodol. 30(1) (2021). https://doi.org/10.1145/ 3408897 39. Rahman, A., Williams, L.: Source code properties of defective infrastructure as code scripts. Inform. Softw. Technol. 112, 148–163 (2019). https://doi.org/10.1016/j.infsof.2019.04.013 40. Red Hat, Inc.: Ansible Molecule (2023). https://molecule.readthedocs.io/en/latest/. Accessed 15 Apr 2023 41. Rosa, G., Scalabrino, S., Oliveto, R.: Fixing dockerfile smells: an empirical study. International Conference on Software Maintenance and Evolution (ICSME) (2022) 42. Sabuhi, M., Musilek, P., Bezemer, C.P.: Studying the performance risks of upgrading Docker Hub images: a case study of WordPress. In: International Conference on Performance Engineering, pp. 97–104. ACM (2022) 43. Sharma, T., Fragkoulis, M., Spinellis, D.: Does your configuration code smell? In: Working Conference on Mining Software Repositories (MSR), pp. 189–200 (2016). https://doi.org/10. 1145/2901739.2901761 44. Shu, R., Gu, X., Enck, W.: A study of security vulnerabilities on Docker Hub. In: International Conference on Data and Application Security and Privacy, pp. 269–280. ACM (2017). https:// doi.org/10.1145/3029806.3029832 45. Sotiropoulos, T., Mitropoulos, D., Spinellis, D.: Practical fault detection in Puppet programs. In: International Conference on Software Engineering (ICSE), pp. 26–37 (2020). https://doi. org/10.1145/3377811.3380384 46. Stack Overflow: 2022 stack overflow developer survey (2022). https://survey.stackoverflow.co/ 2022. Accessed 15 Apr 2023 47. Tsuru, T., Nakagawa, T., Matsumoto, S., Higo, Y., Kusumoto, S.: Type-2 code clone detection for Dockerfiles. In: International Workshop on Software Clones (IWSC). IEEE, Piscataway (2021)

9 Infrastructure-as-Code Ecosystems

245

48. Turnbull, J.: The Docker Book: Containerization is the New Virtualization. James Turnbull (2014) 49. van der Bent, E., Hage, J., Visser, J., Gousios, G.: How good is your Puppet? An empirically defined and validated quality model for Puppet. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 164–174 (2018). https://doi.org/10. 1109/SANER.2018.8330206 50. Vermeer, B., Henry, W.: Shifting Docker security left (2019). https://snyk.io/blog/shiftingdocker-security-left/. Accessed 15 Apr 2023 51. Wu, Y., Zhang, Y., Wang, T., Wang, H.: Characterizing the occurrence of dockerfile smells in open-source software: an empirical study. IEEE Access 8, 34127–34139 (2020) 52. Xu, J., Wu, Y., Lu, Z., Wang, T.: Dockerfile TF smell detection based on dynamic and static analysis methods. In: Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 185–190. IEEE, Piscataway (2019). https://doi.org/10.1109/COMPSAC.2019.00033 53. Zerouali, A., Constantinou, E., Mens, T., Robles, G., González-Barahona, J.: An empirical analysis of technical lag in npm package dependencies. In: International Conference on Software Reuse (ICSR). Lecture Notes in Computer Science, vol. 10826, pp. 95–110. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-90421-4_6 54. Zerouali, A., Cosentino, V., Mens, T., Robles, G., Gonzalez-Barahona, J.M.: On the impact of outdated and vulnerable JavaScript packages in Docker images. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 619–623. IEEE, Piscataway (2019) 55. Zerouali, A., Cosentino, V., Robles, G., Gonzalez-Barahona, J.M., Mens, T.: Conpan: a tool to analyze packages in software containers. In: International Conference on Mining Software Repositories (MSR), pp. 592–596. IEEE, Piscataway (2019) 56. Zerouali, A., Mens, T., De Roover, C.: On the usage of JavaScript, Python and Ruby packages in Docker Hub images. Sci. Comput. Program. 207, 102653 (2021) 57. Zerouali, A., Mens, T., Decan, A., Gonzalez-Barahona, J., Robles, G.: A multi-dimensional analysis of technical lag in Debian-based Docker images. Empirical Softw. Eng. 26(2), 1–45 (2021) 58. Zerouali, A., Mens, T., Gonzalez-Barahona, J., Decan, A., Constantinou, E., Robles, G.: A formal framework for measuring technical lag in component repositories—and its application to npm. J. Softw. Evol. Process 31(8) (2019). https://doi.org/10.1002/smr.2157 59. Zerouali, A., Mens, T., Robles, G., Gonzalez-Barahona, J.M.: On the relation between outdated docker containers, severity vulnerabilities, and bugs. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 491–501. IEEE, Piscataway (2019). https://doi.org/10.1109/SANER.2019.8668013 60. Zhang, Y., Zhang, Y., Mao, X., Wu, Y., Lin, B., Wang, S.: Recommending base image for docker containers based on deep configuration comprehension. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 449–453. IEEE, Piscataway (2022)

Part V

Model-Centered Software Ecosystems

Chapter 10

Machine Learning for Managing Modeling Ecosystems: Techniques, Applications, and a Research Vision Davide Di Ruscio, Phuong T. Nguyen, and Alfonso Pierantonio

Abstract Model-driven engineering (MDE) is a software discipline that promotes the adoption of models to support the specification, analysis, and development of complex systems. Available models, transformations, code generators, and a wide range of corresponding software tools constitute the definition of modeling ecosystems. Automatically understanding and analyzing the modeling artifacts available in the modeling ecosystem of interest are preparatory for simplifying their management, including their exploration and evolution. Machine learning (ML) has garnered attention from the scientific community due to its ability to obtain unprecedented results and performance compared with conventional approaches. ML algorithms have been conceived to simulate the learning activities of humans, being capable of autonomously extracting meaningful patterns from data. Deep learning (DL) algorithms go one step further to learn from highly complex data, exploiting deep neural networks with several layers. Over the last few years, the MDE community investigated the adoption of ML/DL techniques to support the management of modeling ecosystems, e.g., by conceiving different types of neural networks to facilitate the automated classification of model repositories or develop recommender systems. This chapter summarizes the recent applications of machine learning approaches to support modeling ecosystems. Furthermore, it identifies possible lines of research that can be followed to further explore to what extent the management of modeling ecosystems can be enhanced by adopting existing machine learning techniques. In this respect, our work envisions a roadmap for the deployment of various ML/DL techniques in the MDE domain.

D. Di Ruscio () · P. T. Nguyen · A. Pierantonio University of L’Aquila, L’Aquila, Italy e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_10

249

250

D. Di Ruscio et al.

10.1 Introduction Model-driven engineering (MDE) [79] provides an effective means of synchronizing among stakeholders, thus playing a crucial role in the software development life cycle. Models are typically specified in terms of concepts formalized in metamodels, and they are manipulated via automated transformations to achieve superior automation by means of refactoring [36, 53, 80], or transformation [44]. The application of MDE enables developers to properly manage the complexity of large-scale software development, exploiting models to capture relevant knowledge of the considered problem domain. As shown in Chaps. 1 and 5, software ecosystems are often seen as either platforms or collections of interrelated communication platforms supporting software development, e.g., GitHub, Stack Overflow, or Google Play, or as collections of interrelated software projects, e.g., Apache, Eclipse, or OW2. Essentially, they are “a collection of software products that have some given degree of symbiotic relationships” [54]. In MDE, modelers create and share reusable artifacts in public stores, such as GitHub or Eclipse. There exist time series artifacts resulting from the interaction between modelers and the hosting platforms over the course of time. Among others, models evolve during their life cycle, leading to the transformation and evolution of models. Altogether, this makes modeling a vibrant ecosystem, where it necessitates proper techniques to manage the flow and transition of data. A modeling repository is a component-based software ecosystem that typically includes metamodels, models, and transformations with structural interdependencies (see Chap. 1). For example, many engineering domains have families of related modeling notations. The goal of producing new software components in a more efficient way by reducing the time to market relies on the ability to locate and reuse relevant artifacts in the repository. To this end, classification mechanisms are essential for clustering similar artifacts and making it easier for the designer to locate potential candidates to base the development of new components. Interestingly, the role of a repository is twofold: on the one hand, it is the subject of our classification and clustering techniques [7, 58, 59, 74]; on the other hand, the repository represents a useful dataset that can be used for creating complex neural networks. These networks can not only be used to classify new incoming artifacts but also to develop, for instance, recommender systems [18, 21]. These systems can use the collective knowledge of the repository population to provide the designer with useful suggestions on how to develop new artifacts or refactor existing ones. In recent years, several model repositories have been populated by academia and industry [35, 40, 42, 43], allowing researchers and developers to share their experiences via the curated artifacts. While the research community is well aware of the benefit of such repositories, there are still difficulties in building effective, scalable, and valuable tools to deduce useful knowledge [90]. Essentially, leveraging the informative structure within a repository with a rich source of insights and accumulated and distributed knowledge necessitates the ability to locate and expose existing artifacts to reuse, analysis, and documentation, among others [20].

10 Machine Learning for Model-Driven Engineering

251

Nevertheless, unless repositories provide effective means to allow users to discover and retrieve the available artifacts, the potential benefits related to the availability of these resources might be significantly jeopardized. Thus, there is an urgent need to conceptualize techniques and tools to mine the existing platforms to provide developers with recommendations, helping them complete their pending tasks. Machine learning (ML) has garnered attention from the scientific community due to its ability to obtain unprecedented results and performance [32]. Being conceived to simulate the learning activities of humans [67], ML algorithms are capable of autonomously extracting meaningful patterns from data without being hard coded [23, 91]. Deep learning (DL) is a branch of machine learning, and algorithms belonging to this category go one step further to learn from highly complex data employing neural networks with several layers. Each layer in a deep neural network is used to capture a specific set of features from the input data. Altogether, the ability to learn from complex data is considerably improved, enabling DL algorithms to obtain a superior performance compared to conventional approaches [59]. Over the last few years, the MDE community has been investigating the adoption of ML/DL techniques to support the management of modeling ecosystems. The purpose of this chapter is threefold. First, it reviews recent applications of such methods in model-driven engineering. Second, the chapter recalls the most notable machine learning techniques that have been widely used in various application domains. Third, we identify possible lines of research that can be followed to further explore to what extent the management of modeling ecosystems can be enhanced by adopting existing machine learning techniques. In this respect, our work envisions a roadmap for deploying various ML/DL techniques in the MDE domain. Therefore, the main contributions of this chapter are summarized as follows: • We review background related to machine learning techniques being widely adopted in MDE. • By conducting a literature review, we survey the existing applications of ML in MDE. • We propose a roadmap about various research issues related to the application of ML. These issues are supposed to be meaningful in the forthcoming years. Outline The chapter is organized as follows. Section 10.2 recalls some of the most notable machine learning techniques, including supervised learning, unsupervised learning, and reinforcement learning. Section 10.3 presents the methodology and results of a lightweight literature review conducted on major venues in software engineering with the aim of collecting relevant application of ML techniques to support MDE tasks. Afterward, in Sect. 10.4, we review and analyze existing applications in MDE being equipped with ML algorithms. Section 10.5 sketches a roadmap for the deployment of various machine learning issues in MDE, whereas Sect. 10.6 concludes the chapter.

252

D. Di Ruscio et al.

10.2 Background in Machine Learning This section presents background related to machine learning algorithms that have been seen in different domains in recent years. We follow a typical taxonomy for machine learning techniques [3], where there are three main categories, i.e., supervised learning, unsupervised learning, and reinforcement learning, described in the following subsections.

10.2.1 Supervised Learning Supervised learning techniques attempt to mimic the learning activities from humans by generalizing through labeled data and deducing conclusions for new incoming data. The ability to learn from data allows neural networks to have a wide range of applications, e.g., pattern recognition [10], or forecasting [95]. This section presents four among the most popular supervised learning techniques, i.e., feedforward neural networks (FFNNs), convolutional neural networks (CNNs), graph neural networks (GNNs), and long short-term memory neural networks (LSTMs), to explain how they extract patterns from the input data. Feed-Forward Neural Networks (FFNNs) Figure 10.1 depicts a simple example of a neural network with one input layer, one output layer, and one hidden layer. For the sake of presentation, most of the weights, biases, and activation functions are removed from the figure, and only some weights h1 (1)

w11

i1

i2

iL

(1) w21 (1) w12

.. .

(1)

w22

(1) w31

(1)

w13 (1) w32 (1)

w23

(1)

w33

Fig. 10.1 A feed-forward neural network

(2)

w11

h2

.. .

hM

(2)

w21

(2)

w12

(2)

w22 (2) w31

(2) w32

.. .

oˆ1

oˆN

10 Machine Learning for Model-Driven Engineering

253

are kept. The input layer consists of L neurons, and each of them corresponds to an input feature, i.e., .I = (i1 , i2 , ..., iL ). The hidden layer consists of M perceptrons, i.e., .H = (h1 , h2 , ..., hM ). There are N perceptrons in the output layer, representing N output categories, i.e., .oˆ = (oˆ 1 , oˆ 2 , .., oˆ N ). The number of neurons in a hidden layer, as well as the number of hidden layers, depends on various factors, including the classification purposes and input data. Convolutional Neural Networks (CNNs) While FFNNs work well with sparse data, they are less effective and, even worse, less efficient on large datasets [24], and moreover, they are not suitable for learning from images. CNNs have been originally conceived to overcome this limitation, extracting patterns from images. Such neural networks use filters and convolutions to capture features from images. A CNN consists of three types of layers whose functionalities are explained as follows: 1. The convolution layer performs most of the computation within a CNN to extract important features of an input image. 2. The pooling layer downsamples a feature map by taking the maximum value within a window, mostly a square one, to reduce the number of parameters [70]; it also extracts rotational and positional invariant features from a feature map. 3. The fully connected layer works as a normal perceptron; each of its neurons is fully connected to the previous layer. The ultimate aim of pooling, convolution, and fully connected perceptrons is to retain the most distinguishable features and discard useless ones while considerably reducing the number of parameters used to learn from data. Figure 10.2 shows an excerpt of a CNN. A tensor of size .96 × 96 × 3 represents an input image, and the number 3 corresponds to the three color channels in images, i.e., Red (R), Green (G), and Blue (B). Filters are matrices used to extract intrinsic characteristics from images. In this example, a 4D filter of size .5 × 5 × 3 × 32 is convolved with the input feature map to produce an output feature map of size .96 × 96 × 1 × 32. Eventually, Maxpool .2 × 2 is performed to reduce the resulting feature map’s width and height by half, yielding a new feature map of size .48 × 48 × 1 × 32. The two fully connected layers work exactly the same like those in FFNNs, and the learning process is conducted to find a function that best fits the input data with the output data.

Fig. 10.2 A simple convolutional neural network

254

D. Di Ruscio et al.

Graph Neural Networks (GNNs) A graph is often represented as an adjacency matrix A of size N.×N, where N is the number of nodes. If each node is characterized by a set of M features, then a dimension of feature matrix X is N.×M. Graph-structured data is complex, and thus, it brings a lot of challenges for existing machine learning algorithms. Hence, graph neural networks (GNNs) are designed to perform inference on data described by graphs [78]. By transforming the problems into simpler representations or rendering them into representations from different perspectives, GNNs are able to solve complex problems. Such networks rely on an information diffusion mechanism in which each node of the graph is processed as a single unit according to the graph connectivity. During the learning phase, the GNN model updates the state of each unit to reach a unique stable equilibrium, i.e., a unique solution given the input. Working on top of graphs, GNNs contribute to a convenient method for linklevel, node-level, and graph-level classification task, graph visualization, and graph clustering. For link-level prediction, it is necessary to represent the relationship among nodes in graph and predict if there is a connection between two entities. In node-level classification, the task is to understand the embedding of every node in a graph by looking at the labels of their neighbors. By graph-level classification, the entire graph needs to be classified into suitable categories. Long Short-Term Memory Neural Networks (LSTMs) Recurrent neural networks (RNNs) [2] cope with sequences of data by keeping information about past activities to predict future events. Afterward, long shortterm memory neural networks (LSTMs) have been developed to learn long-term dependencies [37] by memorizing the input sequence of data. Figure 10.3 illustrates how an LSTM works. Given that .it = [ht−1 , xt ] is the concatenation of the hidden state vector .ht−1 from the previous time step and the current input vector .xt , then two states are propagated to the next cell, i.e., cell state .ct and hidden state .ht . The output of the previous unit, together with the current input, is fed as the input data for a cell. The sigmoid function is applied to keep useful information and remove the useless one. Softmax is used as the activation function, and it computes probabilities that sum to 1 [70]. Given C classes, and .yk is the output of the k.th neuron, the class that yields the maximum probability is the final classification, i.e., .yˆ = argmaxpk , .k ∈ {1, 2, .., C}, where .pk is computed as C .pk = exp(yk )/ k=1 exp(yk ). Encoder-Decoder and Transformer [87] are the follow-up of LSTMs, and they learn better the dependencies within terms in a sentence. In this respect, they have been widely used in machine translation tasks. Decision Tree (DT) and Random Forest (RF) DT is a supervised learning technique where the classifier is modeled as a tree with two types of nodes: decision and leaf nodes. The former specifies some tests on a single attribute, while the latter corresponds to a specific class. The paths within a tree are mutually exclusive, and each data instance is identified by a single rule.

10 Machine Learning for Model-Driven Engineering

255

yˆt

ct−1

×

ct

+ tanh

× Wt ft σ bf

ht−1

bu

×

ct

ut σ

softmax

σ

tanh bt

ht

xt

Fig. 10.3 An LSTM cell

Given a test instance, the algorithm traverses down to the bottom from the top, following the attribute values, until it reaches a leaf node. Starting from a set of input data, it is important to find an optimized tree, small but accurate. Nevertheless, finding the best tree is a nondeterministic polynomial time (NP)-complete problem, and thus, different heuristic algorithms have been proposed to populate decision trees, aiming to choose the best attributes to partition the data, while still preserving timing efficiency. Among others, functions such as entropy and impurity are used as an effective means of selecting a suitable set of attributes for classification trees. Gradient Boosted Decision Tree (GBDT) [28] is an ensemble model of decision tree, being a weak classifier. A weak classifier is altered on purpose to enhance its prediction capability [46], done by restraining classifiers from using all the available information, or by simplifying their internal structure. This equips GBDT with the ability to learn from different parts of a training set, thus reducing overfitting. During the learning phase, GBDT populates a series of weak decision trees, where each of them attempts to correct errors produced by the previous one. A GBDT extracts patterns from data by assuming that a weak decision tree classifier can handle are left, while the remaining difficult observations are subject to developing new weak decision trees. Random forest (RF) is an ensemble learning technique, i.e., combining different algorithms to perform a classification task. In particular, RF is a combination of several decision trees (thus yielding the name “forest”), and it employs bagging to reach a final decision. In particular, depending on the input data, each tree in the forest may produce a different outcome, and the results obtained by all the trees are ranked to choose the maximum one.

256

D. Di Ruscio et al.

Graph Kernel (GK) To compute the internal weights, a graph neural network can employ two different methods, i.e., vector-space embedding [71] and graph kernel [88]. The former involves the annotation of nodes and edges with vectors that describe a set of features. Thus, the prediction capabilities of the model strongly depend on encoding salient characteristics into numerical vectors. The main limitation is that the vectors must be of a fixed size, which negatively impacts the handling of more complex data. In contrast, the latter works mostly on graph structures by considering three main concepts, i.e., paths, subgraphs, and subtrees. A graph kernel is a symmetric positive semidefinite function k defined for a pair G.i and G.j such that the following equation is satisfied: k(Gi , Gj ) = φ(Gi ), φ(Gj )H

.

(10.1)

where .·, ·H is the inner product defined into a Hilbert space H. Roughly speaking, a graph kernel computes the similarity between two graphs following one of the aforementioned strategies. Linear Regression (LR) Regression belongs to the supervised learning category, and it is used to predict a numerical value, or a continuous quantity from a set of input features. There is a linear relation between the input and output, and it can be roughly modeled as a linear combination of the explanatory variables using the following formula: .

Yˆ = b0 + b1 X1 + b2 X2 + · · · + bn Xn

(10.2)

where .Yˆ is the dependent variable, and it needs to be predicted; .b1 , b2 , . . . , bn are coefficients; .X1 , X2 , . . . , Xn are independent variables. As in neural networks, a loss function is used to quantify the difference between the real and predicted values, that is, the smaller the error, the better the model. The final aim of the training process is to compute the weights, as well as to minimize the cost function. Naïve Bayesian (NB) Given a set of input instances being represented as vectors, NB attempts to draw class labels from the features without paying any attention to the relationship among them. NB algorithms work based on the assumption that each constituent feature is independent from the others. In other words, no matter if there exists correlation among the features, the algorithm assumes that each feature contributes independently to the final classification, without being affected by the remaining features. To be concrete, the probability that an event A happens is modeled using the following formula: P (A|B) =

.

P (B|A)P (A) P (B)

(10.3)

10 Machine Learning for Model-Driven Engineering

257

where B is the evidence and A is the hypothesis. The Bayes theorem is used to find the probability of A occurring, considering the fact that B already happened. The assumption made here is that the predictors/features are independent, that is, the occurrence of a specific feature does not influence that of the others. Support Vector Machines (SVM) SVM [83] is a supervised learning technique, widely used in classification and regression tasks. SVM represents data instances as labeled points in an Ndimensional space. To extract patterns from data, it selects hyperplanes that better split the points among the categories. A hyperplane can be perceived as a border between two classes, and its dimension varies according to the number of input features. In this way, the learning phase is an optimization process trying to find the hyperplanes that maximize the distance between points in independent classes. Whenever an unlabeled data instance is fed into SVM, it is transformed into a data point; then the generated hyperplanes are used to assign a label to the data instance.

10.2.2 Unsupervised Learning In contrast to supervised learning techniques, unsupervised learning ones do not need labeled data to extract meaningful patterns. Such techniques work on the basis of a similarity function, grouping various instances according to their common features. In this category, clustering is the dominant technique applied in Knowledge Mining and Information Retrieval [52, 82], and it is “the process of organizing objects into groups of similar objects” [9]. A cluster is a collection of objects, in which the similarity between one pair of objects in a cluster is higher than the similarity between one of the objects to any objects in a different cluster [38]. Clustering is seen as unsupervised learning since we neither know a priori the number of classes nor their attributes. Clustering techniques have been applied in a wide spectrum of areas including biology to classify plants and animals according to their properties and geology to classify observed earthquake epicenter and identify dangerous zones [92]. Over the recent years, several clustering methods have been developed like the hierarchical and partitional ones [38]. Partitional clustering algorithms use a criterion function to identify partitions, whereas hierarchical clustering algorithms try to group similar partitions. Examples of partitioning-based algorithms include K-Means, K-Medoids, CLARA, and CLARANS [56, 73]. Meanwhile, typical examples of hierarchical-based algorithms are BIRCH [97], CURE [33], ROCK [34], and Chameleon [41]. Existing clustering techniques share the property that they can be applied when it is possible to specify a proximity (or distance) measure that allows one to assess if elements to be clustered are mutually similar or dissimilar. The basic idea is that the similarity level of two elements is inversely proportional to their distance. The definition of the proximity measure has an important role to play in almost

258

D. Di Ruscio et al.

all clustering methods, and it depends on many factors including the considered application domain, available data, and goals. Once the proximity measure has been defined, it is possible to produce a proximity matrix for the related objects.

10.2.3 Reinforcement Learning (RL) RL [85] is a branch of ML, which is neither supervised nor unsupervised learning. In this category of algorithms, intelligent actors interactively learn from the surrounding environment by means of feedback obtained from their experiences and the corresponding activities. To this end, rewards and penalties are used as an effective means of learning positive and negative behaviors. Participating actors attempt to find a suitable action model that would maximize the total cumulative reward and, in the meantime, minimize the penalties. RL has been widely applied in robotic systems to equip them with the ability to autonomously learn by interacting with the neighborhood and other objects.

10.3 Literature Review This section presents a literature review conducted in major venues in software engineering to study the main applications of ML in MDE.

10.3.1 Methodology This section details the methodology employed to collect and analyze relevant publications on machine learning (ML) techniques employed in model-driven engineering. As shown in Fig. 10.4, the process consists of the following steps: (i) definition and execution of a query string to collect papers of interest based on a set of keywords; (ii) manual filtering of the retrieved papers with respect to defined inclusion and exclusion criteria; and (iii) manual labeling of the selected publications regarding adopted ML techniques and accomplished MDE tasks. In order to achieve a good trade-off between the coverage of proposed approaches on ML for MDE and timing efficiency, we defined the search and filter strategy by answering the following four W-questions [96]:

10 Machine Learning for Model-Driven Engineering

Query string definition / refinement

Query string execution on SCOPUS

Does the query produce testing articles?

No

259

Yes

221 papers

Application of Inclusion and Exclusion Criteria 34 papers

Manual labeling 13 tasks 17 techniques

Fig. 10.4 Overview of the performed process

• Which? Both automatic and manual operations were performed to collect relevant papers from different venues including conferences and journals. • Where? We defined a query string executed on the Scopus database.1 We fetched all the papers published in a given time period using the advanced search and export features. • What? For each collected paper, its content was manually processed with the aim of keeping only papers presenting ML-based technologies for achieving some MDE goals. • When? Since the application of ML techniques in the MDE field is a recent topic, we limited the search to the last ten years, i.e., from 2012 to 2022.

10.3.2 Query String After a series of test executions and refinements, the resulting search string is shown in Listing 10.1. It has been formed to cover all the papers contained in the title or in the abstract relevant keywords related to both MDE and ML/artificial intelligence (AI). The search string has been conceived by means of an iterative process by looking in the obtained results for the papers that are relevant for this study, e.g., papers recently published in the theme issue on “AI-enhanced modeldriven engineering” of the Springer Journal on Software and Systems Modeling [15].

1 https://www.scopus.com.

260

D. Di Ruscio et al.

Listing 10.1 Fragment of the Scopus query string TITLE-ABS(({model-driven engineering} OR {model-driven → development} OR {model-based development} OR {model-based → software engineering} OR "metamodel*" OR "modeling → environment?") AND ({artificial intelligence} OR {machine learning} OR {deep → learning} OR "bot" OR "*neural network*") ) [...] AND (LIMIT-TO (PUBYEAR,2022) OR [...] OR LIMIT-TO (PUBYEAR → ,2012))

The execution of the conceived query string on the Scopus platform2 produced 221 articles. Then, we performed different filtering steps to narrow down the search as discussed in the next section.

10.3.3 Inclusion and Exclusion Criteria We decided the selection criteria of this study before the query definition phase to reduce the likelihood of bias. In the following, we describe the inclusion (I) and exclusion (E) criteria of our study: I1 Studies presenting novel AI/ML approaches for supporting model management operations in the MDE field I2 Studies in which modeling artifacts are the primary elements managed by the proposed approach I3 Studies subject to peer review (e.g., papers published as part of conference proceedings or journal publications are considered, whereas white papers are discarded) I4 Studies available in full-text E1 Studies presenting the application of MDE to manage some problems in the AI/ML domain E2 Secondary studies (e.g., systematic literature reviews, surveys) Applying these inclusion and exclusion criteria to the 221 papers produced by the query execution resulted in 34 documents. It is important to remark that the coauthors have manually performed this phase. Many potentially relevant studies were excluded because of E1. In fact, .≈70 papers present some model-based solutions to support the development of ML systems (e.g., the design of neural networks or complex data pipelines). More than 80 papers were excluded because the terms metamodel and model used therein are adopted under the mathematical meaning and not as MDE-related concepts.

2 https://www.scopus.com/.

10 Machine Learning for Model-Driven Engineering

261

10.3.4 Manual Labelling As previously mentioned, in this study, we want to take a snapshot of the recently conceived approaches based on ML techniques that have been conceived to address some relevant problems in the MDE field. To this end, we manually labelled the obtained 34 papers concerning two dimensions, i.e., Task and Technique. In particular, each paper has been analyzed by reading the title, abstract, and introduction to classify it depending on the addressed MDE problem and the employed ML techniques. The other sections of the papers under analysis have also been considered if needed. To mitigate possible biases, the labeling operation has been performed by one of the authors and subsequently checked with all the coauthors. Overall, we identified 13 tasks and 17 techniques, as summarized in the next section.

10.3.5 Results As previously described, the process shown in Fig. 10.4 resulted in a corpus of 34 relevant papers in the considered period. Journal, conference, and book chapter publications have been produced during the considered period, even though the topic has become more popular over the last four years. Though we included 2012 in our period of interest, none of the collected articles was published that year. As expected, journal papers were published later in the period, as shown in Fig. 10.5. For instance, all six articles published in 2022 appeared in journal venues. As shown in Fig. 10.6, the analyzed papers focus on different MDE tasks ranging from model assistance to model refactoring activities. In particular, most analyzed papers propose algorithms and tools underpinning recommender systems to assist modeling activities, including the definition of metamodels. Figure 10.7 represents the different ML techniques employed by the analyzed papers to address the considered MDE tasks. According to the results, clustering, LSTM, and natural language processing techniques are among the most used approaches to support MDE tasks. Further than long short-term memory neural networks, convolutional neural networks and feed-forward neural networks are among the most used neural networks. Interestingly, logic-based techniques like answer set programming and integer linear programming are also employed. Table 10.1 gives an overview of the obtained results by showing the different ML techniques that each of the analyzed papers has employed.

262

D. Di Ruscio et al.

Papers by venue type 6

Journal

6

6

6

Conference Book chapter 5

4 4

Papers

4

3

2

1

0

1

1

0 0

0 0

2013

2014

1

0 2015

1 1

1

0

0 0

0

2016

2017

2018

1

0 0 2019

0

0

0

2020

2021

2022

Fig. 10.5 Publications on AI/ML techniques for MDE during the period 2012–2022 distinguished by venue type

10.4 Existing Machine Learning Applications in MDE In this section, we review the most notable applications of machine learning techniques in MDE according to the results of the analysis overviewed in Sect. 10.3.

10.4.1 Model Assistants The prevalence of MDE has promoted the need for automatic modeling assistants to support modelers in general during their daily activities [16, 29, 55]. Among others, it is of paramount importance to help modelers increase their work efficiency and effectiveness by providing them with suitable components while working on new (meta)models. Existing tools like those based on Eclipse EMF3 normally provide only canonical functionalities, i.e., drag-and-drop, specification of graphical components, and they do not support context-related recommendations. Several

3 https://www.eclipse.org/modeling/emf/.

10 Machine Learning for Model-Driven Engineering

263

MDE Tasks vs # Papers Task

10

# Papers

8

6

4

2

R e bl

na ai pl

Ex

Li ac

e

L

y er ov

io R nk

G od

e Tr

el

ec

at er en

ac ef R C

od

n

g rin to

ris

en

pa

m

om M

el

C

As el

od M

od M

on

t

ts ss se

re ui

eq R el

T

od M

M

m

pm lo ve

de

od M

en

en

t

r ai

s

ep

el

nt Sy el

od M

R

he

ar

si

ch

n tio

Se el

od M

ss C

el od M

M

od

el

la

As

si

ifi

st

ca

an

ce

0

MDE Tasks

Fig. 10.6 MDE tasks distribution over the analyzed papers

modeling assistants have been conceptualized, e.g., to suggest model elements, which might be useful for modelers to complete their tasks as described below. While specifying a metamodel, modelers can avoid mistakes if they are provided with useful recommendations such as metaclasses and structural features. Di Rocco et al. [18] developed MemoRec, an approach that employs a collaborative filtering technique to recommend valuable entities related to the metamodel under construction. The system is used to recommend both metaclasses and structured features that should be added in the metamodel under definition. An empirical evaluation on the conceived tool shows that MemoRec is able to suggest relevant items given a partial metamodel and supporting modelers in their task. The same authors presented MORGAN [21], a recommender system built on top of kernel similarity to assist modelers in specifying metamodels and models. MORGAN has been evaluated using two real-world datasets composed of models and metamodels. The experiment results showed that the tool can provide relevant recommendations related to model classes and class members, as well as relevant metaclasses and structural features. Similarly, Weyssow et al. [89] proposed an approach based

264

D. Di Ruscio et al. ML techniques vs # Papers 5

ML Technique

# Papers

4

3

2

1

st s h At rM m te s a nt io chi n ne L i Au ne Ne s ar to tw m or R at e k gr s ed N es a R si eg vie Ba on re ss ye io si n Pl an an ni ng

ks

ec

lf

Se

Su

pp

or

tV

et

ic

to

Al

go

rit

re

or w

Fo

et

m do

an

en

R

N ph ra G

G

es

xi de

In

lN ra

eu

d se

ba cgi Lo

ng

s

ch

pr

n

Ap

io

oa

Tr

w et

lN

is ec

D

ra

ee

ks

g

or

in

es

rn

qu

ni

ea

ch

tL en

m ce

lN na

tio

lu

eu

ks w

et LP

nf ei R

vo on

-fo ed Fe

C

rw

ar

d

or

N

ra eu N

Te

lN

C

m er t-T or Sh ng Lo

or

in er

lu

M

st

em

or

y

g

0

ML Techniques

Fig. 10.7 ML techniques distribution over the analyzed papers

on LSTMs to recommend metamodel concepts during the specification of domain models. Stephan [84] proposed an approach based on model clones to assist modelers by providing them with existing models that are similar to the one under definition. The approach was developed and evaluated for MATLAB Simulink models. Saini et al. [75, 77] addressed the problem of model assistance for teaching purposes. In particular, they propose an automated approach to extract domain models from textual descriptions of the problem at hand. Thus, by exploiting natural language processing (NLP) techniques, the proposed system generates recommendations for teaching modeling literacy to novice modelers. With a similar goal, Abid et al. [1] present MAGNET, an LSTM-based recommender system to aid beginner and intermediate users of AutoFOCUS3 in learning the tool. Batot et al. [8] proposed a multi-objective genetic programming technique to support learning model well-formedness rules from model examples and counterexamples. They introduced the social semantic diversity (SSD) concept and integrated it into a genetic programming version of the NSGA-II algorithm. According to the

10 Machine Learning for Model-Driven Engineering

265

Table 10.1 An overview of ML techniques and their applications in MDE Application

Study

Machine learning techniques DT RF LR SVM CL NB FFNN LSTM CNN GNN RL Others

Saini et al. [75] Batot et al. [8] Stephan [84] Boubekeur et al. [11] Assistant Weyssow et al. [89] Saini et al. [76] Abid et al. [1] Saini et al. [77] Di Rocco et al. [21] Di Rocco et al. [18] Burattin et al. [13] Claris´o et al. [17] Nguyen et al. [57] Rubei et al. [74] Classification Nguyen et al. [59] Nguyen et al. [58] Basciani et al. [7] Refactoring Sidhu et al. [80] Pinna Puissant Repair et al. [66] Iovino et al. [50] Padget et al. [64] Requirements Lano et al. [45] Rigou et al. [72] L´opez et al. [49] Search Eisenberg et al. [25] Ferdjoukh et al. [27] Synthesis Bao et al. [5] Burgue˜no et al. [14] MT Dev. Lano et al. [44] Boubekeur et al. [12] Tang et al. [86] Others Parra-Ullauri et al. [65] Babur [4] Rasiman et al. [69] DT: Decision Tree, RF: Random Forest, LR: Linear Regression, SVM: Support Vector Machines, CL: Clustering, NB: Na¨ıve Bayesian, FFNN: Feed-forward Neural Networks, LSTM: Long Short-Term Memory Neural Networks, CNN: Convolutional Neural Networks, GNN: Graph Neural Networks, RL: Reinforcement Learning.

performed experiments, they have proven that injecting SSD in the learning process improves the convergence and the quality of the learning artifacts. Saini et al. [76] extended their DoMoBOT architecture to develop a bot-based system that can interact with modelers while working on the models at hand. During the interactions with modelers, the proposed bot exploits clustering and LSTM techniques to find alternative configurations and then present them as suggestions to modelers. The bot automatically updates the domain model under definition in response to the acceptance by modelers of the automatically given suggestions.

266

D. Di Ruscio et al.

The interactions with modeling assistants were the subject also of the work done by Boubekeur et al. [11]. This study examines the interactions between a student modeler and an interactive domain modeling helper to understand the necessary interactions better. With the help of three examples, they formalized the intended interactions into a metamodel. Finally, they described how to create a corpus of learning materials to support the assistant interactions based on the metamodel.

10.4.2 Model Classification Model repositories contain unstructured resources that allow for storing experience (in terms of developed artifacts) and making it available for public use. While the importance of such repositories is well perceived, the difficulties in building effective, scalable, and useful tools are often neglected [90]. The classification of metamodels into independent categories fosters personalized searches by boosting the visibility of metamodels. However, the manual classification of metamodels is both an uphill and error-prone task. In fact, misclassification is the norm, resulting in a reduction of reachability as well as reusability of metamodels. Handling such complexity requires suitable tooling to leverage raw data into practical knowledge that can help modelers with their daily tasks. To this end, clustering and classification approaches have already been proposed to categorize metamodels automatically [7, 57]. Basciani et al. [7] introduced an approach to grouping similar metamodels exploiting unsupervised learning. However, as the performance of clustering techniques is dependent on the associated similarity measures, there are two main issues that affect the overall effectiveness: (i) timing performance and (ii) the identification of the appropriate number of clusters [57]. To tackle these issues, AURORA [57, 58] is among the first approaches that employ neural networks to classify metamodels. In the proposed approach, metamodels are first transformed into vector of terms, and then a feed-forward neural network is used to classify the input data. Afterward, memoCNN [59] goes one step further to improve the prediction performance, exploiting deep learning. The tool first renders metamodels in an image-like format and applies convolutional neural networks to learn and classify unknown input data. Both tools bring a good prediction performance, demonstrating the feasibility of different types of neural networks in the MDE domain. Rubei et al. [74] made use of Apache Lucene, a general-purpose indexing and search engine to realize a tool for the classification and clustering of metamodel repositories. The authors demonstrated that the proposed tool can provide accurate results while conventional approaches such as hierarchical clustering or neural networks cannot obtain. Clarisó et al. [17] propose the adoption of graph kernels for clustering modeling artifacts. The adoption of graph kernels is preferred and considered more suitable than vector-space embedding techniques. In the former, the similarity of modeling

10 Machine Learning for Model-Driven Engineering

267

artifacts is calculated as a distance metric between each pair of graphs; the latter is based on encoding each modeling artifact into a vector of features of fixed size. Burattin et al. [13] propose a machine-learning approach to automatically classify modelers with respect to their expertise during modeling sessions. The approach is based on a feed-forward neural network, with one hidden layer consisting of 50 neurons. The input layer consists of ten neurons, whereas the output one contains two neurons, whose values distinguish between the two classes of modelers, i.e., novice or expert.

10.4.3 Model Refactoring Besides the modeling assistants focused on the recommendation of artifacts, many approaches use formal constraints to drive model refactorings. For example, Janota et al. [39] proposed a system to guide the model specification phase by using the interactive model derivation technique. Given an input model, users inquire about the underpinning algorithm to search for possible editing operations. Such transformations adhere to a well-defined set of both syntactic and semantic constraints. The proposed strategy was evaluated on feature models by developing an Eclipse plug-in. Sidhu et al. [80] employed deep neural networks to detect the presence of functional decomposition in unified modeling language (UML) models of objectoriented software. Moreover, data science methods were utilized to gain insight into multidimensional software design features, and the gained experience was used to generalize subtle relationships among architectural components. A formal approach based on a set of laws of object-oriented programming has been proposed to support model refactorings [53]. Given an Alloy partial model, the system employs the mentioned rules to apply semantic-preserving primitive transformations that guarantee the same behavior of the target model compared to the input one. Such operations are recorded in a well-structured sequence, called strategy, that is used to check the conformance of the previous rules. The SmartEMF plug-in [36] offers model checking and editing by relying on the Prolog language. Starting from the initial specification, the tool traverses the Ecore model and maps each element to a corresponding set of Prolog rules used to validate model constraints. Afterward, SmartEMF infers a set of candidate editings and checks their consistency based on preconditions. The valid operations are eventually presented to the user in a reflective editor embedded in the Eclipse IDE. All the presented approaches drive model refactoring using formal rules that have to be consistent to avoid constraint violations.

268

D. Di Ruscio et al.

10.4.4 Model Repair Detecting and resolving inconsistencies are considered as an important phase in the development process in MDE [51]. Nevertheless, there might be many updates to resolve any given set of inconsistencies. As a result, choosing suitable repair actions turns out to be a daunting task. Barriga et al. [6] identified various issues and challenges in AI-powered model repair by reviewing several approaches. Moreover, the authors also discussed the connection between the current approaches and the opportunities with the identified challenges. Pinna Puissant et al. [66] developed Badger, an approach to resolving inconsistencies by generating one or more resolution plans. Badger is a regression planner, and it can be used to resolve different types of structural inconsistencies in UML models using both generated and reverse-engineered models of varying sizes. Iovino et al. [50] went further into the problem of model repairing by introducing the support of quality assessment to identify the best actions to be applied to repair a given broken model. The proposed approach uses reinforcement learning to find the best sequence of repair steps. It also automatically evaluates domain models based on a given quality model for personalized and automatic model repair.

10.4.5 Model Requirements Machine learning techniques are also employed to support the formalization of software requirements. For instance, Lano et al. [44] proposed an approach based on natural language processing and decision trees to recognize required data and behavior elements of systems from textual and graphical documents. Out of such input elements, corresponding formal specification models are automatically created. The resulting models are then used as a basis for subsequent software development activities. Padget et al. [64] proposed the adoption of answer set programming for the formal specification of system requirements, which are thus amenable to evolution and synchronization with the system architecture, being able to deal with uncertainty and support multi-objective decision-making and selfexplanation.

10.4.6 Model Search The proliferation of MDE in recent years has culminated in numerous open-source model and metamodel repositories in public code sharing systems, such as GitHub. These platforms sustain well-defined projects, paving the way for reusability, that is, they provide developers with an effective means of developing new software by leveraging existing components [19]. However, metamodel repositories usually

10 Machine Learning for Model-Driven Engineering

269

contain unstructured resources that cannot be easily unearthed if there is no proper tool. Public models are typically stored in various locations, including model repositories, regular source code repositories, web pages, etc. To profit from them, developers need effective search mechanisms to locate the models relevant to their tasks. Rigou et al. [72] outlined a deep learning approach to discover UML class diagrams starting from a textual specification of the functionalities that the wanted system should provide. Eisenberg et al. [25] formulated the problem of finding the best models that match user requirements as an optimization task. The proposed framework, Marrying Optimization and Model Transformations (MOMoT), is built atop the Eclipse Modeling Framework (EMF) and exploits Henshin as a model transformation tool. Different evolutionary algorithms are provided to perform the search process. López et al. [49] presented MAR, a search engine for models. It uses a queryby-example approach. Models are encoded and indexed by using the notion of bag of paths. MAR is built on top of HBase, and it has been evaluated on different benchmarks by showing that the engine is efficient and has fast response times in most cases. More than .50,000 models of different kinds, including Ecore metamodels, business process model and notation (BPMN) diagrams, and UML models, have been indexed.

10.4.7 Model Synthesis In MDE, modeling artifacts are typically taken as the input of model management operations, which can have different goals, including generating target artifacts, like source code or different views of the considered system. The development of model management operations can be a difficult task that can require the availability of extensive data of input models to test the operation under development. In this respect, Ferdjoukh et al. [27] proposed an approach based on constraint satisfaction problem (CSP) to synthesize models conforming to the metamodel given as input automatically. Generating large datasets of relevant models can be handy to test model transformations or properly design new metamodels. Thus, the authors have defined an original constraint modeling of the problem of generating models conforming to a metamodel by also considering additional OCL (object constraint language) constraints. Bao et al. [5] use deep neural networks to automatically generate SysML Models by mining input requirements specified in restricted natural language. The approach has been conceived mainly to support the development of safety-critical cyberphysical systems.

270

D. Di Ruscio et al.

10.4.8 Model Transformation Development Model transformation concerns the activities related to the generation of target artifacts (including source code) from models, or migration, merging, and analysis of models, to name a few. Nevertheless, the manual development of transformation requires domain knowledge, programming skills, time, and resources. For example, brute-force search for matched metamodels is computationally impossible even with average-sized metamodels [44]. In this respect, various approaches have been proposed to assist modelers in automatically performing transformation. Lano et al. [44] conceived an approach by combining natural language processing, machine learning, and inductive logic programming techniques. Moreover, the authors also used search-based software engineering (SBSE) to find matching metamodels. In recent work [14], long short-term memory neural networks (LSTMs) have been used to generate model transformations autonomously, given pairs of input and output models. Once the system has been trained, it can automatically convert an input model into the corresponding output without writing any transformationspecific code. The proposed approach has been evaluated using real-world datasets and shown to obtain a promising prediction performance.

10.4.9 Others Beyond the main MDE tasks supported by the previously presented techniques, the MDE community is investigating the application of ML techniques for many purposes, including teaching activities and automated model assessment. For instance, the AIDOaRT research project [26] aims at applying MDE techniques and tools to provide a framework offering proper AI-enhanced methods and related tooling for building trustable cyber-physical systems. The framework is intended to work within the DevOps practices combining software development and information technology (IT) operations. To this end, the project allows AI for IT operations (AIOps) to automate the decision-making process and complete system development tasks. Boubekeur et al. [12] investigated the adoption of a tool combining heuristics with machine learning techniques to help assess student submissions in modeldriven engineering courses. The proposed method is applied first to identify submissions of high quality and second to predict approximate letter grades. According to the performed evaluation, the obtained results are promising and comparable to human grading for the first goal and surprisingly accurate for the second one. Tang et al. [86] investigated the combined adoption of deep neural networks and syntax rules to generate source code from descriptive text. According to

10 Machine Learning for Model-Driven Engineering

271

the presented evaluation, there is a trade-off between speed and accuracy of the approach, which still outperforms existing methods for the code generation task. Reinforcement learning is among the employed ML techniques in MDE, where autonomous systems learn through trial and error how to find good solutions to a problem. In such a setting, Parra-Ullauri et al. [65] propose ETeMoX (Event-driven Temporal Models for eXplanations), a configurable architecture based on temporal models to keep track of the decision-making processes of RL systems. Babur [4] develops an approach to comparing and analyzing large model datasets. The author proposes combining information retrieval, natural language processing, and machine learning by implementing an open framework providing users with such analysis functionalities. The approach was evaluated on public datasets involving domain analysis, repository management, and model searching scenarios. Rasiman et al. [69] deal with the problem of requirements traceability, which is typically manually addressed even though it is prone to mistakes, time-consuming, and challenging to maintain. The authors designed an automated tool based on machine learning classifiers to recover traces between JIRA issues (user stories and bugs) and revisions in a model-driven development context.

10.5 A Roadmap for the Deployment of ML in MDE The rise of ML algorithms in recent years paves the way for applications in different domains, including MDE. Nevertheless, while MDE systems have been empowered with ML algorithms, there still exists room for improvement. On the one hand, as ML algorithms evolve very quickly [3], there is the need to adopt the newly conceived ML techniques in MDE, with the ultimate aim of conveying the latest development into the domain. However, on the other hand, adopting such techniques also poses various issues related to the robustness and soundness of the hosting MDE systems. Altogether, this necessitates further work to increase the synergy between machine learning and model-driven engineering. This section presents a research vision for MDE by forecasting future applications of ML as well as possible pitfalls. Among others, we anticipate that the following research topics need to be investigated in the near future: (i) data privacy management, (ii) detecting technical debt, (iii) adversarial machine learning, and (iv) mining time series data, as we explain below.

10.5.1 Data Privacy Management Conventional ML algorithms are usually trained in a centralized fashion, that is, all the data is uploaded to a single platform, and the training process takes place locally. However, this raises concerns over data privacy and may discourage individuals

272

D. Di Ruscio et al.

and companies from contributing their private data to a shared machine learning model. Recently, the federated learning (FL) paradigm [93] has been proposed as an effective means of building a central machine learning model in a distributed manner. In particular, FL algorithms allow multiple platforms to take part in the training process without stressing a centralized server with raw data. More importantly, this enables all the participating clients to avoid disclosing data to each other. At each training cycle, a subset of clients runs the learning locally using their snapshot of user data and transmits the resulting trained weights to the centralized server, which in turn combines the weights to populate a new global model [47]. Such a model is then distributed to another subset of clients, and the training process continues in the same manner until specific convergence requirements are met. Due to its nature, FL helps to solve critical issues related to data privacy and access rights to private data. In this way, FL has gained increasing attention lately, and it finds its applications in various domains [48]. In MDE, the growing need for data for training ML models certainly triggers privacy concerns. Still, according to the conducted literature review (see Sect. 10.3), we have not found any work dealing with privacy and security for ML-based systems in MDE. In this sense, we suppose that FL can come in handy, equipping systems with the ability to get data from various sources while still preserving privacy. Furthermore, referring back to various applications presented in Sect. 10.4, we can see that FL can be applied to enforce the training of at least model refactoring, model search, model classification, and model transformation development.

10.5.2 Detecting Technical Debt While aiming for a trade-off between short-term efficiency and long-term stability, software teams usually resort to sub-optimal solutions that do not adhere to the best practices during the development process. This may induce technical debt (TD), thereby triggering maintenance issues [68, 94]. Discovering TD hidden in source code is crucial in the software development cycle since it enables programmers to locate potentially erroneous snippets, thus allowing for suitable interventions and improving code quality. While detecting TD has been well studied in other branches of software engineering [81], according to our investigation, the issue of investigating technical debt and its impact on MDE has been largely unexplored. Though a few studies [30, 31] have started to discuss the issue, most of them considered only theoretical aspects, e.g., why and how TD is formed, rather than proposing a workable solution to the detection and resolution of technical debt. However, in recent work [22], we successfully employed neural networks to detect self-admitted technical debt from textual comments. Thus, we believe that various neural networks can be used further to detect and classify technical debt in MDE. Among others, CNNs, GNNs, and NLP are considered suitable for this purpose.

10 Machine Learning for Model-Driven Engineering

273

10.5.3 Adversarial Machine Learning Recommender systems in software engineering provide a wide range of valuable items, facilitating reusability, efficiency, and effectiveness [19]. However, while most of the efforts have been made to improve prediction accuracy, less attention has been paid to securing the entire system and protecting them against malicious training data. According to recent studies [61, 62], many recommender systems in software engineering are not immune to malicious attempts. By falsifying the learning material, e.g., data from large open-source software repositories, adversary users may succeed in injecting malicious data, attempting to compromise the targeted systems. Empirical investigations on adversarial machine learning techniques and their possible influence on recommender systems demonstrated that various state-of-the-art recommender systems are indeed prone to malicious intent. The obtained result triggers the need for effective countermeasures to protect recommender systems against hostile attacks disguised in training data. Interestingly, our observation reveals that the issue of adversarial attacks on recommender systems in MDE has not been studied yet. As shown in Sect. 10.4, most of the existing studies employ conventional ML algorithms in addressing a specific issue, e.g., model transformation, model refactoring, or model repair, but not about dealing with ill-intentioned, crafted data. Furthermore, it is noteworthy that none of the reviewed papers discusses potential threats and how an ML model in MDE can be shielded from such threats. Thus, we anticipate that there are two major research topics as follows: (i) investigating the possibilities of poisoning MLbased systems in MDE with artificially manipulated data and (ii) finding adequate countermeasures.

10.5.4 Mining Time Series Data In open-source software repositories, there exist time series artifacts that result from the interaction between developers and hosting platforms, e.g., the evolution of a software project in GitHub over time. Once properly mined, such data can provide developers/modelers with useful recommendations. We assume that deploying machine learning techniques dedicated to dealing with sequential data such as long short-term memory neural networks (LSTMs) allows us to mine the existing data, providing support to developers. In our recent work, we succeeded in conceptualizing a recommender system to provide upgrading plans for third-party libraries [60] and API function calls [63]. The migration history of projects is mined to build matrices and train deep neural networks, whose weights and biases are used to forecast the subsequent versions of the related libraries. This showcases the potential of ML algorithms in mining time series data in OSS ecosystems such as GitHub.

274

D. Di Ruscio et al.

Similarly, in MDE, we can see that various types of interactions end up producing time series data. Among others, models evolve during their life cycle, resulting in the transformation and evolution of models. In this respect, ML techniques specialized in dealing with sequential data are of great use, that is, they learn from time series data to perform predictions for unknown sequences of events. Some initial attempts have already been made following this paradigm, achieving a promising performance. For instance, Burgueño et al. [14] proposed an automatic approach to model transformations built on top of an LSTM neural network. Once trained, the system can automatically transform an input model into the corresponding output without needing any transformation-specific code. Furthermore, we anticipate that the application of cutting-edge neural networks, such as Encoder-Decoder, or Transformer [87], can help tackle various issues, including model transformation and model evolution, further boosting the prediction capability.

10.6 Conclusion In this chapter, we introduced fundamental concepts, notable machine learning techniques, and their existing applications in MDE. We summarized the recent applications of machine learning approaches to support modeling ecosystems through a literature review of the existing studies. More importantly, we identified possible lines of research that can be followed to further explore to what extent the management of modeling ecosystems can be enhanced by adopting existing machine learning techniques. Finally, the chapter proposed a vision for developing ML-based systems and future research issues in the MDE domain. We believe that the MDE community should pay attention to these issues in the near future.

References 1. Abid, S., Mahajan, V., Lucio, L.: Towards machine learning for learnability of MDD tools. In: International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 355–360 (2019). https://doi.org/10.18293/SEKE2019-050 2. Alemany, S., Beltran, J., Pérez, A., Ganzfried, S.: Predicting hurricane trajectories using a recurrent neural network. In: The Thirty-Third Conference on Artificial Intelligence (AAAI), The Ninth Symposium on Educational Advances in Artificial Intelligence (EAAI), pp. 468– 475. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.3301468 3. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A.Q., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8(1) (2021). https://doi. org/10.1186/s40537-021-00444-8 4. Babur, Ö.: Statistical analysis of large sets of models. In: International Conference on Automated Software Engineering (ASE), pp. 888–891. ACM, New York (2016). https://doi. org/10.1145/2970276.2975938

10 Machine Learning for Model-Driven Engineering

275

5. Bao, Y., Yang, Z., Yang, Y., Xie, J., Zhou, Y., Yue, T., Huang, Z., Guo, P.: An automated approach to generate SysML models from restricted natural language requirements (in Chinese). Jisuanji Yanjiu yu Fazhan/Comput. Res. Dev. 58(4), 706–730 (2021). https://doi. org/10.7544/issn1000-1239.2021.20200757 6. Barriga, A., Rutle, A., Heldal, R.: AI-powered model repair: an experience report—lessons learned, challenges, and opportunities. Softw. Syst. Model. 21(3), 1135–1157 (2022). https:// doi.org/10.1007/s10270-022-00983-5 7. Basciani, F., Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Automated clustering of metamodel repositories. In: Advanced Information Systems Engineering, vol. 9694, pp. 342– 358. Springer, Berlin (2016). https://doi.org/10.1007/978-3-319-39696-5_21 8. Batot, E.R., Sahraoui, H.: Promoting social diversity for the automated learning of complex MDE artifacts. Softw. Syst. Model. 21(3), 1159–1178 (2022). https://doi.org/10.1007/s10270021-00969-9 9. Berkhin, P.: A survey of clustering data mining techniques. In: J. Kogan, C. Nicholas, M. Teboulle (eds.) Grouping Multidimensional Data: Recent Advances in Clustering, pp. 25– 71. Springer, Berlin (2006). https://doi.org/10.1007/3-540-28349-8_2 10. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 11. Boubekeur, Y., Mussbacher, G.: Towards a better understanding of interactions with a domain modeling assistant. In: International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 94–103. ACM, New York (2020). https://doi.org/10.1145/3417990. 3418742 12. Boubekeur, Y., Mussbacher, G., McIntosh, S.: Automatic assessment of students’ software models using a simple heuristic and machine learning. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 84–93. ACM, New York (2020). https://doi.org/10.1145/3417990.3418741 13. Burattin, A., Soffer, P., Fahland, D., Mendling, J., Reijers, H., Vanderfeesten, I., Weidlich, M., Weber, B.: Who is behind the model? classifying modelers based on pragmatic model features. In: International Conference on Business Process Management. Lecture Notes in Computer Science, vol. 11080, pp. 322–338. Springer, Berlin (2018). https://doi.org/10.1007/978-3-31998648-7_19 14. Burgueño, L., Cabot, J., Gérard, S.: An LSTM-Based neural network architecture for model transformations. In: International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 294–299. IEEE, Piscataway (2019). https://doi.org/10.1109/ MODELS.2019.00013 15. Burgueño, L., Cabot, J., Wimmer, M., Zschaler, S.: Guest editorial to the theme section on AIenhanced model-driven engineering. Softw. Syst. Model. 21(3), 963–965 (2022). https://doi. org/10.1007/s10270-022-00988-0 16. Cabot, J., Clarisó, R., Brambilla, M., Gérard, S.: Cognifying model-driven software engineering. In: Federation of International Conferences on Software Technologies: Applications and Foundations, pp. 154–160. Springer, Berlin (2017) 17. Clarisó, R., Cabot, J.: Applying graph kernels to model-driven engineering problems. In: International Workshop on Machine Learning and Software Engineering in Symbiosis (MASES), pp. 1–5. ACM, New York (2018). https://doi.org/10.1145/3243127.3243128 18. Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P.T., Pierantonio, A.: MemoRec: a recommender system for assisting modelers in specifying metamodels. Softw. Syst. Model. (2022). https://doi.org/10.1007/s10270-022-00994-2 19. Di Rocco, J., Di Ruscio, D., Di Sipio, C., Nguyen, P.T., Rubei, R.: Development of recommendation systems for software engineering: the CROSSMINER experience. Empirical Softw. Eng. 26(4), 69 (2021). https://doi.org/10.1007/s10664-021-09963-7 20. Di Rocco, J., Di Ruscio, D., Iovino, L., Pierantonio, A.: Collaborative repositories in modeldriven engineering [software technology]. IEEE Softw. 32(3), 28–34 (2015) 21. Di Rocco, J., Di Sipio, C., Di Ruscio, D., Nguyen, P.: A GNN-based recommender system to assist the specification of metamodels and models. In: International Conference on Model-

276

D. Di Ruscio et al.

Driven Engineering Languages and Systems (MODELS), pp. 70–81. IEEE, Piscataway (2021). https://doi.org/10.1109/MODELS50736.2021.00016 22. Di Salle, A., Rota, A., Nguyen, P.T., Di Ruscio, D., Fontana, F.A., Sala, I.: PILOT: synergy between text processing and neural networks to detect self-admitted technical debt. In: International Conference on Technical Debt (TechDebt), pp. 41–45 (2022). https://doi.org/10. 1145/3524843.3528093 23. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012). https://doi.org/10.1145/2347736.2347755 24. Driss, S.B., Soua, M., Kachouri, R., Akil, M.: A comparison study between mlp and convolutional neural network models for character recognition. In: Real-Time Image and Video Processing, vol. 10223, pp. 32 – 42. International Society for Optics and Photonics, SPIE (2017). https://doi.org/10.1117/12.2262589 25. Eisenberg, M., Pichler, H.P., Garmendia, A.: Searching for models with hybrid AI techniques. In: International Workshop on Conceptual Modeling Meets Artificial Intelligence (CMAI), p. 2 (2021) 26. Eramo, R., Muttillo, V., Berardinelli, L., Bruneliere, H., Gomez, A., Bagnato, A., Sadovykh, A., Cicchetti, A.: AIdoArt: AI-augmented automation for DevOps, a model-based framework for continuous development in cyber-physical systems. In: Euromicro Conference on Digital System Design (DSD), pp. 303–310. IEEE, Piscataway (2021). https://doi.org/10.1109/ DSD53832.2021.00053 27. Ferdjoukh, A., Baert, A.E., Chateau, A., Coletta, R., Nebut, C.: A CSP approach for metamodel instantiation. In: International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1044–1051 (2013). https://doi.org/10.1109/ICTAI.2013.156 28. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000) 29. Gamboa, M.A., Syriani, E.: Automating activities in MDE tools. In: International Conference on Model-Driven Engineering and Software Development (MODELSWARD), pp. 123–133 (2016) 30. Giraldo, F.D., España, S., Pineda, M.A., Giraldo, W.J., Pastor, O.: Conciliating model-driven engineering with technical debt using a quality framework. In: CAiSE Forum (Selected Extended Papers). Lecture Notes in Business Information Processing, vol. 204, pp. 199–214. Springer, Berlin (2014) 31. Gomes, R.A., Pinheiro, L.B.L., Maciel, R.S.P.: Anticipating identification of technical debt items in model-driven software projects. In: Brazilian Symposium on Software Engineering (SBES), pp. 740–749. ACM, New York (2020). https://doi.org/10.1145/3422392.3422434 32. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016) 33. Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998). https://doi.org/10.1145/276305.276312 34. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Inform. Syst. 25(5), 345–366 (2000). https://doi.org/10.1016/S0306-4379(00)00022-3 35. Hein, C., Ritter, T., Wagner, M.: Model-driven tool integration with modelbus. In: Workshop Future Trends of Model-Driven Development, pp. 50–52 (2009) 36. Hessellund, A., Czarnecki, K., Wasowski, A.: Guided development with multiple domainspecific languages. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 46–60. Springer, Berlin (2007) 37. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 38. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999) 39. Janota, M., Kuzina, V., Wasowski, A.: Model construction with external constraints: an interactive journey from semantics to syntax. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 431–445. Springer, Berlin (2008) 40. Karasneh, B., Chaudron, M.R.: Online img2uml repository: An online repository for uml models. In: EESSMOD@ MoDELS, pp. 61–66 (2013)

10 Machine Learning for Model-Driven Engineering

277

41. Karypis, G., Han, E.H.S., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999). https://doi.org/10.1109/2.781637 42. Koegel, M., Helming, J.: EMFStore: a model repository for EMF models. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering, vol. 2, pp. 307–308. IEEE, Piscataway (2010) 43. Kutsche, R., Milanovic, N., Bauhoff, G., Baum, T., Cartsburg, M., Kumpe, D., Widiker, J.: Bizycle: Model-based interoperability platform for software and data integration. Proceedings of the MDTPI at ECMDA 430 (2008) 44. Lano, K., Fang, S., Umar, M., Yassipour-Tehrani, S.: Enhancing model transformation synthesis using natural language processing. In: International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 277–286. ACM, New York (2020). https://doi.org/10.1145/3417990.3421386 45. Lano, K., Yassipour-Tehrani, S., Umar, M.: Automated requirements formalisation for agile MDE. In: International Conference on Model-Driven Engineering Languages and Systems (MODELS), pp. 173–180. IEEE, Piscataway (2021). https://doi.org/10.1109/MODELSC53483.2021.00030 46. Latinne, P., Debeir, O., Decaestecker, C.: Combining different methods and numbers of weak decision trees. Pattern Anal. Appl. 5(2), 201–209 (2002). https://doi.org/10.1007/ s100440200018 47. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020). https://doi.org/10.1109/ MSP.2020.2975749 48. Lim, W.Y.B., Luong, N.C., Hoang, D.T., Jiao, Y., Liang, Y.C., Yang, Q., Niyato, D., Miao, C.: Federated learning in mobile edge networks: a comprehensive survey. IEEE Commun. Surv. Tutorials 22(3), 2031–2063 (2020). https://doi.org/10.1109/COMST.2020.2986024 49. López, J.A.H., Cuadrado, J.S.: MAR: A structure-based search engine for models. In: International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 57–67. ACM, New York (2020). https://doi.org/10.1145/3365438.3410947 50. Ludovico, I., Barriga, A., Rutle, A., Heldal, R.: Model repair with quality-based reinforcement learning. J. Object Technol. 19(2), 17:1 (2020). https://doi.org/10.5381/jot.2020.19.2.a17 51. Macedo, N., Jorge, T., Cunha, A.: A feature-based classification of model repair approaches. IEEE Trans. Softw. Eng. 43(7), 615–640 (2017). https://doi.org/10.1109/TSE.2016.2620145 52. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 53. Massoni, T.L., Gheyi, R., Borba, P.: Formal model-driven program refactoring. In: FASE (2008) 54. Messerschmitt, D.G., Szyperski, C.: Software Ecosystem: Understanding an Indispensable Technology and Industry. MIT Press, Cambridge (2003) 55. Mussbacher, G., Combemale, B., Kienzle, J., Abrahão, S., Ali, H., Bencomo, N., Búr, M., Burgueño, L., Engels, G., Jeanjean, P., Jézéquel, J.M., Kühn, T., Mosser, S., Sahraoui, H., Syriani, E., Varró, D., Weyssow, M.: Opportunities in intelligent modeling assistance. Softw. Syst. Model. 19(5), 1045–1053 (2020). https://doi.org/10.1007/s10270-020-00814-5 56. Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002). https://doi.org/10.1109/TKDE.2002.1033770 57. Nguyen, P., Di Rocco, J., Di Ruscio, D., Pierantonio, A., Iovino, L.: Automated classification of metamodel repositories: a machine learning approach. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 272–282. IEEE, Piscataway (2019). https://doi.org/10.1109/MODELS.2019.00011 58. Nguyen, P., Di Rocco, J., Iovino, L., Di Ruscio, D., Pierantonio, A.: Evaluation of a machine learning classifier for metamodels. Softw. Syst. Model. 20(6), 1797–1821 (2021). https://doi. org/10.1007/s10270-021-00913-x 59. Nguyen, P., Di Ruscio, D., Pierantonio, A., Di Rocco, J., Iovino, L.: Convolutional neural networks for enhanced classification mechanisms of metamodels. J. Syst. Softw. 172 (2021). https://doi.org/10.1016/j.jss.2020.110860

278

D. Di Ruscio et al.

60. Nguyen, P.T., Di Rocco, J., Rubei, R., Di Sipio, C., Di Ruscio, D.: DeepLib: machine translation techniques to recommend upgrades for third-party libraries. Expert Syst. Appl. 202, 117267 (2022). https://doi.org/10.1016/j.eswa.2022.117267 61. Nguyen, P.T., Di Ruscio, D., Di Rocco, J., Di Sipio, C., Di Penta, M.: Adversarial machine learning: On the resilience of third-party library recommender systems. In: Evaluation and Assessment in Software Engineering (EASE), pp. 247–253. ACM, New York (2021). https:// doi.org/10.1145/3463274.3463809 62. Nguyen, P.T., Di Sipio, C., Di Rocco, J., Di Penta, M., Di Ruscio, D.: Adversarial attacks to API recommender systems: time to wake up and smell the coffee? In: International Conference on Automated Software Engineering (ASE), pp. 253–265 (2021). https://doi.org/10.1109/ ASE51524.2021.9678946 63. Nguyen, P.T., Di Sipio, C., Di Rocco, J., Di Ruscio, D., Di Penta, M.: Fitting missing API puzzles with machine translation techniques. Expert Syst. Appl. 216, 119477 (2023). https:// doi.org/10.1016/j.eswa.2022.119477 64. Padget, J., Elakehal, E., Satoh, K., Ishikawa, F.: On requirements representation and reasoning using answer set programming. In: International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), pp. 35–42. IEEE, Piscataway (2014). https://doi.org/10. 1109/AIRE.2014.6894854 65. Parra-Ullauri, J.M., García-Domínguez, A., Bencomo, N., Zheng, C., Zhen, C., Boubeta-Puig, J., Ortiz, G., Yang, S.: Event-driven temporal models for explanations—ETeMoX: explaining reinforcement learning. Softw. Syst. Model. 21(3), 1091–1113 (2022). https://doi.org/10.1007/ s10270-021-00952-4 66. Pinna Puissant, J., Van Der Straeten, R., Mens, T.: Resolving model inconsistencies using automated regression planning. Softw. Syst. Model. 14(1), 461–481 (2015). https://doi.org/ 10.1007/s10270-013-0317-9 67. Portugal, I., Alencar, P., Cowan, D.: The use of machine learning algorithms in recommender systems: a systematic review. Expert Syst. Appl. 97, 205–227 (2018). https://doi.org/10.1016/ j.eswa.2017.12.020 68. Potdar, A., Shihab, E.: An exploratory study on self-admitted technical debt. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 91–100. IEEE, Piscataway (2014). https://doi.org/10.1109/ICSME.2014.31 69. Rasiman, R., Dalpiaz, F., España, S.: How effective is automated trace link recovery in modeldriven development? Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13216, 35–51 (2022). https:// doi.org/10.1007/978-3-030-98464-9_4 70. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29(9), 2352–2449 (2017). https://doi.org/10.1162/neco_a_ 00990 71. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embedding. World Scientific Publishing, USA (2010) 72. Rigou, Y., Lamontagne, D., Khriss, I.: A sketch of a deep learning approach for discovering UML class diagrams from system’s textual specification. In: International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET). IEEE, Piscataway (2020). https://doi.org/10.1109/IRASET48871.2020.9092144 73. Rokach, L., Maimon, O.: Clustering Methods, pp. 321–352. Springer, Boston (2005). https:// doi.org/10.1007/0-387-25465-X_15 74. Rubei, R., Rocco, J., Ruscio, D., Nguyen, P., Pierantonio, A.: A lightweight approach for the automated classification and clustering of metamodels. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 477–482. IEEE, Piscataway (2021). https://doi.org/10.1109/MODELS-C53483.2021.00074 75. Saini, R.: Artificial intelligence empowered domain modelling bot. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 1–6. ACM, New York (2020). https://doi.org/10.1145/3417990.3419486

10 Machine Learning for Model-Driven Engineering

279

76. Saini, R., Mussbacher, G., Guo, J., Kienzle, J.: Automated, interactive, and traceable domain modelling empowered by artificial intelligence. Softw. Syst. Model. 21(3), 1015–1045 (2022). https://doi.org/10.1007/s10270-021-00942-6 77. Saini, R., Mussbacher, G., Guo, J.L., Kienzle, J.: Teaching Modelling literacy: an artificial intelligence approach. In: International Conference on Model Driven Engineering Languages and Systems (MoDELS), pp. 714–719. IEEE, Piscataway (2019). https://doi.org/10.1109/ MODELS-C.2019.00108 78. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. Trans. Neural Netw. 20(1), 61–80 (2008) 79. Schmidt, D.C.: Model-driven engineering. Comput.-IEEE Comput. Soc. 39(2), 25 (2006) 80. Sidhu, B., Singh, K., Sharma, N.: A machine learning approach to software model refactoring. Int. J. Comput. Appl. 44(2), 166–177 (2022). https://doi.org/10.1080/1206212X.2020.1711616 81. Sierra, G., Shihab, E., Kamei, Y.: A survey of self-admitted technical debt. J. Syst. Softw. 152, 70–82 (2019). https://doi.org/10.1016/j.jss.2019.02.056 82. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: ACM SIGKDD World Text Mining Conference (2000) 83. Steinwart, I., Christmann, A.: Support Vector Machines, 1st edn. Springer, Berlin (2008) 84. Stephan, M.: Towards a cognizant virtual software modeling assistant using model clones. In: International Conference on Software Engineering: New Ideas and Emerging Results (ICSENIER), pp. 21–24. IEEE, Piscataway (2019). https://doi.org/10.1109/ICSE-NIER.2019.00014 85. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018) 86. Tang, X., Wang, Z., Qi, J., Li, Z.: Improving code generation from descriptive text by combining deep learning and syntax rules. In: International Conference on Software Engineering and Knowledge Engineering (SEKE), pp. 385–390 (2019). https://doi.org/10.18293/SEKE2019170 87. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates (2017) 88. Vishwanathan, S., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11(40), 1201–1242 (2010) 89. Weyssow, M., Sahraoui, H., Syriani, E.: Recommending metamodel concepts during modeling activities with pre-trained language models. Softw. Syst. Model. 21(3), 1071–1089 (2022). https://doi.org/10.1007/s10270-022-00975-5 90. White, M., Vendome, C., Linares-Vásquez, M., Poshyvanyk, D.: Toward deep learning software repositories. In: Working Conference on Mining Software Repositories (MSR), pp. 334–345. IEEE, Piscataway (2015) 91. Wuest, T., Weimer, D., Irgens, C., Thoben, K.D.: Machine learning in manufacturing: advantages, challenges, and applications. Prod. Manuf. Res. 4(1), 23–45 (2016). https://doi. org/10.1080/21693277.2016.1192517 92. Xu, R., II, D.C.W.: Survey of clustering algorithms. Trans. Neural Netwo. 16(3), 645–678 (2005). https://doi.org/10.1109/TNN.2005.845141 93. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. Trans. Intell. Syst. Technol. 10(2), 12:1–12:19 (2019). https://doi.org/10.1145/3298981 94. Zampetti, F., Fucci, G., Serebrenik, A., Di Penta, M.: Self-admitted technical debt practices: a comparison between industry and open-source. Empirical Softw. Eng. 26(6), 131 (2021). https://doi.org/10.1007/s10664-021-10031-3 95. Zhang, G., Patuwo, B.E., Hu, M.Y.: Forecasting with artificial neural networks: the state of the art. Int. J. Forecasting 14(1), 35–62 (1998). https://doi.org/10.1016/S0169-2070(97)00044-7 96. Zhang, H., Babar, M.A., Tell, P.: Identifying relevant studies in software engineering. Inform. Softw. Technol. 53(6), 625–637 (2011). https://doi.org/10.1016/j.infsof.2010.12.010 97. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997). https://doi.org/10.1023/A: 1009783824328

Chapter 11

Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems Csaba Nagy, Michele Lanza, and Anthony Cleve

Abstract Managing data-intensive software ecosystems has long been considered an expensive and error-prone process. This is mainly due to the often implicit consistency relationships between applications and their database(s). In addition, as new technologies emerged for specialized purposes (e.g., key-value stores, document stores, graph databases), the common use of multiple database models within the same software (eco)system has also become more popular. There are undeniable benefits of such multi-database models where developers use and combine technologies. However, the side effects on database design, querying, and maintenance are not well-known. This chapter elaborates on the recent research effort devoted to mining, analyzing, and evolving data-intensive software ecosystems. It focuses on methods, techniques, and tools providing developers with automated support. It covers different processes, including automatic database query extraction, bad smell detection, self-admitted technical debt analysis, and evolution history visualization.

11.1 Introduction Data-intensive software ecosystems comprise one or several databases and a collection of applications connected with the former. They constitute critical assets in most enterprises since they support business activities in all production and management domains. They are typically old, large, heterogeneous, and highly complex. Database interactions play a crucial role in data-intensive applications, as they determine how the system communicates with its database(s). When the application C. Nagy () · M. Lanza Software Institute, Università della Svizzera italiana, Lugano, Switzerland e-mail: [email protected]; [email protected] A. Cleve Faculté d’informatique, Université de Namur, Namur, Belgium e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. Mens et al. (eds.), Software Ecosystems, https://doi.org/10.1007/978-3-031-36060-2_11

281

282

C. Nagy et al.

sends a query to its database, it is the database’s responsibility to handle it with the best performance. The developer has limited control: if the query is not wellformed or not handled correctly in the program code, it generates an extra load on the database side that affects the application’s performance [52, 81]. In the worst case, it can lead to errors, bugs, or security vulnerabilities such as code injection. In the last decade, the emergence of novel database technologies and the increasing use of dynamic data manipulation frameworks have made data-intensive ecosystem analysis and evolution even more challenging. In particular, the increasing use of NoSQL databases poses new challenges for developers and researchers. A prominent feature of such databases is that they are schema-less, offering greater flexibility in handling data. This freedom strikes back when it comes to maintaining and evolving data-intensive applications [71, 92]. Another challenging trend is the development of hybrid multi-database architectures [16], called hybrid polystores, where relational and NoSQL databases coexist within the same system and must be queried and evolved jointly and consistently. We present recent research initiatives aiming to address those challenges. In Sect. 11.2, we discuss mining techniques to determine how data is stored and managed in a data-intensive ecosystem. Those techniques can automatically identify and extract the database interactions from the system source code. Section 11.3 elaborates on static analysis and visualization techniques that exploit the mined information about storing and manipulating the ecosystem’s data. In Sect. 11.4, we summarize the findings of empirical studies related to data-intensive software ecosystems. We provide concluding remarks and anticipate future directions in Sect. 11.5.

11.2 Mining Techniques 11.2.1 Introduction Managing a data-intensive software ecosystem requires a deep understanding of its architecture since it consists of many subsystems that depend on one another. They use each other’s public services and APIs—they communicate. A subsystem may rely on one or more databases, and their data will likely travel through the entire ecosystem. Maintaining and evolving this interconnected system network requires a fundamental understanding of how data is handled all over it. In this section, we present approaches to mining how data is stored and managed in a data-intensive ecosystem. Such knowledge can serve various purposes, e.g., reverse engineering, re-documentation, visualization, or quality assurance approaches. We present two techniques to study the interaction points in the applications where they communicate with databases. Both techniques are based on static analysis and, hence, do not require the application to execute but only its source

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

283

code. This can be particularly important for an ecosystem where dynamic analysis is even more challenging, if not impossible, in most situations. In Sect. 11.2.2, we present a static approach to identifying, extracting, and analyzing database accesses in Java applications. This technique can be used for applications with libraries that communicate through SQL statements (e.g., JDBC or Hibernate). It locates the database interaction points and traces back the potential SQL strings dynamically constructed at those locations. In Sect. 11.2.3, we show a similar static approach for NoSQL databases. This technique was developed to analyze MongoDB usage in Java and JavaScript. JavaScript is a highly dynamic language, and the approach tries to alleviate the limitations of static analysis by using heuristics, e.g., when types are not available explicitly.

11.2.2 Static Analysis of Relational Database Accesses Database manipulation is central in the source code of a data-intensive system. It serves data to all its other parts and enables the application to query the information needed for all operations and persist changes to its actual state. The database manipulation code is often separated in the codebase. For example, object-oriented languages usually follow the DAO (Data Access Object) design pattern to isolate the application/business layer from the persistence layer. A DAO class implements all the functionality required for fetching, updating, and removing its domain objects. For example, a UserDao would be responsible for handling User objects mapped to user entities in the database. The complexity of the manipulation code depends on the APIs or libraries used for database communication, and many libraries are available depending on the developers’ needs. Several factors may determine the library’s choice, such as the programming language, database, and required abstraction level. A large-scale empirical study by Goeminne et al. [43] found JPA, Hibernate, and JDBC among the most popular ones in 3707 GitHub Java projects. Moreover, many systems use combinations of multiple libraries. From the mining point of view (and also from the developer’s perspective), these libraries can partly or entirely hide the actual SQL queries executed by the programs, generating queries at runtime before sending them to the database server. Listing 11.1 shows an example code snippet executing an SQL query through the JDBC API. Statement.execute(...) sends the query to the database (line 10). It is part of the standard java.sql API that provides classes for “accessing and processing data stored in a data source (usually a relational database).”1 The final query results from string operations (e.g., lines 9 and 14) depending on conditions (e.g., lines 8 and 19).

1 https://docs.oracle.com/javase/8/docs/api/java/sql/package-summary.html.

284

C. Nagy et al.

Listing 11.1 Java code example executing an SQL query 1 2 3 4

public class ProviderMgr { private Statement st; private ResultSet rs; private boolean ordering;

5

public void executeQuery(String x, String y){ String sql = getQueryStr(x); if (ordering) sql += "order by " + y; rs = st.execute(sql); }

6 7 8 9 10 11 12

public String getQueryStr(String str){ return "select * from " + str; }

13 14 15 16

public Provider[] getAllProviders(){ String tableName = "Provider"; String columnName = (...) ? "provider-id" : → "provider_name"; executeQuery(tableName, columnName) ; // ... }

17 18 19

20 21 22 23

}

Listing 11.2 presents a similar example usage of Hibernate to send an HQL query to the database. HQL (Hibernate Query Language) is the SQL-like query language of Hibernate. It is fully object oriented and understands inheritance, polymorphism, and association. The example shows a snippet using the SessionFactory API. The HQL statement on line 10 queries products belonging to a given category. Hibernate transforms this query to SQL and sends it to the database when the list() method is invoked on line 12. Listing 11.2 Java code snippet executing an HQL query 1 2

public class ProductDaoImpl implements ProductDao { private SessionFactory sessionFactory;

3

public void setSessionFactory(SessionFactory → sessionFactory) { this.sessionFactory = sessionFactory; }

4

5 6 7

public Collection loadProductsByCategory(String → category) { return this.sessionFactory.getCurrentSession() .createQuery("from Product product where → category=?") .setParameter(0, category) .list(); }

8

9 10

11 12 13 14

}

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

285

Java system

Database schema

Database

Static analyzer

JDBC analysis JDBC JDBC

ORM

DB access detection

Hibernate analysis

Database accesses

JPA analysis Java source code

Fig. 11.1 Overview of the query extraction approach

Meurice et al. addressed the problem of recovering traceability links between Java programs and their databases [69]. They proposed a static analysis approach to identify the source code locations where database queries are executed, extracting the SQL queries for each location. The approach is based on algorithms that operate on the application’s call graph and the methods’ intra-procedural control flow. It considers three of the most popular database access technologies used in Java systems, according to Goeminne et al. [43], namely, JDBC, Hibernate, and JPA. An overview of Meurice et al.’s approach can be seen in Fig. 11.1. First, it takes the application source code and database schema as input. It parses the schema and analyzes the source code to identify the locations where the application interacts with the database. Then, it extracts the SQL queries sent to the database at these locations and parses the queries. The final output is a set of database access locations, their queries, and the database objects (tables and columns) impacted/accessed at these locations. Different technologies (i.e., JDBC, Hibernate, or JPA) require different analysis approaches, like SQL dialects requiring specific parsers. A static approach would require interprocedural data- and control-flow analyses, as SQL queries can be constructed through deeply embedded string operations. In some cases, even such techniques cannot extract the entire query. Thus, a query might be incomplete. For example, a user may enter credentials in a login form to be validated in the database. The user input is known at runtime, but the static analyzer only sees that the email and password variables are used to construct the query. The parser must tolerate such incomplete statements, and in the end, the extraction process balances precision and the computation overhead of in-depth static analyses.

286

C. Nagy et al.

Meurice et al. evaluated their approach on three open-source systems (OSCAR,2 OpenMRS,3 and Broadleaf4 ) with sizes ranging from 250 kLOC to 2054 kLOC having 88–480 tables in their databases. The first two are popular electronic medical record (EMR) systems, and the third is an e-commerce framework. They could extract queries for 71.5–99% of database accesses with 87.9–100% of valid queries. In their follow-up work [72], they analyzed the evolution of the same three systems as they have been developed for more than 7 years. They jointly analyzed the changes in the database schema and the related changes in the source code by focusing on the database access locations. They made several interesting observations. For example, different data manipulation technologies could access the same tables within the programs. Database schemas also expanded significantly over time. Most schema changes consisted in adding new tables and columns. A significant subset of database tables and columns were not accessed (any longer) by the application programs, resulting in “dead” schema elements. Co-evolving schema and programs is not always a trivial process for developers. Developers seem to refrain from evolving a table in the database schema since this may make related queries invalid in the programs. Instead, they probably prefer to add a new table by duplicating its data and incrementally updating the programs to use the new table instead of the old one. Sometimes, the old table version is never deleted, even when not accessed anymore.

11.2.3 Static Analysis of NoSQL Database Accesses NoSQL (“Not Only SQL”) technologies emerged to tackle the limitations of relational databases. They offer attractive features such as scale-out scalability, cloud readiness, and schema-less data models [68]. New features come at a price, however. For example, schema-less storage allows faster data structure changes, but the absence of explicit schema results in multiple coexisting implicit schemas. The increased complexity complicates developers’ operational and maintenance burden [5, 91]. Efforts have been made to address the challenges of NoSQL systems. A popular one is to support schema evolution in the schema-less NoSQL environment [105]. For example, researchers study automatic schema extraction [1], schema generation [45], optimization [76], and schema suggestions [49]. Behind the scenes, such approaches mainly rely on a static analysis of the source code or the data, operating on the part of the source code implementing the database communication.

2 https://www.oscar-emr.com. 3 https://www.openmrs.org. 4 https://www.broadleafcommerce.org.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

287

Cherry et al. addressed the problem of retrieving database accesses from the source code of JavaScript applications that use MongoDB [24]. Static analysis of JavaScript is known to be extremely difficult. Existing techniques [9, 53] usually struggle to handle the excessively dynamic features of the language [106], and approaches with type inference [51], data flow [62], or call graphs [37] need to balance between scalability and soundness. Cherry et al. alleviate the limitations of the static analysis by using heuristics. Listing 11.3 presents a typical schema definition in Mongoose,5 a popular object-modeling library to facilitate working with MongoDB in JavaScript. First, the mongoose module is included using the built-in require function. Then a schema is created through the mongoose.Schema(...) API call. In Mongoose, a Schema is mapped to a MongoDB collection and defines the structure of the documents within that collection. To work with a Schema, a Model is needed in Mongoose. Finally, line 9 creates a Model that is exported to be used externally. Listing 11.3 Mongoose schema definition example 1

const mongoose = require("mongoose");

2 3 4 5 6 7

let CarSchema = new mongoose.Schema({ brand: String, model: String, price: Number });

8 9

module.exports = mongoose.model("cars", CarSchema);

Listing 11.4 shows an example usage of the model in Listing 11.3. The model is imported using the require function (line 1). An instance of a model is a Document in Mongoose that can be created and saved in various ways. The example uses the Document.save() method of the tesla instance (line 5). Finally, line 8 shows a simple query to find documents. Listing 11.4 Mongoose query example 1

Car = require("./cars.js");

2 3 4 5

// ... tesla = new Car("Tesla", "Model S", 95000); await tesla.save();

6 7 8

// ... tesla = await Car.find({name: /Tesla/});

5 https://mongoosejs.com/.

288

C. Nagy et al.

1

CodeQL Analysis

JavaScript Project

2

Database Access Method Extraction

{;}

Database Access Locations

3

Filtering Heuristics

Fig. 11.2 Overview of the MongoDB data access analysis by Cherry et al. [24]

Cherry et al. look for similar MongoDB interactions in JavaScript applications. The aim is to identify every statement operating with the database. For this purpose, they gathered method signatures from reference guides of MongoDB Node Driver 3.6 and Mongoose 5.12.8. They selected the methods that access the database for one of the following operations: (1) creating a new collection/document; (2) updating the content of documents or a collection; (3) deleting documents from a collection; (4) accessing the content of documents. Overall, they identified 179 methods, 74 from MongoDB Node Driver and 105 from Mongoose. Figure 11.2 presents an overview of the main steps of the approach. First, it analyzes a JavaScript project with CodeQL,6 a code analysis engine developed by GitHub. Code can be queried in CodeQL as if it were data in an SQL-like query language. Accordingly, the approach then runs queries to find the database access methods. The next step applies filtering heuristics to improve the precision by eliminating method calls in potential conflict with other APIs. They defined seven heuristics. For example, the “the receiver should not be ‘_’” heuristic avoids potential collisions with the frequently used Lodash7 library. The outcome is a list of source code locations accessing the database with details of the access (e.g., API used, receiver, context). An example use case of the approach is the analysis of the evolution of systems database usage. Cherry et al. presented two case studies on Bitcore8 and Overleaf.9 Figure 11.3 shows the database access methods in the different releases of Bitcore, a project with 4.2K stars and 2K forks on GitHub.

6 https://codeql.github.com/. 7 https://lodash.com/. 8 https://github.com/bitpay/bitcore. 9 https://github.com/overleaf/web.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

289

create 350

delete generic insert

300

Database accesses

select update

250

200

150

100

50

0 v8.25.22

v8.25.21

v8.25.20

v8.25.19

v8.25.17

v8.25.16

v8.25.15

v8.25.14

v8.25.13

v8.25.12

v8.25.11

v8.25.10

v8.25.9

v8.22.2

v8.16.2

prod

v8.2.0

v8.1.1

v8.1.0

Release

Fig. 11.3 Evolution of Bitcore

It has a multi-project infrastructure with a MongoDB database in its core. The most represented database operation is select with 170 distinct method calls. One can also see a major change in the number of database accesses around v8.16.2. A closer look reveals that a commit10 adds numerous models and methods interacting with it. It is a new feature: “[Bitcore] can now sync ETH and get wallet history for ERC20 tokens”—says the commit message. Figure 11.4 shows the evolution of Overleaf, a popular online, collaborative LaTeX editor. Overleaf’s database usage differs from Bitcore with more prominent data modifications. Indeed, there are more updates (34%, 108) than selects (32%, 103). There was also an abrupt change in database accesses between September and October 2020. Overleaf was migrated from MongoJS to MongoDB Node Driver. Cherry et al. evaluated the accuracy of their approach on a manually validated oracle of 307 open-source projects. They reached promising results, achieving a precision of 78%. Such an approach is the first step toward additional database access API usage analyses in JavaScript applications. It is required, for example, to analyze the evolution of systems [71], help their developers propagate schema changes [2], or identify antipatterns [81].

10 https://github.com/bitpay/bitcore/commit/d08ea9.

290

C. Nagy et al.

create 300

delete generic insert

Database accesses

250

select update

200

150

100

50

0

Jul 2019

Jan 2020

Jul 2020

Jan 2021

Jul 2021

Commit date

Fig. 11.4 Evolution of Overleaf

11.2.4 Reflections We presented two static analysis approaches to study how applications communicate with their databases. We first discussed communication with relational databases. The programming context here was Java, a statically typed language. Then we learned a technique for MongoDB as an example of a popular NoSQL database in the dynamically typed language context of JavaScript applications. To extract SQL queries, pioneer work was published by Christensen et al. [25]. They propose a static string analysis technique that translates a given Java program into a flow graph and then generates a finite-state automaton. Gould et al. propose a method based on an interprocedural data-flow analysis [47, 116]. Maule et al. use a similar k-CFA algorithm and a system dependence graph to identify the impact of relational database schema changes upon object-oriented applications [67]. Brink et al. present a quality assessment approach for SQL statements embedded in PL/SQL, COBOL, and Visual Basic code [110]. They extract the SQL statements from the source code using control- and data-flow analysis techniques. Annamaa et al. presented Alvor, a tool that statically analyzes SQL queries embedded into Java programs. Their approach is based on an interprocedural path-insensitive constant propagation analysis [10] similar to the one presented by Meurice et al. [74]. Ngo and Tan use symbolic execution to extract database interaction points from web applications [82]. They work with PHP applications of sizes ranging from 2 to 584 kLOC. Their method can extract about 80% of database interactions. PHP applications were also studied by Anderson et al., who proposed program analysis for extracting models of database queries [6–8]. They implement

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

291

their approach in Rascal as part of the PHP AiR framework. A similar approach was also presented by Manousis et al. [63]. They describe a language-independent abstraction of the problem and demonstrate it with a tool implementation for C.++ and PHP. Recent research also targeted Android, where SQL is preferred instead of higherlevel abstractions (e.g., an ORM (Object-Relational Mapping)) that may affect performance. Lyu et al. studied local database usage in Android applications [61]. They look for invocations of SQLite APIs and their queries. Li et al. presented a general string analysis approach for Android [59]. They define an intermediate representation (IR) of the string operations performed on string variables. This representation captures data-flow dependencies in loops and context-sensitive call site information. There are also dynamic approaches. Cleve et al. explored aspect-based tracing and SQL trace analysis for extracting implicit information about program behavior and database structure [28]. Noughi et al. mined SQL execution traces for data manipulation behavior recovery and conceptual interpretation [77, 83]. Oh et al. proposed a technique to extract dependencies between web components (i.e., Java Server Pages) and database resources. Using the proxy pattern, they dynamically observe the database-related objects in the Java standard library. Some recent approaches also targeted NoSQL databases, e.g., to extract models from the JSON document database [1, 14, 19, 39]. Some approaches also deal with schema generation [45], optimization [76], and schema suggestions [49]. Also interesting to note is the work of Störl et al., who studied schema evolution and data migration in a NoSQL environment [105]. As a similar approach to Cherry et al. [24], Meurice et al. implemented an approach to extract the database schema of MongoDB applications written in Java [71]. They applied their method to analyze the evolution of Java systems.

11.3 Analysis Techniques 11.3.1 Introduction Once we mined information on the storage and management of data in the ecosystem, we can analyze it for various purposes. In this section, we present two techniques. First, we show static analysis approaches in Sect. 11.3.2, which rely on the mining techniques introduced in the previous section. Then we present visualization methods in Sect. 11.3.3 to analyze the dependencies between the database and different components of an ecosystem.

292

C. Nagy et al.

11.3.2 Static Analysis Techniques A database is a critical component of a data-intensive ecosystem. It has to be readily available, and its response time influences the usability of the entire ecosystem. It has been shown that the structure of a database can evolve rapidly, reaching hundreds of tables or thousands of database objects [74]. Moreover, because the application code and the database depend on each other, they evolve in parallel [72], resulting in increased complexity of the database communication code. This layer must remain reliable, robust, and efficient. Here, we show example approaches to help maintain database interactions between systems of an ecosystem and their databases.

11.3.2.1

Example 1: SQLINSPECT—A Static Analyzer

Static analyzers11 can help detect fault-prone and inefficient database usage, i.e., code smells, in early development phases. A lightweight analyzer can pinpoint a mistake already in the IDE before the developer commits it. Nagy et al. presented SQLINSPECT,12 a tool to identify code smells in SQL queries embedded in Java code [80, 81]. SQLINSPECT implements a combined static analysis of the SQL statements in the source code, the database schema, and the data in the database. Its static analysis relies on the approach of Meurice et al. presented in Sect. 11.2. It uses a path-sensitive string analysis algorithm to extract SQL queries from the Java code and implements smell detectors on the abstract semantic graph of a fault-tolerant SQL parser. The supported SQL smells are based on the SQL Antipatterns book of Karwin [52]. SQLINSPECT can also perform additional analyses: (1) it supports inspecting the interprocedural slice of the statements involved in the query construction; (2) it can perform a table/column access analysis to determine which Java methods access specific tables or columns; (3) it calculates embedded SQL metrics (e.g., number of joins, nested select statements) to identify problematic or poorly designed classes and SQL statements. The tool is also available as an Eclipse plug-in. A screenshot of a query slice in the Eclipse plug-in can be seen in Fig. 11.5. SQLINSPECT has been used in various studies. Muse et al. relied on it to study SQL code smells [79] and the prevalence, composition, and evolution of self-admitted technical debt in data-intensive systems [78]. Gobert et al. employed SQLINSPECT to study developers’ testing practices of database access code [40, 41]. Ardigò et al. also relied on it to visualize database accesses of a system [11, 12].

11 The term “static analysis” is conflated. In this section, we call “static analyzer” a tool implementing source code analysis algorithms and techniques to find bugs automatically. The more general term “static analysis” (or static program analysis) is the analysis of a program performed without executing it. 12 https://bitbucket.org/csnagy/sqlinspect/.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

293

Fig. 11.5 A query slice in the Eclipse plug-in of SQLINSPECT

11.3.2.2

Example 2: Preventing Program Inconsistencies

Any software system is subject to frequent changes [57], which often hit the database schema [29, 95, 112]. When the schema evolves, developers need to adapt the applications where it accesses the changed schema elements [23, 67, 87]. This adaptation process is usually achieved manually. Thus, it can be error-prone, resulting in database or application decay [101, 102]. Meurice et al. proposed an approach to detect and prevent program inconsistencies under database schema changes [73]. Their what-if analysis simulates future database schema modifications to estimate how they affect the application code. It recommends to developers where and how they should propagate the schema changes to the source code. The goal is to ensure that the programs’ consistency is preserved. The core idea of the approach is first to analyze the evolution of the system, focusing on its schema changes. They collect metrics to estimate the effort required in the past for adapting the applications to database schema changes. For example, they look for renamed or deleted tables and columns and estimate from the commits

294

C. Nagy et al.

the time needed to solve them in the code. To analyze the codebase and the schema, they rely on the previous analysis approach we presented in Sect. 11.2. They run the analysis on each earlier system version and build a historical dataset. This dataset is designed to replay database schema modifications and estimate their impact on the source code. It describes all versions of the columns of database tables and links them to source code entities where they are accessed in the application code. Meurice et al. demonstrated their what-if analysis on the three open-source Java systems (OpenMRS, Broadleaf Commerce, and OSCAR) used for the query extraction approach in Sect. 11.2.2. Recall that OpenMRS and OSCAR are electronic medical record systems, and Broadleaf is an e-commerce framework. The largest one, OSCAR, has 2054 kLOC and 480 tables. They collected 323 database schema changes and randomly selected 130 for manual evaluation. They compared the tool’s recommendations to the developers’ actual modifications. The approach made 204 suggestions for the 130 cases: 99% were correct, and only 1% were wrong. The tool missed recommendations for 5% (6/130) of the changes. The results show impressive potential in detecting and preventing program inconsistencies.

11.3.3 Visualization 11.3.3.1

Introduction

Software visualization is “the use of [. . . ] computer graphics technology to facilitate both the human understanding and effective use of computer software” [99]. It is a specialization of information visualization [22]. In the eighteenth century, starting with Playfair, the classical methods of plotting data were developed. In 1967, Bertin published “Semiology of Graphics” [17], where he identified the basic elements of diagrams. Later, Tufte published a theory of data graphics that emphasized maximizing the density of useful information [107–109]. Bertin’s and Tufte’s theories led to the development of the information visualization field, which mainly addresses the issues of how certain types of data should be depicted. The goal of information visualization is to visualize any kind of data. According to Ware [115], visualization is the preferred way of getting acquainted with large data. Software visualization deals with software in terms of runtime behavior (dynamic visualization) and structure (static visualization). It has been widely used by the reverse engineering and program comprehension research community [18, 36, 65, 103, 104] to uncover and navigate information about software systems. In the more specific field of software evolution, mining software repositories, and software ecosystems, visualization has also proven to be a key technique due to the sheer amount of information that needs to be processed and understood. Here, we show two foundational approaches in the scientific literature to visualize applications, databases, and their interactions.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

11.3.3.2

295

Example 1: DAHLIA

Data access APIs enable applications to interact with databases. For example, JDBC provides Java APIs for accessing relational databases from Java programs. It allows applications to execute SQL statements and interact with an SQL-compliant database. JDBC is considered a lower-level API with its advantages and disadvantages, for example, clean and simple SQL processing or good performance vs. complexity and Database Management System (DBMS)-specific queries. Higherlevel APIs such as Hibernate ORM (Object-Relational Mapping) try to tackle the object-relational impedance mismatch [50]. In return, such mechanisms partially or wholly hide the database interactions and the executed SQL queries. In this context, manually recovering links between the source code and the databases may prove complicated, hindering program comprehension, debugging, and maintenance tasks. Meurice et al. developed DAHLIA to help developers with a software visualization approach [69, 70]. It allows developers to analyze database usage in highly dynamic and heterogeneous systems. The goal is to support software comprehension and database program co-evolution. DAHLIA extracts and visualizes database interactions in the source code to derive useful information about database usage. It relies on the data access extraction described in Sect. 11.3 [74] and analyzes the evolution of the system by mining its development history [73]. DAHLIA has multiple views. It can visualize the database as a city using the 3D city metaphor [118]. This view represents a database table as a 3D building with its height, width, and color calculated from database usage metrics. For example, metrics for the building height/width can be the number of files or the number of code locations accessing the given table. Metrics for the building color can be the database access technology distribution. An example of this view can be seen in Fig. 11.6. Another view shows the code city view, which, compared to the traditional code city [119], maps database metrics to the buildings. For example, the user may ask to calculate the buildings’ height/width from the number of accessed tables in the given file or the number of database access locations in the given file. For the color, the user may use metrics such as the access technology distribution (i.e., different colors for database access technologies). Visualizing links between the database and code cities is also possible by showing the two cities side by side in one view. When the user selects a table, DAHLIA highlights all the files accessing it. This view can be seen in Fig. 11.7 for OpenMRS, a medical record system that we also analyzed in Sect. 11.2.2. The green tables are accessed with Hibernate mapping in the figure, and the black tables are without ORM mappings. A table’s height represents its number of columns, and its width is the number of SQL queries accessing it. The green files use Hibernate, the blue files use JDBC, and the black ones do not access the database. A file’s height represents the number of accessed tables, and its width represents the number of locations accessing the database. The user selected a Java file highlighted in the right code city with cyan. DAHLIA highlighted all its tables in the left database city with cyan color.

296

Fig. 11.6 A 3D database city in DAHLIA

Fig. 11.7 Database (left) and code (right) cities side by side in DAHLIA

C. Nagy et al.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

297

Overall, DAHLIA is a visualization tool to analyze the database usage of dynamic and heterogeneous systems by showing the links between the source code and the database. It was designed to deal with systems using multiple database access technologies, aiming to support database program co-evolution.

11.3.3.3

Example 2: M3TRICITY

As we could see in the previous example, the city metaphor for visualizing software systems in 3D has been widely explored and has led to various implementations and approaches. Now we look at M3TRICITY,13 a code city visualization focusing on data interactions [11, 12]. Data is usually managed using databases, but it is often simply stored in files of various formats, such as CSV, XML, and JSON. Data files are part of a project’s file system and can thus be easily retrieved. However, a database is usually not contained in the file system, and its presence can only be inferred from the source code, which implements the database accesses. M 3 TRI C ITY represents data files in the city and maps simple metrics on their meshes (i.e., their shapes rendered in the visualization). It adds the database to the visualization using the free space of the sky above or the underground below the city. M3TRICITY infers the database schema using SQLINSPECT [81]. It also collects metrics of the database entities, such as the number of columns of tables or the number of classes accessing them. The city layout uses a history-resistant arrangement proposed by Pfahler et al. [86]; that is, new entities remain at their reserved place throughout the evolution. The resulting view seamlessly integrates data sources into a software city and enables a comprehensive understanding of a system’s source code and data. Figure 11.8 shows a screenshot of MoneyWallet14 visualized in M3TRICITY. It is an expense manager Android app with an SQLite database. The software city is in the center, with the database cloud above the city. Information panels present the repository name (top left), the system metrics (top right), the actual commit (bottom left), and its commit message (bottom right). The timeline at the bottom depicts the evolution of the project, where one can spot significant changes in the metrics. The evolution can be controlled with the buttons below the city. Figure 11.9 presents the evolution of the official SwissCovid Android App15 and highlights three revisions. The timeline at the bottom shows regular contributions to data files (blue bubbles in the middle). Indeed, the XML files grow from an initial 10k to 25k lines. Interesting districts can be spotted in the evolution. 1 The Java classes are primarily located on the bottom-left side of the city . 2 stands out. This is the encrypted The robust SecureStorage.java class 

13 M 3 TRI C ITY is a web application available online 14 https://github.com/AndreAle94/moneywallet.

at https://metricity.si.usi.ch/v2.

15 https://github.com/SwissCovid/swisscovid-app-android.

298

C. Nagy et al.

Database with Table-Cylinders

DataFile-Cylinders Code-Buildings

Binary-Hemispheres

Fig. 11.8 The main page of M3TRICITY 4 3 1 2

r3 (f038042) - 4 May 2020

r373 (6675702) - 30 Jul 2020

r725 (9c5f1c1) - 27 Apr 2021

Fig. 11.9 The evolution of the SwissCovid Android App in M3TRICITY

implementation of android.content.SharedPreferences,16 the primary storage implementation with a critical role in the contact tracing app of Switzerland. 3 with tiny PNG and SVG files We can see the neighborhoods of resource files  and folders of smaller layout XMLs. Another noticeable district is a folder with 4 The initial version supported three strings.xml files of various languages . official Swiss languages (Italian, French, and German). As the app evolves, the XMLs grow, and the number of languages increases to 12. Overall, M3TRICITY is an example of adding “data” as first-class citizens to a widely used software visualization approach, the city metaphor.

16 https://developer.android.com/reference/android/content/SharedPreferences.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

299

11.3.4 Reflections In this section, we learned a static analysis approach to identify SQL code smells in the database communication layer of data-intensive applications. We could also see a what-if analysis technique to detect and prevent program inconsistencies under database schema changes. Then we discussed two visualization methods to analyze dependencies between the database and different components of an ecosystem. Static Analysis Common mistakes in SQL have been in the interest of many researchers. Brass et al. worked on automatically detecting logical errors in SQL queries [20] and then extended their work by recognizing common semantic mistakes [21]. Their SQLLint tool automatically identifies such errors in (syntactically correct) SQL statements [44]. There are also books in this area, e.g., The Art of SQL [35], Refactoring SQL Applications [34], and SQL Antipatterns [52]. In the realm of embedded SQL, Christensen et al. proposed a tool (JSA, Java String Analyzer) to extract string expressions from Java code statically [25]. They also check the syntax of the extracted SQL strings. Wassermann et al. proposed a static string analysis technique to identify possible errors in dynamically generated SQL code [116]. They detect type errors (e.g., concatenating a character to an integer value) in extracted query strings of valid SQL syntax. In a tool demo paper, they present their prototype tool called JDBC Checker [46]. Anderson and Hills studied query construction patterns in PHP [6]. They analyzed query strings embedded in PHP code with the help of the PHP AiR framework. Brink et al. proposed a quality assessment of embedded SQL [110]. They analyze query strings embedded in PL/SQL, COBOL, and Visual Basic programs [111]. Many static techniques deal with embedded queries for SQL injection detection [97]. Their goal is to determine whether a query could be affected by user input. Yeole and Meshram published a survey of these techniques [120]. Marashdeh et al. also surveyed the challenges of detecting SQL injection vulnerabilities [64]. Some papers also tackle SQL fault localization techniques. Clark et al. proposed a dynamic approach to localize SQL faults in database applications [26]. They provide command-SQL tuples to show the SQL statements executed at database interaction points. Delplanque et al. assessed schema quality and detected design smells [32]. Their tool, DBCritics, analyzes PostgreSQL schema dumps and identifies design smells such as missing primary keys or foreign key references. Alvor by Annamaa et al. can analyze string expressions in Java code [10]. It checks syntax correctness, semantics correctness, and object availability by comparing the extracted queries against its internal SQL grammar and by checking SQL statements against an actual database. Visualization Since the seminal works of Reiss [89] and Young and Munro [121], many have studied 3D approaches to visualize software systems. The software as city metaphor has been widely explored and led to diverse implementations, such as the Software World approach by Knight et al. [55], the visualization of communicating architectures by Panas et al. [84, 85], Verso by Langelier et al. [56],

300

C. Nagy et al.

CodeCity by Wettel et al. [118, 119], EvoStreets by Steinbrückner and Lewerentz [100], CodeMetropolis by Balogh and Beszedes [15], and VR City by Vincur et al. [114]. Some approaches considered presenting the databases together with the source code, and interestingly, most use the city metaphor, such as DAHLIA [69, 70] and M3TRICITY [11, 12]. Zirkelbach and Hasselbring presented RACCOON [122], a visualization approach of database behavior, which uses the 3D city metaphor to show the structure of a database based on the concepts of entity-relationship diagrams. Marinescu presented for enterprise systems a meta-model containing object-oriented entities, relational entities, and object-relational interactions [66].

11.4 Empirical Studies 11.4.1 Introduction In this section, we discuss recent empirical studies relying on large collections of data-intensive software systems. The discussed studies cover three main aspects (1) the (joint) use of database models and access technologies (Sect. 11.4.2), (2) the quality of the database manipulation code (Sects. 11.4.3 and 11.4.4), and (3) the way this part of the code is tested (Sect. 11.4.5).

11.4.2 The (Joint) Use of Data Models and Technologies In the last decade, non-relational database technologies (e.g., graph databases, document stores, key-value, column oriented) have emerged for specialized purposes. The joint use of database models (i.e., using different database models for various purposes in the same system, such as a key-value store for caching and a relational database for persistence) has increased in popularity since there are benefits of such multi-database architectures where developers combine various technologies. However, the side effects on design, querying, and maintenance are not well known. Benats et al. [16] conducted an empirical study of (multi-)database models in open-source database-dependent projects. They mined 4 years of development history (2017–2020) of 33 million projects by leveraging Libraries.io.17 They identified projects relying on one or several databases and written in popular programming languages (1.3 million projects). After applying filters to eliminate “low-quality” repositories and remove project duplicates, they gathered a dataset of 40,609 projects. They analyzed the dependencies of those projects to assess (1) the popularity of different database models, (2) the extent that they are combined 17 https://libraries.io/.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

Relational

12,511 2

12k

301

Document Key-Value Wide-Column

10k

# projects

7,625

8 8,340

8,660 6

9,784

8k

Graph

6k

4k

2k 1,840

0

Mo Po SQ My stg SQ ng Lit e oD re L SQ B L

Re

dis

975

402

Me Ca Ne ss o4 mc an j ac dr he a d

147 HB as e

122 Co uc hb

as e

DBMS

Fig. 11.10 Usage of database management systems (2020)

within the same systems, and (3) how their usage evolved. They found that most current database-dependent projects (54.72%) rely on a relational database model, while NoSQL-dependent systems represent 45.28% of the projects. However, the popularity of SQL technologies has recently decreased compared to NoSQL data stores (Fig. 11.10). Regarding programming languages, the authors noticed that Ruby and Python systems are often paired with a PostgreSQL database. At the same time, Java and C# projects typically rely on a MySQL database. Data-intensive systems in JavaScript/TypeScript are essentially paired with document-oriented or key-value databases. The study results confirm the emergence of hybrid data-intensive systems. The authors found the joint use of different database models (e.g., relational and nonrelational) in 16% of all database-dependent projects. In particular, they found that more than 56% of systems relying on a key-value database also use another technology, typically relational or document oriented. Wide-column dependent systems follow the same pattern, with over 47% being hybrid. This shows the complimentary usage of SQL and NoSQL in practice (Fig. 11.11). The authors then examined the evolution of these systems to identify typical transitions in terms of database models or technologies. They observed that only 1% of the database-dependent projects evolved their data model over time. The majority (62%) were not initially hybrid but once relied on a single database model. In contrast, 19% of those projects became “mono-database” after initially using multiple database models.

302

C. Nagy et al.

Fig. 11.11 Distribution of hybrid database-dependent projects

11.4.3 Prevalence, Impact, and Evolution of SQL Bad Smells Muse et al. [79] investigated the prevalence and evolution of SQL code smells in data-intensive open-source systems. Their study relies on the analysis of 150 opensource software systems that manipulate their databases through popular database access APIs (Android Database API, JDBC, JPA, and Hibernate). The authors analyzed the source code of each project and studied 19 traditional code smells using the DECOR tool [48] and four SQL code smells using SQLINSPECT [81]. They also collected bug-fixing and bug-inducing commits from each project using PyDriller [98]. They first studied the prevalence of SQL code smells in the selected software systems by categorizing them into four application domains: Business, Library, Multimedia, and Utility. They found that SQL code smells are prevalent in all four domains, some SQL code smells being more prevalent than others. Then, they investigated the co-occurrence of SQL code smells and traditional code smells using association rule mining. The results show that while some SQL code smells have statistically significant co-occurrence with traditional code smells, the degree of association is low. Third, they investigated the potential impact of SQL code smells on software bugs by analyzing their co-occurrences within the bug-inducing commits. They performed Cramer’s V test of association and built a random forest model to study the impact of the smells on bugs. The analysis results indicate a weak association between SQL code smells and software bugs. Some SQL code smells show a higher association with bugs than others. Finally, the authors performed a survival analysis of SQL and traditional code smells using Kaplan-Meier survival curves to compare their survival time. They found that SQL code smells survive longer than traditional code smells. A large fraction of the source files affected by SQL code smells (80.5%) persist throughout the whole snapshots, and they hardly

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

303

get any attention from the developers during refactoring. Furthermore, significant portions of the SQL code smells are created at the very beginning and persist in all subsequent versions of the systems. The study shows that SQL code smells persist in the studied data-intensive software systems. Developers should be aware of these smells and consider detecting and refactoring SQL and traditional code smells separately, using dedicated tools.

11.4.4 Self-Admitted Technical Debt in Database Access Code Developers sometimes choose design and implementation shortcuts due to the pressure from tight release schedules. However, shortcuts introduce technical debt that increases as the software evolves. The debt needs to be repaid as quickly as possible to minimize its impact on software development and quality. Sometimes, technical debt is admitted by developers in comments and commit messages. Such debt is known as self-admitted technical debt (SATD). In data-intensive systems, where data manipulation is a critical functionality, the presence of SATD in the data access logic could seriously harm performance and maintainability. Understanding the composition and distribution of the SATDs across software systems and their evolution could provide insights into managing technical debt efficiently. Muse et al. [78] conducted a large-scale empirical study on the composition and distribution of SATD across data-intensive software systems and their evolution, providing insights into the prevalence, composition, and evolution of SATD. The authors analyzed 83 open-source systems relying on relational databases and 19 systems relying on NoSQL databases. They detected SATD in source code comments obtained from different snapshots of the subject systems and conducted a survival analysis to understand the evolutionary dynamics of SATDs. They analyzed 361 sample data access SATDs manually, investigating the composition of data access SATDs and the reasons behind their introduction and removal. They identified 15 new SATD categories, out of which 11 are specific to database access operations. They found that most of the data access SATDs are introduced in the later stages of change history rather than at the beginning. They also discovered that bug fixing and refactoring are the main reasons behind the introduction of data access SATDs.

11.4.5 Database Code Testing (Best) Practices Software testing allows developers to maintain the quality of a software system over time. The database access code fragments are often neglected in this context, although they require specific attention.

304

C. Nagy et al.

Non DB methods test coverage rate

100

R = 0.47 , p = 6.1e−08

75

50

25

0 0

25 50 75 DB access methods test coverage rate

100

Fig. 11.12 Test coverage rates of non-DB access methods vs. DB access methods

Gobert et al. [40] empirically analyzed the challenges and perils of database access code testing. They first mined open-source systems from Libraries.io to find projects relying on database manipulation technologies. They analyzed 6622 projects and found automated tests and database manipulation code in only 332 projects. They further examined the 72 projects for which the tests could be executed and analyzed the corresponding coverage reports. Figure 11.12 shows a scatter plot of the analyzed projects and their respective test coverage rates. The results show that the database manipulation code was poorly tested: Thirty-three percent of the projects did not test DB communication, and 46% did not test half of their DB methods. A high number of projects with the highest coverage rate reached, in fact, full coverage. The authors found a mean value of 2.8 database methods for projects with full coverage. Slightly fewer projects in the figure (48.6%) had lower coverage for database methods. However, considering only the projects with at least five database methods (the median value), there is a more significant difference: 59% have a smaller coverage for database methods than for regular methods. Similarly, while 46% of the projects cover less than half of their database methods, this number increases to 53% for projects above the median. This poor test coverage motivated the authors to understand why developers were holding back from testing DB manipulation code. They conducted a qualitative analysis of 532 questions from popular Stack Exchange websites. They identified the problems that hampered developers in writing tests. Then they built a taxonomy of issues with 83 different problems classified into seven main categories. They found out that developers mostly look for insights on general best practices to test database manipulation code. They also ask more technical questions related to DB handling, mocking, parallelization, or framework/tool usage.

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

305

In a follow-up study, the same authors analyzed the answers to these questions [41]. They manually labelled the top three highest-ranked answers to each question and built a taxonomy of best practices. Overall, they examined 598 answers to 255 questions, leading to 363 different practices in the taxonomy. The category in the taxonomy with the highest number of tags and questions relates to the testing environment, e.g., proposed various tools and configurations. The second most important category is database management, e.g., initializing or cleaning up a database between tests. Other categories include code structure or design guidelines, concepts, performance, processes, test characteristics, test code, and mocking. Most suggestions consider the testing environment and recommend various tools or configurations. The second largest category is database management, mainly addressing database initialization and clean-up between tests. Other categories pertain to code structure or design, concepts, performance, processes, test characteristics, test code, and mocking.

11.4.6 Reflections Studies on Code Quality Other researchers studied frequent errors and antipatterns in SQL queries. The book of Karwin [52] is the first to present SQL antipatterns in a comprehensive catalogue. Khumnin et al. [54] present a tool for detecting logical database design antipatterns in Transact-SQL queries. Another tool, DbDeo [94], implements the detection of database schema smells. DbDeo has been evaluated on 2925 open-source repositories; their authors identified 13 different types of smells, among which “index abuse” was the most prevalent. De Almeida Filho et al. [30] investigate the prevalence and co-occurrence of SQL code smells in PL/SQL projects. Arzamasova et al. propose to detect antipatterns in SQL logs [13] and demonstrate their approach by refactoring a project containing more than 40 million queries. Shao et al. [93] identified a list of database-access performance antipatterns, mainly in PHP web applications. Integrity violation was addressed by Li et al. [58], who identified constraints from source code and related them to database attributes. Studies on Evolution Several researchers studied how data-intensive systems relying on a relational database evolve. Curino et al. [29] analyze the evolution history of the Wikipedia database schema, to extract both a micro-classification and a macro-classification of schema changes. An evolution period of 4 years was considered, corresponding to 171 successive schema versions. In addition to a schema evolution statistics extractor, the authors propose a tool operating on the differences between subsequent schema versions to semi-automatically extract the schema changes that have possibly been applied. The results motivate the need for automated schema evolution support. Vassiliadis et al. [112, 113] study the evolution of individual database tables over time in eight different software

306

C. Nagy et al.

systems. They observe that evolution-related properties, such as the possibility of deletion, or the updates a table undergoes, are related to observable table properties, such as the number of attributes or the time of birth of a table. Through a largescale study on the evolution of databases, they also show that the essence of Lehman’s laws of software evolution remains valid in the context of database schema evolution [96], but that specific mechanics significantly differ from source code evolution. Dimolikas et al. [33] analyze the evolution history of six database schemas and reveal that the update behavior of tables depend on their topological complexity. Cleve et al. [27] show that mining database schema evolution can have a significant informative value in reverse engineering. They introduce the concept of global historical schema, an aggregated schema of all successive schema versions. They then analyze this global schema to better understand the current version of the database structure, intending to facilitate its evolution. Lin et al. [60] study the collateral evolution of applications and databases, in which the evolution of an application is separated from the evolution of its persistent data, or the database. They investigated how application programs and database management systems in popular open-source systems (Mozilla, Monotone) cope with database schema and format changes. They observed that collateral evolution could lead to potential problems. The number of schema changes reported is minimal. In Mozilla, 20 table creations and four table deletions are reported in 4 years. During 6 years of Monotone schema evolution, only nine tables were added, while eight were deleted. Qiu et al. [87] present a large-scale empirical study on ten popular database applications from various domains to analyze how schemas and application code co-evolve. They study the evolution histories of the repositories to understand whether database schemas evolve frequently and significantly, and how schemas evolve and impact the application code. Their analysis estimates the impact of a database schema change in the code. They use a simple difference calculation of the source lines changed between two versions for this estimation. Goeminne et al. [42] study the co-evolution between code-related and database-related activities in dataintensive systems combining several ways to access the database (native SQL queries and Object-Relational Mapping). They empirically analyze the evolution of SQL, Hibernate, and JPA usage in a large and complex open-source information system. They observed that using embedded SQL queries was still a common practice. Other studies exclusively focus on NoSQL applications. Störl et al. [105] investigated the advantages of using object mapper libraries when accessing NoSQL data stores. They overview Object-NoSQL Mappers (ONMs) and Object-Relational Mappers with NoSQL support. As they say, building applications against the native interfaces of NoSQL data stores creates technical lock-in due to the lack of standardized query languages. Therefore, developers often turn to object mapper libraries as an extra level of abstraction. Scherzinger et al. [92] studied how software engineers design and evolve their domain model when building NoSQL applications by analyzing the denormalized character of ten open-source Java applications relying on object mappers. They observed the growth in complexity of the NoSQL

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

307

schemas and common evolution operations between the projects. The study also shows that software releases include considerably more schema-relevant changes: >30% compared to 2% with relational databases. Ringlstetter et al. [90] examined how NoSQL object-mappers evolution annotations were used. They found that only 5.6% of 900 open-source Java projects using Morphia or Objectify used such annotations to evolve the data model or migrate the data. Studies on Technical Debt Several studies are related to technical debt in dataintensive systems. Albarak and Bashoon [3] propose a taxonomy of debts related to the conceptual, logical, and physical design of a database. For example, they claim that ill-normalized databases (i.e., databases with tables below the fourth normal form) can also be considered technical debt [4]. To tackle this type of debt, they propose to prioritize tables that should be normalized. Foidl et al. propose a conceptual model to outline where technical debt can emerge in data-intensive systems by separating them into three parts: software systems, data storage systems, and data [38]. They show that those three parts can further affect each other. They present two smells as illustrations: missing constraints, when referential integrity constraints are not declared in a database schema, and metadata as data, when an entity-attribute-value pattern is used to store metadata (attributes) as data. Weber et al. [117] identified relational database schemas as potential sources of technical debt. They provide a first attempt at utilizing the technical debt analogy for describing the missing implementation of implicit foreign key (FK) constraints. They discuss the detection of missing FKs, propose a measurement for the associated technical debt (TD), and outline a process for reducing FKrelated TD. As an illustrative case study, they consider OSCAR, which was also used to demonstrate the static analysis approach in Sect. 11.2.2. Ramasubbu and Kemerer [88] empirically analyze the impact of technical debt on system reliability by observing a 10-year life cycle of a commercial enterprise system. They examine the relative effects of modular and architectural maintenance activities in clients and conclude that technical debt decreases the reliability of enterprise systems. They also add that modular maintenance targeted to reduce technical debt is about 53% more effective than architectural maintenance in reducing the probability of a system failure due to client errors.

11.5 Conclusion This chapter summarized the recent research efforts devoted to mining, analyzing, and evolving data-intensive software ecosystems. We have argued that 1. Both the databases and the programs are essential ecosystem artifacts. 2. Mining, analyzing, and visualizing what the programs are doing on the data may considerably help in understanding the system in general and the databases in particular.

308

C. Nagy et al.

3. Database interactions may suffer from quality problems and technical debt and should be better tested. 4. Software evolution methods should devote more attention to the programdatabase co-evolution problem. The research community will face many challenges and open questions in the near future, given the increasing complexity and heterogeneity of data-intensive software ecosystems. For instance, to fully embrace the DevOps movement, developers need better support for database-related analyses and evolutions at runtime [31]. This is the case, in particular, when developing micro-services applications deployed on distributed computing architectures such as the cloud-edge continuum [75]. Furthermore, machine learning techniques, such as those described in Chap. 10, open the door to novel recommenders helping developers to design, understand, evolve, test, and improve the performance of modern data-intensive systems.

References 1. Abdelhedi, F., Brahim, A., Rajhi, H., Ferhat, R., Zurfluh, G.: Automatic extraction of a document-oriented NoSQL schema. In: Int. Conf. Enterprise Information Systems (2021) 2. Afonso, A., da Silva, A., Conte, T., Martins, P., Cavalcanti, J., Garcia, A.: LESSQL: dealing with database schema changes in continuous deployment. In: IEEE 27th Int. Conf. Software Analysis, Evolution and Reengineering (SANER 2020), pp. 138–148 (2020). https://doi.org/ 10.1109/SANER48275.2020.9054796 3. Albarak, M., Bahsoon, R.: Database design debts through examining schema evolution. In: International Workshop on Managing Technical Debt (MTD), pp. 17–23 (2016). https://doi. org/10.1109/MTD.2016.9 4. Albarak, M., Bahsoon, R.: Prioritizing technical debt in database normalization using portfolio theory and data quality metrics. In: International Conference on Technical Debt (TechDebt), pp. 31–40. ACM, New York (2018). https://doi.org/10.1145/3194164.3194170 5. Alger, K.W., Daniel Coupal, D.: Building with patterns: the polymorphic pattern (2022). https://www.mongodb.com/developer/products/mongodb/polymorphic-pattern/. Accessed 15 Apr 2023 6. Anderson, D.: Modeling and analysis of SQL queries in PHP systems. Master’s thesis, East Carolina University (2018) 7. Anderson, D., Hills, M.: Query construction patterns in PHP. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 452–456 (2017). https://doi. org/10.1109/SANER.2017.7884652 8. Anderson, D., Hills, M.: Supporting analysis of SQL queries in PHP AiR. In: International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 153–158 (2017). https://doi.org/10.1109/SCAM.2017.23 9. Andreasen, E., Møller, A.: Determinacy in static analysis for jQuery. In: Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pp. 17– 31 (2014). https://doi.org/10.1145/2660193.2660214 10. Annamaa, A., Breslav, A., Kabanov, J., Vene, V.: An interactive tool for analyzing embedded SQL queries. In: Asian Symposium on Programming Languages and Systems (APLAS). Lecture Notes in Computer Science, vol. 6461, pp. 131–138. Springer, Berlin (2010)

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

309

11. Ardigò, S., Nagy, C., Minelli, R., Lanza, M.: Visualizing data in software cities. In: Working Conference on Software Visualization (VISSOFT), NIER/TD, pp. 145–149. IEEE, Piscataway (2021). https://doi.org/10.1109/VISSOFT52517.2021.00028 12. Ardigò, S., Nagy, C., Minelli, R., Lanza, M.: M3triCity: visualizing evolving software & data cities. In: International Conference on Software Engineering (ICSE), pp. 130–133. IEEE, Piscataway (2022). https://doi.org/10.1145/3510454.3516831 13. Arzamasova, N., Schäler, M., Böhm, K.: Cleaning antipatterns in an SQL query log. Trans. Knowl. Data Eng. 30(3), 421–434 (2018) 14. Baazizi, M.A., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28(4), 497–521 (2019). https://doi.org/10.1007/s00778-018-0532-7 15. Balogh, G., Beszedes, A.: CodeMetropolis—code visualisation in MineCraft. In: International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 136–141. IEEE, Piscataway (2013) 16. Benats, P., Gobert, M., Meurice, L., Nagy, C., Cleve, A.: An empirical study of (multi) database models in open-source projects. In: International Conference on Conceptual Modeling (ER), pp. 87–101. Springer, Berlin (2021) 17. Bertin, J.: Graphische Semiologie, 2nd edn. Walter de Gruyter (1974) 18. Beyer, D., Lewerentz, C.: CrocoPat: a tool for efficient pattern recognition in large objectoriented programs. Tech. Rep. I-04/2003, Institute of Computer Science, Brandenburgische Technische Universität Cottbus (2003) 19. Brahim, A.A., Ferhat, R.T., Zurfluh, G.: Model driven extraction of NoSQL databases schema: Case of MongodB. In: Int. Joint Conf. on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 145–154 (2019). https://doi.org/10.5220/ 0008176201450154 20. Brass, S., Goldberg, C.: Detecting logical errors in SQL queries. In: Workshop on Foundations of Databases (2004) 21. Brass, S., Goldberg, C.: Semantic errors in SQL queries: A quite complete list. J. Syst. Softw. 79(5), 630–644 (2006). https://doi.org/10.1016/j.jss.2005.06.028 22. Card, S.K., Mackinlay, J.D., Shneiderman, B. (eds.): Readings in Information Visualization— Using Vision to Think. Morgan Kaufmann (1999) 23. Chen, T.H., Shang, W., Jiang, Z.M., Hassan, A.E., Nasser, M., Flora, P.: Detecting performance anti-patterns for applications developed using object-relational mapping. In: International Conference on Software Engineering (ICSE), pp. 1001–1012. ACM, New York (2014). https://doi.org/10.1145/2568225.2568259 24. Cherry, B., Benats, P., Gobert, M., Meurice, L., Nagy, C., Cleve, A.: Static analysis of database accesses in mongodb applications. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 930–934. IEEE, Piscataway (2022). https://doi. org/10.1109/SANER2022.2022.00111 25. Christensen, A.S., Møller, A., Schwartzbach, M.I.: Precise analysis of string expressions. In: International Conference on Static Analysis (SAS), pp. 1–18. Springer, Berlin (2003) 26. Clark, S.R., Cobb, J., Kapfhammer, G.M., Jones, J.A., Harrold, M.J.: Localizing SQL faults in database applications. In: International Conference on Automated Software Engineering (ASE), pp. 213–222. ACM, New York (2011). https://doi.org/10.1109/ASE.2011.6100056 27. Cleve, A., Gobert, M., Meurice, L., Maes, J., Weber, J.: Understanding database schema evolution: a case study. Sci. Comput. Program. 97, 113–121 (2015). https://doi.org/10.1016/ j.scico.2013.11.025. Special Issue on New Ideas and Emerging Results in Understanding Software 28. Cleve, A., Hainaut, J.L.: Dynamic analysis of SQL statements for data-intensive applications reverse engineering. In: Working Conference on Reverse Engineering (WCRE), pp. 192–196 (2008). https://doi.org/10.1109/WCRE.2008.38 29. Curino, C.A., Tanca, L., Moon, H.J., Zaniolo, C.: Schema evolution in Wikipedia: toward a web information system benchmark. In: International Conference on Enterprise Information Systems (ICEIS) (2008)

310

C. Nagy et al.

30. de Almeida Filho, F.G., Martins, A.D.F., Vinuto, T.d.S., Monteiro, J.M., de Sousa, Í.P., de Castro Machado, J., Rocha, L.S.: Prevalence of bad smells in PL/SQL projects. In: International Conference on Program Comprehension (ICPC), pp. 116–121. IEEE, Piscataway (2019) 31. de Jong, M., van Deursen, A., Cleve, A.: Zero-downtime sql database schema evolution for continuous deployment. In: International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 143–152. IEEE, Piscataway (2017). https:// doi.org/10.1109/ICSE-SEIP.2017.5 32. Delplanque, J., Etien, A., Auverlot, O., Mens, T., Anquetil, N., Ducasse, S.: Codecritics applied to database schema: challenges and first results. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 432–436 (2017). https://doi. org/10.1109/SANER.2017.7884648 33. Dimolikas, K., Zarras, A.V., Vassiliadis, P.: A study on the effect of a table’s involvement in foreign keys to its schema evolution. In: International Conference on Conceptual Modeling (ER), pp. 456–470. Springer, Berlin (2020) 34. Faroult, S., L’Hermite, P.: Refactoring SQL Applications. O’Reilly Media (2008) 35. Faroult, S., Robson, P.: The Art of SQL. O’Reilly Media (2006) 36. Favre, J.M.: GSEE: a generic software exploration environment. In: International Workshop on Program Comprehension (ICPC), pp. 233–244. IEEE, Piscataway (2001) 37. Feldthaus, A., Schäfer, M., Sridharan, M., Dolby, J., Tip, F.: Efficient construction of approximate call graphs for JavaScript IDE services. In: International Conference on Software Engineering (ICSE), pp. 752–761. IEEE, Piscataway (2013) 38. Foidl, H., Felderer, M., Biffl, S.: Technical debt in data-intensive software systems. In: 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2019), pp. 338–341 (2019). https://doi.org/10.1109/SEAA.2019.00058 39. Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018). https://doi.org/10.1016/j.is.2018.02.007 40. Gobert, M., Nagy, C., Rocha, H., Demeyer, S., Cleve, A.: Challenges and perils of testing database manipulation code. In: International Conference on Advanced Information Systems Engineering (CAiSE), pp. 229–245. Springer, Berlin (2021) 41. Gobert, M., Nagy, C., Rocha, H., Demeyer, S., Cleve, A.: Best practices of testing database manipulation code. Inform. Syst. 111, 102105 (2023). https://doi.org/10.1016/j.is.2022. 102105 42. Goeminne, M., Decan, A., Mens, T.: Co-evolving code-related and database-related changes in a data-intensive software system. In: Software Evolution Week (CSMR/WCRE) (2014) 43. Goeminne, M., Mens, T.: Towards a survival analysis of database framework usage in Java projects. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 551–555. IEEE, Piscataway (2015). https://doi.org/10.1109/ICSM.2015.7332512 44. Goldberg, C.: Do you know SQL? About semantic errors in database queries. Tech. rep., Higher Education Academy (2008) 45. Gómez, P., Casallas, R., Roncancio, C.: Automatic schema generation for document-oriented systems. In: Database and Expert Systems Applications, pp. 152–163. Springer, Berlin (2020) 46. Gould, C., Su, Z., Devanbu, P.: JDBC Checker: A static analysis tool for SQL/JDBC applications. In: International Conference on Software Engineering (ICSE), pp. 697–698 (2004). https://doi.org/10.1109/ICSE.2004.1317494 47. Gould, C., Su, Z., Devanbu, P.: Static checking of dynamically generated queries in database applications. In: International Conference on Software Engineering (ICSE), pp. 645–654 (2004). https://doi.org/10.1109/ICSE.2004.1317486 48. Guéhéneuc, Y.G.: Ptidej: A flexible reverse engineering tool suite. In: 2007 IEEE International Conference on Software Maintenance, pp. 529–530. IEEE, Piscataway (2007) 49. Imam, A.A., Basri, S., Ahmad, R., Watada, J., González-Aparicio, M.T.: Automatic schema suggestion model for NoSQL document-stores databases. J. Big Data 5 (2018) 50. Ireland, C., Bowers, D., Newton, M., Waugh, K.: A classification of object-relational impedance mismatch. In: International Conference on Advances in Databases, Knowledge, and Data Applications, pp. 36–43 (2009). https://doi.org/10.1109/DBKDA.2009.11

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

311

51. Jensen, S.H., Møller, A., Thiemann, P.: Type analysis for JavaScript. In: Static Analysis, pp. 238–255. Springer, Berlin (2009) 52. Karwin, B.: SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Pragmatic Programmers (2010) 53. Kashyap, V., Dewey, K., Kuefner, E.A., Wagner, J., Gibbons, K., Sarracino, J., Wiedermann, B., Hardekopf, B.: JSAI: a static analysis platform for JavaScript. In: International Symposium on Foundations of Software Engineering (FSE), pp. 121–132. ACM, New York (2014). https://doi.org/10.1145/2635868.2635904 54. Khumnin, P., Senivongse, T.: SQL antipatterns detection and database refactoring process. In: ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD), pp. 199–205 (2017). https://doi.org/10.1109/ SNPD.2017.8022723 55. Knight, C., Munro, M.C.: Virtual but visible software. In: International Conference on Information Visualization (IV), pp. 198–205. IEEE, Piscataway (2000) 56. Langelier, G., Sahraoui, H., Poulin, P.: Visualization-based analysis of quality for large-scale software systems. In: International Conference on Automated Software Engineering (ASE), pp. 214–223. ACM, New York (2005) 57. Lehman, M.M.: Laws of software evolution revisited. In: C. Montangero (ed.) Software Process Technology, pp. 108–124. Springer, Berlin (1996) 58. Li, B., Poshyvanyk, D., Grechanik, M.: Automatically detecting integrity violations in database-centric applications. In: International Conference on Program Comprehension (ICPC), pp. 251–262 (2017). https://doi.org/10.1109/ICPC.2017.37 59. Li, D., Lyu, Y., Wan, M., Halfond, W.G.J.: String analysis for java and android applications. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 661–672. ACM, New York (2015). https://doi.org/10. 1145/2786805.2786879 60. Lin, D.Y., Neamtiu, I.: Collateral evolution of applications and databases. In: Joint Int’l Workshop on Principles of software evolution and ERCIM software evolution workshop, pp. 31–40. ACM, New York (2009). https://doi.org/10.1145/1595808.1595817 61. Lyu, Y., Gui, J., Wan, M., Halfond, W.G.J.: An empirical study of local database usage in Android applications. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 444–455 (2017). https://doi.org/10.1109/ICSME.2017.75 62. Madsen, M., Møller, A.: Sparse dataflow analysis with pointers and reachability. In: Static Analysis, pp. 201–218. Springer, Berlin (2014) 63. Manousis, P., Zarras, A., Vassiliadis, P., Papastefanatos, G.: Extraction of embedded queries via static analysis of host code. In: Advanced Information Systems Engineering (CAiSE), pp. 511–526. Springer, Berlin (2017) 64. Marashdeh, Z., Suwais, K., Alia, M.: A survey on SQl injection attack: detection and challenges. In: International Conference on Information Technology (ICIT), pp. 957–962 (2021). https://doi.org/10.1109/ICIT52682.2021.9491117 65. Marcus, A., Feng, L., Maletic, J.I.: 3D representations for software visualization. In: ACM Symposium on Software Visualization, p. 27. IEEE, Piscataway (2003) 66. Marinescu, C.: Applications of automated model’s extraction in enterprise systems. In: International Conference on Software Technologies (ICSOFT), pp. 254–261. SCITEPRESS (2019) 67. Maule, A., Emmerich, W., Rosenblum, D.: Impact analysis of database schema changes. In: International Conference on Software Engineering (ICSE), pp. 451–460 (2008). https://doi. org/10.1145/1368088.1368150 68. McKnight: NoSQL Evaluator’s Guide (2014) 69. Meurice, L., Cleve, A.: DAHLIA: a visual analyzer of database schema evolution. In: Software Evolution Week—IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pp. 464–468. IEEE, Piscataway (2014). https://doi. org/10.1109/CSMR-WCRE.2014.6747219

312

C. Nagy et al.

70. Meurice, L., Cleve, A.: DAHLIA 2.0: A visual analyzer of database usage in dynamic and heterogeneous systems. In: Working Conference on Software Visualization (VISSOFT), pp. 76–80. IEEE, Piscataway (2016). https://doi.org/10.1109/VISSOFT.2016.15 71. Meurice, L., Cleve, A.: Supporting schema evolution in schema-less NoSQL data stores. In: International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 457–461 (2017). https://doi.org/10.1109/SANER.2017.7884653 72. Meurice, L., Goeminne, M., Mens, T., Nagy, C., Decan, A., Cleve, A.: Analyzing the evolution of database usage in data-intensive software systems. In: Software Technology: 10 Years of Innovation, pp. 208–240. Wiley, London (2018). https://doi.org/10.1002/9781119174240. ch12 73. Meurice, L., Nagy, C., Cleve, A.: Detecting and preventing program inconsistencies under database schema evolution. In: International Conference on Software Quality, Reliability & Security (QRS), pp. 262–273. IEEE, Piscataway (2016). https://doi.org/10.1109/QRS.2016. 38 74. Meurice, L., Nagy, C., Cleve, A.: Static analysis of dynamic database usage in Java systems. In: International Conference on Advanced Information Systems Engineering (CAiSE), pp. 491–506. Springer, Berlin (2016). https://doi.org/10.1007/978-3-319-39696-5%5C_30 75. Milojicic, D.: The edge-to-cloud continuum. IEEE Ann. History Comput. 53, 16–25 (2020) 76. Mior, M.J.: Automated schema design for NoSQL databases. In: 2014 SIGMOD PhD Symposium, pp. 41–45. ACM, New York (2014). https://doi.org/10.1145/2602622.2602624 77. Mori, M., Noughi, N., Cleve, A.: Mining SQL execution traces for data manipulation behavior recovery. In: International Conference on Advanced Information Systems Engineering (CAiSE) (2014) 78. Muse, B.A., Nagy, C., Cleve, A., Khomh, F., Antoniol, G.: FIXME: synchronize with database an empirical study of data access self-admitted technical debt. Empirical Softw. Eng. 27(6) (2022). https://doi.org/10.1007/s10664-022-10119-4 79. Muse, B.A., Rahman, M., Nagy, C., Cleve, A., Khomh, F., Antoniol, G.: On the prevalence, impact, and evolution of SQL code smells in data-intensive systems. In: International Conference on Mining Software Repositories (MSR), pp. 327–338. ACM, New York (2020). https://doi.org/10.1145/3379597.3387467 80. Nagy, C., Cleve, A.: Static code smell detection in SQL queries embedded in Java code. In: International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 147–152. IEEE, Piscataway (2017). https://doi.org/10.1109/SCAM.2017.19 81. Nagy, C., Cleve, A.: SQLInspect: A static analyzer to inspect database usage in Java applications. In: International Conference on Software Engineering (ICSE), pp. 93–96 (2018). https://doi.org/10.1145/3183440.3183496 82. Ngo, M.N., Tan, H.B.K.: Applying static analysis for automated extraction of database interactions in web applications. Inform. Softw. Technol. 50(3), 160–175 (2008). https://doi. org/10.1016/j.infsof.2006.11.005 83. Noughi, N., Mori, M., Meurice, L., Cleve, A.: Understanding the database manipulation behavior of programs. In: International Conference on Program Comprehension (ICPC), pp. 64–67. ACM, New York (2014). https://doi.org/10.1145/2597008.2597790 84. Panas, T., Berrigan, R., Grundy, J.: A 3D metaphor for software production visualization. In: IV 2003, p. 314. IEEE Computer Society (2003) 85. Panas, T., Epperly, T., Quinlan, D., Saebjornsen, A., Vuduc, R.: Communicating software architecture using a unified single-view visualization. In: International Conference on Engineering Complex Computer Systems (ECCS), pp. 217–228. IEEE, Piscataway (2007) 86. Pfahler, F., Minelli, R., Nagy, C., Lanza, M.: Visualizing evolving software cities. In: Working Conference on Software Visualization (VISSOFT), pp. 22–26 (2020). https://doi.org/10.1109/ VISSOFT51673.2020.00007 87. Qiu, D., Li, B., Su, Z.: An empirical analysis of the co-evolution of schema and code in database applications. In: Joint Meeting on Foundations of Software Engineering (ESEC/FSE), pp. 125–135. ACM, New York (2013). https://doi.org/10.1145/2491411. 2491431

11 Mining, Analyzing, and Evolving Data-Intensive Software Ecosystems

313

88. Ramasubbu, N., Kemerer, C.F.: Technical debt and the reliability of enterprise software systems: a competing risks analysis. Manag. Sci. 62(5), 1487–1510 (2016). https://doi.org/ 10.1287/mnsc.2015.2196 89. Reiss, S.P.: An engine for the 3D visualization of program information. J. Vis. Lang. Comput. 6(3), 299–323 (1995) 90. Ringlstetter, A., Scherzinger, S., Bissyandé, T.F.: Data model evolution using object-NoSQL mappers: folklore or state-of-the-art? In: International Workshop on BIG Data Software Engineering, pp. 33–36 (2016) 91. Scherzinger, S., De Almeida, E.C., Ickert, F., Del Fabro, M.D.: On the necessity of model checking NoSQL database schemas when building SaaS applications. In: International Workshop on Testing the Cloud (TTC). ACM, New York (2013) 92. Scherzinger, S., Sidortschuck, S.: An empirical study on the design and evolution of NoSQL database schemas. In: International Conference on Conceptual Modeling (ER), pp. 441–455. Springer, Berlin (2020) 93. Shao, S., Qiu, Z., Yu, X., Yang, W., Jin, G., Xie, T., Wu, X.: Database-access performance antipatterns in database-backed web applications. In: International Conference on Software Maintenance and Evolution (ICSME), pp. 58–69. IEEE, Piscataway (2020). https://doi.org/ 10.1109/ICSME46990.2020.00016 94. Sharma, T., Fragkoulis, M., Rizou, S., Bruntink, M., Spinellis, D.: Smelly relations: measuring and understanding database schema quality. In: International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 55–64. ACM, New York (2018). https://doi.org/10.1145/3183519.3183529 95. Sjøberg, D.: Quantifying schema evolution. Inform. Softw. Technol. 35(1), 35–44 (1993). https://doi.org/10.1016/0950-5849(93)90027-Z 96. Skoulis, I., Vassiliadis, P., Zarras, A.: Open-source databases: within, outside, or beyond Lehman’s laws of software evolution? In: International Conference on Advanced Information Systems Engineering (CAiSE). LNCS, vol. 8484, pp. 379–393. Springer, Berlin (2014). https://doi.org/10.1007/978-3-319-07881-6%5C_26 97. Sonoda, M., Matsuda, T., Koizumi, D., Hirasawa, S.: On automatic detection of SQL injection attacks by the feature extraction of the single character. In: International Conference on Security of Information and Networks (SIN), pp. 81–86. ACM, New York (2011). https:// doi.org/10.1145/2070425.2070440 98. Spadini, D., Aniche, M., Bacchelli, A.: PyDriller: Python framework for mining software repositories. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 908–911. ACM (2018). https://doi. org/10.1145/3236024.3264598 99. Stasko, J.T., Brown, M.H., Domingue, J.B., Price, B.A.: Software Visualization— Programming as a Multimedia Experience. MIT Press, Cambridge (1998) 100. Steinbrückner, F., Lewerentz, C.: Representing development history in software cities. In: International Symposium on Software Visualization, pp. 193–202. ACM, New York (2010). https://doi.org/10.1145/1879211.1879239 101. Stonebraker, M., Deng, D., Brodie, M.L.: Database decay and how to avoid it. In: Proc. Big Data, pp. 7–16 (2016). https://doi.org/10.1109/BigData.2016.7840584 102. Stonebraker, M., Deng, D., Brodie, M.L.: Application-database co-evolution: a new design and development paradigm. In: New England Database Day (2017) 103. Storey, M.A., Best, C., Michaud, J.: SHriMP views: an interactive and customizable environment for software exploration. In: International Workshop on Program Comprehension (IWPC) (2001) 104. Storey, M.A., Wong, K., Müller, H.: How do program understanding tools affect how programmers understand programs? In: Working Conference on Reverse Engineering (WCRE), pp. 12–21. IEEE, Piscataway (1997) 105. Störl, U., Klettke, M., Scherzinger, S.: NoSQL schema evolution and data migration: Stateof-the-art and opportunities. In: International Conference on Extending Database Technology (EDBT), pp. 655–658 (2020). https://doi.org/10.5441/002/edbt.2020.87

314

C. Nagy et al.

106. Sun, K., Ryu, S.: Analysis of JavaScript programs: challenges and research trends. ACM Comput. Surv. 50(4) (2017) 107. Tufte, E.: Envisioning Information. Graphics Press (1990) 108. Tufte, E.: Visual Explanations. Graphics Press (1997) 109. Tufte, E.: The Visual Display of Quantitative Information, 2nd edn. Graphics Press (2001) 110. van den Brink, H., van der Leek, R., Visser, J.: Quality assessment for embedded SQL. In: International Working Conference on Source Code Analysis and Manipulation (SCAM), pp. 163–170 (2007). https://doi.org/10.1109/SCAM.2007.23 111. Van Den Brink, H.J., van der Leek, R.: Quality metrics for SQL queries embedded in host languages. In: European Conference on Software Maintenance and Reengineering (CSMR) (2007) 112. Vassiliadis, P., Zarras, A.V., Skoulis, I.: How is life for a table in an evolving relational schema? Birth, death and everything in between. In: International Conference on Conceptual Modeling (ER), pp. 453–466. Springer, Berlin (2015) 113. Vassiliadis, P., Zarras, A.V., Skoulis, I.: Gravitating to rigidity: Patterns of schema evolution, and its absence in the lives of tables. Inform. Syst. 63, 24–46 (2017). https://doi.org/10.1016/ j.is.2016.06.010 114. Vincur, J., Navrat, P., Polasek, I.: VR City: Software analysis in virtual reality environment. In: Int. Conf. Software Quality, Reliability and Security, pp. 509–516. IEEE, Piscataway (2017). https://doi.org/10.1109/QRS-C.2017.88 115. Ware, C.: Information Visualization: Perception for Design, 2nd edn. Morgan Kaufmann (2004) 116. Wassermann, G., Gould, C., Su, Z., Devanbu, P.: Static checking of dynamically generated queries in database applications. Trans. Softw. Eng. Methodol. 16(4), 14 (2007). https://doi. org/10.1145/1276933.1276935 117. Weber, J.H., Cleve, A., Meurice, L., Bermudez Ruiz, F.J.: Managing technical debt in database schemas of critical software. In: International Workshop on Managing Technical Debt, pp. 43–46 (2014). https://doi.org/10.1109/MTD.2014.17 118. Wettel, R., Lanza, M.: Visualizing software systems as cities. In: International Workshop on Visualizing Software for Understanding and Analysis (VISSOFT), pp. 92–99. IEEE, Piscataway (2007) 119. Wettel, R., Lanza, M.: CodeCity: 3D visualization of large-scale software. In: International Conference on Software Engineering (ICSE), pp. 921–922. ACM, New York (2008) 120. Yeole, A.S., Meshram, B.B.: Analysis of different technique for detection of SQL injection. In: International Conference & Workshop on Emerging Trends in Technology (ICWET), pp. 963–966. ACM, New York (2011). https://doi.org/10.1145/1980022.1980229 121. Young, P., Munro, M.: Visualising software in virtual reality. In: International Workshop on Program Comprehension (IWPC), pp. 19–26. IEEE, Piscataway (1998) 122. Zirkelbach, C., Hasselbring, W.: Live visualization of database behavior for large software landscapes: the RACCOON approach. Tech. rep., Department of Computer Science, Kiel University (2019)