Enhancing Surrogate-Based Optimization Through Parallelization 9783031306082, 9783031306099

134 83 3MB

English Pages 123 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Enhancing Surrogate-Based Optimization Through Parallelization
 9783031306082, 9783031306099

Table of contents :
Acknowledgments
Contents
Symbols
1 Introduction
1.1 Motivation
1.2 Research Questions
1.3 Outline
1.4 Publications
References
2 Background
2.1 Surrogate-Based Optimization
2.2 Evolutionary Algorithms
2.3 A Taxonomy for Parallel SBO
2.3.1 Parallel Objective Function Evaluation (L1)
2.3.2 Parallel Model Building (L2)
2.3.3 Parallel Evaluation Proposals (L3)
2.3.4 Multi-algorithm Approaches (L4)
2.3.5 Recommendations for Practitioners
2.4 Parallel SBO—A Literature Review
References
3 Methods/Contributions
3.1 Benchmarking Parallel SBO Algorithms
3.1.1 A Framework for Parallel SBO Algorithms
3.1.2 Conclusions
3.2 Test Problems
3.2.1 Simulation Based Functions
3.2.2 Experiments
3.2.3 Results
3.2.4 Conclusions
3.3 Why Not Other Parallel Algorithms?
3.3.1 A Hybrid SBO Baseline
3.3.2 Experiments
3.3.3 Results
3.3.4 Conclusions
3.4 Parallelization as Hyper-Parameter Selection
3.4.1 Multi-config SBO
3.4.2 Experiments
3.4.3 Results
3.4.4 Conclusions
3.5 Exploration Versus Exploitation
3.5.1 Experiments
3.5.2 Results
3.5.3 Conclusions
3.6 Multi-local Expected Improvement
3.6.1 Batched Multi-local Expected Improvement
3.6.2 Experiments
3.6.3 Results
3.6.4 Conclusions
3.7 Adaptive Parameter Selection
3.7.1 Benchmark Based Algorithm Configuration
3.7.2 Experiments
3.7.3 Results
3.7.4 Conclusions
References
4 Application
4.1 Electrostatic Precipitators
4.1.1 Problem Description
4.1.2 Methods
4.1.3 Results
4.1.4 Conclusions
4.2 Application Case Studies
References
5 Final Evaluation
5.1 Define
5.2 Analyze
5.3 Enhance
5.4 Final Evaluation
References
Appendix Glossary

Citation preview

Studies in Computational Intelligence 1099

Frederik Rehbach

Enhancing Surrogate-Based Optimization Through Parallelization

Studies in Computational Intelligence Volume 1099

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Frederik Rehbach

Enhancing Surrogate-Based Optimization Through Parallelization

Frederik Rehbach Fakultät Informatik TU Dortmund University Köln, Germany

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-30608-2 ISBN 978-3-031-30609-9 (eBook) https://doi.org/10.1007/978-3-031-30609-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Acknowledgments

This work would not have been possible without the help of many others. First of all, I would like to express many thanks to my supervisor Prof. Dr. Thomas BartzBeielstein for his limitless support over many years. He first introduced me to my later passion in the field of optimization and without him I would have never taken up the chance to do a doctorate in this field. I would like to thank my supervisor in Dortmund, Prof. Dr. Günter Rudolph, for many fruitful discussions, for valuable feedback on my work, and for pointing my research in the correct direction whenever needed. A special thanks to my colleagues who guided and supported me so much, especially in the first years: Andreas, Boris, Jörg, and Martin. Additionally, I would like to thank all of my great colleagues from the institute for five amazing years of teamwork, research, and lots of fun: Aleksandr, Alexander, Beate, Dani, Dietlind, Dimitri, Elmar, Julia, Jürgen, Konstantin, Lorenzo, Margarita, Martina, Olaf, Patrick, Pham, Sebastian, Sowmya, Steffen, and Svetlana. I would like to give a very big thanks to my family for all their support: my wife Abir, my parents Stephanie and Winfried, and my brothers Christian and Julian. Last but for sure not least, I would like to thank my friend Tobias for quite a few motivational talks and hikes.

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 4 6

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Surrogate-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 A Taxonomy for Parallel SBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Parallel Objective Function Evaluation (L1) . . . . . . . . . . . . . . 2.3.2 Parallel Model Building (L2) . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Parallel Evaluation Proposals (L3) . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Multi-algorithm Approaches (L4) . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Recommendations for Practitioners . . . . . . . . . . . . . . . . . . . . . 2.4 Parallel SBO—A Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 11 13 14 18 19 21 22 23 27

3 Methods/Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Benchmarking Parallel SBO Algorithms . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 A Framework for Parallel SBO Algorithms . . . . . . . . . . . . . . 3.1.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Simulation Based Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Why Not Other Parallel Algorithms? . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 A Hybrid SBO Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 32 34 35 37 39 41 44 45 46 47

vii

viii

Contents

3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Parallelization as Hyper-Parameter Selection . . . . . . . . . . . . . . . . . . . 3.4.1 Multi-config SBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Exploration Versus Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Multi-local Expected Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Batched Multi-local Expected Improvement . . . . . . . . . . . . . 3.6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Adaptive Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Benchmark Based Algorithm Configuration . . . . . . . . . . . . . . 3.7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 52 53 53 54 56 58 59 61 62 67 69 70 73 74 79 80 82 85 87 89 90

4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Electrostatic Precipitators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Application Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 95 98 100 104 104 106

5 Final Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Define . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Enhance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Final Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 111 112 113 113

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Symbols

yˆ λ K X χc fs fsc μˆ σˆ k Ainner CT Ci M ωυ φυ θ k ysc b fc fe

Model prediction or estimate. Parameter for the nugget effect in Kriging models. Correlation matrix for Kriging model. Matrix of candidate solutions that can be evaluated on the expensive objective function. Set of candidate solutions that can be evaluated on the expensive objective function. Test function generated by Kriging simulation. Conditional test function generated by Kriging simulation. Mean value. Used in Kriging simulations, a model parameter, determined by MLE. Kernel for a Kriging model. Inner optimization algorithm, searching on the surrogate in SBO. Total evaluation cost of χc on the expensive objective function. Evaluation cost for the ith candidate on the expensive objective function. Surrogate model in the model-based optimization. Used in Kriging simulations, i.i.d. random samples from a distribution given by spectral density function of the Kriging model’s kernel. Used in Kriging simulations an i.i.d. uniform random sample from the interval [−π, π]. Parameter of an exponential kernel function for Kriging. Vector of correlations between training data and a new data, used for Kriging model. Unconditionally simulated values at known training samples. Batch size of a parallel optimization algorithm. Function describing the evaluation cost spent per objective function evaluation, given a certain batch size. Expensive to evaluate objective function.

ix

x

ft f in f ill nT n L1

Symbols

Function describing the time required per objective function evaluation, given a certain amount of CPU cores. Infill criterion such as expected improvement. Total amount of CPU cores in a system. Amount of CPU cores per objective function evaluation.

Chapter 1

Introduction

1.1 Motivation The practical optimization of many applications is accompanied by a significant amount of cost for each conducted experiment. Such costs might stem from the time an employee requires to perform experiments, costs of materials used in the experiments, the machines involved, or even the computational cost of computer experiments such as Computational Fluid Dynamics (CFD) simulations. In such a scenario, often only a few tens or hundreds of expensive functions evaluations are feasible due to a strict budget of time or money that has to be met. These hard limits require the use of very sample-efficient optimizers. One commonly applied sample-efficient technique is Surrogate-Based Optimization (SBO) [1–3]. In SBO a data-driven model, the so-called surrogate, is fitted to a few initially evaluated sample points. The surrogate mimics the expensive objective function while being cheap to evaluate. This enables an extensive search for an optimum on the cheap surrogate instead of the expensive function. In the classical approach of SBO only the best-found point on the surrogate will be expensively evaluated on the objective function. The additional data point is then used to build a new, hopefully, more accurate model. This process of fitting and searching for an optimum on a surrogate model is usually iterated until the evaluation budget of the expensive function is depleted or a sample point of satisfactory quality is found. Classical SBO has been applied successfully in numerous industrial applications. Consider, for example, the work by Bartz-Beielstein [4], listing over 100 distinct application cases. Yet, there is a subset of functions where this method can not directly be applied in an efficient manner. Classical SBO usually generates a single candidate solution per iteration. Thus, it can not efficiently be applied to problems that allow for the parallel evaluation of a set of candidates. This work is aimed to specialize in this specific subset of problems: Problems where the cost of evaluating an experiment is high but where multiple experiments can be conducted at the same time with comparatively little or no additional cost. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9_1

1

2

1 Introduction

Such a scenario is met in many real-world scenarios. For example, consider computer simulations such as CFD and Finite Element Method (FEM) models, that are applied in a variety of industrial optimization applications: e.g., optimizing aerofoils [5–7], heat and mass transfer simulations [8–10], stress and stiffness simulations in construction work [11], and many more. Furthermore, most parameter tuning tasks for optimization and machine learning algorithms can be parallelized to some degree. All these problems can take advantage of today’s age of computing, where parallelism is a crucial component in large computation clusters with millions of cores to tackle complex optimization tasks efficiently. Formally we aim to optimize a given expensive-to-evaluate function fe : X → R with X ⊆ Rd We refer to f e as a ‘black-box’ function, because we assume that no more knowledge on f e is available, other than the knowledge obtained by directly evaluating f e (x) for a given candidate solution x ∈ X. Furthermore, f e is required to allow for the parallel evaluation of a set of candidates Xc = {x1 , . . . xn }. If the total cost CT of evaluating the candidates Xc in parallel is less then the sum of their individual costs Ci evaluating one after another, then applying a parallel optimizer to the problem might be more efficient: n  Ci CT < i=1

Note that the total cost CT includes eventual communication overheads that could lead to additional computation time. Even though the results presented in this thesis might be partially applicable to other research domains, we restrict our work here to single-objective deterministic problems.

1.2 Research Questions Despite the sheer amount of possible use cases for parallel SBO, essential questions remain unanswered even in fundamental areas. For example, we see a clear lack of benchmarking tools and test functions for parallel SBO. Existing benchmarks only allow for algorithm comparisons in limited test scenarios for example with fixed parameter settings [12]. A quantitative measure for optimization efficiency needs to be defined for parallel SBO algorithms. An additional problem lies within the expensive nature of the functions we are trying to research. If already a few evaluations on such problems yield a high cost, then how can SBO algorithms be

1.3 Outline

3

benchmarked rigorously, allowing for statistical comparisons? From these questions, we summarize: RQ-1 How to benchmark and analyze parallel SBO algorithms in a fair, rigorous, and statistically sound way? Given the ability to benchmark and analyze SBO algorithms, we want to investigate existing methods. For example, why not just use already intrinsically parallel optimization algorithms, such as population-based Evolutionary Algorithms (EAs)? Are model-based methods better suited for expensive problems, even in a parallel scenario? Are there specific scenarios in which existing parallel SBO performs well or poorly? We ask: RQ-2 How well do existing parallel SBO algorithms perform on expensive-toevaluate problems? What are their benefits and drawbacks? Understanding the performance of parallel SBO algorithms is essential to improve their efficiency further. Possible drawbacks need to be analyzed and can possibly be fixed to advance state of the art in parallel optimization. How can multiple candidate solutions be chosen on a single surrogate more efficiently? Therefore: RQ-3 How can existing drawbacks of parallel SBO algorithms be avoided to enable more efficient optimizations?

1.3 Outline The following sections of this thesis are, to a large extent, written independently from one another. This means that readers who are only interested in parts of this thesis or only in a single specific topic can skip ahead and only read the section of their interest. Figure 1.1 gives a visual overview of the thesis structure.

Fig. 1.1 Graphical summary of the thesis outline. Numbers represent chapters/sections

4

1 Introduction

Chapter 2 gives a brief introduction into SBO and EAs and refers to introductory literature. Readers who want to refresh their knowledge in this field might start there. A taxonomy and overview of parallel SBO are presented with recommendations for practitioners in Sect. 2.3. Existing methods in the field of parallel SBO are reviewed in Sect. 2.4. The central part of this thesis is concerned with answering the presented research questions RQ-1 to RQ-3. Before parallel SBO algorithms can be analyzed, we need to DEFINE a fair performance measure by which these algorithms can be benchmarked. Practitioners who are interested in how to benchmark parallel SBO algorithms are referred to Sect. 3.1. Section 3.2 covers two distinct approaches for the selection and generation of benchmark test functions. We apply the benchmarking framework to these functions, hereby answering what to benchmark. Based on the benchmark tool set from Sects. 3.1 and 3.2, we can ANALYZE existing SBO algorithms, their performance, their benefits, and their shortfalls. Indepth analysis of parallel SBO is presented in three investigative studies. Section 3.3 analyzes multiple state-of-the-art parallel SBO optimizers and compares them to nonmodel-based optimization algorithms. The importance of a proper hyper-parameter setup is evaluated in Sect. 3.4. Lastly, we research the impact of landscape exploration in SBO algorithms and discuss the trade-off between exploration and exploitation in Sect. 3.5. The findings from Sects. 3.3, 3.4, and 3.5 are used to ENHANCE parallel SBO. Section 3.6 introduces a novel SBO algorithm and empirically evaluates its effectiveness on a large set of test functions. Section 3.7 further improves this approach by adding an adaptive parameter selection technique to the algorithm. Lastly, we APPLY the novel techniques to various industrial- and researchproblems in Chap. 4. Section 4.1 presents the optimization of large-scale, industrial electrostatic precipitators as a detailed case study. The outcome and significant improvements in the project are discussed. Multiple additional application cases are discussed in a summarized form in Sect. 4.2. A final evaluation of this thesis is given in Chap. 5.

1.4 Publications Parts of this thesis are based on previous, already published research. This section lists these research items and describes the thesis author’s contributions to them. Where the author of this thesis is also the first author and main contributor of the published work, a roughly estimated percentage is given to show the authors share to the scientific contribution. In all other cases, the particular contribution of the author is presented. Parts of the listed research items were extended and rewritten to give more detailed information or extend experiments where necessary. Others are used verbatim in this thesis. The listed work is presented based on its publication date, and not necessarily in the order it is referred to throughout this thesis.

1.4 Publications

5

Benchmark-Driven Algorithm Configuration Applied to Parallel Model-Based Optimization, F. Rehbach, M. Zaefferer, A. Fischbach, G. Rudolph, and T. Bartz-Beielstein, IEEE Transactions on Evolutionary Computation, 2022 [12]. • The author of this thesis is also the main author of this research item and estimates to have contributed roughly 70% of the scientific work. Continuous optimization benchmarks by simulation, M. Zaefferer and F. Rehbach, in Parallel problem solving from nature—PPSN XVI, 2020 [13]. • The author contributed the design and implementation of benchmarks for simulated functions, as well as to the development of the employed quality measures. The spectral method for simulation was developed by M. Zaefferer. Variable reduction for surrogate-based optimization, F. Rehbach, L. Gentile, and T. Bartz-Beielstein, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO’20, 2020 [14]. • The author of this thesis is also the main author of this research item and estimates to have contributed roughly 60% of the scientific work. Parallelized bayesian optimization for expensive robot controller evolution, M. Rebolledo, F. Rehbach, A. E. Eiben, and T. Bartz-Beielstein, in Parallel problem solving from nature—PPSN XVI, 2020 [15]. • The author contributed to the development of a parallel optimizer for robot controller parameters and the design and implementation of the benchmarks. Expected improvement versus predicted value in surrogate-based optimization, F. Rehbach, M. Zaefferer, B. Naujoks, and T. Bartz-Beielstein, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20, 2020 [16]. • The author of this thesis is also the main author of this research item and estimates to have contributed roughly 75% of the scientific work. Parallelized bayesian optimization for problems with expensive evaluation functions, M. Rebolledo, F. Rehbach, E. A.E, and T. Bartz-Beielstein, in GECCO ’20: Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2020 [17]. • The author contributed to the development of a parallel optimizer for expensive-to-evaluate functions and the executed benchmarks. Feature selection for surrogate model-based optimization, F. Rehbach, L. Gentile, and T. Bartz-Beielstein, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2019 [18]. • The author of this thesis is also the main author of this research item and estimates to have contributed roughly 60% of the scientific work.

6

1 Introduction

Model-based evolutionary algorithm for optimization of gas distribution systems in power plant electrostatic precipitators, A. Schagen, F. Rehbach, and T. Bartz-Beielstein, International Journal for Generation and Storage of Electricity and Heat, VGB Powertech, vol. 9, 2018 [19]. • The research done in this work is based on the author’s master thesis. The author developed the described surrogate-based optimization algorithm and automated evaluation system for the parallel optimization of computational fluid dynamics simulations. Comparison of parallel surrogate-assisted optimization approaches, F. Rehbach, M. Zaefferer, J. Stork, and T. Bartz-Beielstein, in Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 18, 2018 [20]. • The author of this thesis is also the main author of this research item and estimates to have contributed roughly 75% of the scientific work.

References 1. A. Forrester, A. Keane et al., Engineering Design via Surrogate Modelling: A Practical Guide (Wiley, New York, 2008) 2. Y. Jin, Surrogate-assisted evolutionary computation: recent advances and future challenges. Swarm Evol. Comput. 1(2), 61–70 (2011) 3. A.I. Forrester, A.J. Keane, Recent advances in surrogate-based optimization. Prog. Aerosp. Sci. 45(1), 50–79 (2009) 4. T. Bartz-Beielstein, Sequential parameter optimization—an annotated bibliography. CIOP Technical Report 04/10, Research Center CIOP (Computational Intelligence, Optimization and Data Mining), Cologne University of Applied Science, Faculty of Computer Science and Engineering Science, April 2010 5. B. Naujoks, L. Willmes, T. Bäck, W. Haase, Evaluating multi-criteria evolutionary algorithms for airfoil optimisation, in International Conference on Parallel Problem Solving from Nature (Springer, 2002), pp. 841–850 6. A.J. Keane, Statistical improvement criteria for use in multiobjective design optimization. AIAA J 44(4), 879–891 (2006) 7. A.I. Forrester, N.W. Bressloff, A.J. Keane, Optimization using surrogate models and partially converged computational fluid dynamics simulations. Proc. R. Soc. A: Math. Phys. Eng. Sci. 462(2071), 2177–2204 (2006) 8. K. Foli, T. Okabe, M. Olhofer, Y. Jin, B. Sendhoff, Optimization of micro heat exchanger: Cfd, analytical approach and multi-objective evolutionary algorithms. Int. J. Heat Mass Transf. 49(5–6), 1090–1099 (2006) 9. S. Daniels, A. Rahat, G. Tabor, J. Fieldsend, R. Everson, Shape optimisation using computational fluid dynamics and evolutionary algorithms (2016) 10. S. Daniels, A. Rahat, J. Fieldsend, R. Everson, et al., Automatic shape optimisation of the turbine-99 draft tube (2017) 11. S.J. Leary, A. Bhaskar, A.J. Keane, A derivative based surrogate model for approximating and optimizing the output of an expensive computer simulation. J. Glob. Optim. 30(1), 39–58 (2004)

References

7

12. F. Rehbach, M. Zaefferer, A. Fischbach, G. Rudolph, T. Bartz-Beielstein, Benchmark-driven configuration of a parallel model-based optimization algorithm. IEEE Trans. Evol. Comput. 26(6), 1365–1379 (2022) 13. M. Zaefferer, F. Rehbach, Continuous optimization benchmarks by simulation, in Parallel Problem Solving from Nature – PPSN XVI: 16th International Conference, 2020 Accepted for publication 14. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Variable reduction for surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 (Association for Computing Machinery, 2020), pp. 1177–1185 15. M. Rebolledo, F. Rehbach, A.E. Eiben, T. Bartz-Beielstein, Parallelized bayesian optimization for expensive robot controller evolution, in Parallel Problem Solving from Nature – PPSN XVI: 16th International Conference, 2020 Accepted for publication 16. F. Rehbach, M. Zaefferer, B. Naujoks, T. Bartz-Beielstein, Expected improvement versus predicted value in surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 (Association for Computing Machinery, 2020), pp. 868–876 17. M. Rebolledo, F. Rehbach, A.E. Eiben, T. Bartz-Beielstein, Parallelized bayesian optimization for problems with expensive evaluation functions, in GECCO ’20: Proceedings of the Genetic and Evolutionary Computation Conference Companion (Association for Computing Machinery, 2020), pp. 231–232 18. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Feature selection for surrogate model-based optimization, in Proceedings of the Genetic and Evolutionary Computation Conference Companion (ACM, 2019), pp. 399–400 19. A. Schagen, F. Rehbach, T. Bartz-Beielstein, Model-based evolutionary algorithm for optimization of gas distribution systems in power plant electrostatic precipitators. Int. J. Gener. Storage Electr. Heat VGB Powertech 9, 65–72 (2018) 20. F. Rehbach, M. Zaefferer, J. Stork, T. Bartz-Beielstein, Comparison of parallel surrogateassisted optimization approaches, in Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 18 (ACM Press, 2018), pp. 1348–1355

Chapter 2

Background

The following chapter briefly presents background knowledge on SBO (Sect. 2.1), infill criteria (Sect. 2.1), and EAs (Sect. 2.2). We refer to more in-depth literature where needed. The chapter continues by presenting a taxonomy for parallel SBO (Sect. 2.3) and giving recommendations for practitioners in the field of SBO. The chapter is concluded with a literature review covering existing SBO methods (Sect. 2.4).

2.1 Surrogate-Based Optimization SBO is a commonly applied technique for expensive-to-evaluate problems. It is used in scenarios where only a severely limited amount of evaluations is possible, such as a few tens or hundreds of evaluations. Surrogate models are data-driven models that try to mimic the original, expensive objective function f e . Throughout this thesis, we refer to models and surrogates interchangeably. The model is built on available data from previous evaluations of f e and is then used to predict the value of f e for new observations Xc . A graphical overview of the typical SBO approach is presented in Fig. 2.1. If no previous evaluations of f e exist, a common step is to generate an initial design Xc with a space filling technique such as Latin Hypercube Sampling (LHS) [1]. X is expensively evaluated on f e and a surrogate M can be built on the data. The model replaces the expensive evaluation on f e , and an in-depth search for good candidate solutions becomes feasible. An inner optimizer Ainner searches for the best candidate solution on the surrogate. The search is led by a so-called infill criterion f in f ill that judges the quality of any candidates proposed by Ainner . In the most basic scenario, f in f ill is directly the predicted value of the surrogate. However, infill criteria commonly do not only consider the predicted function value for any given candidate solution but also the surrogates’ uncertainty at that point. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9_2

9

10

2 Background

Fig. 2.1 Graphical overview of the typical SBO approach

The surrogate combined with the infill criterion is usually much cheaper to evaluate than f e . Therefore, an extensive search for good candidate solutions becomes feasible. In the standard (non-parallel) SBO scenario, a single best candidate solution is selected by the search with Ainner . This candidate is then expensively evaluated on f e , and the surrogate M is updated with the additional information from the evaluation. One iteration in SBO denotes one cycle of fitting a surrogate on the available data, searching for the best candidate solution on the surrogate with Ainner , and lastly, evaluating the candidate on f e . This process is repeated until no more evaluation budget is available or other stopping criteria are met. For a detailed explanation of SBO covering different sampling strategies for the initial design, different surrogate models, and also infill criteria, we highly recommend the work by Forrester et al. [2].

Kriging One commonly used surrogate in SBO is Kriging (often also referred to as Gaussian Process Regression). In Kriging, the quality of new candidate solutions is estimated by building a Gaussian process model. Kriging is based on a correlation structure derived from the observed training data. For a detailed introduction to the Kriging methodology, we refer to the work of Forrester et al. [2]. Kriging conveniently supplies an uncertainty estimate to its own predictions. This uncertainty estimate is the basis for most methods of generating multiple candidate solutions on a single surrogate in parallel SBO. Furthermore, Kriging is argued to deliver a high prediction accuracy even if only a few evaluations of f e are available.

2.2 Evolutionary Algorithms

11

For these reasons, this thesis specializes solely in Kriging models, even though many of the described methods can be employed for other kinds of surrogate models as well.

Infill Criteria Predicted Value (PV) The most straightforward infill criterion for Kriging is the Predicted Value (PV) criterion. The PV of the Kriging surrogate is the direct prediction of the surrogate model. I.e., the mean of the Gaussian process prediction for a given candidate solution. If a surrogate were to exactly reproduce the expensive function f e , then optimizing the PV criterion would yield the optimum of f e . However, this is rarely the case as all surrogates are fitted on a limited amount of data and will have some deviation from the original function. A significant problem with the PV criterion is that it can quickly get stuck in local optima. If the currently best-known point lies within a local optimum, it is possible and even likely that the surrogate also predicts the global optimum to be close to that point. This can lead to an endless loop where the PV search on the surrogate, again and again, leads to points in a possibly very bad local optimum.

Expected Improvement (EI) The probably most commonly applied infill criterion in Kriging based SBO is the Expected Improvement (EI) criterion [3]. EI is meant to balance between exploration and exploitation in SBO and was famously implemented in the Efficient Global Optimization (EGO) algorithm [4]. As the name suggests, the EI criterion calculates the expected improvement of a given candidate solution, calculated based on the candidate’s predicted value and the estimated uncertainty. We refer to Forrester et al. [2] for the exact calculation and a more detailed description. Importantly, since EI assists global exploration, it will eventually sample all areas of the search space. Given an infinite budget, EI yields a guarantee to converge to the global optimum, and convergence rates can be analyzed analytically [5, 6].

2.2 Evolutionary Algorithms One well-established class of optimizers are Evolutionary Algorithms (EA). EAs are stochastic population-based search algorithms that mimic natural evolution. Candidate solutions that are evaluated in the search space are referred to as individuals.

12

2 Background

Fig. 2.2 Graphical overview of the typical EAs approach

Multiple individuals together form a population. The objective function is usually referred to as a fitness function [7]. The term EA refers to a whole class of optimizers that for example includes evolutionary programming [8], genetic algorithms [9], evolution strategies [10, 11] and genetic programming [12]. The typical approach of an EA is presented in Fig. 2.2. Firstly, an initial population is created (usually randomly sampled or by means of some design) and evaluated on the objective (fitness) function. The search space is explored by two main operators: mutation and recombination. Dependent on the specific algorithm implementation, these behave differently. Yet, most commonly, a mutation refers to a rather small random change in an individual. E.g., a small normally distributed error can be added to an individual. Recombination describes the combination of two or more parent individuals into new child individuals, also called the offspring. The new and changed individuals are evaluated on the objective function, and a subset of the total population is chosen to survive and continue into the next iteration. This process of applying mutation and recombination to the population and selecting the next population with regard to their fitness is repeated until a stopping criterion is met or the evaluation budget is depleted. For a detailed overview on EAs we refer to [7]. In the following, we would like to briefly introduce two specific EA variants that were applied in this thesis.

(1+1)-Evolution Strategy The (1+1)-Evolution Strategy is a special case of ((1+1)-ES) that only has one parent individual and one offspring. At each iteration, a new individual (offspring) is created by applying a random mutation to the parent. If the offspring has better fitness (objective function value) than the parent, then the offspring replaces the parent and continues to the next iteration. Otherwise, the parent is kept [7]. The size of the random changes is adapted over time. Famously, Rechenberg showed, for a set of handpicked functions in his analysis, that the optimal success probability of a

2.3 A Taxonomy for Parallel SBO

13

mutation should be roughly 1/5 [10], coining the term “1/5th rule”. If the success rate (amount of offspring outperforming their parent) in the last iterations was larger than 1/5, the step size is increased. If the success rate is lower than 1/5, the step size is decreased.

CMA-ES The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) belongs to the group of evolution strategies. Hansen and Ostermeier introduced the CMA-ES to solve continuous nonlinear optimization problems [13]. Similar to the described (1+1)-ES, the CMA-ES relies on a step size adaptation to apply mutations. Instead of adapting a single step size, in the CMA-ES, a covariance matrix is adapted. The covariance matrix is used to allow for different step sizes dependent on the direction in the search space. This specific adaptation makes the algorithm invariant to search space rotations.

2.3 A Taxonomy for Parallel SBO Parallelization in the field of optimization is usually concerned with generating timesavings in the optimization process. Different aspects of the optimization process can be adapted and parallelized. Yet, to our best knowledge, there is no unified taxonomy for parallel SBO that clearly separates all aspects of parallelization that can be applied in a typical application problem. In the field of optimization, parallelization is often separated into two categories: coarse and fine-grained parallelization [14, 15]. This classification depends on the computation/communication ratio of the optimization algorithm. For a high ratio of computation to communication, the algorithm is said to be coarse-grained; with a low ratio, it is termed fine-grained [14]. In order to cover the whole spectrum of possible parallelization in industrial applications, we would like to move away from this two-class system. The existing system only considers the optimization algorithms themselves instead of the whole optimization process, including the objective function. The taxonomy which we will introduce in the following is meant to give an overview of which parts of SBO can be parallelized. Instead of the two groups (coarse-, and fine-grained parallelization) these parts will be addressed in levels to emphasize a hierarchical structure, meaning some kinds of parallelization have prerequisites from lower levels. It aims to clearly define each parallelizable part and its distinctions to other levels. On the one hand, this is done in order to give an overview and be able to refer clearly to each different aspect of parallelization throughout this thesis. On the other hand, it is used to provide decision support for practitioners. This decision support will be given by pointing out the advantages and

14

2 Background Initial Design

Expensive Function L1 Parallelization

Surrogate-Based Optimizer Kriging Surrogate L2 Parallelization

Optimization Algorithm

Stopping Criterion met?

L2 Parallelization

Infill Criterion STOP

L3 Parallelization

L4 Parallelization

Fig. 2.3 Visualization of the four levels of parallelization exemplified on a surrogate-based optimization loop. An initial design is expensively evaluated on the objective function (parallelized on L1). A surrogate model is fitted, and an optimization algorithm is used to search for the optimum on the model. Both can be parallelized (L2). Depending on the infill criterion that is used to guide the optimization, one or multiple sample points are found as a result of the surrogate-based optimizer. They are evaluated in parallel (L3). Additional samples can be generated by applying multiple SBO algorithms or even mixtures of different algorithms in parallel (L4)

disadvantages of each applicable level. Additionally, after the introduction of the taxonomy, the levels will be summarized in a guide on when to prioritize which level of parallelization. We define four levels (L1–L4) of parallelization, covering the whole optimization loop from an application viewpoint. We further argue that usually each level can only be parallelized to a certain extent until saturation effects occur and further parallelization becomes less and less efficient. Mixing parallelization on multiple levels can circumvent these saturation effects. L1 to L4 and the described saturation effects are explained in more detail in the following subsections. In principle, they apply to various classes of optimization algorithms and problems. However, they are specifically designed for parallel SBO applied to problems with very high evaluation costs. Figure 2.3 illustrates L1 to L4.

2.3.1 Parallel Objective Function Evaluation (L1) The first level of parallelization L1 is in many cases also the most sensible starting point for parallelizing an expensive industrial problem. Usually, if an objective function is costly to evaluate, then this cost will dominate the overall cost of optimization.

2.3 A Taxonomy for Parallel SBO

15

Fig. 2.4 Graphical representation of running one or multiple simulations (denoted S), on an example system with 16 CPU-cores. The simulations are either not parallelized or parallelized with L1-a and L1-b as labeled on the left side of the figure

For example, a computer simulation might run for hours on a compute cluster, while the surrogate-based optimizer only requires seconds or minutes to propose the next candidate solution that should be evaluated. L1 defines the parallelization of the objective function and thus usually tackles the largest share of the total cost of optimization. We distinguish between L1-a and L1-b as two independent types of parallelization of the objective function. L1-a and L1-b are explained graphically in Fig. 2.4. L1-a could also be referred to as an internal parallelization of the objective function evaluation. It describes a reduction of the time needed to evaluate one sample point on the objective function. Therefore, parallelization is used to reduce the required wall-clock time for that single evaluation. Example Consider a computer simulation that requires a significant amount of run time and that can be run using a single CPU . Using additional CPU cores for the same simulation can reduce the total runtime. Similarly, (L1-a) could describe a physical experiment run by some technician. Asking for help from a colleague and doubling the number of technicians on the task might reduce the time required to finish the given experiment. L1-b describes an external parallelization of the objective function. Thus, the simultaneous evaluation of several samples on the objective function. Here, not each individual evaluation is sped up. Instead, multiple evaluations are done at the same time. Let us reconsider our two previous examples for this. Instead of speeding up a single computer simulation, we could run two simulations at the same time. Instead of the technician helping a colleague with an experiment, one could choose to work on a second experiment to finish in parallel with the colleague. This second type of parallelization is not functioning by itself though. It is defined in L1 because it acts directly upon the objective function’s evaluation. Yet, the additional sample points that are evaluated in parallel need to be proposed by an optimizer. This is covered in the later-described levels L3 and L4.

16

2 Background

0.9

Speed up

6

4

0.6

2

0.3

2

4

6

8

10

12

14

Relative Efficiency

8

16

Number of Cores Fig. 2.5 Example in measured speedup (black curve) versus relative parallel efficiency (red curve) for an increasing number of CPU cores (efficiency is defined as speedup/used number of cores). Numbers are taken from example measurements of a small CFD simulation and will largely differ for other problems. Figure based on [16, p. 7]

While L1-a and L1-b are independent of each other, both can (and in the best case should) be used simultaneously. That is, each evaluation of the objective function is accelerated by parallelization, and several evaluations are performed simultaneously. Balancing both of these types of parallelization is important in many industrial applications to enable an efficient optimization. The main reason for this is illustrated on the example of a CFD simulation in Fig. 2.5. There are processes where parallelization scales ideally. This means doubling the available computation resources will half the required runtime. Other processes can not be parallelized at all, and additional resources do not yield a benefit in computation time. Figure 2.5 shows a scaling behavior that is common across many kinds of computer experiments. Initially, increasing the resources also yields nearly linear speed-ups for the simulation runtime. However, with each additional core, communication overhead is added. This results in less and less speed-up gained by each additional core. This allows for a synergetic effect that is created by parallelizing on multiple different levels of this taxonomy at the same time. Each level can show these saturation effects where additional resource spending yields diminishing returns. While these effects occur fundamentally differently for each level (as described in the respective sections of the other levels), each level can still show the same outcome that additional resources are not largely helpful. Thus, in cases where lots of resources are available for parallelization, spending the resources across multiple levels will likely yield the best overall performance. Example Consider an optimization problem that involves running a computer simulation using a computation cluster. The cluster has a total amount of available cores, let’s say 128 for now. If all of these 128 cores are assigned to speed up a single

2.3 A Taxonomy for Parallel SBO

17

Table 2.1 Summary of benefits and drawbacks of L1 parallelization Benefits Description Large speed-ups

No expert optimization knowledge Drawbacks Applicability Problem dependent

If the given objective function allows for parallelization, very large speed-ups can be possible Dependent on the given problem, this level can often yield the highest speed-up across all other forms of parallelization The parallelization of the objective function is problem dependent, but does generally not require expert knowledge in the field of optimization Not all objective functions can be parallelized, some others may only be parallelized either on L1-a or L1-b While expert knowledge in the field of optimization might not be required, L1 is very problem dependent and no single standard solution exists Expert knowledge in the respective field might be required

simulation,it will likely run faster than executing it with only a single CPU core. Yet considering Fig. 2.5 it will probably by far not run 128 times faster. It might even take roughly as long or longer than running it with i.e., 64 cores instead. This is if the computational overhead of the communication between the CPUs grew larger than the speed-up the CPUs could provide. Instead of using all 128 cores to parallelize one simulation, a suitable parallel SBO algorithm could propose multiple candidate solutions in parallel. For example, four candidates could be proposed per iteration and each candidate is evaluated with 128/4 = 32 cores. The described saturation effect does not only occur to the parallelization of the objective function but also to optimization algorithms. This deficit depends on the optimizer and the way new samples are generated. While a purely random search could be scaled infinitely to generate millions of sample points, other optimization algorithms can not be scaled in a similar way. For example, an EA might work well with a population size of four or eight but less efficient at a population size of 128. In such a scenario, a mixture of both, using multiple CPU cores for speeding up a single simulation and proposing multiple simulation candidates that can run in parallel might yield the best performance. Dependent on the optimization problem at hand and the available computational resources, different mixtures between (L1-a) and (L1-b) become more or less efficient. To summarize the described points, the benefits and drawbacks of parallelization on L1 are covered in Table 2.1.

18

2 Background

2.3.2 Parallel Model Building (L2) Level 2 (L2) of parallelization describes any implementation speed-up that affects the runtime of the inner optimizer Ainner . L2 does not concern about the quality of newly proposed sample points, L2 is just concerned about the amount of time that is required for the optimizer to create these new sample points. For surrogate-based algorithms, this includes the time it takes to build a data-driven model and the time it takes for the inner optimizer to search on the model for new candidate solutions. L2 does not change or affect the evaluation of the expensive objective function f e . Yet, the expected increase in optimization efficiency of L2 largely depends on the evaluation cost of f e . If the problem at hand is particularly expensive and makes up the largest part of the total optimization time, then efficiency improvements to L2 will only yield small returns. For example, let us again assume that the problem at hand is to tune the parameters of a computationally expensive CFD simulation. First measurements have shown that the simulation needs about one-hour of runtime per evaluation. The inner optimizer might only need seconds or very few minutes to choose new evaluation points for the simulation. Even if a significant speed-up is achieved on L2 and the runtime of the inner optimizer can be shortened by a factor of ten, the whole process, which is dominated by the one-hour-long simulation, is not completed a lot faster. Therefore, in such cases where the cost of the objective function dominates the total runtime (or cost) of the optimization, increasing the performance of L2 should often be a lower priority. On the other hand, a speed-up to L2 can be the key to success on optimization problems where the objective function does not entirely dominate the cost of the optimization. For this to make more sense, we should reconsider the same CFD example, but this time assume that the simulation only takes 30 s to run instead of the original hour. In this case, only roughly one-half of the optimization runtime would be occupied by the CFD simulation. The other half is caused by the runtime of the inner optimizer. In this scenario, speeding up the inner optimizer could shorten the overall optimization runtime by a significant amount. This circumstance makes L2 parallelization particularly interesting for problems where SBO was previously considered too computationally expensive. I.e., problems where the loss due to the high inner optimizer runtime would overshadow the efficiency gain by requiring less expensive evaluations f e . While it is often argued that surrogate-based methods are more sample efficient than other non-surrogate-based algorithms (e.g. evolutionary algorithms), this benefit is quickly outweighed if the runtime of the surrogate-based algorithm is too long. Example If a computationally cheap optimization algorithm like a (1+1)-ES can do significantly more objective function evaluations due to its lower algorithm runtime, then this will likely make up for any possible lack in sampling efficiency. Then, applying SBO would be inefficient due to the higher runtime due to model fitting and inner optimizer runtime. Yet, a speed-up to L2 could be the decisive point that justifies the use of SBO again.

2.3 A Taxonomy for Parallel SBO

19

Table 2.2 Summary of benefits and drawbacks of L2 parallelization Benefits Description Reduced inner algorithm runtime Enables applicability

Drawbacks Problem dependent

Expert knowledge Algorithm dependent

Can largely reduce required runtime of inner optimizer. Less runtime overhead on top of the expensive function Makes computationally expensive optimizers like surrogate-based optimization feasible for more optimization tasks due to lower runtime Efficiency benefit largely depends on the cost of the objective function. Might yield close to no benefit on a very expensive function Expert knowledge might be required to parallelize the respective inner optimization algorithm Some algorithms can not be parallelized

We would like to mention existing literature for L2 parallelization. Emmerich [17] shows the computational complexity of building local Kriging models for local predictions. The evaluated sample points are clustered using the k-nearest-neighbors algorithm. The local models are then only fitted on a cluster with the k nearest neighbors instead of all the samples, thus largely reducing the runtime required to fit a model. Nguyen-Tuong et al. [18] propose a localized Kriging approach in a robotics application. They partition the evaluated data points into separate models, which are faster to fit. The model prediction is then recombined from the separate models based on a distance metric. Bas van Stein et al. [19] employ a cheap clustering algorithm to split the available data and build models on each subsample. They further calculate the optimal linear combination of the models’ prediction, coining it with the name: Optimally Weighted Cluster Kriging. Grammacy et al. [20] use tree based methods to partition the search space. Again, local models are then build for each partition of the search space. In summary, one can argue that L2 parallelization is particularly efficient if the runtime of the inner optimization algorithms represents a significant portion of the total optimization runtime. Applying complex parallelization methods on L2 where the optimizer only contributes a negligible amount of runtime is likely not worth the effort. Again, the benefits and drawbacks of parallelization on L2 are covered in Table 2.2.

2.3.3 Parallel Evaluation Proposals (L3) Level 3 (L3) focuses on infill criteria that propose more than one candidate solution per iteration. We refer to these multiple candidates as a batch and the amount of candidates that are generated per iteration defines the batch size of the parallel opti-

20

2 Background

mizer. For some inherently parallel optimization algorithms like a population-based EA, this happens automatically. Here, increasing the batch size is as easy as increasing the population size of the algorithm. Yet, since this thesis is focused on applying SBO methods, in the following, we will only consider how L3 can be applied in SBO. For most surrogate-based methods, this step is not as straightforward. The default assumption for SBO is that an infill criterion leads the search for a singular global optimum on the model. This point can then be expensively evaluated. L3 parallelization for SBO on the other hand, requires to use an infill criterion that selects multiple sample points from the same model. All of which are then expensively evaluated at the same time. However, selecting several reasonable evaluation candidates on a single model is not straightforward and efficient methods for doing so, are still part of ongoing research. In particular, scaling this process to a high number of parallel sample points is complex. The proposed points either quickly become too similar (meaning they are all placed in a local area) or are distributed widely over the search space [16]. Each sample point then only has a minuscule chance of actually delivering an improvement over the current best-known solution. A multitude of methods for this process exists. For Kriging models, these usually employ the intrinsic uncertainty estimate of the Kriging model. By placing points in well-known but very good regions as well as not yet-well known and possibly worse regions of the search space, the SBO algorithms try to balance exploration and exploitation. Since these methods are within the special focus of this thesis, we dedicated them their own overview Sect. 2.4. Furthermore, novel methods for the efficient parallelization of SBO are introduced in Chap. 3. L3 parallelization can be applied in a multitude of scenarios. First and foremost, it can be used in any situation where multiple evaluations of the expensive objective function are possible in parallel. Dependent on the efficiency of an optimization algorithm in creating many reasonable sample points, large speed-ups can be achieved. Furthermore, there are scenarios in which L3 can actually be required to use all available resources. Example Consider experiments that are usually only evaluated in batches with a fixed size, e.g., at each iteration, four sample points should be evaluated because a machine in a lab takes four probes simultaneously. Filling the machine with fewer probes does not lower the experiment cost and is thus a waste of efficiency. In a scenario with fixed batch sizes, parallelization on L1 or L2 cannot allow for efficient resource usage. L3 can be used to generate as many sample points as required for the given batch size of an experiment. Similar to the last sections, the benefits and drawbacks of parallelization on L3 are covered in Table 2.3.

2.3 A Taxonomy for Parallel SBO

21

Table 2.3 Summary of benefits and drawbacks of L3 parallelization Benefits Description Large speed-ups Synergy with L1 Drawbacks Implementation

Enables large speed-ups especially in scenarios where fixed batch sizes in experiments demand for parallelization L1 and L3 can often be applied together to the same problem Implementation depends on model and infill criterion. The field is still under development in current research. There is not necessarily always a good parallel method for every model you want to use, etc.

2.3.4 Multi-algorithm Approaches (L4) The final level of this taxonomy, L4 is similar to L3 in the way that it is used to propose additional sample points to be evaluated on the expensive objective function. While L3 creates multiple sample points with one optimization algorithm, L4 describes parallelization across multiple algorithms running in parallel. Thus, for example running an EA in parallel to SBO. L4 can be applied in cases where a single optimizer is not able to create enough points in parallel or where the optimizer is deemed to generate inefficient samples (in terms of objective function quality) if too many samples are required. We draw the dividing line between L3 and L4 parallelization based on exchanged information between optimization algorithms. For example, hybrids where two algorithms work together, and the algorithms have access to each other’s already evaluated points, are not considered L4 parallelization. Since both parts of this hybrid would access the same information, we consider them as one structure in this taxonomy. However, the same two algorithms running independently without this shared knowledge would, in fact, be parallelized on L4. When tackling a new optimization problem, a suitable optimization algorithm or a good parameter configuration is often unknown. In such scenarios, parallelization can also be used to try multiple algorithms or the same algorithm multiple times, each with a different configuration. A similar approach using parallelization as hyper-parameter selection is covered in more detail in Sect. 3.4. In general, even running exactly the same optimizer multiple times in parallel with new random starting conditions can significantly improve the best found configuration. This can also be combined with stratified splits of the search space. E.g., starting each replicate of the optimizer in a different area of the search space to improve global exploration. Yet, all of these approaches require considerable amounts of parallelization resources. Just like in the last sections, the benefits and drawbacks of parallelization on L4 are summarized in Table 2.4.

22

2 Background

Table 2.4 Summary of benefits and drawbacks of L4 parallelization Benefits Description Additional sample points Tuning Drawbacks Efficiency

Implementation

Possibility to generate nearly unlimited amount of sample points Can replace tuning and complex decisions on selecting the best optimization algorithms Depending on the problem, running the same optimizer multiple times is often less efficient than a single optimizer that has access to all available evaluations If multiple different optimizers are used, each one needs to be implemented leading to additional work

2.3.5 Recommendations for Practitioners We would like to end this section with some practical tips on how to prioritize your efforts in parallelization. After reading this taxonomy, one might ask where to start or which level of parallelization to use for a given task. For this purpose, we have created a decision tree (see Fig. 2.6) that can be followed in order to gain a basic idea of where to start. The presented rules were derived from empirical knowledge gained while working on several industrial applications. An overview of some of these applications is given in Chap. 4. While these rules should usually give a solid starting point, they should in no way be seen as unchangeable truth across all kinds of optimization problems. There might always be special cases in which, against previous assumptions, one level of parallelization ends up more efficient than another. A final answer can only be achieved by measuring. How to identify and measure such a scenario is covered in Sect. 3.1. The clearest separation can be made based on whether the cost of the objective function dominates the total optimization cost or not. If the answer is not obvious straight away, both runtimes can be measured. The objective function runtime can be measured by evaluating one or more sample points, preferably some that were anyways part of an initial design, so that the execution time is not wasted. The inner runtime (model fitting and inner optimizer) can be measured on a cheap-toevaluate test function. Keep in mind that not the total runtime of the inner algorithm is important but only the time required for a single iteration, as this is the time that needs to be compared to the evaluation cost of f e . If the inner runtime contributes a significant portion to the total runtime, a good starting point will be L2. If the optimizer’s runtime is negligible compared to the cost of the objective function, then L2 can be safely disregarded. Next, the objective function itself should be considered. If it can be parallelized, doing so is usually a good starting point. Increasing the L1 parallelization, for example, by increasing the number of cores assigned to a computer simulation, one can directly measure the resulting speed-up of the function. Hence, a well-informed decision can be made on how many more cores still improve the function’s runtime. Once the saturation

2.4 Parallel SBO—A Literature Review

23

yes Can be sped-up by parallelization?

yes L1

Does dominate the optimization runtime / cost?

no L2

more ressources

no

more ressources

L3 more ressources

L4

Fig. 2.6 Empirical guide for prioritizing different levels of parallelization. Starting at the top of the figure, solid arrows should be followed. Dashed arrows are optional and can be followed if additional resources for parallelization are left

effects that were described in Sect. 2.3.1 show up, one might consider instead to opt for spending additional resources on L3 instead of L1. Lastly, if after parallelizing on L3 still resources for further parallelization remain, L4 should be considered. Between L3 and L4, it is much harder to exactly tell when one kind of parallelization is more efficient than the other. An exact measurement would require benchmarking both methods on the expensive objective function. Yet, this benchmark would require many expensive evaluations, defeating the purpose of efficiency and reducing the evaluation cost on f e . Ideas on how to circumvent this by benchmarking on cheap-to-evaluate functions are presented in Sect. 3.1.

2.4 Parallel SBO—A Literature Review In the following section we give a brief overview of existing approaches for parallel SBO approaches. A good survey of the field of parallel SBO is given by Haftka et al. [15]. In general, the main task in parallel SBO is to generate not just one candidate solution per iteration but instead to define multiple feasible candidates on the surrogate that can then be expensively evaluated on f e in parallel. We follow a distinction into two groups as posed by Forrester et al. [2]: Firstly we will discuss approaches

24

2 Background

where no modification of an infill criterion or special parallel infill criterion is needed. Secondly, we will discuss those parallel infill criteria that are specifically meant for generating multiple candidate solutions at the same time.

Non-modified Infill Criteria Parallel SBO can be pursued without the need of a special parallel infill criterion. In theory, any infill criterion can be used in a parallel application too. For this purpose, multiple points must be selected on that criterion’s landscape. For example, Sóbester et al. start multiple optimizations with the inner optimizer Ainner on the surrogate [21]. Each time, with changed random starting conditions. Ideally, each run results in a candidate solution in a distinct local optimum of the infill criterion landscape. A similar approach by internal restarts is pursued by Hutter et al. [22] in the SMAC (Sequential Model-Based Optimization for General Algorithm Configuration) algorithm. In general, any search for multiple optima of the infill criterion landscape can be used to generate candidate solutions for parallel SBO. However, these approaches yield some drawbacks. Firstly, the amount of local optima that can be found largely varies based on the underlying function f e but also on the infill criterion and the amount of already evaluated points on f e . Or, to quote from the work of Forrester et al.: “A problem with the above technique is that we never know how many points we are going to get—there could be any number of local optima—thus making it difficult to align this technique with parallel computing capabilities” [2]. Furthermore, the chosen local optima could be of arbitrarily low quality, because they (1) represent a local optimum in a bad area of the search space, or (2) because they might lie arbitrarily close to each other, for example if two restarts of Ainner converge into the same local optimum. Then, essentially the same candidate solution would be expensively evaluated twice. For the mentioned reasons, most of the recent research is usually focused on the second group of parallel SBO: specifically designed parallel infill criteria that allow the search for a set of points instead of a single candidate solution. These methods usually rely on different compromises between exploration and exploitation of the search space by using the intrinsic uncertainty estimate of the Kriging surrogate. Landscape exploration is done when points are evaluated in areas where the surrogate shows high uncertainty in its prediction. Landscape exploitation is done by searching in areas that are already known to yield good objective function values but also have low uncertainties. E.g., to slightly improve the solution quality by converging further into a local or global optimum.

2.4 Parallel SBO—A Literature Review

25

Investment Portfolio Improvement One method that creates sample points with different compromises between exploration and exploitation is Investment Portfolio Improvement (IPI) [23]. Ursem first introduced this as a sequential strategy that creates a specific amount of candidate solutions per iteration. However, the strategy is nicely applicable to parallel optimization. In IPI, each candidate solution is judged from the viewpoint of an investment portfolio. Candidate solutions with a high probability of improvement are low-risk investments but will likely only yield minor improvements. Candidates in areas with high uncertainties are much less likely even to find an improvement, but if one is found, it could be better by a significant margin. By searching for multiple candidates, each with a different risk level, one can generate multiple points for parallel evaluation.

Q-Point Expected Improvement (Q-EI) The direct adaptation of the EI criterion to a parallel infill criterion is the Q-Point Expected Improvement (Q-EI). Q-EI was first proposed by Schonlau et al. [24]. It gained significant popularity after further improvements by Ginsbourger et al. [25]. Instead of calculating the EI for a single point, the Q-EI criterion defines the expected improvement of a set of points. Thus, in the optimization of the Q-EI a list of candidate solutions is searched by the inner optimizer Ainner , and the total EI of the set of points is computed by Q-EI. It is important to note that this approach yields different results than searching all local optima of the EI landscape as, for example, done by Sóbester et al. [21]. This is because if two candidate solutions lie close to each other, they reduce each others EI. Evaluating one reduces the uncertainty in the area where the other candidate would be evaluated, and an additional improvement over said candidate in an area very close to it becomes extremely unlikely.

Multi-objective Infill Model-Based Optimization In their Multi-objective Infill Model-Based Optimization (MOI) Bischl et al. [26] do not define specific trade offs between exploration and exploitation like Ursem did with IPI. Instead, they treat the selection of multiple parallel candidate solutions as a multiobjective optimization problem. The predicted value (the quality) of a candidate is just one of multiple independent objectives in that optimization. The Kriging surrogate uncertainty, as well as different distance measures between the proposed candidates, are included as objectives to increase the diversity of the candidate solutions. The

26

2 Background

final proposed candidates for the expensive evaluation on f e are chosen from the pareto-front found by the inner optimizer Ainner on the MOI criterion.

-Shotgun Wessing et al. argue that often more exploitation instead of prolonged exploration is beneficial to efficient SBO. De Ath et al. [27] follow up on this line of argumentation by creating a mostly exploitative parallel SBO infill criterion: the -shotgun. The search is started by selecting a single candidate solution on the surrogate. For this purpose, they optimize the PV criterion, or with a small chance , they select a candidate meant to explore the search space. The second, so-called shotgun stage, involves selecting additional sample points around the single chosen candidate. With the argument that the surrogate is slightly inaccurate at any point in the optimization, randomly selected samples are drawn in the local vicinity of the candidate. The deviation of the random samples is adapted based on the activity that the Kriging surrogate estimates for that region.

Synchronous Versus Asynchronous Optimization All the previously covered infill criteria are examples of synchronous parallel optimization. That is, we assume that each expensive evaluation of f e takes roughly the same amount of time. A set of candidate solutions is generated by some infill criterion and evaluated in parallel. The optimization algorithm waits until all evaluations are finished (synchronizes) and then proceeds to generate and evaluate the next set of candidate solutions. This thesis specializes only in these synchronous scenarios, i.e., we assume a roughly similar evaluation time for all candidates of a given batch. If there are significant differences in the evaluation times, e.g., one evaluation takes multiple times longer than another one, then synchronous approaches can be very inefficient. A synchronous approach would waste a considerate amount of runtime waiting for maybe just a single evaluation that took longer than the others of the same batch. However, there are approaches that specialize in asynchronous parallel SBO such as the work by Richter et al. [28]. They use surrogates not only to estimate the quality of a candidate solution but also to estimate its evaluation time. This is then used in a scheduling approach to use all computational resources at all times efficiently.

References

27

References 1. M.D. McKay, R.J. Beckman, W.J. Conover, Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2), 239–245 (1979) 2. A. Forrester, A. Keane et al., Engineering Design via Surrogate Modelling: A Practical Guide (Wiley, New York, 2008) 3. F. Rehbach, M. Zaefferer, B. Naujoks, T. Bartz-Beielstein, Expected improvement versus predicted value in surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 (Association for Computing Machinery, 2020), pp. 868–876 4. D.R. Jones, M. Schonlau, W.J. Welch, Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998) 5. A.D. Bull, Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011) 6. S. Wessing, M. Preuss, The true destination of EGO is multi-local optimization, in 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI) (IEEE, 2017) 7. T. Bartz-Beielstein, J. Branke, J. Mehnen, O. Mersmann, Evolutionary algorithms. Wiley Interdiscip. Rev.: Data Mining Knowl. Discov. 4(3), 178–195 (2014) 8. L.J. Fogel, Artificial intelligence through a simulation of evolution, in Proceedings of the 2nd Cybernetics Science Symposium (1965) 9. J.H. Holland, Genetic algorithms and the optimal allocation of trials. SIAM J. Comput. 2(2), 88–105 (1973) 10. I. Rechenberg, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. PhD thesis, Technische Universität, Fakultät für Maschinenwissenschaft (1970) 11. H.-P. Schwefel, Evolutionsstrategien für die numerische optimierung, in Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie (Springer, 1977), pp. 123– 176 12. J.R. Koza, Genetic programming as a means for programming computers by natural selection. Stat. Comput. 4(2), 87–112 (1994) 13. N. Hansen, A. Ostermeier, Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, in Proceedings of IEEE International Conference on Evolutionary Computation (IEEE, 1996), pp. 312–317 14. E. Alba, J.M. Troya et al., A survey of parallel distributed genetic algorithms. Complexity 4(4), 31–52 (1999) 15. R.T. Haftka, D. Villanueva, A. Chaudhuri, Parallel surrogate-assisted global optimization with expensive functions-a survey. Struct. Multidiscip. Optim. 54(1), 3–13 (2016) 16. F. Rehbach, M. Zaefferer, A. Fischbach, G. Rudolph, T. Bartz-Beielstein, Benchmark-driven configuration of a parallel model-based optimization algorithm. IEEE Trans. Evol. Comput. 26(6), 1365–1379 (2022) 17. M. Emmerich, Single-and multi-objective evolutionary design optimization assisted by gaussian random field metamodels. Ph.D. Thesis, Dortmund, University, Dissertation (2005) 18. D. Nguyen-Tuong, M. Seeger, J. Peters, Model learning with local gaussian process regression. Adv. Robot. 23(15), 2015–2034 (2009) 19. B.v. Stein, H. Wang, W. Kowalczyk, T. Bäck, M. Emmerich, Optimally weighted cluster kriging for big data regression, in International Symposium on Intelligent Data Analysis (Springer, 2015), pp. 310–321 20. R.B. Gramacy, H.K.H. Lee, Bayesian treed gaussian process models with an application to computer modeling. J. Amer. Stat. Assoc. 103(483), 1119–1130 (2008) 21. A. Sóbester, S.J. Leary, A.J. Keane, A parallel updating scheme for approximating and optimizing high fidelity computer simulations. Struct. Multidiscip. Optim. 27(5), 371–383 (2004) 22. F. Hutter, H.H. Hoos, K. Leyton-Brown, Sequential model-based optimization for general algorithm configuration, in International Conference on Learning and Intelligent Optimization (Springer, 2011), pp. 507–523

28

2 Background

23. R.K. Ursem, From Expected Improvement to Investment Portfolio Improvement: Spreading the Risk in Kriging-Based Optimization (Springer International Publishing, Cham, 2014), pp. 362–372 24. M. Schonlau, Computer experiments and global optimization (1997) 25. D. Ginsbourger, R. Le Riche, L. Carraro, Kriging is well-suited to parallelize optimization, in Computational Intelligence in Expensive Optimization Problems (Springer, 2010), pp. 131–162 26. B. Bischl, S. Wessing, N. Bauer, K. Friedrichs, C. Weihs, Moi-mbo: multiobjective infill for parallel model-based optimization, in International Conference on Learning and Intelligent Optimization (Springer, 2014), pp. 173–186 27. G. De Ath, R.M. Everson, A.A. Rahat, J.E. Fieldsend, Greed is good: Exploration and exploitation trade-offs in bayesian optimisation. ACM Trans. Evol. Learn. Optim. 1(1), 1–22 (2021) 28. J. Richter, H. Kotthaus, B. Bischl, P. Marwedel, J. Rahnenführer, M. Lang, Faster modelbased optimization through resource-aware scheduling strategies, in Learning and Intelligent Optimization. ed. by P. Festa, M. Sellmann, J. Vanschoren (Springer International Publishing, Cham, 2016), pp. 267–273

Chapter 3

Methods/Contributions

The following chapter presents the research contributions of this work. It focuses on advancing techniques for the parallel optimization of expensive-to-evaluate functions. Rigorous methods for analyzing and assessing algorithm performance are required before any improvements can be made to optimization algorithms. Benchmarks and well-chosen test functions are essential to gain an unbiased insight into optimization algorithms. Therefore, Sect. 3.1 introduces a benchmarking framework that allows for an in-depth performance analysis of parallel SBO algorithms. Methods for choosing and generating suitable test problems complement this framework in Sect. 3.2. In Sects. 3.3 and 3.4 we analyze the state-of-the-art in parallel SBO. Apart from the general algorithm performance, Sect. 3.3 researches how well existing parallel SBO algorithms scale to larger batch sizes by comparing them to evolutionary algorithms. Section 3.4 analyzes the effect of hyper-parameter configurations to the performance of SBO algorithms. A method for running multiple algorithms with different hyperparameters in parallel is introduced. In both of these sections, interesting findings regarding the effects of global exploration in SBO are discussed. An in-depth study further investigates these findings. It compares explorative and exploitative infill criteria for SBO in Sect. 3.5. The combined knowledge gained from the presented research leads to a novel model-based hybrid algorithm for expensive optimization problems, which is presented in Sect. 3.6. Finally, Sect. 3.7 researches possibilities for an automated batch size adaptation for parallel algorithms.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9_3

29

30

3 Methods/Contributions

3.1 Benchmarking Parallel SBO Algorithms Parts of the following section are based on “Benchmark-Driven Configuration of a Parallel Model-Based Optimization Algorithm” by Rehbach et al. [1]. Parts of the paper were taken verbatim and included in this section. Major parts were largely rewritten and extended with recommendations and more detailed explanations.

One important tool in developing and understanding optimization algorithms are benchmarks. Benchmarks are studies in which the performance of optimization algorithms is measured. Usually, this is done across various test functions with different properties to judge an algorithm’s ability to cope with different kinds of problems. Measuring the exact performance of algorithms in different scenarios is a key aspect of developing new methods and improving the state-of-the-art in optimization. Without proper benchmarking, new algorithms cannot be tested objectively. Ideally, algorithms should be benchmarked on test functions with similar characteristics to those of the application problem that the algorithm is meant to solve [2]. Parallel SBO methods are aimed at solving problems with high costs associated with each evaluation of the objective function f e . Yet, one problem of developing such parallel SBO algorithms is that even though they are meant to optimize such expensive problems, they can usually not be benchmarked on such, because the high cost of each function evaluation renders extensive studies infeasible. A statistically sound analysis of algorithm results would require too many costly objective function evaluations. Thus, while the benchmarked algorithms aim to solve expensive-to-evaluate objective functions, cheap-to-evaluate functions are necessary to enable in-depth benchmarks. A common strategy to judge an algorithm’s performance on expensive problems is to compare each algorithm’s best found objective function value, given a certain amount of function evaluations (see e.g., [3–5]). The maximum number of allowed function evaluations can be derived from the evaluation cost of the realworld function, and the total budget one is willing to spend on the optimization. The expensiveness of the benchmark functions is then emulated by limiting the number of objective function evaluations for each optimizer to the same total budget. For a more detailed discussion on possible quality measures for judging algorithm performance, we recommend the recent survey on benchmarking optimization algorithms by Bartz-Beielstein et al. [6]. However, in many scenarios, the approach of assigning a total budget of objective function evaluations does not yield a fair comparison of parallel SBO algorithms. First of all, parallel SBO algorithms may have different batch sizes (amounts of points which are proposed per iteration → L3/L4 parallelization). Thus, some algorithms might only evaluate one point per iteration, while others propose many more. Other parallel algorithms might even adapt their batch size during the optimization. I.e., some metric decides at every iteration how many sample points are proposed by the

3.1 Benchmarking Parallel SBO Algorithms

31

algorithm. Counting the total amount of evaluations is therefore not fair anymore as some algorithms do more evaluations simultaneously than others. Another problem that is special to parallel optimization, is that the evaluation cost of the objective function can be different based on the batch size of the optimizer. This happens whenever the cost of an experiment (independent of a physical or computational experiment) changes depending on the amount of samples evaluated in parallel. For computational experiments, such as CFD and FEM simulations, this is a typical scenario. L1 parallelization is combined with L3 or L4 parallelization. Thus, the scenario occurs if both the objective function is sped up by parallelization and at the same time, multiple evaluations are done in parallel. Both the speed-up of a computational experiment and running multiple experiments in parallel requires CPU resources, which are usually limited by an available machine or cluster with a given amount of CPU cores. I.e., if more resources are spent on one level, fewer resources are available for another level. Given a total amount of CPU cores in a system (n T ), an amount of CPU cores per objective function evaluation (n L1 ), and the batch size (b) of an optimizer, the following equation must hold: n T ≥ n L1 × b If each evaluation is assigned with more cores, then the time required to finish each evaluation will be lower. However, at the same time, not as many evaluations can be run in parallel, reducing the batch size of the optimizer. Or, more importantly, the other way around: Increasing the batch size of an optimizer gives fewer resources to the speed-up of every single evaluation. Hence, increasing the batch size of an algorithm can also increase the time required to evaluate the objective function (and thus its cost). In summary, benchmarks for parallel SBO algorithms require cheap-to-evaluate test functions in order to allow for rigorous testing. Yet, these cheap functions do not have this batch size dependent cost. A metric that is used to benchmark parallel algorithms needs to incorporate this batch size dependent cost into cheap functions. Furthermore, the metric needs to cope with possible changes in batch size during the optimization run. Ideally we would want to map the changing cost of an expensive parallel problem to cheap-to-evaluate test functions. Only then would detailed benchmarks for parallel SBO become feasible. This leads to the following research question: RQ-3.1 How to benchmark parallel SBO algorithms for expensive problems in a fair, rigorous, and statistically sound way? Measuring the efficiency of different batch sizes for the same algorithm would allow for a tuning procedure in which the ideal batch size and thus the ideal balance between L1 and L3 parallelization can be analyzed. The significance of performance changes due to different batch size configurations is shown experimentally in Sect. 3.6.

32

3 Methods/Contributions

3.1.1 A Framework for Parallel SBO Algorithms In the following, we propose a benchmarking framework that enables a rigorous performance evaluation of parallel model-based optimizers for expensive functions. The framework establishes a relationship between estimated costs of parallel function evaluations (on real-world problems) to known sets of cheap-to-evaluate test functions. The presented framework is specialized for scenarios in which the cost of each objective function evaluation is affected by the configuration of the applied optimizer. We mainly consider a case where a changed batch size and thus changed L3 parallelization leads to changes in the evaluation cost. Yet, the framework can also be applied to problems where other configurations change the cost or even problems where the cost of the objective function is static. Example 3.1 Consider a system with 32 cores for our optimization. If we were to choose a batch size of eight, then this would directly translate into 32/8 = 4 cores to be used per evaluation. Whereas a smaller batch size of, let’s say, four would lead to 32/4 = 8 cores that can be used per evaluation. Thus, a smaller batch size is associated with a lower wall-clock cost for every single evaluation since more resources can be used to speed up said evaluation. In summary, we do not aim to only map the evaluation cost of an expensive function, e.g., by assigning a fixed cost to each evaluation or limiting the total evaluation budget. Rather, we want to reproduce a function’s characteristic behavior of how this cost scales, dependent on how many evaluations are done in parallel. We seek a method that allows us to rigorously benchmark an algorithm on cheap functions while estimating the cost that the same evaluations would have created on an expensive parallelized problem. We employ a two-step approach to fairly judge the parallel efficiency of an optimizer together with a specific objective function. Step one requires a runtime measurement of the original expensive function. These measurements can be collected by using a few samples, measuring the evaluation runtime of the problem, for example, a simulation. Each simulation is run with a different number of cores to observe the runtime scaling. Based on these evaluations, it is possible to estimate a function f t (n L1 ) describing the amount of time needed per evaluation, given a certain amount of cores for L1 parallelization: n L1 . The trial simulations used for the measurements can also already be part of the initial design. In that case, they would not require any additional budget taken away from the optimization run itself. To measure the efficiency of all possible trade-offs between L1 and L3 parallelization, essentially all possible amounts of CPU cores per objective function evaluation could be recorded. However, often a lot of combinations are much less efficient and can already be ruled out by process knowledge. Therefore, only a few measurements on the expensive function should be necessary. To give just one example: If the total amount of available CPU cores in a system is not a multiple of the number of cores used per simulation, then some cores are left unused, and thus the overall efficiency

3.1 Benchmarking Parallel SBO Algorithms

33

drops. The required measurements can also be evaluated as part of an initial design. That way, no expensive evaluation would be wasted just to measure different levels of parallelization. If the runtime between repeats of the same evaluation varies slightly, then the measurements might be reported as the mean runtime of the repeats. However, this approach is not meant to cope with large runtime deviations in repeated function evaluations with the same amount of CPU cores. If the evaluation time largely varies, an asynchronous evaluation scheme such as the one presented by Kotthaus et al. [7] is likely better suited. Knowing the execution times f t (n L1 ) of an expensive objective function allows for step two: scaling the measured costs and applying them to cheap functions. The empirically determined runtimes can be translated into ‘costs’ for cheap-to-evaluate functions. For this purpose, we again first have to consider the link between the batch size and the number of CPU cores used per evaluation. Finally, the translation from an execution time f t to an evaluation cost f c given a certain batch size b is described in Eq. 3.1. First the execution time with an amount of cores n L1 = nbT is observed. This time is then scaled with the base evaluation time, that an evaluation with a single core would take f t (1). f c (b) =

f t ( nbT ) f t (1)

(3.1)

To give just one example of this scaling procedure, consider the following scenario: Example 3.2 Assume that an objective function evaluation with a single core took five hours of computation time. Given a total of 32 cores in the system, the algorithm could be configured with a batch size of 32 and would use exactly one unit of budget per iteration. For this purpose reconsider Fig. 3.1 that was first shown in Sect. 2.3.1. Yet, following the graph of Fig. 3.1, if we were to use four cores per evaluation, we would gain a factor 3.5 evaluation speedup. The objective function evaluation would only require roughly 1.4 h and, therefore, only 0.28 “units of budget” per iteration. Yet, we would not be able to evaluate 32 points in parallel, but instead only 8. In summary, algorithms are no longer given an iteration budget in the proposed framework but instead a budget relative to a base evaluation time. This allows for easy back and forth calculations to estimate real-time performance, even for cheapto-evaluate functions. Furthermore, it allows plotting the performance of multiple algorithms with different batch sizes on a unified axis, comparing their estimated total runtimes.

34

3 Methods/Contributions

Fig. 3.1 Example decrease in measured speedup (black curve) versus relative parallel efficiency (red curve) for an increasing number of CPU cores (efficiency is defined as speedup/used number of cores). Numbers are taken from example measurements of a small CFD simulation and will largely differ for other problems. Figure based on [1, p. 7]

3.1.2 Conclusions To conclude this section, we would like to reconsider the initially posed research question: RQ-3.1 How to benchmark parallel SBO algorithms for expensive problems in a fair, rigorous, and statistically sound way? The discussed framework maps measured evaluation costs of expensive functions to cheap test functions. It is suited to deliver exact performance measurements for a given combination of available computational resources and the objective function at hand. Instead of only benchmarking algorithms with singular specific batch sizes, the framework allows to directly measure the combined system performance of an algorithm and its function evaluation back end. The ability to rigorously judge the performance of an algorithm on a variety of landscapes before applying it to an expensive-to-evaluate function gives practitioners more concise choices without wasting costly evaluations. The procedure allows for rigorous performance analysis of algorithms as close to the actual evaluation case as possible. Furthermore, the described cheap benchmarks allow for parameter tuning of parallel SBO algorithms. For example, changing the batch size of an algorithm can drastically change the overall optimization efficiency. A proper balance between the multiple levels of parallelization (L1, L3, L4) is crucial for performant optimization. Yet, tuning the batch size of an algorithm for a specific objective function was previously infeasible due to the high cost per evaluation. Estimating total optimization

3.2 Test Problems

35

runtimes for parallel optimizers on cheap functions is an important milestone as it allows for realistic parameter tuning on cheap functions. The described methodology is used experimentally in Sects. 3.6 and 3.7.

3.2 Test Problems Parts of the following section are based on “Continuous optimization benchmarks by simulation” by Zaefferer and Rehbach [8]. The content of the paper was extended with an additional overview of existing suites for test functions and an introduction to a suite of real-world test problems [9] to which the author contributed the electrostatic precipitator problem, which is also covered as a case study in Sect. 4.1. For the introduction of simulation-based test functions, major parts of the paper were taken verbatim and included in this section. Parts of it were rewritten, extended, and restructured. In the same way that benchmarks are essential to the development and analysis of optimization algorithms, test problems are essential to any benchmark. The chosen test problems in a benchmark can drastically change the results of the benchmark. Each function can add bias towards some algorithms if the function is not representative of the real-world task at hand. Ideally, a ‘representative’ test function shares as many landscape properties, e.g., number of local optima, separability, global structure, etc., with the real optimization problem as possible. That is why much research went into devising the best strategies and suites of functions for benchmarking in the last years. Many well-documented function sets exist, each including a vast set of functions with distinct landscape properties. To give just two examples, consider the Black-Box Optimization Benchmarking (BBOB) suite [10] and the CEC competition test function set [11]. The noiseless singleobjective BBOB suite contains 24 test functions. From each of these functions, random problem instances can be created. Instances are shifted, stretched, or rotated versions of the original function class. Furthermore, the problems can be scaled in dimensionality, and all optima of the functions are known. A more detailed description can be found in [12]. Similar to the BBOB suite, the CEC function set contains shiftable and scalable test functions with known global optima. The total of 28 functions is regularly updated in the form of the yearly CEC benchmark competition [11]. A very detailed list of additional benchmark suites is given by Bartz-Beielstein et al. [6]. Most of these available suites have in common that they contain manually crafted and hand-picked functions that are supposed to represent specific artificial landscape criteria. Benchmarking across such a set of functions has many benefits. For example, algorithm performance can be analyzed when faced with certain landscape properties like multi-modal search spaces. In-depth benchmarks become possible because

36

3 Methods/Contributions

experiment results can be statistically analyzed across many repeats. Comparisons are easier because optima are known. However, it is hard to argue that results that were measured on these artificial functions can directly be translated to results on real-world problems. In the end, the chosen functions might or might not correctly represent the behavior of the real problem. Algorithm choices or even parameter tuning for an algorithm that was done on a set of artificial functions is likely not going to perform as well on the real-world problem. Therefore, we pose: RQ-3.2 How to create or choose more representative test functions? Real-World Benchmarks Note 1 The expensiveness (cost) of a function is not a necessary or even a desired problem property that test functions need to imitate. Test functions are not more or less similar to a real-world problem because of their cost. Important in choosing a test problem are only the problem’s landscape characteristics. Any cheap artificial function could be coupled with a piece of code that waits for, e.g., an hour, making it an expensive problem. Yet, that would not change the results of any optimizer. It would just hinder proper benchmarking. Therefore, the search for representative test problems should not consider the cost, apart from it being low enough to enable in-depth studies. One idea for choosing representative test problems is finding problems in the same industrial or application domain. For example, if the expensive problem at hand is a CFD simulation, it makes sense to try and find a similar but possibly cheap(er) to evaluate simulation. Often simulations can also be run in a less accurate configuration which largely reduces the runtime but might still yield a very similar problem landscape. Sadly, finding a set of openly accessible real-world problems for benchmarking is not easy. Many applications such as simulators are not published as they contain proprietary company knowledge. Other applications are public but require paid software, complicated setups, or specific operating systems to be used. Testing some algorithms across multiple problems is rendered infeasible by such an overhead. An easily applicable suite of relevant problems is required to enable benchmarking in the field of parallel SBO. One notable attempt at creating such a suite was published by Daniels et al. [9]. They published a set of continuous shape optimization problems using CFD simulations. All problems are run in the same open-source software stack that can be run under Linux. The author of this thesis made two contributions to the test problem suite. Firstly, the so-called electrostatic precipitator problem [3] was added to the problem suite. The electrostatic precipitator problem is covered in more detail as a case study in Sect. 4.1. Furthermore, a Docker [13, 14] based interface was developed. The interface allows the execution of the published test problems on the most commonly used operating systems without the installation or setup for any CFD software.

3.2 Test Problems

37

Yet, all this does not fully solve the main issues faced. More test problems from different fields are required. Additionally, even the relatively fast-to-execute CFD problems in the suite of Daniels et al. have runtimes making most in-depth benchmarks infeasible.

3.2.1 Simulation Based Functions Sometimes, real-world problems may not be available in terms of functions but only as data (i.e., observations from previous experiments). This can be due to real objective function evaluations being too costly or not accessible (in terms of software or equipment). In those cases, using a data-driven approach may be a viable alternative: surrogate models can be trained and subsequently used to benchmark algorithms. The intent is not to replace artificial benchmarks such as BBOB (which have their own advantages), but rather to augment them with problems that have a closer connection to real-world problems. This approach has been considered in previous studies [15–20]. As pointed out by Zaefferer et al. [21], surrogate-based benchmarks face a crucial issue: the employed machine learning models may smoothen the training data, especially if the training data is sparse. Hence, these models are prone to produce optimization problems that lack the ruggedness and difficulty of the underlying realworld problems. Thus, algorithm performances may be overrated, and comparisons become biased. Focusing on a discrete optimization problem from the field of computational biology, Zaefferer et al. proposed to address this issue via simulation with Kriging models. Usually, the task of a surrogate is to give an estimate for a functions value at an unknown location. We refer to these Kriging models as estimation (or prediction) models. In contrast to estimation with Kriging models, the response of a simulation retains the covariance properties determined by the model [22]. A simulation does not aim to exactly predict the a function value at a certain location, rather the goal of a simulation is to produce values whose moments are as close to the ones of the real data as possible. This is achieved by creating realizations of a Gaussian process with the same mean and covariances as the modeled process [21]. The approach used by Zaefferer et al. relies on the selection of a set of simulation samples [21]. The simulation is evaluated only at these sample locations. The simulation samples are distinct from and less sparse than the observed training samples of the expensive function. If a candidate solution outside of the set of simulation samples needs to be evaluated, interpolation needs to be used and might introduce undesirable smoothness. Yet, in discrete search spaces, often all samples in the search space can be simulated. This made the approach well suited for combinatorial optimization problems but not for continuous problems. A simulation method for continuous problems is required to generate adequate test functions in that domain.

38

3 Methods/Contributions

Simulation by the Spectral Method Following up on [21], we investigate a different simulation approach that is well suited for continuous optimization problems: the spectral method [23]. This approach directly generates a function that can be evaluated at arbitrary locations, without interpolation. It yields a superposition of cosine functions [23],  fs (x) = σˆ

N 2  cos(ω v · x + φv ), N v=1

with φv being an i.i.d. uniform random sample from the interval [−π, π]. Here, σˆ is a model parameter, determined by Maximum Likelihood Estimation (MLE). The sampling of ω v requires the spectral density function of the Kriging model’s kernel [22, 23]. That is, ω v ∈ Rn are i.i.d. random samples from a distribution with that same density. A frequently chosen kernel for Kriging is 

k(x, x ) = exp

 n 

 −θi |xi −

xi |2

.

(3.2)

i=1

The respective distribution for the i-th dimension of the kernel in Eq. (3.2) is the normal distribution with zero mean and variance 2θi . A simulation conditioned on the training data can be generated with fsc (x) = fs (x) + yˆ ∗ (x) [23], where yˆ ∗ (x) = ˆ is an adaptation of the standard Kriging predictor μˆ + k T K−1 (ysc − 1μ) ˆ yˆ (x) = μˆ + k T K−1 (y − 1μ).

(3.3)

with the training observations y replaced by ysc , and μˆ = 0. Here, ysc are the unconditionally simulated values at the training samples, that is, ysc j = fs (x j ). To employ these simulations in a benchmarking context, we follow the approach by Zaefferer et al. [21]. First, a data set is created by evaluating the true underlying problem (if not already available in the form of historical data). Then, a Kriging model is trained with that data. Afterwards, the spectral method is used to generate unconditional simulations. Unconditional simulations do not necessarily reproduce values of the original training data even in the same sample locations. In an additional conditioning step, conditional simulations can be generated out of unconditional ones, these then reproduce the same function values in already evaluated locations. These simulations are finally used as test functions for optimization algorithms. Here, the advantage of simulation over estimation is the ability to reproduce the topology of functions, rather than predicting a single, isolated value. As an illustration, let us assume an example for n = 1, where the ground-truth is f(x) = sin(33x) + sin(49x − 0.5) + x. A Kriging model is trained with the samples X = {0.13, 0.6, 0.62, 0.67, 0.75, 0.79, 0.8, 0.86, 0.9, 0.95, 0.98}. The resulting estimation, and conditional simulation of the Kriging model are presented in Fig. 3.2. This example

3.2 Test Problems

39

f(x)

2

●●

0

0.00

0.25

0.50

● ● ●





−2

● ●





0.75

1.00

ground truth & estimation

4

x

2

●●

0





−2 0.00

● ●



0.25

0.50

0.75

● ● ●



1.00

conditional simulation

f(x)

4

x

Fig. 3.2 Example illustration of creating a simulation based test function. Top: Ground-truth f(x) (dashed line), training data (circles), and Kriging model estimation (gray solid line). Bottom: Three instances of a conditional simulation (same model). The three different instances are generated by re-sampling of ω v and φv . Different from the estimation model, the simulations show function activity in the unsampled regions. Image taken from [8, p. 5]

shows how estimation might be unsuited to reproduce an optimization algorithm’s behavior. In the sparsely sampled region, the Kriging estimation is close to constant, considerably reducing the number of local optima in the (estimated) search landscape. The number of optima in the simulated search landscapes is considerably larger.

3.2.2 Experiments In the following, we describe an experiment that compares test functions produced by estimation and simulation with Kriging. A set of objective functions is required as a ground-truth for our experiments. In practice, the ground-truth would be a realworld optimization problem. Yet, a real-world case would limit the comparability, understandability, and the extent of the experimental investigation. We want to understand where and why our emulation deviates from the ground-truth. This situation reflects the need for model-based benchmarks. We chose the well-established singleobjective, noiseless BBOB suite from the COCO framework [10, 24]. Important landscape features of the BBOB suite are known (e.g., modality, symmetry/periodicity, separability), which enables us to understand and explain where Kriging models fail. For each function, usually 15 randomized instances are produced. We followed the same convention. In addition, all test functions are scalable in terms of search space dimensionality n. We performed our experiments with n = 2, 3, 5, 10, 20.

40

3 Methods/Contributions

Generating the Training Data We generated training data by running an optimization algorithm on the original problem. The data observed during the optimization run was used to train the Kriging model. This imitates a common scenario that occurs in real-world applications: Some algorithm has already been run on the real-world problem, and the data from that preliminary experiment provides the basis for benchmarking. Moreover, this approach allows us to determine the behavior of the problem on a local and global scale. An optimization algorithm (especially, a population-based evolutionary algorithm) will explore the objective function globally as well as performing smaller, local search steps. Specifically, we generated our training data as follows: For each BBOB function (1, ..., 24) and each function instance (1, ..., 15), we ran a variant of Differential Evolution (DE) [25] with 50n function evaluations, and a population size of 20n. We used the implementation from the DEoptim R-package, with default configuration [26]. This choice is arbitrary. Other population-based algorithms would be equally suited for our purposes.

Generating the Model We selected the 50n data samples provided by the DE runs. Based on that data, we trained a Kriging model, using the SPOT R-package [27]. Three non-default parameters of the model were specified with: useLambda=FALSE (no nugget effect, see also [28]), thetaLower=1e-6 (lower bound on θi ), and thetaUpper=1e12 (upper bound on θi ). For the spectral simulation, we used N = 100n cosine functions. We only created conditional simulations, to reflect each BBOB instance as closely as possible. Scenarios where an unconditional simulation is preferable have been discussed by Zaefferer et al. [21]. Notably, in all experiments the algorithms were run with different seeds than the initial DE runs with which the model was built. This is to ensure that the DE algorithm can not reproduce the same search path along already known points from the training data.

Testing the Algorithms We tested three algorithms: • DE: As a global search strategy, we selected DE. We tested the same DE variant as mentioned in Sect. 3.2.2, but with a population size of 10n. All other algorithm parameters remained at default values. • NM: As a classical local search strategy, the Nelder-Mead (NM) simplex algorithm was selected [29]. We employed the implementation from the R-package nloptr [30]. All algorithm parameters remained at default values.

3.2 Test Problems

41

• RS: We also employ a simple uniform Random Search (RS) algorithm, which evaluates the objective function with i.i.d. samples from a uniform random distribution. This selection was to some extent arbitrary. The intent was not to investigate these specific algorithms. Rather, we selected these algorithms to observe a range of different convergence behaviors. Essentially, we made a selection that scales from very explorative (RS), to balanced exploration/exploitation (DE), to very exploitative (NM). All three algorithms receive the same test instances and initial random number generator seeds. For each test instance, each algorithm uses 1000n function evaluations. Overall, each algorithm was run on each instance, function, and dimension of the BBOB test suite (24 × 15 × 5 = 1800 runs, each run with 1000n evaluations). Additionally, each of these runs was repeated with an estimation-based test function, and with a simulation-based test function.

3.2.3 Results Quality Measure Our aim is to measure how well algorithm performance is reproduced by the test functions. One option would be to measure the error as the difference of observed values along the search path compared to the ground-truth values at those same locations. But this error is actually irrelevant for quantifying a good test function. To explain this, let us assume that the ground truth is f(x) = x 2 , and two test functions are ft1 (x) = (x − 1)2 and ft2 (x) = 0.5, with x ∈ [0, 1]. Clearly, ft1 is a reasonable oracle for most algorithms’ performance (e.g., in terms of convergence speed) while ft2 is not. Yet, the mean squared error of ft1 would usually be larger than the error of ft2 . The error on ft1 even increases when an algorithm approaches the optimum. Instead, we measured the error on what we refer to as performance curves: For each algorithm run, the best-observed function values after each objective function evaluation were recorded (on all test instances, including estimation, simulation, and ground-truth). The resulting performance values were scaled to values between zero and one, for each problem instance (i.e., each BBOB function, each BBOB instance, each dimension, and also separately for the ground-truth, estimation, and simulation variants). This yielded what we term the scaled performance. Each algorithm run is done with the same random starting conditions, making sure that each scaled value starts of fairly from the same baseline. The error of the scaled performance was then calculated as the absolute deviation of the performance on the model-based functions, compared to the performance on the ground-truth problem. Example 3.3 Let us assume that DE achieved a (scaled) function value on the ground-truth of 0.25 after 200 objective function evaluations. But the same algorithm only achieved 0.34 on the estimation-based test function after 200 evaluations. Then, the error of the estimation-based run is |0.34 − 0.25| = 0.09 (after 200 evaluations).

42

3 Methods/Contributions

Observations In Fig. 3.3, we show the resulting errors over run time for a subset of the 24 BBOB functions. The full set of all plots is available through the supplementary material of [8]. Here, we only show the error for the DE and NM algorithms. Similar patterns

Fig. 3.3 The error of algorithm performance with simulation- and estimation-based test functions. The curves are based on the performance of DE runs (left) and NM runs (right) on test functions, compared against the performance values on the ground-truth, i.e., the respective BBOB functions. Labels on the right-hand side specify respective IDs of the BBOB functions. Top-side labels indicate dimensionality. Lines indicate the median, the colored areas indicate the first and third quartile. Statistics are calculated over the 15 instances for each BBOB function. Image based on [8, p. 9]

3.2 Test Problems

43

are visible in the omitted curves for RS. We also omit the curves for n = 3, which closely mirror those for n = 2. For the simulation, we mostly observe decreasing errors or constant errors over time. The decrease can be explained by the algorithms’ tendency to find the best values for the respective problem instance later on in the run, regardless of the objective function. Earlier, the difference is usually larger. For the estimation-based test functions, the error often increases, and sometimes decreases again during the later stages of a run. When comparing estimation and simulation, the modality and the dimensionality n are important. For low-dimensional unimodal BBOB functions (n = 2, 3, 5, and function IDs: 1, 2, 5–7, 10–14), the simulation yields larger errors than estimation. In most of the multimodal cases, the simulation seems to perform equally well or better. This can be explained: The larger activity of the simulation-based functions may occasionally introduce additional local optima (turning unimodal into multimodal problems). The estimation is more likely to reproduce the single valley of the groundtruth. Conversely, the simulation excels in the multimodal cases because it does not remove optima by interpolation. For higher-dimensional cases (n = 10, 20), this situation changes: the simulation produces lower errors, regardless of modality. This is explained by the increasing sparseness of the training data, which in the case of estimation, will frequently lead to landscapes of mostly constant values, with the exception of very small areas close to the training data. As noted earlier, results between DE, RS, and NM exhibit similar patterns. There is an exception, where the results between DE, NM, and RS differ more strongly: BBOB function 21 and 22. Hence, Fig. 3.4 shows results for n = 20 with function 21 (22 is nearly identical). Here, each plot shows the error for a different algorithm. Coincidentally, this includes the only case where estimation is performing considerably better than simulation for large n (with algorithm RS only). The reason is

method

estimation

error

DE

simulation

NM

RS

1.00

1.00

0.4

0.75

0.75

0.3

0.50

0.50

0.2

0.25

0.25

0.1

0.00

0.00 0

250

500

750

1000

0.0 0

250

500

750

1000

0

250

500

750

1000

evaluations / n

Fig. 3.4 This is the same plot-type as presented in Fig. 3.3, but limited to BBOB function 21 and n = 20. While Fig. 3.3 only shows results for DE, this figure compares results for each tested algorithm (DE, NM, RS) for this specific function and dimensionality. Image taken from [8, p. 11]

44

3 Methods/Contributions

not perfectly clear. One possibility is a particularly poor model quality. BBOB functions 21 and 22 are both based on a mixture of Gaussian components. They exhibit a peculiar, localized non-stationarity that is problematic for Kriging. The activity of the function may abruptly change direction, depending on the closest Gaussian component.

3.2.4 Conclusions The results show that the model-based test functions will occasionally deviate considerably from the ground-truth. This has various reasons. • Dimensionality: Clearly, all models are negatively affected by a rising dimensionality in the data. With 10 or more variables, it becomes increasingly difficult to learn the shape of the real function with a limited number of training samples. Necessarily, this limits how much we can achieve. Despite its more robust performance, simulation also relies on a well-trained model. • Continuity: Our Kriging model, or more specifically its kernel, works best if the ground-truth is continuous, i.e., limx→x f(x) = f(x ). Otherwise, model quality decreases. One example for this is the step ellipsoidal function (BBOB function 7). • Regularity/Periodicity: Several functions in the BBOB set have some form of regular, symmetric, or periodic behavior. One classical example is the Rastrigin function (e.g., BBOB 3, 4 and 15). While our models seemed to work well for these functions, their regularity, symmetry or periodicity is not reproduced. With the given models, this would require a much larger number of training samples. If such behavior is important (and known a priori), a solution may be to choose a new kernel that, e.g., is itself periodic. This requires that the respective spectral measure of the new kernel is known, to enable the spectral simulation method [23]. • Extremely fine-grained local structure: Some functions, such as Schaffer’s F7 (BBOB function 17, 18), have an extremely fine-grained local structure. This structure will quickly go beyond even a good model’s ability to reproduce accurately. This will be true, even for fairly low-dimensional cases. While there is no easy way out, our results at least suggest one compensation: Many optimization algorithms will not notice such kind of fine-grained ruggedness. For instance, a mutation operator might easily jump across these local bumps, and rather follow the global structure of the function. Hence, an accurate representation of such structures may not be that important in practice, depending on the tested algorithms. • Number of samples: The number of training data samples is one main driver of the complexity for Kriging, affecting computational time and memory requirements. The mentioned cluster-Kriging approach is one remedy [31]. To draw a final conclusion on this section, we would first like to refer back to the initial research question: RQ-3.2 How to create or choose more representative test functions?

3.3 Why Not Other Parallel Algorithms?

45

For the generation of continuous model-based test functions, we investigate the spectral method for Kriging simulation [23]. This method results in a superposition of cosine functions. Our results show that the approach is well suited to generate continuous test problems, where a previous approach by Zaefferer et al. [21] is infeasible due to computational constraints. In cases where previous evaluation data from an expensive problem is available, surrogate models, especially simulations, can generate good alternative test functions. Our experiments provide evidence that simulationbased benchmarks perform considerably better than estimation-based benchmarks. Only for low-dimensional (n ≤ 5), unimodal problems did we observe an advantage for estimation. In practice, if the objective function’s modality (and dimensionality) is known, this may help select the appropriate approach. The simulation approach seems to be the more promising choice in a black-box case. Additionally, we described that in a scenario where cheaper to evaluate real-world problems from a similar domain or even the same problems with a possibly lowered accuracy are available, these could also be considered as possible test functions. This scenario is experimentally investigated in Sect. 4.1.

3.3 Why Not Other Parallel Algorithms? Parts of the following section are based on “Comparison of parallel surrogateassisted optimization approaches” by Rehbach et al. [3]. The paper was largely structurally adapted. The introduction and conclusions were majorly rewritten. Parts of the paper were taken verbatim and included in this section.

We have initially motivated this thesis in Sect. 1.1 with the efficiency of the wellknown and often applied sequential SBO. In many application cases where only a severely limited evaluation budget is available, SBO is superior to other algorithms due to its high sampling efficiency. Yet, before we go into any deeper analysis on how to improve parallel SBO algorithms, we should first and, more importantly, ask: Are SBO algorithms even that efficient in a parallel scenario? There are plentiful other algorithm types that are inherently already parallel. So why not just use them instead of developing new methods for parallelizing a naturally sequential algorithm? To give just one example, evolutionary algorithms are readily available in all kinds of variations. Through their population-based approach, they are easily adaptable to however much parallelization is desired. We, therefore, ask: RQ-3.3 Why parallel surrogate-based algorithms? What are their benefits and drawbacks? This section aims to show that parallel SBO algorithms are, in fact, efficient in lowbudget scenarios and well suited to parallel optimization. We compare existing SBO

46

3 Methods/Contributions

techniques with non-model-based ones. Existing benefits and drawbacks of modelbased techniques are analyzed with special regard to how well they scale to larger batch sizes. To show the remaining limitations of existing parallel SBO approaches, we propose a simple hybrid approach combining surrogate-based techniques with an evolutionary algorithm. The algorithm is used as a baseline in a detailed benchmark study to identify existing short comes of parallel SBO.

3.3.1 A Hybrid SBO Baseline In the following, we propose a hybrid algorithm composed of a SBO and an EA. The approach is meant to produce new candidate solutions in two different ways: (1) via standard infill criteria such as the Predicted Value (PV) and Expected Improvement (EI) , and (2) via evolutionary operators. The usage of the PV infill criterion, as well as the EI infill criterion, are well established and yield good results in single-threaded applications [32]. Thus, the idea is that the hybrid approach will employ these two infill criteria and utilize any remaining computational resources for optimization with an EA. The basic concept of this algorithm is visualized in Fig. 3.5. Algorithm 1 presents a pseudo-code for the proposed hybrid algorithm. An initial design is evaluated on the objective function. After that, at each iteration, a surrogate is fitted on all evaluated data. The surrogate is used to propose candidate solutions based on the PV and EI infill criteria. Additional candidate solutions are generated by evolutionary mutation and recombination operators. The combined candidate solutions are evaluated in parallel. The results are collected in a synchronization step, and a new iteration begins. This process is repeated until the evaluation budget is depleted.

Fig. 3.5 A surrogate is implemented in parallel to a model-free optimization algorithm. Here, an EA. Both the SBO and the EA propose candidate solutions. These are evaluated on the expensive objective function through a scheduler. The surrogate is updated and fitted on the data from both the EA and also the SBO proposed candidates. Image based on [3, p. 3]

3.3 Why Not Other Parallel Algorithms?

47

3.3.2 Experiments As described in Sect. 3.1, a straightforward way to benchmark parallel algorithms is to measure the best objective function value obtained after a fixed amount of algorithm iterations. For each iteration, the amount of possible evaluations in parallel is fixed. For this case study, this simple measure is sufficient. The framework for adaptive batch sizes and evaluation costs, which was introduced in Sect. 3.1, is not required here. In this study, we do not intend to measure the best possible batch size or compare algorithms across batch sizes. Each optimization algorithm starts with an initial design of the size of n, where n is the amount of objective function evaluations that are possible in parallel. After the initialization, each algorithm performs 50 iterations. Depending on the value of n, which is varied between 3 and 15, the algorithms can do a maximum of 150-750 function evaluations. With this algorithm setup, 12 different methods were compared, including a set of baseline comparisons, variations of SBO+EA, IPI, and two versions of q-EI. An overview of all methods and their composition is given in Table 3.1. All algorithms were implemented in R. Each time an EA was applied, the optimEA method of the CEGO [33, 34] package was used. The population size and mutation rate were set to 20 and 0.05, respectively. They were determined to work best in preliminary runs. The other parameters were kept at the package defaults. Algorithm 1 SBO+EA hybrid. Here, n is the number of initial candidate solutions, design() is a function that produces an initial set of candidate solutions, train() is a procedure to train an adequate surrogate model, and optimize() is an evolutionary algorithm implementation with variation operators eaMutation() and eaRecombination(). The function evalParallel() represents the potentially expensive objective function, which allows for evaluating n solutions simultaneously. 1: function SBO+EA- Sync 2: X = {x1 , x2 , ..., xn } = design(n) 3: y = evalParallel(X) 4: while budget not exhausted do 5: model = train(X, y) 6: xs1 = optimize(PV(model)) 7: xs2 = optimize(EI(model)) 8: Xea = eaMutation(X) + eaRecombination(X) 9: Xnew = {Xea , xs1 , xs2 } 10: ynew = evalParallel(Xnew ) 11: X = {X, Xnew } 12: y = {y, ynew } 13: end while 14: end function

 optimize PV  maximize EI

In the following, the implementation of each of the 12 methods is described shortly. Two variants of single-threaded standard SBO were implemented as a baseline for

48

3 Methods/Contributions

Table 3.1 Optimization methods. Some of the methods are hybrids, the table shows their composition by detailing the amount of objective function evaluations allowed to each method. Here, n is the total amount of evaluations possible in parallel. m = n − 3 for all cases where n > 3 otherwise m = n − 2, p = 1 for n > 3 and p = 0 for n = 3 Index EA PV EI Rnd. Space IPI q-EI Samp. filling EA singleCore Krig-PV Krig-EI K.PV+K.EI EA-nCores K.PV+K.EI +Rnd.Samp. K.PV+K.EI+LHS SBO+EA-sync SBO+EA-sync+SF IPI-n-cores q-EI q-EI-Bounded

1

1

1 1 1

1

1

1 1 1

1 1 1

n

n−2 m

n−2 n−2 p n n n

the comparison. Here, single-threaded indicates that no parallel evaluations will be performed. Each variant is based on a different infill criterion (PV and EI), and both use Kriging. Hence, they will be denoted as Krig-PV and Krig-EI. The optimum of the infill criteria was determined with a simple EA from the same package. To judge the performance of both of these single-threaded SBO implementations, a single-threaded and model-free EA is used (denoted as EA singleCore). As another baseline, we test a model-free EA that generates as many individuals per iteration as there are slots for parallel function evaluation (EA-nCores). The proposed SBO+EA algorithm uses both PV and EI together. Thus, another experiment was added to judge the EAs impact on the hybrid algorithm, where both EI and PV are implemented without further algorithms. Methods like LHS are often argued to increase the model quality of surrogates by generating space-filling (SD) designs in the search space. This methodology for model quality improvements was also implemented in a CFD airfoil optimization by Marsden et al. [35]. To test this hypothesis, two further methods were tested: K.PV+K.EI+Rnd.Samp and K.PV+K.EI+LHS. Both generate two candidate solutions per iteration, one via PV and one via EI. In K.PV+K.EI+LHS, the remaining n − 2 available evaluation slots are populated with LHS. In addition to the Latin hypercube property, the points are determined to maximize the distance to their nearest neighbor. In K.PV+K.EI+Rnd.Samp, the slots are populated with random samples. In addition, the benefits of a space-filling (SF) infill point was investigated in the SBO+EA structure. To that end, the SBO+EA was extended as follows. Firstly, one

3.3 Why Not Other Parallel Algorithms?

49

less solution (thus a total of n − 3) is generated by the EA’s variation operators. This slot is filled with a candidate solution that maximizes the minimal distance to the other candidate points. This approach is similar to the distance to nearest neighbor objective in the study by Bischl et al. [4]. The respective algorithm is denoted with SBO+EA-sync+SF. Since IPI was originally implemented to generate multiple candidate solutions that are sequentially evaluated [36], the methodology was slightly altered. In our implementation, the CEGO buildKriging function is used to build a Kriging model. Then, an EA is used to optimize each of IPIs infill criteria so that n candidate solutions are generated in each iteration. They are evaluated in parallel, and the model is trained with the updated data set. For q-EI, we employ the suggested implementation from Ginsbourger et al. [37], which is available in the DiceKriging R-Package. Their implementation of the q-EI infill criterion was optimized by CEGOs optimEA. However, the optimization task was not to search for a single point but the best set of points that optimizes the q-EI criterion. Due to the behavior of the q-EI optimization in the first tests, a second variant of q-EI based SBO was implemented, where bound constraints are respected. Note that otherwise, all methods only respect bound constraints where applicable and only during design creation. This issue will be discussed in more detail in Sect. 3.3.3. The described methods were applied to the following test functions: Rosenbrock 2D, Rastrigin 5D, Rastrigin 10D, Branin 2D, Hartmann 6D, and Colville 4D. The experiments yield useful information on the surrogate’s performance due to the various landscapes and features provided by the test functions. For each test function and method, 60 repeated optimization experiments were performed to account for the stochastic behavior of the algorithms.

3.3.3 Results Our presentation of the results is based on a statistical analysis. For each test problem, we performed a statistical multiple-comparisons test. Differences are judged significant if the corresponding p-values are smaller than α = 0.05. We computed a ranking based on the derived pair-wise comparisons. Here, the ranking is performed as follows. Any algorithm that is never significantly worse than any other algorithm receives rank one and is removed from the list. Of the remaining algorithms, the ones that are not worse than any other receive rank two and are also removed. This procedure repeats until all algorithms are ranked. We chose to use the Kruskal-Wallis test [38] (to check whether any significant differences are present) and a corresponding post-hoc test based on Conover (to determine which algorithm pairs are actually different) [39, 40]. We use the implementations from the PMCMR R package [41]. These tests were chosen as the data is not normal distributed and is also heteroscedastic (i.e., group variances are not equal). Hence, parametric test procedures that assume homoscedastic (equal variance), normal distributed data are unsuited. Note that non-parametric tests are not

50

3 Methods/Contributions

5 1 6 3 3 3 4 2 2 2 4 2

6 1 7 5 3 4 4 1 1 4 4 3

4 6 5 4 2 3 3 2 2 5 2 1

6 6 5 4 3 4 4 2 3 2 2 1

4 2 5 3 1 3 3 1 1 2 3 1

5 10 EA singleCore 3.5 Krig-PV 5.67 Krig-EI 3.83 K.PV+K.EI 2.67 EA-nCores 3.33 K.PV+K.EI+Rand.S. 3.5 K.PV+K.EI+LHS 1.83 SBO+EA-sync 2 SBO+EA+SF-sync 2.83 IPI-n-cores 2.83 q-EI q-EI-Bounded 1.5

7 7 7 6 5 4 4 3 3 2 2 1

6 3 6 4 4 2 2 1 1 2 5 5

8 4 8 5 3 2 2 1 1 4 7 6

6 6 5 3 4 2 2 1 1 3 2 1

6 5 5 4 3 3 3 1 1 2 3 2

6 4 6 5 3 4 4 2 2 3 1 1

6.286 4.714 5.857 4.286 3.286 2.857 2.857 1.429 1.429 2.571 3.333 2.667

5 EA singleCore Krig-PV Krig-EI K.PV+K.EI EA-nCores K.PV+K.EI+Rand.S. K.PV+K.EI+LHS SBO+EA-sync SBO+EA+SF-sync IPI-n-cores q-EI q-EI-Bounded

5 6 6 5 5 4 4 4 3 3 2 1

5 1 6 4 3 3 3 1 1 2 5 3

6 2 6 5 3 4 3 1 2 4 5 4

4 5 4 4 3 3 3 2 3 4 2 1

7 7 6 5 4 5 5 1 3 1 4 2

5 4 6 5 2 4 4 2 3 2 1 1

5.33 15 EA singleCore Krig-PV 4.17 Krig-EI 5.67 K.PV+K.EI 4.67 EA-nCores 3.33 K.PV+K.EI+Rand.S. 3.83 K.PV+K.EI+LHS 3.67 SBO+EA-sync 1.83 SBO+EA+SF-sync 2.5 IPI-n-cores 2.67 q-EI 3.17 q-EI-Bounded 2

6 6 5 5 4 4 4 3 3 2 2 1

5 3 4 3 3 2 2 1 1 2 4 4

8 5 7 4 3 2 2 1 1 4 6 6

6 6 5 4 4 3 3 1 1 3 3 2

5 4 4 3 3 3 2 1 1 1 2 2

7 6 7 6 3 5 5 2 2 4 1 1

6.000 4.857 5.143 4.000 3.000 3.143 3.000 1.429 1.429 2.571 3.000 2.667

n method

mean

5 5 6 4 4 3 3 3 3 2 2 1

Rose.2D Rast.5D Rast.10D Bran.2D Hart.6D Colv.4D

3 EA singleCore Krig-PV Krig-EI K.PV+K.EI EA-nCores K.PV+K.EI+Rand.S. K.PV+K.EI+LHS SBO+EA-sync SBO+EA+SF-sync IPI-n-cores q-EI q-EI-Bounded

mean

n method

Rose.2D Rast.5D Rast.10D Bran.2D Hart.6D Colv.4D

Table 3.2 Overview of optimization performances on all problems. Results are ranked based on pairwise multiple-comparison tests. For each problem, the best method is marked in bold font. The last column gives the mean rank. The best mean rank is underlined and printed in bold font.

free of assumptions either. In fact, the Kruskal-Wallis test assumes that the data are random samples, statistically independent within each group and between groups, and have an ordinal measurement scale [40, p. 289]. These assumptions should hold for the optimization performance results we consider. An overview of the analysis results can be found in Table 3.2. It is clearly visible that, as expected, the single-threaded baseline implementations perform quite badly on the test functions. This confirms that parallelization is necessary to efficiently optimize in an environment where more than one function evaluation is possible simultaneously. In the first set of results with n = 3 parallel evaluations, the bounded q-EI approach performed best for many functions. Yet, the fairness of a comparison to this bounded approach is arguable. As first tests showed

3.3 Why Not Other Parallel Algorithms?

51

Fig. 3.6 Boxplot showing the optimization results on the 10 dimensional rastrigin function. In these experiments 15 function evaluations were possible in parallel. Red numbers at the bottom indicate the given rank based on pairwise multiple-comparison tests. Image based on [3, p. 7]

that the original implementation of the EA-optimized q-EI criterion yielded unsatisfactory results, the problem was further investigated. In many of the experiments, it was observed that q-EI only positioned one or two design points in a reasonable search area. The other design points were rather far spread out into regions where the kriging model estimates maximum uncertainty. Thus, q-EI often yields points far beyond the previously sampled regions. This was the main reason for introducing bound constraints to the q-EI implementation. In this algorithm variant, the EA, which is used to optimize the q-EI criterion, is limited to a search in a bounded area. This solved the basic issues q-EI was facing and yielded much better results. However, it has to be considered that problem bounds are not always known a priori and that the other SBO implementations were not subject to bounds (except for the initial design generation that is shared by all methods). Therefore, the striking performance of q-EI with bounds and n = 3 should be considered with care. Given five or more evaluations in parallel, scaling seems to become an issue in IPI and q-EI as their mean rank drops. Here, SBO seems to deliver the best results for Rastrigin, Branin, and Hartmann. On the Rosenbrock and Colville functions, q-EI consistently delivers the best results. Figure 3.6 was chosen as one example to present results on test functions when more parallel evaluations are possible. The figure shows that the simple hybrid approach significantly outperforms existing parallel SBO methods, given a large enough batch size. The same is true for some functions evaluated with n = 10, but

52

3 Methods/Contributions

not for smaller n. This behavior may be explained by the nature of the SBO+EA hybrid. The number of solutions suggested by the model is constant (here: two), whereas the number of solutions suggested by the EA operators increases with n. Hence, the hybrid will tend to behave more similarly to a model-free EA for larger n. This indicates that it may be necessary for the model-based part to scale with n, too, to provide better performance. Hence, combining it with the q-EI or IPI approaches may be profitable.

3.3.4 Conclusions To draw a final conclusion about the presented analysis, we would first like to reconsider the initial research question: RQ-3.3 Why parallel surrogate-based algorithms? What are their benefits and drawbacks? The given results show that SBO is in fact very well applicable to parallelized environments. The model-free EA was outperformed in nearly all test cases when combined with parallel SBO techniques. The sampling efficiency of model-based methods seems to outweigh the scaling capabilities of EAs. Yet, the performance of the pure model-based methods deteriorated more and more the higher the configured batch size was. We were able to clearly observe that scaling to larger batch sizes is a problem for existing parallel SBO algorithms. It is hard to select many reasonable points from a single model. The results especially showed that q-EI tends to end up being too explorative in its search. At the same time, even a very simple hybrid approach that combines a few model-based evaluations with an evolutionary algorithm can perform very well on larger batch sizes. This leads us to the conclusion that SBO should be applied in parallel low-budget optimization cases but that additional research is required in two areas. Firstly, the study indicated that very explorative search strategies are not well suited for lowbudget scenarios. Yet, some degree of exploration is mandatory for model-based search strategies. Exploration covering the whole search space can help move out of local optima and improve the global model quality. Further investigations should be made into when and how much exploration can help a model-based search. In-depth research was done in this area and will be presented in Sect. 3.5. Secondly, this investigation proves that hybrid algorithms are a valid alternative to pure model-based approaches for parallel optimization. Again, further research is necessary to advance on this hypothesis. This research is presented in Sect. 3.6.

3.4 Parallelization as Hyper-Parameter Selection

53

3.4 Parallelization as Hyper-Parameter Selection Parts of the following section are based on “Parallelized Bayesian Optimization for Problems with Expensive Evaluation Functions” by Rebolledo et al. [42] and “Parallelized Bayesian Optimization for Expensive Robot Controller Evolution” by Rebolledo et al. [43]. The covered information and experiments in the papers were largely rewritten and extended, especially concerning the introductions and conclusions. The papers were largely structurally adapted. Parts of the papers were taken verbatim and included in this section.

A common problem for real-world optimization problems is that little, or no knowledge of their landscape is available before the optimization is started. As described earlier such problems are termed black-box problems. When facing blackbox problems, one can not know which configuration or hyper parameters are reasonable choices for an optimizer. In SBO, these hyper parameters are, for example, many different kernels and infill criteria that can be used. Configurations that work well on one problem might not work on another one. Additionally, tuning a SBO algorithm for an expensive black-box problem is infeasible. The parameter tuning would simply require too many resources on the expensive-to-evaluate function. We therefore ask: RQ-3.4 Can parallelization support in selecting feasible hyper parameters? In this section, we show that the performance of sequential SBO significantly depends on a good choice of kernel and infill criterion. Six common hyper parameter configurations for SBO are benchmarked, and their impact on optimization efficiency is discussed. We further show that if a parallel evaluation of the objective function is feasible, parallelization can be used to circumvent these problems associated with bad hyper parameter setups in SBO. For this purpose, an approach that fits multiple distinct models with different hyper parameter setups in parallel (Multi-Config Surrogate- Based Optimization (MC-SBO)) is investigated as a more robust approach that does not depend as strongly on parameter choices. We discuss whether this is a relevant alternative to applying more specialized parallel-SBO implementations.

3.4.1 Multi-config SBO Multi-Config Surrogate- Based Optimization (MC-SBO) uses parallelization to circumvent the possibility of badly configured kernel parameters or wrongly chosen infill criteria. Instead of choosing a single setup for SBO, multiple configurations of interest are configured and applied in parallel. In this study, we will apply six different configurations in parallel. We chose three commonly applied infill criteria:

54

3 Methods/Contributions

Fig. 3.7 Graphical representation of the Multi-Config Surrogate-Based Optimization (MC-SBO) algorithm. In this application n = 6 surrogates are built in parallel

EI, Lower Confidence Bound (LCB), and PV. Each of these is furthermore run with two different kernel settings. For this, reconsider the following kernel function from Eq. 3.2:  n     pi k(x, x ) = exp −θi |xi − xi | . i=1

For this kernel, a common question is whether the correlation function smoothness parameter p = ( p1 , ... pn ) should be estimated automatically through Maximum Likelihood Estimation (MLE) or if it should stay fixed. We apply both variants. In the first variant, p is fitted automatically. In the second variation, it is fixed to p = 2. Figure 3.7 gives a graphical representation of the Multi-Config Surrogate-Based Optimization (MC-SBO) algorithm. After sampling an initial design, the algorithm will build the six distinctly configured GP models in parallel at each iteration. Their respective infill criteria are optimized, and the candidates are evaluated. Importantly, the six distinctly configured algorithms do not behave like algorithm restarts, where no knowledge from one run is given to the other runs. Instead, the algorithm waits for all instances to complete their evaluations in a synchronization step. All evaluation knowledge is shared. Thus at each iteration the surrogates are built on all evaluated points including those from the other surrogates. This process of fitting independent models and then re-synchronizing their data is repeated until the iteration budget is used up.

3.4.2 Experiments In this paper, two benchmark scenarios are covered. Firstly a sequential, therefore, non-parallel setup. The setup is used to measure the impact of the hyper-parameter setup to the optimization performance of SBO algorithms. It compares the six

3.4 Parallelization as Hyper-Parameter Selection

55

described configurations that were discussed in Sect. 3.4.1. Additionally they are compared to a uniform random sampling search strategy and a CMA-ES (refer back to Sect. 2.2). The python ‘cma’ implementation by Hansen et al. [44], was used in our experiments. To fit the general R framework that was used for experimenting, the python code is called via the R to python interface “eticulate” [45]. Random search is used to judge the behavior of the algorithms compared to a lower baseline. The CMA-ES is used as a model-free baseline as a well known efficient optimizer. In the first set of experiments, each algorithm is given a maximum of 100 function evaluations, and the best-found values are reported. This set of experiments is not meant to judge which SBO configuration works best on which kind of function but rather to show the significance of performance differences depending on which kernel or infill criterion is chosen. The second set of experiments compares the described parallel Multi-Config Surrogate- Based Optimization (MC-SBO) approach to other parallel algorithms. It is compared with three parallel model-based approaches, namely Q-EI, MOIMBO, and IPI (for detailed explanations of each of these SBO approaches please refer back to Sect. 2.4). Again CMA-ES and random search are added as baselines for comparison. The algorithms in both experiments are extensively benchmarked on the previously covered BBOB function suite in order to assess their performance on varying landscapes and dimensionalities. For more detail on the BBOB functions please refer back to Sect. 3.2, or directly to [10, 12]. For the parallel experiment, we judge the algorithms similar to the experiments in Sect. 3.3. We again limit the total budget of the algorithms by assigning a maximum amount of iterations that can be done. The described Multi-Config Surrogate- Based Optimization (MC-SBO) method consists of a total of six different configurations that can each generate a sample point. Therefore, the batch size of all algorithms in this comparison is fixed to six. Latin hypercube sampling is used to select an initial design of ten points for each of the SBO algorithms. Q-EI and MOI-MBO are based on the DICE-Kriging R implementation. This implementation requires the number of initial samples to be larger than the input dimensionality. Therefore, the number of initial samples was adapted for both algorithms based on the given problem. It was set to the next multiplier of the six parallel evaluations, which is greater than the respective problem dimensionality. The population size for CMA-ES is set to six to match the number of available processors. Each algorithm is again given a maximum budget of 100 function evaluations. This very restricted budget of only 100 evaluations was chosen due to a very costly real-world application to which the algorithms are compared. This robotic application is covered in Sect. 4.2. Each experiment is repeated 30 times for statistical analysis.

56

3 Methods/Contributions

3.4.3 Results The following discussion is based on a statistical analysis of the algorithm results. We do not assume that the collected results are normal distributed. For example, we know that we have a fixed lower bound on our performance values. Furthermore, our data is likely to be heteroscedastic (i.e., group variances are not equal). Therefore, parametric test procedures that assume homoscedastic (equal variance) and normal distributed data may be unsuited. Therefore, we apply non-parametric tests that make fewer assumptions about the underlying data, as suggested by Derrac et al. [46]. We accept results to be statistically significant if the corresponding p-values are smaller than α = 0.05. The Kruskal-Wallis rank sum test (base-R package ‘stats’) is used to determine whether a significant difference is present in the set of applied algorithms or not. If this test is positive, a posthoc test according to Conover (PMCMR R package [41]) for multiple pairwise comparisons is used to check for differences in each algorithm pair. The pairwise comparisons are further used to rank the algorithms on each problem class as follows: The set of algorithms that are never outperformed with statistical significance is considered rank one and removed from the list. Of the remaining algorithms, the ones which are again never outperformed receive rank two and are also removed. This process is repeated until all algorithms are ranked. First, we compare the set of sequential experiments. The combination of different kernel configurations and infill criteria explained in Sect. 3.4.1 makes up the six tested variants of single-core SBO. We denote these as follows. The name of each algorithm is based on the applied infill criterion (EI, PV, and Lower Confidence Bound (LCB)), as well as the kernel parameter p. In the variant with fixed p = 2 is denoted with P2, the variant where p is estimated is denoted FitP. Using this notation, an experiment denoted as SBO-PV-P2 then refers to a single-core SBO using predicted value as infill criterion and having kernel parameter p = 2. As expected, we observed that SBO outperforms the CMA-ES on most problems and dimensionalities. A CMA-ES usually requires more evaluations to correctly adapt the covariance matrix before efficient progress can be made on the objective function. Figure 3.8 illustrates this behavior on one example function. The illustration of the Rastrigin function was chosen rather arbitrarily. Other functions showed very similar results. The results and plots of all BBOB functions are published in the supplementary material of [43]. Here, all SBO approaches ranked better than CMA-ES and random search. The different kernel configurations and infill criteria show a statistically significant difference between each other. Therefore, it is crucial to carefully choose the kernel and infill criteria according to the application characteristics. Interestingly, greedy infill criteria, like PV, show better performance than the more popular EI in 16 out of the 24 tested functions. This agrees with results by Rehbach et al. in [32]. Switching to the parallel experiments, we can consider the convergence plots for the Rastrigin, and Sharp Ridge functions in Fig. 3.9. Again these were chosen as a representative set, other functions showed similar results. For additional

3.4 Parallelization as Hyper-Parameter Selection

57

Fig. 3.8 Boxplots of the non-parallel experiments on the Rastrigin function. Red numbers denote ranking as determined by the statistical tests, higher numbers are worse. The header of each graph represents the problem dimensionality: 5,10, and 20

plots we again refer to the supplementary material of [43]. Firstly, we observed that MOI-MBO usually performed rather well across many functions. Especially in highdimensional applications, MOI-MBO performs best. We suspect that this behavior can be explained with other parallel methods like Q-EI and IPI searching too exploratively. Especially the performance of Q-EI significantly dropped for higher dimensionalities. This was also reported by Rehbach et al. in [3]. Similar behavior can be observed for the CMA-ES. Only poor performance is shown in higher dimensions where the CMA-ES requires longer to adapt the covariance matrix. A clear advantage for the model-based methods could not always be observed for functions with high conditioning or weak global structure. However, only in functions

58

3 Methods/Contributions 5

10

20

1000

Rastrigin y

3000

3000

algName CMAES IPI MC-SBO MOI Q−EI RS

300 1000 1000 100 300 30

300 100 0

25

50

75

100

0

25

50

75

100

0

25

50

75

100

iteration 5

10

20

3000 3000 1000

Sharp Ridge y

algName CMAES IPI MC-SBO MOI Q−EI RS

1000 300

1000

100 300 30

500 0

25

50

75

100

0

25

50

75

100

0

25

50

75

100

iteration

Fig. 3.9 Parallel algorithm performance on two BBOB functions with different dimensionalities (top of each plot, 5,10,20). Plots show the median (solid line), upper and lower quartiles (transparent area). Image taken from [43, p. 274]

Weierstrass and Katsura did SBO not achieve equal or better performance than the CMA-ES and random search. Both functions are highly rugged and repetitive. The Katsura function is a special case where all three algorithms ranked equal on the multi-comparison test. Surprisingly, the relatively simple Multi-Config Surrogate- Based Optimization (MC-SBO) approach consistently showed rather good behavior. On a few functions, it works best, outperforming all other algorithms with statistical significance. On most functions, it at least delivers moderate results performing only worse or similar to MOI-MBO. The MOI-MBO algorithm favors exploitative behavior [4]. If the suggested points tend to exploitation, the performance of MOI-MBO is in accordance with the results from single-core experiments, where greedy approaches achieved better values.

3.4.4 Conclusions In order to conclude this section, we would first like to reconsider the initially posed research question. RQ-3.4 Can parallelization support in selecting feasible hyper parameters? Firstly, we compared six commonly used configurations for SBO across a variety of 24 benchmark functions with various landscape features. The results showed very significant performance differences based on the configured kernel and infill criterion.

3.5 Exploration Versus Exploitation

59

Furthermore, the results showed that no single configuration works best as the best performing configurations changed dependent on the test function at hand. Thus, in a black-box scenario, a correct choice for the best-performing hyper parameters is near impossible. At the same time, however, the results show that a good configuration is crucial for good optimization performance. We have shown that parallelization can mitigate this problem of selecting a good hyper-parameter configuration quite effectively. Running multiple configured kernels and infill criteria in parallel ensures a robust performance as at least some of the candidate solutions are evaluated with the best possible hyper-parameter setup. This simple parallel strategy even repeatedly outperformed the parallel infill criteria IPI and Q-EI. Yet, further research is required to advance parallel-SBO. One of the strongest drawbacks for Multi-Config Surrogate- Based Optimization (MC-SBO) is that it does not scale well to existing parallelization systems. I.e., if a system would work well with a batch size of eight, then eight suitable configurations would be required for the SBO approach. Each of these has to be implemented in code. Furthermore, we assume that the additional performance gain with added configurations will decline if the added configurations are less distinct. Further research could investigate a switching criterion that allows the algorithm to spend more budget on some of the configured kernels or infill criteria. For example, if one kernel shows better performance in the first half of an optimization, it might be good to give it more budget for the latter half. Lastly, as an addition to the experiments in Sect. 3.3, we demonstrated that SBO approaches can outperform state-of-the-art techniques such as a CMA-ES in problems with a very limited number of function evaluations. This observation was made both for the sequential and the parallel experiments. Additionally, a clearer difference in results was found on higher dimensional functions. The higher the dimensionality, the lower the performance of the CMA-ES compared to the SBO approaches. This is somewhat counterintuitive to a commonly stated rule of thumb, that the quality of Kriging models degrades with dimensionality, and Kriging is not really useful for problems with more than 20 dimensions [31].

3.5 Exploration Versus Exploitation Parts of the following section are based on “Expected Improvement versus Predicted Value in Surrogate-Based Optimization” by Rehbach et al. [32]. Major parts of the paper were taken verbatim and included in this section. Parts of it were rewritten, extended, and restructured. The balance between exploration and exploitation plays an important role in many parallel SBO methods. Usually, it is attempted to reach a good balance with the help of the intrinsic uncertainty estimate of Kriging surrogates. For example, methods

60

3 Methods/Contributions

like Q-EI (c.f. Sect. 2.4) try to balance exploration and exploitation by not only searching in areas where the model predicts good function quality but also in areas where the model is still uncertain about the true landscape. The EI (c.f. Sect. 2.1) of a candidate solution increases if the predicted value or the estimated uncertainty of the model rises. The experiments discussed in Sects. 3.3 and 3.4 showed that too much exploration leads to worsened performance in SBO. With this section, we would like to analyze why this happens and how much exploration or exploitation is required for efficient SBO. The worsened performance for explorative methods, like EI, should be seen as a surprise considering how often EI is applied. When Kriging is used as the surrogate model of choice, EI is a very commonly applied infill criterion. We briefly examined 24 software frameworks for SBO. We found 15 that use EI as default infill criterion, three that use PV, five that use criteria based on lower or upper confidence bounds [47], and a single one that uses a portfolio strategy (which includes confidence bounds and EI, but not PV). An overview of the surveyed frameworks is included in the supplementary material of [32]. Seemingly, EI is firmly established. EI can yield a guarantee for global convergence, and convergence rates can be analyzed analytically [48, 49]. However, we argue that the popularity of expected improvement largely relies on its theoretical properties rather than empirically validated performance. Only a few results from the literature show evidence, that under certain conditions, expected improvement may perform worse than something as simple as the PV (c.f. Sect. 2.1) of the surrogate model. Despite this, some criticism can be found in the literature. Wessing and Preuss point out that the convergence proofs are of theoretical interest but have no relevance in practice [49]. The most notable issue in this context is the limited evaluation budget that is often imposed on SBO algorithms. Under these limited budgets, the asymptotic behavior of the algorithm can become meaningless. Wessing and Preuss show results where a model-free CMA-ES often outperforms EI-based EGO within a few hundred function evaluations [49]. A closely related issue is that many infill criteria, including EI, define a static trade-off between exploration and exploitation [37, 50–52]. This may not be optimal. Intuitively, explorative behavior may be of more interest in the beginning, rather than close to the end of an optimization run. At the same time, benchmarks that compare EI and PV are far and few between. While some benchmarks investigate the performance of EI-based algorithms, they rarely compare to a variant based on PV (e.g., [53–56]). Other authors suggest portfolio strategies that combine multiple criteria. Often, tests of these strategies do not include the PV in the portfolio (e.g., [57, 58]). We would like to discuss two exceptions. Firstly, Noe and Husmeier [59] do not only suggest a new infill criterion and compare it to EI, but they also compare it to a broad set of other criteria, importantly including PV. In several of their experiments, PV seems to perform better than EI, depending on the test function and the number of evaluations spent. Secondly, a more recent work by De Ath et al. [60] reports similar observations. They observe that “Interestingly, Exploit, which always samples from the best mean surrogate prediction is competitive for most of the high dimen-

3.5 Exploration Versus Exploitation

61

sional problems” [60]. Conversely, they observe better performances of explorative approaches on a few, low-dimensional problems. In summary, we want to analyze the pros and cons of exploration in SBO. Is it justified that EI is applied so often? We ask RQ-3.5 How important is exploration in SBO? In the following, this section describes extensive experiments that investigate the importance of exploration and exploitation in SBO. We discuss the results based on our observations and a statistical analysis.

3.5.1 Experiments Judging the performance of an algorithm requires an appropriate performance measure. The choice of this measure as usual depends on the context of the investigation. Similar to the investigations that were covered in previous sections, we are concerned with expensive optimization problems where only a severely limited budget of objective function evaluations is available. For practical reasons, we have to specify an upper limit of function evaluations for our experiments. Yet, in our analysis, we evaluate algorithm performance for multiple lower budgets as well. Readers who are only interested in a scenario with a specific fixed budget can, therefore, just consider the results for that budget. One critical contribution to an algorithm’s performance is its configuration and setup. This includes parameters as well as the chosen implementation. We chose the R-package ‘SPOT’ [61, 62] as an implementation for SBO. It provides various types of surrogate models, infill criteria, and optimizers. To keep the comparison between the EI and PV infill criteria as fair as possible, they will be used with the same configuration in ‘SPOT’. This mentioned configuration regards the optimizer for the model, the corresponding budgets, and lastly the kernel parameters for the Kriging surrogate. The configuration is explained in more detail in the following. We allow each ‘SPOT’ run to expend 300 evaluations of the objective function. This accounts for the assumption that SBO is usually applied in scenarios with severely limited evaluation budgets. The first ten of the 300 function evaluations are spent on an initial set of candidate solutions. This set is created with Latin Hypercube sampling [63]. In the next 290 evaluations, a Kriging surrogate model is trained at each iteration. The model is trained with the ‘buildKriging’ method from the ‘SPOT’ package. During the model building, the nugget effect [28] is activated for numerical stability, and the respective parameter λis set to a maximum of 10−4 . The kernel parameters θi , pi and λare determined by Maximum Likelihood Estimation (MLE) via differential evolution [25] (‘optimDE’). The budget for the MLE is set to 500 × t likelihood evaluations, where t = 2d + 1 is the number of model parameters optimized by MLE (θ, p, λ). The respective infill criterion is optimized, after the model training. Again, differential evolution is used for this purpose. In this case, we use a budget of 1 000 × d evaluations of the model (in each iteration of the SBO algorithm). Generally, these

62

3 Methods/Contributions

settings were chosen rather generously to ensure high model quality and a welloptimized infill criterion. To rule out side effects caused by potential peculiarities of the implementation, independent tests were performed with a different implementation based on the Rpackage ‘mlrmbo’ [64, 65]. Notably, the tests with ‘mlrmbo’ employed a different optimizer (focus search instead of differential evolution) and a different Kriging implementation, based on DiceKriging [66]. To answer the presented research question, we want to investigate how different infill criteria for SBO behave on a broad set of test functions. We chose one of the most well-known benchmark suites in the evolutionary computation community: the Black-Box Optimization Benchmarking (BBOB) suite [10]. While the test functions in the BBOB suite themselves are not expensive-to-evaluate, we emulate expensive optimization by limiting the algorithms to only a few hundred evaluations. The global optima and landscape features of all functions are known. Hansen et al. present a detailed description [12]. All described experiments were run with a recent GitHub version, v2.3.1 [24]. Each algorithm is tested on the 15 available standard instances of the BBOB function set.

3.5.2 Results Convergence Plots Our first set of results is presented in the form of convergence plots shown in Fig. 3.10. The figure shows the convergence of the SBO-PV and SBO-EI algorithm on two of the 24 BBOB functions. These two functions (Rastrigin, Sharp Ridge) were chosen because they nicely represent the main characteristics and problems that can be observed for both infill criteria. These characteristics will be explained in detail in the following paragraphs. A simple random search algorithm was included as a baseline comparison. The convergence plots of all remaining functions are available as supplementary material in [32], also included are the results with the R-package ‘mlrmbo’. As expected, the random search is largely outperformed on three to ten-dimensional function instances. In the two-dimensional instances, the random search approaches the performance of SBO-PV, at least for longer algorithm runtimes. For both functions shown in Fig. 3.10, it can be observed that SBO-EI works as good or even better than SBO-PV on the two-, and three-dimensional function instances. As dimensionality increases, SBO-PV gradually starts to overtake and then outperform SBO-EI. Initially, SBO-PV seems to have a faster convergence speed on both functions. Yet, this speedup comes at the cost of being more prone to getting stuck in local optima or sub-optimal regions. This problem is most obvious in the two- and three-dimensional instances, especially on the two-dimensional instance of BBOB function 3 (Separable Rastrigin). Here, SBO-EI shows a similar convergence to SBO-PV on roughly the

3.5 Exploration Versus Exploitation

63

Rastrigin

Objective Function Value

2

3

5

10

1e+02

3000 300

100.0

1e+01

1000 100

1e+00

algName

10.0

RandomSearch SBO−EI SBO−PV

300 30

1e−01

1.0

100

10

1e−02 0.1 0

100

200

300

3 0

100

200

300

30 0

100

200

300

0

100

200

300

iteration

(a) BBOB Function 3 - Separable Rastrigin Sharp Ridge

Objective Function Value

2

3

5

10

1e+03

3000 1000 1000

1e+02 1000

algName

100

RandomSearch SBO−EI SBO−PV

1e+01 100

1e+00

300

10

1e−01

10 0

100

200

300

0

100

200

300

100 0

100

200

300

0

100

200

300

iteration

(b) BBOB Function 13 - Sharp Ridge Fig. 3.10 Convergence plots of SBO-EI and SBO-PV on two of the 24 BBOB functions. Random search added as baseline. The experiments are run (from left to right) on 2, 3, 5, and 10 inputdimensions. The y-axis shows the best so far achieved objective function value on a logarithmic scale. The center line indicates the median of the algorithm repeats. The surrounding ribbon marks the lower and upper quartiles. Y-values measure differences from the global function optimum. Image taken from [32, p. 5]

first 40 iterations. However, after just 100 iterations, the algorithm appears to get stuck in local optima, reporting only minimal or no progress at all. As Rastrigin is a highly multi-modal function, the results for SBO-PV are not too surprising. At the same time, SBO-EI yields steady progress over all 300 iterations, exceeding the performance of SBO-PV. Yet, this promising performance seems to change on higher dimensional functions. On the five- and ten-dimensional function sets, the performance of SBO-PV is close to the one of SBO-EI early on. Later in the run, SBO-PV outperforms SBO-EI noticeably. Neither algorithm shows a similar form of stagnation as was visible for SBO-PV in the lower dimensional test scenarios. This behavior indicates that with increasing dimensionality, it is less likely for SBO to get stuck in local optima. Therefore, the importance of exploration diminishes with increasing problem dimension. De Ath et al. reach a similar conclusion for the higher dimensional functions they considered [60]. They argue that the comparatively small accuracy that the surrogates can achieve on high dimensional functions results in some internal exploration even for the strictly exploitative SBO-PV. This

64

3 Methods/Contributions

is due to the fact that the estimated location for the function optimum might be far away from the true optimum. In Sect. 3.5.2 this is covered in more detail. There, we investigate exactly how much exploration is done by each infill criterion.

Statistical Analysis To provide the reader with as much information as possible in a brief format, the rest of this section will present data that was aggregated via a statistical analysis. We do not expect that our data follows a normal distribution. For example, we know that we have a fixed lower bound on our performance values. Also, our data is likely heteroscedastic (i.e., group variances are not equal). Hence, common parametric test procedures that assume homoscedastic (equal variance), normal distributed data may be unsuited. Therefore, we apply non-parametric tests that make less assumptions about the underlying data, as suggested by Derrac et al. [46]. We chose the Wilcoxon test, also known as Mann-Whitney test [67], and use the test implementation from the base-R package ‘stats’. Statistical significance is accepted if the corresponding p-values are smaller than α = 0.05. The statistical test is applied to the results of each iteration of the given algorithms. As the BBOB suite reports results in exponentially increasing intervals, the plot in Fig. 3.11 follows the same exponential scale on the x-axis. The figure shows the aggregated results of all tests on all functions and iterations. Blue cells indicate that SBO-EI significantly outperformed SBO-PV, while red indicates the opposite result. Uncolored cells indicate that there was no evidence for a statistically significant difference between the two competing algorithms. The figure is further split by the input dimensionality (2, 3, 5, 10) of the respective function, which is indicated on the right-hand side of each subplot. We start with an overview of the two-dimensional results. Initially, SBO-PV seems to perform slightly better than its competitor. Until roughly iteration 60, it shows statistical dominance on more functions than EI. Yet, given more iterations, EI performs well on more functions than PV. The performance of SBO-PV with less than 60 iterations, together with the convergence plots, indicates that SBO-PV achieves faster convergence rates. Thus, SBO-PV is considered the more greedy algorithm. This same greediness of SBO-PV may increasingly lead to stagnation when more than 60 iterations are performed. Here, the SBO-PV algorithm is getting stuck at solutions that are sub-optimal or only locally optimal. SBO-EI starts to outperform SBO-PV, overtaking it at roughly 70 iterations. At the maximum budget of 300 function evaluations, SBO-EI outperforms SBO-PV on 8 of the 24 test functions. SBO-PV only works best on 3 out of the 24 functions for the two-dimensional instances (Sphere, Linear Slope, Rosenbrock). Notably, those three functions are unimodal (at least for the two-dimensional case discussed so far). It is not surprising to see the potentially more greedy SBO-PV perform well on unimodal functions.

3.5 Exploration Versus Exploitation 1 5 10 2

15 20 24 1 5 10

BBOB Function ID

3

15 20

Criterion

24 1

PV EI None

5 10 5

Fig. 3.11 Statistical analysis of difference in performance between SBO-EI versus SBO-PV. y-axis shows the BBOB-function index. Colors indicate which infill criterion performed better in each case, based on the statistical test procedure. Blank boxes indicate that there was no significant difference found. The results are presented for the 2, 3, 5 and 10 dimensional BBOB-function set as indicated in the gray bar to the right of each plot segment. The BBOB suite reports results in exponentially increasing intervals, thus the figure follows the same exponential scale on the x-axis. Image taken from [32, p. 6]

65

15 20 24 1 5 10 10

15 20 24 10 14 19 28 39 56 79 112 158 223 300

Iteration

A similar behavior is observed on the three-dimensional function set. Initially, SBO-PV performs well on up to five functions, while SBO-EI only performs better on up to two functions. At around 85 iterations, SBO-EI again overtakes SBO-PV and then continues to perform well on more than eight functions up to the maximum budget of 300 iterations. Based on the observations for the two- and three-dimensional scenarios, one could assume that a similar pattern would be observed for higher-dimensional functions. However, as previously discussed, it seems that SBO-PV’s convergence rate is less likely to stagnate with increasing problem dimensionality. On the five-dimensional functions, only three functions remain on which SBO-EI outperforms at the maximum given budget. On the other hand, SBO-PV performs better on up to 7 functions.

66

3 Methods/Contributions

This now also includes multimodal functions, hence functions that are usually not considered promising candidates for SBO-PV. On the ten-dimensional function set, SBO-EI is outperformed on nearly all functions, with only two temporal exceptions. Only on the Sphere function and the Katsuura function, a statistically significant difference can be measured for a few iterations in favour of SBO-EI. SBO-PV performs significantly better on 9 out of the 24 functions. Only on the function group with weak global structure, SBO-PV fails to produce significantly better performance. Summarizing these results for separate function groups, it is noticeable that SBOEI tends to work better on the multimodal functions. Here, SBO-EI clearly excels on the two-, and three-dimensional instances. On the functions with weak global structure, it continues to excel on the five-dimensional functions and is at least not outperformed on the ten-dimensional ones. SBO-EI performs especially poor on ‘functions with low or moderate conditioning’. Here, SBO-PV outperforms SBO-EI on at least as many functions as vice versa, independent of the budget and the problem dimensionality. Generally, multimodality does not seem to require an explorative search methodology as long as the input dimensionality is high.

Case Study: Measuring Exploration The performance discussion in Sect. 3.5.2 largely relies on the idea that SBO-PV generally does less exploration than SBO-EI. Multiple functions were observed on which the search with SBO-PV stagnates too early. On the other hand, we argue that as long as SBO-PV does not get stuck in sub-optimal regions, its greedy behavior leads to faster convergence and, thus, a more efficient search process. This may especially be true if the budget is small in relation to the problem dimension. In other words, if time is short, greedy behavior may be preferable. To support this hypothesis, additional experiments were carried out to determine the amount of “exploration” each criterion does in different stages of the optimization runs. For this purpose, we assume that exploration can be estimated as the Euclidean distance of a proposed candidate to its closest neighbor in the already evaluated set of candidate solutions. Therefore, placing a new point close to a known point is regarded as an exploitative step. Else, placing a new point far away from known points is considered as an explorative step. This measure was applied to the previously discussed optimization runs on the BBOB function set. The results for the two functions are shown in Fig. 3.12. Represented are the same functions as in the previous section for comparability. Results of the case study on all 24 BBOB functions are included in the supplementary material. As discussed earlier, the performance of SBO-PV stagnated on the two- and threedimensional instances of this function. Hence, it is interesting to see that the measured distance becomes fairly small in those same cases. While SBO-EI seems to make comparatively large search steps, SBO-PV seems to propose candidate solutions in the close vicinity of already known solutions.

3.5 Exploration Versus Exploitation

67

Rastrigin 2

3

1e+01

5

10

10.00

Euclidean Distance

10.0

1.00

10

1e−01

algName 1.0

SBO−EI SBO−PV

5

0.10

3

1e−03 0.01 0.1 0

100

200

300

0

100

200

300

0

100

200

300

0

100

200

300

iteration

(a) BBOB Function 3 - Separable Rastrigin Sharp Ridge 2

3

1e+01

5

10

10.00

Euclidean Distance

10.0 10

1.00

1e−01

algName 1.0

0.10

SBO−EI SBO−PV

3

1e−03

0.01 0.1 0

100

200

300

0

100

200

300

1 0

100

200

300

0

100

200

300

iteration

(b) BBOB Function 13 - Sharp Ridge

Fig. 3.12 The proposed candidate solutions’ distance to its nearest neighbor in the set of known solutions. The y-axis shows the measured Euclidean distance. The experiments are run (from left to right) on 2, 3, 5, and 10 input-dimensions. The colored lines indicates the median of the distance, computed over the repeated runs of each algorithm. The surrounding ribbon marks the lower and upper quartiles. Image taken from [32, p. 7]

On the higher-dimensional functions, the difference between SBO-EI and SBOPV decreases. On the Rastrigin function, both infill criteria reach similar levels on the ten-dimensional function. On the Sharp Ridge problem, a significant difference between the criteria remains. But the difference decreases with increasing problem dimension. This supports the idea that the convergence speed of SBO-PV is generally higher due to less time being spent on exploring the search space.

3.5.3 Conclusions The presented results clearly showed that there is not a single ideal balance between exploration and exploitation. Before giving a final conclusion on the importance of exploration and exploitation in SBO, we can identify certain scenarios which tend to prefer more explorative or exploitative search schemes. We summarize these in a short decision guideline for practitioners.

68

3 Methods/Contributions

• Problem dimension: Even when optimizing a black-box function, the input dimensionality of a problem should be known. The results show that, SBO-EI performs better on lower-dimensional functions, whereas SBO-PV excels on higher dimensional functions. If no further knowledge about a given optimization problem is available, then the benchmark results indicate that it is best to apply SBO-EI to functions with up to three dimensions. Problems with five or more input dimensions are likely best solved by SBO-PV. • Budget: The main drawback of an exploitative search (SBO-PV) is the likelihood of prematurely converging into a local optimum. The larger the available budget of function evaluations, the more likely it is that SBO-PV will converge to a local optimum. Yet, as long as SBO-PV is not stuck, it is more efficient in finding the overall best objective function value. Therefore, for many functions, there should be a certain critical budget until which SBO-PV outperforms SBO-EI. If the budget is larger, SBO-EI performs better. The results indicate that this critical budget is increasing with input dimensionality. In the discussed benchmarks, SBO-EI started to excel after roughly 70 iterations on the two- and three-dimensional functions. On the five-dimensional set, SBO-EI started excelling on some functions at around 200 iterations. On the ten-dimensional, no signs of approaching such a critical budget could be observed, indicating that it might lie far beyond (i.e. >>300 evaluations) reasonable budgets for SBO. In summary, we can say: the smaller the budget, the more exploitative the search should be. • Modality: Lastly, if any a priori knowledge regarding the landscape of the given function is available, then it can be used to make a more informed decision. The greatest weakness of SBO-PV is to get stuck in local optima (or flat, sub-optimal regions). Hence, if it is known that a given function is fairly multimodal, then EI may be a good choice. For simpler, potentially unimodal functions, we recommend using PV. Given these guidelines, we would like to reconsider research question RQ-3.5 How important is exploration in SBO? We observe that explorative methods such as SBO-EI perform significantly different from what is often assumed. Especially on higher-dimensional functions, exploitative techniques seem to be the better choice, despite (or because) of their greedy search strategy. Only on lower-dimensional, multimodal functions, does exploration seem to largely benefit SBO. Given the limited budgets under which SBO is usually applied, the common application of EI in the community might need to be reconsidered. As summarized in the in-depth literature review of 24 published SBO software frameworks, most used EI as their default infill criterion. Only three employ an exploitative strategy as their default infill criterion, even though these criteria are arguably easier to optimize and cheaper to compute. The importance of explorative SBO in practical applications should be reassessed. In this context, the sub-optimal performance of SBO-EI for dimensionalities of five or more is troubling and needs to be further investigated. Finally, we would like to discuss possibilities for future research in this area. First and foremost, the experiments clearly show that the PV is a surprisingly competitive

3.6 Multi-local Expected Improvement

69

infill criterion. Future work on new infill criteria should include PV in their benchmarks (e.g., as a baseline). Portfolio methods as well as parallel-SBO techniques that cover multiple infill criteria would likely profit from considering PV in their framework. Furthermore, based on the presented results, static trade-offs between exploration and exploitation should be reconsidered. If exploration is more important in the early stage of an optimization, but later hinders quick convergence, then a switch between the methods during an optimization run might improve the overall optimization performance. A consideration that was not covered in this work, is the global model quality. A globally accurate model is not required for an optimization task that only searches for a single optimum. However, there may be additional requirements in practice. For instance, the learned model may have to be used after the optimization run to provide additional understanding of the real-world process to practitioners or operators. A model that is trained with data generated by a purely exploitative search might fall short of this requirement, despite its ability to find a good solution.

3.6 Multi-local Expected Improvement Parts of the following section are based on “Benchmark-Driven Configuration of a Parallel Model-Based Optimization Algorithm” by Rehbach et al. [1]. Significant parts of the paper were taken verbatim and included in this section. Parts of it, especially the introduction and conclusion, were rewritten and extended to fit the scope of this thesis.

The research of the previous sections has marked multiple important observations that could lead to advances in parallel SBO. In Sect. 3.3, it was shown that hybrid algorithms scale better to larger batch sizes than classical SBO approaches. Sects. 3.3, 3.4, and 3.5 showed that existing parallel SBO approaches often spend too much budget on exploration. Furthermore, the trade-off between exploration and exploitation is usually defined in a static manner. I.e., independent of the remaining budget, exploration stays a driving factor for these infill criteria. We argue that the need for exploration in SBO largely changes throughout the optimization (c.f. Sect. 3.5). In the very beginning, the search space is unknown. More exploration is necessary to improve the global model quality and discover good areas of the search space. Exploitation in the early stages of the optimization will likely only lead to premature convergence into a local optimum (c.f. Sect. 3.5, [32]). The opposite can be argued for the very last iterations of an optimization. The search space should be roughly known by now, and discovering a new and better

70

3 Methods/Contributions

area in the search space becomes less and less likely. Even if a new area was found, likely not enough budget would remain to exploit deep into the new area. Therefore, further exploration is less beneficial. On the other hand, local exploitation around the currently best-known solution can still significantly improve the best-found solution in the last iterations. We ask: RQ-3.6 How to combine the gathered knowledge into an efficient parallel SBO algorithm? The remainder of this section introduces Multi-Local Expected Improvement (MLEI), an efficient parallel SBO algorithm. The performance of the algorithm is shown experimentally across a broad set of benchmark scenarios.

3.6.1 Batched Multi-local Expected Improvement Q-EI is the theoretical extension of EI from a single point to the expected improvement of a set of points. Yet as discussed before, recent research shows that parallel infill criteria like Q-EI tend to do too much search space exploration and thus do not make efficient use of the limited evaluation budget [3, 49]. Wessing and Preuss argue that the theoretical convergence proof of EI has little influence on practical applications. Under the severely limited evaluation budget, the asymptotic convergence behavior of EI is rendered meaningless [49]. Importantly, they instead argue that the application of EI is better suited for niching applications where the goal is not to find one global optimum but rather to identify as many local basins as possible [49]. In order to circumvent the presented problems of a too explorative parallel search, we propose the Multi-Local Expected Improvement (MLEI) algorithm. Multi-Local Expected Improvement (MLEI) combines the explorative search of Q-EI with an exploitative search in the form of multiple parallel local optimizers. As proposed by Wessing et al. [49], we employ expected improvement to identify multiple interesting search basins. Here, the goal for Q-EI is specifically only to identify good regions, not to find the optima of those regions. We then make use of these regions by starting multiple parallel local searches. Each local search aims to exploit into a newly discovered region, possibly finding local optima or even the global optimum. Furthermore, an adaptive search scheme is used. As discussed previously, search space exploration is required in the early iterations of SBO. Placing points in unknown regions of the search space can be useful for increasing the global model quality [32]. As the search progresses and less budget remains, exploitation becomes more important. Therefore, our search scheme traverses multiple phases. The goal is to gradually adapt from a very explorative search to finally a purely exploitative search based on the remaining budget. The structure of the Multi-Local Expected Improvement (MLEI) algorithm is shown in Fig. 3.13. The budget-based phases are explained with an example batch size of four in Fig. 3.14. Starting in phase (1), an initial design for the optimization is created by Latin Hypercube Sampling [63]. The design of size k is evaluated in parallel, based on the configured batch size. To create a linear transition from exploration to exploitation,

3.6 Multi-local Expected Improvement

71

Initial Design

Surrogate Optimizer Kriging Model Expensive Function

Q-EI

StartingPoint Archive

Iterative Local Optimizer

Iterative Local Optimizer

Optimization Algorithm Stopping Criterion met? STOP

Fig. 3.13 Multi-Local Expected Improvement (MLEI) algorithm: points are generated in an initial design (white), as well as later by the surrogate optimizer (yellow) and multiple local searches (green). They are expensively evaluated on the objective function (red) until a stopping criterion is met. The starting point archive contains the initial design and all points proposed by the surrogate optimizer. Image taken from [1, p. 3]

Initial Design

1

1+1 ES

1+1 ES

1+1 ES

Q-EI

Q-EI

1+1 ES

1+1 ES

Q-EI

Q-EI

Q-EI

1+1 ES

Q-EI

Q-EI

Q-EI

PV

2

Batch Size

Phases Q-EI

n

Fig. 3.14 Graphical example of the budget-based phases of the Multi-Local Expected Improvement (MLEI) algorithm for a batch size of four, resulting in a total of n = 5 phases: After the evaluation of the initial design, the later stages represent how many points are generated with which method (Q-EI, 1+1 ES and PV) per iteration. Image taken from [1, p. 3]

one less sample is proposed for exploration at each new phase, and one more is proposed for exploitation. Therefore, there are batch size number of switches to make from a fully explorative to a fully exploitative search. The budget that remains after evaluating the initial design is split into batch size phases of equal budget, resulting in a total of n = batch si ze + 1 phases. Initially, the goal of the optimization is to search very exploratively. Phase (2) only employs the Q-EI algorithm. Therefore, at each iteration batch size points are generated with the Q-EI criterion. One iteration specifies one parallel evaluation of the batch size proposed points. There are usually multiple iterations per phase in the algorithm. To give an example, consider Fig. 3.14:

72

3 Methods/Contributions

The algorithm uses a batch size of four. This means that there are a total of n = 4 + 1 = 5 phases, and in each iteration of the algorithm, four points are generated. Throughout the phases, the amount of points that are generated per iteration does not change. Only the method with which these points are generated changes from more points invested into exploration to points being invested into exploitation. In each following phase (3) .. (n), one point less is proposed by the Q-EI algorithm, and one local optimizer will be started. The local optimizers run in parallel to the model-based algorithm. We employ a simple 1+1 ES as our local optimizer [68]. The 1+1 ES benefits us in this framework, as it only requires small budgets while still being able to improve from any starting point. Using the standard 1/5th rule for step size adaptation, it can quickly adapt to given search spaces to perform a simple local search. The 1+1 ES is just one possible choice of many and could theoretically be drop-in replaced with any other local optimizer. The only requirement for the local optimizer is to be efficient in low-budget scenarios. It does not require, nor should it have any properties that would help with global optimization as its only task is to descend into the nearest basin. The local optimizer directly evaluates the expensive function. It does not use any knowledge of the surrogate model, i.e., it does not use predictions of the model to guide the local search. The starting point of each ES stems from an archive that includes the initial design and is filled at every iteration with the newly evaluated Q-EI points. A new ES is started on the point in the archive with the best objective function value. If a point was already used to start another ES, it will be skipped, and the next best point is chosen. Points that the ESs propose are explicitly not included in this archive, even though they might include the best seen point. This is done to reduce the chances of two or more local searches running into the same basin. Contrary to the ES, which is not using any information from the surrogate model or other ES evaluations, the surrogate model is fitted on all available data in each iteration. Thus, the model includes the initial design, all Q-EI points, and all ES evaluations. The predictive quality of the models benefits from each evaluation. Additionally, the many sample points that an ES places into one local area largely reduce the expected improvement for that same area. This significantly reduces the chances for the model to place additional samples into an area that is already taken care of by an ES. In the last phase of the optimization batch si ze − 1 ES instances are running in parallel, and the model proposes one additional evaluation point. Our choices are based on results and arguments presented in [32]. The budget spent on exploration to improve model quality becomes less and less beneficial, the less budget remains. Therefore, the last phase does not employ EI anymore. Instead, the infill criterion is replaced with the predicted value of the surrogate model itself. Essentially creating another pure exploitation step, where only that point is evaluated, that the model believes to have the best objective function value, without any consideration of the model uncertainty.

3.6 Multi-local Expected Improvement

73

3.6.2 Experiments We employed two sets of benchmarks to analyze the described algorithm. Similarly to the previous sections, the first set of experiments again compares the best found results of algorithms after a fixed amount of iterations. In this scenario, a fair comparison of an algorithm’s performance can only be drawn comparing it to other algorithms using the same batch size. The second set of experiments is done using the benchmarking framework introduced in Sect. 3.1. Therefore the algorithms can be compared across batch sizes and we can determine which batch size is best for which algorithm. While the benchmarked algorithms aim to solve expensive-to-evaluate objective functions, cheap-to-evaluate functions are necessary to enable in-depth benchmarks without requiring large computational budgets. Therefore, we employ a commonly used benchmark suite: the Black-Box Optimization Benchmarking (BBOB) suite [10]. The expensiveness of the benchmark functions is emulated by limiting the number of objective function evaluations for each optimizer. All experiments are implemented with a recent GitHub version of the benchmarking suite (v2.3.1). To judge the performance of the model-based algorithms, we first benchmark three baseline algorithms. Firstly, a uniform sampling random search as a lower baseline to performance. Next, we chose the standard python CMAES implementation by Hansen [44] as a well-known efficient optimizer on the BBOB function set. The CMAES uses a population of candidates at each iteration. It can, therefore, efficiently make use of multiple objective function evaluations in parallel. The population size of the CMAES was set to 4× batch size. Since Multi-Local Expected Improvement (MLEI) uses multiple 1+1 ESs for local optimization, the performance of the same 1+1 ES algorithm alone should be taken as the third baseline. Yet, a 1+1 ES only proposes one new candidate solution per iteration. To judge its performance compared to parallel algorithms, as many ESs are started in parallel as the batch size allows. The 1+1 ES implementation is based on the ‘optimES’ implementation of the R-package SPOT [69]. Both the CMAES and the 1+1 ESs are started with initial random uniformly sampled starting points. Each algorithm is given 50 iterations and is run with batch sizes 2, 4, 8, and 16, resulting in 100 to 800 objective function evaluations per run. The algorithms are run on all 24 BBOB functions, on two- and five-dimensional problems. For statistical analysis, 15 repeats are done per run. Each repetition is evaluated on a new instance of the given function class. However, each optimization algorithm sees the same 15 instances to make the comparison as fair as possible. Next, multiple parallel model-based algorithms are benchmarked. We compare QEI, MOI-MBO, IPI, and our proposed Multi-Local Expected Improvement (MLEI). Each model-based algorithm is initiated with a Latin hypercube sampled initial design [63]. The design size for each algorithm is 2 × pr oblem Dimension× batch size. Thus, the first 2 × pr oblem Dimension iterations of each algorithm are used for the initial design.

74

3 Methods/Contributions

Q-EI is implemented in R through the DiceKriging package, which is based on Ginsbourger et al. [37, 66]. The package method qEI accepts multiple candidate solutions and returns their combined expected improvement. For the optimization of this infill criterion we employ the differential evolution [70] implementation of the package ‘SPOT’ [61, 71] (optimDE). This internal optimizer is given 1000 × log( pr oblem Dimension) evaluations of the Q-EI criterion. The IPI criterion is implemented based on [36]. Internally, we use the same SPOT Kriging model and optimDE with the same budgets as for Q-EI. The PV criterion in Multi-Local Expected Improvement (MLEI) uses the same SPOT Kriging model as the IPI implementation. Yet, this time the target option is set to the default: ‘y’. This yields an optimization for the best-predicted objective function value of the model. The optimization of the criterion is again handled by optimDE with the same budgets as for IPI. The implementation for MOI-MBO is taken directly from the authors’ R implementation: ‘mlrMBO’ [64]. More specifically, the package is configured to use the MOI-MBO setup that performed best in their original experiments. The configurations are reported in [4]. According to their best-performing configuration, we implemented ID 10 for our experiments. Apart from that, the same initial design sizes and internal optimizer budgets as for IPI and Q-EI are assigned to MOI-MBO.

3.6.3 Results The first set of results is reported for static batch sizes. We compare the results of each algorithm given a certain fixed batch size. Figure 3.15 shows a set of convergence plots on different BBOB functions. The specific plots were handpicked as they represent interesting features that will be discussed in the following. A detailed set of all convergence plots, as well as boxplots for each optimization run covering statistical results, is available in the supplementary material of [1]. The left plot in Fig. 3.15 shows results on the sphere function with a batch size of two. This result best represents the desired behavior of the Multi-Local Expected Improvement (MLEI) algorithm. At the beginning of the optimization, Multi-Local Expected Improvement (MLEI) behaves exactly like Q-EI. Thus, they also show the exact same performance for the first part of the optimization. Yet, after switching to the exploitative search step, quick progress is made. The combination of the 1+1 ES and the PV criterion works well on the simple landscape of the sphere function, quickly outperforming the other algorithms. It is also worth noting that after good initial progress, the model-based methods all stagnate in their search. No further progress is made, although the problem at hand should be easy to solve. This is likely due to the fact that parallel infill criteria largely rely on Krigings uncertainty estimate and avoid placing points in already sampled regions. While this aids the exploration of the search space at the beginning of an optimization, it renders local exploitation impossible.

3.6 Multi-local Expected Improvement

Objective Function Value (best so far)

BBOB Function ID: 1 Sphere Batchsize: 2

75

BBOB Function ID: 2 Ellipsoid Batchsize: 4 1e+07

1e+01

BBOB Function ID: 15 Rastrigin Batchsize: 8 Algorithm

300

1e+06

1+1ES CMAES ML−EI MOI−MBO Q−EI Random

100

1e−01

1e+05 30

1e+04 1e−03 1e+03 0

25

50

75

100

10 0

50

100

150

200

0

100

200

300

400

Evaluations

Fig. 3.15 Convergence plots showing optimization performance over algorithm runtime. The Yaxis shows the best so far found objective function value, the X-axis the current evaluation. All plots show experiments on 5-dimensional functions. The solid line in the plot represents the median of all repeats, the transparent colored areas represent the upper and lower quartiles. The convergence behavior shown in plots is explained in detail in Sect. 3.6.3. IPI results were removed for better readability as they did not show noteworthy behavior. Image taken from [1, p. 10]

The second plot from Fig. 3.15 shows results from the ellipsoid function. While the ellipsoid function still features a smooth, unimodal landscape, its key distinction point is the strong ill-conditioning of the problem landscape. It was picked for discussion here, since ill-conditioned landscapes (which are present in many BBOB functions) were the strongest differentiator in algorithm performance throughout the benchmarks. Due to the strong ill-conditioning, results with high function values of up to 106 are to be expected. All algorithms deliver considerably worse results compared to the results on the sphere function. The CMAES is much closer to the model-based methods in this comparison. Yet, the largest difference in results can be observed for the Multi-Local Expected Improvement (MLEI) algorithm. The previously observed quick progression of the exploitation phase is largely missing on the ellipsoid function. This is likely because the 1+1 ES stepsize adaptation does not work well on ill-conditioned functions. The MOI-MBO works best in this scenario, significantly outperforming the other algorithms. The third set of results in Fig. 3.15 was picked to represent a common behavior for Multi-Local Expected Improvement (MLEI) on multi-modal functions. Similar behavior was observed on other functions as well but has shown up in the clearest form on the Rastrigin function. Firstly, it is clear to see that the pure 1+1 ES does not cope well with the many local optima, even being outperformed by random search. Yet, the combination of Q-EI and the same local search in the Multi-Local Expected Improvement (MLEI) algorithm delivers continuous progression on the search space. Thus it can be argued that the ES strongly synergizes with the explorative global search of Q-EI. The next set of results is aggregated in a dense format into Figs. 3.16 and 3.17. Both Figures show the results of all algorithms on all functions and dimensionalities. Figure 3.16 shows the processed results of a statistical analysis. As we cannot expect our data to follow a normal distribution, we decided to apply non-parametric tests as suggested by Derrac et al. [46], as these make fewer assumptions about the underlying data. We apply the base-R package “stats” to

76

3 Methods/Contributions Batch Size: 2

Batch Size: 4

Batch Size: 8

Batch Size: 16

24 Dimensions: 2

20

10 5 1 24 20

Dimensions: 5

BBOB Function ID

15

15 10 5 Q−EI

Random

MOI−MBO

IPI

ML−EI

1+1ES

CMAES

Q−EI

Random

MOI−MBO

IPI

ML−EI

1+1ES

CMAES

Q−EI

Random

MOI−MBO

IPI

ML−EI

1+1ES

CMAES

Q−EI

Random

MOI−MBO

IPI

ML−EI

1+1ES

CMAES

1

Algorithm

Fig. 3.16 Aggregated results of the static batch size experiments. Summarized statistical analysis over all algorithms and functions after using their complete budget (100× batch size). If a box is gray, this means that the specific algorithm was outperformed with statistical significance. Colored boxes show that the algorithm is NOT outperformed in that setup. For example, top left corner of the plot: batch size 2, dimension 2, function 24. No algorithm was able to outperform another one with statistical significance. Hence all fields are colored. Image taken from [1, p. 10] 24 20 Dimensions: 2

15

5 1 24 20 Dimensions: 5

BBOB Function ID

10

15 10

Algorithm 1+1ES CMAES IPI ML−EI MOI−MBO Q−EI

5 1 2

4

8

16

Batch Size

Fig. 3.17 Best performing algorithm on a given function. Each cell of the plot represents the repeats of multiple algorithms on a function (Y-axis), batch size (X-axis), and dimensionality (vertical split). The color indicates the algorithm with the best median objective function value after the optimization budget is depleted. Image taken from [1, p. 11]

3.6 Multi-local Expected Improvement

77

implement the Wilcoxon test, also known as Mann-Whitney test [67]. Results are accepted as statistically significant if p-values smaller than 0.05 are reported. For each combination of function and dimension, the colored boxes in Fig. 3.16 represent the set of best algorithms for that combination. If only one algorithm is marked (colored different than gray) for a certain combination, then it outperformed all other algorithms with statistical significance. If multiple algorithms are marked, this indicates that there was no statistically significant difference between them. Starting off, it is noticeable that the harder a function or optimization task is, the less difference there is between algorithms competing on the functions. In the two-dimensional case with a batch size of two, on functions 23 (Katsuura) and 24 (Lunacek bi-Rastrigin), no algorithm outperforms any other algorithm. Furthermore, the random search is not outperformed on functions 21 (Gallagher’s Gaussian 101me Peaks) and 22 (Gallagher’s Gaussian 21-hi Peaks), even outperforming the 1+1 ES, CMAES, IPI, and MOI-MBO. The hard combination of a low evaluation budget (due to the batch size of two) and the weak global structures and high modality of the function leads to poor performance of all applied algorithms. With increasing problem dimension and batch size, the random search baseline is less often on par with the other optimizers. With a batch size of 16, random search is always outperformed. Interestingly, the parallel 1+1 ES can not compete on the sphere function. This is arguably due to the fact that the parallel restarts are not helpful on the unimodal sphere function, thus wasting evaluations. The CMAES performs increasingly well with larger evaluation budgets. While it is only rarely able to compete in a scenario with a batch size of two, it can compete on a total of ten function/dimension combinations with a batch size of 16. This coincides with other experiments that have shown that a CMAES starts outperforming model-based methods when it is given enough budget [43]. This observation is further signified considering Fig. 3.17. While the CMAES nearly never outperforms on any function for the lower batch sizes (once with a batch size of four on the Bent Cigar function, ID 12), it has the best median performance in five cases for a batch size of 16. Overall, the model-based algorithms perform relatively well across most setups. Both IPI and Q-EI seem to do better for batch sizes two and four and are often outperformed on batch sizes eight and 16. This seems to validate the previously made argument, that these algorithms tend to do too much exploration with an increasing batch size. More exploitation is required to refine the results given the small number of iterations that the algorithms can do. MOI-MBO and Multi-Local Expected Improvement (MLEI) deliver more robust results across the board. On the five-dimensional cases, MOI-MBO is often outperformed on functions 15-18, which exactly coincides with the group of BBOB functions with adequate global structure. Furthermore, MOI-MBO is outperformed on some functions with weak global structure. Multi-Local Expected Improvement (MLEI) delivers good and especially robust results. For almost all function classes, it resides in the set of top-performing algorithms, making it well applicable in a scenario where the function landscape is previously unknown. It is outperformed on strongly ill-conditioned functions like the two ellipsoid functions (BBOB ID 2 and 10). Considering Fig. 3.17, a broader look

78

24 20 Dimensions: 2

15 10

Batch Size

5 1 24 20 Dimensions: 5

BBOB Function ID

Fig. 3.18 Aggregated results showing the best median batch size for a given algorithm on a given function. Results taken from an exemplary point in optimization time after ten time units. Image taken from [1, p. 10]

3 Methods/Contributions

15 10

2 4 8 16

5 Random

Q−EI

MOI−MBO

ML−EI

IPI

CMAES

1+1ES

1

Algorithm

on the results can be taken, showing which algorithm actually performed the best on which function. Yet, here the best-performing algorithm is directly chosen by best median performance and thus does not necessitate a statistically significant outperformance. Still, this view can give interesting insights into which algorithms performs well under which circumstances. While Multi-Local Expected Improvement (MLEI) was nearly always in the set of well performing algorithms in Figs. 3.16 and 3.17 shows that it is often not the single best performing algorithm. In many cases MOIMBO delivered the best median performance. In some cases Multi-Local Expected Improvement (MLEI) is outperformed by the CMAES, and in rare occasions even by Q-EI. Figure 3.18 analyses the algorithms in the newly introduced format of unit time. The results shown are not meant to actually judge these algorithms, as they only represent one moment in time for one specific set of costs of a computer simulation. Yet, we want to use them to present how the introduced benchmarking framework, that assigns costs in unit-time, can be applied, and what knowledge can be gained from the results.

3.6 Multi-local Expected Improvement

79

Firstly it is easy to notice, that for both the 1+1 ES and the random search, batch size 16 is always the best. With a batch size of 16 few iterations can be made, yet the total amount of evaluations is the largest. The random search does not lose efficiency when scaled/parallelized. The expected final result should only depend on the number of evaluations done, not on the number of iterations, which is correctly represented in the plot. For the 1+1 ES, this behavior is more interesting. As the plot was generated for a rather early point in optimization time, the step-size adaptation of the 1+1 ES did not have enough time to adjust to the different function landscapes. Thus, until that point, having more starting points seems more beneficial than having more iterations of local search. For the other algorithms, a clear difference is visible. The CMA-ES seems to prefer smaller population sizes with more iterations on most function classes. The same goes for the model-based algorithms as they report the most progress especially in the early iterations of their optimization. Having a few more iterations in such an early stage of the optimization helps on most functions. On problem 23 (Katsuura), where most algorithms were not able to outperform random search, batch size 16 performs best for nearly all algorithms, again indicating that a rather random search strategy with more sample points wins.

3.6.4 Conclusions In order to conclude this section, we would like to start by reconsidering the posed research question: RQ-3.6 How to combine the gathered knowledge into an efficient parallel SBO algorithm? To answer this question, we proposed the adaptive Multi-Local Expected Improvement (MLEI) algorithm. The algorithm goes through multiple phases, which are based on the remaining optimization budget. Considering that early exploration is desired and later exploitation is beneficial, the algorithm enacts precisely this behavior. Starting with a purely explorative search, the algorithm becomes more and more exploitative until the last stage does no exploration anymore. The presented algorithm shows great overall performance, especially on multi-modal functions with adequate global structure. More importantly, it delivered robust results over a large set of different landscapes. This makes the algorithm well applicable to unknown black-box types of problems. We assume that this is a positive outcome of the synergy between both the explorative and the exploitative parts of the algorithm. While the exploitative steps (1+1 ES and PV infill criterion) result in quick convergence on easier problem landscapes, the Q-EI phases help explore the landscapes of more complex or multi-modal landscapes. Previous analysis (c.f. Sect. 3.3) showed that existing parallel SBO algorithms do not scale well to larger batch sizes. The presented performance of the Q-EI and IPI methods dropped significantly for larger batch sizes. For Q-EI, large batch sizes result in points mainly being placed in untouched search regions. With IPI, many similar points might be proposed if the batch size is too large. Both of these behaviors

80

3 Methods/Contributions

prohibit an efficient convergence to the optimum. In contrast to that, the Multi-Local Expected Improvement (MLEI) parallelization scaled well also with a batch size of 16. There, it was never outperformed for any two-dimensional function and only on two of the five-dimensional functions. Similarly, good scaling was only observed by MOI-MBO. In summary we have introduced a novel parallel SBO algorithm. Multi-Local Expected Improvement (MLEI) scales well to large batch sizes and shows very good performance across different test functions with various properties. Future research should be done on a range of topics regarding the Multi-Local Expected Improvement (MLEI) algorithm. Firstly, the choice of the 1+1 ES as a local optimizer. Generally, any local optimizer could be used to replace the ES. Thus, additional research might lead to better alternatives. Secondly, the rate at which the search should switch from exploration to exploitation needs further investigation. Is the currently applied linear transition based on the remaining budget a good choice? Would an earlier or later transition to exploitation result in better performance? Possibly the transition could even be made based on observed landscape features of a function. To give just one example, noticing that one is faced with a relatively simple unimodal problem, a quick transition to pure exploitation should yield the best performance. In contrast to that, on a very multi-modal problem, additional exploration should be beneficial.

3.7 Adaptive Parameter Selection Parts of the following section are based on “Benchmark-Driven Configuration of a Parallel Model-Based Optimization Algorithm” by Rehbach et al. [1]. Significant parts of the paper were taken verbatim and included in this section. Parts of it, especially the introduction and conclusion, were rewritten and extended to fit the scope of this thesis.

After the introduction of Multi-Local Expected Improvement (MLEI) in Sect. 3.6, one remaining problem is that in many application cases for parallel optimization algorithms, the best possible batch size is not directly obvious. Increasing the batch size of an algorithm might yield good results faster but also increases resource usage. As covered in detail in Sect. 3.1, changing the batch size usually yields an indirect cost. The cost of each objective function evaluation can change when more or fewer evaluations are done in parallel. This is especially true for computer experiments. For computer experiments, the resources of level one parallelization (speed up of each objective function evaluation) and level three and four parallelizations (speed up by doing multiple evaluations in parallel) are shared. If more cores are assigned to each objective function evaluation, then the evaluation time per evaluation will be lower, yet, fewer proposal points of an algorithm can be evaluated in parallel. Assigning more or fewer cores to one of the levels changes the overall system’s efficiency.

3.7 Adaptive Parameter Selection

81

Section 3.6 introduced the Multi-Local Expected Improvement (MLEI) algorithm and analyzed its behavior across different batch sizes. This showed that there are significant performance differences between the configured batch sizes. Furthermore, the best possible batch size changed dependent on the test problem to which the algorithm was applied. A proper configuration of the batch size is crucial for any parallel optimization. Yet, tuning the batch size of an algorithm for a specific objective function is usually infeasible due to the high cost per evaluation. We, therefore, ask the following research question: RQ-3.7.1 How to choose a batch size for a new unknown problem? We investigate approaches to automatically configure the batch size of parallel model-based algorithms by extracting specific properties of the function landscape. The idea of strategically choosing an algorithm for an optimization problem at hand goes far back in the history of evolutionary computation. Consider, for example, the work of Rice in 1976 [72]. More recently, Mersmann et al. introduced Exploratory Landscape Analysis (ELA) as a method to match optimization algorithms to problems efficiently, i.e., by extracting properties from problems with few evaluations [73]. The work defines eight high-level expert features to classify the BBOB test set. These features describe, e.g., the level of multi-modality, the global structure, the separability, and homogeneity in a rather subjective manner. Mersmann et al. suggested a set of low-level features, e.g., convexity, curvature, local search, y-distribution, with a total of 50 numerical sub-features to characterize the problem structure [74]. Based on these features, a new field of research in algorithm selection arose. Multiple successful applications of automated feature computation employing Exploratory Landscape Analysis (ELA) to address the algorithm selection problem can be found: i.e., [75, 76]. They successfully train algorithm selection models for the BBOB test set. A good overview of the algorithm selection problem is given by Muñoz et al. [77]. Similar approaches can be applied to automatically determine a suitable batch size for parallel SBO algorithms. For this purpose, we reconsider the efficiency of specific batch size configurations as analyzed in Sect. 3.6.3. The goal is to determine combinations of function landscapes and dimensionality that advocate for a certain batch size. We investigate the possibility of predicting efficient batch size configurations for parallel SBO algorithms based on landscape features in a benchmark-based dynamic algorithm configuration approach. The dynamic algorithm configuration measures certain landscape features at each iteration in order to predict the most effective batch size for the next iteration. Similar to the work of Biedenkapp et al. [78] we aim to adapt algorithm parameters during a running optimization. The described approach is applied to predict algorithm configurations on unseen instances of BBOB test functions. Furthermore, the prediction is then applied to function instances generated by a data-driven approach to evaluate the prediction quality of new functions.

82

3 Methods/Contributions

3.7.1 Benchmark Based Algorithm Configuration Facing an expensive-to-evaluate optimization problem, any existing knowledge about the problem landscape can help to configure an algorithm properly. Yet, knowledge about the problem can often only be gained by evaluating the problem itself. When such a problem has to be approached, parameter or algorithm choices can hardly be made. Given that in most cases, at least the dimensionality and also the cost of the objective function are known, we consider a scenario where the choice was already made to use SBO to handle the high evaluation cost. However, as was shown in Sect. 3.4, even then the algorithm performance crucially depends on a correct hyper parameter setup. Tuning the algorithm on the expensive problem is infeasible, as there is no budget for running multiple configurations and observing which parameters work well. We consider two options to define a suitable batch size for parallel algorithms. The first approach is based on an offline benchmarking study. The task is to select a good algorithm configuration for the unknown expensive-to-evaluate objective function. In this first approach, the algorithm configuration will be chosen before any evaluations are made on the expensive objective function. Furthermore, the configuration stays fixed throughout the optimization. In order to choose a proper configuration for an algorithm, the given algorithm can be benchmarked over a wide variety of cheap functions with different landscapes. It will be possible to observe which parameter settings work well for which kinds of problem landscapes. Furthermore, robust parameters that work across multiple landscapes, as well as ‘niche’ kind of parameters that work only in very specific setups, can be identified. Running these benchmarks on cheap-to-evaluate functions is likely less costly/computationally intensive than a single or just a few evaluations of the expensive problem. The gained knowledge can, for example, be applied to the algorithm by choosing a robust setting for the unknown landscape of the expensive problem. The described approach could also be applied in situations with many possible batch sizes to configure or even in situations including other (possibly continuous) algorithm parameters. However, in this case, a reasonable subset of parameter configurations needs to be chosen for evaluation in the benchmarks. For example, a subset could be chosen by employing a space-filling design such as Latin Hypercube Sampling. Yet, for now, we will assume that only a limited number of algorithm configurations need to be benchmarked, e.g., a set of < 20 different batch sizes. The second approach aims to change the algorithm configuration during the optimization runtime (dynamic algorithm configuration). More specifically, the task is to choose the best batch size after every iteration of the algorithm. This approach still involves the offline benchmarking phase. Yet, the results of these benchmarks are then used to fit a prediction model. The Exploratory Landscape Analysis (ELA) features of each of the benchmarked functions can be measured, as well as which algorithm configuration worked well for which function. The task of the model is then to accept such a set of landscape features of a new problem and predict a well-working algorithm configuration based on that. The landscape features can be

3.7 Adaptive Parameter Selection

83

computed after each iteration of the algorithm, giving the algorithm the most accurate data to predict the batch size for the next iteration. If such a model could make correct predictions during the algorithm runtime, then algorithms would always be correctly configured, even for new and unseen problems. This assumption relies on the correct prediction of the model, which is likely hard to achieve. Mersmann et al. argue that reliable predictions require functions with at least similar features to be known to the predicting model [73]. In the following, we will briefly introduce the implementation of the proposed approach, illustrating it on the Multi-Local Expected Improvement (MLEI) algorithm, which was introduced in Sect. 3.6. As explained earlier, an initial benchmarking phase on cheap-to-evaluate functions is required. For this purpose, we reuse the experiments of the Multi-Local Expected Improvement (MLEI) algorithm on the 24 BBOB functions, which were presented in 3.6. All evaluated points are stored. We use the R-package flacco [79] to generate features based on these sample points. In flacco itself, the measurable features are arranged in sets. We exclude all feature sets that require additional function evaluations, as we do not want to spend any budget on the feature measurements but rather only reuse the already available evaluations. Furthermore, the distribution features ela_distr were excluded. This feature set often yielded numerical problems when it was calculated on the ES sample points, which are locally close to each other. The final prediction model uses the following feature sets as input parameters: problem dimensionality (ela_basic), meta model features (ela_meta) and information content features (ela_ic). These three feature sets contain a total of 14 features. These were furthermore reduced to increase the predictive quality of our model. The selection was made by recursive feature elimination while monitoring the cross-validated prediction accuracy of the model. The complete list of the eight chosen features which where used in our experiments is shortly given in the following: basic.dim, ela_meta.lin_simple.coef.max, ela_meta.quad_simple.adj_r2, ela_meta.quad_w_interact.adj_r2, ic.eps.s, ic.eps.max, ic.eps.ratio, ic.m0. Since the model is supposed to predict which algorithm configuration works well on which kind of function, the performance of the ML-EI algorithm is measured on each available test function and with each available configuration. The winning configuration is determined to be the algorithm configuration that delivers the best median objective function value on a given problem. Dependent on the optimization demands, the median might also be replaced with other measures, for example, the worst observed result if looking for a robust approach. The described landscape features –> best algorithm configuration data can be used to fit a prediction model. We choose a random forest [80] model for its robust properties. The model was implemented using the ‘randomForest’ R-package [81] using the package’s default parameters. Replacing the model in this framework is easily possible. The fitted model can be incorporated into the optimization loop of any parallel optimizer to suggest the batch size for upcoming iterations. We exemplify this procedure by adding the batch size prediction to the Multi-Local Expected Improvement (MLEI) algorithm that was introduced in Sect. 3.6. Each time new sample points are expensively evaluated on the objective function, flacco is used to estimate the characteristics of the given objective function. The measured features are

84

3 Methods/Contributions

fed into the random forest model, which was trained on the benchmark data. Based on the features, it will give an estimate of which algorithm configuration should work best for the problem at hand. Therefore, before the start of each new iteration of the Multi-Local Expected Improvement (MLEI) algorithm, the algorithm is configured with a new batch size. At each iteration, the number of proposed points that shall be evaluated in parallel can change based on the current assumptions of the optimized problem landscape.

Simulation-Based Benchmarks The discussed approach can be used to predict batch sizes for new unseen instances of BBOB functions, as for example done by Mersmann et al. [74]. Yet, we are interested in analyzing the predictive quality on unseen black-box functions. Even though new instances of the BBOB functions are shifted and scaled, they largely remain similar and are arguably easily recognized by their features given to a predictive model. To measure the real performance of this algorithm configuration approach for unseen functions, we need to have truly new and unseen functions. We try to alleviate this problem by employing Kriging simulations as test functions. Kriging simulations were previously covered in Sect. 3.2. Details of Kriging prediction and simulation are, e.g., provided by Cressie [23] and Zaefferer et al. [8, 21]. Further details on how Kriging prediction is usually used in expensive optimization are, e.g., given by Forrester et al. [28]. The core idea of Continuous Optimization Benchmarks by Simulation (COBBS) is to generate new problem instances from existing data, especially in contexts where ‘real’ problem instances (i.e., the corresponding objective functions) are unavailable or too expensive for an in-detail experimental investigation such as a benchmarking study [8]. When these simulations are not conditioned by the training data (unconditional simulation), the resulting test instances may have similar landscapes (in terms of activity or number of local optima) but are not reproducing the data that the model was trained with (see also Fig. 1 in [8]). As a result, prediction errors are not meaningful in evaluating their quality. We use unconditional simulation in this study. We add an extension to the Kriging simulation in COBBS. The Kriging model used by Zaefferer and Rehbach [8] is a stationary one with a constant trend. This will often limit the quality of the model and hence, the quality of the produced test instances. The single stationary model will either be able to deliver a model that fits well to the global features of the input data or excel in reproducing the finer, local problem structures. As only one activity parameter is fitted per problem dimension, it is hard to represent the data’s local and global features. We use a two-stage approach to address this issue, which is to some extent similar to a single round of boosting [82]. When training the Kriging model, a first-stage model is generated with the nugget effect enabled. The nugget effect allows the Kriging model to regress and smoothen the data [28]. This enables to fit the global trend of the data but will ignore (i.e., smoothen) finer structures of a landscape.

3.7 Adaptive Parameter Selection

85

Once this model is trained, a secondary Kriging model is fitted to the first-stage model’s residuals. The second-stage model does not employ the nugget effect and will attempt to interpolate, i.e., fit the training data exactly. Thus, the second-stage model can account for any local structures that are not well represented by the firststage model. The combination of both models (i.e., the addition of the predictions or simulations from each model) produces a model with a flexible, non-stationary trend. Importantly, the non-stationary trend itself can be varied for the simulation. In general, we determine the activity parameters θi by maximum likelihood estimation. Here, optimizing the likelihood involves numerical optimization via Differential Evolution (DE) [25, 70]. Each time a model is fitted, 200 × pr oblem Dimension likelihood evaluations are performed by DE. The parameters of our model are determined with relatively generous lower and upper bounds for the search (here: lower bound 10−6 , upper bound 1012 ). These bounds are identical for each dimension i of θi and allow to deal with the fairly pathological behavior of the BBOB function set (low as well as extremely high activity). We aim to build a first-stage model which fits slower changes in the data (i.e., the global trend). Hence, we expect that the activity parameters θof the Gaussian kernel that we use (c.f. Eq. 3.2) are much smaller (i.e., correlations decay more slowly) than in the second-stage model. To exploit that assumption, we specify different lower bounds for the second-stage model: for each dimension i, the respective lower bound of θi in the second-stage model is the respective θi determined when fitting the first-stage model. This reduces the size of the likelihood search space and hence simplifies the search.

3.7.2 Experiments Benchmarking on BBOB In order to predict the performance of algorithm configurations on specific functions, the results of the Multi-Local Expected Improvement (MLEI) runs with static batch sizes from Sect. 3.6.2 were analyzed. To create a simple proof-of-concept, we only compare two configurations: low (batch size of 2) and high (batch size of 8) in a binary classification problem. The results on each BBOB function and each dimensionality are treated separately. The batch size with the lowest median across 15 repeats on such a function-dimension pair is reported as the best-performing configuration for that pair. The ELA features are measured separately for every function instance. Thus, the collected training data consists of 24 functions × 2 dimensions × 15 instances = 720 samples. Each sample represents one pair of landscape features together with a recorded best batch size. The random forest prediction model is built using these 720 samples. The prediction model is then used in new experiments. It continuously proposes new batch sizes for the Multi-Local Expected Improvement (MLEI) algorithm in additional optimization runs. For this, at each iteration, the ELA features of the

86

3 Methods/Contributions

function instance at hand are calculated and used for prediction. This means that at the beginning of an optimization, these predictions are only based on very few sampled points. The longer the optimization is running, the more samples are available and the better the ELA feature estimate, as well as the model prediction, should get. For the first set of the dynamic adaptation experiments, the static algorithms were run on the BBOB instances 1–15 of each function to generate the training data. Then, the model was tasked to predict during the optimization on instances 16–30. Thus, we employ a similar train/test split as Mersmann et al. [74]. Therefore, if the model is capable of correctly predicting the batch size for those instances, it can be argued that the model can accurately generalize to new unseen function instances. However, remember that the ultimate goal of such a dynamic adaptation would be to correctly predict the batch size of a function class that was never seen before. If this goal could be met, then such an approach could be perfectly applied to costly real-world problems where no a priori knowledge is available. As the BBOB instances only represent changes through translation and stretching, the resulting instances still have many similarities. Thus, making it easier for the described model to correctly classify which function class is represented and, therefore, which algorithm configuration should be chosen.

Benchmarking on Simulated Functions We circumvent the described problem by measuring the predictive power of the adaptation approach on a more diverse set of new objective functions. To create more independent test functions, we employ Kriging simulations with the methodology as described in Sect. 3.7.1. One test function is created for each of the 24 BBOB function classes in their two- and five-dimensional forms. The simulations are fitted by training the Kriging model with 400 sample points each. The samples stem from the previous optimization runs on those functions. We use the same approach of generating training data as described in [8] because the samples (a) cover the whole search space in a reasonable manner and (b) cover potentially optimal areas more densely, thus being able to represent both local and global patterns in the data. The respective results were taken from previous runs of the static Multi-Local Expected Improvement (MLEI) algorithm with a configured batch size of eight. The models are trained with the nugget effect for the first level model to account for a rough global trend (see Sect. 3.7.1). The lower and upper bounds on the kernel parameters θ are 10−6 and 1012 respectively, accounting for a wide range of activity in the search landscapes. The Kriging simulations are derived from these fitted models, using the spectral method [23]. Importantly, this method will produce the simulation by superimposing many cosine functions. Compared to other simulation methods (i.e., based on matrix decomposition), it is beneficial that this method inherently creates a continuous function. Other methods might require interpolating simulated data, which can limit the quality and/or computational efficiency of the simulation [8]. We configured our

3.7 Adaptive Parameter Selection

87

simulation to use a specific number of superimposed cosine functions: 100 times the dimension of the underlying data. We perform unconditional simulations. Unconditional simulations imply that the simulation is not conditioned by the training data. The training data will only influence how the model parameters (e.g., activity θ, nugget) are fitted. Thus, the simulation will only attempt to reproduce the covariance structure of the training data but not the actual training data itself. This results in a greater variety of different functions because the simulated landscapes are not directly constrained by the training data (see also Fig. 1 in [8]). The adaptive algorithm is then run on all simulated functions trying to predict the best batch size configuration. The prediction model is still only fitted to the data observed on BBOB functions. In order to measure the true best batch size, the static Multi-Local Expected Improvement (MLEI) algorithm is also applied to each function. Both algorithms are again run for a total of 15 repeats.

3.7.3 Results Results on BBOB Looking at the results of our experiments, we would first like to stress that the goal of these experiments was not to compare the Multi-Local Expected Improvement (MLEI) algorithm to other algorithms, nor to measure its absolute performance. We only aim to analyze if the best-performing batch size for a given algorithm can be correctly predicted based on landscape features. Figure 3.19 shows the achieved batch size predictions in a summarized manner. Let’s first consider the left-hand side of the plot, representing the results on BBOB functions. The plot is again separated into the 24 BBOB function IDs (y-Axis) as well as by dimensionality (two and five). The single separated column of blue and red rectangles shows the ground truth that should be predicted. It was determined via experiments run with the static batch size version of Multi-Local Expected Improvement (MLEI) as described in Sect. 3.7.2. The left-hand side of the plot shows the predictions of the Multi-Local Expected Improvement (MLEI) runs with adaptive batch size. The color of each rectangle is averaged over the 15 repeats done on distinct BBOB function instances. Thus, a strongly colored red or blue rectangle represents that the model delivered the same predictions across all instances, whereas a white rectangle represents an iteration where different predictions were made across instances. Notably, the plot shows that for most two-dimensional functions, a large batch size performs better than a small batch size. The opposite is the case for the fivedimensional functions. At first, this seemed counter-intuitive. One might expect that more complex higher-dimensional problems might require a larger batch size for a more explorative search. However, also the experiments conducted by Rehbach et al. [32] indicated that exploitation is increasingly important for functions with higher dimensionality.

88

3 Methods/Contributions BBOB

Simulated

25 20 Dimensions: 2

15

5 0 25 20 Dimensions: 5

BBOB Function ID

10

15 10

8 7 6 5 4 3 2

5 0 0

100

200

300

400

0

100

200

300

400

Time Fig. 3.19 Batch size predictions during adaptive Multi-Local Expected Improvement (MLEI) run. Y-axis: BBOB function 1–24. X-axis: scaled optimization runtime (refer back to Eq. 3.1). The model only predicts batch sizes two or eight (binary classification task). The color of a cell shows the predicted batch size averaged over multiple repeats. The separated colors on the right of each plot represent the measured ground-truth of the static Multi-Local Expected Improvement (MLEI) variant that should be predicted. Image taken from [1, p. 12]

Overall, the results on the BBOB functions look promising. In most cases, the model correctly predicts the best batch size, even early in the optimization where only a few samples are available to calculate the ELA features. In the final iteration, the model correctly predicts all 24 function classes, both on the two- as well as on the five-dimensional problems. However, in seven out of the 48 cases (24 functions × 2 dimensions), the correct prediction is not made for all 15 instances. In these seven cases, we can also observe the largest variation in the predictions. Consider, for example, the five-dimensional version of function 14 (different powers function). For roughly three-quarters of the optimization runtime, a small batch size was predicted for most instances. Only in the final evaluations was the model able to correct this. This might be due to the fact that the landscape of the different powers function is very similar to other functions until the singular spike in objective function value is found. Based on the results that were observed on the BBOB functions, we can argue that even a fairly simple model, e.g., our random forest setup, can easily predict a well-performing algorithm configuration for new instances of BBOB functions. It is

3.7 Adaptive Parameter Selection

89

surprising to see how few samples are necessary to calculate a set of ELA features that can be used for a correct prediction. Yet, these results only represent the ability to give predictions for instantiated functions of BBOB classes. As previously discussed, these can not be argued to be entirely new and unseen functions.

Results on Simulated Functions The right-hand side of Fig. 3.19 shows the results of the test functions generated via Kriging Simulation. Again, the single separated column of colored rectangles represents the ground-truth that should be predicted by the model. The ground truth is measured by running the static version of the Multi-Local Expected Improvement (MLEI) model on the newly generated functions and recording the best median performing configuration. As expected, the predictions on the simulated functions are less accurate than on the BBOB functions. Roughly counted, the model correctly predicts only 14 out of 24 of the functions for each of the two- and the five-dimensional cases. This result is only marginally better than random guessing. The prediction is clearly not working well on the unseen simulated functions. Interestingly, the models’ predictions actually lay closer to the ground-truth column for the BBOB functions as to the simulated functions. This might indicate that the simulated functions were somewhat able to recreate landscapes with similar landscape features (hence the model prediction is working) but still so different that the algorithm performance was far off. Further analysis of the actual quality of the created test functions will be necessary to give a final conclusion to that argument.

3.7.4 Conclusions In order to conclude this section, we would once again like to reconsider the initially posed research question: RQ-3.7 How to choose a batch size for a new unknown problem? We presented an automatic algorithm configuration approach based on features measured by exploratory landscape analysis. Additionally, we have presented an adaptive variant of the Multi-Local Expected Improvement (MLEI) algorithm that changes the configured batch size after each iteration based on a prediction model. The model uses ELA features which are estimated on all the available points that have been evaluated so far during an optimization run. The described approach worked well for predicting new BBOB instances in a train and test split similar to the one employed by Mersmann et al. [73]. This is likely due to the fact that the instances in the BBOB suite are still rather similar to one another. As expected, the results on new, unseen function classes (the simulated functions) showed much less predictive accuracy. Hence, good predictive quality for truly new functions could not be achieved. Future work is required to see how results on new functions can be improved. Topics of interest might focus on choosing a different modeling technique or pro-

90

3 Methods/Contributions

viding a larger and more diverse training data set. A promising idea to increase the model’s predictive power is introduced in recent work by Muñoz, and Smith-Miles [83]. They argue that each function can be represented in a multi-dimensional ELA feature space. Then, genetic programming is used to sample new functions that fill gaps in that search space. Having a sufficiently filled function space as the training data sample for the batch size prediction model might yield better results as less abstraction to new functions should be required from the prediction model.

References 1. F. Rehbach, M. Zaefferer, A. Fischbach, G. Rudolph, T. Bartz-Beielstein, Benchmark-driven configuration of a parallel model-based optimization algorithm. IEEE Trans. Evolut. Comput. 1 (2022) 2. D. Whitley, S. Rana, J. Dzubera, K.E. Mathias, Evaluating evolutionary algorithms. Artif. Intell. 85(1–2), 245–276 (1996) 3. F. Rehbach, M. Zaefferer, J. Stork, T. Bartz-Beielstein, Comparison of parallel surrogateassisted optimization approaches, in Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 18 (ACM Press, 2018), pp. 1348–1355 4. B. Bischl, S. Wessing, N. Bauer, K. Friedrichs, C. Weihs, Moi-mbo: multiobjective infill for parallel model-based optimization, in International Conference on Learning and Intelligent Optimization (Springer, 2014), pp. 173–186 5. G. De Ath, R.M. Everson, J.E. Fieldsend, A.A.M. Rahat, -shotgun: -greedy batch bayesian optimisation, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 (Association for Computing Machinery, New York, 2020), pp. 787–795 6. T. Bartz-Beielstein, C. Doerr, D.V.D. Berg, J. Bossek, S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Kerschke, W. La Cava, M. Lopez-Ibanez et al., Benchmarking in optimization: best practice and open issues (2020). arXiv preprint arXiv:2007.03488 7. H. Kotthaus, J. Richter, A. Lang, J. Thomas, B. Bischl, P. Marwedel, J. Rahnenführer, M. Lang, Rambo: Resource-aware model-based optimization with scheduling for heterogeneous runtimes and a comparison with asynchronous model-based optimization, in International Conference on Learning and Intelligent Optimization (Springer, 2017), pp. 180–195 8. M. Zaefferer, F. Rehbach. Continuous optimization benchmarks by simulation, in Parallel Problem Solving From Nature – PPSN XVI: 16th International Conference (2020). Accepted for Publication 9. S.J. Daniels, A.A. Rahat, R.M. Everson, G.R. Tabor, J.E. Fieldsend, A suite of computationally expensive shape optimisation problems using computational fluid dynamics, in International Conference on Parallel Problem Solving from Nature (Springer, 2018), pp. 296–307 10. N. Hansen, A. Auger, O. Mersmann, T. Tusar, D. Brockhoff, COCO: a platform for comparing continuous optimizers in a black-box setting. ArXiv e-prints (2016). arXiv:1603.08785v3 11. J.J. Liang, B. Qu, P.N. Suganthan, A.G. Hernández-Díaz, Problem definitions and evaluation criteria for the cec 2013 special session on real-parameter optimization. Comput. Intell. Labor., Zhengzhou University, Zhengzhou, China and Nanyang Technological University, Singapore, Tech. Rep. 201212(34), 281–295 (2013) 12. N. Hansen, S. Finck, R. Ros, A. Auger, Real-parameter black-box optimization benchmarking 2009: noiseless functions definitions. Technical Report RR-6829, INRIA (2009) 13. D. Merkel, Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014) 14. C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Operat. Syst. Rev. 49(1), 71–79 (2015)

References

91

15. G. Rudolph, M. Preuss, J. Quadflieg, Two-layered surrogate modeling for tuning optimization metaheuristics. Technical Report TR09-2-005 (TU Dortmund, Dortmund, 2009). Algorithm Engineering Report 16. M. Preuss, G. Rudolph, S. Wessing, Tuning optimization algorithms for real-world problems by means of surrogate modeling, in Genetic and Evolutionary Computation Conference (GECCO’10) (ACM, Portland, 2010), pp.401–408 17. T. Bartz-Beielstein, How to create generalizable results, in Springer Handbook of Computational Intelligence. ed. by J. Kacprzyk, W. Pedrycz (Springer, Berlin, 2015), pp.1127–1142 18. O. Flasch, A modular genetic programming system. Ph.D. thesis, Technische Universität Dortmund, Dortmund, Germany (2015) 19. A. Fischbach, M. Zaefferer, J. Stork, M. Friese, T. Bartz-Beielstein, From real world data to test functions, in 26. Workshop Computational Intelligence (KIT Scientific Publishing, Dortmund, 2016), pp. 159–177 20. N. Dang, L. Pérez Cáceres, P. De Causmaecker, and T. Stützle. Configuring irace using surrogate configuration benchmarks, in Genetic and Evolutionary Computation Conference (GECCO’17) (ACM, Berlin, 2017), pp. 243–250 21. M. Zaefferer, A. Fischbach, B. Naujoks, T. Bartz-Beielstein, Simulation-based test functions for optimization algorithms, in Genetic and Evolutionary Computation Conference (GECCO’17) (ACM, Berlin, 2017), pp.905–912 22. C. Lantuéjoul, Geostatistical Simulation: Models and Algorithms (Springer, Berlin, 2002) 23. N.A. Cressie, Statistics for Spatial Data (Wiley, New York, 1993) 24. N. Hansen, D. Brockhoff, O. Mersmann, T. Tusar, D. Tusar, O.A. ElHara, P.R. Sampaio, A. Atamna, K. Varelas, U. Batu, D.M. Nguyen, F. Matzner, A. Auger, Comparing Continuous Optimizers: numbbo/COCO on Github (2019) 25. R. Storn, K. Price, Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997) 26. D. Ardia, K.M. Mullen, B.G. Peterson, J. Ulrich, DEoptim: Differential evolution in R. https:// CRAN.R-project.org/package=DEoptim (2020). Version 2.2-5. Accessed 25 Feb 2020 27. T. Bartz-Beielstein, J. Stork, M. Zaefferer, C. Lasarczyk, M. Rebolledo, J. Ziegenhirt, W. Konen, O. Flasch, P. Koch, M. Friese, L. Gentile, F. Rehbach, SPOT - sequential parameter optimization toolbox - v20200429. https://github.com/bartzbeielstein/SPOT/releases/tag/v20200429 (2020). Accessed 29 April 2020 28. A. Forrester, A. Keane, et al., Engineering Design via Surrogate Modelling: A Practical Guide (Wiley, 2008) 29. J.A. Nelder, R. Mead, A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965) 30. J. Ypma, H.W. Borchers, D. Eddelbuettel, nloptr vers-1.2.1: R interface to nlopt. http://cran.rproject.org/package=nloptr (2019). Accessed 20 Nov 2019 31. H. Wang, B. van Stein, M. Emmerich, T. Bäck, Time complexity reduction in efficient global optimization using cluster Kriging, in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17) (ACM, Berlin, 2017), pp. 889–896 32. F. Rehbach, M. Zaefferer, B. Naujoks, T. Bartz-Beielstein, Expected improvement versus predicted value in surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 (Association for Computing Machinery, 2020), pp. 868–876 33. M. Zaefferer, J. Stork, M. Friese, A. Fischbach, B. Naujoks, T. Bartz-Beielstein, Efficient global optimization for combinatorial problems, in Proceedings of the 2014 Conference on Genetic and Evolutionary Computation (GECCO’14) (ACM, New York, 2014), pp. 871–878 34. M. Zaefferer, Combinatorial efficient global optimization in R - CEGO v2.2.0 (2017), https:// cran.r-project.org/package=CEGO. Accessed 07 Nov 2018 35. A.L. Marsden, M. Wang, J.E. Dennis, P. Moin, Optimal aeroacoustic shape design using the surrogate management framework. Optim. Eng. 5(2), 235–262 (2004) 36. R.K. Ursem, From Expected Improvement to Investment Portfolio Improvement: Spreading the Risk in Kriging-Based Optimization (Springer International Publishing, Cham, 2014), pp.362– 372

92

3 Methods/Contributions

37. D. Ginsbourger, R. Le Riche, L. Carraro, Kriging is well-suited to parallelize optimization, in Computational Intelligence in Expensive Optimization Problems (Springer, 2010), pp. 131–162 38. W.H. Kruskal, W.A. Wallis, Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952) 39. W.J. Conover, R.L. Iman, On multiple-comparisons procedures, in Technical Report LA-7677MS (Los Alamos Science Laboratory, 1979) 40. W.J. Conover, Practical Nonparametric Statistics, 3rd edn. (Wiley, 1999) 41. T. Pohlert, The pairwise multiple comparison of mean ranks package (pmcmr) (2014). Accessed 12 Jan 2016 42. M. Rebolledo, F. Rehbach, A.E. Eiben, T. Bartz-Beielstein, Parallelized bayesian optimization for problems with expensive evaluation functions, in GECCO ’20: Proceedings of the Genetic and Evolutionary Computation Conference Companion (Association for Computing Machinery, 2020), pp. 231–232 43. M. Rebolledo, F. Rehbach, A.E. Eiben, T. Bartz-Beielstein, Parallelized bayesian optimization for expensive robot controller evolution, in Parallel Problem Solving from Nature – PPSN XVI: 16th International Conference (2020) Accepted for Publication 44. N. Hansen, Y. Akimoto, P. Baudis, Cma-es/pycma on github (2019) 45. K. Ushey, J. Allaire, Y. Tang. Reticulate: Interface to ‘Python’ (2019). R package version 1.13 46. J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011) 47. P. Auer, Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2002) 48. A.D. Bull, Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011) 49. S. Wessing, M. Preuss, The true destination of EGO is multi-local optimization, in 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI) (IEEE, 2017) 50. R. Lam, K. Willcox, D.H. Wolpert, Bayesian optimization with a finite budget: an approximate dynamic programming approach, in Advances in Neural Information Processing Systems 29 (NIPS 2016), ed. by D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (Curran Associates, Inc., 2016), pp. 883–891 51. H. Wang, M. Emmerich, T. Back, Cooling strategies for the moment-generating function in bayesian global optimization, in 2018 IEEE Congress on Evolutionary Computation (CEC) (IEEE, Rio de Janeiro, 2018), pp.1–8 52. H. Wang, M. Emmerich, T. Bäck, Towards self-adaptive efficient global optimization, in Proceedings LeGO – 14th International Global Optimization Workshop. AIP Conference Proceedings, vol. 2070 (Leiden, 2019) 53. M. Osborne, Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. Ph.D. thesis (University of Oxford, 2010) 54. J. Snoek, H. Larochelle, R.P. Adams, Practical Bayesian optimization of machine learning algorithms, in Advances in Neural Information Processing Systems 25 (NIPS 2012). ed. by F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Curran Associates Inc, Lake Tahoe, NV, USA, 2012), pp.2951–2959 55. J.M. Hernández-Lobato, M.W. Hoffman, Z. Ghahramani, Predictive entropy search for efficient global optimization of black-box functions, in Proceedings of the 27th International Conference on Neural Information Processing Systems NIPS’14, vol. 1 (MIT Press, Cambridge, 2014), pp. 918–926 56. V. Nguyen, S. Rana, S.K. Gupta, C. Li, S. Venkatesh, Budgeted batch bayesian optimization, in 2016 IEEE 16th International Conference on Data Mining (ICDM) (IEEE, 2016) 57. M. Hoffman, E. Brochu, N. de Freitas, Portfolio allocation for bayesian optimization, in Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11 (AUAI Press, Arlington, 2011), pp. 327–336 58. T. de P. Vasconcelos, D.A.R.M.A. de Souza, C.L.C. Mattos, J.P.P. Gomes, No-past-bo: normalized portfolio allocation strategy for bayesian optimization

References

93

59. U. Noé, D. Husmeier, On a new improvement-based acquisition function for bayesian optimization (2018) 60. G. De Ath, R.M. Everson, A.A.M. Rahat, J.E. Fieldsend, Greed is good: exploration and exploitation trade-offs in bayesian optimisation. ArXiv e-prints (2019) 61. T. Bartz-Beielstein, C. Lasarczyk, M. Preuss, Sequential parameter optimization, in Proceedings Congress on Evolutionary Computation, (CEC’05) (Scotland, Edinburgh, 2005), p.1553 62. T. Bartz-Beielstein, J. Stork, M. Zaefferer, M. Rebolledo, C. Lasarczyk, J. Ziegenhirt, W. Konen, O. Flasch, P. Koch, M. Friese, L. Gentile, F. Rehbach, Spot: sequential parameter optimization toolbox – version 2.0.4 (2019), https://cran.r-project.org/package=SPOT. Accessed 15 Nov 2019 63. M.D. McKay, R.J. Beckman, W.J. Conover, Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2), 239–245 (1979) 64. B. Bischl, J. Richter, J. Bossek, D. Horn, J. Thomas, M. Lang, mlrMBO: a modular framework for model-based optimization of expensive black-box functions (2017) 65. B. Bischl, J. Richter, J. Bossek, D. Horn, M. Lang, J. Thomas, mlrmbo: Bayesian optimization and model-based optimization of expensive black-box functions – version 1.1.2 (2019), https:// cran.r-project.org/package=mlrMBO. Accessed 15 Nov 2019 66. O. Roustant, D. Ginsbourger, Y. Deville, DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization. J. Stat. Softw. 51(1), 1–55 (2012) 67. M. Hollander, D.A. Wolfe, E. Chicken, Nonparametric Statistical Methods, 3rd edn. (Wiley, New York, 2014) 68. I. Rechenberg, Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution (Frommann-Holzboog, 1973) 69. T. Bartz-Beielstein, M. Zaefferer, Model-based methods for continuous and discrete global optimization. Appl. Soft Comput. 55, 154–167 (2017) 70. K. Mullen, D. Ardia, D. Gil, D. Windover, J. Cline, DEoptim: an R package for global optimization by differential evolution. J. Stat. Softw. 40(6), 1–26 (2011) 71. T. Bartz-Beielstein, J. Stork, M. Zaefferer, M. Rebolledo, C. Lasarczyk, J. Ziegenhirt, W. Konen, O. Flasch, P. Koch, M. Friese, L. Gentile, F. Rehbach, Spot: sequential parameter optimization toolbox – version 2.0.4 (2019), https://cran.r-project.org/package=SPOT. Accessed 15 Nov 2019 72. J.R. Rice, The algorithm selection problem, in Advances in Computers, vol. 15 (Elsevier, 1976), pp. 65–118 73. O. Mersmann, M. Preuss, H. Trautmann, Benchmarking evolutionary algorithms: towards exploratory landscape analysis, in Parallel Problem Solving from Nature, PPSN XI (2010), pp.73–82 74. O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, G. Rudolph, Exploratory landscape analysis, in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (2011), pp. 829–836 75. B. Bischl, O. Mersmann, H. Trautmann, M. Preuß, Algorithm selection based on exploratory landscape analysis and cost-sensitive learning, in Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (2012) 76. P. Kerschke, H. Trautmann, Automated algorithm selection on continuous black-box problems by combining exploratory landscape analysis and machine learning. Evol. Comput. 27(1), 99–127 (2019) 77. M.A. Muñoz, Y. Sun, M. Kirley, S.K. Halgamuge, Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Inf. Sci. 317, 224–245 (2015) 78. A. Biedenkapp, H.F. Bozkurt, T. Eimer, F. Hutter, M. Lindauer, Dynamic algorithm configuration: foundation of a new meta-algorithmic framework, in ECAI 2020 (IOS Press, 2020), pp. 427–434 79. P. Kerschke, H. Trautmann, Comprehensive feature-based landscape analysis of continuous and constrained optimization problems using the r-package flacco, in Applications in Statistical Computing (Springer, 2019), pp. 93–123

94

3 Methods/Contributions

80. L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001) 81. A. Liaw, M. Wiener, Classification and regression by randomforest. R News 2(3), 18–22 (2002) 82. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning (Springer, 2017) 83. M.A. Muñoz, K. Smith-Miles, Generating new space-filling test instances for continuous blackbox optimization. Evol. Comput. 28(3), 379–404 (2020)

Chapter 4

Application

4.1 Electrostatic Precipitators Parts of the following section are based on “Model-based evolutionary algorithm for optimization of gas distribution systems in power plant electrostatic precipitators” by Schagen et al. [1] and “Comparison of Parallel SurrogateAssisted OptimizationApproaches” by Rehbach et al. [2]. Parts of the papers were taken verbatim and included in this section. Major Parts of the paper were rearranged, rewritten, and restructured to fit into the scope of this thesis.

4.1.1 Problem Description The methods from Chap. 3 were applied to various real-world optimization problems from different disciplines. In the following section we present the work done on one of these industrial applications. We aim to show the significant impacts, and potential benefits a single such optimization study can make in industry. The project was researched in cooperation between TH Köln and Steinmüller Babcock Engineering GmbH. We first aim to introduce and motivate the industrial application. Electrostatic Precipitators (ESPs) are used for dust separation in various industrial processes, e.g., to clean flue gas streams of coal-fired power plants. They commonly reach sizes of up to 50 m × 30 m × 30 m and thus result in millions of euros in steel building costs alone. Additionally, these large devices result in high maintenance and operation costs. All these costs depend on the size of an ESP. Smaller ESPs result in less building and operation cost as well as less maintenance. Yet, the large size of an ESP is required because specific standards have to be met regarding the quality of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9_4

95

96

4 Application

Fig. 4.1 Schematic of an ESP with 3 separation zones. This figure was kindly provided by Steinmüller Babcock Environment GmbH

the cleaned gas leaving the separator. If the efficiency of an ESP can be increased, then the same level of particle separation from flue gases could be achieved with a smaller building size. Thus, increasing the separation efficiency of an ESP yields huge potentials for reduced building and operating costs. An illustration of such an ESP is given in Fig. 4.1. Flue gas streams into the separator through an inlet hood and enters the first separation zone. Discharge electrodes negatively charge the dust particles, which are then attracted to collecting electrodes. The same process is repeated iteratively with multiple separation zones until the desired level of particle separation is reached, and “clean” gas leaves the separator. The efficiency of an ESP mainly depends on the gas velocity distribution within the separation zones. To ensure a sufficient flow field, a Gas Distribution System (GDS) is required in the inlet hood of the ESP to control and guide the gas flow through separation zones in which particles are removed from the exhaust gases. If no GDS is used, or if the system is configured poorly, the fast inlet gas stream will rush through the separation zones of the ESP. This results in very low separation efficiencies. In the case of a well-configured GDS, the inflowing gas is nicely distributed across the whole surface of the separation zones, resulting in high efficiency. Hence, a good GDS configuration is crucial to reducing costs and the environmental impact of power plants.

4.1 Electrostatic Precipitators

(a) GDS mounted in the inlet hood of of an ESP. Flue gas streams in from the right and is distributed by the GDS. The GDS consists of three layers. The first layer only contains blocking and perforated plates. The second and third layer also contain baffles (darker green).

97

(b) Desired gas velocity distribution at the inlet of the first separation field in the ESP.

Fig. 4.2 Figures kindly provided by Steinmüller Babcock Environment GmbH

Such a GDS (as depicted in Fig. 4.2a) usually consists of several hundred elements. The specific GDS that this project operated on has a total of 334 configurable slots. Each of these slots can be configured with different metal elements such as baffles, blocking-, and perforated plates. Baffles are metal plates that are mounted at an angle to the incoming gas flow. They are used to redirect a gas stream in a new direction. Blocking plates completely block a gas stream. Perforated plates are used to slow down and only partially block gas streams. They are created by punching a grid of holes into metal plates. Smaller holes lead to higher pressure drops and thus a slower gas stream. Larger holes allow for a nearly free gas flow. Table 4.1 summarizes the possible choices from which an optimizer can choose for the 334 slots. Depending

98

4 Application

Table 4.1 Summary of possible configurations for the GDS. The GDS of this project has 334 configurable slots. 200 of these are for perforated and blocking plates, the other 134 slots are filled with baffles. The algorithm has to propose a configuration 1–6 for each slot that takes perforated plates and a configuration 0/1 for each slot that takes baffles Configuration proposed by 0 1 2 3 4 5 6 algorithm perforated plates: % free cross-section baffles: configured on/off

0 Off

40

50

60

70

100

On

on the slots, either six types of perforated and blocking plates are available or two distinct states on how to configure baffles. Selecting the best baffle or plate for each configurable slot in the GDS is a very complex discrete optimization problem with roughly 10195 possible solutions. Usually, a heuristic optimization approach with only a few GDS configurations (chosen by an engineer) in conjunction with highly detailed CFD simulations and cold flow measurements by hand are used to find a suitable GDS configuration. This results in an acceptable but not optimal gas velocity distribution, especially at the inlet area of the first separation field. Each run of the CFD simulation results in multiple hours of computation time. Thus, only a very limited number of evaluations is available to any optimizer. The optimization aims to establish a desired gas velocity profile at the inlet of the first separation zone. Figure 4.2b depicts the desired velocity distribution. This profile shows higher velocities than the mean velocity in the upper part and smaller velocities in the lower part. This velocity distribution shall be met as closely as possible by configuring the plates and baffles into the GDS.

4.1.2 Methods Before the start of this project, GDSs were optimized manually by engineers. That means an experienced engineer is required to configure the 334 baffles and metal plates into a simulation of the ESP. The configuration is mostly based on the engineer’s personal experiences, and minor adaptations to previous simulations run on for the same ESP. The configuration process can take multiple hours, dependent on the number of changes that are required. Once a simulation is prepared, it is sent to run on the computation cluster. The simulation is usually sent over night, so that new results can be evaluated in the morning. This evaluation is again a time-consuming manual approach. The engineer generates multiple images representing the gas flow through the separation zones. He will then visually compare them with the ideal velocity distributions to examine the quality of the simulation. Each such iteration of

4.1 Electrostatic Precipitators

99

creation, evaluation, and adaptation of a single configuration for the gas distribution system roughly takes one working day. Therefore, before any optimization task in this project could be approached, an automation system for the manual evaluation of the CFD simulations was developed and implemented. Every time the algorithm proposes a new GDS configuration, the compilation of a new CFD simulation by generating a new mesh is required. The simulations are created and meshed with the open-source CFD framework OpenFOAM V.4.0 [3]. Computational resources are assigned to the simulation on a cluster. Lastly, each simulation goes through a post-processing phase in which the quality of the configuration is determined by measuring the simulated gas velocity distribution. Covering all these automation steps in detail lies outside of the scope of this thesis. For more details, we refer to Rehbach [4]. One crucial consideration for the optimization is the correct setup of the algorithm’s batch size and the number of cores with which each simulation shall be evaluated on the computation cluster. For this purpose, we refer back to Sect. 2.3.1 for the pros and cons of larger or smaller batch sizes. Methods discussed in Sect. 3.1 are used to determine a proper batch size for this project. The computation cluster for this project had 128 cores available for the optimization task. The final choice was made to use a batch size of eight. Each CFD simulation is evaluated with 16 cores on the cluster. This yielded the best ratio of speed-up to the simulations versus the number of evaluations that can be done in parallel. With the assigned 16 cores per simulation, each simulation still runs for more than two hours. Therefore, the decision was made not to start the optimization with a random or stratified sampled (e.g., Latin hypercube sampling) design. Instead, two CFD simulations with reduced accuracy were created. Since these simulations do not deliver the same quality of results, their runtime is largely reduced. The simulations with reduced accuracy take roughly 2 and 20 m each to compute. The idea is to create a multi-level approach in which the search on each level gives good starting conditions for the upcoming levels. Thus running a relatively detailed search on the reduced models is feasible and only requires the runtime of very few accurate (longrunning) simulations. For more details of the multi-model approach, we again refer to Rehbach [4]. Finally, the discrete optimization problem is approached with algorithms that are covered in Sect. 3.3. More specifically, we apply the synchronous version of SBO+EA. The hybrid algorithm proposes six points with evolutionary operators (mutation/recombination). The remaining points are generated with a Kriging model. One point each iteration with the expected improvement infill criterion, and another point with the predicted value infill criterion. We refer to Schagen et al. [1] for a more detailed description of the experimental setup.

100

4 Application

4.1.3 Results Automation and Multi-Fidelity Optimization The first major improvement in the project is the complete automation of the generation and evaluation of CFD simulations. A framework was developed that accepts a CFD simulation with a particular naming scheme for all plates and baffles. The framework is capable of automatically generating new, changed simulation models based on a proposed configuration (coming from an algorithm) and the existing simulation case. The exact setup and details about this framework are covered in [4]. With the given framework, the creation of a new simulation case only requires computation time on a cluster and depends on the desired accuracy of the simulation (strongly depends on the amount of cells and thus the runtime of the simulation). This framework also first enabled the approach of applying different accuracy levels in the simulations. In general, an engineer could have created a simulation with any desired runtime and accuracy, thus also faster running simulations. However, the manual changes and the evaluation of a simulation require nearly a full day of manual work. Furthermore, the simulations are mostly run over night outside of the engineers working hours. Thus, speed-ups to the simulation would previously only result in a loss of accuracy but not in a speed-up of the overall optimization. Yet, for the automated approach, for the very first time many faster running simulations can be configured and evaluated to quickly gain an understanding of the search space. The required runtime for each simulation can be reduced from multiple hours to a few minutes. Figure 4.3 shows the results of one full optimization run on the ESP problem. The switches from the fast-running simulation model to the longer-running ones are marked with vertical lines in the plot. It is clear that after each change in simulation accuracy, a drop-in solution quality (fitness) occurs. More importantly though, even though a drop in quality occurs, it is still clear that each following level starts with much better-configured starting points. So much better, in fact, that the last (high-accuracy) optimization starting configuration already yielded better performance than the best-found manual results of the engineers. Furthermore, the additional runtime overhead is arguably minimal. The accuracy levels are each roughly separated by an order of magnitude in runtime. Thus, the 700 cheap evaluation runs done in the first phase only required the runtime of seven ‘full’ evaluations.

Recreation Case Study One problem with judging the quality of any configuration for the GDS is that true optimum is unknown. Even if the best configuration for the GDS was known, it is unlikely that it would, in fact, exactly recreate the desired distribution shown in Fig. 4.2b. Some delta from the optimal gas distribution will likely always remain as only a discrete set of choices can be made for the configuration of the GDS.

4.1 Electrostatic Precipitators

101

Fig. 4.3 Best so far measured fitness vs. amount of simulations tried. Vertical lines represent the switches in simulation accuracy. The color of the line represents the method with which the given sample point was found. Image based on [4, p. 71]

Therefore, it is hard to judge the quality of the optimization results solely on the best-found configuration. To truly judge the algorithm’s quality on the ESP problem, we require an objective function with a known optimum. A test scenario was created that fulfills this requirement. Instead of reducing the difference to the desired gas velocity distribution, the algorithm was instead applied to reduce the difference to the velocity distribution of a known GDS configuration. Therefore, an existing GDS configuration was simulated, and the observed velocities were defined as the optimization goal for a new set of experiments. This effectively defines the optimum of the optimization task at a known location and quality. If an algorithm found the optimal GDS configuration in this scenario, the error in the velocity distribution would become zero. Any difference from the optimum can then be calculated as the distance from zero. Figure 4.4 presents the results of this reconstruction case study. A random starting point for the algorithm is shown on the left-hand side. Significant differences to the known velocity distribution (shown on the right) are visible. Finally, the best-found reconstruction is shown in the middle. While differences in the velocity distributions are still visible, the reconstructed case is surprisingly close to the goal velocity distribution. The results are especially impressive considering that a total of 334 discrete variables (roughly 10195 possible solutions) have to be configured in only 1000–2000 evaluations, dependent on the allowed runtime in the

102

(a) Velocity distribution of a randomly created individual. Colors represent velocity, scale is given on the right hand side in m/s. White areas are above or below the scaling range and are thus not plotted.

4 Application

(b) Best algorithmically found reconstruction of the given velocity field. Colors represent velocity, scale is given on the right hand side in m/s.

(c) Goal velocity distribution which has to be reproduced. Colors represent velocity, scale is given on the right hand side in m/s.

Fig. 4.4 Images taken from [4, p. 76]

company’s computation clusters. The results show that the designed algorithm is well suited to find a close to optimal configuration for a gas distribution system.

Results on the ESP Finally, to discuss the main results of the project, we consider the plots in Figs. 4.5a and b. On the left-hand side, the best solution as found by an engineer is shown. This exact design was built into an existing ESP before the project was reopened for

4.1 Electrostatic Precipitators

(a) Heatmap plot of the velocity distribution of the best manually achieved individual. Colors represent velocity in x-direction in m/s according to scale on the right hand side.

103

(b) Heatmap plot of the velocity distribution achieved through optimizing with the evolutionary algirthm. Colors represent velocity in x-direction in m/s according to scale on the right hand side.

Fig. 4.5 Image taken from [4, p. 78]

further optimization. Figure 4.5b presents the measured velocity distribution with the best found GDS configuration by the SBO+EA algorithm. The manually engineered solution still shows the remnants of the fast inlet gas stream in the center of the inlet hood. The gas passes through the GDS and can not be sufficiently directed away from this central area. Furthermore, this solution shows a large negative gas velocity area in the separator’s lower parts. This means gas is pulled back into the GDS area. This so-called backflow is especially bad for the ESPs separation efficiency as already separated particles can be pulled back out of the collection hoppers and reenter the main gas flow. In contrast to that, the SBO+EA optimized solution reproduces the desired velocity distribution for the ESP with high quality. Large separator areas show gas velocities that are optimal for a high separation efficiency (bright green areas in the plot). Furthermore, the fast inlet gas stream and the backflow area at the bottom of the

104

4 Application

separator are resolved. In total, the optimized configuration yields a 30% better velocity distribution than the originally engineered one, thus yielding significant improvements to the ESPs separation efficiency.

4.1.4 Conclusions In summary, this section presented that parallel SBO methods are not only very efficient in optimizing costly to evaluate functions, but they are also well applicable to industrial applications. Firstly, we have shown that a complex industrial task can be effectively automatized, resulting in a largely reduced workload for project engineers. Furthermore, this automation allowed for efficiency gains achieved by running faster yet lower accuracy models to refine the search space. Promising regions in the search space can then be checked efficiently with higher accuracy models. The reconstruction case study proves that the implemented algorithm can find high-quality GDS configurations even with very small computation budgets. The algorithmic approach delivers a solution quality far superior to those previously found in manual optimization approaches by engineers. Furthermore, the parallel optimization approach delivers these high-quality results in a shorter time frame while reducing the workload for the engineers. Previously an engineer could only set up, run and evaluate a simulation roughly once a day. This resulted in a maximum of about five evaluations per week over the course of multiple weeks. The fully automated optimization process runs for only two weeks without human interaction to find the presented results of 30% higher quality.

4.2 Application Case Studies We would like to complete this chapter by providing a brief overview of additional applications to which the methods from Chap. 3 were applied, showing the broad applicability of SBO to various real-world problems. Hospital Capacity Planning The BaBsim.Hospital resource planning tool was developed under special consideration of the Covid-19 pandemic [5]. The tool is meant for crisis teams to efficiently plan the capacities of hospital beds, ventilation units, and employees. The data-driven approach requires computationally expensive tuning of the simulation parameters. In order to achieve the most accurate capacity forecasts, the parameters are refitted every day to match the quickly changing pandemic situations that the crisis teams are confronted with. Applying parallel SBO solved the limited availability of computational resources under the significant load of daily model refitting. Without the

4.2 Application Case Studies

105

significant efficiency gains introduced by parallel SBO, a daily refitting would simply not have been an option. Among other use cases, the final simulation and tuning concept was applied in cooperation with the local government in the German city of Gummersbach and a UK hospital [6–8]. Self-adapting Robots A further application stems from a research cooperation with the Vrije Universiteit Amsterdam. The project develops modular robots with 3D-printed building blocks (Fig. 4.6a). The robots are supposed to adapt to given tasks by natural evolution. The robots can change their morphology (i.e., add, remove, or change building blocks). An optimization approach drives modular adaptations to the robots. Firstly, the robot morphology is changed, then the robot requires an optimized controller for its newly possible movements. The robots are simulated with the robot evolve toolkit ‘revolve’ [9]. The resulting expensive-to-evaluate simulation problem was again tackled and improved by parallel SBO techniques to efficiently tune the robot controllers [10]. In this way, the robot controllers can be optimized in a similar time frame compared to the 3D-printing and assembly of the robots enabling a fluent test procedure. Material Design A task in material design is to create sandwich composite structures that consist of two thin laminate outer skins and a lightweight (e.g., honeycomb) core structure. The goal is to design a lightweight material with high flexural stiffness. A sandwich material consists of multiple of these layers, which are stacked upon each other. Figure 4.6b shows such a simulation of such a layered material.

(a) Screenshot of a modular robot in the simulator ’revolve’. Image taken from [93, p. 6].

(b) Simulation model of a sandwich composite material. Image taken from [98, p. 4].

Fig. 4.6 Sample images from the robotics application (a) and material design application (b)

106

4 Application

In the given project task, a sandwich plate is loaded with lifting and torquing loads, linearly distributed along the length of the plate. The objective is to minimize the displacement after applying the loads to the material. The sandwich material is modeled with 19 layers of equal thickness but different plate materials in the finite element analysis solver Abaqus [11]. The optimization task lies in selecting the best plate materials and production parameters (lamination angles) to design the optimal sandwich material. In this specific application project, a speedup through parallel SBO was not required, however, the resulting simulations are computationally expensive and could be run in parallel on a computation cluster [12, 13]. They are much cheaper than equivalent material studies that would require the manufacturing and testing of the material in a lab. We expect the possibility for similar efficiency gains on this project compared with the previously observed gains on the other case studies.

References 1. A. Schagen, F. Rehbach, T. Bartz-Beielstein, Model-based evolutionary algorithm for optimization of gas distribution systems in power plant electrostatic precipitators. Int. J. Gener. Storage Electr. Heat VGB Powertech. 9, 65–72 (2018) 2. F. Rehbach, M. Zaefferer, J. Stork, T. Bartz-Beielstein, Comparison of parallel surrogateassisted optimization approaches, in Proceedings of the Genetic and Evolutionary Computation Conference - GECCO 18, (ACM Press, 2018), pp. 1348–1355 3. H.G. Weller, G. Tabor, H. Jasak, C. Fureby, A tensorial approach to computational continuum mechanics using object-oriented techniques. Comput. Phys. 12(6), 620–631 (1998) 4. F. Rehbach, Model-assisted evolutionary algorithms for the optimization of gas distributions systems in electrostatic precipitators. (TH Köln, 2017) 5. T. Bartz-Beielstein, E. Bartz, F. Rehbach, O. Mersmann, Optimization of High-dimensional Simulation Models Using Synthetic Data (2020). arXiv:2009.02781 6. E. Bartz, T. Bartz-Beielstein, F. Rehbach, O. Mersmann, K. Elvermann, R. Schmallenbach, F. Ortlieb, S. Leisner, N. Hahn, R. Mühlenhaus, Einsatz künstlicher Intelligenz in der Bedarfsplanung im Gesundheitswesen, hier in der Bedarfsplanung von Intensivbetten im Pandemiefall. Akzeptiert als Poster für den DIVI Kongress 2020, 12 (2020) 7. T. Bartz-Beielstein, M. Dröscher, A. Gür, A. Hinterleitner, O. Mersmann, D. Peeva, L. Reese, N. Rehbach, F. Rehbach, A. Sen, A. Subbotin, M. Zaefferer, Resource planning for hospitals under special consideration of the covid-19 pandemic: Optimization and sensitivity analysis, in Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’21), (Association for Computing Machinery, New York, NY, USA, 2021), pp. 293–294 8. T. Bartz-Beielstein, M. Dröscher, A. Gür, A. Hinterleitner, T. Lawton, O. Mersmann, D. Peeva, L. Reese, N. Rehbach, F. Rehbach, A. Sen, A. Subbotin, M. Zaefferer, Optimization and adaptation of a resource planning tool for hospitals under special consideration of the covid-19 pandemic, in 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 728–735 (2021) 9. E. Hupkes, M. Jelisavcic, A.E. Eiben, Revolve: a versatile simulator for online robot evolution, in Applications of Evolutionary Computation, ed. by K. Sim, P. Kaufmann. (Springer International Publishing, Cham, 2018), pp. 687–702 10. M. Rebolledo, F. Rehbach, A.E. Eiben, T. Bartz-Beielstein, Parallelized bayesian optimization for expensive robot controller evolution, in Parallel Problem Solving From Nature – PPSN XVI: 16th International Conference (2020) 11. K. Hibbitt, Abaqus: User’s manual: Version 6.13: Hibbitt. (Karlsson & Sorensen, Incorporated, 2013)

References

107

12. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Feature selection for surrogate model-based optimization, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, (ACM, 2019), pp. 399–400 13. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Variable reduction for surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference (GECCO ’20), (Association for Computing Machinery, 2020), pp. 1177–1185

Chapter 5

Final Evaluation

Discussion and Conclusion At the beginning of this document, we expressed existing challenges in the field of parallel SBO. To conclude this document we will discuss our contributions to these challenges. For this purpose, we would like to reconsider the research questions posed throughout this thesis. For the presentation of the contributions we go back to the central research questions RQ-1 to RQ-3.

5.1 Define Benchmarks are crucial tools for the performance analysis of optimization algorithms. Yet, there were no techniques to benchmark parallel SBO algorithms for many application scenarios. Existing techniques only allowed to compare algorithms in specific fixed scenarios, for example, by configuring one batch size that all algorithms have to stick to. Comparative studies with batch size dependent evaluation costs, or even adaptive parameters that change during the optimization were not possible. We initially posed RQ-1: How to benchmark and analyze parallel SBO algorithms in a fair, rigorous, and statistically sound way? This lead us to our first contribution: C-1:

We introduced a rigorous benchmarking framework that allows for an indepth comparison of parallel algorithms. The framework enables benchmarks across different batch sizes and even batch size adaptations during algorithm runtime. We can measure important problem characteristics, such as the changing or parameter-dependent runtime of the objective function. The framework then allows transferring the measured behavior to cheap-toevaluate test functions. This procedure enables benchmarking across various

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9_5

109

110

5 Final Evaluation

cheap functions as if the optimization was run on the expensive original function where any benchmarking was previously infeasible due to evaluation cost. Experimentally validating runtimes and algorithm behavior with different batch sizes as if applied to the real problem enables tuning approaches for good batch size configurations. This is crucial for an efficient optimization and yields significant performance improvements. Another issue in benchmarking algorithms for expensive industrial problems is the choice of relevant test functions. We asked RQ-3.2 How to create or choose more representative test functions? C-2:

We covered two distinct approaches for test function selection and generation. Firstly, we discuss the possibilities of employing real-world functions from the same domain as test functions. These functions can, for example, be simulations with reduced accuracy or other approximations of the original problem. To give just one example of this, we published a reduced-scale version of the original ESP application from Chap. 4 in “A Suite of Computationally Expensive Shape Optimisation Problems Using Computational Fluid Dynamics” [1]. The original simulation required hours to run and needed specialized software that is only available under Linux. The newly developed and publicly available small-scale version can be run in minutes on a single CPU core and can be evaluated on all common operating systems. Since then, the problem has been part of the international “GECCO Industrial Challenge” twice and was used as a test problem in various other studies [2–6]. The ESP simulations with lowered accuracy (Chap. 4) proved to have very similar problem landscapes to the original full scale model. They are well suited to serve as benchmark functions for the industrial application. Secondly, we investigated test function generation based on Kriging simulations. Our novel approach is capable of generating continuous test problems and is therefore much more widely applicable than the previous solution for discrete problems [7]. These continuous functions can be generated directly from the evaluation data of any real-world problem. The simulations aim to replicate the function’s landscape characteristics as closely as possible, making them well suited for benchmarking scenarios. The analysis in Sect. 3.2 showed that this novel approach performs considerably better than estimation-based test functions for most types of underlying real-world functions.

5.2 Analyze

111

5.2 Analyze Given these benchmarking and analysis tools, we aimed to investigate existing shortfalls in parallel SBO. For example, why not just use algorithms which are intrinsically already parallel, such as population-based evolutionary algorithms? We asked RQ-2: How well do existing parallel SBO algorithms perform on expensive to evaluate problems? What are their benefits and drawbacks? C-3:

We showed in detailed benchmark studies that existing parallel SBO methods outperform classical evolutionary algorithms and sophisticated algorithms like modern CMA-ES implementations in low budget optimization. Simultaneously, we could show that hybrid approaches that combine surrogate-based approaches and classical EAs outperform parallel SBO methods in many scenarios. This is primarily because existing parallel SBO approaches do not scale well to larger batch sizes.

In Sect. 3.4, we analyzed how strongly different hyper parameter configurations affect the performance of SBO. We asked RQ-3.4: Can parallelization support in selecting feasible hyper parameters? C-4:

We showed that the performance of SBO algorithms largely varies based on the configured kernel and infill criteria. Knowing the best possible configuration for the algorithm a priori is not possible for black-box functions. At the same time, we were able to show that a simple parallel approach that evaluates multiple kernel setups and infill criteria simultaneously can deliver robust results and often performs just as well or better than state-of-the-art parallel SBO algorithms.

The experiments in Sects. 3.3 and 3.4 indicated that one of the largest shortcomes of current parallel SBO algorithms is the tendency to do too much search space exploration. At the same time, too few evaluations are actually spent exploiting into the search space yielding quick improvements. We examined 24 SBO software frameworks in an extensive literature study. We found 15 frameworks that use EI as a default infill criterion, three that use PV, five that use criteria based on lower or upper confidence bounds [8], and a single one that uses a portfolio strategy (which includes confidence bounds and EI, but not PV). This showed that to our best knowledge EI is the most commonly applied infill criterion in Kriging-based SBO. Exploitative strategies such as the PV infill criterion are rarely applied. C-5:

Yet, we were able to show that explorative methods such as SBO-EI perform significantly different from what is often assumed. Especially on higherdimensional functions (already from d > 3), exploitative techniques seem to be the better choice, despite (or because) of their greedy search strategy. Only on lower-dimensional, multi-modal functions does exploration seem to benefit SBO largely. Importantly the study showed that pure exploitation (PV-infill criterion) often gets stuck in the multi-modal search spaces. At the

112

5 Final Evaluation

same time, EI does not converge fast enough into an optimum given the typical evaluation budgets for SBO of a few tens to a few thousand evaluations. The results hint that a mixture of an initially explorative search that switches to exploitation later on, might yield the best performance from both sides.

5.3 Enhance Given this broad set of insights into parallel SBO algorithms and their shortfalls, the sensible next step is to develop new methods that avoid or compensate for these. We asked RQ-3: How can existing drawbacks of parallel SBO algorithms be avoided to enable more efficient optimizations? The in-depth analysis covered in Sects. 3.3, 3.4, and 3.5 revealed multiple such drawbacks that we aimed to solve. C-6:

We introduced the novel Multi-Local Expected Improvement (ML-EI) algorithm. The algorithm combines explorative and exploitative search steps. The amount of exploration or exploitation per iteration is, decided based on the remaining evaluation budget. Starting with a purely explorative search, the algorithm becomes more and more exploitative until the last stage does no exploration anymore. The presented algorithm shows great overall performance, especially on multi-modal functions with adequate global structure. More importantly, it delivered robust results over a large set of different landscapes. This makes the algorithm well applicable to unknown black-box problems. We assume this is a positive outcome of the synergy between the explorative and exploitative parts of the algorithm. While the exploitative steps result in quick convergence on easier problem landscapes, the Q-point Expected Improvement (Q-EI) phases help explore the landscapes of more complex or multi-modal landscapes. Importantly, the algorithm scales well to high-batch sizes, making it well applicable to large-scale parallelization.

The methods discussed in C-1 enable a manual choice of robust batch size settings for an optimization problem. Yet, it would be more desirable if such a procedure could be done in an automated fashion and not only choose a robust setup but an optimal one. We asked RQ-3.7: How to choose a batch size for a new unknown problem? C-7:

We presented an automatic algorithm configuration approach based on ELA. Additionally, we introduced an adaptive variant of the ML-EI algorithm that changes the configured batch size after each iteration based on a prediction model. The model uses ELA features that are estimated on the already evaluated points during an optimization run. The described approach works well for predicting batch sizes for unseen BBOB instances in a train and test split, similar to the one employed by Mersmann et al. [9]. Yet, future work is required to advance these findings to work on truly unseen black-box functions.

References

113

5.4 Final Evaluation In summary, we have contributed significantly to multiple research areas to advance parallel SBO. For the first time, benchmarks of parallel algorithms are feasible with varying batch sizes and changing evaluation costs. The greater availability of realworld relevant test functions and simulation based test function generation synergizes with the presented benchmarking framework. The conducted analysis gives deep insights into parallel and non-parallel SBO, further uncovering that expected improvement, one of the most accepted and most commonly applied infill criteria, is significantly less efficient than previously assumed. The gathered information was used to develop the ML-EI algorithm, an efficient hybrid SBO approach. Extensive empirical studies showed that ML-EI performs well and robust across different kinds of test problems, making it especially suitable for black-box optimization problems. The algorithm showed high evaluation efficiency, outperforming existing parallel approaches on many different problem landscapes. The presented methods not only yielded a significant impact for the research community but were also applied in multiple industrial applications, proving their practicality. To give just one example of this, the presented results showed that the model-based approach efficiently optimized the 334-dimensional industrial electrostatic precipitator design problem that has a very complex, nonlinear problem landscape. The acquired design for the gas distribution system showed far superior results to the previously industry-applied heuristic approach. The best-found configuration for the gas distribution system yielded a 30% better velocity distribution than the best solution found by experienced engineers. This allowed for huge operation cost reductions in the project. Furthermore, these higher quality results are achieved in significantly less time and with considerably fewer human resources.

References 1. S.J. Daniels, A.A. Rahat, R.M. Everson, G.R. Tabor, J.E. Fieldsend, A suite of computationally expensive shape optimisation problems using computational fluid dynamics, in International Conference on Parallel Problem Solving from Nature, (Springer, 2018), pp. 296–307 2. R. Karlsson, L. Bliek, S. Verwer, M.d. Weerdt, Continuous surrogate-based optimization algorithms are well-suited for expensive discrete problems, in Benelux Conference on Artificial Intelligence, (Springer, 2020), pp. 48–63 3. L. Bliek, A. Guijt, R. Karlsson, S. Verwer, M. de Weerdt, Expobench: benchmarking surrogatebased optimisation algorithms on expensive black-box functions (2021). arXiv preprint arXiv:2106.04618 4. F. Rehbach, M. Zaefferer, J. Stork, T. Bartz-Beielstein, Comparison of parallel surrogateassisted optimization approaches, in Proceedings of the Genetic and Evolutionary Computation Conference—GECCO 18, (ACM Press, 2018), pp. 1348–1355 5. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Feature selection for surrogate model-based optimization, in Proceedings of the Genetic and Evolutionary Computation Conference Companion, (ACM, 2019), pp. 399–400

114

5 Final Evaluation

6. F. Rehbach, L. Gentile, T. Bartz-Beielstein, Variable reduction for surrogate-based optimization, in Proceedings of the 2020 Genetic and Evolutionary Computation Conference (GECCO ’20), (Association for Computing Machinery, 2020), pp. 1177–1185 7. M. Zaefferer, A. Fischbach, B. Naujoks, T. Bartz-Beielstein, Simulation-based test functions for optimization algorithms, in Genetic and Evolutionary Computation Conference (GECCO’17), (ACM, Berlin, Germany, 2017), pp. 905–912 8. P. Auer, Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3(Nov):397–422 (2002) 9. O. Mersmann, M. Preuss, H. Trautmann, Benchmarking evolutionary algorithms: towards exploratory landscape analysis, in Parallel Problem Solving from Nature (PPSN XI), pp. 73–82 (2010)

Glossary

(1+1)-ES CFD CMA-ES EA EGO EI ELA ES ESP FEM GDS IPI LCB LHS MC-SBO ML-EI MLE MOI-MBO PV Q-EI SBO

(1+1)-Evolution Strategy. Computational Fluid Dynamics. Covariance Matrix Adaptation Evolution Strategy. Evolutionary Algorithm. Efficient Global Optimization. Expected Improvement. Exploratory Landscape Analysis. Evolution Strategy. Electrostatic Precipitator. Finite Element Method. Gas Distribution System. Investment Portfolio Improvement. Lower Confidence Bound. Latin Hypercube Sampling. Multi-Config Surrogate-Based Optimization. Multi-Local Expected Improvement. Maximum Likelihood Estimation. Multi-Objective Infill Model-Based Optimization. Predicted Value. Q-point Expected Improvement. Surrogate-Based Optimization.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 F. Rehbach, Enhancing Surrogate-Based Optimization Through Parallelization, Studies in Computational Intelligence 1099, https://doi.org/10.1007/978-3-031-30609-9

115