Impact of Scientific Computing on Science and Society 303129081X, 9783031290817

This book analyzes the impact of scientific computing in science and society over the coming decades. It presents advanc

286 103 10MB

English Pages 450 [451] Year 2023

Report DMCA / Copyright


Polecaj historie

Impact of Scientific Computing on Science and Society
 303129081X, 9783031290817

Table of contents :
Mathematical and Numerical Modeling
On Extreme Computational Complexity of the Einstein Equations
1 Historical Facts of Importance
2 Exterior and Interior Schwarzschild Solution
3 Einstein Equations of General Relativity
4 Explicit Form of the First Einstein Equation
5 Computational Complexity of the Einstein Equations
6 Concluding Remarks
Systematic Imposition of Partial Differential Equations in Boundary Value Problems
1 Introduction
2 Formal Sum of p-Forms
3 Differential Operator for General Fields
3.1 Hodge Operator and Wedge Product
3.2 Action
3.3 Differential Operator in Space-time
3.4 Differential Operator in Space and Time
4 Instantiation of Particular Models
4.1 Electromagnetism
4.2 Schrödinger Equation
4.3 Elasticity
4.4 Yang-Mills Equation
5 Approximations in Finite Dimensional Spaces
6 Conclusions
Curiously Empty Intersection of Proof Engineering and Computational Sciences
1 Introduction
2 Unfit for Computational Sciences?
3 Closer Investigation
3.1 Architecture of Interactive Theorem Provers
3.2 Theoretical Background
3.3 Role of Elaboration
3.4 Proof Engineering by Example
4 Unexploited Opportunities?
4.1 Reasons to Get Excited
4.2 Reasons to Remain Skeptical
5 Conclusions
Challenges for Mantle Convection Simulations at the Exa-Scale: Numerics, Algorithmics and Software
1 Introduction
2 Geophysics
2.1 Significance and Physics of Mantle Convection
2.2 Challenges and Future Directions
3 Mathematics
3.1 Basic Components of Finite Element Simulations
3.2 Two-Scale Surrogate
3.3 All-at-Once Multigrid Method
4 High Performance Computing
4.1 Grand Challenge Problems
4.2 Scalability Taken Seriously
4.3 Asymptotically Optimal Not Necessarily Good Enough
4.4 With Matrix-Free Multigrid Methods Towards Extreme Scale
4.5 Quest for Textbook Efficiency
5 Conclusions
Remarks on the Radiative Transfer Equations for Climatology
1 Introduction
2 Radiative Transfer Equations for a Stratified Atmosphere
3 Effect of Absorption Coefficient on Atmospheric Temperature
4 Unaffected Sunlight?
5 Earth Albedo
6 General Statement About Earth Albedo
6.1 Energy Type Estimate for the Grey Problem
6.2 Frequency Dependent Case
7 Calculus of Variations for the ν-dependent Case
7.1 Support of the Conjecture
7.2 Numerical Simulations
Simulation and Its Use in Additive Manufacturing
1 General Background
2 Additive Manufacturing in 3D Printing
2.1 L-PBF and DED Techniques
2.2 DfAM
3 Incorporation of Simulation in DfAM
3.1 Simulation Process
3.2 Different Simulation Approaches
4 State-of-the-Art in AM Simulation
5 Speculations About the Future Usability of the Simulation
A Posteriori Error Estimates for Domain Decomposition Methods
1 Introduction
2 Fully Guaranteed a Posteriori Error Estimates
3 Domain Decomposition Methods for the Basic Elliptic Problem
3.1 Problem Formulation
3.2 Schwarz Alternating Method
4 Guaranteed Bounds of Errors
4.1 A Posteriori Error Estimate Adapted to DDM
4.2 Computation of Admissible Flux Fields
5 Numerical Evidence
Optimization and Control
Multi-criteria Problems and Incompleteness of Information in Structural Optimization
1 Introduction
2 Prototype Problem
3 Multi-criteria Optimization of a Beam with a Crack System
4 Optimization of Shell with Uncertainties in Damage Characteristics
5 Conclusion
Stability in Discrete Games with Perturbed Payoffs
1 Introduction
2 Basic Definitions and Notations
3 Auxiliary Lemma and Statements
4 Main Result
5 Other Results
6 Conclusion
Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: Computational Aspects
1 Introduction
2 Identification of the Slip Bound in the Stokes System with Nonsmooth Slip Conditions
3 Optimal Surface Coating Governed by the Stokes System with Threshold Slip Boundary Conditions
4 Conclusions
Optimal Control Approaches in Shape Optimization
1 Introduction
2 Distributed Penalization of the State System
2.1 Dirichlet Conditions
2.2 Neumann Conditions
3 Hamiltonian Approach in Dimension Two
Optimal Factor Taxation with Credibility
1 Introduction
2 Households and Public Sector
3 Production and Investment
4 Non-Credible Public Policy
5 Savers
6 Government
7 Policy Rules
8 Conclusions and Extensions
8.1 Endogenous Fertility
8.2 Endogenous Mortality
Transformative Direction of R&D into Neo Open Innovation
1 Introduction
2 Development of the Digital Economy
2.1 R&D-Driven Growth
2.2 Dilemma Between R&D Expansion and Declining Productivity
3 Neo Open Innovation
3.1 Self-Propagating Function
3.2 Spinoff to Uncaptured GDP
3.3 Soft Innovation Resources
3.4 Concept of Neo Open Innovation
4 Conclusion
On Tetrahedral and Square Pyramidal Numbers
1 Introduction
2 Main Theorem
3 Proof of the Main Theorem
3.1 Proof with Even n
3.2 Proof with Odd n
4 Weird Numbers
AI and Applications in Health Care
Randomized Constructive Neural Network Based on Regularized Minimum Error Entropy
1 Introduction
2 Incremental Extreme Learning Machine
3 Proposed Method
3.1 Problem Formulation
3.2 Problem Solving
4 Experimental Results
4.1 Comparison of RMSE-MEE-RCNN with IELM
4.2 Comparison of RMSE-MEE-RCNN with ELM, B-ELM And MLP-MEE
5 Conclusion
On the Role of Taylor's Formula in Machine Learning
1 Introduction
2 Methods
2.1 Basic Formulations
2.2 Taylor's Formula
2.3 Autoencoder Inspired by Taylor's Formula
2.4 Feature Selection Method Based on Taylor's Formula
3 Computational Experiments
3.1 Experimental Setting
3.2 Additive Autoencoder Inspired by Taylor's Formula
3.3 Feature Selection for a Distance-Based Classifier
4 Conclusions
Computational Methods in Spectral Imaging
1 Introduction
2 Noise Reduction
3 Model-Based Analysis
4 Machine Learning
5 Applications
6 Concluding Remarks
Method for Radiance Approximation of Hyperspectral Data Using Deep Neural Network
1 Introduction
2 Methods
2.1 Dataset and Network Architecture
2.2 Training
3 Results
4 Discussion
4.1 Findings
4.2 Related Work
5 Conclusions
Directional Wavelet Packets for Image Processing
1 Introduction
2 Quasi-analytic Directional Wavelet Packets
2.1 Properties of qWPs
2.2 Implementation Scheme for 2D qWP Transforms
2.3 Summary of 2D qWPs
3 Numerical Examples
3.1 Image Denoising
3.2 Image Inpainting
4 Discussion
Trends in Scientific Computing
Fifty Years of High-Performance Computing in Finland
1 Introduction
2 History
2.1 From Centralized Systems to Computing Ecosystem
2.2 Development of CSC
2.3 Towards the European HPC Ecosystem
3 Future Developments
3.1 Prerequisites for Success
3.2 LUMI in Kajaani
3.3 ELIXIR and National Nodes
3.4 LUMI's Eco-efficiency
4 Conclusions
Quantum Scientific Computing
1 Introduction
2 Quantum Computing in a Nutshell
3 Quantum Algorithms for Scientific Computing
3.1 Quantum Linear Solver Algorithms
3.2 Hybrid Quantum-Classical Algorithms
3.3 Quantum Algorithms for Partial Differential Equations
4 Opportunities for Scientific Quantum Computing
Quantum Computing at IQM
1 Introduction to Quantum Technology
2 Overview of Quantum Computing
2.1 History of Quantum Computing
2.2 Differences Between Quantum and Classical Computing
3 Basics of Quantum Computing Theory
3.1 Qubit
3.2 Quantum Information Processing
3.3 Quantum Algorithms
4 Quantum Computing at IQM
4.1 IQM Technology
4.2 Co-design Quantum Computers and Their Components
5 Future Outlook at IQM
Contribution of Scientific Computing in European Research and Innovation for Greening Aviation
1 Introduction
2 Technologies to Reduce Aircraft Drag
2.1 Drag Reduction by Laminar Flow and Flow Control
2.2 High Lift Design for Transonic Wings
2.3 Research on Shock Wave Boundary Layer Interaction
3 Aeroacoustic Research to Reduce Aircraft Noise
3.1 Low-Noise Technologies for Propulsion Systems
3.2 Airframe Noise Reduction Research
4 Computational and Numerical Methods in Aeronautics
4.1 Development of Reynolds Averaged Navier-Stokes (RANS) Flow Solvers
4.2 Numerical Methods for Industrial Needs
5 European Scientific Networks in Flow Physics and Design Methods
6 Conclusions
Thirty Years of Progress in Single/Multi-disciplinary Design Optimization with Evolutionary Algorithms and Game Strategies in Aeronautics and Civil Engineering
1 Introduction and Motivation
2 Methods of Design Optimization
2.1 Evolutionary Methods Genetic Algorithms and Evolutionary Algorithms
2.2 Multi-objective Optimization and Game Theory
2.3 Advanced Methods and Tools for EAs
3 Progress of the Use of Optimization Methods in Aeronautical Engineering Design
3.1 Period 1994–2004
3.2 Period 2004–2014
3.3 Period 2014–2022
4 Applications to Civil Engineering Design
5 Conclusions and Perspectives

Citation preview

Computational Methods in Applied Sciences

Pekka Neittaanmäki Marja-Leena Rantalainen   Editors

Impact of Scientific Computing on Science and Society

Computational Methods in Applied Sciences Volume 58

Series Editor Eugenio Oñate , Universitat Politècnica de Catalunya, Barcelona, Spain

This series publishes monographs and carefully edited books inspired by the thematic conferences of ECCOMAS, the European Committee on Computational Methods in Applied Sciences. As a consequence, these volumes cover the fields of Mathematical and Computational Methods and Modelling and their applications to major areas such as Fluid Dynamics, Structural Mechanics, Semiconductor Modelling, Electromagnetics and CAD/CAM. Multidisciplinary applications of these fields to critical societal and technological problems encountered in sectors like Aerospace, Car and Ship Industry, Electronics, Energy, Finance, Chemistry, Medicine, Biosciences, Environmental sciences are of particular interest. The intent is to exchange information and to promote the transfer between the research community and industry consistent with the development and applications of computational methods in science and technology. Book proposals are welcome at Eugenio Oñate International Center for Numerical Methods in Engineering (CIMNE) Technical University of Catalunya (UPC) Edificio C-1, Campus Norte UPC Gran Capitán s/n08034 Barcelona, Spain [email protected] or contact the publisher, Dr. Mayra Castro, [email protected] Indexed in SCOPUS, Google Scholar and SpringerLink.

Pekka Neittaanmäki · Marja-Leena Rantalainen Editors

Impact of Scientific Computing on Science and Society

Editors Pekka Neittaanmäki Faculty of Information Technology University of Jyväskylä Jyväskylä, Finland

Marja-Leena Rantalainen Faculty of Information Technology University of Jyväskylä Jyväskylä, Finland

ISSN 1871-3033 ISSN 2543-0203 (electronic) Computational Methods in Applied Sciences ISBN 978-3-031-29081-7 ISBN 978-3-031-29082-4 (eBook) © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland


The quote from Hamming, “The purpose of computing is insight, not numbers”, comes from the era of the transition from human ’computers’ to digital ones and the birth of computing that we now call scientific. To capture the original idea today, we may have to explicate the quote a bit—the purpose of [scientific] computing is [scientific] insight. This invites us to stop for a while to reflect on the themes of this treatise: Scientific Computing as a paradigm and influential tool in other sciences, the leading edges of scientific computing and the future visions toward the decades to come. Whose insight will we produce with scientific computing in the future—computational scientists, professionals of other fields, or perhaps even AI? In the past and even during the first decades of digital computing, computing was used solely by scientists and seasoned professionals who had a good insight into their subject matter and built the critical parts of their computational methods and tools themselves. Computing was, to a great extent, an art utilized to operationalize the current understanding and to provide hints to building new insight into the regimes that had been outside the reach of theoretical or experimental approaches. Later, when computers became widely available (for academia and R&D units) and gained computational power, new challenges emerged. The method development started to form independent research lines: numerical analysis (of various approximation techniques), numerical linear algebra, and numerical optimization. New methods were developed, analyzed rigorously, and provided as tools for the scientific community. Still, anyone wishing to construct a working solution would need some insight into the domain and proper knowledge about the methods and libraries. Gradually the paths of development and use of scientific computing methods have separated more and more. The methods of scientific computing have been encapsulated in versatile toolboxes and modeling tools that enable at least formally complete modeling of complex phenomena (setting up the problem, solving it, presenting the solution in various forms, automated optimization of design variables) without expert knowledge from the (often multi-physical) problem domain, not to mention the different computational approximations and methods that are utilized behind the scenes.




We have to remember that the infamous “Garbage in—garbage out” principle is still valid—and that the “in” includes more than just the visible input parameters to our problem. It contains all the modeling choices, all applied numerical methods, and their zones of validity. The challenge is that now we cannot assume the user to recognize when the output is garbage as the tools are available for (and used by) individuals with very diverse backgrounds. One emerging solution to the above challenge has been the approach of reliable computing which provides insight into the actual error in the final solution, no matter their source, assuming that the model used as the reference is valid. This approach opens at least two kinds of challenges. First of all, working and reliable a-posteriori estimates are so far available only for a relatively limited set of problems. Secondly, the art of selecting valid models for a given problem setting requires a good understanding of the phenomena and good command of various models and their range of validity. Here is the point where we better stop to think about the relationship between scientific computing and artificial intelligence. Starting with a bit cynical view, the relationship may today resemble an organized marriage—the research funding agencies seem to require that research topics and plans contain “and AI” to be funded. An easy way to satisfy this constraint is to use a well-elaborated computational model to train a black box neural network or replace an elaborated optimization algorithm with an evolutionary algorithm requiring more model evaluations. Scientific computing can, of course, also contribute to AI. The various learning techniques are based mainly on optimization and numerical linear algebra and provide challenges to model selection and model reduction to find models that express the essential with a reasonable number of parameters and a reasonable amount of learning data. The models learned from data by selecting some general class of approximating functions and optimizing an error function may give phenomenal results in selected cases, but do they provide new insight—or do they use the existing insight. The approach, where one forgets all the scientific knowledge from the last millennia and starts with tabula rasa and data only, sounds suboptimal. Indeed, it is straightforward and hence relatively easy to implement, but at the same time, it is bound to reinvent the wheel. We must be able to do better. So, let us take another look at Hamming’s quote. I am tempted to expand it toward AI. Something like “The purpose of artificial scientific computing is artificial insight, not numbers”. Let me elaborate on this. Why “artificial scientific computing”? So far, we have been at the steering wheel. The scientists have formulated problems, proposed methods, done experiments, and validated results. This to the level where we can provide for many common problems and phenomena a set of readymade and tested tools that a layperson can use with relatively good confidence. However, for new types of problems (new combinations of phenomena, new kinds of setups), the layperson may be lost with a selection of models and methods and validation of results. What is more, she/he is out there without the academic community and the possibility to consult an experienced computational scientist when at need. So, can we devise an artificial expert (expert system) that knows about the models and methods and can both build feasible solutions and validate them on demand? Or, to



take this a bit further—one of the current catchwords around computational modeling is the “digital twin”, a computational model of some entity in real life to be used for design, planning, and decision-making about the real thing. To what extent the creation of such twins can be automated? Can we have a portfolio of twins and a method to select the most appropriate for any given situation? In fact, we would need such twins also ourselves as the future challenges of computational science often overarch several scales, and we need to communicate information between the models of different scales with accuracy that goes beyond simple closed-form functional dependencies. The complete models on a shorter scale are too complicated and costly to be used billions of times for the needs of a long-scale simulation. So, we have to construct effective microscale models that are sufficiently valid in the operating window defined by the system on the macroscale. To what extent we can automate this using the tools from machine learning/AI, and can we keep control of the overall reliability and accuracy of the models? Scientific computing has changed in the last decades. And it will continue changing in the future. What trends and developments we can identify and highlight in this collection at hand? The first section discusses the challenges in trying to match the theoretical/mathematical models, their numerical simulations, and the needs in the real world. Kˇrížek points out that we have fundamental and well-known models that we are by far not able to simulate computationally in their full generality. Kettunen et al. envision a generic framework capable of unifying a wide variety of models, and what is important, of their simulators. Kiiskinen augments that approach by the contributions from automated proof engines and their application to scientific computing and numerical models. Whereas the previous papers focus in some sense on the pure models and their systematic simulation, Mohr et al. focus on practical (literally down-to-earth) settings and reveal the need to model across the different scales to understand our environment. Bardos et al. open similar questions concerning atmospheric modeling and the challenges of considering radiative transfer in climate models. The last two papers of this section extend the discussion outwards from the numerical simulation of individual models. Ghafouri et al. elaborate on the challenges of linking the numerical simulation/design models to actual physical artifacts resulting from dynamic production processes. Kraus and Repin address the ultimate and absolute error bounds for the numerical approximation against the chosen ground truth. Optimization and computer-supported design/decision-making are major use cases of computational models. At the same time, they are model cases of settings where a computer program, instead of a human expert, is the primary user of computational simulations. This means, among other things, that all aspects of decisionmaking, including the uncertainties and lacking information, should be included in the definition of the optimization tasks and communicated to the corresponding optimizer. These aspects are discussed by Banichuk and Ivanova, as well as by Emelichev et al. in different contexts. Haslinger and Mäkinen discuss optimization and control for non-smooth systems, whereas Tiba focuses on optimization with respect to the system geometry. Modeling and optimization are by no means limited to systems



modeled by the laws of physics. Palokangas discusses taxation from the point of view of optimal control. Watanabe and Tou take an even broader look to model the economies and the role of transformative innovations in steering the economy. Finally, Somer and Kˇrížek consider the solvability of the Diophantine equation. Data-centric computational modeling, data science poses its own challenges to practitioners. Having a lot of data and a universal approximator is only a part of the solution. How to select the right class and structure of the approximator, how to train the model efficiently, and how to arrange ample amounts of training data? Nayyeri et al. approach the problem of selecting the network structure by minimizing the error entropy. Kärkkäinen, on the other hand, elaborates machine learning from the point of view of optimization. Pölönen discusses the computational challenges in very data-intensive spectral imaging. Rahkonen and Pölönen continue with a spectral imaging case where numerical simulation models can be used to train neural networks to handle measurement data, reducing the costs of expensive labeling of actual measurements. Finally, Averbuch and Zheludev elaborate on the recent advances in exploiting wavelets in the processing of high-dimensional images. The last section contains a perspective on the past and probably also on the future of the tools and applications of scientific computing. Koski et al. give an account of the history of supercomputing and its development in Finland. Möller discusses how different types of computational problems may benefit from the emerging quantum computers and their capabilities, and Heimonen et al. open the questions related to the practical development of quantum computers. Finally, two papers discuss the role of scientific computing in the development of the aviation industry. Knoerzer takes the standpoint of international research collaboration in developing greener aviation while Periaux and Tuovinen summarize how the developments in multi-disciplinary optimization and computational fluid mechanics have pushed forward the boundaries of industrial use of scientific computing. Jyväskylä, Finland June 2022

Timo Tiihonen


Scientific computing is the third paradigm of science complementing traditional experimental and theoretical approaches. Its importance is growing in society because of digitalization and the Fourth Industrial Revolution (Industry 4.0) as well as new computing resources. Almost 30 leading experts from academia and industry analyze the impact of scientific computing in science and society over the coming decades. Advanced methods provide new possibilities to solve scientific problems and study important phenomena in society. The book is intended to researchers. It consists of the following topics: Mathematical and Numerical Modeling, Optimization and Control, AI and Applications in Health Care, Trends in Scientific Computing. We would like to thank the authors for their contribution and Springer Science Business Media for their flexible collaboration. Jyväskylä, Finland June 2022

Pekka Neittaanmäki Marja-Leena Rantalainen



Mathematical and Numerical Modeling On Extreme Computational Complexity of the Einstein Equations . . . . . Michal Kˇrížek


Systematic Imposition of Partial Differential Equations in Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lauri Kettunen, Sanna Mönkölä, Jouni Parkkonen, and Tuomo Rossi


Curiously Empty Intersection of Proof Engineering and Computational Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampsa Kiiskinen


Challenges for Mantle Convection Simulations at the Exa-Scale: Numerics, Algorithmics and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcus Mohr, Ulrich Rüde, Barbara Wohlmuth, and Hans-Peter Bunge


Remarks on the Radiative Transfer Equations for Climatology . . . . . . . . Claude Bardos, François Golse, and Olivier Pironneau


Simulation and Its Use in Additive Manufacturing . . . . . . . . . . . . . . . . . . . . 111 Mehran Ghafouri, Mohsen Amraei, Aditya Gopaluni, Heidi Piili, Timo Björk, and Jari Hämäläinen A Posteriori Error Estimates for Domain Decomposition Methods . . . . . 127 Johannes Kraus and Sergey Repin Optimization and Control Multi-criteria Problems and Incompleteness of Information in Structural Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Nikolay Banichuk and Svetlana Ivanova Stability in Discrete Games with Perturbed Payoffs . . . . . . . . . . . . . . . . . . . 167 Vladimir A. Emelichev, Marko M. Mäkelä, and Yury V. Nikulin




Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Jaroslav Haslinger and Raino A. E. Mäkinen Optimal Control Approaches in Shape Optimization . . . . . . . . . . . . . . . . . . 195 Dan Tiba Optimal Factor Taxation with Credibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Tapio Palokangas Transformative Direction of R&D into Neo Open Innovation . . . . . . . . . . 223 Chihiro Watanabe and Yuji Tou On Tetrahedral and Square Pyramidal Numbers . . . . . . . . . . . . . . . . . . . . . 241 Lawrence Somer and Michal Kˇrížek AI and Applications in Health Care Randomized Constructive Neural Network Based on Regularized Minimum Error Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Mojtaba Nayyeri, Hadi Sadoghi Yazdi, Modjtaba Rouhani, Alaleh Maskooki, and Marko M. Mäkelä On the Role of Taylor’s Formula in Machine Learning . . . . . . . . . . . . . . . . 275 Tommi Kärkkäinen Computational Methods in Spectral Imaging . . . . . . . . . . . . . . . . . . . . . . . . . 295 Ilkka Pölönen Method for Radiance Approximation of Hyperspectral Data Using Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Samuli Rahkonen and Ilkka Pölönen Directional Wavelet Packets for Image Processing . . . . . . . . . . . . . . . . . . . . 327 Amir Averbuch and Valery Zheludev Trends in Scientific Computing Fifty Years of High-Performance Computing in Finland . . . . . . . . . . . . . . . 347 Kimmo Koski, Pekka Manninen, and Tommi Nyrönen Quantum Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Matthias Möller Quantum Computing at IQM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Hermanni Heimonen, Adrian Auer, Ville Bergholm, Inés de Vega, and Mikko Möttönen



Contribution of Scientific Computing in European Research and Innovation for Greening Aviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Dietrich Knoerzer Thirty Years of Progress in Single/Multi-disciplinary Design Optimization with Evolutionary Algorithms and Game Strategies in Aeronautics and Civil Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Jacques Periaux and Tero Tuovinen


Mohsen Amraei Lappeenranta-Lahti University of Technology, Lappeenranta, Finland; University of Turku, Turku, Finland Adrian Auer IQM, Munich, Germany Amir Averbuch School of Computer Science, Tel Aviv University, Tel Aviv, Israel Nikolay Banichuk Ishlinsky Institute for Problems in Mechanics RAS, Moscow, Russia Claude Bardos Mathematics Department, University of Paris Denis Diderot, Paris, France Ville Bergholm IQM, Espoo, Finland Timo Björk Lappeenranta-Lahti University of Technology, Lappeenranta, Finland Hans-Peter Bunge Department of Earth and Environmental Sciences, LudwigMaximilians-Universität München, München, Germany Inés de Vega IQM, Munich, Germany Vladimir A. Emelichev Belarusian State University, Minsk, Belarus Mehran Ghafouri Lappeenranta-Lahti University of Technology, Lappeenranta, Finland François Golse Ecole Polytechnique, Palaiseau, France Aditya Gopaluni Lappeenranta-Lahti University of Technology, Lappeenranta, Finland Jari Hämäläinen Lappeenranta-Lahti University of Technology, Lappeenranta, Finland Jaroslav Haslinger Department of Mathematical Analysis and Applications of Mathematics, Palacký University Olomouc, Olomouc, Czech Republic xv



Hermanni Heimonen IQM, Espoo, Finland Svetlana Ivanova Ishlinsky Institute for Problems in Mechanics RAS, Moscow, Russia Tommi Kärkkäinen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Lauri Kettunen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Sampsa Kiiskinen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Dietrich Knoerzer Independent Aeronautics Consultant, Brussels, Belgium Kimmo Koski CSC – IT Center for Science, Espoo, Finland Johannes Kraus Faculty of Mathematics, University of Duisburg-Essen, Essen, Germany Michal Kˇrížek Institute of Mathematics, Czech Academy of Sciences, Prague, Czech Republic Marko M. Mäkelä University of Turku, Turku, Finland Raino A. E. Mäkinen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Alaleh Maskooki University of Turku, Turku, Finland Pekka Manninen CSC – IT Center for Science, Espoo, Finland Marcus Mohr Department of Earth and Environmental Sciences, LudwigMaximilians-Universität München, München, Germany Matthias Möller Delft University of Technology, Delft Institute of Applied Mathematics, Delft, The Netherlands Sanna Mönkölä Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Mikko Möttönen IQM, Espoo, Finland Mojtaba Nayyeri University of Bonn, Bonn, Germany Yury V. Nikulin University of Turku, Turku, Finland Tommi Nyrönen CSC – IT Center for Science, Espoo, Finland Tapio Palokangas Helsinki Graduate School of Economics, University of Helsinki, Helsinki, Finland Jouni Parkkonen Department of Mathematics and Statistics, University of Jyväskylä, Jyväskylä, Finland



Jacques Periaux CIMNE—Centre Internacional de Mètodes Numèrics a l’Enginyeria, Campus Nord UPC, Barcelona, Spain Heidi Piili Lappeenranta-Lahti University of Technology, Lappeenranta, Finland; University of Turku, Turku, Finland Olivier Pironneau Applied Mathematics, Jacques-Louis Lions Lab, Sorbonne Université, Paris cedex 5, France Ilkka Pölönen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Samuli Rahkonen Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Sergey Repin St. Petersburg Department of V.A. Steklov Institute of Mathematics of Russian Academy of Sciences, St. Petersburg, Russia Tuomo Rossi Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland Modjtaba Rouhani Ferdowsi University of Mashhad, Mashhad, Iran Ulrich Rüde Department of Computer Science 10, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany; CERFACS, Toulouse, France Lawrence Somer Department of Mathematics, Catholic University of America, Washington, D.C., USA Dan Tiba Simion Stoilow Institute of Mathematics of Romanian Academy, Bucharest, Romania Yuji Tou Department of Industrial Engineering and Management, Tokyo Institute of Technology, Tokyo, Japan Tero Tuovinen Jyväskylä University of Applied Sciences, Jyväskylä, Finland Chihiro Watanabe Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland; International Institute for Applied Systems Analysis, laxenburg, Austria Barbara Wohlmuth Institute for Numerical Mathematics, Technische Universität München, München, Germany Hadi Sadoghi Yazdi Ferdowsi University of Mashhad, Mashhad, Iran Valery Zheludev School of Computer Science, Tel Aviv University, Tel Aviv, Israel

Mathematical and Numerical Modeling

On Extreme Computational Complexity of the Einstein Equations Michal Kˇrížek

Abstract We show how to explicitly express the first of the 10 Einstein partial differential equations to demonstrate their extremely large general complexity. Consequently, it is very difficult to use them, for example, to realistically model the evolution of the Solar system, since their analytical solution even for at least two massive bodies is not known. Significant computational problems associated with their numerical solution are illustrated as well. Thus, we cannot verify whether the Einstein equations describe the motion of two or more bodies sufficiently accurately. Keywords Einstein equations · Schwarzschild solution · Finite difference method · n-body simulations Mathematics subject classification: 65M06 · 35L70

1 Historical Facts of Importance Karl Schwarzschild was probably the first scientist who ever realized that our universe at any fixed time might be non-Euclidean and that it can be modeled by the threedimensional hypersphere S3 or the three-dimensional hyperbolic pseudosphere H3 , see his paper Schwarzschild and Über das zulässige Krümmungsmaaß des Raumes (1900) from 1900. In 1915, he became famous, since he calculated the first nontrivial solution of the Einstein vacuum equations (see (6) below). On November 18, 1915, Albert Einstein submitted his famous paper Einstein (1915b) about Mercury’s perihelion shift. Here the gravitational field is described by the following equations using the present standard notation and the Einstein summation convention:

M. Kˇrížek (B) Institute of Mathematics, Czech Academy of Sciences, Žitná 25, 115 67, Prague 1, Czech Republic e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



M. Kˇrížek

 κ μν,κ −  λ μκ  κ λν = 0, κ, λ, μ, ν = 0, 1, 2, 3, where  κ μν :=

1 κλ g (gνλ,μ + gλμ,ν − gμν,λ ) 2



are the Christoffel symbols of the second kind (sometimes also called the connection coefficients), gμν = gμν (x 0 , x 1 , x 2 , x 3 ) are components of the unknown 4 × 4 twice differentiable symmetric metric tensor of the spacetime of one time variable x 0 and three space variables x 1 , x 2 , x 3 , and det(gμν ) = −1.


Furthermore, g μν is the 4 × 4 symmetric inverse tensor to gμν . For brevity, the first classical derivatives of a function f = f (x 0 , x 1 , x 2 , x 3 ) are denoted as f ,κ := ∂ f /∂ x κ . For simplicity, the dependence of all functions on the spacetime coordinates will be often not indicated. Note that the infinitesimally small spacetime distance ds is usually expressed by physicists as follows: ds 2 = gμν dx μ dx ν . On November 20, 1915, David Hilbert submitted the paper Hilbert (1915) which was published on March 31, 1916. He did not require the validity of the restrictive algebraic constraint (3). Using a variational principle, he derived the following complete equations for the gravitational field (see Fig. 1): Rμν :=  κ μν,κ − κ μκ,ν +  λ μν  κ λκ −  λ μκ  κ λν = 0.


The doubly underlined terms do not appear in (1). The number of the Christoffel symbols in (1) is 10(4 + 4 × 4) = 200. From (2) we observe that (4) is a system of nonlinear second-order partial differential equations. Its left-hand side Rμν is at present called the Ricci tensor and the equations Rμν = 0 are called the Einstein vacuum equations.

Fig. 1 Hilbert’s original paper Hilbert (1915, p. 402)

On Extreme Computational Complexity of the Einstein Equations


By Sauer (1999, p. 569), Hilbert’s knowledge and understanding of the calculus of variation and of invariant theory readily put him into a position to fully grasp the Einstein gravitational theory. Hilbert’s original paper (see Fig. 1) contains the by K μν (K is the Ricci scalar) and the Christoffel Ricci tensor Rμν from (4) denoted ⎧ ⎫ ⎨μν ⎬ symbol  κ μν denoted by −⎩ ⎭. Hilbert’s paper was submitted for publication five κ days earlier than Einstein’s note Einstein (1915a). On November 25, 1915, Albert Einstein submitted a three and a half page note Einstein (1915a) which contains the same expression of the Ricci tensor as in (4). The way in which the Einstein equations (see (16) below) were derived is not described there. Einstein’s short note Einstein (1915a) was published already on December 2, 1915, i.e. within one week. One can easily see that the classical Minkowski metric gμν = diag(−1, 1, 1, 1),


where all non-diagonal entries are zeros, is a trivial solution to the Einstein vacuum equations (4) since all the Christoffel symbols (2) are zeros. On January 13, 1916, Karl Schwarzschild submitted the paper Schwarzschild (1916b) containing the first nontrivial static spherically symmetric solution of the Einstein vacuum equations. At present his solution is usually written as follows: 


r r−S 2 2 2 , , r , r sin θ , = diag − r r−S


where all non-diagonal entries are zeros, S ≥ 0 is a fixed real constant, and the standard spherical coordinates x 1 = r cos ϕ cos θ, x 2 = r sin ϕ cos θ, x 3 = r sin θ, are employed, r > S, ϕ ∈ [0, 2π ), and θ ∈ [0, π ]. The metric tensor (6) is called the exterior Schwarzschild solution of the Einstein equations (4). We see that for S = 0 the formula (6) reduces to the Minkowski metric in the spherical coordinates. Note that already on December 22, 1915, Karl Schwarzschild wrote a letter to Einstein announcing that he was reading Einstein’s paper on Mercury and found a solution to field equations (see Vankov 2011 for this letter). This solution is very similar to (6). Finally, on March 20, 1916, Albert Einstein submitted his fundamental work Einstein (1916), where the general theory of relativity was established. One year later, Einstein in (1952) added the term gμν to the right-hand side of his field equations (see (25) below) and Willem de Sitter (1917) found their vacuum solution which describes the behavior of the expansion function of the entire universe. The outline of this paper is as follows. In Sect. 2, we prove that the Schwarzschild solution (6) satisfies (4), but does not satisfy (1). In Sect. 3, we introduce the Einstein equations and in Sect. 4, we show that they have an extremely complicated explicit


M. Kˇrížek

expression. In Sect. 5, we investigate their enormous computational complexity when they are solved numerically. Finally, in Sect. 6, we present some concluding remarks.

2 Exterior and Interior Schwarzschild Solution Classical relativistic tests McCausland (1999), Misner et al. (1997), Will (2014) (such as bending of light, Mercury’s perihelion shift, gravitational redshift, Shapiro’s fourth test of general relativity) are based on verification of very simple algebraic formulae derived by various simplifications and approximations of the Schwarzschild solution (6) which is very special and corresponds only to one spherically symmetric nonrotating body. Therefore, in this section we present several theorems on the Schwarzschild solution. Einstein and also Schwarzschild assumed that the gravitational field has the following properties: 1. It is static, i.e., all components gμν are independent of the time variable x 0 . 2. It is spherically symmetric with respect to the coordinate origin, i.e., the same solutions will be obtained after a linear orthogonal transformation. 3. The equations gρ0 = g0ρ = 0 hold for any ρ = 1, 2, 3. 4. The spacetime is asymptotically flat, i.e., the metric tensor gμν tends to the Minkowski metric in infinity. Theorem 1 The exterior Schwarzschild solution (6) with S > 0 satisfies Eqs. (4), but does not satisfy (1). Proof We will proceed in three steps: 1. At first, we recall the following definition of the Christoffel symbols of the first kind: 1 λμν := (gλμ,ν + gνλ,μ − gμν,λ ). (7) 2 From this and the relation gμν = gνμ we find that the Christoffel symbols are, in general, symmetric with respect to the second and third subscript λμν =

1 (gλν,μ + gμλ,ν − gνμ,λ ) = λνμ . 2

Using (7) and (6), we obtain 1 r − (r − S) S g00,1 = − = − 2, 2 2r 2 2r 1 S = − g00,1 = 2 , 2 2r 1 S = g11,1 = − , 2 2(r − S)2

001 = 100 111


On Extreme Computational Complexity of the Einstein Equations


1 122 = − g22,1 = −r, 2 1 133 = − g33,1 = −r sin2 θ, 2 1 221 = g22,1 = r, 2 1 233 = g33,2 = r 2 sin θ cos θ, 2 1 331 = g33,1 = r sin2 θ, 2 1 332 = g33,2 = r 2 sin θ cos θ 2 and the other components λμν = λνμ are zeros. 2. Similarly to (8) we can find that the Christoffel symbols of the second kind are symmetric with respect to the second and third (lower) indices and by (2) and (7) we have  κ μν = g κλ λμν . Since the metric tensor (6) is diagonal, its inverse reads g


 = diag −

r−S 1 r 1 , , 2, 2 2 r−S r r r sin θ


and all non-diagonal entries are zeros. From this and Step 1 we get S , 2r (r − S) S(r − S) ,  1 00 = g 11 100 = 2r 3 S ,  1 11 = g 11 111 = − 2r (r − S)  0 01 = g 00 001 =

 1 22 = g 11 122 = −(r − S),  1 33 = g 11 133 = −(r − S) sin2 θ, 1 1  2 12 = g 22 221 = 2 r = , r r  2 33 = g 22 233 = sin θ cos θ, 1 1  3 31 = g 33 331 = r sin2 θ = , r r 2 sin2 θ 1  3 32 = g 33 332 = r 2 sin θ cos θ = cotan θ r 2 sin2 θ κ and the other components  κ μν = νμ are zeros. 3. Finally, we will evaluate all entries of the Ricci tensor defined in (4). In particular, by (9) and Step 2 we obtain


M. Kˇrížek

R00 =  κ 00,κ − κ 0κ,0 +  λ 00  κ λκ −  λ 0κ  κ λ0 =  1 00,1 + 1 00 ( 2 12 +  3 13 ) − 2 1 00  0 01  S S2r 3 − S(r − S)6r 2 S(r − S) 1 1 S(r − S) + = + −2 4r 6 2r 3 r r 2r 3 2r (r − S) =

3S(r − S) S(r − S) 2 S2 rS − + − 4 = 0, 4 4 3 2r 2r 2r r 2r

where the doubly underlined terms correspond to the doubly underlined terms in (4). From this it is obvious that the Schwarzschild solution (6) does not satisfy (1), since S(r − S)/r 4 = 0 for S > 0. (Note that (3) is not valid.) Similarly, we find that R11 =  κ 11,κ − κ 1κ,1 +  λ 11  κ λκ −  λ 1κ  κ λ1 =  1 11,1 − 0 10,1 −  1 11,1 −  2 12,1 −  3 13,1 +  1 11 ( 0 01 +  2 12 +  3 13 ) −  0 01  0 01 −  2 12  2 12 −  3 13  3 13  S S 2 S(2r − S) 2 S2 2 − + = 2 + − − 2 = 0, r 2r (r − S) 2r (r − S) r 2r (r − S)2 4r 2 (r − S)2 r R22 =  κ 22,κ − κ 2κ,2 +  λ 22  κ λκ −  λ 2κ  κ λ2 =  1 22,1 − 3 23,2 +  1 22 ( 2 12 +  3 13 ) − 2 2 21  1 22 −  3 23  3 23 = −1 +

1 sin2 θ

2 2 − (r − S) + (r − S) − cotan2 θ = 0, r r

R33 =  κ 33,κ − κ 3κ,3 +  λ 33  κ λκ −  λ 3κ  κ λ3 =  1 33  3 13 +  2 33  3 23 −  3 31  1 33 −  3 32  2 33 = 0,

and the other non-diagonal entries of Rμν are also zeros. 

Therefore, (4) is satisfied.

Theorem 1 nicely demonstrates Schwarzschild’s ingenuity to find a nontrivial solution to a very complicated system of partial differential equations (4) for an arbitrary constant S > 0 (see also Kˇrížek 2019a, Appendix). For a spherically symmetric object of mass M > 0, the constant S=

2M G c2


On Extreme Computational Complexity of the Einstein Equations


Fig. 2 Spherical shell between two concentric spheres

is called its Schwarzschild gravitational radius, where G is the gravitational constant and c is the speed of light in a vacuum. Denote by Sμν = − κ μκ,ν +  λ μν  κ λκ the doubly underlined terms in (4). By Einstein (1915a) one additional algebraic condition (3) surprisingly guarantees that 10 differential operators Sμν = Sνμ simultaneously vanish when (4) is valid. This implies (1). For positive numbers r0 < r1 consider the spherical shell {(x 1 , x 2 , x 3 ) ∈ E3 | r02 ≤ 1 2 (x ) + (x 2 )2 + (x 3 )2 ≤ r12 } with interior radius r0 and exterior radius r1 (see Fig. 2). It is a region between two concentric spheres. Its volume in the Euclidean space E3 around the mass M is equal to V =

4 3 π r1 − r03 . 3

However, the spacetime around the mass M is curved. Therefore, we have to consider the proper (relativistic) volume defined as V˜ :=


r r0


r dr × r−S

π 0

sin θ dϕ dθ = 4π



r r0


r dr. r−S

In Kˇrížek and Kˇrížek (2018), we prove the following astonishing theorem. Theorem 2 If M > 0 and r0 > S are any fixed numbers satisfying (10), then V˜ − V → ∞ as r1 → ∞. We observe that the difference of volumes V˜ − V increases over all limits for r1 → ∞, which is quite surprising property. Namely, Theorem 2 can be applied, for instance, to a small imperceptible pinhead, since the fixed mass M > 0 can be arbitrarily small. Consequently, a natural question arises whether the exterior Schwarzschild solution (6) approximates reality well.


M. Kˇrížek

In 1916, Karl Schwarzschild (1916a) found the first nonvacuum solution of the Einstein equations, cf. (12) below. He assumed that the ball with coordinate radius r0 > 0 is formed by an ideal incompressible nonrotating fluid with constant density to avoid a possible internal mechanical stress in the solid that may have a nonnegligible influence on the resulting gravitational field. Then by Ellis (2012) (see also Florides 1974, p. 529; Stephani 2004, p. 213; Interior 2020) the corresponding metric is given by ⎞  2   3 2 1 S Sr r = diag ⎝− , 3 0 2 , r 2 sin2 θ, r 2 ⎠ , 3 1− − 1− 3 4 r0 r0 r0 − Sr ⎛



where r ∈ [0, r0 ]. The metric tensor (11) is called the interior Schwarzschild solution, see Stephani (2004, p. 213). It is again a static solution, meaning that it is independent of time. To avoid the division by zero in the component g11 , we require −1  r03 Sr 2 = 1− 3 > 0 for all r ∈ [0, r0 ] r03 − Sr 2 r0 which leads to the inequality r0 > S. Hence, we can define the composite Schwarzschild metric gμν by (6) for r > r0 and by (11) for r ∈ [0, r0 ]. It is easy to check that gμν is continuous everywhere. Theorem 3 The composite Schwarzschild metric gμν is not differentiable for r = r0 . Proof We will show that the component g11 is not differentiable. From (6) and (11) we observe that the first classical derivative does not exist at r = r0 (see Fig. 3). Namely, the component g11 (r ) of the interior solution (11) is an increasing function on [0, r0 ], whereas from (6) we see that the one-sided limit of the first derivative of the component g11 (r ) = r/(r − S) of the exterior solution is negative lim+

r →r0

∂g11 (r ) < 0. ∂r

All Riemannian spacetime manifolds have to be locally flat which is not true in this case.  The piecewise rational function g11 cannot be smoothed near r0 , since then the Einstein equations would not be valid in a close neighborhood of r0 . Thus we observe that the composite Schwarzschild solution (6)+(11) is not a global solution of an idealized spherically symmetric star and its close neighborhood (see Fig. 2).

On Extreme Computational Complexity of the Einstein Equations


3 Einstein Equations of General Relativity The Einstein field equations consist of 10 equations (cf. Einstein 1916) for 10 components of the unknown twice differentiable symmetric metric tensor gμν Rμν −

1 8π G Rgμν = 4 Tμν , μ, ν = 0, 1, 2, 3, 2 c


where Rμν is the symmetric Ricci tensor defined by (4), the contraction R = g μν Rμν


is the Ricci scalar (i.e. the scalar curvature), and Tμν is the symmetric tensor of density of energy and momentum. Let us emphasize that the 10 Einstein equations (12) are not independent, since the covariant divergence of the right-hand side is supposed to be zero (see, e.g., Misner et al. 1997, p. 146), i.e., T μν ,ν := T μν ,ν +  μ λν T λν +  ν λν T μλ = 0,


where T μν = g κμ g λν Tκλ . The covariant divergence of the Ricci tensor is nonzero, in general, but the covariant divergence of the whole left-hand side of (12) is zero automatically for gμν smooth enough, e.g., if the third derivatives of gμν are continuous (which is not the case sketched in Fig. 3). Therefore, we have only six independent equations in (12). The number of independent components of the metric tensor is also six, since we have four possibilities of choosing four coordinates. Finally, the contravariant symmetric 4 × 4 metric tensor g μν which is inverse to the covariant metric tensor gμν satisfies g μν =

∗ gμν

det(gμν )

, det(gμν ) :=

(−1)sgn π g0ν0 g1ν1 g2ν2 g3ν3 ,



∗ where the entries gμν form the 4 × 4 matrix of 3 × 3 algebraic adjoints of gμν , S4 is the symmetric group of 24 permutations π of indices (ν0 , ν1 , ν2 , ν3 ), sgn π = 0 for

Fig. 3 Behavior of the non-differentiable component g11 = g11 (r ) of the composite metric tensor from (6) and (11)


M. Kˇrížek

an even permutation and sgn π = 1 for an odd permutation. Notice that the constant on the right-hand side of (12) is very small in the SI base units implying that the Ricci curvature tensor Rμν is also very small (if components of Tμν are not too large). Einstein (1915a) presented the field equations (12) in an equivalent form which is at present written as follows: Rμν

8π G = 4 c


1 − T gμν , 2


where T := Tμν g μν = T μ μ denotes the trace of Tμν . To see that (16) is equivalent with (12), we multiply (12) by g μν . Then the traces of the corresponding tensors satisfy 8π G −R = R − 2R = 4 T. c

Theorem 4 If gμν is a solution to (12), then (−gμν ) also solves (12). Proof From (2) we find that the Christoffel symbols remain the same if we replace gμν by (−gμν ), namely,  κ μν =

1 (−g κλ )(−gνλ,μ − gλμ,ν + gμν,λ ). 2

Using (4), we find that the Ricci tensor Rμν in (12) does not change as well. Concerning the second term on the left-hand side of (12), we observe from (13) that

1  − 2 Rgμν also remains unchanged if we replace gμν by (−gμν ). Example 1 For comparison, we also note that the first-order classical derivatives of the Newton potential u for the situation sketched in Fig. 2 are continuous. It is described by the Poisson equation u = 4π Gρ,


where ρ is the mass density. Let the right-hand side f = 4π Gρ be spherically symmetric and such that f (r ) = 1 for r ∈ [0, 1] and f (r ) = 0 otherwise. The Laplace operator in spherical coordinates reads u =

1 2 ∂u ∂ 2u + 2 + 2 ∂r r ∂r r

∂ 2u 1 ∂ 2u ∂u + + cotan θ ∂θ 2 ∂θ sin2 θ ∂ϕ 2


The term in parenthesis on the right-hand side is zero for the spherically symmetric case. By the well-known method of variations of constants, we find the following solution to the Poisson equation (17):

On Extreme Computational Complexity of the Einstein Equations


⎧ 1 1 ⎪ ⎨ r2 − for r ∈ [0, 1], u(r, ϕ, θ ) = 6 1 2 ⎪ ⎩− otherwise. 3r Hence, both u and ∂u/∂r are continuous at r0 = 1.

4 Explicit Form of the First Einstein Equation In this section, we want to point out the extreme complexity of the Einstein equations. In (12), the dependence of the Ricci scalar R and the Ricci tensor Rμν on the metric tensor gμν is not indicated. Therefore, the Einstein equations (12) seem to be quite simple (see Misner et al. 1997, p. 42). To avoid this deceptive opinion, we will show now how to derive an explicit form of the first Einstein equation. First, we shall consider only the case when Tμν = 0 (and without the cosmological constant Einstein 1952). Multiplying (12) by g μν and summing over all μ and ν, we obtain by (13) that 0 = g μν Rμν −

1 μν 1 1 Rg gμν = R − Rδ μ μ = R − 4R, 2 2 2

where δ μ ν is the Kronecker delta. Thus, R = 0 and the Einstein vacuum equations can be rewritten in the well-known form Rμν = 0.


The unknown metric tensor gμν is not indicated. Concerning nonuniqueness expressed by Theorem 4, we observe from (2) that we can add any constant to any component gμν = gνμ and the Einstein equations Rμν = 0 will still be valid. Equation (18) looks seemingly very simple, since the unknown metric tensor gμν is not indicated there. So now we will rewrite it so that this metric tensor appears explicitly there. Using (4), we can express the first Einstein equation of (18) as follows: 0 = R00 =  κ 00,κ −  κ 0κ,0 +  λ 00  κ λκ −  λ 0κ  κ 0λ =  0 00,0 +  1 00,1 +  2 00,2 +  3 00,3 −  0 00,0 −  1 01,0 −  2 02,0 −  3 03,0

+  0 00  0 00 +  1 01 +  2 02 +  3 03 +  1 00  0 10 +  1 11 +  2 12 +  3 13

+  2 00  0 20 +  1 21 +  2 22 +  3 23 +  3 00  0 30 +  1 31 +  2 32 +  3 33 −  0 00  0 00 −  0 01  1 00 −  0 02  2 00 −  0 03  3 00 −  1 00  0 01 −  1 01  1 01 −  1 02  2 01 −  1 03  3 01 −  2 00  0 02 −  2 01  1 02 −  2 02  2 02 −  2 03  3 02 −  3 00  0 03 −  3 01  1 03 −  3 02  2 03 −  3 03  3 03 ,


M. Kˇrížek

where the underlined terms cancel. Hence, the first Einstein equation can be rewritten by means of the Christoffel symbols in the following way: 0 = R00 =  1 00,1 +  2 00,2 +  3 00,3 −  1 01,0 −  2 02,0 −  3 03,0 +  0 00 ( 1 01 +  2 02 +  3 03 ) +  1 00 (− 0 10 +  1 11 +  2 12 +  3 13 ) +  2 00 (− 0 20 +  1 21 +  2 22 +  3 23 ) +  3 00 (− 0 30 +  1 31 +  2 32 +  3 33 ) − 2 1 02  2 01 − 2 1 03  3 01 − 2 2 03  3 02 − ( 1 01 )2 − ( 2 02 )2 − ( 3 03 )2 .


Using (2) and the symmetry of gμν , we obtain 2 κ μν = g κ0 (gμ0,ν + gν0,μ − gμν,0 ) + g κ1 (gμ1,ν + gν1,μ − gμν,1 ) + g κ2 (gμ2,ν + gν2,μ − gμν,2 ) + g κ3 (gμ3,ν + gν3,μ − gμν,3 ). Thus, by (18), we can express the first Einstein equation R00 = 0 by means of the metric coefficients and their first- and second-order derivatives as follows:  10 g 11 12 13 0 = 4R00 = 2 g,1 00,0 + g,1 (2g01,0 − g00,1 ) + g,1 (2g02,0 − g00,2 ) + g,1 (2g03,0 − g00,3 ) + g 10 g00,01 + g 11 (2g01,01 − g00,11 ) + g 12 (2g02,01 − g00,21 ) + g 13 (2g03,01 − g00,31 ) 20 g 21 22 23 + g,2 00,0 + g,2 (2g01,0 − g00,1 ) + g,2 (2g02,0 − g00,2 ) + g,2 (2g03,0 − g00,3 )

+ g 20 g00,02 + g 21 (2g01,02 − g00,12 ) + g 22 (2g02,02 − g00,22 ) + g 23 (2g03,02 − g00,32 ) 30 g 31 32 33 + g,3 00,0 + g,3 (2g01,0 − g00,1 ) + g,3 (2g02,0 − g00,2 ) + g,3 (2g03,0 − g00,3 )

+ g 30 g00,03 + g 31 (2g01,03 − g00,13 ) + g 32 (2g02,03 − g00,23 ) + g 33 (2g03,03 − g00,33 ) 10 g 11 12 13 − g,0 00,1 − g,0 g11,0 − g,0 (g02,1 + g12,0 − g01,2 ) − g,0 (g03,1 + g13,0 − g01,3 )

− g 10 g00,10 − g 11 g11,00 − g 12 (g02,10 + g12,00 − g01,20 ) − g 13 (g03,10 + g13,00 − g01,30 ) 20 g 21 22 23 − g,0 00,2 − g,0 (g01,2 + g21,0 − g02,1 ) − g,0 g22,0 − g,0 (g03,2 + g23,0 − g02,3 )

− g 20 g00,20 − g 21 (g01,20 + g21,00 − g02,10 ) − g 22 g22,00 − g 23 (g03,20 + g23,00 − g02,30 ) 30 g 31 32 33 − g,0 00,3 − g,0 (g01,3 + g31,0 − g03,1 ) − g,0 (g02,3 + g32,0 − g03,2 ) − g,0 g33,0

− g 30 g00,30 − g 31 (g01,30 + g31,00 − g03,10 ) − g 32 (g02,30 + g32,00 − g03,20 ) − g 33 g33,00 + (g 00 g00,0 − g 01 g00,1 − g 02 g00,2 − g 03 g00,3 )  × g 10 (2g10,1 − g11,0 ) + g 11 g11,1 + g 12 (2g12,1 − g11,2 ) + g 13 (2g13,1 − g11,3 ) + g 20 (g10,2 + g20,1 − g12,0 ) + g 21 g11,2 + g 22 g22,1 + g 23 (g13,2 + g23,1 − g12,3 )  + g 30 (g10,3 + g30,1 − g13,0 ) + g 31 g11,3 + g 32 (g12,3 + g32,1 − g13,2 ) + g 33 g33,1 + (g 10 g00,0 + g 11 g11,1 − g 12 g11,2 − g 13 g11,3 )  × −g 00 g00,1 − g 01 g11,0 − g 02 (g12,0 + g02,1 − g10,2 ) − g 03 (g13,0 + g03,1 − g10,3 ) + g 10 (2g10,1 − g11,0 ) + g 11 g11,1 + g 12 (2g12,1 − g11,2 ) + g 13 (2g13,1 − g11,3 ) + g 20 (g10,2 + g20,1 − g12,0 ) + g 21 g11,2 + g 22 g22,1 + g 23 (g13,2 + g23,1 − g12,3 )  + g 30 (g10,3 + g30,1 − g13,0 ) + g 31 g11,3 + g 32 (g12,3 + g32,1 − g13,2 ) + g 33 g33,1

On Extreme Computational Complexity of the Einstein Equations


  + g 20 g00,0 + g 21 (2g01,0 − g00,1 ) + g 22 (2g02,0 − g00,2 ) + g 23 (2g03,0 − g00,3 )  × −g 00 g00,2 − g 01 (g21,0 + g01,2 − g20,1 ) − g 02 g22,0 − g 03 (g23,0 + g03,2 − g20,3 ) + g 10 (g20,1 + g10,2 − g21,0 ) + g 11 g11,2 + g 12 g22,1 + g 13 (g23,1 + g13,2 − g21,3 ) + g 20 (2g20,2 − g22,0 ) + g 21 (2g21,2 − g22,1 ) + g 22 g22,2 + g 23 (2g23,2 − g22,3 ) + g 30 (g20,3 + g30,2 − g23,0 ) + g 31 (g21,3 + g31,2 − g23,1 ) + g 32 g22,3 + g 33 g33,2   + g 30 g00,0 + g 31 (2g01,0 − g00,1 ) + g 32 (2g02,0 − g00,2 ) + g 33 (2g03,0 − g00,3 )  × −g 00 g00,3 − g 01 g01,3 − g 02 (g32,0 + g02,3 − g30,2 ) − g 03 g33,0

+ g 10 (g30,1 + g10,3 − g31,0 ) + g 11 g11,3 + g 12 (g32,1 + g12,3 − g31,2 ) + g 13 g33,1 + g 20 (g30,2 + g20,3 − g32,0 ) + g 21 (g31,2 + g21,3 − g32,1 ) + g 22 g22,3 + g 23 g33,2  + g 30 (2g30,3 − g33,0 ) + g 31 (2g31,3 − g33,1 ) + g 32 (2g32,3 − g33,2 ) + g 33 g33,3   − 2 g 10 g00,2 + g 11 (g01,2 + g21,0 − g02,1 ) + g 12 g22,0 + g 13 (g03,2 + g23,0 − g02,3 )   × g 20 g00,1 + g 21 g11,0 + g 22 (g02,1 + g12,0 − g01,2 ) + g 23 (g03,1 + g13,0 − g01,3 )   − 2 g 10 g00,3 + g 11 (g01,3 + g31,0 − g03,1 ) + g 12 (g02,3 + g32,0 − g03,2 ) + g 13 g33,0   × g 30 g00,1 + g 31 g11,0 + g 32 (g02,1 + g12,0 − g01,2 ) + g 33 (g03,1 + g13,0 − g01,3 )   − 2 g 20 g00,3 + g 21 (g01,3 + g31,0 − g03,1 ) + g 22 (g02,3 + g32,0 − g03,2 ) + g 23 g33,0   × g 30 g00,2 + g 31 (g01,2 + g21,0 − g02,1 ) + g 32 g22,0 + g 33 (g03,2 + g23,0 − g02,3 )  2 − g 10 g00,1 + g 11 g11,0 + g 12 (g02,1 + g12,0 − g01,2 ) + g 13 (g03,1 + g13,0 − g01,3 )  2 − g 20 g00,2 + g 21 (g01,2 + g21,0 − g02,1 ) + g 22 g22,0 + g 23 (g03,2 + g23,0 − g02,3 ) 2  − g 30 g00,3 + g 31 (g01,3 + g31,0 − g03,1 ) + g 32 (g02,3 + g32,0 − g03,2 ) + g 33 g33,0 .


Now we should substitute all entries of (20) with double upper indices for (15). For instance, the entry g 11 in the second line of (20) could be rewritten by means of ∗ by the Sarrus rule for 3 × 3 symmetric matrices g11 g 11 = =

∗ g11 det(gμν )

g00 g22 g33 + 2g02 g03 g23 − g00 (g23 )2 − g22 (g03 )2 − g33 (g02 )2  , sgn π g 0ν0 g1ν1 g2ν2 g3ν3 π∈S4 (−1)


where the sum in the denominator contains 4! = 24 terms. Note that the optimal expression for the minimum number of arithmetic operations to calculate the inverse of a 4 × 4 matrix is not known, yet. The other nine entries g 00 , g 01 , g 02 , g 03 , g 12 , g 13 , g 22 , g 23 , and g 33 can be expressed similarly.


M. Kˇrížek

However, we have to evaluate also the first derivatives of g μν . Consider for instance 11 in the first line of (20). Then by (21) we get the entry g,1 11 g,1

 ∗ g11 ∂ = ∂ x1 det(gμν )   1 g00 g22 g33 + 2g02 g03 g23 =  sgn π g 0ν0 g1ν1 g2ν2 g3ν3 π∈S4 (−1)  − g00 (g23 )2 − g22 (g03 )2 − g33 (g02 )2 =


g00,1 g22 g33 + 2g02,1 g03 g23 − g00,1 (g23 )2 − g22,1 (g03 )2 − g33,1 (g02 )2

+ g00 g22,1 g33 + 2g02 g03,1 g23 + g00 g22 g33,1 + 2g02 g03 g23,1 − 2g00 g23,1     sgn π − 2g22 g03,1 − 2g33 g02,1 (−1) g0ν0 g1ν1 g2ν2 g3ν3 π∈S4

− g00 g22 g33 + 2g02 g03 g23 − g00 (g23 )2 − g22 (g03 )2 − g33 (g02 )2   × (−1)sgn π g0ν0 ,1 g1ν1 g2ν2 g3ν3 + g0ν0 g1ν1 ,1 g2ν2 g3ν3 π∈S4

+ g0ν0 g1ν1 g2ν2 ,1 g3ν3 + g0ν0 g1ν1 g2ν2 g3ν3 ,1  ×

−2 (−1)

sgn π

g0ν0 g1ν1 g2ν2 g3ν3




Substituting all g μν and also its first derivatives into (20), we get the explicit form of the first Einstein equation R00 = 0 of the second order for 10 unknowns g00 , g01 , g02 , . . . , g33 . It is evident that such an equation is extremely complicated. Relation (19) takes only four lines, relation (20) takes 40 lines and after substitution of all entries with determinants given by (21), (22), etc., into (20), the Eq. (18) for the component R00 of the Ricci tensor will occupy at least 10 pages. The other nine equations Rμν = 0 can be expressed similarly. The explicit expression of the left-hand side of (12) for a given covariant divergence-free Tμν = 0 in terms of the unknown components of gμν is even more complicated. Using (13) and (20)–(22), we still have to express the term − 21 Rgμν similarly. Up to now, nobody has calculated how many terms the Einstein equations really contain, in general.

On Extreme Computational Complexity of the Einstein Equations


5 Computational Complexity of the Einstein Equations According to (20)–(22), we observe that the Einstein equations are highly nonlinear. From the end of Sect. 4, we find that the explicit form of all 10 equations (12) will occupy at least one hundred pages. For comparison note that the Laplace equation u = 0 has only three terms ∂ 2 u/∂ xi2 , i = 1, 2, 3, on its left-hand side and the famous Navier-Stokes equations have 24 terms. Let n denote the number of mass bodies. If n = 0, then the simplest solution to the Einstein equation is the Minkowski metric (5). If n = 1, then there are several other simple solutions to (12) that use spherical or axial symmetry of one body, e.g., the Schwarzschild metrics (6) and (11), or the Kerr metric Misner et al. (1997, p. 878). However, these solutions are local, not global (cf. Theorem 3). Moreover, the analytical solution of (12) is not known for two or more mass bodies. Thus, we have a serious problem to verify whether the Einstein equations describe well the n-body problem for n > 1 (e.g., in the Solar system). There are many numerical methods for solving partial differential equations such as the finite difference method, the finite volume method, the boundary element method, the finite element method Brandts et al. (2020), etc. For the numerical solution of the Einstein equations, we have to include back all arguments of the functions gμν = gμν (x 0 , x 1 , x 2 , x 3 ), gμν,κ = gμν,κ (x 0 , x 1 , x 2 , x 3 ), gμν,κλ = gμν,κλ (x 0 , x 1 , x 2 , x 3 ) appearing in (20)–(22) for all μ, ν, κ, λ = 0, 1, 2, 3. For example, in the simplest setting of the finite difference method one has to establish a four-dimensional regular space-time mesh, e.g., with N 4 mesh points (xi0 , x 1j , xk2 , xl3 ) for i, j, k, l = 1, 2, . . . , N . Then the 10 values gμν (x 0 , x 1 , x 2 , x 3 ), their 40 = 10 × 4 first derivatives and 100 = 10 × (1 + 2 + 3 + 4) second derivatives (of the Hessian) appearing in (20)–(22) have to be replaced by finite differences at all mesh points. For instance, the second derivative g00,11 appearing in the second line of (20) can be approximated by the standard central difference as g00,11 (xi0 , x 1j , xk2 , xl3 ) ≈

g00 (xi0 , x 1j + h, xk2 , xl3 ) − 2g00 (xi0 , x 1j , xk2 , xl3 ) + g00 (xi0 , x 1j − h, xk2 , xl3 ) h2


where h = N −1 is the discretization parameter. The huge system of nonlinear partial differential equations described in Sect. 4 would then be replaced by a much larger system of nonlinear algebraic equations for approximate values of the metric tensor


M. Kˇrížek

at all mesh points. In particular, at each mesh point, the corresponding discrete Einstein equations will be much longer than the Einstein equations themselves written explicitly. Hence, for example, if N ≈ 100, the discrete system on each time level will occupy millions pages of extremely complicated and highly nonlinear algebraic equations. It is well known that explicit numerical methods for solving evolution problems are unstable. Therefore, one should apply implicit methods. Nevertheless, up to now, we do not know any convergent and stable method that would yield a realistic numerical solution of the above system with guaranteed error bounds of discretization, iteration, and rounding errors. Moreover, there are large problems with initial conditions. Since (12) is a secondorder hyperbolic system of equations, one should prescribe initial conditions for all 10 components gμν and all their 40 first derivatives. However, this is almost impossible if all data are not spherically symmetric. The main reason is that spacetime tells matter how to move and matter tells spacetime how to curve Misner et al. (1997). So the initial space manifold is a priori not known, in general. Thus we also have serious problems to prove the existence and uniqueness of the solution of the Einstein equations and compare their solution with reality. There are similar large problems with boundary conditions and the divergence-free right-hand side (14) of the Einstein equations (12). Another non-negligible problem lies in the nonuniqueness of topology. The reason is that the knowledge of the metric tensor gμν does not determine uniquely the topology of the corresponding space-time manifold. For instance, the Euclidean space E3 has obviously the same metric gμν = δμν , μ, ν = 1, 2, 3, as S1 × E2 but different topology for a time-independent case with Tμν = 0 in (12). Here S1 stands for the unit circle. Hence, solving the Einstein equations does not mean that we obtain the shape of the associated space-time manifold. Other examples can be found in Misner et al. (1997, p. 725).

6 Concluding Remarks Validation and verification of problems of mathematical physics and their computer implementation is a very important part of numerical analysis. We always encounter two basic types of errors: modeling error and numerical errors (such as discretization error, iteration error, round-off errors, and also undiscovered programing bugs). Validation tries to estimate the modeling error and to answer the question: Do we solve the correct equations? On the other hand, verification tries to quantify the numerical errors and to answer the question: Do we solve the equations correctly? There is a general belief in the current astrophysical community that the Einstein equations best describe gravity Kˇrížek (2019a). However, their extreme complexity prevents from verifying whether they model, for instance, the Solar system better than Newtonian n-body simulations with n > 1. Hence, in this case the Einstein equations are, in fact, non-computable by present computer facilities, and thus non-

On Extreme Computational Complexity of the Einstein Equations


testable in their general form. Moreover, from the previous exposition it is obvious that by (12) we are unable to calculate trajectories of the Jupiter-Sun system, even 1 mm of Jupiter’s trajectory, for example. The reason is that the mass of Jupiter is not negligible with respect to the Sun’s mass. On the other hand, such trajectories can be calculated numerically very precisely by the n-body simulations (e.g., with n = 8 planets) even though their analytical solution is not known. Thus the modeling error e0 , which is the difference between observed trajectories and the analytical solution, is also not known. However, the modeling error e0 can be easily estimated by the triangle inequality |e0 | ≤ |e1 | + |e2 |, where e1 is the numerical error and e2 is the total error which is the difference between observed and numerically calculated trajectories. Classical relativistic tests are based on verification of very simple algebraic formulae (see, e.g., (23) below) derived by various simplifications and approximations of the Schwarzschild solution (6) of the Einstein equations (12) which is very special and corresponds only to the exterior of one spherically symmetric body, that is n = 1. However, we cannot test good approximation properties of the Einstein equations (12) by means of one particular exterior Schwarzschild solution. Such an approach could be used only to disprove their good modeling properties of reality. Analogously, good modeling properties of the Laplace equation u = 0 cannot be verified by testing some of its trivial linear solutions, since there exist infinitely many other nontrivial solutions. In Einstein (1915b), Einstein replaced Mercury by a massless point, the position of the Sun was fixed, and the other planets were not taken into account (see Kˇrížek (2017) for many other simplifications that were done). To express the gravitational field, Einstein used Eqs. (1) instead of (4) for μ = ν = 0. This important fact is suppressed (cf. Theorem 1). In this way, Einstein derived under various further approximations the following formula for the relativistic perihelion shift of Mercury Einstein (1915b, p. 839): a2 = 5.012 × 10−7 rad, (23) ε = 24π 3 2 2 T c (1 − e2 ) where T = 7.6005 × 106 s is the orbital period, e = 0.2056 is the eccentricity of its elliptic orbit, and a = 57.909 × 109 m is the length of its semimajor axis. From this he got an idealized value of the perihelion shift 43

per century. However, this number does imply that the system (12) describes trajectories of planets better than Newtonian mechanics as demonstrated in Sects. 4 and 5. Note that Paul Gerber in 1898 derived the following formula for the speed of light by means of retarded potentials (see Gerber 1898): c2 = 24π 3

a2 , T 2 (1 − e2 )


M. Kˇrížek

where  is the perihelion shift of Mercury during one orbital period. We see that this formula is the same as (23). So the corresponding tests of the general theory of relativity based on (4) or by the Einstein system (1) yield the same values as tests of the Gerber theory of retarded potentials. So which theory is correct? A shift (advance) of the line of apsides of binary pulsars cannot be derived similarly to Mercury’s perihelion shift, since the analytical solution of the corresponding two-body problem with nonzero masses is not known. Thus, only some heuristic formulae can be employed to this highly nonlinear problem. Recent observations of gravitational waves also do not confirm that (12) models reality well, since these waves are described by a simplified linearized equations with the D’Alembert operator. For a collision of two black holes a post-Newtonian approach was employed (see, e.g., Mroué et al. 2013). Moreover, a large gravitational redshift was not taken into account Kˇrížek and Somer (2018). Abbott et al. (2016) considered only the cosmological gravitational redshift z = 0.09 of a binary black hole merger, but they forgot that the redshift of any black hole is z = ∞. In Kˇrížek and Somer (2018), we demonstrate that the resulting black hole masses were overestimated approximately twice. Note also that Fig. 4, which should illustrate the propagation of gravitational waves, contradicts the general theory of relativity. To see this, denote by d the coordinate distance of two black holes and by T their orbital period. Multiply the inequality π >2 by d/T . Then we immediately get a contradiction v=

2d |AB| πd > = = c, T T T


where v is the orbital velocity, c the speed of gravitational waves (equal to the speed of light), and |AB| is the distance of two consecutive maximum amplitudes of the right black hole as indicated in Fig. 4. However, v ≤ 13 c by Abbott et al. (2016). Figure 4 shows only a dipole character of gravitational waves and not their proclaimed quadrupole character. Furthermore, the double Archimedean spiral illustrating gravitational waves has by definition a different shape near the center. Finally, we would like to emphasize that no equation of mathematical physics describes reality absolutely exactly on any scale. Therefore, each mathematical model has only a limited scope of its application. In particular, also the Einstein equations with cosmological constant = 0 (see Einstein 1952) Rμν −

1 8π G Rgμν + gμν = 4 Tμν 2 c


should not be applied to the entire universe as it is often done, since they are nonlinear and thus not scale invariant. Note that the observable universe is at least 15 orders of magnitude larger than one astronomical unit.

On Extreme Computational Complexity of the Einstein Equations


Fig. 4 Popular illustration implying that the orbital velocity v of binary black holes is larger than the speed of light c, see (24)

There are only three maximally symmetric three-dimensional manifolds S3 , E3 , H that are used to model a homogeneous and isotropic universe for a fixed time. In this case, the Einstein equations lead to the famous Friedmann ordinary differential equation Kˇrížek and Somer (2016), i.e. 3

Einstein equations + maximum symmetry =⇒ Friedmann equation.


The Friedmann equation is applied to calculate luminosity distances of type Ia supernovae. On the basis of these distances, it is claimed that the Einstein equations well describe the entire universe. This is a typical circular argument. The current cosmological model, which is based on the Friedman equation, possesses over 20 paradoxes (see, e.g., Kˇrížek 2019a; Kˇrížek and Somer 2016; Vavryˇcuk 2018). From this and implication (26), it is evident that the Einstein equations should not be applied to modeling the entire universe. During its expansion, the topology cannot change. The most probable model is S3 whose present radius is very roughly R = 1026 m, the volume is 2π 2 R 3 and the total mass is estimated to 2 × 1053 kg. However, radius increases with time. So during the Big Bang, the topology of the universe should also be S3 . By Kˇrížek (2019b) the maximum mass density is about 1018 kg/m3 which would correspond to the radius R = 109 m. Acknowledgements The author is indebted to J. Brandts, A. Mészáros, L. Somer, and A. Ženíšek for inspiration and valuable suggestions. Supported by grant no. 23-06159S of the Grant Agency of the Czech Republic and RVO 67985840 of the Czech Republic.


M. Kˇrížek

References Abbott BP, Abbott R, Abbott TD, Abernathy MR et al (2016) Observation of gravitational waves from a binary black hole merger. Phys Rev Lett 116:061102 Brandts J, Korotov S, Kˇrížek M (2020) Simplicial partitions with applications to the finite element method. Springer, Cham de Sitter W (1917) On the relativity of inertia: remarks concerning Einstein’s latest hypothesis. Proc Kon Ned Acad Wet 19(2):1217–1225 Einstein A (1915a) Die Feldgleichungen der Gravitation. Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften zu Berlin, Jan–Dec:844–847. https://www. Einstein A (1915b) Erklärung der Perihelbewegung des Merkur aus der allgemeinen Relativitätstheorie. Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften zu Berlin, Jan–Dec:831–839 Einstein A (1916) Die Grundlage der allgemeinen Relativitätsteorie. Ann. der Phys. 354(7):769–822 Einstein A (1917) Kosmologische Betrachtungen zur allgemeinen Relativitätstheorie. Sitzungsberichte der Königlich Preuss. Akad. Wiss. zu Berlin, pp 142–152. English transl. In: The principle of relativity. Dover, New York, 1952 Ellis HG (2012) Gravity inside a nonrotating, homogeneous, spherical body. arXiv:1203.4750v2 Florides PS (1974) A new interior Schwarzschild solution. Prof Roy Soc Lond A 337(1611):529– 535 Gerber P (1898) The spatial and temporal propagation of gravity. J Math Phys 43:93–104 Hilbert D (1915) Die Grundlagen der Physik (Erste Mitteilung). Nachr. Ges. Wissen. Göttingen, Math-Phys Klasse 1915:395–408 Interior (2020) Wikipedia, Kˇrížek M (2017) Influence of celestial parameters on Mercury’s perihelion shift. Bulg Astron J 27:41–56 Kˇrížek M (2019a) Do Einstein’s equations describe reality well? Neural Netw World 29(4):255–283 Kˇrížek M (2019b) Possible distribution of mass inside a black hole. is there any upper nlimit on mass density? Astrophys Space Sci 364:Article 188, 1–5 Kˇrížek M, Kˇrížek F (2018) Quantitative properties of the Schwarzschild metric. Publ Astron Soc Bulg 1–10:2018 Kˇrížek M, Somer L (2016) Excessive extrapolations in cosmology. Gravit Cosmol 22(3):270–280 Kˇrížek M, Somer L (2018) Neglected gravitational redshift in detections of gravitational waves. In: Kˇrížek M, Dumin YV (eds) Proceedings of the international conference cosmology on small scales 2018: dark matter problem and selected controversies in cosmology. Institute of Mathematics, Czech Academy of Sciences, Prague, pp 173–179 McCausland I (1999) Anomalies in the history of relativity. J Sci Explor 13(2):271–290 Misner CW, Thorne KS, Wheeler JA (1997) Gravitation, 20th edn. Freeman, New York, W.H Mroué AH, Scheel MA, Szilágyi B, Pfeiffer HP, Boyle M, Hemberger DA, Kidder LE, Lovelace G, Ossokine S, Taylor NW, Zenginoglu A, Buchman LT, Chu T, Foley E, Giesler M, Owen R, Teukolsky SA (2013) Catalog of 174 binary black hole simulations for gravitational wave astronomy. Phys Rev Lett 111:241104 Sauer T (1999) The relativity of discovery: Hilbert’s first note on the foundations of physics. Arch Hist Exact Sci 53:529–575 Schwarzschild K, Über das zulässige Krümmungsmaaß des Raumes. Vierteljahrsschift der Astronomischen Gesellschaft, 35:337–347, (1900) English translation: On the permissible numerical value of the curvature of space. Abraham Zelmanov J 1(64–73):2008 Schwarzschild K (1916a) Über das Gravitationsfeld einer Kugel aus inkompressibler Flüssigkeit nach der Einsteinschen Theorie. Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften zu Berlin, Jan-Juni:424–434. English transl.: On the gravitational field of a sphere of incompressible liquid, according to Einstein’s theory. Abraham Zelmanov J 1:20–32

On Extreme Computational Complexity of the Einstein Equations


Schwarzschild K (1916b) Über das Gravitationsfeld eines Massenpunktes nach der Einsteinschen Theorie. Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften zu Berlin, Jan–Juni:189–196. English transl.: On the gravitational field of a point-mass, according to Einstein’s theory. Abraham Zelmanov J 1:10–19, 2008 Stephani H (2004) Relativity: an introduction to special and general relativity, 3rd edn. Cambridge University Press, Cambridge Vankov AA (2011) Einstein’s paper: “Explanation of the perihelion motion of Mercury from general relativity theory”. General Sci J. Translation of the paper (along with Schwarzschild’s letter to Einstein) by R.A. Rydin with comments by A.A. Vankov Vavryˇcuk V (2018) Universe opacity and CMB. Mon Not R Astron Soc 478(1):283–301 Will CM (2014) The confrontation between general relativity and experiment. Living Rev Relativ 17:4

Systematic Imposition of Partial Differential Equations in Boundary Value Problems Lauri Kettunen, Sanna Mönkölä, Jouni Parkkonen, and Tuomo Rossi

Abstract We present a systematic approach to impose the partial differential equations of second-order boundary value problems. This is motivated by the needs of scientific computing and software development. The contemporary need is for software which does not restrict end users to some pre-given equations. For this we first generalize the idea of a field as a formal sum of differential forms on a Minkowski manifold. Thereafter for these formal sums we write in space-time first-order partial differential equations that stem with the action principle. Since boundary value problems are solved with respect to some spatial and temporal chart, i.e., coordinate system, the corresponding differential operator is written explicitly in space and time. By construction, particular fields, such as those of electromagnetism or elasticity become instances of the formal sums, and the differential equations that constitute the underlying field theories become instances of the corresponding differential operator. Consequently, the approach covers a wide class of field models and yields a systematic approach to impose specific partial differential equations. As it is straightforward to convert the underlying operators to pieces of software, the approach yields solid foundations to develop software systems that are not restricted to pre-given equations. Keywords Systematic approach · Minkowski manifold · Electromagnetism · Elasticity · Field model L. Kettunen (B) · S. Mönkölä · T. Rossi Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, 40014 Jyväskylä, Finland e-mail: [email protected] S. Mönkölä e-mail: [email protected] T. Rossi e-mail: [email protected] J. Parkkonen Department of Mathematics and Statistics, University of Jyväskylä, P.O. Box 35, 40014 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



L. Kettunen et al.

1 Introduction Scientific computing has arisen from the analysis needs in the major classical field theories, such as elasticity, thermal analysis, fluid dynamics and electromagnetism. Over the years the focus has expanded towards multi-physical problems and to boundary value problems that couple microscopic and macroscopic physics. Furthermore, nowadays second-order boundary value problems are exploited not only in physics but also in other fields of science such as image processing, optimal transport, mathematical finance, and optimal control. This trend raises a new challenge to software development. The need is for systems that are not restricted to some a priori given partial differential equations. Since there is virtually no limit to the number of feasible problems, constructing a finite list of specific problems –even a long one– is not a satisfactory starting point to contemporary software development. In this paper we introduce a framework from which particular equations can be systematically derived for each specific purpose. This makes it possible to construct software systems that enable end users to specify and to solve for their own problems. Elementary physics textbooks already suggest such an approach is achievable. Textbooks recognize analogies between field theories. Mere analogies, however, do not yield a view accurate enough for the needs of scientific computing. For this reason, in this paper we focus on the abstract logical structures from which field theories are composed. In more precise terms, there exists a class of field theories that corresponds (1) to a pair of first-order differential equations equipped with a constitutive relation in space-time, and where (2) the differential equations follow from the action principle (Feynman et al. 1963; Bleecker 1981; Baez and Muniain 1994). The action principle is a generalization of the familiar minimum energy and power principle involved in static and quasi-static problems. At first, we generalize the idea of a field to a formal sum F of fields so that particular fields become instances of F. Then, we write the pair of differential equations DF = 0, D F = G in Minkowski space and assume they stem from the action principle. Here G is another formal sum,  is the Hodge operator, and D is a differential operator. Thereafter, we assume a decomposition of space-time and work out operator D = D + D explicitly in space and time. The construction guarantees any pair of differential equations that follows from the action principle can also be instantiated in space and time from D and F. This provides us with an approach to systematize the imposition of partial differential equations in boundary value problems. Parabolic and elliptic problems are obtained in the usual way as special cases of hyperbolic ones. The paper is organized as follows. First we give a short background explaining the employed concepts. Then we define the formal sum of fields in Sect. 2. In Sect. 3 we specify operator D in space-time and derive explicitly its counterparts in (1 + 3)-

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


dimensions in time and space in Corollaries 1 and 2. These theorems are the main results of this paper. For convenience, the same is also translated into terms of classical vector analysis in Remark 9. To demonstrate the usage of the main result, in Sect. 4 we show how some well-known field models are obtained from Corollary 2. The main principles of approximations in finite-dimensional spaces for the needs of numerical computing are briefly explained in Sect. 5. Finally, the conclusions follow in Sect. 6.

2 Formal Sum of p-Forms To get started we assume space-time is modeled as an n-dimensional smooth and oriented Minkowski manifold  equipped with a metric tensor and signature (−, +, +, . . . , +) Frankel (2012). Furthermore, at will, we may assume that  is topologically trivial, as the differential equations describe field properties locally in the virtual neighborhoods of points of . Hence, the differential equations do not depend on the topology of . The space-like part of  is a Riemannian manifold (Frankel 2012; Petersen 2016) denoted by s , and the boundary of s is ∂s . A bounded domain in the Euclidean 3-space is a simple and common example of s . The time-like part of  is t . When  is decomposed into space and time, we write  = t × s . There are many physical field theories that cannot be covered with ordinary differential forms. Therefore, we introduce also so-called E-valued and End(E)-valued forms (Frölicher and Nijenhuis 1956; Baez and Muniain 1994). The rest of this section yields a background to E-valued and End(E)-valued forms. If one is mainly interested in ordinary forms and classical vector analysis, this part can be bypassed, and one may jump directly to Definition 1 and Remark 2. We denote the tangent and cotangent bundlesover  by T  and T ∗ . We assume E is a smooth real vector bundle over  and p T ∗  is a p-form  bundle over . An E-valued differential form of degree p is a section of E ⊗ p T ∗  (Frölicher and Nijenhuis 1956; Baez and Muniain 1994; Frankel 2012). Linear maps from a vector space back to itself are mathematically known as endomorphisms. Let E ∗ be the dual space of E. The endomorphism bundle End(E) of E over  is isomorphic to E ⊗ E ∗ (Baez and Muniain 1994, p. 221), and for the needs of this paper it is eligible and enough to interpret End(E) as  E ⊗ E ∗. An End(E)-valued differential form of degree p is a section of End(E) ⊗ p T ∗  (Baez and Muniain 1994). To exemplify E and End(E)-valued forms, let us express them in local coordinates. For this, let f be an ordinary p-form, s a section of tangent bundle E = T 1 , and  a simple bounded domain in Euclidean 3-space. The wedge product between differential forms is denoted by ∧. In this case a section corresponds with a vector field and in local coordinates s can be written as s = s x ex + s y e y + s y e y .


L. Kettunen et al.

A 1, 2, or 3-form f can be given by f = f x dx + f y dy + f z dz, f = f yz dy ∧ dz + f zx dz ∧ dx + f x y dx ∧ dy, f = f x yz dx ∧ dy ∧ dz, respectively. In this case the E-valued p-form is a so-called vector-valued p-form s ⊗ f = [s x f, s y f, s z f ]T . When s ⊗ f operates on a p-vector (or on a p-tuple of vectors w), f (w) becomes a real, and the value of the map is vector s ⊗ f (w) = [ f (w)s x , f (w)s y , f (w)s z ]T ∈ R3 . This motivates the name “vector-valued p-forms”. Remark 1 Not all tensors are simple, and, consequently, all E-valued differential forms cannot be given in the form s ⊗ f . However, they can still be written as a sum of such forms. The local expression of E ∗ -valued, that is, of co-vector-valued p-forms is constructed similarly. Let s ∗ be a section of E ∗ . In Euclidean spaces s ∗ is then just a co-vector field. In local coordinates s ∗ is given by s ∗ = sx∗ dx + s y∗ dy + sz∗ dz. Now, a co-vector-valued p-form s ∗ ⊗ f is s ∗ ⊗ f = sx∗ dx ⊗ f + s y∗ dy ⊗ f + sz∗ dz ⊗ f. When s ∗ ⊗ f operates on a p-vector (or on a p-tuple of vectors) w, f (w) becomes a real number, and the tensor products with f (w) reduce to multiplying with a real number, and consequently, the value of the map becomes a co-vector s ∗ ⊗ f (w) = f (w) sx∗ dx + f (w) s y∗ dy + f (w) sz∗ dz = f (w)s ∗ justifying the name “co-vector-valued p-form”. Finally, for End(E)-valued forms we first interpret End(E) as E ⊗ E ∗ , and then s ⊗ s ∗ can be understood as a section of End(E). In local coordinates, (s ⊗ s ∗ ) ⊗ f is written ⎡ x ∗ ⎤ s sx ex ⊗ dx s x s y∗ ex ⊗ dy s x sz∗ ex ⊗ dz (s ⊗ s ∗ ) ⊗ f = ⎣s y sx∗ e y ⊗ dx s y s y∗ e y ⊗ dy s y sz∗ e y ⊗ dz ⎦ ⊗ f, s z sx∗ ez ⊗ dx s z s y∗ ez ⊗ dy s z sz∗ ez ⊗ dz

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


and when f operates on a p-vector or on a p-tuple of vectors w, the value is (s ⊗ s ∗ ) ⊗ f (w) = f (w)(s ⊗ s ∗ ), and hence the name “End(E)-valued p-form”. Now, we call the formal sums of differential forms by the name general field. These are defined as follows: Definition 1 F is a general field, if it is a section of n   p p=0

T ∗ ,


Ep ⊗




p  End(E) ⊗ T ∗  , or T ∗ , p=0

where E p is either E or E ∗ . Remark 2 The general field of ordinary differential forms is of the type F = α0 f 0 + α1 f 1 + · · · + αn f n , in case of E-valued forms F = α0 s0 ⊗ f 0 + α1 s1 ⊗ f 1 + . . . , αn sn ⊗ f n , and so on. Here α p ∈ R, p = 0, . . . , n, and the s p are sections of E. Consequently, it is plain that any ordinary E- or End(E)-valued differential form is an instance of the general field. This construction guarantees that any field expressible as a differential form can also be given as an instance of F. Consequently, by operating with general fields we can cover in one token a class of fields from physical field theories without focusing on any particular one.

3 Differential Operator for General Fields In this section we will introduce the first-order differential equation the general field F should fulfil. For this, we need some further tools. At first we denote the exterior derivative and the covariant exterior derivative by d and d∇ , respectively, where ∇ is a connection (Bleecker 1981; Crampin and Pirani 1986; Frankel 2012; Baez and Muniain 1994).


L. Kettunen et al.

3.1 Hodge Operator and Wedge Product The Hodge operator (Hodge 1941; Flanders 1989; Bleecker 1981; Frankel 2012) mapping ordinary p-forms to (n − p)-forms is denoted by . The rest of this subsection extends the Hodge operator to E-valued and End(E)-valued forms. If one’s main interest lies in ordinary forms, this can be bypassed. We assume a metric on E. Then, let σ be an ordinary p-form and f = s ⊗ σ an E-valued p-form. Furthermore, we denote the set of sections of bundle E by (E), and  : (E ∗ ) → (E) is the sharp map, and : (E) → (E ∗ ) is the flat map. For our needs the Hodge operator should be defined such that the integral of product f ∧  f over  becomes a real-valued action. For this reason we adopt the following definitions: Definition 2 The  is an extension of the Hodge operator, if it operates on E-valued and End(E)-valued forms as follows:

p n− p : E⊗ T ∗ →  E ∗ ⊗ T ∗ ,

p n− p  :  E∗ ⊗ T ∗ →  E ⊗ T ∗ ,

p n− p  :  End(E) ⊗ T ∗  →  End(E) ⊗ T ∗ ,

s ⊗ σ  → s ⊗ σ , s ⊗ σ  → s ⊗ σ , S ⊗ σ  → S ⊗ σ .

Definition 3 The ∧ is an extension of the wedge product between ordinary differential forms, if it operates on E-valued and End(E)-valued forms as follows:

p r  p+r ∧: E⊗ T ∗ ×  E ∗ ⊗ T ∗  →  (E ⊗ E ∗ ) ⊗ T ∗ , (s ⊗ σ, s ⊗ σ ) → (s ⊗ s ) ⊗ (σ ∧ σ ) and

p r  p+r ∧ :  End(E) ⊗ T ∗  ×  End(E) ⊗ T ∗  →  End(E) ⊗ T ∗ , (S ⊗ σ, S ⊗ σ ) → SS ⊗ (σ ∧ σ ).

Remark 3 The wedge product of s ⊗ σ and (s ⊗ σ ) yields an E ⊗ E ∗ -valued form (s ⊗ s) ⊗ (σ ∧ σ ). Remark 4 The wedge product of S ⊗ σ and (S ⊗ σ ) is an End(E)-valued form S ⊗ σ ∧ (S ⊗ σ ) = SS ⊗ (σ ∧ σ ).

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


3.2 Action Hereinafter we need not to emphasize explicitly whether we talk of ordinary, Evalued, or End(E)-valued p-forms. Consequently, symbols f p , g p , h p , etc., may denote to p-forms of any type. Physical field theories are typically built around the idea of conserving some fundamental notion, such as energy, power, or probability, and the very idea is to express such notion as a L 2 -norm of a product of fields integrated over the whole manifold. To introduce such type of real-valued action, we define first: Definition 4 Given a vector space V , a linear map V ⊗ V ∗ → R is the trace tr, if v ⊗ ν → ν(v). Remark 5 The trace is independent of the choice of basis. With the aid of the trace tr we are able to introduce function A mapping f p to real numbers: 1 tr( f p ∧  f p ). A( f p ) = 2  This provides us with an action of the desired type. The integral of the product between the elements of pair { f p , g n− p } over , where the elements f p and g n− p are related by a constitutive relation g n− p =  f p , yields a real number. Many physics field theories call also for source terms. To append them to action A, we need to add another term. Let the sources be given as a general field G: G=


g p.


Complementing the source terms to action A results in Ap =

1 2

tr( f p ∧  f p ) +

tr(a p−1 ∧ g p+1 ),

where a p is the potential. Remark 6 Notice, however, the potential does not have the same interpretation in classical field theories and in gauge theories. For example, in classical theories one typically writes f p = d∇ a p−1 , whereas in case of the Yang-Mills theory (Yang and Mills 1954; Yang 2014; Baez and Muniain 1994) the End(E)-valued curvature 2form f 2 is f 2 = f 0 + d∇ a 1 + a 1 ∧ a 1 , where a 1 is the vector potential and f 0 is the curvature of the connection, see Baez and Muniain (1994, p. 274–275). The wedge product between an End(E)-valued with itself need not to vanish. This is due the non-commutativity between the two products involved. For more detail, see Crampin and Pirani (1986, p. 280–281).


L. Kettunen et al.

Insisting the variation δA p to vanish yields differential equations. This is the p so-called action principle. In case of A p potential a p is varied by aα = a p + αδa p , p α ∈ R, and then the variation δA is given by δA p =

d p p−1

A (aα ) . α=0 dα

Thereafter, differential equations are obtained with an integration by parts process, for details and examples see Baez and Muniain (1994), Bleecker (1981), Frankel (2012). All the examples we will show later on in Sect. 4 rely on the action A p given above.

3.3 Differential Operator in Space-time We have now all what is needed to define the class of field theoretical problems we are interested in: Definition 5 On manifold  differential equations d∇ F = 0 and d∇ F = G form the general conservation law for pair {F, G}, if the equations are derivatives from a real-valued action. If the connection is trivial and in case of ordinary differential forms, d∇ reduces to exterior derivative d. Remark 7 The idea is that small changes in the solutions of the differential equations do not change the underlying action up to the first order (Baez and Muniain 1994). This is the link to conservation of energy, probability or of some other significant notion. Remark 8 Operator  specifies the sign,  = (−1) p(n− p)+1 . Lemma 1 In dimension n = 4 with the given signature the Hodge operator  maps p-forms to (4 − p)-forms, and it fulfils 2 = (−1) p(4− p)+1 . Consequently, the gen eral conservation law for pair {F, G}, F = 4p=0 f p , and G = 4p=0 g p , can be equivalently given by ⎡ ⎢d∇ ⎢ ⎢ ⎢ ⎣

−d∇  d∇


⎤ ⎡ 0⎤ f0 g ⎥ ⎢ f 1 ⎥ ⎢g 1 ⎥ d∇  ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎥ ⎢ f ⎥ = ⎢g ⎥ . −d∇  ⎥⎢ ⎥ ⎢ ⎥ d∇ ⎦ ⎣ f 3 ⎦ ⎣g 3 ⎦ d∇ f4 g4 d∇

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


3.4 Differential Operator in Space and Time Let us next assume a decomposition of  into space and time,  = s × t . The space-like part of the exterior covariant derivative d∇ is denoted by d∇s . Here we state our main result as corollaries following from Lemma 1. Corollary 1 In t × s and in dimension n = 1 + 3, the conservation law is given by ⎡

⎤ ⎡ 3⎤

s d∇

dt ∧ ∂t


⎢ s (dt ∧ ∂t ) d∇ ⎢ ⎢ s  dt ∧ ∂t −d∇ ⎢ ⎢ s  s (dt ∧ ∂t ) d∇ d∇ ⎢ ⎢ s s ⎢ d∇  d∇ dt ∧ ∂t ⎢ s s ⎢−d∇  d∇ −(dt ∧ ∂t ) ⎢ s  ⎣ d∇ dt ∧ ∂t s  −d∇

s d∇

⎡ 4⎤ gt

⎥ ⎢ ft4 ⎥ ⎢gs3 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ f 1 ⎥ ⎢g 2 ⎥ ⎥⎢ s ⎥ ⎢ t⎥ ⎥ ⎢ 2⎥ ⎢ 1⎥ ⎥ ⎢ ft ⎥ ⎢gs ⎥ ⎥ ⎢ 2⎥ = ⎢ 3⎥ . ⎥ ⎢ fs ⎥ ⎢gt ⎥ ⎥ ⎢ 3⎥ ⎢ 2⎥ ⎥ ⎢ ft ⎥ ⎢gs ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎦ ⎣ f 0 ⎦ ⎣gt1 ⎦ f t1

−(dt ∧ ∂t )


Proof The corollary follows from the following two properties: 1. The covariant exterior derivative can be decomposed into a time and space-like part by writing d∇ = dt ∧ ∂t + d∇s . 2. For any p > 0, any p-form f p can be decomposed into a sum of a time-like and space-like component: p f p = f t + f sp , p


where f s involves only space-like components of f p , and f t contains the remaining components with a time-like part. These two properties imply p


d∇ f p = dt ∧ ∂t ( f t + f sp ) + d∇s ( f t + f sp ) ∀ p > 0. p

Since the wedge product between dt and any f t vanishes, one may equivalently write p d∇ f p = dt ∧ ∂t f sp + d∇s f t + d∇s f sp , ∀ p > 0. In case of p = 0 we have d∇ f 0 = dt ∧ ∂t f 0 + d∇s f 0 . Next, a decomposition of d∇ F into a time and space-like parts yields


L. Kettunen et al.

d∇ F = (dt ∧ ∂t + d∇s )F = dt ∧ ∂t F + d∇s F. Consequently, in case of p = n we have d∇  f n = dt ∧ ∂t  f tn + d∇s  f tn . p

n− p


For p < n notice first, that any (n − p)-form  f t is of the type f s , and  f s n− p p is of the type f t . Accordingly, the wedge product between dt and any  f s has to vanish, and the derivative is given by p


d∇  f p = dt ∧ ∂t ( f t + f sp ) + d∇s ( f t + f sp ) p p = dt ∧ ∂t  f t + d∇s  f t + d∇s  f sp . The proof follows now from n = 4,



f p,


d∇ f 0 = dt ∧ ∂t f 0 + d∇s f 0 , p d∇ f p = dt ∧ ∂t f sp + d∇s f t + d∇s f sp , ∀ p > 0, d∇  f n = dt ∧ ∂t  f tn + d∇s  f tn , p p d∇  f p = dt ∧ ∂t  f t + d∇s  f t + d∇s  f sp , ∀ p < n, and furthermore, from d∇s f s3 ≡ 0 and d∇s  f t1 ≡ 0. Substituting these back to Lemma 1 yields ⎡ ⎢dt ∧ ∂t ⎢ ⎢ ds ⎢ ∇ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

⎤ ⎡ 0⎤

s  −(dt ∧ ∂t ) −d∇



⎥ ⎢ ft1 ⎥ ⎢gt1 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ f 1 ⎥ ⎢g 1 ⎥ s (dt ∧ ∂t ) d∇  ⎥⎢ s ⎥ ⎢ s⎥ ⎥ ⎢ 2⎥ ⎢ 2⎥ s dt ∧ ∂t −d∇  ⎥ ⎢ ft ⎥ ⎢gt ⎥ ⎥ ⎢ 2⎥ = ⎢ 2⎥ s s  ⎥ ⎢ fs ⎥ ⎢gs ⎥ d∇ −(dt ∧ ∂t ) −d∇ ⎥ ⎢ ⎥ ⎢ 3⎥ s s ⎢ ⎥ d∇ dt ∧ ∂t d∇  ⎥ ⎢ f t3 ⎥ ⎥ ⎢ ⎥ ⎢gt ⎥ s 3 ⎣ ⎣gs3 ⎦ ⎦ ⎦ d∇ (dt ∧ ∂t ) fs s  d∇

s d∇

s d∇

dt ∧ ∂t

and the desired result is obtained by row and column swapping. p

⎡ 0⎤




f t4


Let f s , gs , Fs , and G s denote p-forms on s , and s is the Hodge operator in the space-like component s .

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


Corollary 2 Corollary 1 is equivalent to ⎡


⎢ ⎢ s ∂t s ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ s  s d∇ ⎢ s ⎢ ⎢ d s  ⎢ s ∇ s ⎢ ⎢ ⎣


s −d∇ s d∇


s  s d∇ s s  s ∂t s s d∇ s s −d∇

s −d∇ s d∇

∂t −s ∂t s

s d∇ s  s d∇ s


s  s d∇ s

f s3

G 3s

⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ F 3 ⎥ ⎢ g3 ⎥ ⎥⎢ s ⎥ ⎢ s ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ f s1 ⎥ ⎢G 1s ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ 1⎥ ⎢ 1 ⎥ ⎥ ⎢ Fs ⎥ ⎢ gs ⎥ ⎥⎢ ⎥ = ⎢ ⎥. ⎥ ⎢ 2 ⎥ ⎢ 2⎥ ⎥ ⎢ f s ⎥ ⎢G s ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ F 2 ⎥ ⎢ g2 ⎥ ⎥⎢ s ⎥ ⎢ s ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥ ⎢ f 0 ⎥ ⎢G 0 ⎥ ⎦⎣ ⎦ ⎣ ⎦

−s ∂t s



Proof Let p-form f p be decomposed into p

f p = f t + f sp , as in the proof of Corollary 1. The exterior derivative satisfies d( f p ∧ gr ) = d f p ∧ gr + (−1) p f p ∧ dgr . Hence, the space-like exterior derivative of d∇s f p can be given by p

d∇s f p = d∇s f t + d∇s f sp = d∇s (dt ∧ Fsp−1 ) + d∇s f sp = −dt ∧ d∇s Fsp−1 + d∇s f sp , p−1

where Fs is a space-like ( p − 1)-form. The  operator in  can be given in terms of s as follows: p

 f p = ( f t + f sp ) = (dt ∧ Fsp−1 + f sp ) = −s Fsp−1 + (−1) p dt ∧ s f sp . Consequently, we also have d∇s  f p = −d∇s s Fsp−1 + (−1) p+1 dt ∧ d∇s s f sp , and recursively d∇s  f p = (−1) p dt ∧ s d∇s s Fsp−1 + (−1) p s d∇s s f sp . The time derivatives satisfies dt ∧ ∂t f p = dt ∧ ∂t (dt ∧ F p−1 + f sp ) = dt ∧ ∂t f sp , and


L. Kettunen et al.

(dt ∧ ∂t ) f p = −(dt ∧ ∂t )s Fsp−1 = s ∂t s Fsp−1 . Summing up, we have the following: p

d∇s f t = −dt ∧ d∇s Fsp−1 , d∇s  f sp = (−1) p s d∇s s f sp , p

d∇s  f t = (−1) p dt ∧ s d∇s s Fsp−1 , dt ∧ ∂t f p = dt ∧ ∂t f sp , p

(dt ∧ ∂t ) f t = s ∂t s Fsp−1 . Substitution of this back to Corollary 1 yields ⎡

⎢ s ∂t s ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ s  ⎢ dt∧s d∇ s ⎢ s ⎢s d∇ s ⎢ ⎣

⎤ ⎡ 3⎤

s −dt∧d∇



dt∧G 3s

⎥ ⎢ Fs3 ⎥ ⎢ gs3 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ s  s ⎥ ⎢ f 1⎥ dt∧s d∇ −dt∧d∇ s ⎥ ⎢ s ⎥ ⎢dt∧G 1s ⎥ ⎢ ⎢ ⎥ ⎥ ⎥ s  s s ∂t s s d∇ d∇ ⎥ ⎢ Fs1 ⎥ ⎢ gs1 ⎥ s = ⎢ ⎢ ⎥ ⎥ ⎥. s ⎥ ⎢ fs2 ⎥ ⎢dt∧G 2s ⎥ −dt∧d∇ dt∧∂t ⎥ ⎢ 2⎥ ⎢ 2 ⎥ ⎥ ⎢ Fs ⎥ ⎢ gs ⎥ −s ∂t s ⎥⎢ ⎥ ⎢ ⎥ s ⎦ ⎣ f 0 ⎦ ⎣dt∧G 0 ⎦ dt∧s d∇ s dt∧∂t s d∇


s d∇ s  s d∇ s

−s ∂t s



All the entries of the odd rows have a wedge product with dt from the left, and the proof follows by taking this wedge product as a common factor.  Remark 9 Let s be a three-dimensional Euclidean space. In terms of classical vector analysis the general conservation law of Corollary 2 corresponds with ⎡


⎤⎡ ⎤

− div


⎡ ⎤ θ

⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ω ⎥ ⎢ λ ⎥ ∂t αω div ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢a ⎥ ∂t curl αv − grad ⎥ ⎢ r ⎥ ⎢ ⎥ ⎢ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ ∂t αs curl αu grad ⎢ ⎥⎢s ⎥ = ⎢c⎥, ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ grad αω − curl ∂t ⎢ ⎥ ⎢u ⎥ ⎢ l ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢grad αφ ⎥ ⎢ v ⎥ ⎢m ⎥ curl −∂t αv ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ϕ ⎥ ⎢κ ⎥ div αs ∂t ⎣ ⎦⎣ ⎦ ⎣ ⎦ div αr

−∂t αψ



where αφ , αφ , . . . , αψ are parameters that follow from the possible change of type p p when working out the (1-) vectors corresponding with s f s and s Fs in Euclidean space. Functions β, κ, λ, θ , ϕ, ψ, φ, and ω are scalar fields, and a, c, l, m, r , s, u, and v are vector fields. (In case of E-valued forms, the corresponding entries of the column vectors [φ, ω, . . . , ψ]T and [θ, λ, . . . , β]T should be considered as vector-valued.)

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


Proof In an n-dimensional Euclidean space any p-form f p can be identified with a p-vector f p . For all p-vectors u p , ( f p , u p ) = f p (u p ) should hold. Now, in particular, in dimension three operators grad, curl, and div are defined such that for all u 1 , u 2 , and u 3 (grad f 0 , u 1 ) = d f 0 (u 1 ), (curl f 1 , s u 2 ) = d f 1 (u 2 ), (div(s f 2 ), s u 3 ) = d f 2 (u 3 ) 

hold, and the proof follows from this identification.

4 Instantiation of Particular Models A large class of particular field models from physics can now be given as instances of the differential operator given in Corollary 2, as demonstrated in the next subsections.

4.1 Electromagnetism We start with electromagnetism. In  = t × s , the Faraday field becomes b + e ∧ dt, where b represents magnetic flux density and e represents the electric field strength. Both b and e ∧ dt are 2-forms. The source term of electromagnetism is (− j ∧ dt + q), where 2-form j is current density and 3-form q is charge density. To substitute F = f s2 + f t2 = b + e ∧ dt = b + dt ∧ (−e) and G = gs1 + gt1 = ( j ∧ dt + q) = s j − dt ∧ s q to Corollary 2, we set Fs1 = −e, f s2 = b, gs1 = s j, and G 0 = −s q, and then, the general conservation law yields ds b = 0, d e + ∂t b = 0, −s ∂t s e + s ds s b = s j, s

−s ds s e = −s q,

2nd row, 5th row, 4th row, 7th row.

This is an equivalent expression for the Maxwell equations ds b = 0 d h = j + ∂t d s

ds e + ∂t b = 0, ds d = q,


L. Kettunen et al.

together with the constitutive laws h = ν b and d = ε e, where ν = νs and ε = εs . (The material parameters can also be embedded into the Hodge operator, see Bossavit 2001b.) To give the same in terms of classical vector analysis, one first selects u = b, s = −e, l = j,κ = −q, αs = ε, and αu = ν. Then, the general conservation law in Remark 9 yields div b = 0, 2nd row, curl e + ∂t b = 0, 5th row, −∂t εe + curl νb = j, − div εe = −q

4th row, 7th row.

4.2 Schrödinger Equation The non-relativistic Schrödinger equation can also be derived from the general conservation law as a coupled pair of differential equations. The components of the wave function by are denoted by ϕ R and ϕ I . In addition, we introduce a pair {q R , q I } of auxiliary variables. Then we write F 0 = ϕ R ,  qR , f s1 = 2m G 1s = q R ,

f s3 = ϕ I ,  qI , Fs2 = 2m gs2 = q I ,

g 0 = −V ϕr ,

G 3s = −V ϕ I ,

and −s ϕ R = ϕ I . Be aware, although the subscripts R and I correspond with what are conventionally expressed as the real and imaginary components, respectively, they are here just labels. Substituting the R-labeled terms F 0 , f s1 , G 1s , and g 0 back into Corollary 2 results in  ∂t q R − ds ϕ R = q R , 2m  s d q R = 0, 2m

 s ds s qr − s ∂t s ϕr = −V ϕ R , 2m

3rd row, 6th row, 8th row.

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


In the non-relativistic case, ∂t q R is neglected by a modelling decision implying the third row can be written as q R = −ds ϕ R . When this is substituted into the equation from the sixth row, that becomes redundant since ds ds ≡ 0. What remains is the eight row, and by substituting q R = −ds ϕ R into it results in −s ∂t s ϕ R −

2 s ds s ds ϕ R = −V ϕ R . 2m

Now, as we have s ϕ R = −ϕ I , the equation arising from the eight row is equivalent to 2 s s d s d s ϕ R = −V ϕ R . s ∂t ϕ I − 2m Notice, s ds s ds is the Laplace operator. Symmetrically, substituting the I -labeled terms f s3 , Fs2 , gs2 , and G 3s back into Corollary 2 yields  s ∂t s q I = q I , 2m  s ds s q I = 0, 2m  s ∂t ϕ I − d q I = −V ϕ I , 2m

s ds s ϕ I −

6th row, 3rd row, 1st row.

Since ∂t q R is neglected, so is also −∂t q I = s ∂t ϕ R , and this implies the equation of the third row of the conservation law is a tautology. Then, we have q I = s ds s ϕ I , and when this and −s ϕ R = ϕ I is substituted to the equation of the first row one gets −s ∂t ϕ R −

2 s ds s ds ϕ I = −V ϕ I . 2m

The Schrödinger equation is now the following pair of equations: 2 s ds s ds ϕ I = V ϕ I , 2m 2 s s d s d s ϕ R = −V ϕ R . s ∂t ϕ I − 2m s ∂t ϕ R +

This yields the same solutions as the textbook expression of the Schrödinger equation ∂t ϕ − i

 div grad ϕ = −iV ϕ, 2m

where ϕ is a complex-valued scalar function.


L. Kettunen et al.

The Dirac equation and the Gross-Pitaevskii equation (Gross 1961; Pitaevskii 1961) are concretized in a similar fashion (Räbinä et al. 2018). The Klein-Gordon equation is akin to the Schrödinger equation, but second-order in time.

4.3 Elasticity To demonstrate deriving the basic equations of small-strain elasticity (Abraham and Marsden 1987; Segev and Rodnay 1999; Kanso et al. 2007; Yavari 2008) from the general conservation law we denote velocity by u = ∂t ν, where displacement ν is a vector-valued 0-form. In other words, ν is locally a vector of displacements in as many directions as the dimension of s . The displacement is a diffeomorphism of the material particles (or material points) from a reference configuration to a deformed body (Kanso et al. 2007). Especially, this map exists without the metric structure. When expressed without metric, stress σ is a co-vector-valued 2-form. Informally, as explained in Feynman et al. (1963, Chap. 38), stress boils down to the idea of “force per unit area”. Stress maps 2-vectors (or ordered pairs of ordinary 1-vectors) representing “virtual oriented areas” to a co-vector that corresponds with virtual work. The virtual work is the metric-independent counterpart to the force (Bossavit 1998) interpreted as the “work per length”. (This interpretation requires the metric structure.) Following similar kind of reasoning, linearized strain ε becomes a vector-valued 1-form. As explained in Feynman et al. (1963, Chap. 38), an informal interpretation of strain boils down to the idea of “stretch per unit length”. Consequently, in a more general sense, the idea of the metric-independent linearized ε is to map ordinary 1-vectors to displacement ν. The source term is the body force f V , which is a vectorvalued 3-form. The stress-strain relation is established with the Hodge operator as a map from vector-valued 1-forms to co-vector-valued 2-forms (Yavari 2008; Kovanen et al. 2016). Informally, this is the map between the “stretches per unit length” and the “forces per unit areas”. Here, we denote such a Hodge operator by Cs , where the superscript C is employed to denote all the parameters of the stress-strain relation are incorporated into the Hodge operator. Notice, that strain is of the form ε = s ⊗ f , and in dimension n = 3 both f and s have three components. Consequently, strain ε involves 3 × 3 = 9 elements. Stress σ has similarly nine elements, and hence, the Cs operator contains 9 × 9 = 81 elements (which can be further reduced by symmetry considerations). Notice also, the product ε ∧ Cs ε = ε ∧ σ yields an End(E)-valued form whose degree is n. When the trace of the product is integrated over space, one gets the strain energy 1 2


tr(ε ∧ Cs ε) =

1 2


tr(ε ∧ σ ).

Now, for the conservation law, we set F 0 = u, f s1 = ε, g 0 = − s f V and denote mass density by ρ. When these are substituted back to Corollary 2 and the constitutive relations are taken into account, one gets

Systematic Imposition of Partial Differential Equations in Boundary Value Problems

∂t ε − d∇s u = 0 d∇s ε = 0 s d∇s Cs ε − s ∂t ρs u = −s f v


3rd row, 6th row, 8th row.

Since we have u = ∂t ν, the equation from the third row is equivalent to ε = d∇s ν. Thereafter the second equation following from the sixth row becomes a tautology. ρ The last equation is equivalent to d∇s σ − s ∂t u = − f v . Summing up, the conservation law for small-strain elasticity can be written as −∂t ε + d∇s u = 0, ρs ∂t u − d∇s σ = f V ,

σ = Cs ε, u = ∂t ν,

The textbook counterpart of the conservation law is obtained from Remark 9 by choosing ψ = u, r = ε, β = − f V , αr = C, and αψ = ρ. (The symbols with an overline indicate vector-valued functions.) As a result one gets −∂t ε + grad u = 0,

σ = Cε,

ρ∂t u − div σ = f V

u = ∂t ν.

4.4 Yang-Mills Equation To demonstrate that the differential operator D need not to be restricted only to ordinary and E-valued forms, our last example is the Yang-Mills equation (Yang and Mills 1954; Yang 2014). It calls for End(E)-valued forms. As explained by Baez and Muniain (Baez and Muniain 1994), the Yang-Mills equation is structurally an immediate extension of the Maxwell equations. Following their work, the counterparts to the electric and magnetic fields are the End(E)-valued 1-form e and the End(E)-valued 2-form b, respectively. The source term is j − ρdt. In this case the decomposition of the exterior covariant derivative into time and space-like parts is d∇ = dt ∧ ∇t + d∇s , see Baez and Muniain (1994, p. 262). The Yang-Mills equation can then be concretized in the same manner as the Maxwell equations above from the general conservation law. This results in d∇s b = 0, d∇s e + ∇t b = 0, −∇t e + s d∇s s b = j, s ds s e = ρ. For further details, see Baez and Muniain (1994).


L. Kettunen et al.

5 Approximations in Finite Dimensional Spaces Our three building boxes are a pair of field {F, F ∗ }, a pair of differential equations dF = 0, dF ∗ = G, and the constitutive relation F ∗ = F (where d is now short for the covariant exterior derivative and the exterior derivative). The techniques to implement differential forms, differential operator d, and the Hodge operator as pieces of software are well known (Bossavit 1997). The main principle behind the commonly employed numerical approaches is to maintain two of the equations exact in finite-dimensional spaces and to approximate the remaining third one. For example, the finite element method (Ciarlet 1978) satisfies one of the differential equations and the constitutive relation exactly in finite-dimensional spaces while the remaining differential equation is approximated in the “weak sense” employing variational techniques that follow from the action principle. In Yee-like schemes Yee (1966), Bossavit and Kettunen (1999)—known also as the finite difference method or generalized finite differences (Bossavit 2001a), discrete exterior calculus (Hirani 2003), finite integration technique (Weiland 1984), etc.— the strategy is to fulfil the pair of differential equations exactly in finite-dimensional spaces, while the constitutive relation is approximated only on a finite subset of points of the manifold. The “generalized view” of finite differences (Yee 1966; Bossavit and Kettunen 1999; Bossavit 2001a; Hirani 2003) emphasizes the importance of recognizing the differential equations call only for a differentiable structure. This then implies the metric structure is only needed to express the constitutive relations. This is to say, the approximation of the Hodge operator in finite-dimensional spaces, that is, “the discrete Hodge” becomes of especial interest (Tarhasaari et al. 1999). Examples of numerics can be found in Räbinä et al. (2015), Räbinä et al. (2018).

6 Conclusions Scientific computing has arisen from the needs of physics analysis. While the same numerical techniques and methods employed to solve second-order boundary value problems are shared many software systems, the systems themselves are typically restricted to some specific set of partial differential equations. However, by starting in space-time from the differential equations obtained from the action principle, in one token a systematic approach to recognize all the corresponding instances of partial differential equations eligible in a certain class of boundary value problems can be found. This class governs a wide class of boundary value problems from statics to wave propagation. As the derivation of the results calls for differential geometry and cannot be expressed in terms of classical vector analysis, the formalism is mathematically bit involving. However, the main result can still be written in the classical language and is also easy to grasp. The work implies, a software system needs not to be built

Systematic Imposition of Partial Differential Equations in Boundary Value Problems


above a priori decided set of problems. Instead, by implementing the differential operator obtained with the approach, end users are enabled to impose any partial equations—or their coupled combinations—from the corresponding class.

References Abraham R, Marsden JE (1987) Foundations of mechanics, 2nd edn. Addison-Wesley, Redwood City, CA Baez J, Muniain JP (1994) Gauge fields, knots and gravity. Series on knots and everything, vol 4. World Scientific Bleecker D (1981) Gauge theory and variational principles. Addison-Wesley Bossavit A (1997) Computational electromagnetism: variational formulations, complementarity, edge elements. Academic Bossavit A (1998) On the geometry of electromagnetism. (2): Geometrical objects. J Jpn Soc Appl Electromag Mech 6:114–123 Bossavit A (2001) ‘Generalized Finite Differences’ in computational electromagnetics. Prog Electromag Res 32:45–64 Bossavit A (2001) On the notion of anisotropy of constitutive laws: some implications of the “Hodge implies metric” result. COMPEL 20(1):233–239. Bossavit A, Kettunen L (1999) Yee-like schemes on a tetrahedral mesh, with diagonal lumping. Int J Numer Modell 12(1–2):129–142 Ciarlet PG (1978) The finite element method for elliptic problems. North-Holland Crampin M, Pirani FAE (1986) Applicable differential geometry. London mathematical society lecture note series, vol 59. Cambridge University Press Feynman RP, Leighton RB, Sands ML (1963) The Feynman lectures on physics. Vol II: electromagnetism and matter. Addison-Wesley, Reading, MA Flanders H (1989) Differential forms with applications to the physical sciences. Dover Frankel T (2012) The geometry of physics: an introduction, 3rd edn. Cambridge University Press, Cambridge Frölicher A, Nijenhuis A (1956) Theory of vector-valued differential forms: part I. Derivations in the graded ring of differential forms. Indagationes Math (Proc) 59:338–350. 1016/S1385-7258(56)50046-7 Gross EP (1961) Structure of a quantized vortex in boson systems. Nuovo Cimento 20:454–477. Hirani AN (2003) Discrete exterior calculus. PhD thesis, California Institute of Technology, Pasadena, CA Hodge WVD (1941) The theory and applications of harmonic integrals. Cambridge University Press, Cambridge Kanso E, Arroyo M, Tong Y, Yavari A, Marsden JE, Desbrun M (2007) On the geometric character of stress in continuum mechanics. Z Angew Math Phys 58:1–14. Kovanen T, Tarhasaari T, Kettunen L (2016) Formulation of small-strain magneto-elastic problems. arXiv:1602.04966 Petersen P (2016) Riemannian geometry, 3rd edn. Springer Pitaevskii LP (1961) Vortex lines in an imperfect Bose gas. Sov Phys JETP 13(2):451–454 Räbinä J, Kettunen L, Mönkölä S, Rossi T (2018) Generalized wave propagation problems and discrete exterior calculus. ESAIM: M2AN 52(3):1195–1218. 2018017


L. Kettunen et al.

Räbinä J, Kuopanportti P, Kivioja MI, Möttönen M, Rossi T (2018) Three-dimensional splitting dynamics of giant vortices in Bose-Einstein condensates. Phys Rev A 98(2):023624. https://doi. org/10.1103/PhysRevA.98.023624 Räbinä J, Mönkölä S, Rossi T (2015) Efficient time integration of Maxwell’s equations with generalized finite differences. SIAM J Sci Comput 37(6):B834–B854. 140988759 Segev R, Rodnay G (1999) Cauchy’s theorem on manifolds. J Elast 56:129–144. 10.1023/A:1007651917362 Tarhasaari T, Kettunen L, Bossavit A (1999) Some realizations of a discrete Hodge operator: a reinterpretation of finite element techniques [for EM field analysis]. IEEE Trans Magn 35(3):1494– 1497. Weiland T (1984) On the numerical solution of Maxwell’s equations and applications in the field of accelerator physics. Particle Accel 15:245–292 Yang CN (2014) The conceptual origins of Maxwell’s equations and gauge theory. Phys Today 67(11):45–51. Yang CN, Mills RL (1954) Conservation of isotopic spin and isotopic gauge invariance. Phys Rev 96(1):191–195. Yavari A (2008) On geometric discretization of elasticity. J Math Phys 49:022901. 10.1063/1.2830977 Yee K (1966) Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media. IEEE Trans Antennas Propag 14(3):302–307. 1966.1138693

Curiously Empty Intersection of Proof Engineering and Computational Sciences Sampsa Kiiskinen

Abstract The tools and techniques of proof engineering have not yet been applied to the computational sciences. We try to explain why and investigate their potential to advance the field. More concretely, we formalize elementary group theory in an interactive theorem prover and discuss how the same technique could be applied to formalize general computational methods, such as discrete exterior calculus. We note that such formalizations could reveal interesting insights into the mathematical structure of the methods and help us implement them with stronger guarantees of correctness. We also postulate that working in this way could dramatically change the way we study and communicate computational sciences. Keywords Proof engineering · Type theory · Software engineering · Formal verification · Interactive theorem provers · Abstract algebra · Functional programming · Coq · OCaml

1 Introduction Proof engineering is a subfield of software engineering, where the objective is to develop theories, techniques and tools for writing proofs of program correctness (Ringer et al. 2019). Even though proof engineering has many parallels with traditional software engineering, there are several concepts that do not translate directly, giving rise to unique challenges and approaches. Also, despite its practical roots as an engineering discipline, proof engineering is surprisingly closely associated with the foundations and philosophy of mathematics. If you look for domains, where recent advances in proof engineering have yielded practical applications, you will find various subfields of computer science and mathematics. On the engineering side of things, the domains include model checkers S. Kiiskinen (B) Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, 40014 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



S. Kiiskinen

(Tourret and Blanchette 2021; Blanchette et al. 2018), interactive theorem provers (Sozeau et al. 2020), instruction set architectures (Armstrong et al. 2019), programming languages (Watt 2018), network stacks (Bishop et al. 2018), file systems (Chen wt al. 2017), concurrent and distributed systems (Sergey et al. 2018; Woos et al. 2016), operating system kernels (Klein et al. 2014), security policies (Dam et al. 2013) and certified compilers (Leroy 2009). On the more mathematical side, the domains span category theory (Hu and Carette 2021), type theory (Coquand et al. 2018), machine learning (Selsam et al. 2017), homotopy theory (Bauer et al. 2017), group theory (Gonthier et al. 2013), foundations of mathematics (Voevodsky 2011), geometry (Hales 2011), graph theory (Gonthier 2008), mathematical logic (O’Connor 2005) and abstract algebra (Geuvers et al. 2002). The list goes on, but computational sciences are nowhere to be found on it. Why? While we cannot answer the question outright, we can still pose the following dichotomy: either proof engineering is unfit for computational sciences in some intricate way, which is interesting, or there is an unexploited opportunity to advance the field, which is also interesting. Let us investigate these options further!

2 Unfit for Computational Sciences? There are a few heuristic arguments that spring into mind when you first try to explain why proof engineering might be unfit for computational sciences. While these arguments are way too simplistic to offer a satisfying explanation, they are still worth discussing, because they help us address common misconceptions surrounding the subject. Importance In computational sciences, program correctness does not matter enough to justify formal specification and proof. Considerable parts of modern infrastructure rely on software packages developed by computational scientists. Every engineering discipline that involves modeling real-world phenomena needs tools for solving linear systems of equations, finding eigenvalues, evaluating integrals or performing Fourier transforms. This is why foundational numerical libraries, such as BLAS, LAPACK, QUADPACK and FFTW, in addition to higher-level software packages based on them, such as NumPy (van der Walt et al. 2011) and MATLAB (Moler 2000), are almost omnipresent in the industry. To suggest that the correctness of these software packages or programs using them does not matter is not only patently false, but also paints quite a depressing picture of the field. If practitioners care so little about the correctness of their work that they consider stronger guarantees to be unnecessary, there are way more serious issues to be addressed than the missed opportunities afforded by proof engineering. We can therefore dismiss this argument as needlessly pessimistic. Relevance In computational sciences, correctness is better established by some other method than formal specification and proof.

Curiously Empty Intersection of Proof Engineering …


Testing and debugging are popular and effective ways to improve the correctness of programs, which is why they are prevalent in the industry. However, as effective as testing may be, it can only ever show the presence of bugs and never their absence (Dijkstra 1971). Formal proofs are the only way to demonstrate the absence of bugs,1 given that the proofs actually cover the whole specification. Even when total absence of bugs is not important, there are still cases, where formal proofs are worth considering. As the generality of a program increases, so does the dimensionality of the space its tests need to cover. If the program keeps growing, testing will eventually become infeasible, because high-dimensional spaces simply have too much room for bugs to hide in. Throwing more computational resources at the problem can offer some relief, but it is not a sustainable solution, because there is always a limit to how much we can compute. So, while this argument may apply to some cases, it is not universally true. Insight In computational sciences, no new insights can be gained from formal specification or proof. Even though writing formal proofs involves a decent amount of tedious busywork, it is not a clerical activity completely devoid of creative insight. Organizing existing knowledge into a form that is as beautiful and understandable as it deserves to be is both challenging and valuable. If this was not the case, category theory and reverse mathematics2 would not exist and detailed formalizations of familiar mathematical concepts would not frequently yield unexpected insights (Univalent Foundations Program 2013). It seems dubious that computational sciences could be so well-understood that its formalizations would be fruitless.3 You might personally feel this way if you are only interested in the immediate results and applications of a particular theory, but otherwise this argument is needlessly narrow-minded. Performance In computational sciences, performance matters most and is at odds with formal specification and proof. One traditional technique for improving the correctness of programs is to litter them with assertions. Assertions are logical statements of invariants that should always hold. They are checked during the execution of a program and, if they are actually found to not hold, the violation is reported and the program is terminated immediately.4 1

The sole exception is a proof by exhaustion, which is a way to test things conclusively. Aside from model checking, it is rarely used in practice, because interesting things tend to be infinite or at least very large. 2 Reverse mathematics refers to the investigation of what assumptions are needed to prove known theorems. In more poetic terms, it is the study of the necessary and the sufficient. 3 Quantum computing seems like a particularly good candidate for formalization due to its counterintuitive nature and close ties with computer science (Liu et al. 2019). 4 There are also static assertions that are checked during the compilation of a program, but they are usually much less flexible.


S. Kiiskinen

Assertions are commonly employed in array libraries, where accessing an element at some index is preceded by a check ensuring that the index is in bounds. They are also often used to uphold preconditions of numerical routines, such as checking the condition number of a matrix before performing linear regression or checking the Courant–Friedrichs–Lewy condition of a discretization scheme before solving a partial differential equation. Another traditional technique for improving the correctness of programs is to write them in a high-level language with a dedicated runtime system. The motivation for having a runtime system is to automate some of the most tedious and error-prone programming tasks, so that you can focus your attention on more important matters. Functional programming languages typically employ runtime systems with garbage collectors, whose purpose is to take care of allocating and releasing memory. Meanwhile, runtime systems of scripting languages almost always incorporate dynamic type systems, whose operating principle is to tag every object with a reference to its type. Since assertions, garbage collectors and dynamic type systems all incur performance penalties, it is easy to assume that any method for improving the correctness of programs has similar drawbacks, but this is not always the case. There exist modern programming languages with so-called zero-cost abstractions that have no negative impact on performance. Typestate analysis (Bierhoff and Aldrich 2007) and proof erasure (Sozeau et al. 2020) are examples of such abstractions. Not only do they not incur performance penalties, but sometimes they even help the compiler find new performance optimizations. So, while this argument may have been true in the past, it is gradually becoming less and less convincing. Resource In computational sciences, time and money should be invested on something other than formal specification and proof. Even though proof engineering tools are constantly becoming more accessible, writing formal specifications and proofs is still quite difficult and laborious. This is especially true in domains, where prerequisite definitions and lemmas have not yet been laid out in full. To give a concrete example, it took several months to establish enough prerequisites in topology, algebra and geometry before perfectoid spaces could be defined (Buzzard et al. 2020). Luckily, formalization is not a package deal. There is nothing wrong with formalizing only the most salient parts of a theory and leaving the rest for later. If you count humans as a part of the system, this thought also relates to performance argument. The less time you and your colleagues spend programming, testing, debugging and reviewing code, the better the overall performance of the system will be. From the collective standpoint, this argument could go either way. Feasibility In computational sciences, it would be too difficult or limiting to incorporate formal specification and proof. It is a reasonable fear that formalization could increase the barrier of entry and make adopting new ideas more difficult. However, it has been reported that proof engineering tools make for excellent teaching aids (Pierce 2009) and that the inflexibility

Curiously Empty Intersection of Proof Engineering …


towards changing definitions can be mitigated through the principle of representation independence (Angiuli et al. 2021). On the flip side, you could argue that it would be too difficult or limiting to not incorporate formal specification and proof. Whether or not you subscribe to the intuitionistic philosophy of defining mathematics as a mental process, it is undisputable that humans can only hold so much information in their heads at once. As the generality of a program increases, so does the level of abstraction needed to keep it manageable. Not everything can or even should be understood as arrays of floating-point numbers. In terms of rigor, computational sciences sit somewhere between mathematics and software engineering, so it seems unlikely that we could not overcome the same challenges as everyone else. Thus, this argument seems like a pointless worry.

3 Closer Investigation In terms of tools, proof engineering deals with the development and use of interactive theorem provers (itps), automated theorem provers (atps), program verifiers and constraint solvers (css). While all of these tools could be useful in various corners of computational sciences, we want to focus our attention on itps, because they offer the highest level of generality. This generality is not only apparent in the problems itps have been used to solve, but also in the way itps can leverage other atps and css as their subsystems (Czajka and Kaliszyk 2018).

3.1 Architecture of Interactive Theorem Provers In the early days, programming languages, such as FORTRAN5 (from 1957) and C (from 1972), had very simple type systems. Their primary purpose was to tell the compiler what storage size, alignment and supported operations any variable was supposed to have. The compiler tracked this information and ensured it remained consistent throughout the program (Brainerd 1978; Kernighan and Ritchie 1988). Programming languages have evolved a lot since those days. Nowadays, there exist languages with type systems that are strong enough to express arbitrarily complicated properties. If arbitrariness does not immediately spark your imagination, you may consider such properties as “this differential equation has a unique solution”, “this numerical method never diverges” or “this program always halts”. Indeed, dependently-typed total languages can express the entirety of constructive mathematics due to a principle known as the Curry–Howard correspondence (Wadler 2015).


You can infer the era from the name, as lowercase letters were not invented until FORTRAN 77 was superseded by Fortran 90.


S. Kiiskinen


Extractor Extractor

Fig. 1 Architecture of interactive theorem provers with nodes representing system components and arrows the data flow mechanisms between them

The only caveat is that the compiler cannot necessarily find a proof of a given proposition or tell if the proposition is meaningful; it can only decide whether a given proof is correct or not. Some notable itps include Isabelle (from 1986), Coq (from 1989), Agda (from 2007) and Lean (from 2013). They are classified as itps, because they satisfy the de Bruijn criterion, which requires being able to produce elaborated proof objects that a small proof-checking kernel can verify (Ringer et al. 2019). The architecture of such itps roughly follows the diagram in Fig. 1. The vernacular is the surface language of the system and therefore appears in all itps. Users mostly write their programs in the vernacular, which is why it has to be quite a rich language. The metalanguage is an additional support language that only appears in some itps. The purpose of the metalanguage is to help write more complicated vernacular programs and it often comes with more unprincipled features, such as nontermination, unhygienic macros and reflection. The kernel, as the name suggests, is at the core of all itps. Vernacular programs are compiled into kernel programs through a process called elaboration and then checked for correctness. The resulting kernel programs tend to be large, uninformative and may only ever exist ephemerally. Their sole purpose is to be type checked and possibly compiled into another language. The remaining components are common to most ordinary programming languages. We shall use Coq as an example of an itp, because it possibly follows this architecture most closely. In Coq, the vernacular is a dependently-typed total purely functional language called Gallina, which looks a lot like ML. Any program written in Gallina is guaranteed to terminate, because its only control flow abstractions are structurally recursive functions.6 In Coq, the metalanguage is an untyped impera-


Even well-founded recursion is not primitive, as it is translated into structural recursion through a clever use of the accessibility predicate (Bove and Capretta 2005).

Curiously Empty Intersection of Proof Engineering …


tive language called Ltac,7 which bears some resemblance to Bourne shell scripts. Working with Ltac is similar to using a reversible debugger, in the sense that you change the state of your program by repeatedly doing and undoing commands. In Coq, the kernel is an implementation of the calculus of inductive constructions with some small modifications (Pfenning and Paulin-Mohring 1990; Coquand and Paulin 1990). It is intended to be as small and simple as possible, so that the risk of having bugs in the kernel is minimized. This is very important, because all of the correctness guarantees afforded by the system hinge on the kernel being bug-free.

3.2 Theoretical Background The kernels of all prominent itps to date are based on some flavors of type theory. In the more conservative camp, Coq and Lean both implement variations of the calculus of constructions, which is also occasionally called Coquand–Huet type theory (Coquand and Huet 1988). In the more experimental camp, Agda implements both intuitionistic type theory, which is frequently referred to as just Martin-Löf type theory (Martin-Löf 1998), and cubical type theory, which is also known as Cohen–Coquand–Huber–Mörtberg type theory (Cohen et al. 2016). Each of these type theories has enough expressive power to serve as a constructive foundation of mathematics, even though they have notable differences in the features they provide. For example, cubical type theory comes with higher-inductive types (Coquand et al. 2018; Vezzosi et al. 2021), while the predicative calculus of inductive constructions supports a universe of proof-irrelevant propositions (Gilbert et al. 2019). If you wonder why none of the itps implement Zermelo–Fraenkel set theory, which is widely considered to be the canonical foundation of mathematics, the reason is mostly coincidental. Nobody has found success with it and the consensus in the community seems to be that set theory only works in principle, not in practice. This attitude is also apparent in statements written by experts. You may consider the development of V to be outside [the] scope of mathematics and logic, but then do not complain when computer scientist[s] fashion it after their technology. I have never seen any serious proposals for a vernacular8 based on set theory. Or[,] to put it another way, as soon as we start expanding and transforming set theory to fit the requirements for V , we end up with a theoretical framework that looks a lot like type theory. —Andrej Bauer (Bauer 2020)

While understanding type theory in depth is way beyond the scope of this text, it is still helpful to be familiar with its basic parlance. Type theories are conventionally presented using inference rules.9 Each rule is written / with a hole for the premises 7

The name is sometimes typeset as L tac and stands for the obvious “language for tactics”. As you can see, working in this field takes a lot of imagination. 8 The variable V refers to the formal language used for the vernacular. 9 This presentation uses a proof calculus called natural deduction. It is popular, but far from the only option.


S. Kiiskinen

and a hole for the conclusion. Rules involve two kinds of judgements that can be placed in a context. Typing judgements are written  :  with a hole for the term and a hole for its type. Equality judgements are written  ≡  with two holes for the terms that are supposed to be equal. Contexts are written , , . . . ,  with holes for judgements or other contexts. Hypothetical judgements are written    with a hole for the context and a hole for the judgement. Even though the typing judgement x : A resembles the set-theoretical membership proposition x ∈ A and the equality judgement x ≡ y resembles the set-theoretical equality proposition x = y, they are not semantically equivalent. The first big difference is that types are not sets. While some types, such as that of the integers Z and that of infinite bit strings N → 2, can be shown to form sets, many others, such as that of types U and that of the circle S1 , carry more structure. The second big difference is that judgements are not propositions. While propositions are internal to the theory, judgements are not, so it is not possible to do such things as form the negation of a judgement. Inference rules are quite an abstract concept, so let us illustrate them through a concrete example. We shall use natural numbers as an example, because they are both familiar and ubiquitous. Natural numbers are a type with the formation rule   N : U, the introduction rules O:N

n:N   S(n) : N,

the elimination rule , m : N  P(m) : U   x : P(O) , m : N, y : P(m)  f (m, y) : P(S(m)) n:N   ind P (x, f, n) : P(n) and the computation rules , m : N  P(m) : U   x : P(O) , m : N, y : P(m)  f (m, y) : P(S(m))   ind P (x, f, O) ≡ x , m : N  P(m) : U   x : P(O) , m : N, y : P(m)  f (m, y) : P(S(m)) n:N   ind P (x, f, S(n)) ≡ f (n, ind P (x, f, n)). The introduction rules represent zero and the successor of another natural number, so any particular natural number can be constructed by applying them repeatedly.

Curiously Empty Intersection of Proof Engineering …


We have 2:N


  2 ≡ S(S(O))

  4 ≡ S(S(S(S(O))))

and so on. The elimination rule represents mathematical induction, so the addition of natural numbers can be defined by applying it to either parameter. If we pick the first one, as is convention, we get , n : N, p : N  n + p : N , n : N, p : N  n + p ≡ indm→N ( p, (m, q) → S(q), n), which reduces according to the computation rules.10 Informal presentations are rarely as detailed as we have been so far. It usually suffices to say that natural numbers are an inductive type freely generated by O:N

S : N → N,

because everything else can be inferred from these generators (Awodey et al. 2012). It is also sufficient to define addition by + : N×N→N O+p≡ p S(n) + p ≡ S(n + p), because pattern matching over structural recursion is equivalent to induction via elimination rules (Coquand 1992; Cockx and Abel 2018). We will follow a more informal style like this from here on out.

3.3 Role of Elaboration Now that we know how itps are designed, we can discuss the role of elaboration in them. The purpose of elaboration is to convert mathematical statements expressed in the vernacular into equivalent statements that can be handled by the kernel. As before, we shall illustrate an abstract concept through a concrete example. In the vein of choosing something that is both familiar and ubiquitous, we shall use elementary group theory as an example. 10

It is a good exercise for the reader to follow the given rules to deduce 2 + 2 ≡ 4.


S. Kiiskinen

Consider the following excerpt from a typical textbook on abstract algebra: If h : G → H is a group homomorphism, then h(a + 2 × b) = h(a) + 2 × h(b). If we have done our homework properly and wish to be as terse as possible,11 we can feed this excerpt into Coq almost exactly as it is stated here. This is possible due to the following mechanisms supported by its vernacular: Existential variables The constituents of the groups are not introduced explicitly. We need to assume the existence of the carriers A : U and B : U, the relations X : A × A → U and Y : B × B → U, the identities x : A and y : B, the inverses f : A → A and g : B → B and the operations k : A × A → A and m : B × B → B before we can even talk about the homomorphism h : G → H between the groups G : IsGrp A (X, x, f, k) and H : IsGrp B (Y, y, g, m). Implicit coercions The typing judgement h : G → H does not make sense, because G and H are groups instead of types. The function in question is actually defined between the carriers, as in h : A → B, with the additional knowledge that it preserves group structure, as evidenced by M : IsGrpHom A,B (X, x, f, k, Y, y, g, m, h). We must insert projections from G to A and H to B before we can make sense of the typing judgement. Type inference The types of the variables a and b are not specified. However, from the presence of h(a) and h(b), we can infer that a : A and b : A. Implicit generalization The variables a and b are introduced without quantigeneralize the equation as if it was stated under fiers.12 We should implicitly  the universal quantifier a,b:A . Notation scopes The notation 2 may not refer to any particular number, because the interpretations of notations depend on their scope. We should interpret 2 as the iterated sum 1 + (1 + 0), where 0, 1 and  +  refer to the constituents of an arbitrary semiring. Type classes The notations referring to the constituents of various algebraic structures are ambiguous. We employ operational type classes, so that we can reuse the relation  =  for X and Y , the identity 0 for x, y and the zero integer, the inverse − for f and g, the operation  +  for k, m and integer addition, the identity 1 for the unit integer and the action  ×  for an integer repetition of the operation. Implicit arguments The operations  = , 0, 1,  +  and  ×  are not fully applied. There are implicit arguments that we can unambiguously infer to be  = B , 0Z , 1Z ,  +Z ,  + A ,  + B ,  ×Z,A  and  ×Z,B . Universe polymorphism The universe levels of any of the sorts U are not given. We should be able to choose them in such a way that there will be no inconsistencies, such as Girard’s paradox. 11

Being as terse as possible is generally not a good guideline for writing maintainable code, but that is beside the point. 12 In type theory, universal and existential quantifiers are proof-relevant, which is why they are written as sums and products.

Curiously Empty Intersection of Proof Engineering …


After elaboration, the excerpt will have taken roughly the following form. Let A : U0 , B : U0 X : A × A → U−1 , Y : B × B → U−1 x : A, y : B f : A → A, g : B → B k : A × A → A, m : B × B → B G : IsGrp A (X, x, f, k) H : IsGrp B (Y, y, g, m) h:A→B M : IsGrpHom A,B (X, x, f, k, Y, y, g, m, h). We can construct the term  Y (h(k(a, k(b, k(b, x)))), m(h(a), m(h(b), m(h(b), y)))) s: a,b:A

or at least refute its nonexistence.13 This form is quite verbose, but consists entirely of pieces we can understand and the kernel can check.

3.4 Proof Engineering by Example To show what proof engineering looks like in practice, we shall formalize enough elementary group theory to state and prove the previously discussed excerpt and extract useful computational code from it. Even though groups are an important concept in mathematics and appear in various corners of computational sciences, formalizing them is not particularly novel (Garillot 2011; Gonthier et al. 2013). Instead, the point of this adventure is to demonstrate a proof engineering technique that could be applied to other interesting concepts, such as exterior algebras and differential equations.


Type-Theoretical Model

We start by building a type-theoretical model of groups. The model has two levels that are open to extension and enjoy representation independence principles. The first level is for abstract specifications, describing what it means to be a group. If there are many such specifications, they must all be equivalent. The second level 13

We make a distinction between being able to construct a witness and being able to derive a contradiction from the lack of a witness, because there is no inference rule that would unify these notions in the classical way.


S. Kiiskinen

Fig. 2 Type-theoretical model of groups with nodes representing types and arrows functions between them

is for concrete representations, exhibiting constructions of particular groups. If any particular group has several different representations, they must all be isomorphic. The model and its levels are sketched in Fig. 2. The design of the model is motivated by category theory. The first level corresponds to the category of type classes (Wadler and Blott 1989), which is just a subcategory of the category of types. The second level corresponds to the category of groups. The levels are related by the instance resolution relation, which you can imagine to be a functor, even though it is not.14 On the first level of the model, we have various types of classes, including one for Abelian groups, one for commutativity, and two for groups. If we start from the type class for Abelian groups15 and fully expand all of its constituents, we get


It is not a good exercise for the reader to ponder why the instance resolution relation is not a functor. 15 In classical mathematics, there is the additional requirement that the carrier must be a set, but we do not include it here.

Curiously Empty Intersection of Proof Engineering …




(A × A → U) × A × (A → A) × (A × A → A) → U


 ⎛ IsAbGrp A (X, x, f, k) ≡ ⎝

⎛ ⎝ ⎛

X (y, y)⎠ × ⎝


X (y, z) → X (z, y)⎠ ×


⎞ X (k(y, k(z, w)), k(k(y, z), w))⎠ ×


X (w, v) → X (k(y, w), k(z, v))⎠ ×


X (k(x, y), y)⎠ × ⎝


X (y, z) →


X (k( f (y), y), x)⎠ × ⎝

⎞ X (k(y, x), y)⎠ ×




IsSym A (X )


X (y, z) × X (z, w) → X (y, w)⎠ ×




IsRefl A (X )


⎞ X (k(y, f (y)), x)⎠ × ⎞

X (y, z) → X ( f (y), f (z))⎠ × ⎝


IsProper1,A (X, f )

⎞ X (k(y, z), k(z, y))⎠ .


IsComm A (X,k)

As you can see, this type class is just a product of the type class for groups and the type class for commutativity, so we can project either one out with fst :

A×B → A

snd :


A × B → B.


We have written the type class for groups as it appears in universal algebra, stating that there exists such a choice function f that any element y has the unique inverse f (y). If instead we wish to write it as it appears in abstract algebra, stating that any element y has the unique inverse y −1 , we can transport the type class along16 choice :

A,B:U X :A×B→U

⎛ ⎝

 y:A z:B

X (y, z)⎠  ⎝

⎞ X (y, f (y))⎠ .

f :A→B y:A

This gives us representation independence with respect to specifications. 16

Unlike classical choice, which is an axiom, constructive choice is just a theorem.


S. Kiiskinen

On the second level of the model, we have representations for various groups, including one for the free group, one for the trivial group and two for the group of integers. One of the integer representations is established over unary integers, which is an inductive type freely generated by 0:Z + : P → Z − : P → Z

1:P 1 +  : P → P.

The other integer representation is established over binary integers, which is another inductive type freely generated by 1 : Pb 2 ×  : Pb → Pb 1 + 2 ×  : Pb → Pb .

0 : Zb + : Pb → Zb − : Pb → Zb

The representations are isomorphic, as witnessed by encoding and decoding. If we have written a group over Z and instead wish to have the equivalent group over Zb , we can transport the structure along code : Z  Zb . This gives us representation independence with respect to carriers. The trivial group is defined over the unit type, which is an inductive type freely generated by 0 : 1. The trivial group is the terminal object in the category of groups, so any group can be projected into it with a specialization of the constant map const(0) :

A → 1.


The free group is defined over finite sequences with some restrictions on adjacent elements, making it a bit more elaborate to construct. We shall use lists as our finite sequences, because they are the simplest structure that works.17 Lists are defined as an inductive type family freely generated by ∅:




A × A∗ → A∗ .


It is also customary to define the functions 17

Finger trees would have asymptotically better performance characteristics, but they would also be more complicated to define and reason about (Hinze and Paterson 2006).

Curiously Empty Intersection of Proof Engineering …

fold : map : zip :


A × (A × A → A) → (A∗ → A)


(A → B) → (A∗ → B ∗ ) ∗

A × B → (A × B)



app :


A∗ × A∗ → A∗

A∗ → A∗  drop : N → A:U A∗ → A∗ rev :


to fold a sequence with a monoid, map a function over a sequence, zip two sequences together, append two sequences, reverse a sequence and drop a finite number of elements from a sequence.18 Now, suppose A is a discrete type. The free group generated by A is the group defined over F(A), which we decree to be ϕ A : (2 × A) × (2 × A) → 2 ϕ((i, x), ( j, y)) ≡ ¬(i = j ∧ x = y) F(A) : U fold(1,  ∧ , map(ϕ, zip(s, drop(1)(s)))). F(A) ≡    ∗ s:(2×A)


The idea is that (2 × A)∗ is a finite sequence that is well-formed according to the adjacency relation ϕ. Each 2 × A in the sequence records an element of type A with a flag of type 2 indicating whether the element is inverted. We require A to be discrete, so that we can use Hedberg’s theorem19 to prove that A is a set and F(A) is a subset of (2 × A)∗ . The free group is the initial object in the category of groups, so it can be projected into any other group with the appropriate evaluation map. Suppose f : A → Z assigns an integer value to every inhabitant of some discrete type A. The free group over A can be projected into the group over Z with the evaluation map e( f ), which we deem ε( f ) : 2 × A → Z ε( f )(0, x) ≡ f (x) ε( f )(1, x) ≡ − f (x)

e( f ) : F(A) → Z e( f )(s) ≡ fold(0,  + , map(ε( f ), s)).    sum

This gives us representation independence with respect to structures.


It is a good exercise for the reader to write the definitions, because they are standard in literature on functional programming (Meijer et al. 1991). 19 The theorem states that every discrete type is a set or, in other words, that every type with decidable equality has unique identity proofs (Kraus et al. 2013). The converse is not true, as the type of infinite bit strings N → 2 forms a set, but does not admit decidable equality.


S. Kiiskinen


Implementation in an Interactive Theorem Prover

Having built a model of groups, we can implement it in an itp. Our implementation is written in Coq and inspired by the operative–predicative type class design (Sozeau and Oury 2008; Spitters and van der Weegen 2011). We mainly diverge from the original design in the way we decouple operational classes from predicative ones, export instances and bind notations. While it would be nice to paint an honest picture of the implementation by showing every little detail, we can only fit so much on this napkin. Despite our best efforts to be terse, we have to omit some parts, as indicated by the discontinuous line numbers and admitted proofs. The curious reader can find the full implementation in the project repository (Kiiskinen 2022). As we saw on the first level of the model, groups are just monoids with inverses that are proper with respect to the underlying equivalence relation. We can formalize this notion of groups by combining IsMon, IsInvLR and IsProper: 16 17 18 19 20 21

Class IsGrp (A : Type) (X : A -> A -> Prop) (x : A) (f : A -> A) (k : A -> A -> A) : Prop := { is_mon :> IsMon X x k; is_inv_l_r :> IsInvLR X x f k; is_proper :> IsProper (X ==> X) f; }.

We could further formalize the notion of Abelian groups by combining IsGrp and IsComm: 23 24

Class IsComm (A : Type) (X : A -> A -> Prop) (k : A -> A -> A) : Prop := comm (x y : A) : X (k x y) (k y x).

These type classes make up the predicative part of the design, because they map types and terms into propositions with no computational content. Other type classes that map types and terms into other types make up the operative part. Groups have several other properties, each of which can be proven by instantiating the appropriate type class. We set up a context, where the group structure and the notations for its constituents are established through operational classes20 : 32 33 34 35 36 37 38 39 40 41


Section Context. Context (A : Type) (X : A -> A -> Prop) (x : A) (f : A -> A) (k : A -> A -> A) ‘{!IsGrp X x f k}. #[local] #[local] #[local] #[local]

Instance Instance Instance Instance

has_eq_rel : HasEqRel A := X. has_null_op : HasNullOp A := x. has_un_op : HasUnOp A := f. has_bin_op : HasBinOp A := k.

The expert reader might wonder why we go through the extra ceremony to declare the operational instances local. We do this, because it makes the parameters of predicative classes independent of operational classes and makes the distinction between projected and derived instances of predicative classes almost opaque to the user.

Curiously Empty Intersection of Proof Engineering …


In this context, we can use the tactic facilities of the metalanguage to show that − is injective: 49 50 51 52 53 54 55 56 57

#[export] Instance is_inj : IsInj X f. Proof. note. intros y z a. rewrite A k : A -> A -> A H : IsGrp _==_ 0 -_ _+_ y, z : A a : - y == - z -------------------------------------y == (y + (- y)) + z

Much like inference rules, the proof state is written / with a hole for the hypotheses and a hole for the goal. It may look daunting, but that is just an illusion caused by its verbosity. We do not need to care about the hypotheses before H, because they are contextual information we can access through the operational classes. We cannot refer to H itself either, because we introduced it without a name on line 36. We could also infer y and z ourselves, because they are mentioned in the next hypothesis. Thus, the hypothesis a and the goal are the only things that really matter. Moving on, group homomorphisms are functions between groups that preserve their operations and equivalences. We can formalize this notion of group homomorphisms by combining IsGrp, IsBinPres and IsProper: 59 60 61 62 63 64 65 66 67

Class IsGrpHom (A B : Type) (X : A -> A -> Prop) (x : A) (f : A -> A) (k : A -> A -> A) (Y : B -> B -> Prop) (y : B) (g : B -> B) (m : B -> B -> B) (h : A -> B) : Prop := { dom_is_grp :> IsGrp X x f k; codom_is_grp :> IsGrp Y y g m; hom_is_bin_pres :> IsBinPres Y k m h; hom_is_proper :> IsProper (X ==> Y) h; }.

We could once again set up a context and show that the function preserves both 0 and −.


S. Kiiskinen

Before turning our attention to the second level of the model, we define a tactic called ecrush, which repeatedly tries to solve a goal using artificial intelligence21 or split the remaining goals into smaller subgoals that can be dispatched using simple case analyses or known arithmetic lemmas: 84 85 86 87 88 89 90

Ltac ecrush := repeat (try typeclasses eauto; esplit); hnf in *; repeat match goal with | |- exists _ : unit, _ => exists tt | |- forall _ : unit, _ => intros ? | x : unit |- _ => destruct x end; eauto with zarith.

This tactic can automatically construct the group of integers from the appropriate constituents: 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162

Module BinaryIntegers. Module Additive. #[export] #[export] #[export] #[export]

Instance Instance Instance Instance

has_eq_rel : HasEqRel Z := eq. has_null_op : HasNullOp Z := has_un_op : HasUnOp Z := Z.opp. has_bin_op : HasBinOp Z := Z.add.

#[export] Instance is_grp : IsGrp eq Z.opp Z.add. Proof. ecrush. Qed. End Additive. End BinaryIntegers.

The same applies to the trivial group. While formalizing the trivial group is completely straightforward, doing the same to the group of integers comes with one notable design consideration. We need to put the operational instances into their own little module, so that we can avoid the ambiguity between the additive monoid making up the group and the multiplicative monoid that could be defined otherwise. With this module structure, either monoid can be imported into a context or both can be brought together to form a semiring. Besides forming a group, integers can act on other groups through repetition. These repetitions appear in additive structures as multiples and in multiplicative structures as powers. We can formalize this notion of repetition as rep:


When we say artificial intelligence, we mean logic programming, although machine learning can also be used to guide proof search.

Curiously Empty Intersection of Proof Engineering … 164 165 166 167 168 169 170 171 172 173 174 175


Section Context. Context (A : Type) {X : HasEqRel A} {x : HasNullOp A} {f : HasUnOp A} {k : HasBinOp A} ‘{!IsGrp X x f k}. Equations rep (n : Z) (y : A) : A := rep Z0 y := 0; rep (Zpos n) y := Pos.iter_op _+_ n y; rep (Zneg n) y := - Pos.iter_op _+_ n y. End Context.

Even though we use additive structures as a motif for the notations, the resulting definition is actually more general, because the notations are forgotten at the end of the context. Now, while repetition is not a group action, its fibers are group homomorphisms, so we could set up a context and show that n ×  is indeed a group homomorphism for every n : Z. The free group is constructed in three steps. First, we define the carrier and prove that it behaves as expected. Then, we implement the operations and verify that they preserve the expected behavior. Finally, we show that the carrier and the operations form a group. On every step, we employ an equational reasoning plugin (Sozeau and Mangin 2019) to simplify dealing with partial and recursive definitions. We can formalize the carrier of the free group as the dependent pair free: 92 93 94 95 96 97 98 99 100 101 102 103 104 105

Module Free. Section Context. Context (A : Type) {e : HasEqDec A}. Equations wfb_def (a b : bool * A) : bool := wfb_def (i, x) (j, y) := decide (~ (i j /\ x = y)). Equations wfb (s : list (bool * A)) : bool := wfb s := decide (Forall (prod_uncurry wfb_def) (combine s (skipn 1 s))). Equations free : Type := free := {s : list (bool * A) | wfb s}.

These definitions relate to the model in that Forall is the composition of and and map, prod_uncurry wfb_def is ϕ, combine is zip and skipn is drop. We can further formalize the operations as the functions null, un and bin. 113 114 115 116 117 118 119 120 121 122 123 124

Equations null : free := null := ([]; _). Next Obligation. ecrush. Qed. Equations un (x : free) : free := un (s; _) := (map (prod_bimap negb id) (rev s); _). Next Obligation. Admitted. Equations bin_fix list (bool * A) bin_fix [] t := bin_fix s [] :=

(s t : list (bool * A)) : * list (bool * A) by struct t := ([], t); (s, []);


S. Kiiskinen 125 126 127 128 129 130 131 132

bin_fix (a :: s) (b :: t) with wfb_def a b := | true => (a :: s, b :: t) | false => bin_fix s t. Equations bin (x y : free) : free := bin (s; _) (t; _) with bin_fix (rev s) t := | (u, v) => (app (rev u) v; _). Next Obligation. Admitted.

These definitions relate to the model in that [] is ∅, _ :: _ is   and prod_bimap is another variation of map. The only thing left to do is to show that we have a group on our hands: 134 135 136 137 138 139 140 141 142 145 146

#[export] #[export] #[export] #[export]

Instance Instance Instance Instance

has_eq_rel : HasEqRel free := eq. has_null_op : HasNullOp free := null. has_un_op : HasUnOp free := un. has_bin_op : HasBinOp free := bin.

#[export] Instance is_grp : IsGrp eq null un bin. Proof. Admitted. End Context. End Free.

Except, we cheat a little bit and admit the proofs on lines 119, 132 and 140 due to their length. Even though the admissions are ordinary proofs by induction, they cannot be automatically discovered by the compiler and, thus, require user interaction. If we wanted to show them here, we would have to expend anything from ten to a hundred lines to write each one.22 We can now define the projections between our groups and show that they are truly structure-preserving. This is perhaps the most important part of the implementation in terms of practical applications, because it gives us the tools to embed meaningful concepts into meaningless data structures. We can formalize the evaluation map from the free group into the group of integers as eval_Z_add: 212 213 214 215 216 217 218 219 220 221

Module Initial. Export Free BinaryIntegers.Additive. Section Context. Context (A : Type) {e : HasEqDec A} (f : A -> Z). Equations eval_Z_add_def (a : bool * A) : Z := eval_Z_add_def (false, x) := f x;


If you really wanted to prove the obligation on line 132, you would have to restate the definition using the inspect pattern, because otherwise your hypotheses would be too weak. 130 131

bin (s; _) (t; _) with (bin_fix (rev s) t; eq_refl) := | ((u, v); _) => (app (rev u) v; _).

Curiously Empty Intersection of Proof Engineering … 222 223 224 225


eval_Z_add_def (true, x) := - f x. Equations eval_Z_add (x : free A) : Z := eval_Z_add (s; _) := fold_right _+_ 0 (map eval_Z_add_def s).

These definitions relate to the model in that e witnesses that A is discrete, eval_Z_add_def is ε( f ) and eval_Z_add is e( f ). We can finally prove that the evaluation map is truly a group homomorphism: 227 228 229 230 231 232 233

#[local] Instance is_grp_hom : IsGrpHom eq null un bin eq Z.opp Z.add eval_Z_add. Proof. Admitted. End Context. End Initial.

Except, we admit the proof on line 229 for the same reason as before. If we wanted to give the same treatment to the trivial group, we would not have to admit anything. We could simply formalize the constant map from any group into the trivial group as const tt and prove its properties with ecrush. This concludes our implementation of the model. Without any of its dependencies, it is 256 lines long with about as much reserved for the admitted proofs. At this point, we could easily prove the previously discussed excerpt in a just a few lines by setting up the appropriate context and invoking four well-chosen tactics.23


Extraction and Compilation into Machine Code

Once the implementation has passed type checking, we can carry on to extract code from it. The purpose of extraction is to erase all the proofs and translate the remaining pieces into a computationally usable form. This form is typically a module in another programming language, which can be interpreted or compiled into machine code. We can ask Coq to extract an OCaml module from our implementation by using the rules listed in ExtrOcamlBasic and ExtrOcamlZBigInt: 254 255 256

From Coq Require Import Extraction ExtrOcamlBasic ExtrOcamlZBigInt. Extraction "" Groups.

We choose OCaml as the extraction language, because Coq itself is implemented in it and, being a dialect of ML, it is familiar and has pleasantly predictable performance characteristics. We can further compile the extracted module into less than 28 kiB of machine code and link it with the standard library to produce an executable less than 1.2 MiB in size. While the size itself is not indicative of good performance, it at least suggests that the result is not bloated. As a case study, consider taking the abstract group expression ((x × y) × 1) × (y × y)−1 , simplifying it into x × y −1 , projecting it into the group of integers and 23

It is a good exercise for the reader to daydream about the way to prove the excerpt.


S. Kiiskinen

evaluating it in a context, where x ≡ 42 and y ≡ 13. We can leverage the extracted module to do this as follows: 3 4 5 6 7 8 9 10 11 12

let main () = let open Groups in let e = Free.bin (=) (Free.bin (=) [(false, ’x’); (false, ’y’)] (Free.null (=))) (Free.un (=) [(false, ’y’); (false, ’y’)]) in let g = function | ’x’ -> Big.of_int 42 | ’y’ -> Big.of_int 13 | _ -> raise Not_found in Printf.printf "%s\n" (Big.to_string (Initial.eval_Z_add (=) g e))

This program prints 29 and exits, just as you would expect. It should not surprise anybody that we can do simple arithmetic on small integers, but it might raise some eyebrows that we can bring symbolic manipulation and numerical computation together in such an elegant way. We can simultaneously minimize the burden of numerical computations by doing as many symbolic manipulations as possible and guarantee that the result is correct by writing proofs that are erased during compilation. Even though the result is guaranteed to be correct, the old adage “garbage in, garbage out” still applies. Since extraction erases all the proofs and we did not explicitly add any assertions, the user is solely responsible for providing valid inputs to the extracted module. In our case study of the most roundabout way to print 29, the expressions x × y and y × y given by the user had to be well-formed or the behavior of the program would have been undefined.24 While it would have been possible to validate inputs before using them, we deliberately avoided doing so, because it would have been unnecessary and potentially detrimental to performance.

4 Unexploited Opportunities? We believe proof engineering tools and techniques could truly change the way computational sciences are studied and communicated. In order to explore what this change could look like, we shall compare a traditional way to conduct a research project with a new way we consider worth trying. The comparison is, of course, exaggerated, but that is unavoidable, because nuance would only serve to make it incomprehensible. If you want to bake a traditional research project in computational sciences, you can follow this recipe: 1. Choose an interesting problem. 2. Use pen and paper to derive an algorithm for solving your problem.


Undefined behavior is bad, because it means that the program could do literally anything, including quietly producing a wrong result, crashing and setting a nearby printer on fire.

Curiously Empty Intersection of Proof Engineering …


3. Write a FORTRAN or C program25 that implements your algorithm efficiently. 4. Test your program thoroughly. 5. If defects are found, fix them and go back to step 2 or 3. Otherwise, assume that no significant defects remain. 6. Run your program and gather the results. 7. Publish your results and hopefully the source code of your program as well. 8. If your publication is not conclusive, go back to step 2. Otherwise, move on to the next problem. However, if you want to be more radical, we suggest trying this new recipe instead: 1. 2. 3. 4.

Choose an interesting problem. Use Coq to specify an algorithm for solving your problem. Write proofs in Coq to implement and verify the correctness of your algorithm. Mechanically extract OCaml code26 from your implementation and choose one of the following options: a. Tune the extraction mechanism of Coq and the optimizer of the OCaml compiler until you can do all your computations using the extracted code. b. Link the extracted code with a C or FORTRAN program that can handle the most demanding computations, but do the rest of your computations using the extracted code. c. Use the extracted code as a testing oracle (Staats et al. 2011) for another C or FORTRAN program that can do all your computations.

5. Run your program and gather the results. 6. Publish your results and the source code of your program, because you cannot meaningfully separate them anymore. 7. If your publication is not conclusive, go back to step 2. Otherwise, move on to the next problem. There are good reasons for and against the new recipe, just like there once were about the traditional one.

4.1 Reasons to Get Excited Consider formalizing a general computational method, such as discrete exterior calculus (dec) (Hirani 2003), finite element method (fem) (Thomée 2001) or discrete element method (dem) (Williams 1985). At best, such a formalization could allow us to state a boundary value problem as it is written on paper, simplify the statement in a way that is guaranteed to be correct and solve the simplified form either by 25

When we say FORTRAN or C, we refer to any of their dialects, derivatives or wrappers, such as Fortran 95, C++, Julia or Python. 26 We use Coq to produce OCaml, but you could just as well use Isabelle to produce Scala, Agda to produce Haskell or what have you.


S. Kiiskinen

realizing it through extraction or by feeding it into an existing numerical solver. Any assumptions about the method would be explicit in its specification and proven to be upheld by its implementation, making accidental misuse of the method nearly impossible.27 Besides giving us more correctness guarantees, formalizing a general method would also help us develop it further. The foundations and possible generalizations of dec are still vague and poorly understood (Kettunen et al. 2021), while fem and dem have so many variations that it is difficult to keep track of them all (Belytschko et al. 2009). With luck, a proper formalization could unify some of these methods by revealing them to be special cases of a more general theory. Beauty and elegance are not the only reasons to pursue the unification of theories. A recurring problem in mathematics is that seemingly novel ideas are found to be old ideas recasting (Tai 1994) or even plain wrong (Voevodsky 2014). To make matters worse, these problems become progressively harder to notice as the level of abstraction increases or the clarity of communication deteriorates. Formalization helps us avoid these problems by forcing everyone to uphold a certain standard of rigor and openness in their communications. The emphasis on constructive foundations also pushes everyone towards being explicit about the computational content of their proofs. This philosophy is known to annoy some mathematicians (Bishop 1975), but fits computational scientists like a glove. If you are still unconvinced that formalizing anything could be a good idea, consider the personal anecdote that, in our experience, it is really fun. Getting an itp to accept your proof feels like playing a game and truly caresses your sensibilities if you enjoy logic puzzles and functional programming.

4.2 Reasons to Remain Skeptical Since all prominent itps have constructive foundations and it is only possible to extract code from definitions that do not use classical axioms, learning to work with them can be difficult for people who are used to classical objects, such as real numbers or smooth manifolds. It is almost a rite of passage for newcomers to realize how useless real analysis is for computing anything. Computational methods always deal with rational approximations, while real analysis can only tell us about the limit. Books (Bishop and Bridges 1985) and theses (Cruz-Filipe 2004) have been written on constructive analysis, but they are much more intricate and less popular than their classical counterparts. Even in the realm of constructive mathematics, some things can be tricky to define inside type theory. Equality is one of them (Univalent Foundations Program 2013) and this issue also came up in our investigation of groups. We defined groups with respect to an arbitrary equivalence relation, which made the definition twice as large as it 27

Unlike people working in software security, we do not need to worry about deliberate misuse, because our only adversary is reality.

Curiously Empty Intersection of Proof Engineering …


really needed to be, because we had to explicitly establish that − and  +  indeed preserve equivalences. The proliferation of these kinds of properties eventually leads to a problem that is colloquially known as setoid hell (Barthe et al. 2003); it is a place, where you have to manually witness the respectfulness of every operation with respect to every other relation. Univalent type theories, such as homotopy type theory and cubical type theory, alleviate the setoid problem by strengthening the concept of equality, but they are still otherwise quite immature. Regardless of the type theory that is used, writing formal proofs in an itp is hard work. While this is unlikely to change in the foreseeable future, more and more common tasks can be automated as time goes on. There are decision procedures for intuitionistic propositional calculus (Dyckhoff 1992, 2018) and quantifier-free fragments of integer arithmetics (Besson 2007) as well as integrated atps for boolean satisfiability problems (Czajka and Kaliszyk 2018). Artificial intelligence also plays a dual role: logic programming can already be used to guide heuristic reasoning (Selsam et al. 2020) and the use of machine learning is being investigated for more intelligent proof search (Komendantskaya et al. 2013). While performance is not a theoretical obstacle for formalizing computational methods, it is not a trivial concern either. It can be challenging to keep the performance of an itp at a usable level, as the ambitions of its users and the scale of their developments grow (Gross 2021). Attempts to solve differential equations without relying on code extraction have not been able to reach satisfactory precision (Makarov and Spitters 2013). Extraction is not a silver bullet either, as it takes skill to manage the performance characteristics of extracted code (Paulin-Mohring and Werner 1993; Letouzey 2003, 2008). However, as difficult as software engineering may be, it has been demonstrated that deeply-embedded domain-specific languages are excellent tools for abstract computational problems, such as performing automatic differentiation (Elliott 2018) or designing electronic circuits (Wester 2015). Even in general, these difficulties do not seem to be stopping people from bringing functional programming into computational sciences (Wang and Zhao 2022; Henriksen 2017).

5 Conclusions Proof engineering does not seem to be unfit for computational sciences in any outstanding way. On the contrary, their intersection could actually be quite fruitful. Computational scientists would get new tools and techniques for improving the quality of their programs and proof engineers would benefit from testing their developments in a domain that aligns so well with their philosophy. Collectively, we could get improved confidence in our results, reducing anxiety and shifting some of the burden from reviewers to computers. There are challenges to be overcome, but they all seem to be tractable with the current state of the art. Ideally, we want to see a world, where the computer is not just a worker, but a companion.


S. Kiiskinen

Acknowledgements I would like to thank my friends and colleagues Paolo Giarrusso, Suvi Lahtinen and Fabi Prezja for encouragement and minor feedback on this text. I must also acknowledge Pekka Neittaanmäki for showing interest in my research and letting me crash his parties.

References Angiuli C, Cavallo E, Mörtberg A, Zeuner M (2021) Internalizing representation independence with univalence. Proc ACM Program Lang 5(POPL), Article 12:1–30 Armstrong A, Bauereiss T, Campbell B, Reid A, Gray KE, Norton RM, Mundkur P, Wassell M, French J, Pulte C, Flur S, Stark I, Krishnaswami N, Sewell P (2019) ISA semantics for ARMv8-a, RISC-v, and CHERI-MIPS. Proc ACM Program Lang 3(POPL), Article 71:1–31 Awodey S, Gambino N, Sojakova K (2012) Inductive types in homotopy type theory. In: LICS ’12: proceedings of the 2012 27th annual IEEE/ACM symposium on logic in computer science. IEEE, pp 95–104 Barthe G, Capretta V, Pons O (2003) Setoids in type theory. J Funct Program 13(2):261–293 Bauer A (2020) Answer to the question: What makes dependent type theory more suitable than set theory for proof assistants? MathOverflow. Bauer A, Gross J, Lumsdaine PL, Shulman M, Sozeau M, Spitters B (2017) The HoTT library: a formalization of homotopy type theory in Coq. In: CPP 2017: proceedings of the 6th ACM SIGPLAN conference on certified programs and proofs. ACM, pp 164–172 Belytschko T, Gracie R, Ventura G (2009) A review of extended/generalized finite element methods for material modeling. Modell Simul Mater Sci Eng 17(4):043001 Besson F (2007) Fast reflexive arithmetic tactics the linear case and beyond. In: Altenkirch T, McBride C (eds) Types for proofs and programs—TYPES 2006. Lecture notes in computer science, vol 4502. Springer, pp 48–62 Bierhoff K, Aldrich J (2007) Modular typestate checking of aliased objects. ACM SIGPLAN Not 42(10):301–320 Bishop E (1975) The crisis in contemporary mathematics. Hist Math 2(4):507–517 Bishop E, Bridges D (1985) Constructive analysis. Springer, Berlin Bishop S, Fairbairn M, Mehnert H, Norrish M, Ridge T, Sewell P, Smith M, Wansbrough K (2018) Engineering with logic: rigorous test-oracle specification and validation for TCP/IP and the sockets API. J ACM 66(1), Article 1:1–77 Blanchette JC, Fleury M, Lammich P, Weidenbach C (2018) A verified SAT solver framework with learn, forget, restart, and incrementality. J Autom Reason 61(1):333–365 Bove A, Capretta V (2005) Modelling general recursion in type theory. Math Struct Comput Sci 15(4):671–708 Brainerd W (1978) Fortran 77. Commun ACM 21(10):806–820 Buzzard K, Commelin J, Massot P (2020) Formalising perfectoid spaces. In: CPP 2020: proceedings of the 9th ACM SIGPLAN international conference on certified programs and proofs. ACM, pp 299–312 Chen H, Chajed T, Konradi A, Wang S, ˙Ileri A, Chlipala A, Kaashoek F, Zeldovich N (2017) Verifying a high-performance crash-safe file system using a tree specification. In: SOSP ’17: proceedings of the 26th symposium on operating systems principles. ACM, pp 270–286 Cockx J, Abel A (2018) Elaborating dependent (co)pattern matching. Proc ACM Program Lang 2(ICFP), Article 75:1–30 Cohen C, Coquand T, Huber S, Mörtberg A (2016) Cubical type theory: a constructive interpretation of the univalence axiom. arXiv:1611.02108 Coquand T (1992) Pattern matching with dependent types. In: Nordström B, Petersson K, Plotkin G (eds) Proceedings of the 1992 workshop on types for proofs and programs. Springer, Berlin, pp 71–84

Curiously Empty Intersection of Proof Engineering …


Coquand T, Huber S, Mörtberg A (2018) On higher inductive types in cubical type theory. In: LICS ’18: proceedings of the 33rd annual ACM/IEEE symposium on logic in computer science. ACM, pp 255–264 Coquand T, Huet G (1988) The calculus of constructions. Inf Comput 76(2–3):95–120 Coquand T, Paulin C (1990) Inductively defined types. In: Martin-Löf P, Mints G (eds) COLOG88: proceedings of the international conference on computer logic. Lecture notes in computer science, vol 417. Springer, Berlin, pp 50–66 Cruz-Filipe L (2004) Constructive real analysis: a type-theoretical formalization and applications. PhD thesis, University of Nijmegen Czajka Ł, Kaliszyk C (2018) Hammer for Coq: automation for dependent type theory. J Autom Reason 61(1–4):423–453 Dam M, Guanciale R, Khakpour N, Nemati H, Schwarz O (2013) Formal verification of information flow security for a simple ARM-based separation kernel. In: CCS ’13: proceedings of the 2013 ACM SIGSAC conference on computer & communications security. ACM, pp 223–234 Dijkstra EW (1971) On the reliability of programs. EWD 303 Dyckhoff R (1992) Contraction-free sequent calculi for intuitionistic logic. J Symb Log 57(3):795– 807 Dyckhoff R (2018) Contraction-free sequent calculi for intuitionistic logic: a correction. J Symb Log 83(4):1680–1682 Elliott C (2018) The simple essence of automatic differentiation. Proc ACM Program Lang 2(ICFP), Article 70:1–29 Garillot F (2011) Generic proof tools and finite group theory. PhD thesis, Ecole Polytechnique X Geuvers H, Wiedijk F, Zwanenburg J (2002) A constructive proof of the fundamental theorem of algebra without using the rationals. In: Callaghan P, Luo Z, McKinna J, Pollack R (eds) Types for proofs and programs, TYPES 2000, selected papers. Lecture notes in computer science. Springer, Berlin, pp 96–111 Gilbert G, Cockx J, Sozeau M, Tabareau N (2019) Definitional proof-irrelevance without K. Proc ACM Program Lang 3(POPL), Article 3:1–28 Gonthier G (2008) Formal proof—the four-color theorem. Not ACM 55(11):1382–1393 Gonthier G, Asperti A, Avigad J, Bertot Y, Cohen C, Garillot F, Le Roux S, Mahboubi A, O’Connor R, Ould Biha S, Pasca I, Rideau L, Solovyev A, Tassi E, Théry L (2013) A machine-checked proof of the odd order theorem. In: Blazy S, Paulin-Mohring C, Pichardie D (eds) Interactive theorem proving. Lecture notes in computer science, vol 7998. Springer, pp 163–179 Gross JS (2021) Performance engineering of proof-based software systems at scale. PhD thesis, Massachusetts Institute of Technology Hales TC, Harrison J, McLaughlin S, Nipkow T, Obua S, Zumkeller R (2011) A revision of the proof of the Kepler conjecture. In: Lagarias JC (ed) The kepler conjecture: the Hales-Ferguson proof. Springer, Berlin, pp 341–376 Henriksen T, Serup NGW, Elsman M, Henglein F, Oancea CE (2017) Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In: PLDI 2017: proceedings of the 38th ACM SIGPLAN conference on programming language design and implementation. ACM, pp 556–571 Hinze R, Paterson R (2006) Finger trees: a simple general-purpose data structure. J Funct Program 16(2):197–217 Hirani AN (2003) Discrete exterior calculus. PhD thesis, California Institute of Technology Hu JZS, Carette J (2021) Formalizing category theory in Agda. In: CPP 2021: proceedings of the 10th ACM SIGPLAN international conference on certified programs and proofs. ACM, pp 327–342 Kernighan BW, Ritchie DM (1988) The C programming language, 2nd edn. Prentice Hall Kettunen L, Lohi J, Räbinä J, Mönkölä S, Rossi T (2021) Generalized finite difference schemes with higher order Whitney forms. ESAIM: Math Model Numer Anal 55(4):1439–1460 Kiiskinen S (2022) Discrete exterior zoo. GitHub.


S. Kiiskinen

Klein G, Andronick J, Elphinstone K, Murray T, Sewell T, Kolanski R, Heiser G (2014) Comprehensive formal verification of an OS microkernel. ACM Trans Comput Syst 32(1), Article 2:1–70 Komendantskaya E, Heras J, Grov G (2013) Machine learning in proof general: interfacing interfaces. In: Kaliszyk C, Lüth C (eds) Proceedings 10th international workshop on user interfaces for theorem provers, UITP 2012. EPTCS, vol 118, pp 15–41 Kraus N, Escardó M, Coquand T, Altenkirch T (2013) Generalizations of Hedberg’s theorem. In: Hasegawa M (ed) Typed lambda calculi and applications: 11th international conference TLCA 2013, proceedings. Lecture notes in computer science, vol 7941. Springer, Berlin, pp 173–188 Leroy X (2009) Formal verification of a realistic compiler. Commun ACM 52(7):107–115 Letouzey P (2003) A new extraction for Coq. In: Geuvers H, Wiedijk F (eds) Types for Proofs and Programs: international workshop TYPES 2002, selected papers. Lecture notes in computer science. Springer, Berlin, pp 200–219 Letouzey P (2008) Extraction in Coq: an overview. In: Beckmann A, Dimitracopoulos C, Löwe B (eds) Logic and theory of algorithms: 4th conference on computability in Europe CiE 2008, proceedings. Lecture notes in computer science. Springer, Berlin, pp 359–369 Liu J, Zhan B, Wang S, Ying S, Liu T, Li Y, Ying M, Zhan N (2019) Formal verification of quantum algorithms using quantum Hoare logic. In: Dillig I, Tasiran S (eds) Computer aided verification: CAV 2019. Lecture notes in computer science. Springer, Berlin, pp 187–207 Makarov E, Spitters B (2013) The Picard algorithm for ordinary differential equations in Coq. In: Blazy S, Paulin-Mohring C, Pichardie D (eds) Interactive theorem proving: ITP 2013. Lecture notes in computer science, vol 7998. Springer, Berlin, pp 463–468 Martin-Löf P (1998) An intuitionistic theory of types. In: Sambin G, Smith J (eds) Twenty-five years of constructive type theory. Oxford University Press, pp 127–172 Meijer E, Fokkinga M, Paterson R (1991) Functional programming with bananas, lenses, envelopes and barbed wire. In: Hughes J (ed) Functional programming languages and computer architecture: 5th ACM conference, proceedings. Lecture notes in computer science. Springer, Berlin, pp 124– 144 Moler C (2000) MATLAB incorporates LAPACK. Technical article, MathWorks O’Connor R (2005) Essential incompleteness of arithmetic verified by Coq. In: Hurd J, Melham T (eds) Theorem proving in higher order logics, TPHOLs 2005. Lecture notes in computer science, vol 3603. Springer, Berlin, pp 245–260 Paulin-Mohring C, Werner B (1993) Synthesis of ML programs in the system Coq. J Symb Comput 15(5–6):607–640 Pfenning F, Paulin-Mohring C (1990) Inductively defined types in the calculus of constructions. In: Main M, Melton A, Mislove M, Schmidt D (eds) Mathematical foundations of programming semantics: 5th international conference, proceedings. Lecture notes in computer science, vol 442. Springer, Berlin, pp 209–228 Pierce BC (2009) Lambda, the ultimate TA: using a proof assistant to teach programming language foundations. ACM SIGPLAN Not 44(9):121–122 Ringer T, Palmskog K, Sergey I, Gligoric M, Tatlock Z (2019) QED at large: a survey of engineering of formally verified software. Found Trends Program Lang 5(2–3):102–281 Selsam D, Liang P, Dill DL (2017) Developing bug-free machine learning systems with formal mathematics. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning. Proceedings of Machine Learning Research, vol 70. PMLR, pp 3047–3056 Selsam D, Ullrich S, de Moura L (2020) Tabled type class resolution. arXiv:2001.04301 Sergey I, Wilcox JR, Tatlock Z (2018) Programming and proving with distributed protocols. Proc ACM Program Lang 2(POPL), Article 28:1–30 Sozeau M, Boulier S, Forster Y, Tabareau N, Winterhalter T (2020) Coq correct! verification of type checking and erasure for Coq, in Coq. Proc ACM Program Lang 4(POPL), Article 8:1–28 Sozeau M, Mangin C (2019) Equations reloaded: High-level dependently-typed functional programming and proving in Coq. Proc ACM Program Lang 3(ICFP), Article 86:1–29

Curiously Empty Intersection of Proof Engineering …


Sozeau M, Oury N (2008) First-class type classes. In: Mohamed OA, Munoz C, Tahar S (eds) Theorem proving in higher order logics: 21st international conference TPHOLs 2008, proceedings. Lecture notes in computer science, vol 5170. Springer, Berlin, pp 278–293 Spitters B, van der Weegen E (2011) Type classes for mathematics in type theory. Math Struct Comput Sci 21(4):795–825 Staats M, Whalen MW, Heimdahl MPE (2011) Programs, tests, and oracles: the foundations of testing revisited. In: Taylor RN (ed) ICSE ’11: proceedings of the 33rd international conference on software engineering. ACM, pp 391–400 Tai MM (1994) A mathematical model for the determination of total area under glucose tolerance and other metabolic curves. Diabetes Care 17(2):152–154 Thomée V (2001) From finite differences to finite elements: a short history of numerical analysis of partial differential equations. J Comput Appl Math 128(1–2):1–54 Tourret S, Blanchette J (2021) A modular Isabelle framework for verifying saturation provers. In: CPP 2021: proceedings of the 10th ACM SIGPLAN international conference on certified programs and proofs. ACM, pp 224–237 Univalent Foundations Program (2013) Homotopy type theory: univalent foundations of mathematics. Univalent Foundations van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30 Vezzosi A, Mörtberg A, Abel A (2021) Cubical Agda: a dependently typed programming language with univalence and higher inductive types. J Funct Program 31(e8):1–47 Voevodsky V (2011) Univalent foundations of mathematics. In: Beklemishev LD, de Queiroz R (eds) Logic, language, information and computation: 18th international workshop WoLLIC 2011, proceedings. Lecture notes in computer science, vol 6642. Springer, Berlin Voevodsky V (2014) The origins and motivations of univalent foundations: a personal mission to develop computer proof verification to avoid mathematical mistakes. Inst Lett Summer 8–9 Wadler P (2015) Propositions as types. Commun ACM 58(12):75–84 Wadler P, Blott S (1989) How to make ad-hoc polymorphism less ad hoc. In: POPL ’89: proceedings of the 16th ACM SIGPLAN-SIGACT symposium on principles of programming languages. ACM, pp 60–76 Wang L, Zhao J (2022) OCaml scientific computing: functional programming meets data science. University of Cambridge. In progress Watt C (2018) Mechanising and verifying the web assembly specification. In: CPP 2018: proceedings of the 7th ACM SIGPLAN international conference on certified programs and proofs. ACM, pp 53–65 Wester R (2015) A transformation-based approach to hardware design using higher-order functions. PhD thesis, University of Twente Williams J, Hocking G, Mustoe G (1985) The theoretical basis of the discrete element method. In: Middleton J, Pande GN (eds) NUMETA 85: proceedings of the international conference on numerical methods in engineering. Theory and applications. Balkema, pp 897–906 Woos D, Wilcox JR, Anton S, Tatlock Z, Ernst MD, Anderson T (2016) Planning for change in a formal verification of the Raft consensus protocol. In: CPP 2016: proceedings of the 5th ACM SIGPLAN conference on certified programs and proofs. ACM, pp 154–165

Challenges for Mantle Convection Simulations at the Exa-Scale: Numerics, Algorithmics and Software Marcus Mohr , Ulrich Rüde , Barbara Wohlmuth , and Hans-Peter Bunge

Abstract Upcoming supercomputers provide theoretical compute performance that is hard to imagine, but even harder to exploit at its full potential. The potential of extreme-scale computing forces us to reconsider classical approaches for finiteelement-based scientific computing. In this contribution, we provide some ideas in that direction using the problem of modeling the dynamics of the Earth’s mantle as an example. Resolving the mantle with a global resolution of 1 km results in more than 1012 mesh cells. Thus, an associated convection model will require solvers capable of handling correspondingly large systems resulting from the discretization of the underlying Stokes problem. Multigrid algorithms are asymptotically optimal so that they can deliver the scalability needed for systems of such size. Here a monolithic multigrid method is employed based on computationally inexpensive Uzawa-type smoothers. Furthermore, advanced techniques must be employed to reduce memory consumption, data transport, and parallel communication overhead. This can be achieved by matrix-free methods that employ polynomial surrogates to compute approximations for the matrix coefficients on the fly. The combination of these techniques leads to highly efficient methods as can be shown by a suitable performance analysis that considers both algorithmic efficiency and also the specific parallel efficiency of a given implementation. Keywords Mantle convection simulation · Surrogate techniques · HPC · Exa-scale PDE solvers M. Mohr (B) · H.-P. Bunge Department of Earth and Environmental Sciences, Ludwig-Maximilians-Universität München, München, Germany e-mail: [email protected] U. Rüde Department of Computer Science 10, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany CERFACS, Toulouse, France B. Wohlmuth Institute for Numerical Mathematics, Technische Universität München, München, Germany © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



M. Mohr et al.

1 Introduction Earth mantle convection is a fundamental geophysical process responsible for many phenomena shaping our planet. To study its dynamics, computational models are mandatory. However a numerical simulation of the planet requires spatial and temporal scales that classify the computational task as a grand challenge problem, when even leading edge supercomputers reach their limits. Progress in this field of computational science depends on sophisticated physical models, advanced mathematical insight, and exa-scale techniques. These fields must be brought together in a interdisciplinary co-design effort. This article reports on a few of the resulting challenges. As example, we address the aspect of matrix-free approaches, the scalability of multigrid solvers and different levels of parallelism. Exploiting structure whenever possible motivates the use of hybrid meshes which show the flexibility of unstructured meshes and allow to build uniform substructures which then can be more efficiently addressed. Only if all these levels of concurrency needs are efficiently interoperable, can sufficient performance be achieved.

2 Geophysics We start our contribution with a brief introduction into the importance and challenges of simulating convection inside the Earth’s mantle, see, e.g., Bauer et al. (2020). However, the numerical techniques and thoughts on software and algorithm design we present in Sects. 3 and 4 are relevant for any grand challenges application.

2.1 Significance and Physics of Mantle Convection The Earth’s interior to a first approximation has a simple shell structure. The outermost shell is the crust, followed by the mantle. Below lies the liquid outer core and at the center the solid inner core. Of these shells the mantle occupies by far the largest volume. It is also the source of all large-scale tectonic activity of our planet. The rise and descent of continents above and below sea level, the upthrust of large mountain belts such as the Himalayas, the Andes, and the Alps, the formation of the Hawaiian Islands chain or the Mariana Trench, the devastating 2011 T¯ohoku earthquake, and the volcanic activity of the Pacific Ring of Fire are but a few examples of natural phenomena that originate from dynamic processes within the Earth’s mantle. For a review on how the mantle drives tectonic Earth processes, see Davies and Richards (1992).

Challenges for Mantle Convection Simulations at the Exa-Scale: …


Fig. 1 Four view angles of a simple mantle convection model (see, e.g., Schuberth et al. 2009); the temperature is 600 K lower in cold regions (blue isosurfaces) and 450 K higher in hot regions (red isosurfaces) compared to the layer average

The mantle is composed of solid rocks, yet it is not completely rigid. Solid state processes at the atomic scale, such as dislocation and diffusion creep, allow its material to move. The driving forces behind this motion are thermo-chemical density differences. Specifically, heat producing elements within the mantle and the core, together with primordial heat stored from the time of the accretion of the planet, act as heat sources. Consequently mantle material grows warm, becomes buoyant and rises to the surface. There it cools and sinks again, allowing for the evacuation of heat deeply buried within the planet. A 3-D spherical computer model of mantle convection is presented in Fig. 1. The structure of the model is dominated by local hot upwellings that originate from the core mantle boundary and spread out laterally beneath the Earth’s surface. Cold downwellings are prevalent in the Western Pacific due to the numerous subduction zones located in that region. The convection process happens on geologic time-scales. An overturn of the mantle requires about 100 million years. The most commonly known expression of this process is the movement of the tectonic plates, which occurs at a speed on the order of 1 cm/year. Given the vital importance of mantle convection to our planet, it is essential to gain a detailed understanding of the convection process. As a fluid dynamics problem the mathematical-physical model to describe mantle convection is well established. But key material parameters remain poorly known. Two crucial ones are the rheology of the mantle, that is the precise stress-strain rate response of the material, and aspects of its composition. A formidable hindrance to improve our knowledge of mantle dynamics is that we cannot observe the convection process directly. While the mantle extends to a depth of about 2871 km, the deepest boreholes are barely reaching the Mohorovi˘ci´c discontinuity. The latter separates the mantle from the Earth’s crust at a depth of about 30 km. At the same time, it is exceedingly difficult to reproduce the enormous pressures and temperatures of the mantle by laboratory experiments. Crucially, the slow mantle strain rates (10−12 –10−15 /s) remain outside of any feasible laboratory experiments. This is where simulations and computer experiments come into play. The mathematical model for mantle convection consists of conservation equations for momentum


M. Mohr et al.

Table 1 Physical quantities and their representing symbols ρ density u Velocity g Gravitational η Dynamic viscosity acceleration  Rate of strain τ Temperature tensor q Heat flux per unit H Heat production area rate cp Specific heat κ Heat conductivity capacity

p σ e α

Pressure Cauchy stress tensor Internal energy Thermal expansivity

   2 0 = div η ∇u + ∇u − (div u) I − ∇ p − ρg , 3


0 = div (ρu) ,



and energy 0 = cpρ

∂p ∂T − αT + c p ρu · ∇T − αT u · ∇ p + div(κ∇T ) − H − σ :  , (3) ∂t ∂t

which are complemented by an Equation of State (EOS) connecting density to pressure and temperature (for an extended derivation of the conservation equations and the EOS, see Jarvis and McKenzie 1980). Physical quantities and their representing symbols are given in Table 1. Note that (3), as usual, is given in terms of temperature and that (2) neglects temporal changes in density. This is a justifiable modeling assumption and avoids problems with stability. A crucial point is that in reality the dynamic viscosity η depends on temperature and pressure, but also on the velocity itself. If incorporated, the latter aspect strongly enhances the non-linearity of the problem. The first simulation codes date back to the 1980s, see, e.g., Baumgardner (1985), and over the last three decades numerical experiments have greatly enhanced our qualitative understanding of mantle convection, see, e.g., Zhong et al. (2015) and references therein. However, the interest in ever more highly resolved and physically advanced models led to the development of a new generation of codes based on state-of-the-art numerical techniques suitable for massively parallel architectures, see, e.g., Burstedde et al. (2012), Heister et al. (2017), Kronbichler et al. (2012), Rudi et al. (2015).

Challenges for Mantle Convection Simulations at the Exa-Scale: …


2.2 Challenges and Future Directions Earth scientists possess a rich collection of geologic data from the past of our planet, such as indicators of vertical surface motions, so-called dynamic topography (Hager et al. 1985). This proxy-record of past mantle dynamics has motivated the development of adjoint based approaches for validation. Such techniques had been pioneered for numerical weather predictions by Le Dimet and Talagrand (Le Dimet and Talagrand 1986) and allow to employ gradient-based optimization techniques. Taking estimates of the current mantle state as a terminal condition, one solves an inverse problem to construct an optimal flow history going back millions of years (Horbach et al. 2014; Ismail-Zadeh et al. 2004). Results of forward simulations started from this optimized starting condition can then be validated against geologic data not included in the inversion, typically the dynamic topography. One might argue that this inverse problem is very ill-posed, since the mathematicalphysical model of convection includes an equation for conservation of energy, which in turn has a diffusive component, a physically irreversible process. However, advection is in most parts of the mantle the dominant form of energy transport, as expressed by a large value of the Peclet number, with diffusion being of significantly less importance. One might also argue that mantle convection is a chaotic process. This would seemingly rule out any construction of robust flow time trajectories (Bello et al. 2014). But this issue is overcome, if one accounts for the history of the horizontal surface velocity field in the form of Dirichlet boundary conditions (Colli et al. 2015; Horbach et al. 2014). The latter is known from geophysical observations. However, two major challenges remain. The current state of the mantle is not precisely known, either. The thermal state required for the inversion is only available as a result of seismic tomography. The latter is in itself the solution to an inverse problem, and only gives a blurred picture of the internal Earth structure. Additionally seismic tomography does not provide temperature directly, but yields results for proxy-quantities such as seismic velocities of compressional and shear waves, which must be converted using mineralogical models. The second challenge is a computational one. Already the forward problem of mantle convection can become extremely costly. As an example compare the thickness of the mantle of around 3000 km to that of dynamically relevant features such as the thermal front associated with a rising plume, which is on the order of a single kilometer, leading to highly resolved meshes, see also Sect. 4. Additionally, fine meshes (at least locally) necessitate short time-step sizes. Furthermore, for a retrodiction this problem must not be solved once, but several times in a forward-backward iteration for the adjoint method (Horbach et al. 2014). Finally, the systematic treatment of uncertainties may further increase the computational effort (Colli et al. 2020). In order to ‘sate this hunger’ for computational performance one can resort to everlarger supercomputers. With the first exa-scale systems anticipated for the immediate future, this seems feasible. However, standard techniques for a parallel solution of Partial Differential Equations (PDEs) cannot simply be transferred and scaled.


M. Mohr et al.

3 Mathematics Legacy finite element software is typically quite versatile with respect to complex non-linear applications, but often falls short with respect to its performance on modern supercomputer systems. In the following subsections, we briefly address some numerical approaches and discuss algorithmic challenges in the co-design of exascale enabled PDE solvers.

3.1 Basic Components of Finite Element Simulations The concept of finite elements (FE) including discontinuous Galerkin (DG) and isogeometric analysis (IGA) approaches is extremely flexible with respect to variable coefficient PDEs and geometries. From the early beginnings of finite elements to nowadays much work has been devoted to the definition of basis functions such as, e.g., C k - or H (div)-conforming elements, fast iterative solvers such as, e.g., multigrid and domain decomposition and error analysis, such as, e.g., a priori or a posteriori error estimates. Of crucial importance in the a priori analysis is the influence of socalled variational crimes. These typically result from applying a quadrature formula to evaluate the bilinear form or from an inexact representation of the geometry. It is well known that the total error can be decomposed into different components such as best approximation error, consistency error and the error in the iterative solver. The first one is mainly controlled by our choice of the FE space and the mesh resolution, and the third one depends on the stopping criteria in the iteration. In the traditional FE approach, the possible consistency error is mainly reduced to local considerations on the element level or on jumps across the faces. It contributes to the total error, in most cases, with at least the same order and by the same mesh size as the best approximation error. Very often the different error components are addressed individually. However, high-performance modern simulation technologies require that all these components are closely interlinked. The classical performance metric of error per degree of freedom is too simplistic and looses its relevance. This can be clearly seen, e.g., in the IGA context where the smaller number of degrees of freedom required for some given tolerance comes with the price of larger supports of the basis functions and therefore higher cost in the assembly and solver. The more relevant, from a user perspective, performance metric of run-time over accuracy is extremely delicate to apply and depends tremendously on the implementation itself. Floating-point performance, parallel efficiency, memory consumption and memory traffic including both cheap and expensive data access are gaining more and more relevance also in finite element simulations. The so-called roofline analysis and execution-cache-memory performance models (Stengel et al. 2015) give deeper insight, see also Sect. 4. Matrix-free realizations of finite elements come with a price of a reassembling step each time access to the matrix is required. This is, e.g., needed when a matrix

Challenges for Mantle Convection Simulations at the Exa-Scale: …


vector multiplication must be executed as a step within an iterative solver. Thus, it is not surprising that stored sparse matrix/stencil approaches show a lower arithmetic intensity (FLOPs/Byte) compared to matrix-free realizations, however, the former indeed also typically exhibit a lower performance (GFLOPs/s), see the discussions in Kronbichler and Kormann (2012), Drzisga et al. (2020).

3.2 Two-Scale Surrogate In most traditional software frameworks, the entries of the stiffness matrix are built upon local assembling steps over the individual elements of the mesh. To do so, each element is locally mapped onto a reference element and then suitable quadrature formulas are applied. This allows for completely unstructured meshes and imposes only minimal restrictions on the mesh type. However, within large scale applications in a matrix-free context, it comes at the price of high computational cost as no structure can, in general, be exploited. This observation motivates the use of so-called hybrid grids (Bergen et al. 2007; Kohl et al. 2019; Gmeiner et al. 2015b; Mayr et al. 2021) in combination with stencil paradigms well-established in the finite difference context for large scale simulations. Hybrid grids are using a fine and a coarse mesh, the elements are denoted by t and T , respectively. The coarse mesh can be completely unstructured but conforming and offers all the flexibility of modern finite element approaches. The fine grid is then a block-structured mesh. Typically it is obtained by successive uniform refinement from the coarse one, where different levels of refinement can be used on the individual coarse mesh elements. Let the fine mesh element t be a subset of the coarse mesh element T , then its mesh size is h t = 2−T HT , where T is the refinement depth of T . The coarse mesh elements are also called macro elements. If all the macro-elementwise defined refinement levels are the same, we obtain a globally conforming mesh. Otherwise we can use ideas from mortar finite elements or from DG to realize the coupling. What is important for what follows is that the local fine mesh size h t is the same for all t ⊂ T and that h t is considerably smaller than HT . Modern parallel software frameworks, as discussed in Sect. 4 allow to run simulations with billions and trillions of unknowns. This offers the option to exploit the attractive features of hybrid meshes. A coarse mesh with a million of elements can often represent even complex geometries and can resolve discontinuities in the PDE coefficients. Since within each macro element, we have a completely uniform refinement structure, the most efficient way to represent the submatrix associated with the inner degrees of freedom is by a stencil data structure. In case that the coefficients of the PDE restricted to a macro element are constant and the macro element is not further mapped, all stencils within the macro element associated with the same node types are the same. However, in the more general case of variable coefficients or if the coarse mesh only approximates the geometry, we have to work with variable stencil values.


M. Mohr et al.

Fig. 2 a Diagonal entry of the stiffness matrix for α = 1/2 as a discrete function over the fine mesh nodes; b–d approximation of this function by a polynomial of degree 0 (b), 3 (c) and 7 (d) Table 2 Maximal relative error between diagonal entries of the stiffness matrix computed by standard FEM assembly and their approximations by a surrogate polynomial on  = 6 for the variable diffusion coefficient (4) with α = 1/2 (left) and α = 3/2 (right) Polynomial degree 0 1 2 3 4 5

Maximal rel. error Sample set  = 4 1.764178 × 100 3.423482 × 10−1 2.862021 × 10−1 3.421032 × 10−2 2.511660 × 10−2 1.749228 × 10−3

Polynomial degree 4 5 6 7 8 9

Maximal rel. error Sample set  = 4 Sample set  = 5 1.676971 × 100 6.636050 × 10−1 6.237671 × 10−1 1.312685 × 10−1 1.366308 × 10−1 1.890241 × 10−2

1.104045 × 100 3.073813 × 10−1 2.769425 × 10−1 3.655171 × 10−2 3.633225 × 10−2 3.011967 × 10−3

In Fig. 2a, we illustrate the effect of a variable diffusion coefficient, given by    1 y 1 1 6 + sin π + + cos (αxπ ) k(x, y) = 50 10 2 10 10


on the diagonal entry of the stiffness matrix over a macro element T . The diagonal entry associated with each fine mesh node is plotted as a discrete function over T , we can clearly see that it is non-constant but shows a smooth variation over the nodes. The approximation of the discrete stencil function (in red) by a polynomial (in blue) with increasing order is shown from the left to the right. While a constant stencil will introduce a quite large consistency error, see Fig. 2b, an approximation of this function with a polynomial of degree 7, see Fig. 2d, results in an excellent approximation. In Table 2, we report on the maximal relative error between the diagonal matrix entries and their approximations by surrogate polynomials for two different choices of α. The coefficients of the polynomials are determined by least-squares fitting against data obtained from sampling the diagonal matrix entries at the points of a coarser mesh. Again the results demonstrate the convergence of the surrogates towards discrete stencil function. These observations motivate us to replace the traditional assembly strategy by a two-step procedure. Firstly on each macro element T , we evaluate the stencil at

Challenges for Mantle Convection Simulations at the Exa-Scale: …


a small number of fine mesh nodes xn ∈ ST in a traditional way by element-wise assembling. Secondly, we approximate the discrete stencil function by a surrogate which is defined on each macro element separately. Here, we select surrogates which are in each stencil function component a polynomial of the same degree. For low order conforming finite elements in a three-dimensional setting, the stencil function has 15 components (weights) and, thus, 15 surrogate polynomials per macro element have to be computed and their coefficients stored. These surrogate polynomials can cheaply be computed, e.g., by standard nodal Lagrange interpolation or by a linear regression exploiting information on a third mesh of considerably lower complexity than the fine mesh. We note that the choice ST depends on the way how we define our surrogate polynomial. We refer to Bauer et al. (2019), for an introduction of the two-scale surrogate in the context of large scale simulations. In such a setting, surrogate polynomials of order 2 or 3 are already sufficient in combination with low-order finite elements to result in a total error that is not dominated by the consistency error. This procedure introduces a two-scale finite element approach in which a consistency error is introduced which scales with the macro mesh size H q+1 , where q ≥ 1 is the polynomial order of the surrogate and the constant depends on the higher derivatives of the diffusion coefficient. We recall that the best approximation error in the H 1 -norm typically scales with h p , where h is the fine mesh resolution, and p is the polynomial order of the finite element space. It can be shown (Drzisga et al. 2019) that for the total error in the H 1 -norm an a priori bound of the form   c h p u p+1 + H q+1 K q+1,∞ u1


can be established for the variable coefficient Poisson equation and standard conforming finite elements. The W q+1,∞ -norm has to be understood macro-element-wise. Here u denotes the weak solution and K =

D F −1 a D F − | det D F −1 |

takes into account the variable coefficient a of the Poisson equation and a possible non-trivial geometry mapping F defined macro-element-wise. We note that by this approach, we can also compensate geometry approximations and allow blending mappings F from a macro element to a physical element representing (more) accurately the geometry or internal layers. Such a mapping then influences the diffusion coefficient by its Jacobian D F and thus the stencil values, but does not distort the data-structure of the fine mesh. To be more precise, the elements of the macro mesh define a reference domain H which, in general, does not resolve the physical domain . Without mapping of H , the approximation error of the geometry would easily dominate the total error. Such a two-scale bound as (5) is based on the first Strang lemma and can be generalized to other equations or finite elements. Although H is typically larger than


M. Mohr et al.

h by a factor of 64–256, both mesh sizes are small in large-scale simulations. The two error terms can be balanced by controlling the order of the surrogate. For stencil entries that only couple nodes placed on macro faces, macro edges, or macro vertices, we have to apply special techniques, e.g., averaging surrogates from the attached macro elements, computing surrogate polynomials on the lowerdimensional structure, or assembling in a traditional way. Recall that the number of nodes associated with the macro mesh wire basket is of lower order O(h 1−d ) compared to the ones in the interior of the macro elements O(h −d ). Here d stands for the space dimension. For matrix-free frameworks this can become an off-line and an on-line procedure. In the off-line phase, we compute a few stencil values exactly and then the coefficients of the surrogates. In the on-line phase, we simply evaluate the surrogates whenever the matrix entries are needed. The uniform structure of the fine mesh lattice associated with one macro element allows an extremely efficient O(q)-evaluation along onedimensional lattice lines. We point out that in contrast to many other reduction techniques, the off-line phase can also be of significantly lower cost compared to standard assembly techniques since the cardinality of ST is, in general, much lower than the number of fine mesh nodes per macro element T . In such a case the approach also pays off in a matrix oriented framework where all matrix entries can now be assembled by the computationally cheap evaluation of the surrogates, see Drzisga et al. (2020). Finally, we note that the two-level surrogate approach is quite general and cannot only be applied to all types of finite elements, but also to finite volume and finite difference schemes. It gains its performance from a higher-order consistency error on a coarser mesh.

3.3 All-at-Once Multigrid Method Typical for mathematical models in geodynamics are two components. A Stokes type problem for the velocity and pressure coupled to a dynamic convection-diffusion equation for the temperature, see (1)–(3). Many time integration schemes then exploit the natural block structure and consider in an iterative way the velocity and pressure block separately from the energy balance equation for the temperature. Here we only discuss a possible solution strategy for the Stokes-type system since this is the most time-consuming part. There are many options to solve a saddle-point problem iteratively, ranging from preconditioned Krylov methods for the indefinite saddle-point system to inner and outer solvers applied to the Schur complement. Typically as outer solver a Krylov space method is used and for the velocity based inner one a multigrid scheme. As all these Krylov variants require some extra storage in form of vectors and global parallel reductions for inner products, an attractive alternative are all-at-once multigrid techniques, i.e. monolithic multigrid methods directly applied to the saddle-point system. Let us focus on a system matrix of the form

Challenges for Mantle Convection Simulations at the Exa-Scale: …


 A B , B  −C

where A is a symmetric positive definite matrix, and C is symmetric and positive semi-definite, e.g., resulting from some pressure based stabilization term. The all-at-once multigrid operates on velocity and pressure unknowns and requires the standard ingredients such as smoother, restriction and prolongation. Our choice of the smoother belongs to the Uzawa class. In a symmetric version, it requires first an application of a suitable smoother Aˆ for A on the velocity component, e.g., a forward Gauss–Seidel type method, then a suitable symmetric smoother Sˆ for the Schur complement S = C + B  A−1 B on the pressure, and finally the application of the adjoint smoother Aˆ  of the first step on the velocity, e.g., a backward Gauss– Seidel. For the choice of the Schur complement smoother there a many options. From a theoretical point of view, it is only required that Aˆ + Aˆ  > A, Sˆ ≥ S, see Drzisga et al. (2018) for more details. One option for Sˆ is a diagonal smoother based on the damped mass matrix. A quasi-optimal damping parameter is determined by some power iteration steps to a generalized eigenvalue problem on a coarse mesh. Alternatively, we use a symmetric Gauss–Seidel applied to the matrix C + ωM, where M is the mass matrix and ω has to be selected such that the assumptions on Sˆ are verified. We refer to Gmeiner et al. (2016) for a comparison of the performance of different solver types. Finally we note that no V-cycle convergence can be established and that numerical examples confirm this negative theoretical results. However, level independent W-cycle or variable V-cycle convergence rates can be shown. Due to the increased number of coarse level solves in a W-cycle compared to a V-cycle, we use the variable V-cycle with a reduced, i.e., additive, increase in the number of smoothing steps.

Fig. 3 a Temperature profile and b streamlines obtained by an all-at-once multigrid solver


M. Mohr et al.

Figure 3a illustrates a given temperature profile defining the right-hand side of the stationary Stokes equation in the Boussinesq approximation, see Grand et al. (1997), Stixrude and Lithgow-Bertelloni (2005). In Fig. 3b, the computed velocity streamlines below Iceland are shown. As boundary condition, we assume free slip at the inner boundary and a velocity profile given by plate tectonic data, see Williams et al. (2012), Müller et al. (2008), on the outer surface.

4 High Performance Computing The evolution of semiconductor technology has led to an explosive growth of available compute power in the past five decades. However, computational science is also confronted with ever more complex supercomputer architectures, such as, e.g., multicore processors, deeply hierarchical memory systems, or accelerator devices. Therefore efficiently programming modern supercomputer systems has become increasingly difficult.

4.1 Grand Challenge Problems Simulating Earth mantle convection belongs to the class of grand challenge problems, since it will inevitably bring any existing computer system to its limits. To see this, it suffices to realize that the Earth mantle has a volume of approximately 1012 cubic kilometers and that one kilometer is a typical spatial resolution needed to resolve geological features, as argued in Sect. 2. Thus, a naïvely constructed mesh might have on the order of N = 1012 finite elements and after assembly the resulting linear system could have on the order of 1013 unknowns. Note that such a vector of double precision floating point numbers occupies 80 TiByte of memory. Thus, even the largest supercomputers in existence today can only store few such vectors and storing the stiffness matrix is out of reach. The sheer size of this computational task is illustrated with a back-of-the-envelope calculation as follows. Assume an O(N 2 ) algorithm being applied to N = 1012 cells when each operation consumes the energy of 1 nJ.1 Note that computing something as simple and fundamental as the domain diameter would often be realized as an O(N 2 ) algorithm. With N 2 = 1024 operations, however, the total energy consumption would amount to 1015 J, or, equivalently, 278 GWh, the energy output of a large nuclear reactor running for two weeks.


Note that the cost of 1 nJ per operation has a realistic magnitude when considering, e.g., Fugaku, the current number one supercomputer listed in the TOP500 of June 2021 (https://www.top500. org). It is quoted as executing approximately 0.5 × 1018 floating point operations per second at an energy intake of approximately 30 MW.

Challenges for Mantle Convection Simulations at the Exa-Scale: …


This rough estimate illustrates that even algorithms for seemingly trivial problems need to be designed with great care when progressing to extreme scale. In effect all steps and all algorithmic components of the computation must be scalable. Note here that our definition of scalability goes beyond a technical property of the implementation. For extreme problem sizes, every step in the solution process must scale. In particular, and most importantly, the algorithmic complexity must not grow any faster than linearly (or polylog-linearly) with problem size.

4.2 Scalability Taken Seriously This limits the choices severely. No technical improvement of computer systems in the foreseeable future will make algorithms of quadratic complexity feasible for solving extreme scale problems. Therefore it is imperative that we must devise scalable algorithms, i.e. algorithms with linear (or almost linear) complexity. Often these are algorithms that will rely on tree structures, divide-and-conquer algorithms, and recursion over a hierarchy. In our case, hierarchical mesh structures will be essential to implement the algorithms with scalable complexity. In particular, it is not feasible to generate a mesh with 1012 elements and then use a partitioning tool to distribute it onto a parallel computer system. Standard mesh generation software and the partitioning algorithms are in this sense not scalable and they cannot reach extreme scale. Consequently, we must use hierarchical structures as the primary design principle even for constructing the mesh. Such considerations leave only few algorithmic choices for the solution process itself. Also memory limitations constrain the options. Again, only hierarchical structures and algorithms such as the multigrid methods described in Sect. 3.3 deliver the near-optimal complexity needed.

4.3 Asymptotically Optimal Not Necessarily Good Enough Algorithms that scale, i.e. algorithms whose cost is linear in problem size, are often called asymptotically optimal. Clearly then also the leading constants in the asymptotic estimate are essential to determine the cost in practice. Determining the constants is therefore an essential step in a cost analysis. Unfortunately, such concise cost estimates for PDE solvers have rarely been published and thus it remains often unclear which method is preferable for which kind of application. In our context we note that also substeps of the solution process such as constructing the local stiffness matrices and global matrix assembly must be reviewed critically in terms of the number of operations and other cost that they incur. These algorithms are inherently scalable and parallelizable, but this does not mean that a particular choice leads to a method that is cost efficient—even if it is asymptotically


M. Mohr et al.

optimal. For a cost analysis in extreme scale computing it is not enough to prove that computations have linear complexity and that they can be parallelized. Exact cost assessments are needed. If not available analytically, then we must resort to systematic numerical experimentation.

4.4 With Matrix-Free Multigrid Methods Towards Extreme Scale In particular, we note that computing and storing the stiffness and mass matrices in explicit form will become a severe cost factor, even if we employ sparse matrix formats. Significantly larger problems can be solved when using matrix-free methods. However, it is again essential that the re-computation of matrix entries on-the-fly is performed efficiently. Here the surrogate techniques of Sect. 3.2 come into play. They exploit novel approximation strategies that in turn rely on the hierarchical blockstructure of the meshes. They trade a small loss of accuracy for better performance and lower memory consumption. In Gmeiner et al. (2015a), it is shown that an all-at-once parallel multigrid method can solve the Stokes equations indeed up to sizes of 1013 unknowns. The largest computation for this fundamental building block of Earth mantle convection to date is summarized in Table 3. The computations were executed on Juqueen2 the supercomputer with largest core count available to us to date. This computational experiment shows the scalability of both the solver algorithm and the implementation of the finite element framework. The slight degradation in compute time for higher core counts is due to the solution of the coarsest grid which is here performed with a simple Krylov method. A more careful discussion of alternative methods to deal with the coarse grids in a multigrid method is presented in Buttari et al. (2021). In particular, this article explores so-called agglomeration strategies, i.e. techniques to use fewer processors when handling coarser grids in the multigrid hierarchy. This reduces latency effects in the iterations on the coarse grid at the cost of overhead for transferring the problem to and from a small subset of processors.

4.5 Quest for Textbook Efficiency For our analysis of efficient large scale solvers we follow the textbook paradigm of Brandt (Brandt and Livne 2011). The idea here is that the cost of solving a system is evaluated in the metric of Work Units (WU), where a WU is defined as the cost of an elementary relaxation step of the discrete system, or as a multiplication with the system matrix. As claimed by Brandt, textbook multigrid efficiency is then reached 2, accessed May 17, 2021.

Challenges for Mantle Convection Simulations at the Exa-Scale: …


Table 3 Weak scaling results for monolithic multigrid in a spherical shell representing Earth’s mantle Nodes Cores DoFs Iterations Time (sec) 5 40 320 2 560 20 480

80 640 5 120 40 960 327 680

2.7X 109 2.1X 1010 1.2X 1011 1.7X 1012 1.1X 1013

10 10 10 9 9

685.88 703.69 741.86 720.24 776.09

when the cost of solving the system does not exceed 10 WU. This paradigm has been extended in Gmeiner et al. (2015b), Kohl and Rüde (2020) to parallel systems by techniques to quantify the cost of a WU in the context of large scale parallel computing. This in turn can be based on analytic models of computer performance, such as the roofline model (Williams et al. 2009) or more advanced techniques for performance analysis, such as the execution-cache-memory model (Stengel et al. 2015). Very large computations on grand challenge problems are often executed as isolated singular efforts. They are thus sometimes designated as heroic supercomputing (Da Costa et al. 2015), meaning in particular also that they cannot be easily repeated or re-run with different parameter settings. Such computational experiments demonstrating primarily the scalability of a prototype implementation of the methods described above, are documented, e.g., in Gmeiner et al. (2015a). However, these computations are still significantly beyond the level of every day computational practice. This is not only because access to supercomputers of such size is limited, but also because the setup of such a computation requires expertise to configure and tune a specialized software. In order for scalability demonstrations as in Table 3 to bear fruit for a target science (here geophysics), the software must additionally satisfy other quality criteria. In particular, it must be so flexible that it can be extended to more complex physical models and it must be user-friendly so that it becomes attractive also to non-HPC experts. Therefore the original prototype software HHG (Bergen et al. 2007) is currently being refactored into a new finite-element multigrid framework HyTeG (Kohl et al. 2019) with the goal to improve its usability and extensibility. HyTeG builds on the concepts of HHG, using a hierarchical construction of the meshes to induce a nested sequence of finite element spaces. However, it relaxes the old restrictions of HHG that are limited to low-order conforming finite elements (Kohl and Rüde 2020). Additionally, the HyTeG software makes use of program generation in an effort to achieve performance portability by automating application specific code transformations. Note that these techniques can be combined with the performance analysis techniques described above. The development of scientific software programs can, thus, not only be organized in a continuous integration process for modern software engi-


M. Mohr et al.

neering, but the essential performance features can be imposed in a continuous benchmarking process. In essence this means that clear performance targets are set for every component of a large simulation software system. These analytically derived performance targets can then be automatically checked for the full range of supported computer systems. Thus, performance degradations can be detected and identified immediately when either modifications of the programs are implemented or when the software is ported to new hardware.

5 Conclusions Future extreme scale computing will require fundamental innovations in the algorithms and data structures. In this article, we have illustrated recent research in model development and mathematics required to simulate Earth mantle convection with high resolution. However, the challenges of extreme scale computing force us to rethink the traditional processes of finite-element-based scientific computing even more fundamentally. Efficiency must be treated as primary design objective. Complex modern computer architectures impact which mathematical models and which algorithms can be considered. In particular, we find that the usual workflow with the steps 1. Meshing, 2. Matrix assembly, 3. Solving will not be suitable for extreme scale computing. Instead a co-design of all steps is necessary, since only a hierarchical design of all stages of the scientific computing pipeline can break the scalability barrier.

References Bauer S, Bunge H-P, Drzisga D, Ghelichkhan S, Huber M, Kohl N, Mohr M, Rüde U, Thönnes D, Wohlmuth B (2020) TerraNeo: Mantle convection beyond a trillion degrees of freedom. In: Bungartz H-J, Reiz S, Uekermann B, Neumann P, Nagel WE (eds), Software for Exascale Computing - SPPEXA 2016-2019, Lecture Notes in Computational Science and Engineering, vol 136, pp 569–610. Springer Bauer S, Huber M, Ghelichkhan S, Mohr M, Rüde U, Wohlmuth B (2019) Large-scale simulation of mantle convection based on a new matrix-free approach. J Comput Sci 31:60–76 Baumgardner JR (1985) Three-dimensional treatment of convective flow in the earth’s mantle. J Stat Phys 39(5–6):501–511 Bello L, Coltice N, Rolf T, Tackley PJ (2014) On the predictability limit of convection models of the Earth’s mantle. Geochem Geophys Geosys 15(6):2319–2328 Bergen B, Wellein G, Hülsemann F, Rüde U (2007) Hierarchical hybrid grids: achieving TERAFLOP performance on large scale finite element simulations. Int J Par Emerg Distrib Syst 22(4):311–329

Challenges for Mantle Convection Simulations at the Exa-Scale: …


Brandt A, Livne OE (2011) Multigrid techniques: 1984 guide with applications to fluid dynamics, vol 67. Classics in Applied Mathematics. SIAM, Philadelphia, PA Burstedde C, Stadler G, Alisic L, Wilcox LC, Tan E, Gurnis M, Ghattas O (2012) Large-scale adaptive mantle convection simulation. Geophys J Int 192(3):889–906 Buttari A, Huber M, Leleux P, Mary T, Rüde U, Wohlmuth B (2021) Block low-rank single precision coarse grid solvers for extreme scale multigrid methods. Num Lin Alg Appl. e2407 Colli L, Bunge H-P, Oeser J (2020) Impact of model inconsistencies on reconstructions of past mantle flow obtained using the adjoint method. Geophys J Int 221(1):617–639 Colli L, Bunge H-P, Schuberth BSA (2015) On retrodictions of global mantle flow with assimilated surface velocities. Geophys Res Lett 42(20):8341–8348 Da Costa G, Fahringer T, Rico-Gallego JA, Grasso I, Hristov A, Karatza H, Lastovetsky A, Marozzo F, Petcu D, Stavrinides G, Talia D, Trunfio P, Astsatryan H (2015) Exascale machines require new programming paradigms and runtimes. Supercomput Front Innov 2(2):6–27 Davies GF, Richards MA (1992) Mantle convection. J Geol 100(2):151–206 Drzisga D, John L, Rüde U, Wohlmuth B, Zulehner W (2018) On the analysis of block smoothers for saddle point problems. SIAM J Matrix Anal Appl 39(2):932–960 Drzisga D, Keith B, Wohlmuth B (2019) The surrogate matrix methodology: A priori error estimation. SIAM J Sci Comput 41(6):A3806–A3838 Drzisga D, Keith B, Wohlmuth B (2020) The surrogate matrix methodology: Low-cost assembly for isogeometric analysis. Comput Methods Appl Mech Eng 361:112776 Drzisga D, Rüde U, Wohlmuth B (2020) Stencil scaling for vector-valued PDEs on hybrid grids with applications to generalized Newtonian fluids. SIAM J Sci Comput 42(6):B1429–B1461 Gmeiner B, Huber M, John L, Rüde U, Wohlmuth B (2016) A quantitative performance study for Stokes solvers at the extreme scale. J Comput Sci 17(3):509–521 Gmeiner B, Rüde U, Stengel H, Waluga C, Wohlmuth B (2015) Performance and scalability of hierarchical hybrid multigrid solvers for Stokes systems. SIAM J Sci Comput 37(2):C143–C168 Gmeiner B, Rüde U, Stengel H, Waluga C, Wohlmuth B (2015) Towards textbook efficiency for parallel multigrid. Numer Math Theory Methods Appl 8(1):22–46 Grand SP, van der Hilst RD, Widiyantoro S (1997) Global seismic tomography: a snapshot of convection in the Earth. GSA Today 7(4):1–7 Hager BH, Clayton RW, Richards MA, Comer RP, Dziewonski AM (1985) Lower mantle heterogeneity, dynamic topography and the geoid. Nature 313(6003):541–545 Heister T, Dannberg J, Gassmöller R, Bangerth W (2017) High accuracy mantle convection simulation through modern numerical methods. II: realistic models and problems. Geophys J Int 210(2):833–851 Horbach A, Bunge H-P, Oeser J (2014) The adjoint method in geodynamics: derivation from a general operator formulation and application to the initial condition problem in a high resolution mantle circulation model. GEM Int J Geomath 5(2):163–194 Ismail-Zadeh A, Schubert G, Tsepelev I, Korotkii A (2004) Inverse problem of thermal convection: numerical approach and application to mantle plume restoration. Phys Earth Planet Inter 145(1– 4):99–114 Jarvis GT, McKenzie DP (1980) Convection in a compressible fluid with infinite Prandtl number. J Fluid Mech 96(3):515–583 Kohl N, Rüde U (2020) Textbook efficiency: massively parallel matrix-free multigrid for the Stokes system. arXiv:2010.13513. Submitted Kohl N, Thönnes D, Drzisga D, Bartuschat D, Rüde U (2019) The HyTeG finite-element software framework for scalable multigrid solvers. Int J Par Emerg Distrib Sys 34(5):477–496 Kronbichler M, Heister T, Bangerth W (2012) High accuracy mantle convection simulation through modern numerical methods. Geophys J Int 191(1):12–29 Kronbichler M, Kormann K (2012) A generic interface for parallel cell-based finite element operator application. Comput Fluids 63:135–147 Le Dimet F-X, Talagrand O (1986) Variational algorithms for analysis and assimilation of meteorological observations: Theoretical aspects. Tellus A Dyn Meteor Oceanogr 38(2):97–110


M. Mohr et al.

Mayr M, Berger-Vergiat L, Ohm P, Tuminaro RS (2021) Non-invasive multigrid for semi-structured grids. arXiv:2103.11962 Müller RD, Sdrolias M, Gaina C, Roest WR (2008) Age, spreading rates, and spreading asymmetry of the world’s ocean crust. Geochem Geophys Geosyst 9(4):1–19 Rudi J, Malossi ACI, Isaac T, Stadler G, Gurnis M, Staar PWJ, Ineichen Y, Bekas C, Curioni A, Ghattas O (2015) An extreme-scale implicit solver for complex PDEs: Highly heterogeneous flow in earth’s mantle. In: SC ’15: Proceedings of the international conference for high performance computing, networking, storage and analysis. ACM, New York, pp 5:1–12 Schuberth BSA, Bunge H-P, Steinle-Neumann G, Moder C, Oeser J (2009) Thermal versus elastic heterogeneity in high-resolution mantle circulation models with pyrolite composition: high plume excess temperatures in the lowermost mantle. Geochem Geophys Geosys 10(1):1–24 Stengel H, Treibig J, Hager G, Wellein G (2015) Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In: ICS ’15: proceedings of the 29th ACM on international conference on supercomputing. ACM, New York, pp 207–216 Stixrude L, Lithgow-Bertelloni C (2005) Thermodynamics of mantle minerals. I: physical properties. Geophys J Int 162(2):610–632 Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76 Williams SE, Müller RD, Landgrebe TCW, Whittaker JM (2012) An open-source software environment for visualizing and refining plate tectonic reconstructions using high-resolution geological and geophysical data sets. GSA Today 22(4):4–9 Zhong SJ, Yuen DA, Moresi LN, Knepley MG (2015) Numerical methods for mantle convection. In Bercovici D (ed) Treatise on geophysics. Vol. 7: mantle dynamics, 2nd edn. Elsevier, pp 197–222

Remarks on the Radiative Transfer Equations for Climatology Claude Bardos, François Golse, and Olivier Pironneau

Abstract In this paper, we discuss some generally accepted approximations of climatology. Using both theoretical and numerical arguments, we prove, disprove or modify these approximations. We apply the general one-dimensional radiative transfer equations for the arguments. Keywords Radiative transfer · Climatology · Integral equation · Finite element methods

1 Introduction Satellite, atmospheric and terrestrial measurements are numerous to support the global warming. Various hypothesis are made to explain the disastrous effect of the GreenHouse Gases (GHG). However, a rigorous proof derived from the fundamental equations of physics is not available and one of the problems is the complexity of the physical system of planet Earth in its astronomical setting around the Sun. In many references such as Kaper and Engler (2013), Modest (2013), Chandrasekhar (1950), Fowler (2011), it is explained that the Sun radiates light with a heat flux Q = 1370 Watt/m2 , in the frequency range (0.5, 20) × 1014 Hz corresponding approximately to a black body at temperature of 5800 K. 70% of this light intensity C. Bardos Mathematics Department, University of Paris Denis Diderot, 5 Rue Thomas Mann, 75013 Paris, France e-mail: [email protected] F. Golse Ecole Polytechnique, Palaiseau 91128, France e-mail: [email protected] O. Pironneau (B) Applied Mathematics, Jacques-Louis Lions Lab, Sorbonne Université, 75252 Paris cedex 5, France e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



C. Bardos et al.

reaches the ground because the atmosphere is almost transparent to this spectrum and about 30% is reflected back by the clouds or the ocean, snow, etc. (albedo). The Earth behaves almost like a black body at temperature Te = 288 K and as such radiates rays of frequencies ν in the infrared spectrum (0.03, 0.6) × 1014 Hz: The absorption coefficient variation with frequencies is such that the atmosphere is essentially transparent to solar radiation. Fowler (2011, p. 65)

This is also stated by the World Meteorological Organization (WMO): Solar radiation that is not absorbed or reflected by the atmosphere (for example, by clouds) reaches the surface of the Earth. The Earth absorbs most of the energy reaching its surface, a small fraction is reflected. The sun’s impact on the earth (2023)

Carbon dioxide renders the atmosphere opaque to infrared radiations around 20 THz. Hence increasing its proportion in air increases the absorption coefficient in that range. It is believed to be one of the causes of global warming. However, it must be more complex because some experimental measurements show that increasing opacity decreases the temperature at high altitude, above the clouds Dufresne et al. (2020). In the framework of the radiative transfer equations, one can study the conjectures cited above. To support theoretical claims, the numerical solver developed in Bardos and Pironneau (2021), Pironneau (2021), Golse and Pironneau (2022) is used. We will address four questions: 1. 2. 3. 4.

What is the effect of increasing the altitude-dependent absorption coefficient? Is it true that sunlight crosses the Earth’s atmosphere unaffected? Is 30% of albedo equivalent to a 30% reduction in solar radiation energy? How does increasing the absorption coefficient in a specific frequency range affect the temperature?

2 Radiative Transfer Equations for a Stratified Atmosphere Finding the temperature T in a fluid heated by electromagnetic radiations is a complex problem because interactions of photons with atoms in the medium involve rather intricate quantum phenomena. Assuming local thermodynamic equilibrium leads to a well-defined electronic temperature. In that case, one can write a kinetic equation for the radiative intensity Iν (x, ω, t) at time t, at position x and in the direction ω for photons of frequency ν, in terms of the temperature field T (x, t):    1 1 ∂t Iν + ω · ∇ Iν + ρ κ¯ ν aν Iν − p(ω, ω )Iν (ω )dω c 4π S2 = ρ κ¯ ν (1 − aν )[Bν (T ) − Iν ].


Here c is the speed of light, ρ is the density of the fluid, ∇ designates the gradient with respect to the position x, while

Remarks on the Radiative Transfer Equations for Climatology

Bν (T ) =


2ν 3 ν

c2 [e kT − 1]

is the Planck function at temperature T with the Planck constant  and the Boltzmann constant k. Recall the Stefan-Boltzmann identity, 

Bν (T )dν = σ¯ T 4 ,

σ¯ =


2π 4 k 4 , 15c2 3


where π σ¯ is the Stefan-Boltzmann constant. The intricacy of the interaction of photons with the atoms of the medium is contained in the mass-absorption κ¯ ν , which is theoretically the result of vibration and rotation atomic resonance, but for practical purpose the fraction of radiative intensity at frequency ν that is absorbed by fluid per unit length. The coefficient aν ∈ (0, 1) is the scattering albedo, and (1/4π ) p(ω, ω )dω is the probability that an incident ray of light with direction ω scatters in the infinitesimal element of solid angle dω centered at ω. The kinetic equation (1) is coupled to the temperature—or energy conservation— equations of the fluid. When thermal diffusion is small, the system decouples and energy balance becomes:  0

 ρ κ¯ ν (1 − aν )


 Iν (ω)dω − 4π Bν (T ) dν = 0 .

When the wave source is far in the direction z, the problem becomes one-dimensional in z and the radiative intensity scattered in the direction ω depends only on μ, which is the cosine of the angle between ω and Oz. Consequently the system becomes, μ∂z Iν + ρ κ¯ ν Iν = ρ κ¯ ν (1 − aν )Bν (T ) + 

∞ 0

ρ κ¯ ν aν 2

1 −1

p(μ, μ )Iν (z, μ )dμ ,

   1 1 ρ κ¯ ν (1 − aν ) Bν (T ) − Iν dμ dν = 0, z ∈ (0, H ), |μ| < 1, ν ∈ R+ . 2 −1 (3)

For this system to be mathematically well posed, the radiation intensity Iν must be given on the domain boundary where radiation enters. For example, Iν (H, −μ) = Q − (μ), I (0, μ) = Q + (μ), 0 < μ < 1 .


Isotropic scattering is modelled by taking p ≡ 1. By introducing an optical depth (in Fowler 2011 and others, the signs are changed),  τ= 0


 ρ(η)dη, ρ becomes 1 and z becomes τ ∈ (0, Z ),


Z= 0

ρ(η)dη. (5)


C. Bardos et al.

Let us denote the exponential integral by 


E p (X ) :=

e−X/μ μ p−2 dμ .


Define Jν (τ ) =

1 2



Iν dμ, Sν (τ ) =

1 2

1 0

 κν τ  − κν (Zμ−τ ) − e− μ Q + (μ) + e Q (μ) dμ. ν ν

The problem is equivalent to a functional integral equation: ⎫  κν Z ⎪ Jν (τ ) = Sν (τ ) + E 1 (κν |τ − t|) ((1 − aν )Bν (T (t)) + aν Jν (t)) dt ,⎪ ⎬ 2 0  ∞  ∞ ⎪ ⎪ ⎭ (1 − aν )κν Bν (T (τ ))dν = (1 − aν )κν Jν (τ )dν, 0


Once Jν and T are known, Iν can be recovered by Iν (τ, μ) = e−

κν τ μ

κν (Z −τ )

Iν (0, μ)1μ>0 + e μ Iν (Z , μ)1μ0 e− |μ| ((1 − aν )κν Bν (T (t)) + aν κν Jν (t)) dt |μ| 0  Z κν |τ −t| κν e− |μ| + 1μν

Let Iν , T and Iν , T  be solutions to (3) and (4), respectively, with Q+ ν (μ) = 0, +

Q  ν = μe

0 Q− ν (μ) = μQ ν , − κνμZ

Q  ν (μ) = 0,

Q 0ν ,

0 < μ < 1, 0 < μ < 1.

Then T  (τ ) = T (τ ) + O(ε) and Iν − Iν = μe−

κν Z μ

  κν κν Q 0ν 1μ>0 e− μ τ + 1μ0 = 0, Iν (Z )μν ,

⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬

⎪ Iν∗ |μ>0 = 0, Iν∗ |μ ν∗, ⎪ ν (−μ), ⎪ ⎪  ν∗  ∞  ∞  1  0 ⎪ ⎪ κν κν ⎪ − ⎭ κν Bν (T )dν ≈ Iν dμdν + Q ν (−μ) dμdν.⎪ 2 −1 2 −1 0 0 ν∗

Now Iν is defined by (3) but with +

Iν (0)|μ>0 = Q  ν (μ),

Iν (Z )|μ ν ∗ , Iν ≈ I  ∗ν (μ), independent of τ , and μ∂τ Iν + κν (Iν − Bν (T  )) = 0, +

ν < ν∗,

⎫ ⎪ ⎪ ⎪ ⎬

ν < ν∗, Iν (0)|μ>0 = Q  ν (μ), Iν (Z )|μ0 = −Q  ν (μ),

Iν (Z )|μ0 e Qν μ + e ((1 − aν )κν Bν (T (t)) + aν κν Jν (t)) dt |μ| 0  Z κν |τ −t| κν e− |μ| + 1μ0


κν |τ −t| |μ|

which implies  1 0 1 Z Q ν E 3 (κν τ ) + E 1 (κν |τ − t|) [(1 − aν )Bν (T (t)) + aν κν Jν (t)] dt 2 2 0  Z 1 αk κν E 1 (κν (t + τ − τk )) [(1 − aν )κν Bν (T (t)) + aν κν Jν (t)] dt. + 2 k τk

Jν (τ ) =

Consider the case Iν (Z , μ)|μ0 = Q 0ν + α Iν (0, −μ) .

Assume no scattering, τ1 = 0, and K = 1. As in Golse and Pironneau (2022), the following iterative scheme is considered:  1 κν Z Jνn+1 (τ ) = Q 0ν E 3 (κν τ ) + (E 1 (κν |τ − t|) + α E 1 (κν (t + τ ))) Bν (T n (t))dt , 2 2 0  ∞  ∞ κν Bν (Tτn+1 )dν = κν Jνn+1 (τ )dν . 0


Example 1 Assume aν = 0. Figure 2 compares the numerical solutions of (3) with κν = 0.5[1ν0 = 0.3Iν (0, −μ) + μe−κν Z /μ Q 0ν /0.7 (solid curve), T  : Iν (0, μ)|μ>0 = 0.3Iν (0, −μ) + μQ 0ν (dashed line)

This numerical simulation is an attempt to replace the radiative source at τ = H by a radiative source at τ = 0 which gives the same result. Notice alongside that adding 0.3 albedo induces a heating of the atmosphere (i.e. comparing both computations with λ = 1, β = 1 above).

6 General Statement About Earth Albedo As above one considers a stratified atmosphere (3) with (z, μ, ν) in (0, H ) × (−1, 1) × R+ . The radiation intensity Iν (z, μ) in the stratified atmosphere at z = H satisfies an incoming condition Iν (H, −μ) = Q − ν (μ), 0 < μ < 1, and at z = 0 on Earth some “albedo” condition denoted by Iν, (0, μ) = Aν (Iν (0, −μ)), 0 < μ < 1. Here A represents the albedo effect of the earth and is thus an operator from the space of the outgoing intensities into the space of the incoming intensities. Based on the energy estimates, some sufficient conditions are given below for A to guarantee the existence and stability—i.e. uniqueness—of the solution beginning with the grey model. A natural operator with an effective Earth temperature is then considered (see Fowler 2011, p. 66).

Remarks on the Radiative Transfer Equations for Climatology


6.1 Energy Type Estimate for the Grey Problem For clarity, we present the case without scattering (aν = 0) and with a constant κν independent of the frequency ν. By rescaling z with κ, the problem can be formulated in term of  ∞ I (z, μ) = Iν (z, μ)dν 0

as 1 μ∂z I (z, μ) + I (z, μ) − 2 −

I (H, −μ) = Q(μ) := Q ν (μ),


I (z, μ )dμ = 0 ,


I (0, μ) = A(I (0, −μ)), 0 < μ < 1 .



Note that we have assumed that A commutes with the integration in ν. First observe that the operator  1 1 I → I − I (μ )dμ 2 −1 is, in the space L 2 (−1, 1), the orthogonal projection on functions of mean value 0, because 




1 2

1 −1

  I (μ )dμ I (μ)dμ =

1 −1


1 2

1 −1

I (μ )dμ

2 dμ.


This implies that the left-hand side of (11) is nonnegative and equal to 0 if and only if  1 1 I = I (μ )dμ 2 −1 is independent of μ. Multiplying Eq. (9) by I (z, μ), integrating over (0, H ) × (−1, 1), using the Green formula and inserting in the computations the conditions (10), one obtains: 



dz 0


2   1 1 1 1   I− I (μ )dμ dμ + μ(I (H, −μ))2 dμ 2 −1 2 0    1 1 1 1  μ (I (0, μ))2 − (A(I (0, μ)))2 dμ = μ(Q(μ))2 dμ + 2 0 2 0

This estimate indicates that any property which implies the positivity of


C. Bardos et al.


  μ I (0, μ)2 − (A(I (μ, 0)))2 dμ


will lead to a well-posed problem with a unique nonnegative solution. For instance, Theorem 1 Assume that the operator I → A(I ) is restricted to the positive cone C + (I (μ)) ≥ 0 on the space (L 2 (−1, 1), μ, dμ). Then for any positive μ → Q(μ) there is a unique nonnegative solution to Eqs. (9) and (10). Remark 3 Making use of the maximum principle one could prove similar results when the albedo operator is a contraction in the positive cone of L ∞ (0, 1). Remark 4 The albedo condition which is a relation, for 0 < μ < 1, between I (0, μ)|μ>0 and I (0, μ)|μ0 only. Indeed, define the operator T : I + (μ) → T (I + (μ)) = I (0, −μ)|μ>0 (defined, for instance, in L ∞ (0, 1)) by solving (9) with I (H, −μ) = Q(μ),

I (0, μ) = I + (μ), 0 < μ < 1 .

Then one has: I (0, μ)|μ>0 = A(I (0, −μ))


I (0, μ)|μ>0 = A(T (I (0, μ)|μ>0 )) .

With Q(μ) > 0 in L ∞ (0, 1), T is an affine operator which preserves the positivity and the monotonicity. This leads to the simple but useful proposition. Proposition 3 Consider two solutions {Ii }i=1,2 to (9) with the same Q(μ) but with two different albedo operators Ai in (10). Assume that both Ai are linear contractions and that one of them is a strict linear contraction, which preserve the positivity. If (12) ∀ f ≥ 0 ∈ L ∞ (0, 1), ∀μ ∈ (0, 1) , A2 ( f )(μ) ≤ A1 ( f )(μ) then one has the same ordering for the corresponding solutions: ∀(z, μ) ∈ (0, 1) × (0, H ) I1 (z, μ) ≤ I2 (z, μ) . Proof Using the linearity of the operators Ai one observes that R = I1 − I2 solves (9) with R(H, −μ) = 0 and, for all 0 < μ < 1, 1 1 R(0, μ) − (A1 + A2 )(T (R(0, μ))) = (A1 − A2 )(T (I1 (0, μ) + I2 (0, μ))) . 2 2 Moreover, restricted to solutions which satisfy R(H, −μ) = 0, μ > 0 the operator T is a strict monotonicity preserving linear contraction. The same observation holds

Remarks on the Radiative Transfer Equations for Climatology


for R → 21 (A1 + A2 )(T (R)). As a consequence the operator R → (I − 21 (A1 + A2 )(T )) is invertible with inverse given by the Neumann series (also preserving the positivity): R=

 1 k≥0


k (A1 + A2 )(T )

1 (A1 − A2 )(T (I1 + I2 ))|z=0,μ>0 . 2


I1 and I2 are positive intensities and for 21 (A1 − A2 )(T (I1 + I2 ))|z=0,μ>0 and the right-hand side of (13), (12) is also true. 

6.2 Frequency Dependent Case Let us come back to (3) with (10). Let us denote f + = max( f, 0). When not ambiguous, let us write Iν (μ) instead of Iν (0, μ). Definition 1 An operator A defined in L 1μ (0, 1) is non-accretive if  ∀I 1 , I 2 ∈ L 1μ (0, 1),


  μ (I 2 − I 1 )+ − (A(I 2 ) − A(I 1 ))+ dμ ≥ 0 .



Remark 5 Obviously (14) follows from the same point-wise property: ∀(I 1 , I 2 ) (A(I 2 ) − A(I 1 ))+ ≤ (I 2 − I 1 )+ , μ > 0 . Then consider the difference of Eq. (3) with arguments (Iνi , T i ) i = 1, 2, multiplied by (μ2 /κ¯ ν )1 Iν2 >Iν1 . Using the same argument as in Golse (1987), one obtains the essential estimate:   d μ2 2 (Iν − Iν1 )+ = 0. ∀z, 0 < z < H, dz κ¯ ν Eventually one has: Theorem 2 Consider (3) with incoming data 0 ≤ Iν (H, −μ) ≤ Bν (TM ), depending on a given (but possibly large) temperature TM , and with outgoing data given by an accretive operator of the form Iν (0, μ)|μ>0 = A(Iν (0, −μ))|μ>0 . The problem has a unique well-defined non-negative solution. Proof The non-accretivity of the operator A is used to prove that  μ(Iν2 − Iν1 )+ z=0 ≤ 0.


C. Bardos et al.

Then since this quantity is constant, non-increasing and non-negative at z = H it must be equal to 0. The rest of the proof is independent of the albedo operator and follows the lines of Theorem 4.1 in Golse and Pironneau (2022).  Below are given two examples of albedo operators that combine an accommodation parameter α and the Earth’s reflection and thermalisation effect. Example 2 The simplest example would be given (cf. Remark 3) with 0 ≤ α ≤ 1 by the formula 


A(I (0, μ)) = α I (0, −μ) + (1 − α)

μ p I (0, −μ )dμ p > 0.


Example 3 In more realistic frequency-dependent cases, one should compare the reflection with the emission from the Earth as global black body at an effective temperature Te  288. This leads to try the albedo operator A(Iν (0, −μ)) = α Iν (0, −μ) + (1 − α)Bν (Te ), μ > 0. This obviously satisfies the hypothesis of Theorem 2.

7 Calculus of Variations for the ν-dependent Case The following argument sheds some light on the conditions needed to obtain cooling (resp. heating) from a local increase in the absorption coefficient. Conjecture 1 Let Iνε (τ, μ), T ε (τ ) be the solution to (3) when κνε := κ + εδκν ,


Let Iν0 (τ, μ), T 0 (τ ) be the solution to (3) with κν = κ constant. When δκν ≈ δ ν ∗ , the Dirac mass at ν ∗ , and κ is small, then the sign of dT /dε(τ )|ε=0 is governed by the sign of  1 1 0 I ∗ (τ )dμ − Bν ∗ (T 0 (τ )). (16) 2 −1 ν

7.1 Support of the Conjecture It will be convenient to define the Planck function in terms of the quantity given by the Stefan-Boltzmann law  := σ T 4 . Henceforth, we denote Bν (T ) = bν ().

Remarks on the Radiative Transfer Equations for Climatology


We seek to study how the average radiation intensity or the temperature is altered if the absorption is modified on various intervals in the frequency variable. We expect that the simplest situation corresponds to absorption of the form (15) with κ > 0 independent of ν, while 0 < ε  1. This is of course an extremely general formulation, but in practice one could think of δκν := κ1ν1 h 2 H M12 M22 M32 M⊕ Ieff 1/4 1/8 1/16 1/32

2.75e-3 1.74e-3 1.09e-3 7.17e-4

4.62e-1 4.06e-1 3.17e-1 1.95e-1

1.21e-3 3.07e-4 7.61e-5 1.88e-5

4.66e-1 4.08e-1 3.18e-1 1.96e-1

6.53e1 6.11e1 5.40e1 4.23e1

Table 3 Contributions to error majorant (31): increasing number of iterations of Algorithm 1, corrector computed on fine mesh, i.e., H = h 2 n M12 M22 M32 M⊕ Ieff 2 4 6 8

1.26e0 5.91e-2 3.57e-3 6.55e-4

4.27e-2 1.99e-3 6.59e-4 6.01e-4

Ieff :=

1.56e-1 8.63e-3 4.65e-4 2.57e-5

1.46e0 6.97e-2 4.70e-3 1.28e-3

1.70 1.72 1.86 2.66

M⊕ ( y, v, f ; 1, 1, 1) ∇(u − v) A,

in the former case is very good even without optimization of the parameters εi , i = 1, 2, 3, and gets much worse in the latter case. This suggests the strategy to improve a flux corrector that has been computed from a global problem on a coarse mesh with mesh size H by solving local subdomain problems on a fine mesh with mesh size h. Next, we tested how the parts M12 , M22 , and M32 of the majorant M⊕2 change throughout the Schwarz alternating iterative process, i.e., for increasing iteration count n. The results are presented in Table 3. Finally, we tested how the parts M12 and M22 , which can be evaluated on different basic subdomains independently from each other, change as the number of iterations n increases. As one can observe, see Table 4, especially M1 indicates well which basic subdomains should be processed next by the Schwarz method in order to reduce the global error majorant effectively. These experiments confirm that the new localized a posteriori error majorant provided by Theorem 2 (and estimate (31)) has great potential to become a powerful tool for reliable and cost-efficient iterative solution methods for (elliptic) PDEs by domain decomposition methods. Future investigations will deal with refining the proposed approach and adapting it to various classes of problems.


J. Kraus and S. Repin

Table 4 Contributions to volume terms of the error majorant (31) from different basic subdomains: increasing number of iterations of Algorithm 1, corrector computed on fine mesh, i.e., H = h 2 2 2 2 2 2 n M1,ω M1,ω M1,ω M2,ω M2,ω M2,ω 1 2 3 1 2 3 2 3 4 7 8

7.94e-1 3.59e-2 3.76e-2 3.01e-4 3.03e-4

3.10e-1 6.18e-2 1.31e-2 3.67e-4 2.55e-4

1.60e-1 1.71e-1 8.42e-3 5.54e-4 9.66e-5

1.80e-2 7.60e-4 8.78e-4 3.14e-4 3.15e-4

2.22e-2 3.99e-3 8.89e-4 1.58e-4 1.52e-4

2.50e-3 3.10e-3 2.23e-4 1.39e-4 1.34e-4

Acknowledgements The first author would like to thank Philip Lederer for his support with the high performance multiphysics finite element software Netgen/NGSolve that served as a platform to conduct the numerical tests.

References Braess D, Schöberl J (2008) Equilibrated residual error estimator for edge elements. Math Comput 77(262):651–672 Bramble JH, Pasciak JE, Wang JP, Xu J (1991) Convergence estimates for product iterative methods with applications to domain decomposition. Math Comput 57(195):1–21 Dohrmann CR (2003) A preconditioner for substructuring based on constrained energy minimization. SIAM J Sci Comput 25(1):246–258 Dolean V, Jolivet P, Nataf F (2015) An introduction to domain decomposition methods: algorithms, theory, and parallel implementation. SIAM, Philadelphia, PA Efendiev Y, Galvis J, Lazarov R, Willems J (2012) Robust domain decomposition preconditioners for abstract symmetric positive definite bilinear forms. ESAIM: M2AN 46(5):1175–1199 Farhat C, Lesoinne M, LeTallec P, Pierson K, Rixen D (2001) FETI-DP: A dual-primal unified FETI method. Part I: a faster alternative to the two-level FETI method. Int J Numer Methods Eng 50(7):1523–1544 Galvis J, Efendiev Y (2010) Domain decomposition preconditioners for multiscale flows in highcontrast media. Multiscale Model Simul 8(4):1461–1483 Hackbusch W (2003) Multi-grid methods and applications, 2nd edn. Springer, Berlin Kantorovich LV, Krylov VI (1964) Approximate methods of higher analysis. Interscience, New York Kelly DW (1984) The self-equilibration of residuals and complementary a posteriori error estimates in the finite element method. Int J Numer Methods Eng 20(8):1491–1506 Kraus J (2006) Algebraic multilevel preconditioning of finite element matrices using local Schur complements. Numer Linear Algebra Appl 13(1):49–70 Kraus J (2012) Additive Schur complement approximation and application to multilevel preconditioning. SIAM J Sci Comput 34(6):A2872–A2895 Kraus J, Lazarov R, Lymbery M, Margenov S, Zikatanov L (2016) Preconditioning heterogeneous H (div) problems by additive Schur complement approximation and applications. SIAM J Sci Comput 38(2):A875–A898 Kraus J, Lymbery M, Margenov S (2015) Auxiliary space multigrid method based on additive Schur complement approximation. Numer. Linear Algebra Appl. 22(6):965–986 Kuznetsov YA (1989) Algebraic multigrid domain decomposition methods. Sov J Numer Anal Math Modell 4(5):351–379 Ladevèze P, Leguillon D (1983) Error estimate procedure in the finite element method and applications. SIAM J Numer Anal 20(3):485–509

A Posteriori Error Estimates for Domain Decomposition Methods


Lions P-L (1978) Interprétation stochastique de la méthode alternée de Schwarz. C R Acad Sci Paris 268:325–328 Lions P-L (1988) On the Schwarz alternating method. I. In: Glowinski R, Golub GH, Meurant GA, Périaux J (eds) First international symposium on domain decomposition methods for partial differential equations. SIAM, Philadelphia, PA, pp 1–42 Lions P-L (1989) On the Schwarz alternating method. II: stochastic interpretation and order proprieties. In: Chan T, Glowinski R, Périaux J, Widlund O (eds) Second international symposium on domain decomposition methods for partial differential equations. SIAM, Philadelphia, PA, pp 47–70 Mali O, Neittaanmäki P, Repin S (2014) Accuracy verification methods: theory and algorithms, vol 32. Computational methods in applied sciences. Springer, Dordrecht Mandel J (1993) Balancing domain decomposition. Commun Numer Methods Eng 9(3):233–241 Mandel J, Dohrmann CR (2003) Convergence of a balancing domain decomposition by constraints and energy minimization. Numer Linear Algebra Appl 10(7):639–659 Mandel J, Dohrmann CR, Tezaur R (2005) An algebraic theory for primal and dual substructuring methods by constraints. Appl Numer Math 54(2):167–193 Mandel J, Sousedík B (2007) Adaptive selection of face coarse degrees of freedom in the BDDC and FETI-DP iterative substructuring methods. Comput Methods Appl Mech Eng 196(8):1389–1399 Mathew TPA (2008) Domain decomposition methods for the numerical solution of partial differential equations. Springer, Berlin Matsokin AM, Nepomnyaschikh SV (1985) The Schwarz alternation method in a subspace. Izv Vyssh Uchebn Zaved Mat 10:61–66 Matsokin AM, Nepomnyaschikh SV (1989) On the convergence of the non-overlapping Schwartz subdomain alternating method. Sov J Numer Anal Math Modell 4(6):479–485 Mikhlin SG (1951) On the Schwarz algorithm. Dokl Akad Nauk S.S.S.R. 77:569–571 Mikhlin SG (1964) Variational methods in mathematical physics. Pergamon Press, Oxford Nazarov AI, Repin SI (2015) Exact constants in Poincaré type inequalities for functions with zero mean boundary traces. Math Methods Appl Sci 38(15):3195–3207 Repin S (2007) A posteriori error estimation methods for partial differential equations. In: Lectures on advanced computational methods in mechanics. Radon series on computational and applied mathematics, vol 1. Walter de Gruyter, Berlin, pp 161–226 Repin SI (2000) A posteriori error estimation for variational problems with uniformly convex functionals. Math Comput 69(230):481–500 Sarkis Martins MV (1994) Schwarz preconditioners for elliptic problems with discontinuous coefficients using conforming and non-conforming elements. PhD thesis, New York University Scheichl R, Vassilevski PS, Zikatanov LT (2011) Weak approximation properties of elliptic projections with functional constraints. Multiscale Model Simul 9(4):1677–1699 Schwarz HA (1869) Ueber einige Abbildungsaufgaben. J Reine Angew Math 70:105–120 Stoutemyer DR (1973) Numerical implementation of the Schwarz alternating procedure for elliptic partial differential equations. SIAM J Numer Anal 10(2):308–326 Synge JL (1974) The hypercircle method. In: Scaife BKP (ed) Studies in numerical analysis: papers presented to Cornelius Lanczos (80th festschrift). Academic Press, pp 201–217 Toselli A, Widlund OB (2005) Domain decomposition methods: algorithms and theory. Springer, Berlin Trottenberg U, Oosterlee CW, Schüller A (2001) Multigrid. Academic, San Diego, CA Vassilevski PS (2008) Multilevel block factorization preconditioners: matrix-based analysis and algorithms for solving finite element equations. Springer, New York Verfürth R (1996) A review of a posteriori error estimation and adaptive mesh-refinement techniques. Wiley-Teubner Xu J (1996) The auxiliary space method and optimal multigrid preconditioning techniques for unstructured grids. Computing 56(3):215–235 Xu J, Zikatanov L (2002) The method of alternating projections and the method of subspace corrections in Hilbert space. J Am Math Soc 15(3):573–597

Optimization and Control

Multi-criteria Problems and Incompleteness of Information in Structural Optimization Nikolay Banichuk and Svetlana Ivanova

Abstract In this paper, we consider the formulation of multi-criteria optimization and structural optimal design with incomplete information. We present both concrete mechanical statements about considered problems and analytical solution methods. All investigations are accomplished within the framework of problems with vector functionals and by including uncertainty into important data. The presented considerations of problem formulations and the developed methods are based on the concept of Pareto optimality and the worst case scenario (minimax guaranteed approach). Some concrete examples of finding optimal structural shapes are presented with certain details of the search methods used. Special attention is devoted to transforming optimal design problems with incomplete information and multi-criteria formulation into conventional structural optimization problems. Keywords Multi-criteria optimization · Incomplete information · Uncertainties · Optimal structural design

1 Introduction Optimal structural design and control are widely studied in mechanics. At present, a majority of criterions and functionals are taken into account within a framework of deterministic approach, assuming that the information used as input data are complete. In these problems, Eulerian formulation is used when one of the considered functionals is taken as an optimized criterion and the rest functionals are used as constraints (in the form of equalities and inequalities) superimposed on the design variables. To solve such problems, we can apply conventional variational techniques and methods developed for design optimization to systems with distributed parameters. N. Banichuk (B) · S. Ivanova Ishlinsky Institute for Problems in Mechanics RAS, Prospekt Vernadskogo 101-1, Moscow 119526, Russia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



N. Banichuk and S. Ivanova

More general formulation is reduced to the so-called “multi-criteria” or “vectorial” optimization problem. In this context, the concept of optimality is changed. The concept of optimality in the Pareto sense (Stadler 1988) for the problem J (h ∗ ) = min J (h), h∈h

J (h) = {J1 (h), J2 (h), ..., Jn (h)}T ,

which is characterized by n quality criteria that depend on the design variable h ∈ h , where h denotes the set of admissible designs, is as follows. The design variable h ∗ ∈ h is optimal in the Pareto sense if and only if there is no h ∈ h such that Ji (h) ≤ Ji (h ∗ ), for i = 1, 2, ..., n, where

J j (h) < J j (h ∗ ) for at least one number j = 1, 2, ..., n.

In other words, if the design variable h ∗ is Pareto-optimal, it is impossible to further decrease the value of any of the criteria Ji without simultaneously increasing the value of at least one other functional J j ( j = i). The application of multi-criteria optimization to the problems of structural mechanics or technology in general took quite a long time. It was W. Stadler who first referred to the scientific application of the Pareto optimization concept in the 1970s and published several papers especially on natural shapes (Stadler 1988). Since the late 1970s, vector optimization has been increasingly integrated into the problems of optimal design in papers by several scientists (Stadler 1986; Eschenauer et al. 1990; Banichuk et al. 2014; Banichuk and Ivanova 2017). It is important to note that in many applications, structural design must take into account the uncertainties of the guaranties, such as the modulus of elasticity, the position and values of external forces, and other factors with incomplete information. Problems in the design of optimal structural systems with incomplete information are essentially different in both their formulation and research technique. In the paper, we will use the “guaranteed” or, in other words, the “minimax” approach to optimization problems with incomplete information. The guaranteed approach is one of the feasible approaches to formulating and solving various problems with incomplete information (Banichuk 1976; Banichuk and Neittaanmäki 2008, 2010). In this approach, we must assume that the sets ω , which contains all possible realizations of undesirable factors, structural imperfections, and external actions, are given and we only need to determine a structure design (sizes, shape, topology) with minimal optimized functionality (mass, cost), satisfying all strength, geometric, performance, and reliability constraints for all possible values in a realizable set. Such a structural design is called optimal (in a guaranteed sense) if for any structure with a smaller value of the optimized functional, it is possible to select an input parameter system belonging to the admissible sets (ω ∈ ω ) so that some of mentioned constraints are violated. Note also that if we solve optimization problem of minimization of some quality functional in frame of “worst-case scenario” (WCS) under given constraints

Multi-criteria Problems and Incompleteness of Information in Structural Optimization


and sets containing all admissible realizations of uncertain parameters, then according to WCS we must first maximize the functional with respect to the parameters of uncertainties, i.e. determine the worst realization of unknown before-hand input data, and only then minimize the functional with respect to the design variable. To be specific, in this paper, we consider structural optimization problems. In Sects. 2 and 3, we use multi-criteria approaches, while in Sect. 4 we deal with the worst-case scenario in the presence of uncertainties. The problems presented in Sects. 3 and 4 are considered in the frame of the fracture mechanics of crackcontaining structural elements.

2 Prototype Problem Consider the multi-criteria optimization of elastic rod of a given length L and the variable circular cross-section of the area S = S(x), which is taken as a design variable. The rod is supposed to be clamped at the end x = 0 and free at the end x = L. The rod is made of a homogeneous elastic material with a Young’s modulus E and a density ρ. Various transverse static loads qi (x), 0 ≤ x ≤ L, are applied to the rod in the considered problem. The loads are unilateral and are included in the given set q , i.e. q ∈ q = {q1 (x), q2 (x), ..., qn (x) | qi (x) ≥ 0, 0 ≤ x ≤ L , i = 1, 2, ..., n} . (1) The stiffness of the rod Ji is evaluated as the value of the deflection Hi of the considered rod at the free end x = L, which is a functional of the cross-section area distribution S(x), 0 ≤ x ≤ L,  L ψi (x) d x , Ji (S) = Hi (S) = D(S(x)) 0  L ψi (x) = (L − x) (ξ − x)qi (ξ ) dξ,

D = EI = E

S 2 (x) , 4π

i = 1, 2, ..., n.


The rod mass

 JM (S) = M(S) = ρ


S(x) d x 0

is also taken as a component of the vector criteria. The cross-section area S(x), 0 ≤ x ≤ L, is considered to be a positive continuous function, i.e. S ∈  S = {S(x) ∈ C | S(x) ≥ 0, x ∈ [0, L]}.



N. Banichuk and S. Ivanova

Thus, the constructed vector functional J consists of rigidity and mass components J = {J1 (S), J2 (S), ..., Jn (S), JM (S)}T .


Considered multi-criteria problem of rigidity and mass optimization for the loadings in (1) consists in finding the admissible distribution of the area S(x) by minimizing the vector functional (3): J (S) → min . S(x)∈ S

The optimal solution is written as S ∗ (x) = arg min J (S) S(x)

and is taken in Pareto sense (Stadler 1988; Eschenauer et al. 1990), which means that there is no other cross-section area distribution  S(x) from (2) such that S) ≤ J1 (S ∗ ), ..., Jn ( S) ≤ Jn (S ∗ ), J1 (

JM ( S) ≤ JM (S ∗ )


and one of the inequalities in (4) satisfies a rigorous inequality, i.e. S) < J j (S ∗ ) for some j = 1, ..., n or JM ( S) < JM (S ∗ ). J j ( As is known (see, e.g., Stadler 1988; Eschenauer et al. 1990; Banichuk et al. 2014; Banichuk and Ivanova 2017), the optimal Pareto solution is not unique and all the Pareto points that satisfy the constraints considered belong to a set called the Pareto front. Using the developed effective method for weighting coefficients Ci , i = 1, 2, ..., n, and C M , we construct a minimized weight functional JC (S) =

n  i=1

Ci Ji (S) + C M JM (S) =

n  i=1


Ci 0

ψi (x) d x + CM ρ D(S(x))


S(x) d x. 0

The weighting coefficients for JC must satisfy the conditions Ci ≥ 0, i = 1, 2, ..., n, C M ≥ 0, C M +


Ci = 1.


The necessary optimality condition δ JC = 0 is written in the form   n 2κ0  ρC M − 3 Ci ψi (x) = 0, S (x) i=1


Multi-criteria Problems and Incompleteness of Information in Structural Optimization


where κ0 = 4π/E. Using Eq. (5) with the assumption C M = 0, we find the optimal multi-criteria cross-section area distribution  ∗

S (x) =

n 2κ0  0 C ψi (x) ρ i=1 i

1/3 , Ci0 =

Ci . CM

3 Multi-criteria Optimization of a Beam with a Crack System Consider the cross-section of a statically determined rectangular beam of thickness h and width b. The material of the beam is quasi-brittle. The transverse loads used induce the distributed bending moments m = m(x), 0 ≤ x ≤ L, which are considered given. The beam is supposed to contain surface cracks. These cracks are rectilinear and have a small length h li = l(xi ) < . 2 In addition, the cracks are normal to the surface y = −h/2, as shown in Fig. 1. Stress concentration coefficients K 1i in the vicinity of the ends of the cracks and the mass M of the beam are taken as components of the considered vector functional J . We have m(xi )l 1/2 (xi )(h/2 − l(xi )) Ji (h) = K 1i (h) = 1.12 πli (σx )x=xi = κ1 , bh 3 JM (h) = M(h) = ρbLh,

Fig. 1 System of cracks on the surface of the beam


N. Banichuk and S. Ivanova

√ where κ1 = 13.44 π , σx (xi ) = m i (h/2 − li )/12bh 3 , and ρ is the material density of the beam. The multi-criteria problem for optimizing the mass and stress concentration in the vicinity of crack tips consists in searching the thickness h, which minimizes the vector functional J = {J1 (h), J2 (h), ..., Jn (h), JM (h)}T → min . h

To find the Pareto-optimal beam thickness, we apply the objective weighting method. Objective weighting is one of the most efficient methods for vector optimization problems. It permits a preference formulation that is independent of the individual minima of the positive weights, also guaranteeing that all points will lie on the efficient boundary for convex problems. The preference or utility function JC (h) is determined here by a linear combination of the criteria J1 , J2 , ..., Jn and JM and the corresponding weighting factors Ci , i = 1, 2, ..., n, and C M JC (h) =


Ci Ji (h) + C M JM (h) =


n κ1  1/2 Ci m i li (h/2 − li ) + C M ρbLh. bh 3 i=1

It is usually assumed that C M ≥ 0, Ci ≥ 0, i = 1, 2, ..., n; C M +


Ci = 1.


Using the stationary condition δ JC (h) = 0, we obtain C M ρbL =

n 3κ1  1/2 Ci m i li (h/2 − li ). bh 4 i=1

Consequently, we have  h=

n 3κ1  Ci 1/2 m i li (h/2 − li ) ρb2 L i=1 C M

as the Pareto-optimal value of the thickness h.


Multi-criteria Problems and Incompleteness of Information in Structural Optimization


4 Optimization of Shell with Uncertainties in Damage Characteristics In this section, we consider an elastic shell with a revolutionary surface shape. The position of the meridian plane is defined by the angle θC , measured from the datum meridian plane, and the position of parallel circle is defined by the angle ϕC (Fig. 2), formed from the normal to the surface and the axis of rotation. The position of the parallel circle is also defined by the coordinate x measured along the axis of rotation: 0 ≤ x ≤ L, L is given a value. The meridian plane and the plane perpendicular to the meridian are the planes of the principal curvature. The geometry of the shell is defined by giving the shape of the middle surface. We restrict our consideration to the axially symmetrical shape of the middle surface (the profile of each cross-section of the shell is a circle) and use the distance r (x) from the axis of rotation to a point on the middle surface as a variable describing the shape of the middle surface. These variables r (x) and shell thickness h(x) can be considered as design variables. The shell is loaded by axisymmetric forces: the internal pressure q acting in the direction of the normal to the meridian, and the longitudinal loads with the resultant R. It is assumed that a penetrating crack is formed in the shell during its manufacturing or exploitation and thus the shell material is quasi-brittle. The arisen crack is supposed to be rectilinear and its length l is very small with respect to the geometrical dimensions of the shell and l ≤ lm , where lm > 0 is a given parameter. The safety condition, as is well known from quasi-brittle fracture mechanics, is written in the form K 1 < K 1C (Kanninen and Popelar 1985; Griffith 1921; Irwin and Paris 1971; Hutchinson 1983). Here, K 1 is the stress intensity factor that occurs in asymptotic representations of stresses nearby the crack tip (in the opening mode) and K 1C is a given quasi-brittle strength constant (toughness of the material). Note that the stress intensity factor expression  √ σn πl/2, σn > 0, K1 = 0, σn ≤ 0,

Fig. 2 Position and orientation of crack on the shell


N. Banichuk and S. Ivanova

can be used for the crack under consideration when the penetrating crack is small enough and far from the boundaries of the shell. Here σn is the normal stress of the uncracked shell at the location of the crack. The subscript n means that the stress acts in the direction of the normal to the crack banks. It is assumed that the possible location of cracks caused by manufacturing or exploitation is unknown beforehand. In the context of structural design, this leads to essential complications in the computation of K 1 , due to the need to analyze a variety of crack locations and directions. Let us consider only the “internal” (not close to the shell boundary) penetrating cracks and characterize the crack with the vector ω = (l, xC , α)


containing the length of the crack l, the coordinate xC of the center of the crack, and the angle α which sets the inclination of the crack with respect to the meridian. The coordinate θC of the center of the crack is nonessential and is omitted because we consider the axisymmetric problems and take into account all the locations of the crack in the parallel direction (0 ≤ θC ≤ 2π ). If α = 0, the crack is parallel to the meridian (axial crack) and when α = π/2, the crack is parallel to perimeter (peripheral crack). The crack can be characterized by any vector ω from a given set ω , i.e.

π ω ∈ ω = (l, xC , α) | 0 ≤ l ≤ lm , 0 < δ ≤ xC ≤ L − δ, 0 ≤ α ≤ , 2


where δ is given a small value. Given the incompleteness of the information on the possible location of the cracks, we can rewrite the safety constraint K 1 < K 1C as follows: max K 1 < K 1C .



Taking into account (6)–(8), we obtain that the maximum of K 1 is reached when l = lm , σn = max σn (α, x). α

The following property holds. The maximum of K 1 with respect to α is attained when α takes one of two values: α = 0 (axial crack) or α = π/2 (peripheral crack). To prove this statement, we note that in the revolutionary thin shell membrane theory, we have three non-zero stress tensor components: two normal stresses σϕ , σθ and one shear stress σϕθ . Considering that the shape of the shell and the applied external loads are axially symmetric, and using the equilibrium equation, we see that the shear stress σϕθ is zero and the only non-zero normal stresses are σϕ , σθ (Timoshenko 1956; Timoshenko and Woinowsky-Krieger 1959). This means that the normal membrane stresses σϕ , σθ are the principal ones and, consequently, we have

Multi-criteria Problems and Incompleteness of Information in Structural Optimization

σϕ ≤ σn ≤ σθ σθ ≤ σn ≤ σϕ Thus, we have max K 1 = max


if σϕ ≤ σθ , if σθ ≤ σϕ .


πlm /2σm (x),

σm (x) = max{σθ (x), σϕ (x)}, σθ (x) = (σn (α, x))α=0 , σϕ (x) = (σn (α, x))α=π/2 . We then take as a functional the value Jm = Jm (h) = max σm (x), 0≤x≤L

which characterizes the maximum of the stress intensity of the shell under consideration. The material mass of the shell 


JM = JM (h) = 2πρ

 rh 1 +


dr dx

2 1/2 dx

is also taken into account as a component of the optimized vector functional J = {Jm (h), JM (h)}T → min . h

The problem of multi-criteria optimization is reduced to the conventional problem of mass minimization JM (h ∗ ) = min JM (h) h

under the additional geometric constraint h ≥ h 0 (h 0 is given a positive value) and strength constraint K 1ε , Jm ≤ σ∗ = √ πl/2

K 1ε = K 1C − ε,

where ε > 0 is small. In what follows, we suppose that the internal pressure q = 0 and that the shell is loaded with distributed axisymmetric forces applied to its ends x = 0 and x = L. The resultant forces R1 (x = 0) and R2 (x = L) are directed along the axis of the optimized shell. We have the following expressions for normal membrane stresses σϕ , σθ :


N. Banichuk and S. Ivanova

 2 1/2 R1 dr σϕ (x) = , 1+ 2πr (x)h(x) dx 2  2 −1/2 d r R1 dr σθ (x) = . 1+ 2π h(x) d x 2 dx Using strength conditions that can be written as max σϕ ≤ σ∗ ,


max σθ ≤ σ∗


and the geometric constraint h ≥ h 0 , we find the optimal design in the form ⎧ ⎨

R1 h = max h 0 , ⎩ 2π σ∗r


dr dx

2 1/2 ,

R1 2π σ∗

 2 −1/2 ⎬ d r dr . 1+ ⎭ dx2 dx 2

(9) The max operation of (9) finds a maximum of three values for any fixed x ∈ [0, L]. To illustrate the solution of (9), we consider the case where the shape of the middle surface of the shell is conical (see Fig. 3) and r (x) is a linear function: r (x) = a + kx, 0 ≤ x ≤ L . In this case, the optimal thickness distribution is given by the formula  √ R1 1 + k 2 h(x) = max h 0 , 2π σ∗ (1 + kx) 

and is shown in Fig. 3 by the dashed line.

Fig. 3 Case of conical shell

Multi-criteria Problems and Incompleteness of Information in Structural Optimization


5 Conclusion In this paper, the problems of optimal structural design were investigated using multicriteria formulations in cases where there is incomplete information or uncertainty about the essential factors (parameters). The basic examinations were performed in the framework of structural mechanics and fracture mechanics. The appearance of arbitrary penetrating cracks whose position, length, and direction are (beforehand) unknown has been taken into account. The possibility of transforming multi-criteria optimization problems and optimal design problems with incomplete information into conventional structural optimization problems has been established. Acknowledgements This study is partially supported by the Ministry of Science and Higher Education within the framework of the Russian State Assignment under contract No 123021700050-1 and partially supported by RFBR Grant 20-08-00082a.

References Banichuk NV (1976) Minimax approach to structural optimization problems. J Optim Theory Appl 20(1):111–127 Banichuk NV, Ivanova SY (2017) Structural design: Contact problems and high-speed penetration. Walter de Gruyter, Berlin Banichuk NV, Jeronen J, Neittaanmäki P, Saksa T, Tuovinen J (2014) Mechanics of moving materials. Springer Banichuk NV, Neittaanmäki P (2008) Incompleteness of information and reliable optimal design. Evolutionary and deterministic methods for design. Optimization and control: applications to industrial and societal problems (EUROGEN 2007, Jyväskylä). CIMNE, Barcelona, pp 29–38 Banichuk NV, Neittaanmäki P (2010) Structural optimization with uncertainties. Springer, Berlin Eschenauer H, Koski J, Osyczka A (1990) Multicriteria optimization: Fundamentals and motivation. In: Eschenauer H, Koski J, Osyczka A (eds) Multicriteria Design Optimization. Springer, Berlin, pp 1–32 Griffith AA (1921) The phenomena of rupture and flow in solids. Philos Trans Roy Soc Lond Ser A 221:163–198 Hutchinson JW (1983) Fundamentals of the phenomenological theory of nonlinear fracture mechanics. J Appl Mech 50(4b):1042–1051 Irwin GR, Paris PC (1971) Fundamental aspects of crack growth and fracture. In: Liebowitz H (ed) Fracture: an advanced treatise, Vol. III: engineering fundamentals and environmental effects. Academic, New York, pp 2–48 Kanninen MF, Popelar CH (1985) Advanced fracture mechanics. Oxford University Press, New York Stadler W (1986) Multicriteria optimization in mechanics: a survey. Appl Mech Rev 37(2):277–286 Stadler W (1988) Fundamentals of multicriteria optimization. In: Multicriteria optimization in engineering and in the sciences. Plenum, New York, pp 1–25 Timoshenko S (1956) Strength of materials. Part II: advanced theory and problems, 3rd edn. D. Van Nostrand, Princeton, NJ Timoshenko S, Woinowsky-Krieger S (1959) Theory of plates and shells, 2nd edn. McGraw-Hill, New York

Stability in Discrete Games with Perturbed Payoffs Vladimir A. Emelichev, Marko M. Mäkelä, and Yury V. Nikulin

Abstract We consider a parametric equilibrium concept in a finite cooperative game of several players in normal form with discrete strategy choice. This concept is defined by partitioning a set of players into coalitions. Two extreme cases of such a partition correspond to Pareto optimal and Nash equilibrium outcomes, respectively. The game is characterized by a matrix in which each element is a subject of independent perturbations. That is, a set of perturbing matrices is formed by a set of additive matrices, with two arbitrary Hölder norms independently specified in the outcome and criterion spaces. We undertake a post-optimal analysis for the so-called stability property. Analytical bounds are found for the supreme levels of such perturbations. Keywords Multicriteria optimization · Nash equilibrium · Stability radius · Hölder norm · Coalitional games

1 Introduction Rapid development of the various fields of information technology, economics, and social sphere requires an adequate development in the corresponding fields of system analysis, management and operations research. One of the main problems arising in this direction is multicriteria decision making in the presence of conflict, uncertainty and risk. An effective tool for modeling decision-making processes is the apparatus of mathematical game theory. Vladimir Emelichev passed away in May 2022 at the age of 92. V. A. Emelichev Belarusian State University, Nezavisimosti 4, 220030 Minsk, Belarus M. M. Mäkelä · Y. V. Nikulin (B) University of Turku, Vesilinnantie 5, 20014 Turku, Finland e-mail: [email protected] M. M. Mäkelä e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



V. A. Emelichev et al.

Game-theoretic models target finding classes of outcomes that are rationally coordinated in terms of possible actions and interests of participants (players) or a group of participants (coalitions). For each game in normal form, coalitional and noncoalitional equilibrium concepts (principles of optimality) are used, which usually lead to different game outcomes. In the theory of non-antagonistic games there is no single approach to the development of such concepts. The most famous one is the concept of the Nash equilibrium (Nash 1950, 1951), as well as its various generalizations related to the problems of group choice, which is understood as the reduction of various individual preferences into a single collective preference. This work implements a parametrization of the equilibrium concept of a finite game in normal form. The core of this parametrization is the method of dividing players into coalitions. The two extreme cases (a single coalition of players and a set of single-player coalitions) correspond to the Pareto optimal outcome and the Nash equilibrium outcome. Quantitative stability analysis for the set of all efficient (generalized equilibrium) outcomes is carried out. Usually stability in multicriteria discrete optimization is defined as a discrete analog of upper semicontinuity property for a vector valued choice function (Aubin 1990; Sergienko and Shylo 2003). The semicontinuity property implies the existence of a neighborhood in the space of game parameters such that no new efficient outcomes appear when perturbations belong to that neighborhood. In this work, the analytical bounds for the stability radius are found for the game with the given partition of players into coalitions under the assumption that the arbitrary Hölder norms are defined in the space of outcomes and criteria space. Note that analogous quantitative characteristics of the various stability types of multicriteria parameterized problems of game theory and discrete linear programming problems with other principles of optimality, stability types and metrics defined in the space of parameters were obtained in a series of works (see, e.g., Emelichev and Karelkina 2021, 2009; Emelichev et al. 2014; Emelichev and Kuzmin 2006; Emelichev and Nikulin 2019, 2020; Nikulin et al. 2013). The apparatus of stability quantifying measures was also used in some application while analyzing risk uncertainty on financial market Korotkov et al. (2020); Korotkov and Wu (2020). The paper is organized as follows. In Sect. 2, we formulate parametric optimality and introduce basic definitions along with the notation. Section 3 contains three auxiliary properties and one lemma used later for the proof of the main result. In Sect. 4, we formulate and prove the main result regarding the stability radius. Section 5 provides a list of important corollaries. Concluding remarks and discussion on computational perspectives appear in Sect. 6.

2 Basic Definitions and Notations We consider the main object of study in game theory, a finite game of n players in normal form Osborne and Rubinstein (1994), where each player i ∈ Nn = {1, 2, . . . , n}, n ≥ 2, has a set of outcomes X i ⊂ R, 2 ≤ |X i | < ∞. The outcome of the game is

Stability in Discrete Games with Perturbed Payoffs


a realization of the strategies chosen by all the players. This choice is made by the players independently. Let linear payoff functions be given as follows: f i (x) = Ci x, i ∈ Nn , where Ci is the i-th row of a square matrix C = [ci j ] ∈ Rn×n and decision vectors x = (x1 , x2 , . . . , xn )T are defined on the set of all possible outcomes of the game X=

X j ⊂ Rn ,


i.e. the Cartesian product of X j , j ∈ Nn . As a result of the game we call the matrix C game, each player i gains a payoff of f i (x), which the player tries to maximize using preference relationships. We assume that all players try to maximize their own payoffs simultaneously: C x = (C1 x, C2 x, ..., Cn x)T → max . x∈X


A non-empty subset J ⊆ Nn is called a coalition of players. For a coalition J and game outcome x 0 = (x10 , x20 , . . . , xn0 )T , we introduce a set V (x 0 , J ) =

V j (x 0 , J ),


where V j (x , J ) = 0

X j if j ∈ J, {x 0j } if j ∈ Nn \ J.

Thus, V j (x 0 , J ) is the set of outcomes that are reachable by coalition J from the outcome x 0 . It is clear that V (x 0 , Nn ) = X and V (x 0 , k) = X k for any x 0 , k ∈ Nn . Further, we use a binary relation of preference ≺ by Pareto Pareto (1909) in space Rk of arbitrary dimension k ∈ N. We assume that for two different vectors y = (y1 , y2 , . . . , yk )T and y = (y1 , y2 , . . . , yk )T in the space the following formula is valid: y ≺ y ⇔ y ≤ y and y = y . The symbol ≺, as usual, denotes the negation of ≺. Let s ∈ Nn , and let Nn = ∪k∈Ns Jk be a partition of the set Nn into s nonempty sets (coalitions), i.e. Jk = ∅, k ∈ Ns , and p = q ⇒ J p ∩ Jq = ∅. A set of (J1 , J2 , ..., Js )efficient outcomes is introduced according to the formula   G(C, J1 , J2 , . . . , Js ) = x ∈ X | ∀k ∈ Ns , ∀x ∈ V (x, Jk ), C Jk x ≺ C Jk x , (2)


V. A. Emelichev et al.

where C Jk is a |Jk | × n submatrix of matrix C consisting of rows that correspond to players in coalition Jk . For brevity, we denote this set by G(C). Thus, preference relations between players within each coalition are based on Pareto dominance. Therefore, the set of all Nn -efficient outcomes G(C, Nn ) (s = 1, i.e. all players are united in one coalition) is Pareto set of game (1) (set of efficient outcomes) Pareto (1909):   P(C) = x ∈ X | X (x, C) = ∅ , where

  X (x, C) = x ∈ X | C x ≺ C x .

Rationality of a cooperative-efficient outcome x ∈ P(C) is that increase of the payoff of any player is possible only by decreasing the payoff of at least one of the other players. In other extreme cases, when s = n, G(C, {1}, {2}, ..., {n}) becomes a set of the Nash equilibria Nash (1950, 1951). This set is denoted by NE(C) and defined as follows:  

, NE(C) = x ∈ X | k ∈ Nn , x ∈ X such that Ck x < Ck x and xNn \{k} = xN n \{k}

where xNn \{k} is a projection of vector x ∈ X to the coordinate axis of space Rn with numbers from the set Nn \ {k}. It is easy to see that rationality of the Nash equilibrium is that no player can individually deviate from the own equilibrium strategy choice while others keep playing their equilibrium strategies. Strict axioms regarding perfect and common (shared) knowledge are assumed to be fulfilled (Osborne and Rubinstein 1994). Thus, we have just introduced a parametrization of the equilibrium concept for a finite game in normal form. The parameter s of this parameterizations is the partitioning of all the players into coalitions J = (J1 , J2 , ..., Js ), in which the two extreme cases (a single coalition of players and a set of n single-player coalitions) correspond to finding the Pareto optimal outcomes P(C) and the Nash equilibrium outcomes NE(C), respectively. By denoting Z (C, J1 , J2 , . . . , Js ), the game consists in finding the set G(C, J1 , J2 , . . . , Js ). Sometimes for brevity, we use Z (C) to denote the problem. Without loss of generality, we assume that the elements of partitioning Nn = ∪k∈Ns Jk are defined as follows:   J1 = 1, 2, . . . , t1 ,   J2 = t1 + 1, t1 + 2, . . . , t2 , .. .   Js = ts−1 + 1, ts−1 + 2, . . . , n .

Stability in Discrete Games with Perturbed Payoffs


For any k ∈ Ns let C k be a square submatrix of size |Jk | × |Jk | consisting of the elements of the matrix C located at the intersection of rows and columns with the numbers Jk . Determine the Pareto set:   P(C k ) = z ∈ X Jk | X (z, C k ) = ∅ . Here

  X (z, C k ) = z ∈ X Jk | C k z ≺ C k z .

The |Jk |-criteria problem Z (C k ) is defined as C k z → max , z∈X Jk

T  where z = z 1 , z 2 , . . . , z |Jk | , and X Jk is a projection of X onto Jk , i.e. X Jk =

X j ⊂ R|Jk | .


This problem is called a partial problem of the game Z (C, J1 , J2 , . . . , Js ). Due to the fact that the payoff linear functions Ci x, i ∈ Nn , are separable, according to (2), the following equality is valid: G(C, J1 , J2 , . . . , Js ) =


P(C k ).



According to the definition of (J1 , J2 , ..., Js )-efficiency in a game defined by the matrix C ∈ Rn×n , only the block-diagonal elements C 1 , C 2 , . . . , C s matter. Thus, the set of (J1 , J2 , ..., Js )-efficient outcomes of the game Z (C, J1 , J2 , ..., Js ) will be denoted by ˜ J1 , J2 , . . . , Js ), G(C, where C˜ = {C 1 , C 2 , . . . , C s }. In the space of arbitrary size Rk , we define the Hölder norm l p , p ∈ [1, ∞], i.e. the norm of the vector a = (a1 , a2 , ..., ak )T ∈ Rk is ⎧

1/ p ⎨ p |a | if 1 ≤ p < ∞, j j∈Nk a p = ⎩max |a | | j ∈ N  if p = ∞. j


The norm of matrix C ∈ Rk×k with the rows Ci , i ∈ Nk , is defined as the norm of a vector whose components are the norms of the rows of the matrix C. By that, we have   C pq =  C1  p , C2  p , . . . , Ck  p q ,


V. A. Emelichev et al.

where lq , q ∈ [1, ∞], is another Hölder norm, i.e. lq may differ from l p in general case. It is easy to see that for any p, q ∈ [1, ∞], and for any i ∈ Nn we have Ci  p ≤ C pq . The norm of the matrix bundle C˜ = {C 1 , C 2 , . . . , C s }, C k ∈ R|Jk |×|Jk | , k ∈ Ns , is defined as follows:   ˜ max = max C k  pq | k ∈ Ns . C Perturbation of the elements of the matrix bundle C˜ is imposed by adding perturbing matrix bundle B˜ = {B 1 , B 2 , . . . , B s }, where B k ∈ R|Jk |×|Jk | are matrices with rows Bik , i ∈ Nn , k ∈ Ns . Thus, the set of (J1 , J2 , ..., Js )-efficient outcomes of the perturbed game here and after is denoted by ˜ J1 , J2 , . . . , Js ). G(C˜ + B, For an arbitrary number ε > 0, we define a bundle of perturbing matrices  


(ε) =

B˜ ∈


|Jk |×|Jk |


 ˜ |  Bmax < ε .


The stability radius (in terminology of Sergienko and Shylo (2003), T3 -stability, also see Lebedeva et al. (2021)) of the game Z (C, J1 , J2 , . . . , Js ) is a number defined as follows:  sup  if  = ∅, ρ = ρ pq (J1 , J2 , . . . , Js ) = 0 if  = ∅, where

  ˜ ⊆ G(C) ˜ .  = ε > 0 | ∀ B˜ ∈ n×n (ε), G(C˜ + B)

The game Z (C, J1 , J2 , . . . , Js ) is called T3 -stable or, simply, stable if ρ pq (J1 , J2 , . . . , Js ) > 0. Formally, the radius is defined for the problem Z (C, J1 , J2 , . . . , Js ) that depends on C and J . For notation brevity, we drop C out of list of arguments and consider it as a default parameter for the radius. ˜ = X then G(C˜ + B) ˜ ⊆ G(C) ˜ for any perturbing matrix B˜ ∈ Obviously, if G(C) n×n (ε), where ε > 0. In this case, the stability radius is not bounded from above. ˜ = X \ G(C) ˜ = ∅ is called non-trivial. For this reason, the problem with G(C)

Stability in Discrete Games with Perturbed Payoffs


3 Auxiliary Lemma and Statements In the outcome space Rn along with the norm l p , p ∈ [1, ∞], we will use the conjugate norm l p∗ , where the numbers p and p ∗ are connected, as usual, by the equality 1 1 + ∗ = 1, p p assuming p ∗ = 1 if p = ∞, and p ∗ = ∞ if p = 1. Therefore, we further suppose that the range of variation of the numbers p and p ∗ is the closed interval [1, ∞], and the numbers themselves are connected by the above conditions. Further, we use the well-known Hölder inequality |a T b| ≤ a p b p∗


that is true for any two vectors a = (a1 , a2 , . . . , an )T ∈ Rn and b = (b1 , b2 , . . . , bn )T ∈ Rn . It is easy to see that for any a = (a1 , a2 , . . . , an )T ∈ Rn with |a j | = α,

j ∈ Nn ,

the following equality holds: 1

av = αn v for any v ∈ [1, ∞].


Directly from (3), the following lemma similar to one in Emelichev and Nikulin (2020) follows. Lemma 1 The outcome x = (x1 , x2 , . . . , xn )T ∈ X is (J1 , J2 , . . . , Js )-efficient, i.e. ˜ J1 , J2 , . . . , Js ) x ∈ G(C, if and only if for any index k ∈ Ns x Jk ∈ P(C k ). Hereinafter, x Jk is a projection of vector x = (x1 , x2 , . . . , xn )T on coordinate axes of X with coalition numbers Jk . Denote ˜ = K (C, ˜ J1 , J2 , . . . , Js ) = {k ∈ Ns | P(C k ) = X Jk }. K (C) The following claims can easily be proven using Lemma 1. ˜ is non-empty. Proposition 1 The game Z (C) is non-trivial if and only if the set K (C)


V. A. Emelichev et al.

Proposition 2 The outcome x 0 ∈ X is not (J1 , J2 , . . . , Js )-efficient in the game Z (C), i.e. ˜ J1 , J2 , . . . , Js ) / G(C, x0 ∈ ˜ such that if and only if there exists an index r ∈ K (C) / P(C r ). x 0Jr ∈ ˜ we have Proposition 3 If the game Z (C) is non-trivial, then for any k ∈ / K (C) P(C k ) = X Jk . A game in which each player’s strategy is antagonistic, i.e. X j = {0, 1}, j ∈ Nn is called Boolean and denoted Z B (C).

4 Main Result For the non-trivial game Z (C, J1 , J2 , . . . , Js ), C ∈ Rn×n , n ≥ 2, s ∈ Nn , and any p, q ∈ [1, ∞], we define ϕ = ϕ p (J1 , J2 , . . . , Js ) = min




Cik (z − z) , z − z p∗

ψ = ψ pq (J1 , J2 , . . . , Js ) = min




1 1 Cik (z − z) |Jk | p + q , z − z1

k ) z ∈X (z,C k ) i∈Jk ˜ z ∈P(C / k∈K (C)

k ) z ∈X (z,C k ) i∈Jk ˜ z ∈P(C / k∈K (C)

  ˜ . ˜ min = min C k  pq | k ∈ Kˆ (C) C It is obvious that ϕ, ψ ≥ 0.

Theorem 1 For any p, q ∈ [1, ∞], C ∈ Rn×n , n ≥ 2, and any coalition partition (J1 , J2 , . . . , Js ), s ∈ Nn , the stability radius ρ pq (J1 , J2 , . . . , Js ) of the non-trivial game Z (C, J1 , J2 , . . . , Js ) has the following bounds: ˜ min . ϕ ≤ ρ pq (J1 , J2 , . . . , Js ) ≤ C If the game is Boolean i.e. Z (C) = Z B (C), then   ˜ min . ϕ ≤ ρ pq (J1 , J2 , . . . , Js ) ≤ min ψ, C ˜ is non-empty. First, we prove the inequality Proof Due to Proposition 1, the set K (C) ρ ≥ ϕ. We assume that ϕ > 0, otherwise the statement in the theorem is obvious. Let B˜ = {B 1 , B 2 , . . . , B s } be a perturbing matrix bundle that belongs to set n×n (ϕ),

Stability in Discrete Games with Perturbed Payoffs


˜ J1 , J2 , . . . , Js ). In order to prove inequality ρ ≥ ϕ, it is enough and let x 0 ∈ / G(C, ˜ J1 , J2 , . . . , Js ). / G(C˜ + B, to show that x 0 ∈ According to the definition of the positive number ϕ, we have ˜ ∀z ∈ ∀k ∈ K (C) / P(C k )



z ∈X (z,C k ) i∈Jk

Cik (z − z) ≥ ϕ. z − z p∗


˜ such that x 0J ∈ / P(C r ). Due to Proposition 2, there exists an outcome r ∈ K (C) r 0 0 Therefore, from Eq. (6) we conclude that there exists a vector z ∈ X (x Jr , C k ) such that for any index i ∈ Jr the following inequalities hold: Cir (z 0 − x 0Jr ) z 0 − x 0Jr  p∗

˜ max ≥ B r  pq ≥ Bir  p . ≥ ϕ >  B

Taking into account the Hölder inequality (4), from the above we get (Cir + Bir )(z 0 − x 0Jr ) = Cir (z 0 − x 0Jr ) + Bir (z 0 − x 0Jr ) ≥ Cir (z 0 − x 0Jr ) − Bir  p z 0 − x 0Jr  p∗ , i ∈ Jr . / P(C r + B r ). Thus, again using PropoThis implies z 0 ∈ X (x 0Jr , C r + B r ), i.e. x 0Jr ∈ ˜ J1 , J2 , . . . , Js ). As a result, we have that sition 2, we conclude that x 0 ∈ / G(C˜ + B, the formula ˜ J1 , J2 , . . . , Js ) ⊆ G(C, ˜ J1 , J2 , . . . , Js ) ∀ B˜ ∈ n×n (ϕ) G(C˜ + B, is valid, and hence ρ ≥ ϕ. In order to prove the upper bound, it is sufficient to show that for any index ˜ ε > C k  pq , we ˜ the inequality ρ ≤ C k  pq holds. Assuming k ∈ K (C), k ∈ K (C) 1 2 s ˜ construct a bundle of perturbing matrices B = {B , B , . . . , B } as follows:  B = i

−C k if i = k, |Ji |×|Ji | if i ∈ Ns \ {k}, 0

where 0|Ji |×|Ji | is a zero matrix of size |Ji | × |Ji |. Then we deduce ˜ max = B k  pq = C k  pq ,  B and

B˜ ∈ n×n (ε),

˜ = X  G(C). ˜ P(C k + B k ) = G(C˜ + B)

Thus the following formula holds: ˜  G(C). ˜ ˜ ∀ε > C k  pq ∃ B˜ ∈ n×n (ε) G(C˜ + B) ∀k ∈ K (C)


V. A. Emelichev et al.

Therefore, the stability radius of the game Z (C, J1 , J2 , . . . , Js ) cannot be larger than ˜ i.e. ρ pq (J1 , J2 , . . . , Js ) ≤ C ˜ min . any C k  pq , k ∈ K (C), Further, we consider the case when the game is Boolean, i.e. Z (C) = Z B (C). ˜ min proven earlier for the game Z (C), The inequality ϕ ≤ ρ pq (J1 , J2 , . . . , Js ) ≤ C obviously stays valid also for the game Z B (C). Now let us prove ρ = ρ pq (J1 , J2 , . . . , Js ) ≤ ψ = ψ pq (J1 , J2 , . . . , Js ). ˜ and z 0 ∈ According to the definition of the number ψ, there exist r ∈ K (C) / P(C r ) such that for any z ∈ X (z 0 , C r ) there exists an index l ∈ Jr such that ψz − z 0  ≥ |Jr | p + q Clr (z − z 0 ). 1



Assuming ε > ψ, we construct a bundle of perturbing matrices B˜ = {B 1 , B 2 , . . . , B s } ∈ Rn×n as follows: ⎧ 0 ⎪ ⎨−δ if (i, j) ∈ Jr × Jr , z = 0, bi j = δ if (i, j) ∈ Jr × Jr , z 0 = 1, ⎪ ⎩ 0 if (i, j) ∈ / Jr × Jr , where

ψ < δ|Jr | p + q < ε. 1



Using Eq. (5), from the above we obtain 1

Bir  p = δ|Jr | p , i ∈ Jr , ˜ max = B r  pq = δ|Jr | p + q .  B 1


Here and further Bir , i ∈ Jr , is a row of matrix B r ∈ R|Jr |×|Jr | . Moreover, we deduce evident equalities (9) Bir (z − z 0 ) = −δz − z 0 1 , i ∈ Jr . Therefore, due to (7) and (8), we conclude

1 1 −1 (Clr + Blr )(z − z 0 ) ≤ ψ |Jr | p + q − δ z − z 0 1 < 0. / X (x 0 , C r + B r ). It is obvious that for any z

∈ X Jr \ X (z 0 , C r ) there Hence, z ∈ exists an index v ∈ Jr such that Cvr (z

− z 0 ) < 0.

Stability in Discrete Games with Perturbed Payoffs


Using (9), we yield (Cvr + Bvr )(z

− z 0 ) = Cvr (z

− z 0 ) − δz

− z 0 1 < 0, / X (z 0 , C r + B r ). Thus, we have X (z 0 , C r + B r ) = ∅, i.e. z 0 ∈ P(C r + B r ), i.e. z

∈ and therefore P(C r + B r )  P(C r ). So, it follows that ˜  G(C). ˜ ∀ε > ψ ∃ B˜ ∈ (ψ) G(C˜ + B) Finally, we have ρ ≤ ψ.

5 Other Results First we focus on the stability radius ρ pq (Nn ) of the non-trivial 1-coalitional (also can be referred as non-coalitional) game Z (C, Nn ) of finding the Pareto set P(C). The first result is well-known (c.f. Emelichev et al. 2002). Corollary 1 Given p, q ∈ [1, ∞], C ∈ Rn×n , n ≥ 2, the stability radius of the nontrivial game Z (C, Nn ), for finding the Pareto set P(C), has the following lower and upper bounds: min



x ∈P(C) / x ∈X (x,C) i∈Nn

Ci (x − x) ≤ ρ pq (Nn ) ≤ C pq . x − x p∗

For Boolean 1-coalitional game Z B (C, Nn ) of finding the Pareto set P(C) the lower and upper bounds can be specified more accurately. Corollary 2 Given p, q ∈ [1, ∞], C ∈ Rn×n , n ≥ 2, the stability radius of the nontrivial Boolean game Z B (C, Nn ) of finding the Pareto set P(C), has the following lower and upper bounds: ϕ p (Nn ) ≤ ρ pq (Nn ) ≤ n p + q ϕ p (Nn ), 1

where ϕ p (Nn ) = min



x ∈P(C) / x ∈X (x,C) i∈Nn


Cik (x − x) , x − x p∗

From the above another well-known result ρ∞∞ (Nn ) = ϕ∞ (Nn ) also follows in Boolean case (see again Emelichev et al. 2002). That is, in Boolean case and with Chebyshev norms, the gap between lower and upper bounds turns to zero, and both bounds transforms into same formula:


V. A. Emelichev et al.

ρ∞∞ (Nn ) = min



x ∈P(C) / x ∈X (x,C) i∈Nn

Cik (x − x) . x − x1

Now we specify one known result (see Bukhtoyarov and Emelichev 2006) that follows from Theorem 1: Corollary 3 For any p = q = ∞, C ∈ Rn×n , n ≥ 2, and any coalition partition (J1 , J2 , . . . , Js ), s ∈ Nn , the stability radius ρ∞∞ (J1 , J2 , . . . , Js ) of the nontrivial Boolean game Z B (C, J1 , J2 , . . . , Js ) of finding the set of efficient outcomes ˜ J1 , J2 , . . . , Js ) is expressed by formula G(C, ρ∞∞ (J1 , J2 , . . . , Js ) = ϕ∞ (J1 , J2 , . . . , Js ) = ψ∞∞ (J1 , J2 , . . . , Js ) = min




k ) z ∈X (z,C k ) i∈Jk ˜ z ∈P(C / k∈K (C)

Cik (z − z) . z − z1

Finally, we focus on the stability radius ρ pq ({1}, {2}, . . . , {n}) of the non-trivial ˜ = ∅, recall Proposition 1) game Z (C, {1}, {2}, . . . , {n}) of finding the Nash (K (C) set NE(C). The next corollary follows directly from Eq. (3). Corollary 4 An outcome x 0 = (x10 , x20 , . . . , xn0 )T ∈ X of the game with matrix C ∈ Rn×n is Nash equilibrium, i.e. x 0 ∈ NE(C), if and only if the equilibrium strategy of every player i ∈ Nn is defined as follows: ⎧ ⎪ ⎨max{xi | xi ∈ X i } if cii > 0, 0 xi = min{xi | xi ∈ X i } if cii < 0, ⎪ ⎩ if cii = 0. xi ∈ X i Therefore, it is obvious that ˜ = K (C, ˜ {1}, {2}, . . . , {n}) = {k ∈ Nn | ckk = 0}. K (C) Besides that, in the case of non-coalitional game (s = n), the inequality P(ckk ) = X k implies ckk = 0. Corollary 5 For any p, q ∈ [1, ∞], C ∈ Rn×n , n ≥ 2, for the stability radius of the ˜ = ∅) game Z (C, {1}, {2}, . . . , {n}) of finding the Nash set NE(C) non-trivial (K (C) the formula ˜ ρ = ρ pq ({1}, {2}, . . . , {n}) = min{|ckk | | k ∈ K (C)} is valid. Proof From Theorem 1 we get the bounds ˜ min , ϕ ≤ ρ ≤ C


Stability in Discrete Games with Perturbed Payoffs


where ϕ = ϕ p ({1}, {2}, . . . , {n}) = min



/ ˜ z ∈P(c kk ) z ∈X (z,ckk ) k∈K (C)

ckk (z − z) , z − z p

˜ ˜ min = min{|ckk | | k ∈ Kˆ (C)}. C ˜ z∈ Therefore, taking into account Corollary 4, for any k ∈ K (C), / P(ckk ) and z ∈ X (z, ckk ) the following equalities are valid: ckk (z − z) = |ckk | > 0. z − z p

Hence, (10) is true. From Corollary 5 it follows that non-trivial game Z (C, {1}, {2}, . . . , {n}) is always stable.

6 Conclusion As a summary, we emphasize that the bounds for stability radius proven in the paper can be utilized as analytical tool for game players while deciding on coalition formation. Based on stability analysis, the player may decide whether their coalitions are robust or not in the case when the game cost matrix is a subject for potential perturbations. However, the bounds are mostly theoretical due to their enumerating structures. The difficulty of exact stability radius calculation is a long-standing challenge pointed out in Karelkina et al. (2011). In practical applications, one can try to get reasonable approximation of the bounds using some meta-heuristics, e.g., evolutionary algorithms (see, e.g., Karelkina et al. 2011) or Monte-Carlo simulation. Another possibility to continue research in this direction is to use the nonsmooth optimization methods developed in Bagirov et al. (2014), Mäkelä (2002), Mäkelä and Neittaanmaki (1992) within some complex branch and bound algorithms via getting efficient bound calculation, and possibly by utilizing Lagrangian relaxation techniques (see, e.g., Gaudioso 2020). Acknowledgements On the 70th anniversary, the authors send their best greetings and sincere wishes to the jubilee celebrant. We wish Pekka Neittaanmäki many more years of a prosperous life and good health.


V. A. Emelichev et al.

References Aubin J-P, Frankowska H (1990) Set-valued analysis. Birkhäuser, Basel Bagirov A, Karmitsa N, Mäkelä M (2014) Introduction to nonsmooth optimization: theory, practice and software. Springer, Cham Bukhtoyarov S, Emelichev V (2006) Measure of stability for a finite cooperative game with a parametric optimality principle (from Pareto to Nash). Comput Math Math Phys 46(7):1193– 1199 Emelichev V, Girlich E, Nikulin Yu, Podkopaev D (2002) Stability and regularization of vector problems of integer linear programming. Optimization 51(4):645–676 Emelichev V, Karelkina O (2009) Finite cooperative games: parametrisation of the concept of equilibrium (from Pareto to Nash) and stability of the efficient situation in the Hölder metric. Discret Math Appl 19(3):229–236 Emelichev V, Karelkina O (2021) Postoptimal analysis of a finite cooperative game. Buletinul Academiei de Stiinte a Moldovei. Matematica 1–2(95–96):121–136 Emelichev V, Kotov V, Kuzmin K, Lebedeva T, Semenova N, Sergienko T (2014) Stability and effective algorithms for solving multiobjective discrete optimization problems with incomplete information. J Autom Inf Sci 46(2):27–41 Emelichev V, Kuzmin K (2006) Stability radius of an efficient solution of a vector problem of integer linear programming in the Gölder metric. Cybern Syst Anal 42(4):609–614 Emelichev V, Nikulin Yu (2019) On the quasistability radius for a multicriteria integer linear programming problem of finding extremum solutions. Cybernet Syst Anal 55(6):949–957 Emelichev V, Nikulin Yu (2020) Finite games with perturbed payoffs. In: Olenev N, Evtushenko Y, Khachay M, Malkova V (eds) Advances in optimization and applications: 11th international conference. OPTIMA 2020, Revised Selected Papers, Communications in computer and information science, vol 1340. Springer, Cham, pp 158–169 Gaudioso M (2020) A view of Lagrangian relaxation and its applications. In: Bagirov A, Gaudioso M, Karmitsa N, Mäkelä M, Taheri S (eds) Numerical nonsmooth optimization: state of the art algorithms. Springer, Cham, pp 579–617 Karelkina O, Nikulin Y, Mäkelä M (2011) An adaptation of NSGA-II to the stability radius calculation for the shortest path problem. TUCS Technical Report 1017, Turku Centre for Computer Science Korotkov V, Emelichev V, Nikulin Y (2020) Multicriteria investment problem with Savage’s risk criteria: theoretical aspects of stability and case study. J Ind Manag Optim 16(3):1297–1310 Korotkov V, Wu D (2020) Evaluating the quality of solutions in project portfolio selection. Omega 91:102029. Lebedeva T, Semenova N, Sergienko T (2021) Stability kernel of a multicriteria optimization problem under perturbations of input data of the vector criterion. Cybernet Syst Anal 57(4):578–583 Mäkelä M (2002) Survey of bundle methods for nonsmooth optimization. Optim Methods Softw 17(1):1–29 Mäkelä M, Neittaanmäki P (1992) Nonsmooth optimization: Analysis and algorithms with applications to optimal control. World Scientific, Singapore Nash J (1950) Equilibrium points in n-person games. Proc Natl Acad Sci 36(1):48–49 Nash J (1951) Non-cooperative games. Ann Math 54(2):286–295 Nikulin Yu, Karelkina O, Mäkelä M (2013) On accuracy, robustness and tolerances in vector Boolean optimization. Eur J Oper Res 224(3):449–457 Osborne M, Rubinstein A (1994) A course in game theory. MIT Press Pareto V (1909) Manuel d’économie politique. Giard & Brière, Paris Sergienko I, Shylo V (2003) Discrete optimization problems: challenges, solution techniques, and analysis. Naukova Dumka, Kiev In Russian

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: Computational Aspects Jaroslav Haslinger and Raino A. E. Mäkinen

Abstract The paper is devoted to numerical realization of nonsmooth optimal control problems in solid and fluid mechanics with special emphasis on contact shape optimization and parameter identification in fluid flow models. Nonsmoothness is usually owing to the state constraint, typically given by an inequality type problem governing the optimized system. To remove the nonsmooth character, which complicates numerical realization, the penalization/regularization of the state constraint is used. The resulting optimal control problem becomes smooth and it can be solved by standard methods of smooth optimization. This approach is illustrated with a parameter identification in the system driven by the Stokes equation with threshold slip boundary conditions. Keywords Threshold slip boundary conditions · Stokes system with slip conditions · Parameter identification in flow models 2010 Mathematics Subject Classification Primary: 49J20, 35J86, 65K15; Secondary: 65N30, 90C30

1 Introduction The contribution deals with a special class of nonsmooth optimal control problems in solid and fluid mechanics with emphasize on computational aspects. By nonsmoothness we mean that the resulting objective functional as a function of control J. Haslinger Department of Mathematical Analysis and Applications of Mathematics, Palacký University Olomouc, 17. listopadu 1192, 779 00 Olomouc, Czech Republic e-mail: [email protected] R. A. E. Mäkinen (B) Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, 40014 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



J. Haslinger and R. A. E. Mäkinen

variables is nondifferentiable in the classical sense. This may happen among other, if an inequality type state relation drives an optimized system. Just this case will be considered in what follows. An appropriate discretization of the continuous setting leads to a constrained minimization of a continuous, nondifferentiable function on a convex set of admissible discrete controls. A direct application of standard gradient type minimization methods usually either fails due to wrong gradient information or does not give satisfactory results. Thus special methods of nonsmooth optimization should be used (Outrata 1998). Such methods need to be supplied with function values and values of a generalized gradient, which holds the post of the classical gradient of the minimized function. For convex functions (a rare case in optimal control) the generalized gradient is given by their subgradient (Ekeland and Temam 1976). For Lipschitz continuous functions, Clarke’s generalized gradient can be used (Clarke 1983). Since these methods are especially tailored for minimizing nonsmooth functions they usually give better results than the classical (smooth) ones (Outrata 1998). On the other hand and in contrast to classical gradient methods, their nonsmooth counterparts use more sophisticated tools of set-valued analysis. To employ fully the ability of nonsmooth algorithms, a user should have to a certain extent some knowledge of their inner structure. Last but not least, the process of providing generalized gradient information required by nonsmooth methods is not is an easy task, too. On account of theses facts it is not surprising that numerical methods of nonsmooth optimization are not spread and widely used by engineering communities. Fortunately there exist ways how to overcome these troubles by transforming nonsmooth optimal control problems into smooth ones. Since we suppose that nonsmoothness results solely from a state constraint, a natural idea arises, namely to replace it by a sequence of “close” smooth equations. A smooth penalization of constraints and a regularization of nonsmooth terms if they appear in the state constraint are the most common techniques how to do that. The resulting system of equations depends on a small penalization/regularization parameter ε destined to tend to zero. Their solutions (under appropriate assumptions) converge to the solution of the original state constraint as ε → 0+ (Duvaut 1976; Glowinski 1984). To justify this approach, the mutual relation between solutions to the optimal control problem with the original state constraint and the ones using penalized/regularized states has to be established. By the special class of nonsmooth optimization problems mentioned in the head of this section we mean contact shape optimization. Contact mechanics is a part of solid mechanics analyzing the behavior of structures, which are comprised of several deformable bodies being in mutual contact. These problems are specific by boundary conditions prescribed on contact zones, which take into account nonpenetration and possible frictional effects. The corresponding mathematical models lead to different types of variational inequalities depending on the used mechanical models (Duvaut 1976; Kikuchi and Oden 1988; Hlaváˇcek et al. 1988). A typical example of contact shape optimization can be formulated as follows: find shapes of the bodies in contact in such a way that the normal contact stresses are evenly distributed along contact zones (Klarbring and Haslinger 1993).

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …


An important feature of shape optimization, in general, is its interdisciplinary character. It puts together several disciplines of mathematics, mechanics, and scientific computing. Which of them will be used depends on aims of the investigation. A systematic mathematical study of this branch of optimal control problems commenced at the beginning of seventies (Cea et al. 1974; Chenais 1975; Begis and Glowinski 1975; Cea 1976; Pironneau 1984) with elliptic PDEs as the state equation, at first. Special attention was paid to the development of the shape differential calculus (Simon 1980; Sokolowski and Zolésio 1987; 1992). A significant part of the research in contact shape optimization has been done in collaboration with researchers from the University of Jyväskylä. For the sake of simplicity we focused on the Signorini problem in 2D as a state relation, i.e. the contact problem for a plane deformable body unilaterally supported by a rigid foundation. In all cases a simple parametrization of the geometry was used: the shapes of the contact zone being the only object of optimization were represented by the graph of uniformly bounded and uniformly Lipschitz continuous functions. Similarly to Begis and Glowinski (1975) and Hlaváˇcek and Neˇcas (1982) we aimed to treat problems in a complex way: from the existence analysis, to a discretization, followed by convergence analysis and finally completed by sensitivity analysis. Rigorous sensitivity analysis requires certain regularity of solutions to the state problems. This one is not usually available in the case of contact problems with complicated friction laws. To avoid formal differentiation, sensitivity analysis was performed only for discretized optimal design problems. In Haslinger and Neittaanmaki (1983), shape optimization with a least squares objective functional and the state relation defined by scalar unilateral boundary value problems is studied. The unilateral constraints on the boundary are treated by a penalty approach. The original state inequality in the shape optimization problem is then replaced by the penalized state equations. This example served as a simple prototype for contact shape optimization in linear elasticity, which is done in Haslinger and Neittaanmaki (1984) for the frictionless Signorini problem. The mathematical model of the state problem leads to a variational inequality of the first kind (Glowinski 1984). The stability of solutions with respect to domains, which is the key point in any existence analysis, is now more involved compared with the scalar case. This is due to the explicit presence of the design variable in the unilateral kinematical boundary constraint. To get more realistic models, the influence of friction has to be taken into account. Shape optimization for the Signorini problems with the Tresca friction law is studied in Haslinger et al. (1985). This is the simplest model of friction which says that the magnitude of the tangential component of the stress vector on the contact zone is bounded by a value given a priori. A slip may occur only if this bound is attained. The mathematical model is given by a variational inequality of the second kind (Glowinski 1984). Until now we have considered contact problems for elastic bodies. Next two papers deal with contact shape optimization for deformable bodies made of materials which are characterized by nonlinear constitutive laws, namely elastic-perfectly plastic obeying the Hencky law of plasticity (Haslinger and Neittaanmäki 1986) and the deformation theory of plasticity (Haslinger and Mäkinen 1992). Constraints can be


J. Haslinger and R. A. E. Mäkinen

prescribed not only on the boundary but also in the interior of the domain of definition, for example, the mathematical model of a membrane whose deflection is restricted by a rigid obstacle. In Haslinger and Neittaanmäki (1988b) the so-called packaging problem is studied: to find a shape of the membrane of minimal area such that the set of points where the membrane comes into contact with the obstacle contains an a priori given subdomain. All these results and many others appeared in the books Haslinger and Neittaanmäki (1988a), Haslinger and Neittaanmaki (1996), Haslinger (2003). Later on, the research in contact shape optimization continued by considering the Signorini problem with more realistic friction laws, namely with the Coulomb friction. Unlike the Tresca model, the upper bound restricting the magnitude of the contact shear stress is now given by the product of the coefficient of friction and the normal contact stress, i.e. the quantity which is not known, in advance. The mathematical model of such problems and its analysis is much more involved compared with the Tresca friction. It leads to an implicit variational inequality (Eck et al. 2005). This time, shape optimization problems with Coulomb friction were solved in their original, i.e. nonsmooth form by methods of non-smooth optimization. The twodimensional case studied in Beremlijski et al. (2002) uses the implicit programming approach (Outrata 1998). Generalized gradient information required by minimization methods is obtained by the differential calculus of Clarke (1983). The situation for 3D problems is more difficult owing to the fact that the contact shear stress is now subject to quadratic constraints. The sensitivity analysis is performed by the generalized differential calculus of Mordukhovich (2006a; 2006b). Friction type conditions can be also useful in fluid mechanics to model the behavior of fluids on surfaces. As an example we mention a water flow in a channel whose walls are coated with a hydrophobic material (teflon, e.g.). Water starts to slip only if the magnitude of the shear stress attains a critical value depending on the level of hydrophobicity, otherwise it adheres to the wall, i.e. slip has a threshold character similarly to friction in contact mechanics. The resulting mathematical model leads again to an inequality problem. Shape stability of solutions to the Stokes system with a solution-dependent threshold bound has been studied in Haslinger and Stebel (2016). The Stokes system but with the regularized form of the threshold slip condition was used as the state equation in a class of shape optimization problems in Haslinger et al. (2017).

2 Identification of the Slip Bound in the Stokes System with Nonsmooth Slip Conditions The aim of this section is to present theoretical results concerning the slip bound identification assuming that the bound depends on the solution to the Stokes system with threshold slip boundary conditions. To this end we use the optimal control technique for an appropriate least squares cost functional and the state constraint given

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …


by the velocity–pressure formulation of the flow model. To make the optimization problem smooth, the regularization of the nonsmooth slip term in the original state inequality will be used. For the rigorous justification of this approach we refer to Haslinger and Makinen (2020). We start with the definition of the state problem. Let  ⊂ R2 be a bounded domain with the Lipschitz boundary ∂ =  ∪ S, where  and S are non-empty, disjoint parts open in ∂. The classical formulation of the state problem reads as follows: find the velocity vector u :  → R2 and the pressure p :  → R such that ⎧ −2μ div(Du) + ∇ p = f in , ⎪ ⎪ ⎪ ⎪ ⎪ div u = 0 in , ⎪ ⎪ ⎪ ⎨ u = 0 on , ⎪ u ν = 0 on S, ⎪ ⎪ ⎪ ⎪ ⎪ |στ | ≤ g(|u τ |) on S, ⎪ ⎪ ⎩ u τ = 0 =⇒ στ = −g(|u τ |) sign u τ on S,


where f ∈ (L 2 ())2 stands for the external force, μ > 0 is the dynamic viscosity of the fluid, Du = 21 (∇u + (∇u)T ) is the symmetric part of the gradient of u, n and t are the unit normal and tangential vector, respectively, to ∂. Further, vν = v · n and vτ = v · t are the normal and tangential component of a vector v ∈ R2 , respectively, and στ = 2μ(Du)n · t is the shear stress on ∂. Finally, g : R+ → R+ is a nonnegative slip bound function that depends on |u τ |. For the weak formulation of (1) we shall need the following spaces and forms: V() = {v ∈ (H 1 ())2 | v = 0 on , vν = 0 on S},    q dx = 0 , L 20 () = q ∈ L 2 () |   a(u, v) = 2μ Du : Dv d x, u, v ∈ V(),   b(v, q) = q div v d x, v ∈ V(), q ∈ L 2 (),   jg (|u τ |, |vτ |) = g(|u τ |)|vτ | ds, u, v ∈ V(). S

The slip bound g will play the role of the control variable in the optimal control problem formulated below. From now on we shall suppose that g ∈ Uad , where Uad = {g ∈ C+ (R+ ) | g(0) ≤ g0 , |g(ξ ) − g(ξ¯ )| ≤ L|ξ − ξ¯ | ∀ξ, ξ¯ ∈ [0, ], g(ξ ) = g( ) ∀ξ ≥ , g non-decreasing in [0, ]}, with g0 , L, and being positive constants which do not depend on g ∈ Uad .


J. Haslinger and R. A. E. Mäkinen

The weak velocity-pressure formulation of (1) for given g ∈ Uad reads as follows: ⎧ Find (u g , p g ) ∈ V() × L 20 () such that ⎪ ⎪ ⎪ ⎨ a(u g , v − u g ) − b(v − u g , p g ) ⎪ + jg (|u τg |, |vτ |) − jg (|u τg |, |u τg |) ≥ ( f , v − u g )0, ∀v ∈ V(), ⎪ ⎪ ⎩ b(u g , q) = 0 ∀q ∈ L 20 ().


It is known that (P(g)) has a unique solution (u g , p g ) for any g ∈ Uad (Haslinger and Makinen 2020). The identification of g will be done through the following minimization problem:

Find g ∗ ∈ Uad such that J (g ∗ ) ≤ J (g) ∀g ∈ Uad ,

where J (g) =

1 2



(u τg − u˜ τ )2 ds

and u˜ τ is the measured value of the tangential velocity on S. If g ∗ ∈ Uad is a solution ∗ ∗ to (P) then (g ∗ , u g , p g ) will be termed the optimal triplet of (P). In Haslinger and Makinen (2020) the following existence result has been established. Theorem 1 Problem (P) has a solution. State constraint (P(g)) is represented by the implicit inequality. Its regularized form reads as follows: ⎧ Find (uεg , pεg ) ∈ V() × L 20 () such that ⎪ ⎪ ⎪ g g ⎪ ⎪ ⎨ a(uε , v) − b(v, pε )  g g ⎪ + g(Tε (u ε,τ ))Tε (u ε,τ )vτ ds = ( f , v)0, ∀v ∈ V(), ⎪ ⎪ ⎪ S ⎪ ⎩ b(uεg , q) = 0 ∀q ∈ L 20 (),

(Pε (g))

where Tε : R → R+ is defined by ⎧ ⎨|x| if |x| ≥ ε, Tε (x) = x 2 + ε2 ⎩ if |x| < ε, 2ε g

ε > 0 is the regularization parameter and u ε,τ stands for the tangential component of g uε on S. Again, there exists a unique solution to (Pε (g)) for any g ∈ Uad and ε > 0. Problem (P) will be replaced by the following problems, ε → 0+:

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …

Find gε∗ ∈ Uad such that

J (gε∗ ) ≤ J (g) ∀g ∈ Uad


1 J (g) = 2


(Pε )


g (u ε,τ − u˜ τ )2 ds.

It has been shown in Haslinger and Makinen (2020) that there exists at least one solution to (Pε ) for any ε > 0. The relation between (P) and (Pε ), ε → 0+, is stated below. g∗


Theorem 2 From any sequence of optimal triplets {(gε∗ , uε ε , pε ε )}ε of (Pε ), ε → 0+, one can find its subsequence (denoted by the same symbol) and an element ∗ ∗ (g ∗ , u g , p g ) ∈ Uad × V() × L 20 () such that ⎧ ∗ g ⇒ g∗ ⎪ ⎪ ⎨ ε∗ ∗ g uε ε → u g ⎪ ⎪ ∗ ⎩ gε∗ pε → p g ∗

(uniformly) in [0, ], in (H 1 ())2 ,


in L 2 (), ε → 0 + .

In addition, (g ∗ , u g , p g ) is an optimal triplet of (P) and any accumulation point of g∗ g∗ {(gε∗ , uε ε , pε ε )}ε in the sense of (2) has this property. For the proof we refer to Haslinger and Makinen (2020).

3 Optimal Surface Coating Governed by the Stokes System with Threshold Slip Boundary Conditions In the previous section, we searched for the threshold bound g from the relatively large set Uad . In many problems however, the form of g is known in advance and only what we need is to adjust its parameters on the basis of data measurements. For example, let g be an affine function of |u τ |: g(|u τ |) = κ + k|u τ |,


where κ and k are two parameters to be identified. Next, we shall consider the couples (κ, k) belonging to the set

ad = {(κ, k) ∈ (L ∞ (S))2 | 0 ≤ κ ≤ κmax , 0 ≤ k ≤ kmax on S}, U where κmax and kmax are given non-negative functions from L ∞ (S). Let us mention that (3) involves three important slip models: 1. Classical Navier condition if κ ≡ 0 on S,


J. Haslinger and R. A. E. Mäkinen

2. Tresca slip model if k ≡ 0 on S, 3. Threshold Navier condition if κ ≡ 0 and k ≡ 0 on S. The weak velocity-pressure formulation (P(g)) with g defined by (3) and (κ, k) ∈

ad reads as follows: U ⎧ Find (u(κ,k) , p (κ,k) ) ∈ V() × L 20 () such that ⎪ ⎪ ⎪ ⎪ ⎨ ak (u(κ,k) , v − u(κ,k) ) − b(v − u(κ,k) , p (κ,k) ) ⎪ |) ≥ ( f , v − u(κ,k) )0, ∀v ∈ V() + jκ (|vτ |) − jκ (|u (κ,k) ⎪ τ ⎪ ⎪ ⎩ b(u(κ,k) , q) = 0 ∀q ∈ L 20 (), 


(P(κ, k))

ak (u, v) = a(u, v) +

jκ (|vτ |) =

ku τ vτ ds, S

κ|vτ | ds. S

ad . Problem (P(κ, k)) has a unique solution for any (κ, k) ∈ U

ad such Identification of g given by (3) consists of finding a couple (κ ∗ , k ∗ ) ∈ U that

ad (

P) J (κ ∗ , k ∗ ) ≤ J (κ, k) ∀(κ, k) ∈ U with

1 J (κ, k) = 2


(u (κ,k) − u˜ τ )2 ds. τ

Theorem 3 Problem (

P) has a solution.

ad is weak∗ compact in (L ∞ (S))2 , state problem Proof uses standard arguments: U

ad .

ad and J is continuous in U (P(κ, k)) is stable with respect to (κ, k) ∈ U This section will be concluded by an example which can be interpreted as follows: consider a pipe assembled from a finite number of segments whose inner surfaces are coated with two different hydrophobic materials. These materials are characterized by the couples (κ, k) = (κmax , 0) and (κ, k) = (κmin , 0), i.e. the Tresca slip conditions with the threshold slip bounds κmax and κmin , respectively, are prescribed on S, where κmax ≥ κmin are non-negative constants. Each segment is coated only by one of these materials. We aim to arrange the segments in the pipe in such a way that the tangential component of the velocity vector on S is as close as possible to a given target. Unfortunately, this would lead to a nonlinear combinatoric optimization problem that is difficult to solve. Instead we use the relaxed optimal control approach (reminiscent of techniques used in topology optimization of structures Bendsoe 2004):

# Find κ ∗ ∈ Uad such that ∗

Jβ (κ ) ≤ Jβ (κ) ∀κ ∈ where

# Uad ,

(P# )

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …


# Uad = {κ ∈ L ∞ (S) | κmin ≤ κ ≤ κmax on S},   1 2 Jβ (κ) = (u (κ,0) − u ˜ ) ds + β (κmax − κ)(κ − κmin ) ds, τ 2 S τ S

u(κ,0) is the solution to (P(κ, 0)), and β > 0 is the penalty parameter. For sufficiently large β the added penalty term forces the box constraints imposed on κ to become (at least approximately) active on S (bang-bang control). For the numerical realization of the state problem, we use the finite element method. Let us assume that  is a polygonal domain and let {Th }, h → 0 be a regular family of triangulations of . With any Th , we associate the following finitedimensional spaces: Vh = {vh = (v1h , v2h ) ∈ (C())2 | vh |T ∈ (P2 (T ))2 ∀T ∈ Th , vh = 0 on , vνh = 0 on S}    Q h = q h ∈ C() | q h |T ∈ P1 (T ) ∀T ∈ Th , qh dx = 0 . 

This choice corresponds to the P2/P1 element satisfying the Ladyzhenskaya– Babuška–Brezzi (LBB) stability condition Arnold et al. (1984). The finite element approximation of the regularized and discretized state problem then reads1 : ⎧ # Given κ ∈ Uad , ε > 0 find (uh , p h ) ∈ Vh × Q h such that ⎪ ⎪ ⎪  ⎨ a(uh , vh ) − b(vh , p h ) + κ Tε (u τh )vτh ds=( f , vh )0, ∀vh ∈ Vh , (Pεh (κ, 0)) ⎪ S ⎪ ⎪ ⎩ h h h b(u , q ) = 0 ∀q ∈ Q h . # The unique solution of (Pεh (κ, 0)) is guaranteed by κ ∈ Uad together with the LBB condition. Let be the arc length of S and 0 = t0 < t1 < ... < tm+1 = the desired partition of [0, ]. Then κ to be identified will be discretized and represented as

κ(s) := κa (s) =


ai χ[ti ,ti+1 ] (s),


where χ[ti ,ti+1 ] is the characteristic function of [ti , ti+1 ] and the vector a = (a0 , ..., am ) belongs to U = {a ∈ Rm+1 | κmin ≤ ai ≤ κmax , i=0, ..., m}. Let q := [u, p]T be the vector of degrees of freedom containing the nodal values of the discrete velocity uh and the pressure p h . Then the algebraic form of (Pεh (κ, 0)) is the nonlinear system of equations:


To simplify notation, we write (uh , p h ) only even though this solution depends also on κ and ε.


J. Haslinger and R. A. E. Mäkinen

r(a; q) :=

Åu + cε (a, u) − Bp − f = 0. BT u


The block matrices Å and B and the vector f arise from the standard discretized Stokes system. The nonlinear term cε (a, u) corresponds the boundary integral term in (Pεh (κ, 0)). Let Jβ : U → R, Jβ (a) := Jβ (a, q(a)) be the discretization of the cost functional Jβ , where q = q(a) solves (4). Then the mathematical programming problem to be solved reads: (5) Find a∗ ∈ arg min{Jβ (a) | a ∈ U}. The gradient of Jβ is needed to use the gradient type optimization algorithms for (5). To compute the partial derivatives of Jβ , the standard discrete adjoint variable approach (see, e.g., Haslinger 2003) is used:

∂Jβ (a, q) ∂Jβ (a) T ∂r(a, q) = +w , i = 0, ..., m, ∂ai ∂ai ∂ai where w solves the linear adjoint problem

∂r(a, q) ∂q

T w = −∇q Jβ (a, q)

with q = q(a). Here ∂r(a,q) is the Jacobian matrix of r at (a, q) that is needed in the ∂q Newton method for solving (4), too. We finish this section with a numerical example aimed at illustrating the potential of our results in practical computations. Example 1 Consider a state problem (P(κ, 0)) with  = ]0, 1[ × ]0, 1[, S = ]0, 1[ × {0},  = ∂ \ S, μ = 1, and f (x1 , x2 ) = [10 sin(2π(0.5 − x2 )), 0]. Further, let the identification problem (P# ) be defined by the target tangential velocity profile on S 

πa π x1 + u˜ 1 (x1 ) = 0.036 max 0, sin b−a a−b

 2 , a=

3 1 , b= , 4 4

# and Uad by the constants κmin = 0.2 and κmax = 0.45. In the numerical realization we use the regularization parameter ε = 10−5 and a structured but non-uniform P2/P1 triangular mesh whose elements are defined by a grid of 31 × 31 points (see Fig. 1). The set of points {ti } defining the discretization of κ were chosen to be the nodal points of the finite element mesh on S. The finite element solver and the sensitivity analysis formulae were implemented using MATLAB Mathworks announces release (2016). Moreover, the fmincon optimizer from MATLAB Optimization Toolbox was used. We solved the problem

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …


Fig. 1 Non-uniform structured mesh (N x = N y = 31)


Fig. 2 a optimized design κa∗ ; b tangential velocities on S (solid line u 1a , dashed line u˜ 1 ); c κa∗

tangential stress on S (solid line −σ1 , dashed line ±κa∗ )

successively for β ∈ {0, 0.1, 0.2} and used the optimized control as an initial guess for the next problem with a higher value of β. The optimized slip bound κa∗ (i.e. the distribution of two different materials on S for β = 0.2) is shown in Fig. 2.

4 Conclusions In the first part of the paper, we explained why nonsmooth optimal control problems are considerably more difficult than the smooth ones not only from the theoretical but also from the computational points of view. In most applications in solid and fluid mechanics, nonsmoothness originates from state constraints represented typically by variational inequalities governing the optimized system. A possible way to overcome difficulties produced by the nonsmooth character is a penalization/regularization of the original state constraint.


J. Haslinger and R. A. E. Mäkinen

As a result, the nonsmooth problem is replaced by a sequence of smooth ones which can be solved by standard tools. The potential of this approach is illustrated with the example of optimal coating of the inner surface of a tube for the Stokes flow obeying the Tresca slip law. Acknowledgements The first author acknowledges the support of FW01010096 of the Czech Technological Agency.

References Arnold D, Brezzi F, Fortin M (1984) A stable finite element for the Stokes equations. Calcolo 21:337–344 Begis D, Glowinski R (1975/76) Application de la méthode des éléments finis à l’approximation d’un problème de domaine optimal: Méthodes de résolution des problèmes approchés. Appl Math Optim 2(2):130–169 Bendsøe MP, Sigmund O (2004) Topology optimization: theory, methods, and applications, 2nd edn. Springer, Berlin Beremlijski P, Haslinger J, Koˇcvara M, Outrata J (2002) Shape optimization in contact problems with Coulomb friction. SIAM J Optim 13(2):561–587 Céa J (1976) Une méthode numérique pour la recherche d’un domaine optimal. In: Glowinski R, Lions JL (eds) Computing methods in applied sciences and engineering (Second International Symposium, Versailles, 1975), Part 1, Lecture notes in economics and mathematical systems, vol 34. Springer, Berlin, pp 245–257 Céa J, Gioan A, Michel J (1974) Adaptation de la méthode du gradient à un problème d’identification de domaine. In: Glowinski R, Lions JL (eds) Computing methods in applied sciences and engineering (Proceeding international symposium, Versailles, 1973), Part 2, Lecture notes in computer science, vol 11. Springer, Berlin, pp 391–402 Chenais D (1975) On the existence of a solution in a domain identification problem. J Math Anal Appl 52(2):189–219 Clarke FH (1983) Optimization and nonsmooth analysis. Wiley, New York Duvaut G, Lions JL (1976) Inequalities in mechanics and physics, vol 219. Grundlehren der Mathematischen Wissenschaften. Springer, Berlin Eck C, Jarušek J, Krbec M (2005) Unilateral contact problems: variational methods and existence theorems. Chapman & Hall/CRC, Boca Raton, FL Ekeland I, Temam R (1976) Convex analysis and variational problems. North-Holland, Amsterdam Glowinski R (1984) Numerical methods for nonlinear variational problems. Springer series in computational physics. Springer, New York Haslinger J, Horák V, Neittaanmäki P (1985/86) Shape optimization in contact problems with friction. Numer Funct Anal Optim 8(5–6):557–587 Haslinger J, Mäkinen R (1992) Shape optimization of elasto-plastic bodies under plane strains: Sensitivity analysis and numerical implementation. Struct Optim 4(3–4):133–141 Haslinger J, Mäkinen RAE (2003) Introduction to shape optimization: theory, approximation, and computation. SIAM, Philadelphia, PA Haslinger J, Mäkinen RAE (2020) The parameter identification in the Stokes system with threshold slip boundary conditions. ZAMM Z Angew Math Mech 100(5):e201900209, 19 pp Haslinger J, Mäkinen RAE, Stebel J (2017) Shape optimization for Stokes problem with threshold slip boundary conditions. Discret Contin Dyn Syst Ser S 10(6):1281–1301 Haslinger J, Neittaanmäki P (1983) Penalty method in design optimization of systems governed by a unilateral boundary value problem. Ann Fac Sci Toulouse Math (5) 5(3–4):199–216

Optimal Control Problems in Nonsmooth Solid and Fluid Mechanics: …


Haslinger J, Neittaanmäki P (1984/85) On the existence of optimal shapes in contact problems. Numer Funct Anal Optim 7(2–3):107–124 Haslinger J, Neittaanmäki P (1986) On the existence of optimal shapes in contact problems: perfectly plastic bodies. Comput Mech 1(4):293–299 Haslinger J, Neittaanmäki P (1988) Finite element approximation for optimal shape design: theory and applications. Wiley, Chichester Haslinger J, Neittaanmäki P (1988b) On the design of the optimal covering of an obstacle. In: Zolesio JP (ed) Boundary control and boundary variations (Nice. 1986), Lecture notes in computer sciences, vol 100. Springer, Berlin, pp 192–211 Haslinger J, Neittaanmäki P (1996) Finite element approximation for optimal shape, material and topology design, 2nd edn. Wiley, Chichester Haslinger J, Stebel J (2016) Stokes problem with a solution dependent slip bound: stability of solutions with respect to domains. ZAMM Z Angew Math Mech 96(9):1049–1060 Hlaváˇcek I, Haslinger J, Neˇcas J, Lovíšek J (1988) Solution of variational inequalities in mechanics, vol 66. Applied mathematical sciences. Springer, New York Hlaváˇcek I, Neˇcas J (1982) Optimization of the domain in elliptic unilateral boundary value problems by finite element method. RAIRO Anal Numér 16(4):351–373 Kikuchi N, Oden JT (1988) Contact problems in elasticity: a study of variational inequalities and finite element methods. SIAM, Philadelphia, PA Klarbring A, Haslinger J (1993) On almost constant contact stress distributions by shape optimization. Struct Optim 5(4):213–216 Mathworks announces release (2016) of the MATLAB and Simulink product families. MathWorks Inc., Natick, MA, p 2016 Mordukhovich BS (2006a) Variational analysis and generalized differentiation I: basic theory. Springer, Berlin Mordukhovich BS (2006b) Variational analysis and generalized differentiation II: applications. Springer, Berlin Outrata J, Koˇcvara M, Zowe J (1998) Nonsmooth approach to optimization problems with equilibrium constraints: theory, applications and numerical results, vol 28. Nonconvex optimization and its applications. Kluwer Academic Publishers, Dordrecht Pironneau O (1984) Optimal shape design for elliptic systems. Springer series in computational physics. Springer, New York Simon J (1980) Differentiation with respect to the domain in boundary value problems. Numer Funct Anal Optim 2(7–8):649–687 Sokolowski J, Zolésio JP (1987) Shape sensitivity analysis of unilateral problems. SIAM J Math Anal 18(5):1416–1437 Sokolowski J, Zolésio JP (1992) Introduction to shape optimization: shape sensitivity analysis, vol 16. Springer series in computational mathematics. Springer, Berlin

Optimal Control Approaches in Shape Optimization Dan Tiba

Abstract This is a survey paper devoted to fixed domain methods in optimal design problems, based on optimal control methods. It discusses penalization of the state system and/or of the cost functional via Hamiltonian equations and implicit parametrizations. A special emphasis is on the treatment of various boundary conditions and some new procedures are also briefly presented. Keywords Shape optimization · Optimal control · Functional variations · Distributed penalization · Arbitrary dimension · General boundary condition · Fixed domains MSC 2020: 49M41 · 49Q10

1 Introduction A typical shape optimization problem has the form  j (x, y(x), ∇ y(x)) d x





subject to Ay = f in Ω, By = 0 on ∂Ω,

(2) (3)

D. Tiba (B) Simion Stoilow Institute of Mathematics of Romanian Academy, Calea Grivi¸tei 21, Bucharest, Romania e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



D. Tiba

where O is a family of subdomains in D ⊂ R n , D bounded, n ∈ N , Λ may denote Ω or ∂Ω or some given fixed domain E ⊂ D, such that E ⊂ Ω for any Ω ∈ O, etc., A is a second-order elliptic operator, f ∈ L p (D), p > 1, is given, and B is the associated boundary condition operator. More general operators, including evolution equations, may be considered as well Pironneau (1984); Sokolowski and Zolesio (1992); Tiba and Yamamoto (2020). The functional j (·, ·, ·) : D × R × R n → R is Carathéodory. Precise hypotheses on the data in (1)–(3) will be imposed in the sequel, as necessity appears. The scientific literature on optimal design problems is very rich and we indicate, for general information, just the monographs Allaire (2007); Bucur and Buttazzo (2005); Delfour and Zolesio (2005); Haslinger and Neittaanmäki (1996); Henrot and Pierre (2005); Neittaanmäki et al. (2006); Novotny and Sokolowski (2013); Pironneau (1984); Sokolowski and Zolesio (1992). A special class of such problems have origin in mechanics: optimization of beams, plates, curved rods, arches, shells, with respect to their thickness, curvature, etc. Since the mechanical models contain the geometric parameters as coefficients in the involved (partial) differential equations, the obtained problems enter the category of optimal control problems with respect to the coefficients. We quote Neittaanmäki et al. (2006, Chap. VI) and references therein for a detailed theoretical and computational treatment of this type of geometric optimization problems. It is also obvious that the problem (1)–(3) has the general structure of an optimal control problem, the difference being that the minimization parameter is just the geometry Ω ∈ O. The idea to solve (1)–(3) via optimal control methods appears already in the classical monograph by Pironneau Pironneau (1984), where some special cases are discussed. In this direction, in the work Neittaanmäki et al. (2009), the concept of functional variation was introduced that combines in a certain sense the well known boundary variations and the topological variations Clason et al. (2018); Novotny and Sokolowski (2013); Sokolowski and Zolesio (1992). See as well the survey Neittaanmaki and Tiba (2012) and the paper Makinen (1992) that is a first attempt to solve geometric optimization problems via level set methods. As the backbone of this approach, we assume that the family O of admissible domains is obtained as the sublevel sets of the functions from a given family F ⊂ C(D) of admissible mappings O = {Ωg | g(x) < 0, ∀x ∈ Ωg , ∀g ∈ F}.


Notice that (4) defines just open sets that may be not connected. In order to obtain a domain via (4), a selection criterion, of a connected component, has to be introduced (for instance, to contain a prescribed domain E ⊂⊂ D or ∂Ωg to contain a prescribed point x0 ∈ D, etc.). In general, Ωg is not simply connected, that is topology optimization will be considered as well. In this paper, we shall also examine Neumann boundary conditions and we shall assume regularity for ∂Ωg , as follows (see Neittaanmaki and Tiba (2008, 2012)): F ⊂ C 1 (D)

Optimal Control Approaches in Shape Optimization


and ∇g(x) = 0, ∀x ∈ G = {x ∈ D | g(x) = 0},


for any g ∈ F. Due to (5), the implicit functions theorem can be applied around any x ∈ ∂Ωg and ∂Ωg is of class C 1 . More regularity can be obtained if more regularity is assumed for F. Functional variations of Ωg correspond to perturbations Ωg+λr , where λ ∈ R and r ∈ F are given. It is clear that Ωg+λr may have a different type of connectivity and ∂Ωg+λr is a perturbation of ∂Ωg . In this sense, functional variations combine boundary and topological variations. In this geometrical context, for Dirichlet boundary conditions, the paper Neittaanmäki et al. (2009) introduced a penalization of the state system (2) that extends the given elliptic system from Ωg to D in an approximate way. Similar ideas were used earlier by Kawarada (1979) in the setting of free boundary problems. However, this procedure is strictly limited to Dirichlet boundary conditions. In this paper, we also briefly discuss the Neumann boundary conditions (3) via a new auxiliary mapping. Using this one, we propose the extension of (2) (in an approximate sense again) from Ωg to the universal domain D. This technique may be applied in the nonlinear case as well. However, the convergence properties are much weaker than in the Dirichlet case and further clarifications are needed in certain respects. In Sect. 2, this is discussed in arbitrary dimension. In Sect. 3, in dimension two, the Hamiltonian approach based on the implicit parametrization theorem Tiba (2013) is briefly introduced following the recent works Tiba (2018a, b, 2020). This method offers a complete approximation solution to shape optimization problems via optimal control theory, for any type of boundary conditions.

2 Distributed Penalization of the State System This section is devoted to techniques based on the penalization of the state Eq. (2), (3), extending the approach of Kawarada (1979) or controllability arguments as in Neittaanmaki and Tiba (2008). See as well the survey Neittaanmaki and Tiba (2012) and the work Neittaanmäki et al. (2009), where the concept of functional variations was first used.

2.1 Dirichlet Conditions We take (2), (3) in the simple form −Δy = f in Ωg , y = 0 on ∂Ωg ,

(6) (7)


D. Tiba

where Ωg is as defined in (4) and is assumed to be a domain here, not necessarily simply connected. A cost functional as in (1) and further constraints, as mentioned in Sect. 1, can be considered as well. The following general approximation property plays an important role. We recall that Ωg ⊂ D, for any g ∈ F ⊂ C(D), and D is a given bounded smooth domain in R n . In this general setting, the definition (4) has to be modified as Ωg = int{x ∈ D | g(x) ≤ 0}


since the closed set {x ∈ D | g(x) = 0} may have positive measure. Then Ωg , as defined in (8), is a Carathéodory open set and its boundary is continuous (of class C, with the segment property). We define the approximate extension of (6), (7) as follows: 1 −Δy + χ D\Ωg y = f ε y=0

in D,


on ∂ D,


where χ D\Ωg is the characteristic function of D \ Ωg . It turns out that it has a simple analytic expression χ D\Ωg = 1 − H (−g), where H is the Heaviside function on R. Then, the geometry Ωg disappears from (9), (10) and g ∈ F becomes the control parameter acting in the coefficients of (9), (10). Theorem 1 If Ωg is of class C and yε ∈ H 2 (D) ∩ H01 (D) denotes the unique solution of (9), (10), then yε |Ωg → y, the solution of (6), (7), weakly in H 1 (Ωg ) and strongly in L 2 (Ωg ). This was established in Neittaanmäki et al. (2009) and it is based on density properties of the Sobolev spaces Adams (1975); Tiba (2002) in domains of class C. A regularization of the characteristic function in (9), given by H ε (−g) allows to obtain differentiability properties of the mapping g → yε with respect to functional variations and to use gradient methods in the associated control problem (1), (9), (10). Notice that, in order to avoid any geometric aspects, the functional (1) can be written in the form  H (−g(x)) j (x, y(x), ∇ y(x)) d x, D

if Λ = Ωg and further regularized by using H ε (−g). This transformation is not necessary if Λ = E, E ⊂ Ωg , ∀g ∈ F, E ⊂ D given. If Λ = ∂Ωg or some parts of ∂Ωg , then Sect. 3 has to be applied (boundary observation). In the recent paper Tiba and Murea (2019), such procedures were extended to the optimization of simply supported plates with holes (topological optimization). In fact, this approximation method allows a natural combination of boundary and topological variations that are intrinsically defined via usual gradients algorithms applied to the approximating optimal control problems.

Optimal Control Approaches in Shape Optimization


2.2 Neumann Conditions Here, we fix the Eq. (2), (3) as follows: −Δy + y = f in Ωg , ∂y = 0 on ∂Ωg , ∂n

(11) (12)

where F satisfies the regularity condition (5) and Ωg is defined by (4) with the supplementary restriction: E ⊂ Ωg , ∀g ∈ F, which can be expressed as g(x) < 0 in E.


The solution of (11), (12) can be understood in the weak sense: y ∈ H 1 (Ωg ) and

 [∇ y · ∇v + yv] d x = Ωg

f v d x, ∀v ∈ H 1 (Ωg ). Ωg

∇g(x) and exists in every x ∈ ∂Ωg , due to The unit normal to ∂Ωg is given by |∇g(x)| (5). If more regularity is imposed on F, in order to have ∂Ωg in C 1,1 (for instance, if F ⊂ C 2 (D)) then (11), (12) has a unique strong solution y ∈ H 2 (Ωg ). Then, one may consider the cost functional (1) depending on ∇ y, involving boundary observation, etc. But the treatment of such a cost functional needs the Hamiltonian approach from Sect. 3. Under condition (5), relation (12) may be written as ∇g · ∇ y = 0 on ∂Ωg and the function ∇g is defined on D, in fact. A direct extension of the distributed penalized system (9), (10) to the Neumann case (11), (12), would be

1 −Δyε + yε + ∇g+ · ∇ yε = f in D, ε yε = 0 on ∂ D,

(14) (15)

where g+ is the positive part of g and supp g+ ⊂ D \ Ωg . The solution of (14), (15) is in H 2 (D) ∩ H01 (D) and the replacement of Neumann conditions by the Dirichlet ones in (15) is quite usual in such extension procedures and even helpful in the numerical aspects. However, it turns out that an approximation property as in Theorem 1 seem not possible to be obtained for (14), (15) and a more sophisticated method is necessary.


D. Tiba

Following Neittaanmaki and Tiba (2012, 2008), we introduce the following optimal control problem, again defined in D: ⎫ ⎧ ⎪ ⎪  ⎬ ⎨1  1 2 2 Min (∇g · ∇ yε ) dσ + H (g)u d x , ⎪ 2 u∈L 2 (D) ⎪ ⎭ ⎩ 2ε ∂Ωg



−Δyε + yε = f + H (g)u in D, ∂ yε = 0 on ∂ D. ∂n

(17) (18)

The problem (16)–(18) has a unique optimal pair [y ε , u ε ] ∈ H 2 (D) × L 2 (D) due to the coercivity and strict convexity of the performance index (16). By the trace theorem, there is y˜ ∈ H 2 (D \ Ωg ), such that (18) is satisfied and y˜ = y, the solution of (11), (12) and ∂∂ny˜ = 0, on ∂Ωg . Then, the concatenation of y, y˜ satisfies (18), (19) in D, with some u˜ ∈ L 2 (D). We have the inequality 1 2ε

1 (∇g · ∇ yε ) dσ + 2

H (g)u 2ε



1 dx ≤ 2


 (u) ˜ 2 d x.



It yields that ∇g · ∇ yε → 0 strongly in L 2 (∂Ωg ), {u ε } is bounded in L 2 (D), {yε } is y, u ε →

u weakly in H 2 (D) × bounded in H 2 (D) and we may assume that yε →

y = 0 on ∂Ω . L 2 (D), on a subsequence, and ∂

g ∂n It is clear, due to (19), that yε |Ωg provides a good approximation of the solution of (11), (12). By taking standard variations u ε + λv, yε + λz, v ∈ L 2 (D) and z ∈ H 2 (D) given by −Δz + z = H (g)v in D, ∂z = 0 on ∂ D, ∂n

(20) (21)

one obtains (see Neittaanmaki and Tiba (2008)): Proposition 1 The gradient of the cost (16) is given by pε + H (g)u ε = 0 in D,


where pε ∈ L 2 (D) is the adjunct state obtained as the unique transposition solution of   1 pε (−Δz + z) d x = (∇ yε · ∇g)(∇z · ∇g) dσ, (23) ε D


for any z ∈ H 2 (D) satisfying (20), (21).

Optimal Control Approaches in Shape Optimization


Remark 1 Using (22), one can eliminate the control in (17) and obtain the system given by (23), (18) and −Δyε + yε = f − pε in D, approximating the solution of (11), (12). It is to be noticed that (23) still involves the geometry ∂Ωg essentially, therefore this extension preserves some of the difficulties of the original shape optimization problem. We propose now a distributed extension of the Neumann problem (11), (12), similar to the Dirichlet case from Sect. 2.1, but again using an optimal control problem. y ∈ H 1 (D) satisfies the We denote by y0 the solution of (17), (18) with u = 0 and

boundary control system −Δ

y +

y = 0 in D, ∂

y = w on ∂ D, w ∈ L 2 (∂ D) ∂n

(24) (25)

as a weak solution:    ∇

y∇ϕ d x +

yϕ d x − wϕ dσ = 0, ∀ϕ ∈ H (D). D




To (24), (25) or (26) we associate the cost functional   2 |∇g · ∇(

y + y0 )| d x + ε w 2 dσ.





y + y0 satisfies (17), (25) as a weak solution. Denote by [

yε , wε ] the unique optimal pair of the problem (24)–(27). Choosing w = 0, we get the inequality: 

 |∇g · ∇(

yε + y0 )|2 d x + ε



 wε2 d x ≤

|∇g · ∇ y0 |2 d x.


D\Ωg 1

Consequently, {∇g · ∇

yε } is bounded in L 2 (D \ Ωg ) and {ε 2 wε } is bounded in yε , in L 2 (D \ Ωg ) weak, on a subsequence. L 2 (∂ D). Let l = lim ∇g · ∇


Taking variations wε + λv, v ∈ L 2 (∂ D),

yε + λz, z ∈ H 1 (D) given by (in the weak sense): −Δz + z = 0 ∂z =v ∂n

in D,


on ∂ D,



D. Tiba

we obtain in a standard way 

 ∇g · ∇(

yε + y0 )∇g · ∇z d x + ε

wε v dσ = 0.




If ε → 0, then (31) and the above remarks show that  (l + ∇g · ∇ y0 )∇g · ∇z d x = 0,



for any z satisfying (29), (30). We define the adjoint state pε ∈ H 1 (D) as the weak solution to the Neumann problem (with null boundary conditions): 

∇ pε · ∇q d x + D

pε q d x = D

(∇g · ∇(

yε + y0 ))(∇g · ∇q) d x,



for any q ∈ H 1 (D). Combining (33) and (29)–(31), we infer 

 vpε = ∂D

 ∇z · ∇ pε +


 zpε =


 (∇g · ∇(

yε + y0 ))(∇g · ∇z) = −ε

wε v,



for any v ∈ L (D). 2


It yields that wε = − 1ε pε and can be eliminated from (25). The system (33) and (24), (25) with this replacement, provides an approximate extension of the Neumann problem (28), (29). Notice that the right-hand side in (34) may be written as  H (g)(∇g · ∇(

yε + y0 ))(∇g · ∇q) d x D

that is the unknown geometry Ωg is completely hidden in this formulation. From (31), (32), (33), { pε } is bounded in H 1 (D) and (34), (28) give pε → 0 in p denote the weak limit in H 1 (D) of { pε }, on a subsequence. By (34), L 2 (∂ D). Let

we also get   ∇z · ∇

p dx + D

due to (31), for any z satisfying (29), (30).


p d x = 0, D


Optimal Control Approaches in Shape Optimization


To (35), null Neumann and Dirichlet conditions are associated on ∂ D. But the behaviour of {

yε } as ε → 0, cannot be estimated, from such information. The Hamiltonian approach in the next section brings more clarifications.

3 Hamiltonian Approach in Dimension Two It is an old observation (see, for instance, Thorpe (1979, p. 63)) that Hamiltonian systems provide a parametrization of general curves in dimension two. This has been recently extended to arbitrary dimension by using the notion of iterated Hamiltonian systems Tiba (2018a). In general, the analytic representation of the geometry has a local character, in a neighbourhood of a non critical point of the Hamiltonian. However, again in dimension two, under the classical assumption that the Hamiltonian system has no equilibrium points, a variant of the Poincaré-Bendixson theorem Hirsch et al. (2013); Pontryagin (1968) shows that this parametrization of the curves is global and the solutions are periodic, the curves are closed. Consequently, it is a natural idea to study even shape optimization problems by using this powerful machinery of Hamiltonian systems. We limit our discussion to dimension two (which is an important case in optimal design) and to Dirichlet problems, although the Neumann conditions are already studied, very recently in Neittaanmaki and Tiba (2012, 2008), where some preliminary results are reported. This presentation is based on the recent publications Murea and Tiba (2019, 2021); Tiba (2016, 2018b) of the author and his collaborators. This optimal control approach in geometric optimization uses the same analytic representation of the domains given by (4). Under the classical Poincaré-Bendixson assumption (5), the boundary ∂Ωg is parametrized by the unique solution of the Hamiltonian system, in dimension two: ∂g (x1 (t), x2 (t)), ∂y ∂g x2 (t) = (x1 (t), x2 (t)), ∂x x1 (t) = −

(x1 (0), x2 (0)) = (x01 , x02 ),

t ∈ I,


t ∈ I,

(37) (38)

where (x01 , x02 ) ∈ ∂Ωg is given and I = [0, T ], where T is the main period of the solution of (36)–(38). This representation is global Murea and Tiba (2021). We consider here the following minimization problem:  j (x, y(x)) d x,





D. Tiba

−Δy(x) = f (x) in Ω, y(x) = 0 on ∂Ω, E ⊂ Ω ⊂ D, ∀Ω ∈ O,

(40) (41) (42)

which is one of the simplest situations in optimal design. The family O of admissible domains is generated via (4), starting from a given family F ⊂ C 1 (D) satisfying (5) and (13), for any g ∈ F. Here, E ⊂⊂ D ⊂ R 2 are some given domains and f ∈ L 2 (D) is fixed. The solution of (40), (41) may be understood in the weak sense and the existence of optimal domains is valid under certain weak compactness assumptions (uniform segment property) Neittaanmäki et al. (2006, Chap. 2). In Tiba (2018b); Murea and Tiba (2019) the following approximation of (39)–(42) is introduced: ⎫ ⎧ ⎪ ⎪ Tg ⎬ ⎨ 1 2

2 2 Min j (x, yg (x)) d x + yg (x1 (t), x2 (t)) x1 (t) + x2 (t) dt , ε > 0, g,u ⎪ ⎪ ε ⎭ ⎩ 0


(43) 2 u ∈ L 2 (D) and subject to g ∈ F, u measurable such that g+ 2 u in D, −Δyg (x) = f + g+ yg (x) = 0 on ∂ D,

g(x) < 0

in D.

(44) (45) (46)

In (43)–(46), we denote shortly by x1 , x2 the solution of (36)–(38), associated to g ∈ F and Tg is the corresponding main period. Comparing with the previous section, we see that the modification of the state Eq. (40) is obtained just by adding some 2 u and the penalization process is moved in the cost functional (43), control term g+ that is quite a standard process in optimization theory. This is due to the equivalence between the original shape optimization problem (39)–(42) and the control problem (39), (44)–(46), with an added state constraint: 2 u ∈ L 2 (D) and the Theorem 2 For any g ∈ F, there is u measurable such that g+ solution of (44), (45) coincides in Ωg with the solution of (40), (41) and satisfies the state constraint:

Tg |yg (σ )| dσ = 2


yg (x1 (t), x2 (t)) x1 (t)2 + x2 (t)2 dt = 0.


The cost (39) associated to Ωg is equal with the cost associated to yg , g, u, again by (39). This is a variant of a result in Murea and Tiba (2019). It is clear that Theorem 2 can be extended to any type of boundary conditions and this is the advantage of the

Optimal Control Approaches in Shape Optimization


Hamiltonian approach. Concerning the approximation properties of the penalized problem (43)–(46) versus the original problem (39)–(42), we just quote the works Tiba (2018b); Murea and Tiba (2019, 2021), where several situations are investigated. As in the previous sections, one can use functional variations also in this setting, via the perturbed domains Ωg+λr , with λ ∈ R and r ∈ F. This allows both topological and boundary variations, simultaneously. Notice that in the penalized problem (43)– (46) one has to include the Hamiltonian system (36)–(38) as a part of the state system since it is essential in the computation of the cost (43). The system in variation corresponding to this problem is: Theorem 3 Under the above assumptions, we denote q = lim (y λ − y)/λ, w = [w1 , w2 ] = lim (z λ − z)/λ, λ→0


with y λ ∈ H 2 (D) ∩ H01 (D) the solution of (44), (45) associated to g + λr and u + λv and z λ ∈ C 1 (0, Tg )2 the solution of (36)–(38) associated to g + λr . The limits exist in the above spaces and satisfy: 2 v + 2g+ ur −Δq = g+

q=0 w1 = −∇∂2 g(z g ) · w − ∂2 r (z g ) w2 = ∇∂1 g(z g ) · w + ∂1r (z g ) w1 (0) = w2 (0) = 0.

in D, on ∂ D, in [0, Tg ], in [0, Tg ],

Here “·” denotes the scalar product in R 2 . The proof can be found in Murea and Tiba (2019). By Theorem 3, one can compute the gradient of the penalized cost with respect to functional variations, according to Murea and Tiba (2019, 2021). Numerical examples are also reported in these two papers.

References Adams R (1975) Sobolev spaces. Academic Press, New York Allaire G (2007) Conception optimale de structures, vol 58. Mathématiques et Applications. Springer, Berlin Bucur D, Buttazzo G (2005) Variational methods in shape optimization problems, vol 65. Progress in Nonlinear Differential Equations and their Applications. Birkhäuser, Boston, MA Clason C, Kruse F, Kunisch K (2018) Total variation regularization of multi-material topology optimization. ESAIM Math Model Numer Anal 52(1):275–303 Delfour MC, Zolesio JP (2005) Shapes and geometries: differential calculus and optimization. SIAM, Philadelphia, PA Haslinger J, Neittaanmäki P (1996) Finite element approximation for optimal shape, material and topology design, 2nd edn. J. Wiley & Sons, Chichester


D. Tiba

Henrot A, Pierre M (2005) Variations et optimisation de formes: Une analyse géométrique, vol 48. Mathématiques et Applications. Springer, Berlin Hirsch MW, Smale S, Devaney RL (2013) Differential equations, dynamical systems, and an introduction to chaos, 3rd edn. Elsevier/Academic Press, San Diego, CA Kawarada H (1979) Numerical methods for free surface problems by means of penalty. Computing methods in applied sciences and engineering. Lecture notes in mathematics, vol 704. Springer, New York, pp 282–291 Mäkinen R, Neittaanmäki P, Tiba D (1992) On a fixed domain approach for a shape optimization problem. In: Ames WF, van der Houwen PJ (eds) Computational and applied mathematics II: differential equations (Dublin, 1991). Amsterdam, North Holland, pp 317–326 Murea CM, Tiba D (2019) Topological optimization via cost penalization. Topol Methods Nonlinear Anal 54(2B):1023–1050 Murea CM, Tiba D (2021) Implicit parametrizations in shape optimization: boundary observation. Pure Appl Funct Anal Neittaanmäki P, Pennanen A, Tiba D (2009) Fixed domain approaches in shape optimization problems with Dirichlet boundary conditions. Inverse Probl 25(5):1–18 Neittaanmäki P, Sprekels J, Tiba D (2006) Optimization of elliptic systems: theory and applications. Springer, Berlin Neittaanmäki P, Tiba D (2008) A fixed domain approach in shape optimization problems with Neumann boundary conditions. In: Glowinski R, Neittaanmäki P (eds) Partial differential equations: modeling and numerical simulations. Computational methods in applied sciences, vol 16. Springer, Berlin, pp 235–244 Neittaanmäki P, Tiba D (2012) Fixed domain approaches in shape optimization problems. Inverse Probl 28(9):1–35 Novotny A, Sokolowski J (2013) Topological derivates in shape optimization. Springer, Berlin Pironneau O (1984) Optimal shape design for elliptic systems. Springer, Berlin Pontryagin LS (1968) Equations differentielles ordinaires. MIR, Moscow Sokolowski J, Zolesio JP (1992) Introduction to shape optimization: shape sensitivity analysis. Springer, Berlin Thorpe JA (1979) Elementary topics in differential geometry. Springer, Berlin Tiba D (2002) A property of Sobolev spaces and existence in optimal design. Appl Math Optim 47(1):45–58 Tiba D (2013) The implicit function theorem and implicit parametrizations. Ann Acad Rom Sci Ser Math Appl 5(1–2):193–208 Tiba D (2016) Boundary observation in shape optimization. In: Barbu V, Lefter C, Vrabie I (eds) New trends in differential equations, control theory and optimization. World Scientifi, pp 301–314 Tiba D (2018) Iterated Hamiltonian type systems and applications. J Differ Equ 264(8):5465–5479 Tiba D (2018) A penalization approach in shape optimization. Atti della Accademia Pelaritana dei Pericolanti: Classe di Scienze Fisiche, Matematiche e Naturali 96(1):A8 Tiba D (2020) Implicit parametrizations and applications in optimization and control. Math Control Relat Fields 10(3):455–470 Tiba D, Murea CM (2019) Optimization of a plate with holes. Comput Math Appl 77(11):3010–3020 Tiba D, Yamamoto M (2020) A parabolic shape optimization problem. Ann Acad Rom Sci Ser Math Appl 12(1–2):312–328

Optimal Factor Taxation with Credibility Tapio Palokangas

Abstract In this paper, optimal factor taxation is examined when output is produced from labor and capital and some (or all) households save in capital. Main results are as follows. There is a reputational equilibrium where the government has no incentive to change its announced tax policies. In that equilibrium, the assertion of zero capital taxation holds, independently of the wage earners’ savings behavior and their weight in the social welfare function. A specific elasticity rule is derived for the optimal wage tax. Finally, it is shown how fertility and mortality can be introduced as endogenous variables in the model. Keywords Optimal control theory · Differential games · Credibility of public policy

1 Introduction How the government could optimally control a market economy? The simplest model for analyzing that problem is the following: 1. Some (or all) of the households in the economy save for capital. These are called savers. The remainder of the households are called non-savers. 2. Capital is the only asset and the only predetermined state variable in the model. 3. The economy can be divided into two sectors. The formal sector produces goods from labor and capital and it can be taxed. The informal sector, where workers produce goods for themselves without capital, cannot be taxed. 4. The goods produced in the two sectors are aggregated into one numeraire good, the price of which is normalized at unity. Without the informal sector, the government would fully dictate the allocation of resources in the economy, which leads to the first best (or Pareto) optimum. It is T. Palokangas (B) Helsinki Graduate School of Economics, University of Helsinki, FI-00014 Helsinki, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



T. Palokangas

realistic to assume, however, that some economic activities cannot be taxed, which leads to the second best optimum. Saving for the accumulation of capital leads to a dynamic model where the government attempts to control production and capital accumulation by taxes. This creates a strategic dependence between the government and the private agents (households and firms). Then, it is possible to consider the second best policy where a benevolent government chooses a sequence of tax rates over time to maximize social welfare. According to Benhabib and Rustichini (1997), this policy is inconsistent as follows. It is first optimal for the government to promote capital accumulation by promising low taxes on capital in the future. However, once capital has been accumulated, it becomes convenient for the government to tax capital rather than to impose distorting taxes on labor. The result of this inconsistency is that promises on tax policy are noncredible. It is shown that if the government can tax consumption, labor and capital, then there is a reputational equilibrium with consistent public policy. Consequently, it is not optimal for the government to renege on its commitments. If the government has a commitment technology (e.g., the constitution of the economy), by which it can commit to its announced policy, the strategy for public policy can be credible. In models of optimal taxation with capital accumulation and no inherent distortions, the classical Chamley–Judd (hereafter C-J) result holds: capital income should be taxed at a zero rate in the long run. Because capital appears only in the production but not in the utility function, it should not be taxed if there are enough instruments to separate consumption and production decisions in the economy. This C-J result has been extended in different directions as follows. Papers Coleman (2000); Judd (1999, 2002); Lansing (1999) prove that the validity of the C-J result depends on the set of available tax instruments. How many instruments are needed when only some households save? In this paper, the second best public policy is put in the form of a Stackelberg differential game with the government and the (representative) saver as players. Xie Xie (1997) argues that in such a game, one commonly uses a boundary condition which is not a necessary condition of optimal public policy, but which is responsible for the time inconsistency in the models. In this study, this boundary condition stems from the necessary conditions. However, it does not violate consistency because, in a reputational equilibrium, the government has no incentive to cheat the public.

2 Households and Public Sector The informal-sector output N is a decreasing but concave function of labor supply in the formal sector L S : N (L S ),

N  < 0,

N  < 0.


Optimal Factor Taxation with Credibility


Hence, more and more labor must be transferred from the formal to the informal sector to produce one more unit in the latter. This two-sector framework is the simplest way of introducing tax evasion that involves distortion in labor taxation. It is assumed that an exogenous proportion β ∈ [0, 1] of wage income W is earned by the households that are so poor that they do not save at all. This is because the parameter β is used as a measure of income distribution. The model is an extension of two common models in the literature. For β = 1, it corresponds to the Judd model Judd (1985) of a representative capitalist that earns all profits and do not work, while the workers earn all wages and do not save. For β = 0, it corresponds to the Chamley model Chamley (1986) of a representative agent that saves and earns both wages and profits. The whole population has the same constant rate of time preference ρ > 0. The instantaneous utility of the (representative) saver U and that of the (representative) non-saver V are given by the following functions: U (C) =

C 1−σ − 1 , σ > 0, σ = 1, 1−σ

V (CW ) > 0, V  > 0, V  < 0,


. where C is the saver’s consumption, CW = βW is the non-saver’s income, which is immediately consumed, and constant 1/σ is the intertemporal elasticity of substitution for the saver. The saver’s and non-saver’s utilities from a flow of consumption starting at time t are ∞ ∞ −ρ(s−t) U (C)e ds, V (CW )e−ρ(s−t) ds. (3) t


The social welfare function is a weighted average of the utilities of the saver and non-saver: ∞ .  = [V (CW ) + ϑU (C)]e−ρ(s−t) ds, (4) t

where constant ϑ > 0 is the social weight of the savers. The marginal rate of substitution between the saver’s and non-saver’s consumption, when social welfare is held constant, is given by  dC  V . ψ =− = . (5)  dCW  constant ϑU  It is assumed that public expenditures E are constant. Then, at each moment of time, the government budget constraint is given by E = τC C + τW wL + τK ,



T. Palokangas

where τC > −1, τW > −1 and τK ≤ 1 are taxes on the saver’s consumption C, wages in the formal sector wL, and (gross) capital income , respectively. To eliminate the possibility that τK would have the value −∞ in the government’s optimal program, it is assumed that there is a fixed upper limit υ ∈ [0, 1) for the capital subsidy −τK : − υ ≤ τK ≤ 1.


3 Production and Investment Capital K is the only asset in which the saver can save. In the formal sector, output is produced from capital K and labor L through the function F(K , L),

FK > 0,

FL > 0,

FKK < 0,

FLL < 0,

where subscripts K and L denote the partial derivatives of F with respect to K and L. Because the labor markets are competitive, in the formal sector the labor supply L S is equal to the demand for labor L, and the marginal product of labor in the informal sector N  is equal to the wage in the formal sector w: L S = L , w = −N  (L S ) = −N  (L).


Given (1) and (8), the elasticity of the labor supply in the formal sector L S with respect to the wage w can be defined as follows: N . w d LS =  > 0. ε= L S dw N L


Labour income W is equal to income from the informal sector N plus wages in the formal sector wL. From (8) and (9) it follows that labor income is an increasing function of formal-sector employment L: W (L) = N + wL = N (L) − N  (L)L , W  = −N  L = −N  /ε > 0.


It is assumed, for simplicity, that a constant proportion μ ∈ (0, 1) of capital K depreciates at each moment of time. Firms in the formal sector maximize profit by labor L taking the wage w, capital K and the wage tax τW as given. This and (8) yield

Optimal Factor Taxation with Credibility


  . (K , w, τW ) = max F(K , L) − (1 + τW )wL − μK , L

L(K , w) = arg max , L

∂ = FK − μ, FL (K , L) = (1 + τW )w = −(1 + τW )N  (L), K = ∂K KK ≡ 0 ⇔ = K K . (11) It is furthermore assumed, for simplicity, that investment is irreversible, i.e., old capital goods have no aftermarket. This means that capital stock K can decrease, K˙ < 0, only up to the level μK at which the stock of capital goods K depreciate. The saver’s budget constraint is then given by dK K˙ = = (1 − β)W + (1 − τK ) (K , w, τW ) − (1 + τC )C ≥ −μK , dt


where K˙ is capital accumulation (i.e., saving in the economy). In the whole economy, capital accumulation is equal to production in the two sectors, N (L) + F(K , L), minus the saver’s consumption C, the non-saver’s consumption βW , government expenditure E and depreciation μK . Given (10), this yields K˙ = N (L) + F(K , L) − C − βW (L) − E − μK ≥ −μK .


Because the labor and goods markets are balanced by (8) and (13), then, by the Walras law, the government budget (6) is balanced as well. Hence the government budget is not a constraint in the problem of public policy.

4 Non-Credible Public Policy In this section, it is assumed that the government can renege on an announced sequence of taxes. This means that the game must be solved backwards by dynamic programming as follows. At each moment t, both the saver and the government make their choices on the assumption that the opponent will make her choices optimally for the whole period (t, ∞). The crucial property of dynamic programming is that the strategies of both parties are independent of the initial time t (cf., Kamien and Schwartz (1985)). The representative saver takes the wage w and total labor income in the economy W as given. She maximizes her utility ∞

U (C)e−ρ(s−t) ds


by her consumption C subject to capital accumulation (12), given w, W , and taxes τW , τK , τC . When the strategies of the government and the saver are in a Stackelberg


T. Palokangas

equilibrium at every moment from time t onwards, the saver’s utility from the flow of consumption from time t onwards, given the accumulation of capital K , (12), is defined as follows: . B(K , t) =

∞ max

C s.t. (12.12)

U (C)e−ρ(s−t) ds.



The partial derivatives of the function in (14) are denoted as . ∂B , BK = ∂K

. ∂B Bt = , ∂t

. ∂2 B BKK = , ∂K2

. ∂2 B Btt = 2 , ∂t

. ∂2 B , BKt = ∂ K ∂t

for convenience. The Hamilton-Jacobi-Bellman equation for the saver is then given by ρ BK − BKt = BKK [(1 − β)W + (1 − τK ) − (1 + τC )C] + (1 − τK )BK K , (15) where labor income W , the wage w, and taxes τW , τK , τC are given. Noting (12), the maximization in (15) yields the first-order condition  ∂ K˙ C −σ U = 0 for K˙ > −μK , =ν − BK = + BK 1 + τC 1 + τC ∂C > 0 for K˙ = −μK ,


where ν is the Kuhn-Tucker multiplier corresponding to the inequality K˙ ≥ −μK . Differentiating Eq. (15) with respect to K yields ρ BK − BKt = BKK [(1 − β)W + (1 − τK ) − (1 + τC )C] + (1 − τK )BK K . . Trying solution λ(t) = BK (K , t) for this equation implies BKK ≡ 0, λ˙ = BKt = [ρ − (1 − τK ) K ]BK = [ρ − (1 − τK ) K ]λ. Hence, in equilibrium, BK = λ ≡ 0. Given (13), (16), and BK ≡ 0, it holds true that K˙ = −μK < 0, C = N (L) + F(K , L) − βW (L) − E.


The government chooses taxes τW , τK , τC to maximize social welfare (4), given the reaction of the saver (17), the dependence of labor income on employment (10), the determination of the wage w = −N  (L) by (8), and the constraints (7). Because there is a one-to-one correspondence from τW to L through (10), τW can be replaced by L as a control variable. The social welfare from the flow of consumption starting at time t is given by

Optimal Factor Taxation with Credibility

. G(K , t) =




L ,τK ,τC ∈[−υ,1]

 V (βW (L)) + ϑU (C) e−ρ(s−t) ds.


The partial derivatives of the function G are denoted as . ∂G . ∂G . ∂2G . ∂2G . ∂2G = = , G , G GK = , Gt = , G KK = , tt Kt ∂K ∂t ∂K2 ∂t 2 ∂ K ∂t for convenience. The Hamilton-Jacobi-Bellman equation for the government is given by   ρG − G t = max V (βW (L)) + ϑU (C) + G K K˙ |(12.17) L ,τK   = max V (βW (L)) + ϑU (C) − μK G K . L ,τK


Noting (5), (10), (11), and (17), the maximization by L in (18) yields ∂C = βW  (V  − ϑU  ) + ϑU  (N  + FL ) ∂L = βW  (V  − ϑU  ) − ϑU  N  τW = ϑU  N  [β(1 − ψ)/ε − τW ],

0 = βV  W  + U  ϑ

which is equivalent to τW = β(1 − ψ)/ε.


The result of this section can be summarized as follows. Proposition 1 With non-credible public policy, the optimal wage tax is given by (19), and capital stock K is exhausted at the rate μ = − K˙ /K . The tax rule (19) can be explained as follows. The lower the elasticity ε, the less distorting labor taxation is and the more labor should be taxed. If all households are savers, β = 0, then public expenditures should be financed by the non-distorting consumption tax τC and wages should not be taxed at all, τW = 0. The lower the relative social value of the savers’ consumption (i.e., the closer ψ to zero) or the smaller proportion of wages the savers earn (i.e., the closer β to one), the higher the wage tax τW must be.

5 Savers Assume that the government cannot renege on its announced sequence of taxes. The formal structure of the interaction between the government and the saver then corresponds to an open-loop Stackelberg equilibrium outcome for a non-cooperative, infinite differential game in which the government is the leader and the saver the


T. Palokangas

follower (cf., Ba¸sar and Olsder (1989, Sect. 7.2)). Technically, the solution of this game is as follows. First, determine the unique optimal response of the saver to every strategy of the government. The saver’s choices can be made on the basis of the initial capital stock without making any difference to the solution. Second, find the government’s optimal strategy given the saver’s optimal response. Since the government cannot depart from its announced strategy, then, at each moment, the government makes its choices by the initial capital stock. Let t = 0 be the initial moment. Given (2) and (3), the saver maximizes her utility ∞

U (C)e−ρ(s−t) ds


by her consumption C subject to capital accumulation (12), taking the wage w, labor income W and taxes (τW , τK , τC ) as given. This yields the Hamiltonian H C and the Lagrangian LC as follows: H C = U (C) + λ[(1 − β)W + (1 − τK ) (K , w, τW ) − (1 + τC )C], LC = H C + δ[(1 − β)W + (1 − τK ) (K , w, τW ) − (1 + τC )C + μK ], where the co-state variable λ evolves according to λ˙ = ρλ −

∂LC = ρλ − (1 − τK ) K (K , w, τW )(ρ + λ), ∂K

lim λK e−ρ(s−t) = 0, (20)


and the multiplier δ is subject to the Kuhn-Tucker conditions  δ( K˙ + μK ) = δ (1 − β)W + (1 − τK ) (K , w, τW )  −(1 + τC )C + μK = 0, δ ≥ 0.


The first-order condition for the saver’s optimization is given by C −σ = U  (C) = (1 + τC )(λ + δ).


Assume first that τK = 1. Equation (20) then takes the form λ˙ = ρλ and, by choosing the initial value λ(0) subject to lim λK e−ρ(s−t) = 0,


it holds true that λ = λ(0) = 0. From (21), (22), and λ ≡ 0 it follows that δ = C −σ /(1 + τC ) > 0, K˙ = −μK < 0 and C = N + F − βW − E for τK = 1.


Optimal Factor Taxation with Credibility


Assume next that τK < 1. In such a case, K˙ > −μK holds. Noting (20), (21), and (22), it holds true that δ = 0, C −σ = λ, and ˙ C/C = −(1/σ )λ˙ /λ = [(1 − τK ) K (K , w, τW ) − ρ]/σ for τK < 1.


Variables K and C are governed by the system (12) and (24). With decreasing returns to scale, KK < 0, the dynamics is as follows. Because ∂ K˙ = (1 − τK ) K > 0, ∂K

∂ K˙ < 0, ∂C

∂ C˙ = (1 − τK ) KK C < 0, ∂K

 ∂ C˙  = 0, ∂C C=0 ˙

it holds true that  ∂ K˙ ∂ C˙  + > 0, ∂K ∂C C=0 ˙

 ∂ K˙ ∂ C˙  ∂ K˙ ∂ C˙ , <  ∂ K ∂C C=0 ∂C ∂ K ˙

and a saddle-point solution for the system exists. Hence, the co-state variable C (that represents λ) jumps onto the saddle path that leads to the steady state, where K , C, and λ are constants and lim s→∞ λK e−ρ(s−t) = 0 holds. With constant returns to scale, KK ≡ 0, there must be = K K by (11). Given = K K , (12), (20), and δ = 0, one obtains 

∂ λ˙ ∂ K˙ W C + −ρ = (1 − β) − (1 + τC ) = (τK − 1) K < 0. ∂K ∂λ K K K˙ =0 K˙ =0

This implies the transversality condition lim λK e−ρ(s−t) = 0.


6 Government Given (2) and (4), the government sets taxes τW , τK , τC to maximize social welfare ∞ [V (CW W ) + ϑU (C)]e−ρ(s−t) ds t

subject to the response of the private sector, (13), (23) and (24), the function L(K , τW ) in (11), the determination of the wage w = −N  (L) by (8), and the constraints (7). Because there is a one-to-one correspondence from τW to L through L(K , τW ) in (11), τW can be replaced by L as a control variable.


T. Palokangas

Assume that τK ≡ 1 for s ∈ [t, ∞). Then, given (23), K˙ = −μK holds and the Hamiltonian for the government’s maximization is given by H I = V βW (L) + ϑU N (L) + F(K , L) − βW (L) − E − γ μK , where the co-state variable γ for K evolves according to γ˙ = ργ −

∂ HI , ∂K

lim λK e−ρ(s−t) = 0.


The first-order condition for L is given by (18). These results can be rephrased as follows: Proposition 2 If the government taxed all profits, τK ≡ 1 for s ∈ [t, ∞), then the dynamics of the economy would be the same as in the case of non-credible public policy in Proposition 1. Assume that the government does not tax all profits away, τK ∈ [−υ, 1) (cf., (7)). The Hamiltonian and the Lagrangian for the government’s maximization are then   H II = V βW (L) + ϑU (C) + η (1 − τK ) K (K , w, τW ) − ρ C/σ + γ [N (L) + F(K , L) − βW (L) − C − E − μK ],


L = H + χ1 (1 − τK ) + χ2 (τK + υ), II

where the co-state variables γ and η evolve according to γ˙ = ργ −

∂L , ∂K

lim K γ e−ρ(s−t) = 0, η˙ = ρη −


∂L , ∂C

lim Cηe−ρ(s−t) = 0, (26)


and variables χ1 and χ2 satisfy the Kuhn-Tucker conditions χ1 (1 − τK ) = 0, χ1 ≥ 0,

χ2 (τK + υ) = 0, χ2 ≥ 0.


The first-order conditions for the capital tax τK are ∂L/∂τK = −(C/σ ) K η − χ1 + χ2 = 0.


Examine first the case −υ ≤ τK ≤ 1, where χ1 = χ2 = 0. Because ∂ 2 L/∂w2 = ∂ 2 HII /∂w2 ≡ 0 then holds, the capital tax τK must be solved through the generalized Legendre–Clebsch conditions (cf., Bell and Jacobson (1975, pp. 12–19))  p d ∂ ∂τW dt p  2q d q ∂ (−1) ∂τW dt 2q

 ∂ H II =0 ∂τW  ∂ H II ≥0 ∂τW

for any odd integer p, (29) for any integer q,

Optimal Factor Taxation with Credibility


where t is time. Since C > 0 by (22), Eq. (28) yields η = 0. Differentiating (28) with respect to time t and noting (25) and (26) yield d dt

∂ H II ∂τW

  C ∂ H II C C = − K η˙  = − K = K (ϑC −σ − γ ) = 0, σ σ ∂C σ η=0  II  ∂ d ∂H = 0. ∂τW dt ∂τW


Given these and (24), it furthermore holds true that d2 dt 2

∂ H II ∂τW

= − K ϑC −σ C˙ = 0,

∂ d2 ∂τW dt 2

∂ H II ∂τW

= 2K ϑC −σ

C > 0. σ (31)

Results (30) and (31) satisfy the Legendre–Clebsch conditions (29).

7 Policy Rules From (30) it follows that

γ = ϑU  = ϑC −σ .


Given this and (31), C˙ = 0 holds, and C and γ are kept constant. Noting (11), (25), (26), χ1 = χ2 = 0, and η = 0, it holds true that 0 = γ˙ = ργ − ∂ H II /∂ K = (ρ + μ − FK )γ and (33) ρ = FK (K , L) − μ = K . By η = 0, (11), and (32), one can derive the first-order condition for L: ∂ H II /∂ L = β(V  − γ )W  + γ (N  + FL ) = β(V  − γ )W  − γ N  τW

 = γ N  β[V  /(ϑU  ) − 1](W  /N  ) − τW = 0, which is equivalent to (19). Hence, the following result is obtained: Proposition 3 The optimal wage tax in the steady state is (19). Equations C˙ = 0, (24), (33), and τK imply 0 = (1 − τK ) K − ρ = (1 − τK )ρ − ρ = −τK , τK = 0 and the following result: Proposition 4 The capital tax τK should be zero in the steady state. This result is a dynamic version of aggregate production efficiency (cf. Chamley (1986, 2001); Correia (1996); Judd (1985)). Because capital is an intermediate good,


T. Palokangas

appearing only in the production function but not in the utility function, it should not be taxed if there are enough instruments to separate consumption and production decisions. In the steady-state, FK (K ∗ , L(K ∗ )) = ρ + μ holds and relations (27) and (28) imply the following. If η > 0, then capital is heavily subsidized, −τK = υ, and the saver accumulates wealth, K˙ > 0. If η < 0, then capital income is taxed away, τK = 1, and the saver exhausts its wealth, K˙ < 0. In equilibrium η = 0, capital K is held constant K ∗ . This result can be concluded as follows: Proposition 5 The steady-state level for capital, K ∗ , is determined by ρ + μ = FK (K ∗ , L(K ∗ )) and, outside the steady state, the capital tax evolves according to τK = 1 for K < K ∗ and τK = −υ < 0 for K > K ∗ . Because the system produces a steady state in which K , C and γ are constants and η = 0, lim K γ e−ρ(s−t) = 0 and lim Cηe−ρ(s−t) = 0 s→∞


hold. The government’s choice set is more restrictive with the rule τK ≡ 1 for t ∈ [0, ∞) than with τK ∈ [−υ, 1]. Therefore, in the former case, the welfare is lower, i.e. H I < H II . Hence, Proposition 1 yields the following corollary. Proposition 6 The government prefers credible to non-credible public policy. In other words, it has no incentive to cheat the public. A government with a good reputation can always impose the same outcome as one with a bad reputation, but it will never have an incentive do so. Because the saver knows this, it relies on the announced tax policy and invests in capital. In Propositions 1–6, parameters β and ϑ do not affect the main results, i.e., the zero capital taxation in the limit and the equalization of the marginal utility of income. Hence, despite changes in the saver’s social weight ϑ or in her share of wages, 1 − β, the saver can expect that the main principles of the tax policy will remain invariant. Because the consumption tax τC does not appear in Propositions 1–6, it balances the government budget.

8 Conclusions and Extensions It is shown that a rational government should maintain the credibility of its policy by some commitment technology. Once this condition is satisfied, it becomes possible to eliminate inefficiency from the economy so that each distortion in the economy will corrected by a specific tax rule that is based on observable variables (e.g., prices, quantities and their statistical dependency). In the following, some extensions of this successful method is presented.

Optimal Factor Taxation with Credibility


8.1 Endogenous Fertility So far, it is assumed that population is constant and normalized at unity, and that households derive utility from consumption only according to (2). Assume that population P grows at the rate ˙ P/P = f − m,

P(t) = Pt ,


where f is the fertility rate and m the mortality rate in the economy. The transformation curve (1) can then be changed into N (L S /P)P,

N  < 0,

N  < 0.

With this specification, the growth of population P increases both the formal and informal sector in the same proportion. Let us ignore the non-savers altogether and assume that all individuals are identical savers. Assume furthermore that the number of newborns f P is a kind of consumption good along with the “normal” consumption good C. Hence, an individual derives . utility from a linearly homogeneous function of consumption per person c = C/P and the fertility rate in its family f : u(c, f ), ∂u/∂c > 0, ∂u/∂ f > 0.


Then, the intertemporal utility function for a family, (2) and (3), can be replaced by that of an individual (35): ∞ t

u(c, f )1−σ − 1 −ρ(s−t) e ds. 1−σ


. An individual’s capital is equal to capital per head k = K /P which, by (34), evolves according to   K˙ K P˙ K˙ d K = − = + (m − f )k, (37) k˙ = dt P P P P P where K˙ is given by (12) for an individual and by (13) for the government. The representative individual maximizes utility (36) by its consumption c and the fertility rate f , by which consumption c and the fertility rate f become a function of the taxes and capital per head k. The government maximizes utility (36) by the same variables c and f , but obtains population P as an additional state variable that evolves according to (34). The author has examined this in Lehmijoki and Palokangas (2006, 2009, 2010, 2016).


T. Palokangas

8.2 Endogenous Mortality Assume that every individual faces the same mortality rate m at each moment of time. Then, the probability of dying in a short time dt is equal to m dt, the probability that an individual will survive beyond the period [t, s], is given by e(t−s)m(s) , and the individual’s expected utility at time s is given by e(t−s)m(t) u(c(s), f (s)). The utility function of the representative family is then the expected utility of the representative member of that family for the planning period s ∈ [0, ∞) as follows: ∞ ∞ −mt −ρt ue e dt = u c(s), f (s) e[ρ+m(s)](t−s) ds, ρ > 0, 0



where ρ is the constant rate of time preference for an individual that would live forever. The whole analysis would be the same as before, except that the mortality rate m increases the rate of time preference from ρ to ρ + m. It is plausible to assume the following: 1. Public spending E is health expenditures. 2. Because capital is the only asset in the model, an individual’s wealth is equal to capital per individual k. 3. Health expenditures devoted to a particular individual is equal to health expendi. tures per individual  = E/P. 4. An individual’s mortality rate m is an increasing function of its wealth k and the . government’s health expenditures devoted to that individual,  = E/P, Because the mortality rate m(t) = m (k(t),  (t)) is now an endogenous variable in the model, it must be eliminated from the discount factor of the utility function (38) by the Uzawa transformation (cf., Uzawa (1968)):   θ (t) = ρ + m  k(t),  (t) t with dt =

dθ , ρ+m (k(θ ),  (θ ))


which transforms the model from real time t to virtual time θ . Because θ (T ) = [ρ + m(T )]T , θ (∞) = ∞ and 1 dt = >0 dθ ρ + m(θ ) hold true, one can define θ (t) as an alternative virtual time variable and set the variables in terms of it. With the transformation (39), the utility function (38) and the evolution of the state variables P (34) and k (37) are changed from real time s into virtual time θ as follows:

Optimal Factor Taxation with Credibility

∞ θ(t)


U (c(θ ), f (θ )) eθ(t)−θ dθ, ρ+m (k(θ ),  (θ ))

dP f (θ ) − m(k(θ ),  (θ )) = P(θ ), dθ ρ+m (k(θ ),  (θ ))

P(t) = Pt ,

˙ ) dk k(θ = , dθ ρ+m (k(θ ),  (θ ))

k(t) = kt .

After the system has been analysed in virtual time θ by the standard methods above, the results can be returned to real time t by the reversing the transformation (39). The author has applied this method to a model of population growth with endogenous fertility and endogenous mortality in Lehmijoki and Palokangas (2021). See also Lehmijoki and Rovenskaya (2010).

References Ba¸sar T, Olsder GJ (1989) Dynamic noncooperative game theory, 2nd edn. Academic Press, London Bell DJ, Jacobson DH (1975) Singular optimal control problems. Academic Press, London Benhabib J, Rustichini A (1997) Optimal taxes without commitment. J Econ Theory 77(2):231–259 Chamley C (1986) Optimal taxation of capital income in general equilibrium with infinite lives. Econ 54(3):607–622 Chamley C (2001) Capital income taxation, wealth distribution and borrowing constraints. J Public Econ 79(1):55–69 Coleman WJ (2000) Welfare and optimum dynamic taxation of consumption and income. J Public Econ 76(1):1–39 Correia IH (1996) Should capital income be taxed in the steady state? J Public Econ 60(1):147–151 Judd KL (1985) Redistributive taxation in a simple perfect foresight model. J Public Econ 28(1):59– 83 Judd KL (1999) Optimal taxation and spending in general competitive growth models. J Public Econ 71(1):1–26 Judd KL (2002) Capital-income taxation with imperfect competition. Am Econ Rev 92(2):417–421 Kamien MI, Schwartz NL (1985) Dynamic optimization: the calculus of variations and optimal control theory in economics and management. North-Holland, Amsterdam Lansing KJ (1999) Optimal redistributive capital taxation in a neoclassical growth model. J Public Econ 73(3):423–453 Lehmijoki U, Palokangas T (2006) Political instability, gender discrimination, and population growth in developing countries. J Popul Econ 19(2):431–446 Lehmijoki U, Palokangas T (2009) Population growth overshooting and trade in developing countries. J Popul Econ 22(1):43–56 Lehmijoki U, Palokangas T (2010) Trade, population growth, and the environment in developing countries. J Popul Econ 23(4):1351–1370 Lehmijoki U, Palokangas T (2016) Land reforms and population growth. Port Econ J 15:1–15 Lehmijoki U, Palokangas T (2023) Optimal population policy with health care and lethal pollution Port Econ J 22:31–47


T. Palokangas

Lehmijoki U, Rovenskaya E (2010) Environmental mortality and long-run growth. In: Crespo Cuaresma J, Palokangas T, Tarasyev A (eds) Dynamic systems, economic growth, and the environment. Springer, Heidelberg, pp 239–258 Uzawa H (1968) Time preference, the consumption function, and optimum asset holdings. In: Wolfe J (ed) Value, Capital, and Growth: Papers in Honor of Sir John Hicks. Aldine, Chicago, pp 485–504 Xie D (1997) On time inconsistency: a technical issue in Stackelberg differential games. J Econ Theory 76(2):412–430

Transformative Direction of R&D into Neo Open Innovation Chihiro Watanabe and Yuji Tou

Abstract The advancement of the digital economy has transformed the concept of growth. Advanced economies have confronted a dilemma between increasing inputs and declining output. Contrary to traditional expectations, excessive increases in inputs have resulted in declining productivity. A solution to this dilemma can only be expected if the vigor of Soft Innovation Resources (SIRs), which lead to neo open innovation in the digital economy, is harnessed. This paper attempts to demonstrate this hypothetical view by following previous analyses of the authors and using a techno-economic approach. An empirical analysis of the development of 500 global information and communication technology firms revealed a dynamism that led to a bipolar vicious cycle between increasing input and improving productivity. Thus, it was assumed that the neo open innovation would harness the vigor of SIRs and maintain a virtuous cycle between input and output increases was postulated. These findings lead to insightful proposals for a new R&D concept and subsequent neo open innovation in the digital economy. Thus, a broadly applicable practical approach to assessing R&D investment in neo open innovation can be provided. Keywords Neo open innovation · Soft innovation resources · Productivity decline · Dilemma of R&D increase · R&D transformation

1 Supra-functionality beyond economic value illustrates people’s shifting preferences, which encom-

pass social (e.g., creating and participating in social communication), cultural (e.g., brand value, cool and cute), aspirational (e.g., striving for traditional beauty), tribal (e.g., cognitive sense, fellow feeling), and emotional (e.g., perceptual value, five senses) values McDonagh (2008). C. Watanabe (B) Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, Jyväskylä FI-40014, Finland e-mail: [email protected] International Institute for Applied Systems Analysis, laxenburg, Austria Y. Tou Department of Industrial Engineering and Management, Tokyo Institute of Technology, Tokyo, Japan © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



C. Watanabe and Y. Tou

1 Introduction Advances in Information and Communication Technology (ICT) have generated a digital economy, which can be largely attributed to the dramatic development of the Internet Tapscott (1997). This economy has transformed the concept of growth OECD (2016). The Internet promotes a free culture, the consumption of which brings benefits and happiness to people, but which cannot be captured by Gross Domestic Product(GDP) as a measure of revenue Lowrey (2011). The authors have defined this Internet-driven added value as uncaptured GDP Watanabe et al. (2015b). The shift in people’s preferences from economic functionality to supra-functionality that goes beyond economic value (including social, cultural and emotional values)1 will further develop the Internet, increasing its dependence on uncaptured GDP McDonagh (2008); Watanabe et al. (2015b). Thus, the technological shift from the computer age to the Internet and the abundance of freely available digital products and services in the digital economy Brynjolfsson and McAfee (2014) has resulted in a new co-evolution among people’s shifting preferences, the advancement of the Internet, and the increasing dependence on uncaptured GDP Watanabe et al. (2015b). Based on these findings, the authors have demonstrated the spinoff dynamism from traditional computer-initiated ICT innovations in the era of the Product of Things (PoT) to the Internet-initiated ICT innovations in the Internet of Things (IoT) society McKinsey & Co. (2015), as illustrated in Fig. 1 Watanabe et al. (2015a, 2016b). In the Internet-initiated ICT innovation, the bipolarization between ICT-advanced economies and ICT-developing economies has become a crucial issue. This is due to the two-faced nature of ICT, that is, although advances in ICT tend to drive up

Fig. 1 Spinoff dynamism from the traditional to the Internet-initiated IoT society (source Watanabe et al. (2015b))

Transformative Direction of R&D into Neo Open Innovation


technology prices by generating new activities, the dramatic development of the Internet is leading to lower technology prices, including freebies, easy replication and mass standardization Cowen (2011); Watanabe et al. (2015a). As falling prices mean lower marginal productivity of technology, in a competitive environment, ICTadvanced economies seeking maximum profit confront a dilemma between rising ICT and declining productivity. Thus, ICT-advanced economies face a contradiction between increasing input and declining output Tou et al. (2019b). Understanding the logistic nature of ICT Schelling (1998); Watanabe et al. (2004), the authors in their previous analysis analyzed the logistic growth trajectory of 500 global ICT firms and identified the bipolarization described above. They then analyzed counteractions initiated by highly R&D-intensive firms confronting the dilemma of declining productivity as R&D increases. They postulated that this problem can be expected to be solved by harnessing the vigor of Soft Innovation Resources (SIRs), i.e., external innovation resources that respond to people’s shifting preferences, leading to neo open innovation in the digital economy Tou et al. (2019b). Although this postulate provides an insightful suggestion for an alternative development trajectory to avoid the above-mentioned dilemma, the details of the disruptive business model and subsequent strategy for changing R&D systems have remained conceptual. To date, following Chesbrough’s pioneering work on open innovation Chesbrough (2003), many studies have analyzed the significance of open innovation in the digital economy (e.g., Chesbrough et al. (2008); West et al. (2014)). However, no one has satisfactorily answered the question of how to move from a conceptual stage to an operational stage. Given the growing importance of a detailed techno-financing system that, by optimizing indigenous R&D and dependence on external resources, leads to strong business growth, this paper attempts to further develop the above postulate by following the authors’ previous studies and other techno-economic analyses. Performing a techno-economic analysis based on previous analyses, the R&D-driven development trajectory of the top 500 global ICT firms was analyzed, focusing on the bipolarization of productivity between highly R&D-intensive leading firms and low-R&D-intensive firms. Particular attention was paid to R&D leaders to overcome the dilemma between R&D increases and declining productivity by harnessing the vigor of external innovation resources such as SIRs. These findings give rise to a proposal for a new R&D concept that will lead to a neo open innovation in the digital economy. Thus, a widely applicable practical approach to assessing R&D investment in neo open innovation can be provided. The structure of this paper is as follows. Section 2 reviews the development of the digital economy. The dilemma between R&D expansion and declining productivity is investigated. Section 3 postulates neo open innovation. Section 4 then summarizes noteworthy findings, policy suggestions and future research recommendations.


C. Watanabe and Y. Tou

2 Development of the Digital Economy R&D investment has played a decisive role in the competitiveness of the digital economy OECD (2016). The advancement of the IoT has accelerated this trend McKinsey & Co. (2015); Kahre et al. (2017). Consequently, global ICT firms have sought R&D-driven growth with a focus on increasing sales Watanabe et al. (2014); Naveed et al. (2018).

2.1 R&D-Driven Growth The correlation between R&D and sales in the top 500 global ICT firms in 2016 is shown in Fig. 2. From the figure, we find that global ICT firms have sought to increase their digital values that focus on sales (S). They are divided into three clusters: D1 25 highly R&D-intensive firms, including such gigantic firms as GAFA (Google, Apple, Facebook and Amazon), Samsung, Intel, Microsoft, and Huawei, D2 140 R&D-intensive firms, and D3 335 relatively low-R&D-intensive firms. Network outsourcing affects the correlation between innovations and institutional systems, creating new types of innovation. This will lead to exponential growth in ICT. Schelling (1998) portrays a set of logistically evolving and disseminating social mechanisms that are stimulated by these interactions. Advances in the Internet

Fig. 2 Correlational development between R&D and sales in 500 global ICT firms in 2016 (source European Union (2016))

Transformative Direction of R&D into Neo Open Innovation


Fig. 3 Comparison of marginal productivity of ICT in 500 global ICT firms in 2016 (source European Union (2016))

continue to stimulate these interactions, accelerating the growth of ICT, which is typically traced by a sigmoidal curve Watanabe et al. (2004). The R&D-driven development of 500 global ICT firms is a typical example (see Fig. 2). This development shows logistical growth (see Table 1 in the Appendix). On this logistical growth trajectory, R&D-driven productivity (estimated as the marginal productivity of R&D in an IoT society) continues to grow before reaching an inflection point equal to half the carrying capacity (upper limit) and decreases after the inflection point is exceeded. Thus, ICT-driven logistical growth incorporates a bipolar death at the inflection point Watanabe (2009). In the case of 500 global ICT firms in 2016, an inflection point was observed at the R&D level of e2.1 billion, indicating a bipolarization between the 25 highly R&Dintensive firms (D1 ) and the remaining 475 firms (D2 and D3 ), as shown in Fig. 3 (see Sect. A2 in the Appendix for details). This bipolarization is due to the duality of ICT, see Fig. 4. While advances in ICT tend to drive up prices with new features, dramatic developments in the Internet are driving down prices as ICTs focus on freebies, easy replication, and mass standardization on the Internet Watanabe et al. (2015a). Consequently, such bipolarization has become inevitable in the competition of global ICT firms. Highly R&D-intensive firms are approaching zero marginal cost Rifkin (2015). Provided that these firms seek to maximize profits in a competitive market, prices will fall as a consequence of the over-development of ICT, leading to lower marginal productivity. Figure 3 also illustrates such phenomena for 500 global ICT firms in 2016. This figure shows a clear bipolarization between Highly R&D-Intensive Firms (HRIFs) and other Low-R&D-Intensive Firms (LRIFs), D1 and D2 + D3 , respectively, in Fig. 2. HRIFs have fallen into a vicious cycle between R&D expenditure on ICT and its marginal productivity, as growth in the former will lead to a decline in the latter. On the contrary, LRIFs have maintained a virtuous cycle as an increase in R&D leads to marginal productivity growth. The productivity of the digital economy has also declined in ICT-advanced countries OECD (2016);


C. Watanabe and Y. Tou

Fig. 4 Duality of ICT (more detailed version of the graph in Watanabe et al. (2015b))

Rothwell (2016); World Bank (2016); Gaspar (2017); Watanabe et al. (2018c, a). Recent fears about the stagnating of trends in ICT giants Economist (2018) may be the reason for this.

2.2 Dilemma Between R&D Expansion and Declining Productivity Unlike the decisive role of R&D in the digital economy, the dilemma of its expansion and declining productivity has become a worldwide concern that most digital economies now confront Watanabe et al. (2018b); Tou et al. (2018b, 2019a). Sales represent R&D-driven digital value creation behavior in 500 global ICT firms. It is defined as follows (see mathematical details in the Appendix): S = F(R) =

∂S × R, ∂R


where R is R&D investment and ∂ S/∂ R is the marginal productivity of sales2 . Taking legalism ln S = ln MP + ln R, (2)


This corresponds to the marginal productivity of the ICT products of the global ICT firms in the IoT society (see Section A1 in the Appendix).

Transformative Direction of R&D into Neo Open Innovation


Fig. 5 Dilemma between R&D expansion and declining productivity

where MP denotes marginal productivity. Figure 5 compares the factors affecting sales between low-R&D-intensive firms (D3 ), R&D-intensive firms (D2 ) and highly R&D-intensive firms (D1 ). Global ICT firms seek to achieve sales growth by leveraging the contribution of ICT to sales growth, which consists of growth in ICT marginal productivity MP and R&D investment R. The product of these factors (MP × R) represents ICT’s contribution to sales growth. Figure 5 illustrates this strategy through the stages of the global position according to the level of R&D investment (R). The strategy can be followed by constructing a useful cycle between MP and R (an increase in R leads to an increase in MP, which in turn induces more increases in R). While this positive cycle can be maintained as long as global ICT firms remain LRIFs (stage D3 and D2 in Fig. 5), when they become HRIFs (stage D1 ) they will fall into a vicious cycle (R increase leads to a decrease in MP). In this pitfall, MP recovery can be attained by reducing the R-value (moving back to D3 ), leading to a decrease in the sales growth target.

3 Neo Open Innovation In order to attain the sales growth target essential for the survival of global ICT firms, HRIFs should find a solution to the dilemma between R&D expansion and declining productivity. Thus, global ICT leaders have sought to transform their business models from traditional to new to find an effective solution Watanabe et al. (2018c). As this dilemma is due to the unique feature of the logistic growth of ICT, this feature should be transformed.


C. Watanabe and Y. Tou

3.1 Self-Propagating Function As far as development depends on the foregoing logistic growth trajectory, its digital value is saturated with a fixed carrying capacity (upper limit of growth), which inevitably results in the above dilemma. However, a particular innovation, the dynamic carrying capacity of which is able to create new carrying capacity during the diffusion process (self-propagating function), enable the continuous growth of digital value Watanabe et al. (2004). This results in a positive cycle between them (see mathematical details in Sect. A3 of the Appendix). Therefore, the key to the sustainable growth of the ICT-driven development trajectory is to identify how such a virtuous cycle can be constructed. As ICT includes an indigenous self-propagating function that takes advantage of the effects of network outsourcing Watanabe and Hobo (2004), sustainable growth is achieved by activating this function. Efforts to activate this latent self-propagating function stem from the repulsive power of falling price (marginal productivity) as a consequence of the bipolarization caused by the excessive R&D investment. This power forces ICT leaders to embrace resources for innovation in the following ways: 1. incorporation the vigor of LRIFs, which has a virtuous cycle between R&D investment and marginal productivity, and/or 2. harnessing the vigor of external innovation resources that do not reduce their own marginal productivity. The former option can take advantage of the vigor of LRIFs that also enjoy the benefits of digital innovation. The authors have postulated that co-evolutionary adaptation strategies are relevant to this option, which may provide an opportunity for further advancement of ICT in HRIFs3 (see, e.g., Watanabe et al. (2015b)). However, HRIFs face increasing capital intensity, so this option is no longer realistic as it will increase the burden of such investments at a later stage Economist (2018). Consequently, the latter option, which can be expected to correspond to the digital economy with the new disruptive business model, proves to be more promising.

3.2 Spinoff to Uncaptured GDP The above analyses demonstrate the significance of a new disruptive business model utilizing external innovation resources that arouses and activates the latent selfpropagating function indigenous to ICT. The activated self-propagating function induces the development of functionality, leading to supra-functionality beyond economic value that encompasses social, cultural, emotional and aspirational value 3

In HRIFs, the further advancement of ICT is leading to a decline in marginal productivity due to the vicious cycle between them. Therefore, the focus should be on advancing LRIFs, as their ICT development capability allows them to enjoy a virtuous cycle between investment and marginal productivity growth, leading to sustainable growth. HRIFs can harness the fruits of LRIFs’ growth, so collaboration between these two business clusters can be expected.

Transformative Direction of R&D into Neo Open Innovation


Fig. 6 Dynamics of the transition to uncaptured GDP and the consequent R&D transformation (more detailed version of the graph in Watanabe et al. (2018c) and Tou et al. (2019a))

McDonagh (2008); Watanabe et al. (2015b) corresponding to a shift in people’s preferences Watanabe (2013); Japan Cabinet Office (2019). This shift, in turn, will induce the further advancement of the Internet Watanabe et al. (2015a) and leverage the co-evolution of innovation in the digital economy Tou et al. (2019b). Because of such notable dynamism, and also postulating that the Internet promotes a free culture by providing people with utility and happiness that cannot be captured through GDP Lowrey (2011), the authors stressed in their previous studies the significance of increasing dependence on uncaptured GDP in the digital economy (e.g., Watanabe et al. (2015b, a)). The shift in people’s preferences is driving the advancement of the Internet Watanabe et al. (2015a), increasing its dependence on uncaptured GDP. Thus, the coevolution is increasingly shifting from traditional ICTs, captured GDP and economic functionality to the advancement of the Internet, increasing dependence on uncaptured GDP, and the shifting of people’s preferences beyond economic value, as illustrated in the upper double circle in Fig. 6 Watanabe et al. (2015b, a, 2016b, a). In such changing circumstances, HRIFs have confronted a dilemma between R&D expansion and declining productivity, as discussed earlier. While they satisfy people’s preferences, which are shifting beyond economic value to supra-functionality, their economic productivity has declined as R&D increases essentially to satisfy economic functionality.


C. Watanabe and Y. Tou

3.3 Soft Innovation Resources A solution to the above dilemma can be expected from a hybrid system of Soft Innovation Resources (SIRs) Tou et al. (2018b, 2019a). SIRs are seen as condensates and crystallizations in the advancement of the Internet Tou et al. (2018a, b). They consist of Internet-based resources that are either asleep, untapped, or the result of multisided interactions in a market where consumers are looking for functionality beyond economic value Tou et al. (2018a). A common feature of SIRs is that, although they lead to supra-functionality, they are not accountable for traditional GDP Watanabe et al. (2018a). The authors have shown a structural change in the concept of production in the digital economy and revealed a limitation on GDP in measuring the output of the digital economy, showing increasing dependence on uncaptured GDP Watanabe et al. (2015a); Ahmad and Schreyer (2016); Byrne and Corrado (2016); Dervis and Qureshi (2016); Feldstein (2017). The use of SIRs is a novel innovation applied in HRIFs to ensure their sustainable growth while avoiding a decline in productivity. In their previous studies, the authors described this hypothetical view Watanabe et al. (2016a, 2017c), stating that while such a transformative circumstance in the digital economy results in declining productivity, global ICT firms seek to survive by spontaneously creating uncaptured GDP by harnessing the vigor of SIRs Watanabe et al. (2018a). Another finding from previous authors’ studies is an important background to the analysis described in this paper. SIRs have been shown to remove structural barriers to GDP growth (and consequent growth in economic functionality), such as conflict between the public, employers and labor unions, gender inequalities and increasing disparities in an aging society. Thus, the spontaneous creation of uncaptured GDP through the effective utilization of SIRs contributes to the resurgence of captured GDP growth through its hybrid function, which also activates the self-propagating function, as shown at the bottom of Fig. 6 Watanabe et al. (2018a); Tou et al. (2018b). Figure 6 demonstrates that SIRs activate the latent self-propagating function indigenous to ICT. Activation induces the development of functionality that leads to supra-functionality beyond economic value, corresponding to shifts in people’s preferences. Furthermore, SIRs contribute to the growth of captured GDP by removing structural barriers, which in turn activates the self-propagating function. Since supra-functionality accelerates co-evolutionary innovations in terms of functionality, the advancement of the Internet, and dependence on uncaptured GDP, SIRs play a hybrid role in promoting a balance between captured and uncaptured GDP as well as economic functionality and supra-functionality. Faced with the dilemma between expanding R&D and declining marginal productivity in ICT, HRIFs have sought to switch to a new business model that creates supra-functionality by harnessing the vigor of SIRs. This arouses and activates a latent self-propagating function. Figure 7 shows the typical SIRs deployed in 2016 by the seven global ICT companies investing the most in R&D: Amazon, Samsung, Intel, Google, Microsoft, Huawei, and Apple Naveed et al. (2018).

Transformative Direction of R&D into Neo Open Innovation


Fig. 7 Soft innovation resources in the world’s leading ICT companies (modified version of the graph in Naveed et al. (2018))

3.4 Concept of Neo Open Innovation The neo open innovation that harnesses the vigor of SIRs corresponding to the digital economy has thus become a promising solution to the critical dilemma. The concept of neo open innovation in the digital economy is illustrated in Fig. 8. Like traditional open innovation Chesbrough (2003); Chesbrough et al. (2008); West et al. (2014), SIRs (which are identical to the digital economy and operate in the same way as R&D investment) sustain growth without being dependent on increased R&D investment4 , which undermines marginal productivity. Increased gross R&D, which consists of indigenous R&D and similar SIRs, will increase sales without a decline in marginal productivity. This increase enhances dynamic carrying capacity, enabling sustainable sales growth and leading to a virtuous cycle between them. Thus, the assimilation of SIRs into one’s own business can be considered a significant driver of growth but not additional expenditures resulting in a decline in productivity, as shown in Fig. 8. The magnitude of SIRs is proportional to the interaction with users according to Metcalfe’s law5 . The assimilation capacity depends on the level of the R&D reserve and its growth rate Watanabe (2009). Since Amazon, a global leader in R&D investment, interacts significantly with users through user-driven innovation based on an architecture of participation6 and has a high assimilation capacity based on rapidly increasing R&D investment Tou et al. (2019a), these systems can enable 4

SIRs substitute for additional R&D investment, which can weaken marginal productivity and induce factors of production, contributing to output growth Tou et al. (2018b). 5 The effect of the telecommunication network is proportional to the square of the number of users connected to the system. 6 The “architecture of participation” introduced by O’Reilly (2003) means that users are helping to extend the platform.


C. Watanabe and Y. Tou

Fig. 8 Scheme of neo open innovation

Fig. 9 Trends in GAFA’s R&D investment and sales in 1999–2019 [USD billion] (source SEC (2020a, b, c, d))

noteworthy R&D and sales growth by overcoming the dilemma between them, as shown in Fig. 9. Amazon has been pursuing innovation and company-wide experimentation, therefore developing its growing empire and, consequently, its large data collection system. Thus, the power of users seeking SIRs reinforces the virtuous cycle, leading to the transformation of “routine or periodic alterations” into a “significant improvement” during the R&D process Tou et al. (2019a). This change has been made possible by the combination of a unique R&D system and a sophisticated financing system focusing on free cash flow management based on the cash conversion cycle Tou et al. (2020). With this orchestration, Amazon is leveraging the expectations of many stakeholders by providing supra-functionality beyond economic value and tak-

Transformative Direction of R&D into Neo Open Innovation


ing the initiative in stakeholder capitalism, where stakeholders believe in Amazon’s promising future because of its aggressive R&D Watanabe et al. (2020); Watanabe et al (2021).

4 Conclusion In the light of the increasing significance of solving the dilemma between increasing inputs and declining output in the digital economy, this paper attempted to identify a trend in R&D that would allow sustainable growth without this dilemma. Following previous analyses and using a techno-economic approach, an empirical analysis of the development trajectory of 500 global ICT firms was conducted, focusing on the bipolarization of productivity between highly R&D-intensive and low-R&Dintensive firms, with a particular focus on R&D leaders. Notable discoveries include: 1. The development trajectory of 500 global ICT firms has bipolarized between highly R&D-intensive firms and low-R&D-intensive firms. 2. While the latter have maintained a virtuous cycle between increased R&D and productivity growth, they have fallen into a vicious cycle between them. 3. The hybrid role of SIRs as a system can be expected to solve the dilemma of highly R&D-intensive firms. 4. SIRs are considered condensates and crystallizations of the advancement of the Internet and consist of Internet-based resources. 5. A neo open innovation that harnesses the vigor of SIRs has proven to be a promising solution to the critical dilemma. 6. Amazon has been pursuing innovation and company-wide experimentation, thus developing its growing empire and, as a result, its big data collection system that will enable it to harness the power of users searching for SIRs. These findings provide the following insightful suggestions for leveraging the new R&D concept and subsequent neo open innovation in the digital economy: 1. The co-evolutional development between user-driven innovation and the disruptive advancement of SIRs should be applied to a disruptive business model that seeks to overcome the dilemma. 2. The dynamism that enables this co-evolution should be explored and conceptualized. 3. The function incorporated in Amazon’s institutional systems, which tempts large stakeholders to invest in it, should be further explored, conceptualized, and then applied to broad stakeholder capital. 4. The dynamism of the advancement of SIRs should be applied to foster neo open innovation by guiding the shifts to a digital, sharing and circular economy.


C. Watanabe and Y. Tou

This will provide a broadly applicable practical approach to assessing R&D investment in neo open innovation and an insightful suggestion that highlights the significance of investor surplus for stakeholder capitalism. Future work should focus on further exploring, conceptualizing and operationalizing the functions that the above orchestration may lead to by transforming the concept of growth, particularly for the post-COVID-19 society.

Appendix: Transformation of ICT-driven Growth A1. Logistic Growth Trajectory of Global ICT Firms ICT outsourcing is changing the correlation between innovation and institutional systems, creating new features for innovation that lead to exponential growth. Schelling (1998) portrayed a set of logistically developing and diffusing social mechanisms that are stimulated by these interactions. Advances in the Internet continue to stimulate these interactions and accelerate the logistically developing and diffusing feature of ICT, typically traced by a sigmoid curve Watanabe et al. (2004). The digital value created by R&D-driven development in an IoT society can be depicted as follows: V ≈ F(R). where R is R&D investment. Given the logistic growth of ICT, its R&D-driven growth d V /d R, which can be approximated by the digital value of marginal R&D productivity ∂ V /∂ R in the IoT society, can be described by the following epidemic function:   V ∂V dR ∂V dV = aV 1 − ≈ × = . (3) dR N ∂R dR ∂R This epidemic function leads to an R&D-driven Simple Logistic Growth (SLG) function that depicts the R&D-driven digital value as a function of R&D investment as follows: N VS (R) = , 1 + be−a R where N is the carrying capacity, a is the rate of diffusion, and b is the coefficient that indicates the initial level of diffusion. The development of global ICT firms follows this sigmoidal trajectory, which will continue to grow until it reaches its carrying capacity (upper limit of growth). Since sales represent the R&D-driven digital value creation behavior of 500 global ICT firms, their R&D-driven sales trajectory can be analyzed using the SLG function, which proves to be statistically significant (see Table 1).

Transformative Direction of R&D into Neo Open Innovation


Table 1 R&D-driven sales growth trajectory of the 500 global ICT firms (2016) N a b c adj. R 2 D (gigantic firms treated by dummy variable) 59.62 (17.39)

1.32 (10.98)

15.91 (21.87)

99.09 (29.74)


Apple, Samsung, Hon Hai

All t-statistics are significant at the 1% level (figures in brackets)

A2. Bi-Polar Fatality of ICT-Driven Development On this SLG trajectory, the growth rate (marginal productivity of R&D to sales) continues to rise before reaching an inflection point equal to half the carrying capacity, but falls after exceeding the inflection point as follows: be

−a R

    V 1 1 aN × x 1 ∂V = aV 1 − = aN × 1− = ≡ , x ∂R N 1 + 1/x 1 + 1/x (1 + x)2

d ∂V d ∂V d ∂∂ VR dR 1 1−x = ∂R × = ∂R × = aN × , dx dR dx dR ax (1 + x)3

b i = e−a R > 0, ax a

Digitization above a certain level of R&D (R > ln b/a) leads to a decline in productivity: d ∂V d ∂∂ VR ln b ln b =0⇔x =1⇔R= →R> ⇒ ∂ R < 0. dR a a dx Thus, ICT-driven logistic growth incorporates bipolarization fatality, where marginal productivity increases before and decreases after the inflection point Watanabe (2009). In the case of 500 global ICT firms in 2016, the R&D level of e2.1 billion identifies this inflection point, demonstrating a bipolarization between the 25 highly R&D-intensive firms (D1 ) and the remaining 475 firms (D2 and D3 ), as illustrated in Fig. 3.

A3. Transformation into a Virtuous Cycle in a Neo Open Innovation As far as the course of development depends on the SLG trajectory, its digital value VS (R) is saturated with a fixed carrying capacity (upper limit of growth) N , which inevitably results in a dilemma between R&D expansion and declining productivity. However, in a particular innovation that includes the dynamic carrying capacity NL (R), which creates new carrying capacity during the diffusion process (3), the epidemic function depicting the digital value V (R) is developed as follows:


C. Watanabe and Y. Tou

Table 2 Development trajectory of the 500 global ICT firms (1996) N a b ak bk VS (R) VL (R)

59.62 (17.39) 102.23 (178.83)

1.32 (10.98) 0.77 (26.13)

15.91 (21.87) 15.84 (9.72)

adj. R 2 0.784

0.43 (7.06)

1.32 (2.53)


VS (R) and VL (R) are digital values of SLG and LGDCC, respectively Dummy variables are used in the VS (R) estimate (see Table 1) All t-statistics are significant at the 1% level (figures in brackets)

  V (r ) d V (R) = aV (R) 1 − . dR NL (R)


Equation (4) gives the following function of logistic growth within a dynamic carrying capacity (LGDCC): VL (R) =

Nk 1+

be−a R


bk e−ak R 1−ak /a



where Nk is the ultimate carrying capacity and a, b, ak , and bk are coefficients. This function includes a self-propagating function that allows its digital value VL (R) to increase further as it creates new carrying capacity successively during the development process Watanabe et al. (2004). The development trajectories of 500 global ICT firms in 2016 were analyzed by comparing SLG and LGDCC, see Table 2. It turned out that LGDCC is statistically more significant than SLG. This is because HRIFs are working to overcome the dilemma by shifting from SLG to LGDCC. In this way, they seek to arouse and activate the latent self-propagating function indigenous to ICT.

References Ahmad N, Schreyer P (2016) Are GDP and productivity measures up to the challenges of the digital economy? Int Prod Monit 30:4–27 Brynjolfsson E, McAfee A (2014) The second machine age: work, progress, and prosperity in a time of brilliant technologies. W.W. Norton & Co., New York Byrne D, Corrado C (2016) ICT prices and ICT services: what do they tell about productivity and technology?. In: Economics Program Working Paper Series EPWP #16-05, The Conference Board. New York Chesbrough HW (2003) Open innovation: the new imperative for creating and profiting from technology. Harvard Business School Press, Boston Chesbrough HW, Vanhaverbeke W, West J (eds) (2008). Oxford University Press, Oxford Cowen T (2011) The great stagnation: how America ate all the low-hanging fruit of modern history, got sick, and will (eventually) feel better. Dutton, New York

Transformative Direction of R&D into Neo Open Innovation


Dervis K, Qureshi Z (2016) The productivity slump—fact or fiction: the measurement debate. Brief, Global Economy and Development at Brookings, Washington, DC Economist (2018) Schumpeter: big tech’s sell-off. The Economist, 3rd edn European Union (2016) The 2016 EU industrial R&D investment scoreboard. Technical report, European Union, Brussels Feldstein M (2017) Understanding the real growth of GDP, personal income, and productivity. J Econ Perspect 31(2):145–164 Gaspar V (2017) Measuring the digital economy. Speech in the 5th IMF Statistical Forum. Washington, DC Japan Cabinet Office (2019) National survey of lifestyle preferences. Japan Cabinet Office, Tokyo Kahre C, Hoffmann D, Ahlemann F (2017) Beyond business-IT alignment-digital business strategies as a paradigmatic shift: a review and research agenda. In: Proceedings of the 50th Hawaii International Conference on System Sciences. pp 4706–4715 Lowrey A (2011) Freaks geeks, and GDP: Why hasn’t the internet helped the American economy grow as much as economists thought it would? slate. moneybox/2011/03/freaks_geeks_and_gdp.html. Accessed 20 June 2017 McDonagh D (2008) Satisfying needs beyond the functional: the changing needs of the Silver Market consumer. In: Proceedings of the International Symposium on the Silver Market Phenomenon: Business Opportunities and Responsibilities in the Aging Society. Tokyo McKinsey & Co (2015) The internet of things: mapping the value beyond the hype. Technical report, McKinsey & Co., New York Naveed K, Watanabe C, Neittaanmäki P (2017) Co-evolution between streaming and live music leads a way to the sustainable growth of music industry: lessons from the US experiences. Technol Soc 50:1–19 Naveed K, Watanabe C, Neittaanmäki P (2018) The transformative direction of innovation toward an IoT-based society: increasing dependency on uncaptured GDP in global ICT firms. Technol Soc 53:23–46 OECD (2016) Digital economy. OECD Observer No 307 O’Reilly T (2003) The open source paradigm shift. O’Reilly & Associates, Inc., Sebastopol. https:// Accessed 10 Jan 2019 Rifkin J (2015) The zero marginal cost society: the internet of things, the collaborative commons, and the eclipse of capitalism. Macmillan, New York Rothwell J (2016) No recovery: an analysis of long-term U.S. productivity decline. Technical report, The US Council on Competitiveness (USCC), Washington, DC Schelling TC (1998) Social mechanisms and social dynamics. In: Hedström P, Swedberg R (eds) Social mechanisms: an analytical approach to social theory. Cambridge University Press, Cambridge, pp 32–44 SEC (2020a) Annual report pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934: For the fiscal year ended december 31, 2019. Registrant Alphabet Inc., U.S. Securities and Exchange Commission (SEC), Washington, DC SEC (2020b) Annual report pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934: For the fiscal year ended december 31, 2019. Registrant, Inc., U.S. Securities and Exchange Commission (SEC), Washington, DC SEC (2020c) Annual report pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934: For the fiscal year ended september 26, 2020. Registrant Apple Inc., U.S. Securities and Exchange Commission (SEC), Washington, DC SEC (2020d) Annual report pursuant to Section 13 or 15(d) of the Securities Exchange Act of (1934) For the fiscal year ended december 31 (2020) Registrant Facebook Inc. U.S., Securities and Exchange Commission (SEC), Washington, DC Tapscott D (1997) The digital economy: promise and peril in the age of networked intelligence. McGraw-Hill, New York Tou Y, Moriya K, Watanabe C, Ilmola L, Neittaanmäki P (2018) Soft innovation resources: enabler for reversal in GDP growth in the digital economy. Int J Manag Inf Technol 10(3):9–27


C. Watanabe and Y. Tou

Tou Y, Watanabe C, Ilmola L, Moriya K, Neittaanmäki P (2018) Hybrid role of soft innovation resources: Finland’s notable resurgence in the digital economy. Int J Manag Inf Technol 10(4):1– 22 Tou Y, Watanabe C, Moriya K, Naveed N, Vurpillat V, Neittaanmäki P (2019) The transformation of R&D into neo open innovation: a new concept in R&D endeavor triggered by Amazon. Technol Soc 58:101141 Tou Y, Watanabe C, Moriya K, Neittaanmäki P (2019) Harnessing soft innovation resources leads to neo open innovation. Technol Soc 58:101114 Tou Y, Watanabe C, Neittaanmäki P (2020) Fusion of technology management and financing management: Amazon’s transformative endeavor by orchestrating techno-financing systems. Technol Soc 60:101219 Watanabe C (2009) Managing innovation in Japan: the role institutions play in helping or hindering how companies develop technology. Springer, Berlin Watanabe C (2013) Innovation-consumption co-emergence leads a resilience business. Innov Supply Chain Manag 7(3):92–104 Watanabe C, Hobo M (2004) Creating a firm self-propagating function for advanced innovationoriented projects: lessons from ERP. Technovation 24(6):467–481 Watanabe C, Kondo R, Ouchi N, Wei H, Griffy-Brown C (2004) Institutional elasticity as a significant driver of it functionality development. Technol Forecast Soc Chang 71(7):723–750 Watanabe C, Naveed K, Zhao W (2014) Institutional sources of resilience in global ICT leaders: harness the vigor of emerging power. J Technol Manag Grow Econ 5(1):7–34 Watanabe C, Naveed K, Neittaanmäki P (2015) Dependency on un-captured GDP as a source of resilience beyond economic value in countries with advanced ICT infrastructure: similarities and disparities between Finland and Singapore. Technol Soc 42:104–122 Watanabe C, Naveed K, Zhao W (2015) New paradigm of ICT productivity: increasing role of un-captured GDP and growing anger of consumers. Technol Soc 41:21–44 Watanabe C, Naveed K, Neittaanmäki P (2016) Co-evolution of three mega-trends nurtures uncaptured GDP: uber’s ride-sharing revolution. Technol Soc 46:164–185 Watanabe C, Naveed K, Neittaanmäki P, Tou Y (2016) Operationalization of un-captured GDP: Innovation stream under new global mega-trends. Technol Soc 45:58–77 Watanabe C, Naveed K, Neittaanmäki P (2017) Co-evolution between trust in teachers and higher education toward digitally-rich learning environments. Technol Soc 48:70–96 Watanabe C, Naveed K, Neittaanmäki P (2017) ICT-driven disruptive innovation nurtures uncaptured GDP: harnessing women’s potential as untapped resources. Technol Soc 51:81–101 Watanabe C, Naveed K, Neittaanmäki P, Fox B (2017) Consolidated challenge to social demand for resilient platforms: lessons from Uber’s global expansion. Technol Soc 48:33–53 Watanabe C, Naveed K, Tou Y, Neittaanmäki P (2018) Measuring GDP in the digital economy: increasing dependence on uncaptured GDP. Technol Forecast Soc Chang 137:226–240 Watanabe C, Naveed N, Neittaanmäki P (2018) Digital solutions transform the forest-based bioeconomy into a digital platform industry: a suggestion for a disruptive business model in the digital economy. Technol Soc 54:168–188 Watanabe C, Tou Y, Neittaanmäki P (2018) A new paradox of the digital economy: structural sources of the limitation of GDP statistics. Technol Soc 55:9–23 Watanabe C, Tou Y, Neittaanmäki P (2020) Institutional systems inducing R&D in Amazon: the role of an investor surplus toward stakeholder capitalization. Technol Soc 63:101290 Watanabe C, Tou Y, Neittaanmäki P (2021) Transforming the socio economy with digital innovation. Elsevier, Amsterdam West J, Salter A, Vanhaverbeke W, Chesbrough H (2014) Open innovation: the next decade. Res Policy 43(5):805–811 World Bank (2016) World development report 2016: Digital dividends. World Bank, Washington, DC

On Tetrahedral and Square Pyramidal Numbers Lawrence Somer and Michal Kˇrížek

Abstract We prove that the only nontrivial solution of the Diophantine equation N 2 = 12 + 22 + · · · + n 2 is N = 70. Our technique is based on some properties of the Pell equation and the Fermat method of infinite descent. We also show that 70 is the smallest weird number. Keywords Tetrahedral numbers · Square pyramidal numbers · Diophantine equation · Congruence Mathematics Subject Classification: 11B50 · 11D09

1 Introduction Number theory has a lot of real-life technical applications in completely different areas (see our recent book Krizek et al. (2021)). For example, the International Bank Account Numbers (IBAN) are protected against possible errors with the help of prime numbers and error-detecting codes. Similarly are protected also ISBN, ISSN, ISMN,…codes. Prime numbers enable us to produce pseudorandom numbers and they are used to design messages to extraterrestrial civilizations. Large prime numbers help us to transmit secret messages and generate digital signatures. In Krizek et al. (2021), we also discuss the so-called error-correcting codes, which automatically correct errors. Number theory is closely related to fractals and chaos Krizek et al. (2001). Together with computers it helped to discover the Lorentz attractor and to find the famous Pentium Bug. Number theory can be applied for the construction of

L. Somer Department of Mathematics, Catholic University of America, Washington, D.C. 20064, USA e-mail: [email protected] M. Kˇrížek (B) Institute of Mathematics, Czech Academy of Sciences, Žitná 25, 115 67, Prague 1, Czech Republic e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



L. Somer and M. Kˇrížek

hashing functions, in scheduling sport tournaments, and has many other applications in astronomy, chemistry, cryptography, genetic, physics, etc. In this paper, we shall study an interesting property of the number 70 (see Krizek et al. (2021)[Sect. 8.1]). For n ∈ N, we define triangular numbers by Tn =

n(n + 1) = 1 + 2 + · · · + n, 2


and square numbers by Sn = n 2 . Thus for n = 1, 2, 3, 4, 5, 6, . . . we get • triangular numbers: 1, 3, 6, 10, 15, 21,... • square numbers: 1, 4, 9, 16, 25, 36,... Note that every square number is the sum of consecutive odd numbers: n 2 = 1 + 3 + 5 + · · · + (2n − 3) + (2n − 1), which can be directly proved from the formula for the sum of a finite number of terms in an arithmetic progression. Further, we easily find that Tn = n + Tn−1 , Sn = n + 2Tn−1 . From relation (1) for n = 8, we can easily verify that the square number 36 = 62 is at the same time a triangular number. It is known that there are infinitely many positive integers that are simultaneously a square number and a triangular number (see Beiler (1964)[p. 197]). All such numbers An are given by the formula An = an2 , where an+2 = 6an+1 − an with initial terms a1 = 1 and a2 = 6. The first few of these numbers are 1 = 12 = T1 , 36 = 62 = T8 , 1225 = 352 = T49 , 41 616 = 2042 = T288 . Lemma 1 The sum of two consecutive triangular numbers is the square of a positive integer. Proof For an integer n ≥ 1 we have Tn−1 + Tn =

n(n − 1) n(n + 1) + = n2 2 2

On Tetrahedral and Square Pyramidal Numbers


and the lemma is proved.

Tetrahedral numbers (also called triangular pyramidal numbers), are defined as the sum of the first n triangular numbers Un = T1 + T2 + · · · + Tn . In 1878, Meyl (1878) proved that Un is a square if and only if n ∈ {1, 2, 48}, which gives rise to the squares 1 = T1 , 4 = 22 = T1 + T2 = 1 + 3, 19600 = 1402 = T1 + T2 + · · · + T48 = 1 + 3 + · · · + 1176. Using Lemma 1, we find that 1402 = S2 + S4 + S6 + · · · + S48 . Dividing this equality by 4, we get 702 = S1 + S2 + S3 + · · · + S24 .


Analogously, we can investigate square pyramidal numbers 1, 5, 14, 30, 55, 91, 140,... which can be obtained as particular sums of square numbers Pn = S1 + S2 + · · · + Sn =

1 n(n + 1)(2n + 1). 6

It is of interest that the only square pyramidal numbers that are squares occur for n = 1 and n = 24 which give the squares 12 and 702 = 12 + 22 + 32 + · · · + 242 , i.e., the generalized Parseval equality (see (2)). This is known as the cannonball problem, which was proposed by Lucas Lucas (1875) as follows: Problem. A square pyramid of cannonballs contains a square number of cannonballs only when it has 24 cannonballs along its base. A correct complete solution to Lucas’ question was only given by Watson Watson (1918) whose proof made use of Jacobian elliptic functions. We give an accessible elementary solution to the cannonball problem based on the paper by Anglin (1990), which makes use of a result by Ma (1985), properties of the Pell equation X 2 − 3Y 2 = 1, and the Fermat method of infinite descent Krizek et al. (2021).


L. Somer and M. Kˇrížek

2 Main Theorem The main theorem is formulated as follows: Theorem 1 The only values of n for which the square pyramidal number Pn = 1 n(n + 1)(2n + 1) is a nonzero square are n = 1 and n = 24, i.e., P1 = 12 and 6 P24 = 702 . The proof of this theorem will utilize the auxiliary results presented in Lemmas 2 and 3. Lemma 2 There are no positive integers x such that 2x 4 + 1 is a square. Proof Assume to the contrary that there exist positive integers x and y such that 2x 4 + 1 = y 2 . Then y = 2s + 1 for some s ∈ N, and x 4 = 2s(s + 1). Suppose that s is odd. Then s and 2(s + 1) are coprime, and for some positive integers t and u, s = t 4 and 2(s + 1) = u 4 . This yields that 2(t 4 + 1) = u 4 with t odd and u even. Then we have that 2(1 + 1) ≡ 0 (mod 8), which is impossible. Therefore, s must be even. It follows that 2s and s + 1 are coprime, and there exist integers t and u, both greater than 1, such that 2s = t 4 and s + 1 = u 4 . Let v and a be positive integers such that t = 2v and u 2 = 2a + 1. Then t4 + 1 = s + 1 = u4, 2 which implies that 2v 4 =

u4 − 1 = a(a + 1). 4

Noting that u 2 = 2a + 1, we observe that u 2 ≡ 1 (mod 4), which implies that a is even. Since 2v 4 = a(a + 1), it now follows from the fact that a and a + 1 are coprime that there exist positive integers b and c such that a = 2b4 and a + 1 = c4 . Then  2 2b4 + 1 = c2 . However, c2 ≤ c4 = a + 1 < 2a + 1 = u 2 ≤ u 4 = s + 1 < 2s + 1 = y. Lemma 2 now follows by the Fermat method of infinite descent Krizek et al. (2021)[Sect. 2.4]. 

On Tetrahedral and Square Pyramidal Numbers


Lemma 3 There exists exactly one positive integer x, namely 1, such that 8x 4 + 1 is a square. Proof Assume that 8x 4 + 1 = (2s + 1)2 , where x ≥ 1. Then 2x 4 = s(s + 1). Suppose that s is even. Since s and s + 1 are coprime, there exist positive integers t and u such that s = 2t 4 and s + 1 = u 4 . Then 2t 4 + 1 = s + 1 = u 4 , and by Lemma 2, t = 0, and hence x = 0. Thus, we only need to consider the case where s is odd. It now follows that there exist integers t and u such that s = t 4 and s + 1 = 2u 4 , i.e., (3) t 4 + 1 = 2u 4 . Since t is odd, t 4 + 1 ≡ 2 (mod 4), which implies by (3) that u is odd. Squaring both sides of (3) yields that 4u 8 − 4t 4 = t 8 − 2t 4 + 1 = (t 4 − 1)2 and thus

 (u − t )(u + t ) = 4




t4 − 1 2

2 ,

an integer square. Since u 4 and t 2 are coprime, it follows that both 21 (u 4 − t 2 ) and 1 4 (u + t 2 ) are integer squares. We now observe that 2 (u 2 − t)2 + (u 2 + t)2 = 4 and

u4 + t 2 = A2 2


u4 − t 2 (u 2 − t)(u 2 + t) = = B2, 2 2

where A> 0. From this and relations (4), if B = 0, then the triangle with sides u 2 − t, u 2 + t, 2(u 4 + t 2 ) is a Pythagorean triangle whose area is an integer square. This is impossible (see Krizek et al. (2021)). Thus B = 0 and u 2 = t. Since t 4 + 1 = 2u 4 by (3), we obtain t 4 − 2t 2 + 1 = 0,


L. Somer and M. Kˇrížek

which yields that t 2 = 1. It now follows that s = t 4 = 1 and 8x 4 + 1 = (2s + 1)2 = 9. Hence, x = 1.

3 Proof of the Main Theorem We prove Theorem 1 in two parts: (1) n is even, (2) n is odd. Several additional lemmas will be needed later for the proof.

3.1 Proof with Even n Let us suppose that x(x + 1)(2x + 1) = 6y 2 , where x is a nonnegative even integer. We will later discard the trivial solution in which x = 0. Then x + 1 is odd. Since x, x + 1, and 2x + 1 are coprime in pairs and both x + 1 and 2x + 1 are odd, it follows that x + 1 and 2x + 1 are either squares or triples of squares. Thus, x + 1 ≡ 2 (mod 3) and 2x + 1 ≡ 2 (mod 3). Hence, x ≡ 0 (mod 3). Since x is even, x ≡ 0 (mod 6) and x + 1 ≡ 2x + 1 ≡ 1 (mod 6). Therefore, for some nonnegative integers f , g, and h, we have x = 6g 2 , x + 1 = f 2 , and 2x + 1 = h 2 . Then 6g 2 = h 2 − f 2 = (h − f )(h + f ).


Since h and f are both odd, 4 is a factor of (h − f )(h + f ), and thus 4 | g 2 . Hence, by (5),  g 2 h− f h+ f . (6) 6 = 2 2 2 We note that 21 (h − f ) and 21 (h + f ) are coprime, because h 2 and f 2 are coprime. Thus, by (6), we have the following two cases: Case 1. One of 21 (h − f ) and 21 (h + f ) has the form 6A2 and the other has the form B 2 , where A and B are nonnegative integers. Then f = ±(6A2 − B 2 ) and by equality (6), 6g 2 h− f h+ f = 6A2 B 2 = ⇔ g 2 = 4 A2 B 2 ⇔ g = 2 AB. 2 2 4 Since 6g 2 + 1 = x + 1 = f 2 , we obtain from (7) that 24 A2 B 2 + 1 = 6g 2 + 1 = (6A2 − B 2 )2


On Tetrahedral and Square Pyramidal Numbers


or (6A2 − 3B 2 )2 − 8B 4 = 1.


By Lemma 3, B = 0 or 1. If B = 0, then by (7), x = 6g 2 = 24 A2 B 2 = 0. If B = 1, then A = 1 by (8), and x = 24 A2 B 2 = 24. Case 2. One of 21 (h − f ) and 21 (h + f ) has the form 3A2 and the other has the form 2B 2 , where A and B are nonnegative integers. Then f = ±(3A2 − 2B 2 ) and by (7), g = 2 AB. By a similar argument as in Case 1, we have that 24 A2 B 2 + 1 = (3A2 − 2B 2 )2 or (3A2 − 6B 2 )2 − 2(2B)4 = 1. By Lemma 2, B = 0 and hence, x = 24 A2 B 2 = 0. Consequently, when n is an even positive integer, the square pyramidal number 1 n(n + 1)(2n + 1) is a square if and only if n = 24.  6

3.2 Proof with Odd n 2 2 We first examine √ the solutions of the Pell equation X − 3Y = 1. Let γ = 2 + and δ = 2 − 3. Note that γ δ = 1. For n a nonnegative integer, let

vn =


γ n + δn γ n − δn and u n = √ . 2 2 3

Then vn and u n are integers and it is well known that vn , u n is the nth nonnegative integer solution of X 2 − 3Y 2 = 1 for n ≥ 0 (see Robbins (2006)[pp. 273–274], Beiler (1964)[Table 91, pp. 252–255]). Below, this part of the proof will make use of Lemmas 4–12. ∞ Lemma 4 Both the sequences (vn )∞ n=0 and (u n )n=0 satisfy the recursion relation

wn+2 = 4wn+1 − wn for n ≥ 0. Proof We observe that γ and δ are the roots of the quadratic equation x 2 − 4x + 1 = 0. Thus, γ 2 = 4γ − 1 and δ 2 = 4δ − 1. It now follows that γ n+2 = 4γ n+1 − γ n , δ n+2 = 4δ n+1 − δ n for n ≥ 0.



L. Somer and M. Kˇrížek

n ∞ Hence, the sequences (γ n )∞ n=0 and (δ )n=0 both satisfy the recursion relation (9). It follows from the definitions of vn and u n and induction that both the sequences n ∞  (v n )∞ n=0 and (u )n=0 also satisfy the relation (9).

Lemma 5 Let n ≥ 0. Then v−n = vn and u −n = −u n . Proof By inspection, v0 = 1, v1 = 2 and u 0 = 0, u 1 = 1. The result now follows by induction upon making use of the recursion relation (9).  Lemma 6 For m a positive integer, v2m = vm2 − 1 = 6u 2m + 1 and u 2m = 2u m vm . Proof The equalities follow from the definitions of vm and u m and from the facts  that γ δ = 1 and vm2 − 3u 2m = 1. Lemma 7 Let m, n ≥ 0. Then we have that vm+n = vm vn + 3u m u n and u m+n = u m vn + u n vm . Further, if m − n ≥ 0 then vm−n = vm vn − 3u m u n and u m−n = u m vm − u n vm . Proof This follows from the definitions of vn and u n and from Lemma 5.

Lemma 8 If k and m are nonnegative integers, then the following hold: v(2k+1)m ≡ 0 u 2km ≡ 0

(mod vm ), (mod vm ),

v2km ≡ (−1)k

(10) (11)

(mod vm ).


Proof We first prove (10) and (11) by induction. Clearly, (10) holds for k = 0. Assume that (10) holds also for k ≥ 0. Then by Lemmas 6 and 7, we have that v(2k+3)m = v(2k+1)m v2m + 3u (2k+1)m u 2m = v(2k+1)m v2m + 6u (2k+1)m u m vm ≡ 0

(mod vm ).

Thus, (10) holds. Since u 0 = 0, we see by Lemma 6 that (11) is satisfied for k = 0 and k = 1. Now assume that (11) holds for k ≥ 1. Then by Lemmas 6 and 7, we see that u (2k+2)m = u 2km v2m + u 2m v2km = u 2km v2m + 2u m vm v2km ≡ 0

(mod vm ).

On Tetrahedral and Square Pyramidal Numbers


Thus, (11) holds. Suppose first that k is odd. Then by Lemma 6 and (10), 2 − 1 ≡ −1 v2km = 2vkm

(mod vm ).

Now suppose that k is even. Then by Lemma 6 and (11), v2km = 6u 2km + 1 ≡ 1 (mod vm ). 

Thus, (12) holds.

Lemma 9 Let k, m, and n be nonnegative integers such that 2km − n is nonnegative. Then v2km±n ≡ (−1)k vn (mod vm ). Proof By Lemma 7 and (11) and (12) v2km±n = v2km ± 3u 2km u n ≡ (−1)k vn

(mod vm ) 

which proves the lemma.

Let us now consider the first several values of vn . Beginning with n = 0, we have 1, 2, 7, 26, 97, 362, 1351, . . . Considering these values modulo 5, we obtain 1, 2, 2, 1, 2, 2, . . . By Lemma 4, we see that (vn ) is purely periodic modulo 5 with a period length of 3. If we consider the values of (vn ) modulo 8, we get 1, 2, 7, 2, 1, 2, 7, 2, . . . Again by Lemma 4, we find that (vn ) is purely periodic modulo 8 with a period length of 4 and that vn is odd if and only if n is even. Using the laws of quadratic reciprocity for the Jacobi symbol (see, e.g., Krizek et al. (2021)), we have the following two lemmas by our above remarks.   Lemma 10 Let n be even. Then vn is an odd nonmultiple of 5 and v5n = 1 if and only if n ≡ 0 (mod 3).   = 1 if and only if n ≡ 0 (mod 4). Lemma 11 Let n be even. Then vn is odd and −2 vn Lemma 12 was first proved by Ma (1985). Lemma 12 For n ≥ 0, vn has the form 4M 2 + 3 only when vn = 7.


L. Somer and M. Kˇrížek

Proof Suppose that vn = 4M 2 + 3. Then vn ≡ 3 or 7 (mod 8), and from the sequence of values of (vn ) modulo 8, we observe that n has the form 8k ± 2. Suppose that n = 2, which implies that vn = 7. Then we can write vn in the form 2r 2s ± 2, where r is odd and s ≥ 2. By Lemma 9, vn = v2r 2s ±2 ≡ (−1)r v2

(mod v2s ).

Since r is odd and v2 = 7, it follows that 4M 2 = vn − 3 ≡ −10 (mod v2s ) and thus, 

−2 v2s

5 v2s


−10 v2s


4M 2 v2s

 = 1.


However, by Lemmas 11 and 10, 

−2 v2s

 = 1 and

5 v2s

 = −1,

which contradicts (13). Thus n = 2 and vn = 7.

Now we prove Theorem 1 for the case where n is odd. Suppose that x is a positive odd integer such that x(x + 1)(2x + 1) = 6y 2 for some integer y. Since x, x + 1, and 2x + 1 are coprime in pairs, it follows that x is either a square or triple a square, and thus x ≡ 2

(mod 3).

Furthermore, since x + 1 is even, we see that x + 1 is either double a square or six times a square, and consequently, x + 1 ≡ 1 (mod 3). Hence, x ≡ 1 (mod 6), x + 1 ≡ 2

(mod 6), 2x + 1 ≡ 3 (mod 6).

Therefore, for some positive integers a, b, and c, we have x = a 2 , x + 1 = 2b2 , 2x + 1 = 3c2 . From this we get 6c2 + 1 = 4x + 3 = 4a 2 + 3. We also observe that


On Tetrahedral and Square Pyramidal Numbers


(6c2 + 1)2 − 3(4bc)2 = 12c2 (3c2 + 1 − 4b2 ) + 1 = 12c2 (2x + 1 + 1 − 2(x + 2)) + 1 = 1.


Hence, by (14), (15), and Lemma 12, 6c2 + 1 = 7. Thus c = 1 and x = 1. Therefore, n = 1 is the only positive odd integer such that 16 n(n + 1)(2n + 1) is a square.  We can now conclude that the only positive integers n, for which the square pyramidal number 16 n(n + 1)(2n + 1) is a square, are n = 1 and n = 24 which give the square numbers 1 and 4900 = 702 , respectively.

4 Weird Numbers Denote by s(n) the sum of all divisors of an integer n that are less than n. A positive integer n is called abundant, if s(n) > n. We see that the number 70 is abundant, since 1 + 2 + 5 + 7 + 10 + 14 + 35 = 74 > 70. A positive integer n is called weird, if it is abundant and if it cannot be expressed as the sum of some of its proper divisors. Theorem 2 The number 70 is the smallest weird number. Proof There are only 13 abundant numbers less than 70, namely 12, 18, 20, 24, 30, 36, 40, 42, 48, 54, 56, 60, 66. By inspection, we find that they are not weird, e.g., 12 = 2 + 4 + 6, 18 = 3 + 6 + 9, 20 = 1 + 4 + 5 + 10, 24 = 4 + 8 + 12, . . .

However, we easily find that no subset of the set {1, 2, 5, 7, 10, 14, 35} of proper divisors of 70 sums to 70.  Acknowledgements M. Kˇrížek was supported by the Institute of Mathematics of the Czech Academy of Sciences (RVO 67985840).

References Anglin WS (1990) The square pyramid puzzle. Am Math Mon 97(2):120–124 Beiler AH (1964) Recreations in the theory of numbers: the queen of mathematics entertains, 2nd edn. Dover, New York Kˇrížek M, Luca F, Somer L (2001) 17 lectures on Fermat numbers: from number theory to geometry. CMS Books in Mathematics, vol 9, 2nd edn. Springer, New York 2011


L. Somer and M. Kˇrížek

Kˇrížek M, Somer L, Šolcová A (2021) From great discoveries in number theory to applications. Springer, Cham Lucas É (1875) Question 1180. Nouv Ann Math Ser 2(14):336 Ma DG (1985) An elementary proof of the solutions to the Diophantine equation 6y 2 = x(x + 1)(2x + 1). Sichuan Daxue Xuebao 4:107–116 Meyl A-J-J (1878) Solution de question 1194. Nouv Ann Math Ser 2(17):464–467 Robbins N (2006) Beginning number theory, 2nd edn. Jones & Bartlett Learning, Sudbury, MA Watson GN (1918) The problem of the square pyramid. Messenger Math 48:1–22

AI and Applications in Health Care

Randomized Constructive Neural Network Based on Regularized Minimum Error Entropy Mojtaba Nayyeri, Hadi Sadoghi Yazdi, Modjtaba Rouhani, Alaleh Maskooki, and Marko M. Mäkelä

Abstract So far several types of random neural networks have been proposed in which optimal output weights are adjusted using the Mean Square Error (MSE) objective function. However, since many real-world phenomena do not follow a normal distribution, MSE-based methods act poorly in such cases. This paper presents a new single-layer random constructive neural network based on the regularized Minimum Error Entropy (MEE) objective function. The proposed method investigates the performance of MSE and MEE objective functions in combination using a regularization term to adjust the optimal output parameter for new nodes. Experimental results show that the proposed method performs well in the presence of both Gaussian and impulsive noise. Furthermore, due to the random assignment of the hidden layer parameters, the computational burden of the proposed method is reduced. Incremental constructive architecture of the proposed network helps optimize non-convex objective functions to achieve the desired performance. Computational comparisons indicate the superior performance of our method with several synthetic and benchmark datasets. Keywords Randomized constructive neural network · Minimum error entropy · Regression · Single layer feedforward network · Non-Gaussian noise · Impulsive noise M. Nayyeri University of Bonn, Bonn, Germany e-mail: [email protected] H. S. Yazdi · M. Rouhani Ferdowsi University of Mashhad, Mashhad, Iran e-mail: [email protected] M. Rouhani e-mail: [email protected] A. Maskooki (B) · M. M. Mäkelä University of Turku, Vesilinnantie 5, 20014 Turku, Finland e-mail: [email protected] M. M. Mäkelä e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



M. Nayyeri et al.

1 Introduction Artificial neural networks, inspired by biological neural systems, are one of the most attractive parts of machine learning. Single hidden Layer Feedforward Networks (SLFNs) are the simplest form of neural networks and are often used in regression and classification problems because of their universal approximation (Hornik et al. 1989; Huang and Chen 2006; Leshno et al. 1993; Park and Sandberg 1991) and classification (Huang et al. 2000, 2012) capabilities. Furthermore, they are used for semi-supervised and unsupervised learning such as the approximation of solutions to differential equation. When using a neural network with a fixed topology, it is important to determine the appropriate number of hidden nodes prior to training. Too few or too many hidden nodes lead to under- or over-fitting, respectively (Anders and Korn 1999). There are some approaches to tackling this problem. For instance, a pruning algorithm trains a large number of hidden nodes and then removes unnecessary units and connections from the oversized network (Reed 1993; Nayyeri et al. 2017). In a constructive network, on the other hand, the network architecture is initially minimal in size and is enlarged until it reaches a certain predefined performance (Fahlman and Lebiere 1989; Huang and Chen 2007, 2006; Karmitsa et al. 2020; Kwok and Yeung 1997). Both methods have the advantage that the algorithm automatically determines the number of hidden nodes. In a classical Feedforward Neural Network (FNN) (Hagan et al. 1996), the input parameters of the hidden layer are adjusted based on the error feedback. Various approaches are proposed in this framework. Kwok and Yeung (1997) introduced several objective functions to adjust the input parameters of a constructive network. These methods create a more compact network than random-based methods due to the optimization of hidden layer parameters. The approximation capability of FNNs has been widely studied. FNNs have been shown to be capable of approximating generic classes of functions, including continuous and integrable ones (Barron 1993; Hornik et al. 1989; Park and Sandberg 1991; Scarselli and Tsoi 1998). Huang et al. (2006) proposed the Incremental Extreme Learning Machine (IELM) in which only the output parameter of the newly added node is adjusted, and the connections with and within hidden neurons are randomly fixed. They proved that such a network has the universal approximation capability. Liu et al. (2015) and Lin et al. (2015) studied the pros and cons of ELM. They showed that, by selecting appropriate activation function, ELM reduces the computational burden without sacrificing the prediction accuracy (Liu et al. 2015). However, compared with the classical FNNs, the randomness characteristic results in uncertainty, in both approximation and learning (Lin et al. 2015). For a single hidden layer network with a constructive architecture and random input parameters, Schmidt et al. (1992) proposed a randomized algorithm in which the hidden layer parameters are randomly chosen with a uniform distribution from the interval [−1, 1]. Pao and Takefji (1992) proposed the Random Vector FunctionalLink (RVFL) neural networks, where possible region for selection of the hidden layer

Randomized Constructive Neural Network …


parameters was discussed in detail. Tyukin and Prokhorov (2009) reviewed Barron’s greedy approximation framework (Barron 1993) and compared it with RVFLs in terms of both theoretical and experimental perspectives. Correlated with previous studies, they observed uncertainty in RVFL models which is the result of random assignment of inner weights and biases. A common objective function for training SLFNs is the Mean Square Error (MSE). However, there are some restrictive assumptions in the use of MSE objective function, including linear and Gaussian noise assumptions (Erdogmus and Principe 2000, 2002, 2003). It is clear that many phenomena do not satisfy such assumptions. Consequently, the MSE objective function performs poorly in such cases. To better tackle such problems, training is done based on Information Theoretic Learning (ITL) approaches. Entropy is one of the widely-used concepts in ITL providing a measure for the average information in a distribution function (Principe 2010). Regarding some practical applications of entropy, Erdogmus and Principe (2000) utilized the error entropy objective function in adaptive system training. Later, the authors investigated the error entropy minimization and proved the equivalence of Renyi’s error entropy and Csiszar distance minimization between the system output and desired probability densities (Erdogmus and Principe 2002). They compared Minimum Error Entropy (MEE) and MSE criteria in prediction of chaotic time series and in identification of nonlinear systems. Furthermore, they generalized the error entropy criterion to any order of Renyi’s entropy and any kernel function (Erdogmus and Principe 2003). Santos et al. (2004a) utilized error entropy minimization for neural network classification with fixed architecture. Further, they proposed a model where the error entropy objective function is optimized using variable learning rate parameter (Santos et al. 2004b). Alexandre and Marques de Sá (2006) used error entropy minimization for a long short-term memory recurrent neural network. Error entropy led to successful results in classification when integrated into kernel-based methods (Han 2007a). Yuan and Hu (2009) employed the MEE objective function for the feature extraction task. In addition, Han (2007b) employed MEE objective function in training adaptive filter and showed that presented filters are robust in the presence of nonGaussian noises. Using the MEE objective function in learning machines, such as Multi-Layer Perceptron (MLP) and adaptive filters, results in high computational complexity. To reduce computing costs, we take advantage of the idea of random hidden nodes (Huang and Chen 2006; Igelnik and Pao 1995; Park and Sandberg 1991; Schmidt et al. 1992; Tyukin and Prokhorov 2009) with a constructive approach (Huang and Chen 2006; Huang et al. 2000, 2012; Kwok and Yeung 1997). This paper presents an extension to IELM (Huang and Chen 2006) that trains a new incremental constructive feedforward network based on minimum error entropy and minimum mean square error with random hidden nodes. The rest of the paper is organized as follows. IELM is briefly described in Sect. 2. The proposed constructive method based on the MEE-MSE objective function is introduced in Sect. 3. Section 4 verifies the performance of the proposed method with several synthetic and benchmark datasets. Conclusions and future works are discussed in Sect. 5.


M. Nayyeri et al.

2 Incremental Extreme Learning Machine In this section, we briefly review IELM proposed by Huang et al. (2006). IELM is a constructive feedforward network with random hidden nodes. In this method, the network starts with one hidden node, and the new nodes are added and trained one by one. Suppose N is the number of samples and the index n indicates the last hidden node which is initialized with zero and is increased in each step. Suppose en−1 (i) is the amount of error for the network with n − 1 hidden nodes, gn (i) is the output of the nth hidden node for ith training sample, and βn is the output weight of the nth hidden node. In a constructive network the following equation holds: E n = E n−1 − βn G n .


Here E n and G n are the nth error and hidden node vectors respectively: G n = [gn (1), ..., gn (N )]T E n−1 = [en−1 (1), ..., en−1 (N )]T . For the nth hidden node, the input parameters are assigned randomly and its output weight is adjusted based on maximization of the following objective function: T T E n−1 − E nT E n = E n−1 E n−1 − (E n−1 − βn G n )T (E n−1 − βn G n ) (βn ) = E n−1 T T T = E n−1 E n−1 − E n−1 E n−1 − βn2 G nT G n + 2βn E n−1 Gn.

Thus, we have T G n − βn2 G nT G n . (βn ) = 2βn E n−1

Since the objective function is concave with respect to βn , it attains its maximum when the derivative vanishes. In other words, d(βn ) T = 2E n−1 G n − 2βn G nT G n = 0, dβn which is true when βn =

T Gn E n−1 . T Gn Gn


There are two types of SLFN based on the activation function of its hidden nodes. Additive nodes are g(aiT x + bi ), ai ∈ Rd , bi ∈ R and radial basis function nodes

Randomized Constructive Neural Network …



 x − ai  , ai ∈ Rd , bi ∈ R+ , bi

where x ∈ Rd is the input vector, and g is the activation function. For each node, the output weight is adjusted using Eq. (2). The input parameters of the ith hidden node, ai and bi , are assigned randomly. When the new node is trained, its parameters are fixed and will not be changed during training of the next nodes. The network size is gradually increased until the certain predefined stopping condition is satisfied.

3 Proposed Method In this section, the MEE objective function is used to adjust the output parameter of the newly added node in a constructive feedforward network. The training process for the new node is performed in two phases: Phase 1 The input parameters of the newly added node are assigned randomly. Phase 2 The combination of MSE and information potential is maximized with respect to the output parameter. The input parameters of the new node remain fixed at the next levels. Figures 1 and 2 show the first and Lth steps of the proposed algorithm, respectively. One new node is added at each step. The parameters of the new node are adjusted in two phases. After adjustment, they are fixed and do not change during training of the next nodes. Before introducing the details of the proposed method, the following subsection provides some preliminary information.

Fig. 1 Determining the parameters of the first hidden node in Step 1 of the algorithm: input weights of the node randomly (Phase 1), output weight of the node by an iterative optimization process (Phase 2)


M. Nayyeri et al.

Fig. 2 Determining the parameters of the Lth hidden node in Step L of the algorithm: input weights of the node randomly (Phase 1), output weight of the node by an iterative optimization process (Phase 2)

3.1 Problem Formulation In this subsection, an optimization problem is formulated to adjust the output parameter of the new node. Similar to Alexandre and deSa (2006), Erdogmus and Principe (2000, 2002, 2003), Han (2007a, 2007b), Igelnik and Pao (1995), Principe (2010), Renyi (1984), Santos et al. (2004a, 2004b), Yuan and Hu (2009) the quadratic error entropy is defined as follows: 

 H (e) = − log

p (e) de , 2

where p(e) is the error probability density function and e is a random variable for error. The information potential is defined as follows:  V (e) =

p 2 (e) de.


So the error entropy is rewritten as H (e) = − log (V (e)) . Since the error probability density function is unknown, the Parzen window method (Parzen 1962) is used to estimate the error from a sample error vector p(e) ˆ =

N 1  Kσ (e − e(i)) , N i=1

Randomized Constructive Neural Network …

where Kσ = √


  ||x||2 exp − 2σ 2 2π σ 1

is a Gaussian kernel function and e(i) is the error of ith training sample. Similar to Erdogmus and Principe (2002), after substituting p(e) ˆ for p(e) into Eq. (3) and differentiation, we have N N 1  √ Vˆ (e) = 2 K 2σ (e( j) − e(i)) . N i=1 j=1

The information potential for the network with n hidden nodes is defined as follows: N N 1  √ K 2σ (en ( j) − en (i)) . (4) I P(βn ) = 2 N i=1 j=1 From Eq. (1), we have the following: en ( j, i) = en−1 ( j, i) − βn gn ( j, i),


where en ( j, i) = en ( j) − en (i) and gn ( j, i) = gn ( j) − gn (i). According to (4) and (5), we have I P(βn ) =

N N 1  √ K 2σ (en−1 ( j, i) − βn gn ( j, i)) . N 2 i=1 j=1

Thus, to adjust the output parameter, the following maximization problem has to be solved: R(βn ) = max I P(βn ). (6) βn

It should be noted that maximization of information potential is equivalent to entropy minimization. Since N 2 is constant, the optimization problem (6) is equivalent to the following problem: U (βn ) = max βn

N  N 

K√2σ (en−1 ( j, i) − βn gn ( j, i)) .


i=1 j=1

One of the disadvantages of MEE is its undesired solutions. As an special case, when all elements of the error vector e(i) take non-zero constants which are all equal (or very close to each other), e in the problem (6) becomes zero (or a value close to zero). Consequently, the problem (7) is maximized (or reaches a predefined accuracy) and the algorithm stops, despite the large number of errors for the training


M. Nayyeri et al.

N 2 samples. In order to avoid this undesired solution, we add the term i=1 en (i) to the problem (7) which is the objective of MSE problem. After reformulation we have: ⎞ N N  N   U (βn ) = max ⎝ (1 − λ)K√2σ (en−1 ( j, i) − βn gn ( j, i)) − λ en2 (i)⎠ , ⎛


i=1 j=1


(8) where λ ∈ (0, 1) is a constant. The efficiency of using MSE objective function in the presence of Gaussian noise has been studied and proven in literature (Haykin 1994). Therefore, we formulate (8) as a bi-objective problem (the MEE and MSE objective functions) to take advantage of this characteristic when data are contaminated with Gaussian noise, and solve it with a weighting method. In practice, it is not known in advance that which type of noise comes with the training data. Thus, the parameter λ is added to (8) as a regularization parameter for optimizing the effect of the MSE an MEE terms in the objective function, and its value is determined over the interval (0, 1) by cross validation. Based on this, we expect the proposed method to perform efficiently on data regardless of the type of noise.

3.2 Problem Solving In the previous subsection, an optimization problem is introduced to adjust the output parameter of the new node based on regularized error entropy minimization. The aim of this subsection is to provide a method for solving the introduced optimization problem. There are several methods for solving problem (8), including gradient descent based methods, expectation maximization (Yang et al. 2009) and half quadratic optimization technique (Rockafellar 1970). In this paper, the half quadratic optimization technique is used to adjust the output parameter of the new node. Training of the new node is divided into two phases (see Figs. 1 and 2): Phase 1 The input parameters are assigned randomly based on a desired sampling distribution probability. In this research, uniform sampling distribution probability is used in experiments for generating random input parameters. Phase 2 The output parameter is adjusted by solving the optimization problem (8). The parameters of the new node are fixed after they are adjusted. The number of hidden nodes are increased until two consecutive objective values are sufficiently close to each other. The following proposition is used to solve the problem (8). By employing it, a closed form for adjusting output parameter of the new node is obtained. Proposition 1 [Yuan and Hu (2009] There exists a convex conjugate function φ such that   x2 Kσ (x) = sup α 2 − φ(α) . σ α∈R−

Randomized Constructive Neural Network …


Moreover, for a fixed x, the supremum is reached at α = −Kσ (x). After using Proposition 1 in Eq. (8), we obtain: ⎛ ⎞  N  N N 2   ( j, i) − β g ( j, i)|| ||e n−1 n n L(βn , α) = max ⎝ αi j (1 − λ) en2 (i)⎠ , (9) −λ α,βn 4σ 2 i=1 j=1


where αi j ∈ R, i, j ∈ {1, ..., N }, are the auxiliary variables appeared in half-quadratic optimization. The term φ(α) is removed from Eq. (9) since it is a constant parameter and would be omitted after differentiation. In the presence of impulsive noise, MEE objective function performs better than MSE. The role of the additional auxiliary variable αi j is to take small values wherever the sample error is large. Thus, the method is robust to impulsive noise. In the presence of Gaussian noise, the method tends to increase the value of λ (and thus the effect of MSE objective function) in the optimization problem (9). In special case λ = 1, the proposed method is equivalent to IELM introduced in Sect. 2. As a result, the performance of two methods are equal. The advantage of the proposed method over IELM is that it performs efficiently on data regardless of the type of noise. For a fixed β, the following equation holds: U (βn ) = L(βn , α). Inspired by the method used by Yuan and Hu (2009), we suggest the following iterative optimization process to find a local optimum of the problem (9): ⎧ t+1   αi j = −y en−1 ( j, i) − βnt gn ( j, i) , ⎪ ⎪ ⎪ ⎨ ⎛ ⎞  N  N N 2   ||e ( j, i) − β g ( j, i)|| n−1 n n t+1 t+1 2 ⎪ ⎝ (1 − λ) −λ αi j en (i)⎠ , ⎪ ⎪ βn = arg max ⎩ 4σ 2 βn

i=1 j=1


(10) where t indicates the tth iteration. According to Eq. (1), we have: E nT E n = (E n−1 − βn G n )T (E n−1 − βn G n ) T T = E n−1 E n−1 + βn2 G nT G n − βn G nT E n−1 − βn E n−1 Gn.

The double sum can be represented in a matrix form. Thus we have: ⎛ βnt+1

= arg max ⎝ βn

N  N 

αit+1 j

i=1 j=1


= arg max tr (1 − λ)

⎞   N  en−1 ( j, i) − βn gn ( j, i)2 2 en (i)⎠ (1 − λ) −λ 4σ 2 i=1

(E n−1 − βn G n )



◦ (E n−1 − βn G n )

− λE n E nT

4σ 2       (1 − λ) E T t+1 ◦ E + (βn G T t+1 ◦ Gβn ) − βn G T t+1 ◦ E = arg max tr 2 4σ βn βn


M. Nayyeri et al.      T T , − E T t+1 ◦ Gβn + βn2 G n G nT − βn G n E n−1 − βn E n−1 G nT − λ E n−1 E n−1


where the symbol ‘tr(·)’ represents the matrix trace operation, and ‘◦’ is the Hadamard product. The auxiliary matrix variable t in iteration t is defined as: ⎤ t t α11 . . . α1N ⎥ ⎢ t = ⎣ ... . . . ... ⎦ α tN 1 . . . α tN N ⎡

and ⎤ en (1, 1) . . . en (1, N ) ⎥ ⎢ .. .. .. E n = ⎣ ⎦, . . . en (N , 1) . . . en (N , N ) ⎤ ⎡ gn (1, 1) . . . gn (1, N ) ⎥ ⎢ .. .. .. G n = ⎣ ⎦. . . . ⎡



gn (N , 1) . . . gn (N , N )

Taking the derivative with respect to βn , the optimal solution to the problem (11) can be obtained as follows:        (1 − λ) 2 tr βn G T ◦ G − 2 tr G T ◦ E + 2λ tr βn G n G nT   T − 2λ tr G n E n−1 =0 and we get:

  T tr (1 − λ) G T ◦ E + λ G n E n−1  .  βn = tr (1 − λ) G T ◦ G + λG n G nT


According to Eq. (14), the optimal output weight for the nth hidden node is adjusted by the following iterative process:   ⎧ t+1 en−1 (i, j) − βnt gn (i, j) , ⎪ ⎨ αi j = −G   T tr (1 − λ)G T t+1 ◦ E + λG n E n−1 t+1 ⎪  .  ⎩ βn = tr (1 − λ)G T t+1 ◦ G + λG n G nT


This process is iterated until the sequence {L(βnt , α t ), t = 1, 2, ...} converges to its maximum value L(βn , α) defined by (9). Convergence to the optimal value of the loss function leads to minimization of the error happening due to noise in data. The following proposition proves the convergence of the proposed method. Proposition 2 The sequence {L(βnt , α t ), t = 1, 2, ...} generated by (9) converges.

Randomized Constructive Neural Network …


Algorithm 1 RMSE-MEE-RCNN N , regularization parameter λ, kernel parameter σ . Input: Training samples χ = {xi , yi }i=1 Output: Number of hidden nodes L and optimal output weights βn , n = 1, . . . , L.

Initialization: Set n = 1, ε = 1, and sufficiently small initial value for estimation of the loss function (9). While ε ≥ 10−4 Randomly initialize the input parameters of the nth hidden nodes an and bn . Calculate hidden node vector G and error vector E by (12) and (13) and set βn0 = 0. For t = 1, . . . , 50 Update α t and output weight βnt according to (15). End Set α = α 50 , βn = βn50 . Set ε as the increment in the loss function (9). Increase n by one. End Return L = n and output weights βn , n = 1, . . . , L.

Proof First, we show that the following inequalities hold:       L βnt , α t ≤ L βnt , α t+1 ≤ L βnt+1 , α t+1 . Starting with an initial value for βnt and α t , α t+1 is updated by (15). Therefore, based on Proposition 1, the left side of the above inequality holds. In the next step, β t+1 is updated by (15). Due to maximization in Eq. (10), the right side of the inequality holds. (In fact, what is maximized in (10) is Kσ in Proposition 1.) Therefore, the sequence {L(βnt , α t ), t = 1, 2, ...} is non-decreasing. Second, we show that the sequence L, defined by (9) is bounded. The information potential (4) is a Gaussian function which has an upper bound. In addition, −λ i en2 (i) is bounded by zero. Thus, L is a summation over values which are bounded from above, and since it is monotonic, it converges. Therefore, the algo rithm leads to an optimal output weight βn . Henceforth, we refer to our proposed method as the Regularized Mean Square Error and Minimum Error Entropy in a Randomized Constructive Neural Network (RMSE-MEE-RCNN). The proposed algorithm is summarized in Algorithm 1. The stopping condition is satisfied when the increment of two successive objective functions is negligible (less than 10−4 ).

4 Experimental Results To evaluate RMSE-MEE-RCNN, with two synthetic and six benchmark datasets, we compare its performance with that of IELM (Huang and Chen 2006), which is the most similar method to our proposed algorithm, in Sect. 4.1. In addition, we


M. Nayyeri et al.

evaluate RMSE-MEE-RCNN in comparison with three recent methods ELM (Huang and Chen 2006), B-ELM (Yang et al. 2012) and MLP-MEE (Bessa et al. 2009) in Sect. 4.2. We obtain the best values for λ and σ by cross-validation for the first node only, and use these values for training all nodes. Parameter η is the noise generated for each sample separately by the following function: η = 0.95N (0, 10−4 ) + 0.05N (0, 10),


where N (m, s) is a Gaussian distribution function with the mean m and variance s. According to the definition (16), η takes large values with probability 0.05 and small values with probability 0.95. As a result, approximately 5% of the whole data take impulsive noise and the rest take Gaussian noise. In all the experiments, the random value η(i) is generated for ith sample and added to its desired output. In this way, a combination of Gaussian and impulsive noise is added to the datasets. For each dataset, 20 trials are performed. In each trial, 50% of the samples are chosen randomly for training and the remaining are used for testing. The value of Root Mean Square Error (RMSE) for testing and the number of nodes are saved after each trial. Then, the average of RMSE values and number of nodes for 20 trials are calculated and reported. Trainings are performed using best values of λ and σ which have already been obtained. The lowest error and smallest network size are shown in bold in each table. Datasets are contaminated with noise (16) in all experiments. To implement our proposed method, the sigmoid activation function g(x) =

1 1 + e−x

is used for all hidden node.

4.1 Comparison of RMSE-MEE-RCNN with IELM In this section, the experiments are performed with synthetic and benchmark datasets.


Experiments with Synthetic Datasets

In the following, two synthetic regression datasets are used to compare the performance of RMSE-MEE-RCNN and IELM (Huang and Chen 2006). When using the Sinc dataset, the data is generated with the following function: y = Sinc(x) =

sin(x) . x

Randomized Constructive Neural Network …


Fig. 3 Comparison of RMSE-MEE-RCNN and IELM with the Sinc dataset in the presence of noise with λ = 0.05 and σ = 0.5

Data is sampled from the interval [−5 : 0.1 : 5]. The random value η(i) is generated using (16) for each sample xi and then added to its desired output. In more detail, yi = Sinc(xi ) + η(i) for the ith sample. As mentioned earlier, the performance of the two methods is equal in the presence of Gaussian noise. Figure 3 illustrates the results of RMSE-MEE-RCNN and IELM in the Sinc dataset in the presence of noise (16). In this experiment, seven data points are turned into outliers. The regularization parameter λ = 0.05 and the kernel parameter σ = 0.5 are obtained by cross-validating the set {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 10} and are used for this experiment. The parameter λ provides a tradeoff between MSE and MEE. The test RMSE for RMSE-MEE-RCNN and IELM is 0.0892 and 0.2305, respectively. According to the results shown in Fig. 3, RMSE-MEE-RCNN is more robust than IELM. With Function1 dataset, data is generated using the following function sampling:   y(x1 , x2 ) = Function1(x1 , x2 ) = x1 exp −(x12 + x22 ) ,


where x1 , x2 ∈ [−2 : .2 : 2]. Figure 4 shows the original function (17). An MSE-based method such as IELM has poor performance in facing nonGaussian noise such as impulsive noise. According to Fig. 5, IELM performs poorly in this condition. As shown in Fig. 6, the proposed method is more robust for the combination of Gaussian and impulsive noise compared to IELM. The regularization and kernel parameters are chosen from the set {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 10} and set to λ = 0.1 and σ = 0.5 in this experiment.


M. Nayyeri et al.

Fig. 4 Original Function1 (17)

Fig. 5 Result of IELM with the Function1 dataset in the presence of noise with λ = 0.1 and σ = 0.5


Experiments with Benchmark Datasets

The following experiments are performed with six real regression datasets in the presence of noise (16). Table 1 shows the set descriptions, including the number of training and testing samples and features. The sets Autoprice, Cleveland, Pyrim and Housing are taken from the UCI Machine Learning Repository (Newman et al. 1998), Strikes, Bodyfat and Cloud from Statlib (Igelnik and Pao 1995) and Quake and Baskball from (Simonoff 1996). For each dataset, the best values of the two parameters λ and σ are obtained by cross-validation over {0, 0.09, ..., 0.99} and {0.001, 0.005, ..., 1}, respectively. The mean and deviation of the 20 test RMSEs, the average number of nodes in the 20 trials, and the parameter λ for each dataset are reported in Table 2. The best results are shown in bold. As can be seen from the table, RMSE-MEE-RCNN outperforms

Randomized Constructive Neural Network …


Fig. 6 Result of RMSE-MEE-RCNN with the Function1 dataset in the presence of noise with λ = 0.1 and σ = 0.5 Table 1 Specification of benchmark datasets Dataset #Feature Housing Strikes Quake Bodyfat Baskball Autoprice Pyrim Cloud Cleveland

13 6 3 14 4 15 27 7 13

Observation #train


253 313 1089 126 48 80 37 54 149

253 312 1089 126 48 79 37 54 148

IELM in all experiments. The average number of hidden nodes in both networks is approximately the same. In most cases, however, the proposed method achieved better results compared to IELM, which had fewer nodes. Moreover, according to the standard deviation of the results, RMSE-MEE-RCNN is more robust. In terms of training time, IELM is a faster method. The increase in time in RMSEMEE-RCNN is due to the inclusion of MEE in the objective function. In fact, computational complexity is the price in exchange for improving generalization performance and robustness for different types of noise.


M. Nayyeri et al.

Table 2 Performance comparison of the two methods with six benchmark datasets Dataset RMSE-MEE-RCNN IELM Mean Deviation Mean λ Mean Deviation Mean RMSE nodes RMSE nodes Housing Strikes Quake Bodyfat Baskball Autoprice

0.1132 0.2794 0.1746 0.0934 0.1680 0.3341

0.0138 0.0082 0.0066 0.0255 0.0252 0.0461

187.92 128.06 137.66 170.18 96.96 127.44

0.89 0.89 0.69 0.79 0.79 0.89

0.1412 0.2805 0.1767 0.1636 0.2119 0.3525

0.0197 0.0080 0.0066 0.0533 0.0462 0.0496

186.76 128.84 156.54 178.44 125.78 119.18

4.2 Comparison of RMSE-MEE-RCNN with ELM, B-ELM And MLP-MEE In this section, the performance of RMSE-MEE-RCNN is compared with the following methods: ELM Huang et al. (2012) used an extreme learning machine with a fixed number of hidden nodes which are sufficiently large (1000 random hidden nodes are used in their experiment). The output weight vector is then adjusted according to their proposed formula. B-ELM Yang et al. (2012) extended the IELM method (Huang and Chen 2006) to a bidirectional extreme learning machine in which odd nodes are generated randomly and even nodes are created using the formulation proposed for them. The output weights are adjusted according to the IELM method. MLP-MEE The algorithm is proposed by Bessa et al. (2009) on a multilayer perceptron with MEE objective function. In our experiment, a single layer perceptron is used. The number of hidden nodes are determined by cross-validation. The experiments are performed with synthetic and benchmark datasets.


Experiments with Synthetic Datasets

Four synthetic regression datasets are used for evaluation. With each dataset, experiments are performed in the presence of noise (16). Synthetic datasets are generated by the functions  f 1 (x1 , x2 ) = 10.391(x1 − 0.4) ((x2 − 0.6) + 0.36) , r (x1 , x2 ) = (x1 − 0.5)2 + (x2 − 0.5)2   f 2 (x1 , x2 ) = 24.235r 2 (x1 , x2 ) 0.75 − r 2 (x1 , x2 ) ,

Randomized Constructive Neural Network …


Table 3 Performance comparison of four methods with four synthetic datasets with noise and parameters λ = 0.5 and σ = 0.5 Dataset RMSE-MEE-RCNN IELM B-ELM MLP-MEE Mean RMSE Mean Nodes Mean RMSE Mean RMSE Mean RMSE f1 f2 f3 f4

0.1478 0.1171 0.1906 0.1616

34.50 17.75 23.65 28.50

0.2746 0.4236 0.2660 0.3864

0.5685 0.7143 0.5118 0.5958

0.1567 0.1178 0.2040 0.1636

r (x) = (x − 0.5)2    f 3 (x1 , x2 ) = 42.659 .1 + r (x1 ) 0.05 + r 4 (x1 ) − 10r 2 (x1 )r 2 (x2 ) + 5r 4 (x2 ) ,  f 4 (x1 , x2 ) = 1.3356 1.5(1 − x1 ) + e2x1 −1 sin (3π(x1 − 0.6)2 )  + e3(x2 −0.5) + sin (4π(x2 − 0.9)2 ) ,

where (x1 , x2 ) are random numbers from the interval [−1 : .01 : 1] generated by uniform distribution. Target values are also normalized into [−1, 1]. The random vector η is added to the target values. The average of the test RMSE values obtained by the four methods, and also the average of the nodes produced by RMSE-MEE-RCNN in the 20 trials are reported in Table 3. ELM has a fixed structure set to 1000 hidden nodes in all experiments. In B-ELM, the maximum number of nodes is set to 500, so the algorithm stopped when it reached this limit. In MLP-MEE, the number of nodes is determined by crossvalidation over the set {10, 20, 30, 40}. However, the best structure was always 10 nodes in all experiments. The results show that RMSE-MEE-RCNN and MLP-MEE are more robust than IELM and B-ELM in the presence of noise. RMSE-MEE-RCNN achieves the smallest mean error in all experiments. Moreover, B-ELM was the least accurate. In terms of network size, MLP-MEE gave the most compact network. The regularization and kernel parameters are set to λ = 0.05 and σ = 0.5, respectively, for all datasets.


Experiments with Benchmark Datasets

The following experiments are performed with three benchmark datasets in the presence of noise (16). The dataset descriptions are in Table 1. The same values λ = 0.5 and σ = 0.5 are used to train the network in this section. The averages of RMSE and nodes from 20 trials are reported in Table 4. According to Table 4, RMSE-MEE-RCNN was the most accurate method even in real datasets in the presence of Gaussian and impulsive noise. Moreover, it gave a much more compact network compared to ELM and B-ELM. MLP-MEE has also performed efficiently with a small error difference compared to RMSE-MEE-RCNN,


M. Nayyeri et al.

which confirms that using the MEE objective function can improve performance when data is contaminated with impulsive noise.

5 Conclusion This paper presents a new random constructive neural network trained based on minimum error entropy and minimum mean square error. There is one hidden layer in the network with random input parameters. The optimal output parameter for each newly added node is adjusted using a combination of MSE and MEE objective functions. The proposed method, referred to as RMSE-MEE-RCNN, has several advantages. First, it is robust regardless of the type of noise (Gaussian or impulsive noises). The regularization term optimizes the robustness for each type of noise. Second, it utilizes a randomized neural network to reduce the computational burden. Employing the MEE objective function in learning machines results in high computational complexity. In our method, only one parameter (the output parameter of the newly added node) is adjusted by the proposed objective function at each step, and the other parameters are assigned randomly. Third, it utilizes a constructive neural network to better optimize nonconvex MEE objective functions with the desired accuracy. Fourth, the algorithm automatically determines the number of hidden nodes. The experiments are performed in comparison with the Incremental Extreme Learning Machine (IELM) and three other recent methods with several synthetic and benchmark datasets in the presence of a combination of Gaussian and impulsive noise. The RMSE-MEE-RCNN performs efficiently in terms of both accuracy and robustness. Although random generation of inputs has reduced the computational burden of our method, further work is needed to improve running time in large-size datasets.

Table 4 Performance comparison of four methods with three benchmark datasets with noise and parameters λ = 0.5 and σ = 0.5 Dataset RMSE-MEE-RCNN IELM B-ELM MLP-MEE Mean RMSE Mean Nodes Mean RMSE Mean RMSE Mean RMSE Cleveland Pyrim Cloud

1.3577 0.3344 0.2970

6.70 1.40 20.75

7.4401 5.2393 1.2537

2.5936 1.8117 0.6283

1.6773 0.9971 0.3312

Randomized Constructive Neural Network …


References Alexandre LA, de Sá JPM (2006) Error entropy minimization for LSTM training. In: Artificial neural networks—ICANN’06, Berlin. Springer, pp 244–253. Anders U, Korn O (1999) Model selection in neural networks. Neural Netw 12(2):309–323. https:// Barron AR (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theory 39(3):930–945. Bessa RJ, Miranda V, Gama J (2009) Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans Power Syst 24(4):1657– 1666. Erdogmus D, Principe JC (2000) Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In: Proceedings of the second international workshop on independent component analysis and blind signal separation, pp 75–80. Erdogmus D, Principe JC (2002) An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans Signal Proc 50:1780–1786. TSP.2002.1011217 Erdogmus D, Principe JC (2003) Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. IEEE Trans Signal Proc 51:1966–1978. 1109/TSP.2003.812843 Fahlman SE, Lebiere C (1989) The cascade-correlation learning architecture. In: Touretzky DS (ed) Advances in neural information processing systems, vol 2, pp 524–532, San Francisco, CA. Morgan Kaufmann Hagan MT, Demuth HB, Beale MH (1996) Neural network design, Boston Han L (2007a) Kernel partial least squares (K-PLS) for scientific data mining. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY Han S (2007b) A family of minimum Renyi’s error entropy algorithm for information processing. PhD thesis, University of Florida, Gainesville, FL Haykin S (1994) Neural networks: a comprehensive foundation. MacMillan, New York Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2:359–366. Huang GB, Chen L (2007) Convex incremental extreme learning machine. Neurocomputing 70:3056–3062. Huang GB, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17:879–892. https:// Huang GB, Chen YQ, Babri HA (2000) Classification ability of single hidden layer feedforward neural networks. IEEE Trans Neural Netw 11:799–801. Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B: Cybern 42:513–529 Igelnik B, Pao YH (1995) Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans Neural Netw 6:1320–1329. 471375 Karmitsa N, Taheri S, Joki K, Mäkinen P, Bagirov A, Mäkelä MM (2020) Hyperparameter-free NN algorithm for large-scale regression problems. Technical reports 1213, Turku Center for Computer Science (2020) Kwok TY, Yeung DY (1997) Objective functions for training new hidden units in constructive neural networks. IEEE Trans Neural Netw 8:1131–1148. Leshno M, Lin VY, Pinkus A, Schocken S (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw 6:861–867. https:// Lin S, Liu X, Fang J, Xu Z (2015) Is extreme learning machine feasible? A theoretical assessment (Part II). IEEE Trans Neural Netw Learn Syst 26:21–34


M. Nayyeri et al.

Liu X, Lin S, Fang J, Xu Z (2015) Is extreme learning machine feasible? A theoretical assessment (Part I). IEEE Trans Neural Netw Learn Syst 26:7–20 Nayyeri M, Maskooki A, Monsefi R (2017) A new sparse learning machine. Neural Proc Lett 46:15–28. Newman D, Hettich S, Blake C, Merz C, Aha D (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA. Accessed 26 Feb 2018 Pao YH, Takefuji Y (1992) Functional-link net computing: theory, system architecture, and functionalities. Computer 25:76–79. Park J, Sandberg IW (1991) Universal approximation using radial-basis-function networks. Neural Comput 3:246–257. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065– 1076. Principe JC (2010) Information theoretic learning: Rényi’s entropy and kernel perspectives. Springer Reed R (1993) Pruning algorithms-a survey. IEEE Trans Neural Netw 4:740–747. 10.1109/72.248452 Rényi A (1984) A diary on information theory. Akademiai Kiado Rockafellar R (1970) Convex analysis. Princeton University Press Santos JM, Alexandre LA, Marques de Sá JM (2004a) The error entropy minimization algorithm for neural network classification. In: International conference on recent advances in soft computing, pp 92–97 Santos JM, Marques de Sá JM, Alexandre LA, Sereno F (2004b) Optimization of the error entropy minimization algorithm for neural network classification. In Dagli CH, Buczal A, Enke D, Embrechts M, Ersoy O (eds) Intelligent engineering systems through artificial neural networks, vol 14. ASME Press, pp 81–86 Scarselli F, Tsoi AC (1998) Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw 11:15–37 Schmidt WF, Kraaijveld MA, Duin RPW (1992) Feedforward neural networks with random weights. In: Pattern recognition, conference B: 11th IAPR international conference on pattern recognition methodology and systems proceedings, vol 2, pp 1–4. Simonoff JS (1996) Smoothing methods in statistics. Springer, New York. http://people.stern.nyu. edu/jsimonof/SmoothMeth/Data/ASCII. Accessed 26 Feb 2018 Tyukin IY, Prokhorov DV (2009) Feasibility of random basis function approximators for modeling and control. In: IEEE international symposium on intelligent control, pp 1391–1396. https://doi. org/10.1109/CCA.2009.5281061 Yang SH, Zha H, Zhou SK, Hu BG (2009) Variational graph embedding for globally and locally consistent feature extraction. In: Buntine W, Grobelnik M, Mladeni´c D, Shawe-Taylor J (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 538–553 Yang Y, Wang Y, Yuan X (2012) Bidirectional extreme learning machine for regression problem and its learning effectiveness. IEEE Trans Neural Netw 23:1498–505. TNNLS.2012.2202289 Yuan XT, Hu BG (2009) Robust feature extraction via information theoretic learning. In: ICML ’09 Proceedings of the 26th annual international conference on machine learning, pp 1193–1200.

On the Role of Taylor’s Formula in Machine Learning Tommi Kärkkäinen

Abstract The classical Taylor’s formula is an elementary tool in mathematical analysis and function approximation. Its role in the optimization theory, whose datadriven variants have a central role in machine learning training algorithms, is wellknown. However, utilization of Taylor’s formula in the derivation of new machine learning methods is not common and the purpose of this article is to introduce such use cases. Both a feedforward neural network and a recently introduced distance-based method are used as data-driven models. We demonstrate and assess the proposed techniques empirically both in unsupervised and supervised learning scenarios. Keywords Taylor’s formula · Machine learning · Neural networks · Distance-based methods

1 Introduction Data-driven methods have increased their role and popularity within computational sciences during the 21st century. There are many disciplines that deal with the construction of parametric models that try to encapsulate main characteristics of a dataset. The data can be given a priori or appear during the model construction in an on-line, streaming fashion. Here we restrict ourselves in the former case of given data. Datadriven modelling approaches include (but are not limited to) statistical approaches (Hastie et al. 2001), data mining and knowledge discovery algorithms (Han et al. 2011), neural computation techniques (Haykin 2009), and pattern recognition and machine learning methods (Bishop 2006). Most recently, machine learning and especially deep learning techniques (LeCun et al. 2015; Schmidhuber 2015; Goodfellow et al. 2016), where the model is composed of many transformation layT. Kärkkäinen (B) Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, Jyväskylä 40014, Finland e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



T. Kärkkäinen

ers, seem to be the most actively addressed branch of artificial intelligence, especially in applications. The term “Machine Learning” (ML) refers to the generation of trained machines, actually computer programs to populate and use data-driven models, that learn from a set of observations represented in the so-called training data. Typically three different classes of machine learning techniques are distinguished: 1. Unsupervised learning aiming to discover and summarize properties and characteristics of the data generation mechanism, 2. Supervised learning attempting to learn an unknown mapping between input and output spaces using a given pairs of input-output examples, 3. Reinforcement learning where the model’s adaptation is based on feedback from the environment representing the degree of fulfilment of the goals of the learner (Simeone 2018). Therefore, the last class is used and could also be considered as a special case of supervised learning. This is the stance taken in this paper and, therefore, we consider only unsupervised and supervised machine learning techniques in what follows. In the field of unsupervised learning, the most common tasks that are addressed with different techniques are dimension reduction and clustering. They both reduce the volume (i.e., number of observations or the dimension of observations) of data while attempting to maintain the intrinsic, relevant information. In the former, dimension of a high-dimensional data (for instance, a set of images) is reduced to simplify its data-based modelling. In the latter, original observations are joined as groups of similar observations (like similar images), being dissimilar to data in other groups. These groups, clusters, can be strictly separated or overlap each other when using, e.g., fuzzy clustering techniques (Saxena et al. 2017). In the supervised learning, main focus is on the derivation of a predictive model encapsulating the unknown relationship exemplified by the input-output examples. Here we restrict ourselves into such models whose basic construction yields to a vector-valued output in the output dimension. For such models, only encoding of the desired output distinguishes the main use cases of uni- or multi-target regression and classification (with trivial postprocessing to retrieve the suggested class in classification). Note that this restriction rules out one of the most popular machine learning technique of the 21st century, the Support Vector Machine (Boateng et al. 2020). Even if unsupervised and supervised learning are introduced and treated separately in basic textbooks, there are many interesting approaches where these classes of learning can be combined. Most commonly, in deep learning an unsupervised, pretrained model can be used as is or with finetuning for processing the training data in the (Deng et al. 2017; Sun et al. 2019; Sun et al. 2019; Kim et al. 2020). Techniques integrating deep learning based dimension reduction and clustering were reviewed and developed in Min et al. (2018), Diallo et al. (2021). Furthermore, integration of clustering-based data division and individual model construction for the subsets using distance-based learning machines was proposed and tested in Hämäläinen et al. (2021). Such techniques generalize the earlier approaches of cluster-based regression (Torgo and DaCosta 2003). To this end, the distance-based methods use the core

On the Role of Taylor’s Formula in Machine Learning


concept of unsupervised learning, dissimilarity, in supervised learning (Kärkkäinen 2019). The purpose of this article is to introduce two cases, with two data-driven models, in which Taylor’s formula is used in the derivation of a machine learning method. Both unsupervised and supervised learning scenarios are considered, with emphasis on reducing the needed volume of data through dimension reduction or feature selection. The contents of this chapter is as follows. After this introduction, we present the general framework and two Taylor’s formula inspired machine learning techniques in Sect. 2. A demonstrative set of computational experiments are then given in Sect. 3. Finally, the chapter is concluded in Sect. 4. Figures illustrating the computational experiments are given at the end of the chapter.

2 Methods In this section, we present the preliminaries and depictions of the computational methods.

2.1 Basic Formulations Let us first define data. We assume that a training set of N input data vectors X = N , where x i ∈ Rn , is given. For the supervised learning in case of classification, {x i }i=1 we assume that each observation is attached with a label, which defines the whole N , where li ∈ {1, . . . , k} for k classes. As anticipated in the set of labels L = {li }i=1 introduction, we will use the 1-of-k coding to replace the labels with a set of vectors N , where ei ∈ Rk is the li th unit vector, i.e., indicator of class l. Finally, we {ei }i=1 Nv assume that a separate validation set Xv = {(x v )i }i=1 with the corresponding labels Nv L v = {(lv )i }i=1 are given, which can be used to estimate the quality of the constructed classifier. Determination of a machine learning model typically targets to cope with two criteria: 1. Generalization of training data with sufficient accuracy (cf. over-learning) by minimizing a reconstruction error, 2. Restriction to as a simple model as possible by minimizing the model’s complexity. Clearly these targets are contradictory, because trivial models have lowest accuracy and, especially when the model is a so-called universal approximator (Kärkkäinen 2019), more complex and more flexible models with more adaptive parameters can approximate more complex functions and class boundaries. Indeed, multicriteria optimized multilayered perceptrons were already considered in Teixeira et al.


T. Kärkkäinen

(2000), but it is much more common to use single-objective, directly combined forms of cost functions in ML. Interestingly, such loss functions are direct analogues of the formulations used in the optimization-based image processing and denoising as summarized in Myllykoski et al. (2015). Hence, in many cases parameters {W} of a machine learning model are obtained through the minimization of a cost function that is composed of a direct sum of a fidelity J f and a regularization Jr term: J ({W}) = J f ({W}; {X, Y}) + Jr ({W}).


As our notation above indicates, the data-fitting type fidelity term includes the training data and the regularization term is defined with respect to the unknown parameters {W} of the model. Compared to the usually given formulations in optimization, the parameters in (1) are given as matrices instead of the usual vector-based form. The convenience of such a notation becomes evident below, where the two terms are specified in Taylor’s formula inspired machine learning methods.

2.2 Taylor’s Formula According to online Encyclopaedia Britannica (Hosch 2009), Taylor’s series is named for the English mathematician Brook Taylor who proposed the power series form in early 18th century. More about history and depiction of application areas are given, e.g., in Barrio (2005). For our purposes, an appropriate more recent form of Taylor’s formula reads as follows: The value of a sufficiently smooth real-valued function f (x) can be approximated as 1 f (x) = f (x 0 ) + ∇ f (x 0 )T (x − x 0 ) + (x − x 0 )T ∇ 2 f (x 0 )(x − x 0 ) + . . . . (2) 2 Here, x 0 ∈ Rn , ∇ f (x 0 ) refers to the gradient vector at x 0 and ∇ 2 f (x 0 ) to the Hessian matrix, respectively. Equation (2) is valid locally, in the neighborhood of x 0 . Moreover, within the line segment connecting the two points x 0 and x, there exists z ∈ Rn such that the following identify holds (Dennis and Schnabel 1996) [Lemma 4.1.5]: 1 f (x) = f (x 0 ) + ∇ f (x 0 )T (x − x 0 ) + (x − x 0 )T ∇ 2 f (z)(x − x 0 ). 2


Because of the nonnegativity of the last term, when ∇ 2 f (z) is at least positive semidefinite, i.e., when f is convex, this formula shows that x ∈ Rn is a local minimizer of f when ∇ f (x) = 0 Dennis and Schnabel (Dennis and Schnabel 1996), Lemma 4.3.1.

On the Role of Taylor’s Formula in Machine Learning


2.3 Autoencoder Inspired by Taylor’s Formula We consider a special form of autoencoder, which performs nonlinear dimension reduction. The proposed composition of operators, with a shallow network approximating the nonlinear part, was proposed and tentatively tested to discover knowledge from capacitated vehicle routing problems in Kärkkäinen and Rasku (2020). A thorough treatment of the topic, with a more thorough contextualization, derivation, and experimentation, will be given in Kärkkäinen and Hänninen (2021). Preliminary formulations and experiments around the theme were presented in Hänninen and Kärkkäinen (2016). Traditionally an autoencoder—a nonlinear feedforward network with a squeezing middle layer—is used for dimension reduction immediately after data preprocessing (Schmidhuber 2015). However, one abstraction of Taylor’s formula (2) is that, locally, one can approximate a smooth function using a sum of its bias (∼constant), a linear term (∼gradient), and a higher-order, nonlinear operator (∼Hessian + higherorder terms). An analogy of such an observation in the field of autoencoding is that a separation-of-concerns type of model for dimension reduction is composed of approximation of constant, linear, and nonlinear behavior of data. From these premises, we propose a serial or sequential way for dimension reduction, where the classical multilayered feedforward network based autoencoder is only used for nonlinear residual approximation. The phases of the algorithm, when the original dimension n of data vectors is reduced to m < n, read as follows: 1. (Bias estimation/data normalization) Center data by subtracting the mean or the spatial median (Kärkkäinen and Saarela 2015) and unify the range of each variable into two. 2. (Linear trend estimation) Perform classical or robust (Kärkkäinen and Saarela 2015) principal component analysis in order to compute the eigenvectors U ∈ Rn×n of the covariance matrix. With this new basis, remembering the effect of centering from the previous step, the linear transformation of a vector x ∈ Rn into the smaller-dimensional space spanned by the m most significant principal components (PCs) is given by z = UmT x, where Um ∈ Rn×m denotes the first m columns of U. The residual vector x˜ ∈ Rn after the linear trend estimation by the PCs reads as x˜ = x − Um z = x − Um UmT x = (I − Um UmT )x.


Here, I ∈ Rn×n denotes the identify matrix. 3. (Nonlinear trend estimation) Train a feedforward autoencoder with a squeezing layer of size m for the residual data  X. This step is specified below. Formalism introduced in Kärkkäinen (2002), Kärkkäinen and Heikkola (2004) allows compact presentation and analysis of feedforward networks. We use the tanh activation function 2 − 1. t (x) = 1 + exp(−2x)


T. Kärkkäinen

Its use on a whole network layer with n l neurons is specified with a diagonal functionmatrix nl , F = F(·) = Diag{ f i (·)}i=1 where f i = t. Using the linear output layer, feedforward transformation of a network with L layers is given by o = o L = N ( x˜ ) = W L o(L−1) , where o0 = x˜ and ol = F(Wl o(l−1) ), l = 1, . . . , L − 1. Note that the layer number l of a weight matrix W is given using the upper index. We assume that L is even and consider the so-called symmetric autoencoder (Hänninen and Kärkkäinen 2016), for  T ˜ which Wl = Wl , l = 1, . . . , L/2, and l˜ = L − (l − 1). The dimensions of the weight matrices to be determined are given by dim(Wl ) = n l × n l−1 , l = 1, . . . , L/2. In autoencoding, n L = n 0 and the central layer has the squeezing dimension m = n L/2 < n. To determine the network weights, we minimize the regularized mean leastsquares cost function J , whose fidelity and regularization terms are defined as follows: J f ({Wl }; { X}) =

N 2 1   L (L−1)  − x˜ i  , W oi 2 N i=1

  β Wl − Wl 2 . Jr ({Wl }) =  0 F L l 2 l=1 #(W1 ) l=1




Here,  ·  F denotes the (matrix) Frobenius norm and #(Wl1 ) the number of rows of L refer to the randomly Wl , β > 0 is a fixed regularization parameter, and {Wl0 }l=1 generated initial values of the weights. For simplicity, let us set β˜ =  L


l l=1 #(W1 )


Without further ado, both terms in (5) and (6) contain averaging over the data and over the number of weights, in order to balance the two terms when different numbers of observations and different numbers and sizes of layers are considered. The gradient matrices ∇Wl J ({Wl }), l = L , . . . , 1, for the optimization problem are given as (see Kärkkäinen 2002)

On the Role of Taylor’s Formula in Machine Learning

∇Wl J ({Wl }) =

N   1  l (l−1) T di oi + β˜ Wl − Wl0 , N i=1



where the layer-wise error backpropagation reads as diL = ei = W L oi(L−1) − x˜ i ,

 dli = Diag (F) Wl oi(l−1) (W(l+1) )T di(l+1) . For the symmetric autoencoder, the layer-wise optimality conditions for the weights in layers l = 1, . . . , L/2 read as ∇Wl J =


  1  l (l−1) T ˜ ˜ di oi + oi(l−1) (dli )T + β˜ Wl − Wl0 , l˜ = L − (l − 1). N i=1

Note that then the regularization term (6) is reduced to contain only these matrices. The autoencoding error of the sequential model is based on the accuracy of the reduced representation. For normalized data, the residual after the linear trend estimation is given by (4). This data is then fed to the nonlinear autoencoder, whose autoencoding error defines the overall error of representing the original data in a lower dimension. Instead of the usual root-mean-squared-error (RMSE), we apply the mean-root-squared-error (MRSE) according to the following formula:

 N  1 | x˜ i − N ( x˜ i )|2 . e=  N i=1


This choice is made because the MRSE correlated better than the RMSE with the independent validation error in the experiments in Kärkkäinen (2014).

2.4 Feature Selection Method Based on Taylor’s Formula The method depicted in this section will be thoroughly explored in Linja et al. (2021). The technique was originally proposed for the feedforward neural network in Kärkkäinen (2015). The predictive model to be used for constructing classifiers in this section was proposed in Kärkkäinen (2019) and was referred there as Extreme Minimal Learning Machine (EMLM). Its ancestors, the Extreme Learning Machine (ELM) has been addressed in Kärkkäinen and Glowinski (2019) (see original articles therein), and the Minimal Learning Machine (MLM) in Kärkkäinen (2019), Hämäläinen et al. (2020), Linja et al. (2020) (see original articles therein). In short, we are combining the classical ridge regression with a distance-based feature mapping. Here ridge regression refers to the Tykhonov type of least-squares regularization (Hastie


T. Kärkkäinen

et al. 2001), which could me changed into sparsity favoring forms using nonsmooth and even nonconvex regularizers and appropriate solvers (Kärkkäinen et al. 2001; Kärkkäinen and Glowinski 2019). As noted in Karkkainen (2019, Remark 1), the structural resemblance with the classical Radial Basis Function Network (RBFN) means that the EMLM is a universal approximator. Differently from RBFN, where most commonly a clustering technique is used to identify the locations of the basis function (Schwenker et al. 2001), a subset of the original observations referred as reference points (RP) are used in EMLM, similarly to MLM. For this purpose, the RS-maxmin algorithm can be applied (Hämäläinen et al. 2020; Kärkkäinen 2019). It starts from a particular observation and adds a new reference point from the not-yet-used observations that has the largest distance to the already chosen set. This continues until the selected number of RPs has been reached. This number, typically given as portion of the size of the training data N , is the only metaparameter in the EMLM method, making its use unquestionably simpler than, e.g., utilization of a deep neural network with a plethodra of metaparameters (e.g., what activation function, how many layers, what kind of layers, what kind of an overall network architecture, etc.). The original selection for the starting point of the RS-maxmin in regression problems was the data mean (Hämäläinen et al. 2020), but here we confine ourselves, in the context of classification problems, to a slightly different variant referred as C-RSmaxmin. Firstly, we select the reference points in a class-wise fashion, similarly to the stratified cross-validation (Kärkkäinen 2014), to ensure that the class frequencies reoccur in the set of reference points. Secondly, within each class, we simply pick the smallest observation (in the Euclidean norm) as the first RP. Similarly to the original form of the algorithm, these choices guarantee a completely deterministic selection strategy, which yields the whole EMLM method to be completely deterministic. Let us denote with R ⊆ X the set of m reference points selected from the input observations using the C-RS-maxmin algorithm. Now, using the set of reference N ∈ points and the whole set of input vectors, define the distance matrix H = {hi }i=1 m×N as R (H)i j = ri − x j 2 , i = 1, . . . , m, j = 1, . . . , N .


This computation performs a nonlinear transformation of the input data, where the weights W of the EMLM model are obtained from quadratic terms J f (W) =

m N k 1  β  Whi − yi 22 and Jr (W) = |Wi j |2 2N i=1 2 m i=1 j=1

yielding to a linear optimality condition (Kärkkäinen 2019, Eq. (5)). Similarly to Sect. 2.3, the coefficients N1 and r1 balance the scales of the fidelity and the regularization terms with respect to the amount of data and the number of reference points, respectively (Kärkkäinen and Glowinski 2019). Similarly, β > 0 is the Tykhonov

On the Role of Taylor’s Formula in Machine Learning


regularization/penalization parameter, which restricts the increase in the magnitude of the weights and, by enforcing strict coercivity, guarantees the unique solvability of J f (W) + Jr (W). Let us next turn our attention into feature selection. In the context of the EMLM, this simply refers to the question whether it would be possible to find a subset of the original n features such that accuracy of the distance-based classifier would improve or not grow worse while its construction, more precisely computation of the distance matrix in (9), would be more efficient (Guyon and Elisseeff 2003). Additionally, if data comes from a set of real sensors, then use of a reduced number of features would simplify the data collection and possibly save investments in the industrial measurement system. Wrapper-based feature selection (Kohavi and John 1997) is based on the estimation of feature importance in relation to the predictive model under construction. For this purpose, we confine ourselves to another observation in relation to Taylor’s formula in (3). Namely, one observes that a small value of an individual gradient component ∇ j f (x 0 ) indicates a weak relevance of the jth feature in depicting the function’s local behavior. Such an observation suggests a feature importance (FI) measure that is based on Mean Absolute Sensitivity (MAS) over the training data: FI =

 N  1   ∂Mi  , N i=1  ∂ x 


where Mi = Whi denotes the ith output of the distance-based model. Note that i ∈ Rk×n , Eq. (10) yields to a k × n-dimensional matrix. because each element ∂M ∂x Different combinations or statistics over the number of classes k (e.g., max or mean) could be used to turn this into an n-dimensional feature importance vector and we simply apply mean here. In order to emphasize individual features, we chose the robust Cityblock distance (Huber 2004) in (10). Finally, from (9) it follows that the i can be computed using exactly similar penalized expression analytic derivative ∂Wh ∂x as given in the context of clustering in Kärkkäinen and Äyrämö (2005, Eq. (3))—yet another notable link between unsupervised and supervised learning. Altogether, the derived approach provides us with n feature importance values, so that sorting these into descending order defines the actual ranking of the features from the most important to the least important. It is then up to a selection strategy to decide which features should be included/excluded from the model. One suggestion was given in Linja et al. (2021) but here we restrict ourselves from such further considerations and simply address experimentally the utility of Eq. (10).


T. Kärkkäinen

3 Computational Experiments Next, novel results from computational experiments with the methods depicted in the previous section are reported. As already emphasized, the proposed techniques are thoroughly treated in other works (Kärkkäinen and Hänninen 2021; Linja et al. 2021), so the purpose of computations presented here is to augment the existing experiments, not replace them. Concerning the autoencoder, a novel layerwise pretraining approach also used here is derived and depicted in Kärkkäinen and Hänninen (2021).

3.1 Experimental Setting The methods were implemented by the author using Matlab environment. Data management and linear algebra including linear solvers were applied as is from the environment. Solution of the nonlinear optimization problem related to the autoencoder’s last phase was based on the limited memory quasi-Newton method provided by the Poblano toolbox.1 Computation of cost function and gradient matrices, with reshaping back and forth between matrix and vector presentations, followed the corresponding algebraic formulae given in (5), (6), and √ (7). The regularization coefficient β appearing in both methods was fixed to ε, where ε refers to machine epsilon (around 1e-16 for a 64-bit processor). We use the same datasets, depicted in the first four columns of Table 1, in both experiments. In the table, after the name of the dataset, the sizes of the training set (N ), the validation set (Nv ), and dimension of data vectors (n) are given. The datasets originate from the UCI machine learning repository and are characterized by decent volumes and especially a priori given training and validation sets. However, because we have no guarantees on the quality and distributional correspondence between the given training and validation sets, we carry out the experiments in a tandem fashion: using the given sets as is first but then changing their role by using the original training set as validation set and vice versa. In this way, we hope to set some light on the quality of these experimental datasets which have an important role in the development and assessment of machine learning methods.

3.2 Additive Autoencoder Inspired by Taylor’s Formula Results for the additive autoencoder, as depicted in Sect. 2.3, are given in the last six columns of Table 1. In the column ‘n 0 ’, the maximum size of the squeezing layer tested is given. For all datasets, we have used 0.75 ∗ n to fix this. The next 1

On the Role of Taylor’s Formula in Machine Learning


Table 1 Description of datasets and autoencoding results Dataset









AE-Train 4.0e-3


1 800

5 400








4 435

2 000









3 823

1 797









5 822

4 000









7 291

2 007









6 238

1 559








Constant input features with

ε tolerance were removed from original datasets

column ‘Inc’ then determines the increment for the squeezing dimension during the experiments. For small-dimension datasets, having less than one hundred features, we used the increment one, and for large-dimension sets the increment 10. In the last four columns of Table 1, we report the autoencoding errors according to Eq. (8) for n 0 . As explained above, we performed the experiments twice, using first the original training set to determine the autoencoder and the validation set to compute the generalization error (column ‘AE-Train’ for the final training error and ‘AE-Valid’ for the final validation error). Then the roles of these datasets were switched and the tests, together with the error computations, were repeated (columns ‘AE-Valid’ and ‘AE-Train’). We conclude from Table 1 that all the autoencoding errors in columns 7–10 are small. Thus, the proposed model with a squeezing dimension n 0 was able to accurately encapsulate the variability of the training dataset. The quality of the overall transformation was generalized to a validation set that also showed small and compatible MRSE. MRSE figures are given in Figs. 1 and 2 at the end of this chapter. In general, the figures show that usually autoencoding errors for the training and validations sets agreed well—except for Isolet, where there is very large discrepancy between these errors. This indicates that, for this dataset, the a priori given split into training and validation data is not distributionally compatible. Moreover, most of the error figures also clearly illustrate that there is a particular hidden dimension after which the autoencoding error level stabilizes, i.e., does not significantly decrease anymore. This suggests that after this hidden, reduced dimension, the deterministic error for representing variability of the dataset can not be improved anymore.

3.3 Feature Selection for a Distance-Based Classifier First, we fixed the portion of C-RS-maxmin based reference points to 85% of the number of training data. Results of these experiments are summarized in Table 2. The same datasets and the same increments (one or ten) for enlarging the number of features were used as in the previous tests. After the name of the dataset, the number


Fig. 1 Train-Valid and Valid-Train AE errors for COIL, Satimage, and Optdigits

T. Kärkkäinen

On the Role of Taylor’s Formula in Machine Learning

Fig. 2 Train-Valid and Valid-Train AE errors for COIL2000, USPS, and Isolet



T. Kärkkäinen

Table 2 Feature selection results Train-Valid



Mcp n 0 Mcp n ∗ Mcp Corr Mcp n 0 Mcp n ∗ Mcp Corr 100 3.96 15 4.04 1 4 4.04 9.995e-1 2.67 15 2.61 15 2.61 9.996e-1




27 7.50

19 7.35 9.978e-1 8.79

27 8.84

22 8.57 9.956e-1




46 1.45

37 1.39 9.999e-1 2.22

46 2.28

46 2.28 9.996e-1

COIL2000 2


64 6.97

64 6.89

1 5.98 3.185e-1



4.43 190 4.29 180 4.24 9.948e-1 2.74 190 2.59 170 2.59 9.972e-1



3.34 460 3.34 230 3.27 9.990e-1 5.82 460 5.39 240 5.18 9.954e-1



1 5.95 -0.16


of classes k is given. Also similarly than before, the maximum number of features n 0 to be included in the reduced, feature-selection based EMLM model was at most 0.75 ∗ n . The feature ranking based on Eq. (10) was tested twice, first, using the original training-validation set division, and then, by interchanging the roles of the datasets. The misclassifications-in-percentages (i.e., percentage of false classifications over the validation set), Mcp, are provided for the full feature model (‘Mcp’ in the third column of Table 2) and for the largest hidden dimension n 0 that was tested (‘Mcp’ in the fifth column). We also searched a dimension n ∗ over the tested number of features for which the Mcp has the smallest value. These are reported in columns six and seven. Moreover, in addition to computing the Mcp with respect to the validation set of labels (‘FS-MCP’), we also computed the same quantity (i.e., relational portion of different labels in percentages) between the full feature set model and the the reduced models (‘Full/FS-MCP’). Note that this quantity can be computed without actual knowledge of the labels in the validation set. The behavior of these Mcps are depicted in Figs. 3 and 4 and the correlation coefficients for both data splits are given in columns ‘Corr’ in Table 2. We draw the following conclusions from Table 2. With smaller sets of features (COIL, Satimage, Optdigits), the full EMLM model has the smallest Mcp error. With larger feature sets (COIL2000, USPS, Isolet), the best reduced EMLM model is the best classifier. When Valid-Train order is used, i.e. when the training set has smaller portion of observations compared to the validation set, also smaller dimension problems are best solved with reduced models. A complete outlier to the general behavior is COIL2000, for which the best classification accuracy is obtained with only one feature and the correlation between the Train-Valid Mcp errors is very low. Figures illustrating the behavior of Mcp-errors for the reduced model, in comparison with the true validation labels (‘FS-MCP’) and with respect to the labels predicted by the full feature model (‘Full/FS-MCP’), are depicted in Figs. 3 and 4 at the end of this chapter. In the figures, the constant Mcp of the distance-based model with complete set of features is illustrated with dashed red. Also in the figures, the results for COIL2000 are completely different from other datasets. All this indicates that there is some unknown deficiency between the features and the target classes

On the Role of Taylor’s Formula in Machine Learning

Fig. 3 Train-Valid and Valid-Train FS errors for COIL, Satimage, and Optdigits



Fig. 4 Train-Valid and Valid-Train FS errors for COIL2000, USPS, and Isolet

T. Kärkkäinen

On the Role of Taylor’s Formula in Machine Learning


because in AE experiments, the dimension reduction errors for the features alone agreed with the overall characteristics, for both Train-Valid order. We conclude that a simpler models with reduced number of features provide a good approximation of the full model’s accuracy already with half of the features involved. Excluding COIL2000, the Mcp error plots with respect to the true validation labels are decreasing, which confirms that Taylor-inspired MAS is, indeed, a very useful and reliable measure of feature importance for the distance-based model. Moreover, the high correlation between ‘FS-Mcp’ and ‘Full/FS-Mcp’ (except for COIL2000) and the stabilizing behavior of ‘Full/FS-Mcp’ suggests that the last plot can provide a useful proxy for the determination of sufficient number of features of the reduced distance-based model.

4 Conclusions In this chapter, we presented and experimented two machine learning methods for unsupervised and supervised learning. The learning machines were based on a classical feedforward neural network and a more recently proposed distance-based model. In the derivation of the methods, interpretation of the structure and behavior of Taylor’s formula was utilized. Experimental results demonstrated the usefulness of the methods in the unsupervised dimension reduction and in the construction of efficient classifiers. The experimental novelty of interchanging the roles of the given training and validation sets provided interesting information on the quality of the open machine learning datasets, which have an important role in the research field when new machine learning methods are proposed. Possibility to compare unsupervised and supervised experiments was also informative in this respect (cf. results with ‘COIL2000’). In addition to shared background, we observed very similar behavior between the error plots of the two different methods: computable quantifies for both methods, without actual knowledge of the optimal reduced dimension in unsupervised learning or validation labels for supervised learning, seemed to stabilize when we reduced our search efforts to 75% of the number of features. Hence, visual interpretation of the results suggests that appropriate dimension for both methods is detectable with other datasets as well. Throughout his career, Professor Pekka Neittaanmäki has been a visionary, actively seeking and integrating new ideas and competencies of world leading scholars to advance both industry and academia. In this chapter, we have tried to show that even in the rapidly expanding field of artificial intelligence and machine learning, classical knowledge and tools from scientific computing and optimization can be useful and provide new perspectives. The methods here were inspired by the centuries old Taylor’s formula and the realization of the proposed techniques were based on numerical linear algebra and analytic sensitivity analysis of discretized optimization problems. Compared to the common approach in deep learning, to use components from available software packages incorporating automatic differentia-


T. Kärkkäinen

tion, the analytical approach guarantees efficient and error-free implementation and supports rigorous analysis of the optimality conditions. The results given in Kärkkäinen (2002), Kärkkäinen and Heikkola (2004) that explicate properties of a locally optimal feedforward neural network underline the usefulness of such an approach. As already anticipated in the introduction, there are plenty of future directions of the line of research presented here. One can integrate and hybridize unsupervised and supervised learning in multiple ways, to use ensemble of local models and/or models which use different subsets of most useful features. Such hybridized approaches also allow integration of classical machine learning research tracks like autoencoding with the more recently proposed techniques like the distance-based supervised methods. Acknowledgements The author would like to thank the Academy of Finland for the financial support (grants 311877 and 315550).

References Barrio R (2005) Performance of the Taylor series method for ODEs/DAEs. Appl Math Comput 163(2):525–545 Bishop CM (2006) Pattern recognition and machine learning. Springer, New York Boateng EY, Otoo J, Abaye DA (2020) Basic tenets of classification algorithms K-Nearest-Neighbor, support vector machine, random forest and neural network: A review. J Data Anal Inf Proc 8(4):341–357 Deng J, Frühholz S, Zhang Z, Schuller B (2017) Recognizing emotions from whispered speech based on acoustic feature transfer learning. IEEE Access 5:5235–5246 Dennis JE Jr, Schnabel RB (1996) Numerical methods for unconstrained optimization and nonlinear equations. Classics in applied mathematics, vol 16. SIAM, Philadelphia, PA Diallo B, Hu J, Li T, Khan GA, Liang X, Zhao Y (2021) Deep embedding clustering based on contractive autoencoder. Neurocomputing 433:96–107 Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 Hämäläinen J, Alencar ASC, Kärkkäinen T, Mattos CLC, Souza AH Jr, Gomes JPP (2020) Minimal learning machine: theoretical results and clustering-based reference point selection. J Mach Learn Res 21:1–29 Hämäläinen J, Nieminen P, Kärkkäinen T (2021) Instance-based multi-label classification via multitarget distance regression. In: Proceedings of the 29th European symposium on artificial neural networks, computational intelligence and machine learning—ESANN 2021. ESANN, 2021. (6 pages, to appear) Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd ed. Elsevier Hänninen J, Kärkkäinen T (2016) Comparison of four- and six-layered configurations for deep network pretraining. In: Proceedings of the European symposium on artificial neural networks, computational intelligence and machine learning—ESANN 2016, pp 533–538 Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York Haykin SO (2009) Neural networks and learning machines. Pearson, 3rd ed Hosch WL (2009) Taylor series. Britannica. Accessed 08 Sept 2021

On the Role of Taylor’s Formula in Machine Learning


Huber PJ (2004) Robust statistics, vol 523. Wiley Series in Probability and Statistics. Wiley, New York Kärkkäinen T (2002) MLP in layer-wise form with applications to weight decay. Neural Comput 14(6):1451–1480 Kärkkäinen T (2014) On cross-validation for MLP model evaluation. In: Structural, syntactic, and statistical pattern recognition–S+SSPR 2014, Berlin, 2014. Springer, pp 291–300 Kärkkäinen T (2015) Assessment of feature saliency of MLP using analytic sensitivity. In: Proceedings of the European symposium on artificial neural networks, computational intelligence and machine learning—ESANN 2015, pp 273–278 Kärkkäinen T (2019) Extreme minimal learning machine: ridge regression with distance-based basis. Neurocomputing 342:33–48 Kärkkäinen T, Äyrämö S (2005) On computation of spatial median for robust data mining. In: Evolutionary and deterministic methods for design, optimization and control with applications to industrial and societal problems—EUROGEN 2005, Munich. FLM, pp 1–14 Kärkkäinen T, Glowinski R (2019) A Douglas-Rachford method for sparse extreme learning machine. Methods Appl Anal 26(3):217–234 Kärkkäinen T, Hänninen J (2021) An additive autoencoder for dimension estimation. Submitted (32 pp + supplementary material 31 pp) Kärkkäinen T, Heikkola E (2004) Robust formulations for training multilayer perceptrons. Neural Comput 16(4):837–862 Kärkkäinen T, Majava K, Mäkelä MM (2001) Comparison of formulations and solution methods for image restoration problems. Inverse Probl 17(6):1977–1995 Kärkkäinen T, Rasku J (2020) Application of a knowledge discovery process to study instances of capacitated vehicle routing problems. In: Computation and big data for transport: digital innovations in surface and air transport systems, pp 77–102. Springer Kärkkäinen T, Saarela M (2015) Robust principal component analysis of data with missing values. In: Machine learning and data mining in pattern recognition—MLDM 2015, Cham. Springer, pp 140–154 Kim S, Noh YK, Park FC (2020) Efficient neural network compression via transfer learning for machine vision inspection. Neurocomputing 413:294–304 Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intel 97(1–2):273–324 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444 Linja J, Hämäläinen J, Nieminen P, Kärkkäinen T (2020) Do randomized algorithms improve the efficiency of minimal learning machine? Mach Learn Knowl Extract 2(4):533–557 Linja J, Hämäläinen J, Nieminen P, Kärkkäinen, T (2021) Feature selection for distance-based regression. Manuscript Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514 Myllykoski M, Glowinski R, Kärkkäinen T, Rossi T (2015) A new augmented Lagrangian approach for L 1 -mean curvature image denoising. SIAM J Imaging Sci 8(1):95–125 Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, Lin CT (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681 Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117 Schwenker F, Kestler HA, Palm G (2001) Three learning phases for radial-basis-function networks. Neural Netw 14(4–5):439–458 Simeone O (2018) A very brief introduction to machine learning with applications to communication systems. IEEE Trans Cogn Commun Netw 4(4):648–664 Sun C, Ma M, Zhao Z, Tian S, Yan R, Chen X (2019) Deep transfer learning based on sparse autoencoder for remaining useful life prediction of tool in manufacturing. IEEE Trans Ind Inf 15(4):2416–2425 Sun M, Wang H, Liu P, Huang S, Fan P (2019) A sparse stacked denoising autoencoder with optimized transfer learning applied to the fault diagnosis of rolling bearings. Measurement 146:305– 314


T. Kärkkäinen

Teixeira RA, Braga AP, Takahashi RHC, Saldanha RR (2000) Improving generalization of MLPs with multi-objective optimization. Neurocomputing 35:189–194 Torgo L, Da Costa JP (2003) Clustered partial linear regression. Mach Learn 50(3):303–319

Computational Methods in Spectral Imaging Ilkka Pölönen

Abstract Spectral imaging is an evolving technology with numerous applications. These images can be computationally processed in several ways. In addition to machine learning methods, spectral images can be processed mathematically by modelling or by combining both approaches. This chapter looks at spectral imaging and the computational methods commonly used in it. We review methods related to preprocessing, modelling, and machine learning, and become familiar with some applications. Keywords Spectral imaging · Modelling · Machine learning · Data processing

1 Introduction Automated interaction between machines and the environment based on visual detection is a rising trend. To detect world around machines sensors are needed. Autonomous cars, planes or industrial robots cannot be talked about without detecting their surroundings. Any robot-based sorting would not be possible without detecting and identifying the objects to be sorted. Humans experience with visual sensors is often limited to the mobile phone’s camera. Today these have enormous amount of pixels and several automatic properties. These can be used with different kinds of applications to produce data. Also traditional machine vision cameras produce a wide range of data, but currently new evolving technologies are emerging alongside them. In particular, various passive and active imaging sensors have emerged. An active sensor needs a separate source of electromagnetic radiation. For example, light detection radar (lidar) is one such sensor model (Guan et al. 2016). Of the passive sensors, spectral imagers have been under increasing research interest (Garini et al. 2006). I. Pölönen (B) Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, 40014 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



I. Pölönen

Fig. 1 Spectral image taken with AVIRIS sensor from field of Indian Pines (Baumgardner et al. 2015): Illustration of the spectral data cube including spatial and spectral domains (top left), Ground truth of the data set (top right), Mean spectral of each class of the ground truth from the spectral image (bottom)

The idea of a spectral imager is to photograph an object so that the different wavelengths of electromagnetic radiation are stored in their separate images. A standard color camera is such a device. It records the wavelengths of blue, green and red on its separate channels. Spectral imagers can be roughly divided into multispectral and hyperspectral imagers. Multispectral imagers capture up to a few dozen wavebands while a hyperspectral imager typically captures hundreds of narrow wavebands so that each pixel in the image forms a spectrum through the captured wavebands. Single spectral image is present in both spatial and spectral domain as illustrated in Fig. 1 (Pölönen 2013). Spectral imagers are available from visible light to thermal infrared. The wavelength ranges are limited by the optics and the sensor technology used. Widely used CMOS cells can reach the wavelength range of 400–1000 nm. Infrared requires specially made sensor cells. For example, in the shortwave infrared range of 1000– 2400 nm, Indium-Gallium-Arsenic cells are used. The price of specially made cells is high compared to CMOS cells. Thus, applications have been mainly explicitly devel-

Computational Methods in Spectral Imaging


oped with spectral imagers based on CMOS cells and corresponding wavelength area (ElMasry and Sun 2010). Spectral imaging requires that incoming electromagnetic radiation is recorded as a function of wavelength. Here is needed optical devices that either refract light or filter the desired wavebands for the sensor cells. The first spectral imagers were single-point spectrometers (whiskbrooms), which were then moved in the spatial directions to form an image. A more advanced version is the pushbroom scanner, which scans a line. Pushbroom uses optics to refract light so that spectra of the line is recorded on the sensor cell. This type of imager is suitable for conveyor lines or remote sensing, where long imaging lines are taken on an aeroplane or a satellite (Qian 2021; Porter and Enmark 1987). In addition to scanner-like spectral imagers, there are spectral imagers that use filters. The spectral separation can be done with filter discs with separate optical filters for each desired wavelength. The disc is moved between wavebands (Harvey et al. 2005). Imaging hundreds of bands would require a massive filter holder otherwise only a limited number of bands can be captured. Filter based imagers include a liquid crystal tunable filter (Glenar et al. 1994), an acousto-optic tunable filter (Hardeberg et al. 2002) and imagers using various interferometers. All of these have adjustable filters. For example, a piezo-actualized spectral imager based on a Fabry-Pérot interferometer adjusts two metal mirrors (Saari et al. 2013). The desired wavelength to be imaged and its multiples are depending on the distance between the mirrors. In addition to these, spectral images can be taken in a snapshot. In this case, a diffraction or diffusion filter is placed in front of the imager optics, spreading the wavelengths to different parts of the imager cell. The actual spectral image is then constructed computationally (Hauser et al. 2017; Toivonen et al. 2021). The spectral imager alone does not tell or describe an imaged object much. Imagers can produce data in a continuous stream, but data as such is not information. At this point, we need tools to get processed information from the data. These tools are computational methods based on applied mathematics and numerical methods. In spectral imaging, it is noteworthy that the size of the data quickly becomes massive. Thus, special attention must be paid to the computational efficiency and memory usage of the used methods. In the following sections, we will outline the computational tools, which are commonly used in spectral image processing. We can note that it is possible to consider a wide variety of approaches in terms of image processing and analytics. Methods can use only the spectral domain or spatial domain and some use the spectralspatial domain simultaneously.

2 Noise Reduction First we formulate spectral image. Let X := [x1 , . . . , xn ] ∈ Rz×n be a spectral image with n spectral vectors of size d. Here, the spectral data forms a data matrix. The data matrix contains n pixels. By arranging the matrix to correspond to the original spatial


I. Pölönen

dimensions, we get the spectra in the form of a spectral cube (tensor) X ∈ Rx×y×z , where x and y are spatial dimensions. There is always some noise in the spectral images (Rasti et al. 2018). In the spectral images light is divided down into small portions, the number of photons reaching the sensor can be minimal. In this case, Poisson noise is formed in the image. Warming of the devices during imaging may cause Gaussian noise to the sensor. Because we convert a continuous electromagnetic signal into a discrete one, quantization noise can be formed in the image. In addition, the cell itself may have hot pixels that cause, for example, streaks in the images. When imaging through a medium such as the atmosphere, various interferences may occur in the signal. Thus, noise is involved in the recorded spectral images. We can describe such a spectral image with a simple model Y = X + N,


where Y, N ∈ Rz×n are the measured spectral image and the noise in it respectively. Now we are interested to evaluate the noise-free spectral image X. Several gradient-based methods have been developed to remove noise from images. These estimate a noise-free image by minimizing the cost function consisting of the gradient norm and the appropriate regularization. The best known of these is probably the Total Variation (TV) method of Rudin et al. (1992). Let u(x, y) and f (x, y) be single bands from the noise-free and measured spectral images. Then TV noise reduction can be expressed as a minimization problem  min u

∇u d x d y + λu − f ,


where  is domain where both images exists, λ is the Lagrange multiplier, and  ·  is the Euclidean norm. Minimization can be solved here using any global optimization method. Computationally efficient way is to use the split-Bregman algorithm, which is designed for convex optimization problems involving regularization. Using this 2D TV-algorithm, noise reduction has to be done separately for each waveband. However, we do not need to be limited to the spatial domain but noise removal can be done in the spectral domain. Let x and y be noise-free and measured spectra. Then the 1D TV would have the following form:  min x

∇x d x + λx − y.


We can also think of using both spectral and spatial domains simultaneously by using 3D TV, where optimization problem is  min X

∇X  d x d y dz + λX − Y.


Computational Methods in Spectral Imaging


Fig. 2 Example of 3D TV denoising for the hyperspectral image: Original noise waveband from Indian Pines data set (Baumgardner et al. 2015) (top left), Same waveband after 3D TV denoising (top right), Comparison of single spectrum from the same image (bottom)

It is clear that in these approaches the noise N would be a difference between noisefree and measured spectral image. We can see example results of 3D TV noise reduction in Fig. 2. There are areas in natural images that are similar to each other. These parts in the image either repeat a pattern or color. In these cases, patch-based image restoration methods can be utilized. These methods can be considered as state-of-the-art methods for noise reduction. Methods such as Block-Matching and 3D filtering (BM3D) (Danielyan et al. 2009) and directional wavelet packets (Averbuch et al. 2019) can effectively remove Poisson and Gaussian noise from images. In spectral images, individual bands can have a very high correlation in the spectral domain. Spectra are actually in a low-dimensional subspace or manifold of Rn . Thus, we can utilize also sparse representation methods to remove noise from images. Assume that the spectrum xi is located in a subspace whose dimension k is much smaller than the original dimension d of the spectrum. Then we can assume that there exists a subspace base E ∈ Rd×k and a coefficient matrix C ∈ Rk×n . Noise is living slightly off this subspace. In this case, the noise-free spectral image


I. Pölönen

X = EC.


Because E is orthogonal basis and N is assumed to be distributed normally (Gaussian and Poisson noise) then ET N ≈ 0. By using the Singular Value Decomposition (SVD) we know that Y = UVT , where U and VT are bases for the subspace and the eigenvectors of the matrix YYT .  and corresponding vectors  Now, using only k largest singular values  U,  V, we can select  E= U, C =  VT . Main challenge in this approach is to find right basis for the subspace. Here, SVD does not take account spatial domain, which also has significant information. There exist, for example, wavelet based approaches where both spatial and spectral domain are taken into account by forming separate basis for the domains (Rasti et al. 2012). Noise removal is one of the essential preprocessing tasks for spectral images before the actual analysis. A few ways have been outlined above, but there are many more. In particular, different deep learning-based approaches have received much interest recently (Zhang et al. 2020; Luo et al. 2021). In addition to noise reduction, other types of correction may also be made to the spectral images before the actual analysis. For example, inpainting is necessary if there are a lot of dead pixels in the imager (Zhuang and Bioucas-Dias 2018). In satellite images, different wavelengths may have been collected by imagers with different spatial resolutions, with different superresolution methods being used to match the resolutions of the wavelength channels. On the other hand, superresolution is also used to improve the resolution of spectral images. By taking a high-resolution image on a panchromatic camera and recording spectral data with coarser spatial resolution, it is possible to enhance spatial resolution of the spectral image using pansharpening (Loncan et al. 2015; Akgun et al. 2005).

3 Model-Based Analysis Usually there are several substances present in a spectral image. Each of these has a separate unique spectrum. The shape of the spectrum depends mainly on the properties of the substance. Typical questions are how to identify these substances from the data and how to determine their relative occurrence. Occurrences may be small, and their relative proportion may be less than a pixel resolution. In this case, we are talking about sub-pixel detection (Bioucas-Dias et al. 2012). We need a mathematical model that describes the presence of substances in spectral data. With the help of the model, we can start to find out what the image consists of and how these substances occur in the image. We can assume that a certain number of different substances with different spectra is present in the spectral image. The spectral image consists of combinations of these spectra. The simplest model to describe this situation is a linear mixture model

Computational Methods in Spectral Imaging

Y = MA + N,



where Y ∈ Rd×n , M ∈ Rd× p are pure spectra for p substances and A ∈ R p×n are their relative occurrence in the data. In the literature pure spectra M are often called as a endmembers and A is named abundance of endmembers in the data. Noise N can be eliminated with techniques presented in Sect. 2 (Adams et al. 1986; Keshava and Mustard 2002). We can see that problem in this form is ill-posed. We do not know neither M nor A. Thus, we should be able to determine both. This is known as spectral unmixing method (Bioucas-Dias et al. 2012). Usually, we start solving this by first looking for endmembers M in the data. The matrix M is non-negative so that each m i, j ≥ 0. In searching for endmembers from spectral data, we rely on either statistical or geometry-based methods. Geometric methods are intuitive to understand. The geometric approach relies on finding such points in the data that are vertices of convex hull of whole data set. Endmembers are these same data points. With geometrybased methods it is needed to know advance how many endmembers to seek. The background assumption is that each endmember appears pure at least in one pixel. The traditional way to find these endmembers is the Pixel Purity Index (PPI) method (Chang and Plaza 2006), where random vectors are drawn in the same space as the data. The nearest location of each data point is then calculated on a line spanned by a random vector. The points that are furthest apart are given a +1 value in the index. By repeating this long enough, the endmembers will get the highest index values. PPI is a computationally expensive method because many iterations are needed. The distances of all spectra to the line have to be calculated directly in each round. A more computationally efficient method is Vertex Component Analysis (VCA) (Nascimento and Dias 2005), which utilizes orthogonal projections, affine transformation, and convex set theory. Using VCA, we search for the subspace position, either using SVD or Principal Component Analysis (PCA), after which the spectrum with the highest eigenvalue singular vector or principal component is taken from the spectrum with the highest value, which is fixed to the first endmember. For this point, an affine transformation is performed, and next, the spectrum with the highest value in the orthogonal direction of the first endmember is selected. The affine transformation is done for these two points and continued until all p endmembers have been found. In Fig. 3, ten endmembers has been extracted from the Indian Pines data set. (Actual number of endmembers is higher in the image.) Once M is known, A must be determined. We want to solve the optimization problem (7) min Y − MA A

If the endmembers M are sufficiently different from each other, then ordinary least squares produce the solution A = (MT M)−1 MT Y.


I. Pölönen

Fig. 3 Spectral endmembers found using VCA from the Indian Pines data set

However, this is not always the case. If we consider the nature of the occurrence of spectra, then for every ai, j ∈ A should hold ai, j ≥ 0,


ai, j = 1.


We would like to constraint the values of A (Keshava and Mustard 2002). The non-negativity condition can be solved by adding an optimization problem such as Tikhonov regularization (Golub et al. 1999) min Y − MA − λA, A


where λ is the Lagrange multiplier or using a non-negative least squares solver. If both constraints are to be filled, then we should use the fully-constraint least squares method. In Fig. 4 fully constraint least squares approach has been used to calculate occurrence of endmember spectra from Fig. 3. We can see that some endmembers correspond relatively well with ground truth data. The scattering of light in a medium is a physical phenomenon. This phenomenon can be modelled mathematically and thus also numerically. A spectrum or spectral image is a recording of this behavior of electromagnetic radiation in an object. The used models can be a rough approximation of reality or very detailed and accurate. Generally accurate models are computationally heavy. Physical models need some background information on how electromagnetic radiation behaves in this medium (Hapke 2012).

Computational Methods in Spectral Imaging


Fig. 4 Example of hyperspectral unmixing: False color image of original Indian Pines data (Baumgardner et al. 2015) (top left), Ground truth labels for different classes (2nd row left), Abundance maps of the certain endmembers in the data (the rest)


I. Pölönen

The spectral sensitivity area of sensor affects what kind of model can be used. Let us first consider a situation with an object that partially transmits light. We ignore possible scattering and consider only light transmission T =

I , I0


where I is the flux coming through the sample and I0 is the flux of the light source. Now the absorption A of the object is obtained from the transmission T = e−A ⇔ A = ln T. According to the Beer law (Calloway 1997), A = εlc, where ε is the absorbance of the target, l the thickness of the sample, and c the concentration. If the sample is a mixture of k substances, then k  εi ci l. (10) A= i=1

Assuming that the absorptions of the different substances and the thickness of the whole sample are known, the relative presence of each substance can be trivially solved from (10). Also imaging setup has impact on the mathematical model, which we use to describe the behavior of light. For example, the commonly used reflectance imaging is much more complex to model than transmission. The reflection of light is affected by how the object scatters and absorbs light. Modelling a light propagation in one spatial dimension alone leads to a more complex differential equation model. Assume that I is the flux reflected from the object and J is the flux radiating to the object. In addition, it is assumed that K is the absorption of the object and S is the scattering of the object. All of these are functions in terms of wavelength. The Kubelka-Munk theory (Yang 2004) presents the change using two differential equations such that dI = −(K + S)I + S J, dx dJ = −(K + S)J + S I. dx

(11) (12)

Equation (11) is flux change in the effect of absorption and forward scattering back to the scattering flux and Eq. (12) again flux change to the radiant flux as a function of thickness. Now, depending on the boundary conditions, several different analytical solutions exist that can be used to determine the reflectance R of an object from the above equations. Since we now have a model for the data from the spectral imager, we can now utilize the model to characterize the imaged object. The target may consist of different substances, and the substances may be layered. In this case, we can utilize the model

Computational Methods in Spectral Imaging


to solve the inversion problem, which is to determine, for example, the ratio of these compositions and the thicknesses of the different layers. The solution to the inversion can be done as an optimal control task, which can be computationally heavy depending on the situation. At least, it is impractical for on-line measurement and monitoring. One possibility is that the model generates a large number of reflectance spectra with different parameterizations. The machine learning method is then taught to approximate the model parameters from the spectral data, using the newly modelled spectra and their parameters for teaching (Annala et al. 2020; Annala and Pölönen 2022). However, we do not live in a one-dimensional world, but electromagnetic radiation is actually in space-time. Models that take all dimensions into account are generally computationally heavy. They can be roughly bundled into stochastic models and deterministic models. In stochastic models, modelling is performed by utilizing the transition probabilities of photons in the computational lattice (Maier et al. 1999). From a modelling point of view, these are laborious because the final result requires calculations for a large number of photons. Deterministic models operating in space-time are not known for their computational efficiency. In these, the problem often becomes the size of the lattice sizes and the number of degrees of freedom of the modelling problem (Räbinä et al. 2015). The things to be modelled, such as even a living tree, are on a completely different scale than where modelling with deterministic models takes place. The lattice density in the different computational models is such that when the wavelength is 400 nm, it would be good to obtain lattice points at least every 40 nm. With a tree size of 101 –102 m and a modelling scale of 10−9 , it is observed that modelling an entire tree deterministically is still a long way off in the future. Thus, research has mainly focused on how deterministic models can be used to refine meso-level models. On the other hand, there is also an interest in how computationally heavy models could be approximated to be useful in building both meso-level models and inverse models (Erkkilä et al. 2022).

4 Machine Learning In the spectral imaging community, artificial intelligence is commonly referred to as a machine learning methods. A wide variety of different methods have been developed and used in various spectral imaging applications. The most widely used methodology is supervised learning, where spectral data and corresponding ground truth is given to the machine learning training process. This kind of classification tasks are currently very common in research. In 2020 almost 30,000 research papers was published in this topic (Ahmad et al. 2021). Unsupervised machine learning, semisupervised and different reinforced learning methodologies are also commonly use approaches. In terms of spectral imaging, machine learning plays an important role. Interpretation of spectral images requires processing, and machine learning provide an


I. Pölönen

effective tool here. Roughly machine learning has been used from the early days of spectral imaging. Particularly measure- and threshold-based classifiers were developed. The Spectral Angle Mapper (SAM) (Kruse et al. 1993) is still widely used, better known as the inverse of the cosine angle  x · yi , SAM = arccos xyi  


where x is detected spectrum and yi is reference spectrum of the class i. Classification is done by setting a threshold for each class separately. SAM is invariant for the intensity changes of the spectrum. If target is in the shadow or in the bright day light only difference in the spectrum should be in the intensity. Unfortunately, the behavior of light is not that simple. The imaged objects rarely are that homogeneous that a single reference spectrum could be formed for them. In general, there is a difference in the scattering properties of the imaged object, which comes from several things (structure, composition, the shape of the surface, angle of illumination, etc.). Thus, more complex machine learning models are required to cope with classification and clustering tasks. Because the spectral imaging community has closely followed machine learning trends during its existence, new methods have typically been tested with spectral data quickly. Classification has been done by all possible techniques (k-Nearest Neighbors (kNN) (Huang et al. 2016), decision tree (Velásquez et al. 2017), random forest (Ham et al. 2005), support vector machine (Gualtieri and Chettri 2000), Bayes (Xu and Li 2014), minimal learning machine (Hakola and Pölönen 2020), artificial neural network (Audebert et al. 2019), etc. In suitable applications, it is justified to use only spatial domain for the classification. But because spectral images are images, the spatial domain holds information that can improve classification accuracy. Example results of classification with kNN (using cosine distance), Support Vector Machine (SVM) with radial basis kernel, random forest and multi-layer perceptron neural network can be seen in Fig. 5, where a part of the Indian Pines data set has been used in the training of the algorithm. Here kNN and SVM outperform random forest and neural network in accuracy, but using carefully selected feature extraction methods results can change drastically. There are many different ways to extract features from spectral images. Spectral domain alone has been handled with many methods known from signal processing wavelets (Bruce et al. 2002), wavelet packets (Zheludev et al. 2015), Fourier transforms (Garcia Salgado and Ponomaryov 2015), the first- and second-order derivatives (Tsai and Philpot 1998), etc.). In extraction of spatial features has been used methods, which are familiar from machine vision, such as histograms of oriented gradients (Chen et al. 2022) or local binary pattern (Li et al. 2015) type methods. Since a single spectrum itself is in high-dimensional space, adding it to its environment further increases the dimension of feature space. There is need to respond to the curse of dimensions in the processing of spectral images. Many distance-based classifiers lose their performance when the data point is located in a large-dimensional space (Gualtieri and Chettri 2000). Another problem

Computational Methods in Spectral Imaging


Fig. 5 Example of hyperspectral classification using only spectral domain

is encountered with the training data of the classifiers. Data points for each category should be roughly five times the number of dimensions, which means that without further processing in many applications we run out of annotated data. In practice, this problem has been addressed by utilizing dimension reduction methods (Khodr and Younes 2011). Without further consideration of the nature of the spectral data, various linear methods have been used due to their computational efficiency. Principal component analysis (Datta et al. 2018) and random projections (Du et al. 2011) have been used extensively. The more different features are included to the process, the more nonlinear the data becomes. It can be assumed that the data is located in a manifold, a multi-dimensional surface in space. Manifold learning methods have been used with nonlinear data (Lunga et al. 2014). These methods examine the data in local environments and form a graph where the reduced dimension is retrieved using a singular value decomposition. Such methods include, for example, Isomap (Guangjun et al. 2007), Laplacian eigenmaps (Yan and Niu 2014), and diffusion maps (Zheludev et al. 2015). Nonlinear data can also be processed by utilizing a kernel trick where a suitable kernel replaces the dot product. Here aim is to make the data linearly separable by adding one dimension inside of training function. This has been used widely with the support vector machine (Camps-Valls and Bruzzone 2005). The last decade of machine learning with spectral images has been marked by an increase in the computing power of computers and an increase in the amount of data. This has enabled entirely new ways of doing classification. Annually classification related articles of spectral images are published large amount. Many of these articles deal with deep learning neural networks in different applications (Ahmad et al. 2021).


I. Pölönen

The extraction of features in these approaches is left to the neural networks. Relying on a brute force, the convolution layers in the neural networks do the feature extraction automatically. These methods can be used to search for features in both the spectral and spatial domains and both simultaneously. The methods have proven to be very effective. The downside to this efficiency is the growing need for training data. Attempts have been made to compensate for this computationally by augmenting the data. One interesting starting point is the simulation of data with different radiometric models (Erkkilä et al. 2022; Annala et al. 2020). In this case, training data can be generated by modifying the simulation parameters.

5 Applications We can use different computational methods to interpret spectral imaging data. This also implies that there exist a wide variety of applications where this can be used. In spatial size, there are applications from sub-wavelength level measurements to the galaxies. We can use different models to interpret data or use machine learning methods to classify, cluster and estimate some phenomena from the spectral data. We can apply spectral imaging for any tasks where visual inspection is needed. With some applications where physics, chemistry or biology are involved, spectral imaging with proper computational methods can generate better understanding and results. Origins of spectral imaging are in remote sensing. Spectral imagers are used from satellites, planes and drones. For example, the spectral image can estimate biomass (Honkavaara et al. 2013) and nitrogen content using a relatively simple nearest neighbor regression approach (Pölönen et al. 2013). In silviculture, tree species identification is one of the most important things to gain from spectral images. Compared to any other sensor type, spectral imaging data has proven outstanding in distinguishing tree species (Nevalainen et al. 2017). For example, studies have shown that novel convolutional neural network approaches can classify very accurate tree species for the nordic main tree species (Pölönen et al. 2018; Nezami et al. 2020). The advantage of spectral imaging becomes apparent when there are significantly more tree species (Näsi et al. 2016). The family and species of several different trees can be distinguished, which is especially important in measuring biodiversity. One particularly interesting imaging target is algae, which can be observed with a spectral imager at macro and micro levels. There is a particular interest in water quality in macro-level observation. In Erkkilä et al. (2017) uses regression modelling to estimate water quality parameters. While in Hakala et al. (2020) several convolutional neural network architectures are compared in regression of the same parameters. The spectral imaging data is used in algae cultivation monitoring. At the micro-level, species identification and the estimation of biomass are under interest. For example, in Salmi et al. (2021) a simple linear model is used to estimate biomass. Medical imaging is constantly looking for new opportunities for non-invasive research methods. Spectral imaging is one answer to applications where the object

Computational Methods in Spectral Imaging


being imaged can be seen. These include skin cancers (Neittaanmäki et al. 2017), teeth (Hyttinen 2021) and various surgeries (Lu and Fei 2014). In the case of cancers and tumors, typical uses are tumor identification and delineation to be removed with as little margin as possible (Salmivuori et al. 2019). For example, in the case of melanoma, a simple linear model is sufficient for delineation (Neittaanmäki-Perttu et al. 2015). Identification and the delineation can be based on a machine learning model. For example, in Räsänen et al. (2021), convolutional neural networks deliver both results simultaneously. Several different spectral imagers currently observe the Earth. They can be found on the International Space Station and satellites. One rising trend is to put miniaturized spectral imagers on cube satellites. The cost of these is a fraction of that of large satellites. Several different companies are currently underway to build a sensor network of spectral imagers around the globe. For example, the Finnish company Kuva Space is implementing this kind of satellite network. The challenge for space in spectral imaging is the energy and data transfer budget, requiring efficient computational methods and processing spectra (Wolfmayr et al. 2021; Lind et al. 2021).

6 Concluding Remarks The increased power of computers, advanced computational methods, and new sensors such as spectral imagers shape our ability to perceive and understand the world. They offer new kinds of opportunities for different applications and multidisciplinary research. However, this would not be possible without people bringing ideas and skills together. Professor Pekka Neittaanmäki has been a significant background influencer in bringing together mathematicians, engineers and medical doctors to brainstorm the possibilities of spectral imaging in the new research areas and applications.

References Adams JB, Smith MO, Johnson PE (1986) Spectral mixture modeling: a new analysis of rock and soil types at the Viking Lander 1 site. J Geoph Res: Solid Earth 91(B8):8098–8112 Ahmad M, Shabbir S, Roy SK, Hong D, Wu, Yao J, Khan AM, Mazzara M, Distefano S, Chanussot J (2021) Hyperspectral image classification-traditional to deep models: a survey for future prospects. arXiv:2101.06116 Akgun T, Altunbasak Y, Mersereau RM (2005) Super-resolution reconstruction of hyperspectral images. IEEE Trans Image Proc 14(11):1860–1875 Annala L, Äyrämö S, Pölönen I (2020) Comparison of machine learning methods in stochastic skin optical model inversion. Appl Sci 10(20):7097 Annala L, Neittaanmäki N, Paoli J, Zaar O, Pölönen I (2020) Generating hyperspectral skin cancer imagery using generative adversarial neural network. In: 2020 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), pp 1600–1603. IEEE


I. Pölönen

Annala L, Pölönen I (2022) Kubelka–Munk model and stochastic model comparison in skin physical parameter retrieval. In: Computational sciences and artificial intelligence in industry, pp 137–151. Springer Audebert N, Le Saux B, Lefèvre S (2019) Deep learning for classification of hyperspectral data: a comparative review. IEEE Geosci Remote Sens Mag 7(2):159–173 Averbuch AZ, Neittaanmäki P, Zheludev VA (2019) Analytic and directional wavelet packets. In: 2019 13th international conference on sampling theory and applications (SampTA). IEEE, pp 1–4 Baumgardner MF, Biehl LL, Landgrebe DA (2015) 220 band AVIRIS hyperspectral image data set: June 12, 1992 Indian Pine test site 3. Purdue University Research Repository (PURR). https:// Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J Sel Top Appl Earth Obser Remote Sens 5(2):354–379 Bruce LM, Koger CH, Li J (2002) Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction. IEEE Trans Geosci Remote Sens 40(10):2331–2338 Calloway D (1997) Beer-Lambert law. J Chem Educ 74(7):744 Camps-Valls G, Bruzzone L (2005) Kernel-based methods for hyperspectral image classification. IEEE Trans Geosc Remote Sens 43(6):1351–1362 Chang C-I, Plaza A (2006) A fast iterative algorithm for implementation of pixel purity index. IEEE Geosci Remote Sens Lett 3(1):63–67 Chen GY, Krzyzak A, Xie WF (2022) Hyperspectral face recognition with histogram of oriented gradient features and collaborative representation-based classifier. Multimed Tools Appl 81:2299– 2310 Danielyan A, Vehviläinen M, Foi A, Katkovnik V, Egiazarian K (2009) Cross-color BM3D filtering of noisy raw data. In: 2009 international workshop on local and non-local approximation in image processing. IEEE, pp 125–129 Datta A, Ghosh S, Ghosh A (2018) PCA, kernel PCA and dimensionality reduction in hyperspectral images. In: Advances in principal component analysis. Springer, pp 19–46 Du Q, Fowler JE, Ma B (2011) Random-projection-based dimensionality reduction and decision fusion for hyperspectral target detection. In: 2011 IEEE international geoscience and remote sensing symposium. IEEE, pp 1790–1793 ElMasry G, Sun D-W (2010) Principles of hyperspectral imaging technology. In: Hyperspectral imaging for food quality analysis and control. Elsevier, pp 3–43 Erkkilä A-L, Pölönen I, Lindfors A, Honkavaara E, Nurminen K, Näsi R (2017) Choosing of optimal reference samples for boreal lake Chlorophyll a concentration modeling using aerial hyperspectral data. In: Frontiers in spectral imaging and 3D technologies for geospatial solutions, volume XLII3/W3 of International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. International Society for Photogrammetry and Remote Sensing, pp 39–46 Erkkilä A-L, Räbinä J, Pölönen I, Sajavaara T, Alakoski E, Tuovinen T (2022) Using wave propagation simulations and convolutional neural networks to retrieve thin film thickness from hyperspectral images. In: Computational sciences and artificial intelligence in industry. Springer, pp 261–275 Garcia Salgado BP, Ponomaryov V (2015) Feature extraction-selection scheme for hyperspectral image classification using Fourier transform and Jeffries-Matusita distance. In: Advances in artificial intelligence and its applications, MICAI 2015. Springer, pp 337–348 Garini Y, Young IT, McNamara G (2006) Spectral imaging: principles and applications. Cytometry A 69(8):735–747 Glenar DA, Hillman JJ, Saif B, Bergstralh J (1994) Acousto-optic imaging spectropolarimetry for remote sensing. Appl Opt 33(31):7412–7424 Golub GH, Hansen PC, O’Leary DP (1999) Tikhonov regularization and total least squares. SIAM J Matrix Anal Appl 21(1):185–194

Computational Methods in Spectral Imaging


Gualtieri JA, Chettri S (2000) Support vector machines for classification of hyperspectral data. In: Proceedings of the IGARSS 2000—IEEE 2000 international geoscience and remote sensing symposium—taking the pulse of the planet: the role of remote sensing in managing the environment, vol 2. IEEE, pp 813–815 Guan H, Li J, Cao S, Yu Y (2016) Use of mobile LiDAR in road information inventory: a review. Int J Image Data Fus 7(3):219–242 Guangjun D, Yongsheng Z, Song J (2007) Dimensionality reduction of hyperspectral data based on ISOMAP algorithm. In: 2007 8th international conference on electronic measurement and instruments. IEEE, pp 3–935–3–938 Hakala T, Pölönen I, Honkavaara E, Näsi R, Hakala T, Lindfors A (2020) Using aerial platforms in predicting water quality parameters from hyperspectral imaging data with deep neural networks. Comput Big Data for Transp: Digit Innov Surf Air Transp Syst 54:213–238 Hakola A-M, Pölönen I (2020) Minimal learning machine in hyperspectral imaging classification. In: Image and signal processing for remote sensing XXVI, volume 11533 of Proceedings of the SPIE. International Society for Optics and Photonics Ham J, Chen Y, Crawford MM, Ghosh J (2005) Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans Geosci Remote Sens 43(3):492–501 Hapke B (2012) Theory of reflectance and emittance spectroscopy. Cambridge University Press Hardeberg JY, Schmitt FJM, Brettel H (2002) Multispectral color image capture using a liquid crystal tunable filter. Opt Eng 41(10):2532–2548 Harvey AR, Fletcher-Holmes DW, Gorman A, Altenbach K, Arlt J, Read ND (2005) Spectral imaging in a snapshot. In: Spectral imaging: instrumentation, applications, and analysis III, volume 5694 of Proceedings of the SPIE. International Society for Optics and Photonics, pp 110–119 Hauser J, Zheludev VA, Golub MA, Averbuch A, Nathan M, Inbar O, Neittaanmäki P, Pölönen I (2017) Snapshot spectral and color imaging using a regular digital camera with a monochromatic image sensor. Int Arch Photogramm Remote Sens Spatial Inf Sci XLII-3/W3, 51–58 Honkavaara E, Saari H, Kaivosoja J, Pölönen I, Hakala T, Litkey P, Mäkynen J, Pesonen L (2013) Processing and assessment of spectrometric, stereoscopic imagery collected using a lightweight UAV spectral camera for precision agriculture. Remote Sens 5(10):5006–5039 Huang K, Li S, Kang X, Fang L (2016) Spectral-spatial hyperspectral image classification based on KNN. Sens Imaging 17:1 Hyttinen J (2021) Oral and dental spectral imaging for computational and optical visualization enhancement. PhD thesis, University of Eastern Finland (2021) Keshava N, Mustard JF (2002) Spectral unmixing. IEEE Signal Proc Mag 19(1):44–57 Khodr J, Younes R (2011) Dimensionality reduction on hyperspectral images: a comparative review based on artificial datas. In: 2011 4th international congress on image and signal processing, vol 4. IEEE, pp 1875–1883 Kruse FA, Lefkoff AB, Boardman JW, Heidebrecht KB, Shapiro AT, Barloon PJ, Goetz AFH (1993) The spectral image processing system (SIPS): interactive visualization and analysis of imaging spectrometer data. Remote Sens Environ 44(2–3):145–163 Li W, Chen C, Su H, Du Q (2015) Local binary patterns and extreme learning machine for hyperspectral imagery classification. IEEE Trans Geosci Remote Sens 53(7):3681–3693 Lind L, Laamanen H, Pölönen I (2021) Hyperspectral imaging of asteroids using an FPI-based sensor. In: Sensors, systems, and next-generation satellites XXV, volume 11858 of Proceedings of the SPIE. SPIE Loncan L, De Almeida LB, Bioucas-Dias JM, Briottet X, Chanussot J, Dobigeon N, Fabre S, Liao W, Licciardi GA, Simoes M, Tourneret J-Y, Veganzones MA, Vivone G, Wei Q, Yokoya N (2015) Hyperspectral pansharpening: a review. IEEE Geosci Remote Sens Mag 3(3):27–46 Lu G, Fei B (2014) Medical hyperspectral imaging: a review. J Biomed Opt 19(1):010901 Lunga D, Prasad S, Crawford MM, Ersoy O (2014) Manifold-learning-based feature extraction for classification of hyperspectral data: a review of advances in manifold learning. IEEE Signal Proc Mag 31(1):55–66


I. Pölönen

Luo Y-S, Zhao X-L, Jiang T-X, Zheng Y-B, Chang Y (2021) Hyperspectral mixed noise removal via spatial-spectral constrained unsupervised deep image prior. IEEE J Sel Top Appl Earth Obser Remote Sensing 14:9435–9449 Maier SW, Lüdeker W, Günther KP (1999) SLOP: a revised version of the stochastic model for leaf optical properties. Remote Sens Environ 68(3):273–280 Nascimento JMP, Dias JMB (2005) Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE Trans Geosci Remote Sens 43(4):898–910 Näsi R, Honkavaara E, Tuominen S, Saari H, Pölönen I, Hakala T, Viljanen N, Soukkamäki J, Näkki I, Ojanen H, Reinikainen J (2016) UAS based tree species identification using the novel FPI based hyperspectral cameras in visible, NIR and SWIR spectral ranges. In: XXIII ISPRS Congress, volume XLI-B1 of International archives of the photogrammetry, remote sensing and spatial information sciences. International Society for Photogrammetry and Remote Sensing, pp 1143–1148 Neittaanmäki N, Salmivuori M, Pölönen I, Jeskanen L, Ranki A, Saksela O, Snellman E, Grönroos M (2017) Hyperspectral imaging in detecting dermal invasion in lentigo maligna melanoma. Brit J Dermatol 177(6):1742–1744 Neittaanmäki-Perttu N, Grönroos M, Jeskanen L, Pölönen I, Ranki A, Saksela O, Snellman E (2015) Delineating margins of lentigo maligna using a hyperspectral imaging system. Acta Dermatovenereologica 95(5):549–552 Nevalainen O, Honkavaara E, Tuominen S, Viljanen N, Hakala T, Yu X, Hyyppä J, Saari H, Pölönen I, Imai NN, Tommaselli AMG (2017) Individual tree detection and classification with UAV-based photogrammetric point clouds and hyperspectral imaging. Remote Sens 9(3):185 Nezami S, Khoramshahi E, Nevalainen O, Pölönen I, Honkavaara E (2020) Tree species classification of drone hyperspectral and RGB imagery with deep learning convolutional neural networks. Remote Sens 12(7):1070 Pölönen I (2013) Discovering knowledge in various applications with a novel hyperspectral imager. PhD thesis, University of Jyväskylä Pölönen I, Annala L, Rahkonen S, Nevalainen O, Honkavaara E, Tuominen S, Viljanen N, Hakala T (2018) Tree species identification using 3D spectral data and 3D convolutional neural network. In: 2018 9th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). IEEE, pp 1–5 Pölönen I, Saari H, Kaivosoja J, Honkavaara E, Pesonen L (2013) Hyperspectral imaging based biomass and nitrogen content estimations from light-weight UAV. In: Remote sensing for agriculture, ecosystems, and hydrology XV, volume 8887 of Proceedings of the SPIE. International Society for Optics and Photonics, p 88870J Porter WM, Enmark HT (1987) A system overview of the airborne visible/infrared imaging spectrometer (AVIRIS). In: Imaging spectroscopy II, volume 0834 of Proceedings of the SPIE. International Society for Optics and Photonics, pp 22–31 Qian S-E (2021) Hyperspectral satellites, evolution, and development history. IEEE J Sel Top Appl Earth Obser Remote Sens 14:7032–7056 Räbinä J, Mönkölä S, Rossi T (2015) Efficient time integration of Maxwell’s equations with generalized finite differences. SIAM J Sci Comput 37(6):B834–B854 Räsänen J, Salmivuori M, Pölönen I, Grönroos M, Neittaanmäki N (2021) Hyperspectral imaging reveals spectral differences and can distinguish malignant melanoma from pigmented basal cell carcinomas: A pilot study. Acta Dermato-Venereologica 101(2):00405 Rasti B, Scheunders P, Ghamisi P, Licciardi G, Chanussot J (2018) Noise reduction in hyperspectral imagery: overview and application. Remote Sens 10(3):482 Rasti B, Sveinsson JR, Ulfarsson MO, Benediktsson JA (2012) Hyperspectral image denoising using 3D wavelets. In: 2012 IEEE international geoscience and remote sensing symposium. IEEE, pp 1349–1352 Rudin LI, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1–4):259–268

Computational Methods in Spectral Imaging


Saari H, Pölönen I, Salo H, Honkavaara E, Hakala T, Holmlund C, Mäkynen J, Mannila R, Antila T, Akujärvi, A (2013) Miniaturized hyperspectral imager calibration and UAV flight campaigns. In: Sensors, systems, and next-generation satellites XVII, volume 8889 of Proceedings of the SPIE. International Society for Optics and Photonics, p 88891O Salmi P, Eskelinen MA, Leppänen MT, Pölönen I (2021) Rapid quantification of microalgae growth with hyperspectral camera and vegetation indices. Plants 10(2):341 Salmivuori M, Neittaanmäki N, Pölönen I, Jeskanen L, Snellman E, Grönroos M (2019) Hyperspectral imaging system in the delineation of ill-defined basal cell carcinomas: a pilot study. J Eur Acad Dermatol Venereol 33(1):71–78 Toivonen ME, Rajani C, Klami A (2021) Snapshot hyperspectral imaging using wide dilation networks. Mach Vis Appl 32(1):9 Tsai F, Philpot W (1998) Derivative analysis of hyperspectral data. Remote Sens Environ 66(1):41– 51 Velásquez L, Cruz-Tirado JP, Siche R, Quevedo R (2017) An application based on the decision tree to classify the marbling of beef by hyperspectral imaging. Meat Sci 133:43–50 Wolfmayr M, Pölönen I, Lind L, Kašpárek T, Penttilä A, Kohout T (2021) Noise reduction in asteroid imaging using a miniaturized spectral imager. In: Sensors, systems, and next-generation satellites XXV, volume 11858 of Proceedings of the SPIE. SPIE, pp 102–114 Xu L, Li J (2014) Bayesian classification of hyperspectral imagery based on probabilistic sparse representation and Markov random field. IEEE Geosci Remote Sens Lett 11(4):823–827 Yan L, Niu X (2014) Spectral-angle-based Laplacian eigenmaps for nonlinear dimensionality reduction of hyperspectral imagery. Photogram Eng Remote Sens 80(9):849–861 Yang L, Kruse B (2004) Revised Kubelka–Munk theory. I. Theory and application. J Opt Soc Am A 21(10), 1933–1941 Zhang C, Zhou L, Zhao Y, Zhu S, Liu F, He Y (2020) Noise reduction in the spectral domain of hyperspectral images using denoising autoencoder methods. Chemom Intel Labo Syst 203:104063 Zheludev V, Pölönen I, Neittaanmäki-Perttu N, Averbuch A, Neittaanmäki P, Grönroos M, Saari H (2015) Delineation of malignant skin tumors by hyperspectral imaging using diffusion maps dimensionality reduction. Biomed Signal Proc Control 16:48–60 Zhuang L, Bioucas-Dias JM (2018) Fast hyperspectral image denoising and inpainting based on lowrank and sparse representations. IEEE J Sel Top Appl Earth Obser Remote Sensing 11(3):730–742

Method for Radiance Approximation of Hyperspectral Data Using Deep Neural Network Samuli Rahkonen and Ilkka Pölönen

Abstract We propose a neural network model for calculating the radiance from raw hyperspectral data gathered using a Fabry–Perot interferometer color camera developed by VTT Technical Research Centre of Finland. The hyperspectral camera works by taking multiple images from different wavelength with varying interferometer settings. The raw data needs to be converted to radiance in order to make any use of it, but this leads to larger file sizes. Because of the amount of the data and the structure of the raw data, the processing has to be run in parallel, requiring a lot of memory and time. Using raw camera data could save processing time and file space in applications with computation time requirements. Secondly, this kind of neural network could be used for generating synthetic training data or use it in generative models. The proposed model approaches these problems by combining spatial and spectral-wise convolutions in neural network with minimizing a loss function utilizing the spectral distance and mean squared loss. The used dataset included images from many patients with melanoma skin cancer. Keywords Spectral imaging · Modelling · Machine learning · Data processing

1 Introduction VTT Technical Research Centre of Finland has developed a prototype hyperspectral camera based on a Fabry–Perot interferometer color camera. The camera works by taking multiple images from different wavelength with varying interferometer settings resulting to a set of raw sensor RGB images. Processing the large data cubes produced by these cameras is time and memory consuming. Traditionally the data must have been converted to radiance data in order to make any use of it in applications, but this leads to large file sizes. The conversion step could be skipped S. Rahkonen (B) · I. Pölönen Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, FI-40014 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



S. Rahkonen and I. Pölönen

Fig. 1 Working principle of Fabry–Perot interferometer camera

completely by using the raw data, for example, in target segmentation in medical imaging. Secondly, this kind of neural network could be used for generating synthetic training data or use it in generative models, like generative adversial networks. We propose a neural network model for calculating the radiance from the raw data produced by this kind of hyperspectral cameras. Figure 1 illustrates how a Fabry– Perot interferometer works. It has two parallel half-mirrors close to each other. A beam of light entering the system interferes with itself as it reflects off the mirrors. Integer multiples of light with certain wavelengths are then transmitted through the mirrors. With low-and high-pass filters, the setup can be used as a narrowband wavelength filter by controlling the mirror separation. Eskelinen (2019), Saari et al. (2009) Conversion to radiance data cube is traditionally done as follows. The raw RGB sensor data is bilinearly interpolated (using a Bayer matrix) to produce an array of RGB images. After interpolation, each image can be converted to multiple narrow wavelength radiance bands depending on the number of received peaks on the sensor. Needed coefficients for calculating the radiance for corresponding wavelengths are solved during the camera calibration. Therefore, the number of bands in the resulting radiance cube can be higher than in the original raw data cube. Finally, the produced bands are sorted by their wavelength to produce the final radiance cube. Eskelinen (2019), Saari et al. (2009). This paper introduces a neural network architecture for predicting a radiance cube from a given raw sensor data cube. We demonstrate the network’s ability to calculate the radiance using a dataset of images of skin melanoma. We noticed reduced noise in the produced radiance cubes.

2 Methods The experiments were ran on a shared computer cluster with Intel(R) Xeon(R) CPU E5-2640 v4, 264 GB memory and Nvidia Tesla P100 GPU with 16GB memory. We used Python 3.6 and Tensorflow 2.0.0.

Method for Radiance Approximation of Hyperspectral …


2.1 Dataset and Network Architecture Our dataset consisted of 62 raw hyperspectral images captured with VTT Fabry– Perot interferometer camera with 1920 × 1080 spatial resolution and 85 bands. A raw image is not yet converted to radiance image and its bands have not been interpolated using the RGB Bayer matrix on the photo sensor. The images were captured from patients with and without diagnosed melanoma cancer. Figure 2 shows three example images from our dataset. We assume that a general method for radiance approximation using a neural network would require a very large and diverse dataset of hyperspectral data, which was not available for this type of camera. Therefore we are using this dataset and narrowing our research scope to be application specific to hyperspectral images of skin. In order to create a ground truth dataset, we used fpipy Python library Hämäläinen (2018) to generate radiance images from the same raw images. The resulting radiance cubes have 120 bands. The ground truth raw and radiance data included noise. The dataset was split to 54 training and 8 test image cubes by hand to make sure we had diverse sets of images available. Data pipeline normalized the full image cubes between 0 and 1, extracted 21 subimages (256 × 256 × 85 and 256 × 256 × 120) patches from each raw and its corresponding radiance data cubes. During the training, images were also rotated and mirrored in spatial domain to enlarge the data pool. The network consists of 3D convolution layers with filters applied at different directions inside the data cube (see Fig. 3). First, we expand dimensions by one (at the end of the tensor), to apply the 3D convolution filters inside spatial and spectral axis of the cube. Also the 3D convolution layer expects 5D tensors as an input. The two 3 × 3 × 1 kernels are used at spatial plane for each band to find filters for bilinear interpolation. We use batch normalization and Maas et al. (2013) as the activation function of the hidden units to avoid gradient vanishing/exploding problem. The number of filters and their shape are kept small, as traditionally the bilinear interpolation of a raw Bayer matrix image could be achieved with convolving just two 3 × 3 kernels.

Fig. 2 Examples of hyperspectral images (including only RGB bands for visualization) from patients; handmade markings on the skin in images b and c


S. Rahkonen and I. Pölönen

Fig. 3 Neural network model

Next, we use two consecutive layers with 1 × 1 × 5 kernels at spectral axis. The idea is to only look at the spectra of the pixels to find filters with suitable coefficients which can convert the out-of-order raw sensor signals to radiance. They are out of order because each band in the data cube can converted to multiple narrow wavelength radiance bands depending on the number of received peaks on the sensor. We drop the last dimension to upscale the spectral dimension. At the end of the network, we use deconvolution layer with sigmoid activation function to increase the number of bands from 85 to 120, which is the expected number of bands. Sigmoid function outputs activation values between 0 and 1.

2.2 Training We trained the network with 205 steps. Each step included 100 batches of two training input-output data pairs. The batch size was 2. Training with full epochs of all training images slowed down the updating of the weights too much. The GPU memory and cube size were the limiting factors. Too long training would result to the network to start modeling the noise in the training data. This could be seen from individual testing spectra during the training. The loss function is Loss = λ1

M B N 1  (Y i jk − Yˆ i jk )2 NMB i j k

N M 1  + λ2 θ (Y i j , Yˆ i j ) + W 2 , NM i j


S · Sˆ , ˆ 2 S2  S


ˆ = cos(S, S) ˆ = θ (S, S)

where Y and Yˆ are the ground truth and predicted training data cubes, respectively. Pixel counts N and M are at the spatial axis and B is the number of bands at the spectral axis. The first term is the mean square error for minimizing the spatial errors and errors in pixel intensity values. The experimentally defined coefficient λ1 = 1 gives a weight for this term. The second term is the mean of spectral angles between predicted and

Method for Radiance Approximation of Hyperspectral …


ˆ in a pixel. It uses Eq. (2) for calculating the spectral angle. expected spectra (S and S) This is used to minimize the errors at the spectral axis and keep the form of the spectra closer to the ground truth data. The effect of this term to the predicted spectra was qualitatively validated during experiments. The experimentally defined coefficient λ2 = 10 gives a weight for this term. The third term is the L2 regularization for the network weights to prevent overfitting and to keep the training more stable by avoiding gradient explosion. We used Adam optimizer with a learning rate of 2e−4 and ε = 0.01. The gradients were set to be clipped by norm 1 for stability. Training the neural network took three hours with the GPU. We compared the radiance cube generated by the network to the radiance calculated by fpipy library Hämäläinen (2018). fpipy calculates the radiance (see Sect. 1). We used several quality metrics and qualitative visual inspection to measure differences between these two radiance calculation methods. The metrics included structural similarity index measure (SSIM) Wang and Bovik (2009), Wang et al. (2004), peak signal to noise ratio (PSNR), mean absolute error (MAE) and mean squared error (MSE). SSIM can be used for measuring the perceived image quality and ranges between −1 and 1, where value 1 would mean that the input images are equal. SSIM takes better into account the structural information (texture, orderings, patterns) in natural signals and should provide better metric for the experimented dataset Wang and Bovik (2009). Scikit-image van der Walt et al. (2014) implementation of SSIM was used.

3 Results To interpret the MAE and MSE, we consider possible cube values between 0 and 1. The ranges of the metrics were as follows: • • • •

MAE: 0.02–0.09, MSE: 0.001–0.009, PSNR: 20–30 dB, SSIM: 0.5–0.9.

Predicting a full image cube took 4.2–6.8 seconds with the neural network and the GPU. Radiance calculation with fpipy took approximately 48 seconds with the CPU. The results are listed in Table 1. The cube number 6 was selected as a representative example. Different plots illustrating the differences between the ground truth and the predicted cube are shown in Figs. 4, 5, 6 and 7. Figure 4 illustrates errors between the two methods. The prediction has visible cut lines between the image patches, because the network tends to create artifacts at the edges of the image that are visible after the full image cube is reconstructed. Comparing the ground truth (Fig. 5a) and prediction (Fig. 5b) histograms, you can see predicted pixels between bands 69–90 to gain larger values compared to the


S. Rahkonen and I. Pölönen

Table 1 Test results Cube number MAE 1 2 3 4 5 6* 7 8 ∗

0.060 0.052 0.078 0.084 0.047 0.040 0.029 0.038




0.005 0.004 0.009 0.008 0.002 0.002 0.001 0.002

22.877 23.933 20.457 20.468 25.354 26.073 29.208 26.405

0.574 0.813 0.809 0.565 0.725 0.831 0.831 0.831

Results described and discussed in detail

Fig. 4 Visualization of the cube number 6 with the RGB bands

ground truth. These bands correspond to red wavelengths. The errors can be seen in the histograms, in Fig. 6 around the corresponding wavelengths 625–700, and also in the RGB image (Fig. 4) as a reddish tint. Based on visual inspection and Fig. 7, which illustrates one image patch and two example spectra, the network seems to reduce noise in the example spectra. Also, Fig. 5a shows that the ground truth bands have more variance in their data values than in the prediction (Fig. 5b). The last two bands have a lot of noise and cause large error peaks. The prediction image showed some noticeable color alterations in recurring

Method for Radiance Approximation of Hyperspectral …


Fig. 5 Histograms of the image spectra; 120 bands on the X -axis, 120 bins for each band on the Y -axis

Fig. 6 MAE per wavelength for the cube number 6; wavelength on the X -axis, absolute error on the Y -axis


S. Rahkonen and I. Pölönen

Fig. 7 Top: example spectra and errors of a test cube patch in the cube number 6; Bottom: spectrum from the upper point (blue) and the lower point (green) in the patch

pattern at spatial dimensions at a closer visual inspection. This is probably due to the fact that the network fails to completely reproduce the bilinear interpolation for the Bayer color filter array of the camera’s photosensor.

4 Discussion In this section, we discuss the main findings and the related work on the topic.

4.1 Findings The PSNR was between 20–30 dB and the MAE between 0.02–0.09, which are quite good results. The largest MAE errors are in the range of 670–775 nm, as shown in Fig. 6. These bands had the most variance in pixel values, which explains the errors. The SSIM varied between 0.57–0.83, which can be interpreted as a good result. The ground truths of the cubes number 1 and 4 were quite noisy at spectral dimension and unfocused spatially, but their spectral histograms for the predicted cubes show much less variance per band. As a result, they scored worse than the other test cubes (see Table 1). Visual inspection to the zoomed in ground truth, prediction and error images in Fig. 7 show that the results are quite close to each other at both spatial and spectral dimensions. The largest L2 spectral errors are at the areas with least samples of similar spectra.

Method for Radiance Approximation of Hyperspectral …


The prediction histogram (Fig. 5b) illustrates how the spectra of the proposed method follows quite closely the ground truth method. At a closer look, the ground truth histogram shows some repeating error patterns and larger variance of the spectra values throughout the bands. The intensity values of the predicted spectra are more tightly concentrated. This could be a result of overgeneralizing the training data, or the network is reducing the noise. A strange effect is observed from the prediction histogram (Fig. 5b). Bands between 0–50 form three band intensity peaks instead of two as in the ground truth (Fig. 5a). This can be explained by the used cube patching method. The full cube was split into patches which were fed through the neural network and the predictions were assembled back to a larger cube. The method could predict each patch differently, which could show in the histogram like this.

4.2 Related Work There exist many cases where neural networks are used for image processing, where the mapping from input to output images is learnt from training images. For Bayer filter interpolation there are a few attempts on applying neural networks Gharbi et al. (2016), Syu et al. (2018). Convolutional neural networks has been used for demosaicing photosensor image data Syu et al. (2018). The authors quantitatively and qualitatively compare ten demosaicing algorithms to two methods (DMCNN and DMCNN-VD) based on neural networks. DMCNN uses a few layers with the intent to learn features for interpolation, non-linearly mapping them to individual pixels and reconstructing them to a full image. The second method, DMCNN-VD, uses residual layers to construct a much deeper network. The results look promising, but artifacts are still present. In the case of spatial interpolation, gathering and constructing a good ground truth dataset is a challenge. One way to enhance the existing data is to create a pseudo ground truth dataset by spatially downsampling the training images, adding noise and creating mosaic from them to reduce noise Gharbi et al. (2016). Generative adversial networks (GAN) have been used with hyperspectral data for visualization Chen et al. (2018), image restoration Fabbri et al. (2018) and generating new images. They have even been used for learning mappings between unpaired RGB images Zhu et al. (2017).

5 Conclusions The aim of the research was to find a method using a neural network for approximating the radiance from the raw data produced by a FPI camera. The proposed neural network method delivers promising results, but suffers from some image artifacts in


S. Rahkonen and I. Pölönen

the form of per band intensity fluctuations depending on the quality of the raw input data. The model is also able to reduce noise of the spectra with noisy training data. As topics for future research, we would consider producing better training data. Bilinear interpolation can be achieved quite fast without a neural network and therefore it would make sense to do this step as a separate preprocessing operation before running a radiance approximation. Alternatively we could have undersampled the image data spatially to produce less noisy ground truth data for training. We noticed that the training was not completely stable, because of fluctuations in image sharpness between training steps in spatial dimensions. At spectral dimension, we noticed varying which caused intensities of all the band values to be off. The reason is probably the loss function which emphasizes the shape of the spectra over the intensity. One solution could be optimizing the λ1 and λ2 coefficients in the loss function. Secondly, the learning rate could be decreased during the training. We experimented on using just MAE as a loss function for the network, but it failed to capture the spatial and spectral fidelity of the ground truth data. The predictions tended to capture and produce only the mean spectra of the whole image. Using spectral angle (2) in the loss function made it possible to distinguish different spectra and MAE contributed to the spatial image quality. The results show that it is possible to train an application specific neural network for radiance approximation with a relatively small number of training images. Acknowledgements This research was supported by the University of Jyväskylä.

References Chen S, Liao D, Qian Y (2018) Spectral image visualization using generative adversarial networks. In: Geng X, Kang BH (eds) PRICAI 2018: trends in artificial intelligence, vol 11012. Lecture notes in computer science. Springer, Cham, pp 388–401 Eskelinen M (2019) Computational methods for hyperspectral imaging using Fabry–Perot interferometers and colour cameras. PhD thesis, University of Jyväskylä Fabbri C, Islam MJ, Sattar J (2018) Enhancing underwater imagery using generative adversarial networks. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 7159–7165 Gharbi M, Chaurasia G, Paris S, Durand F (2016) Deep joint demosaicking and denoising. ACM Trans Graph 35(6) Hämäläinen J, Jääskeläinen S, Rahkonen S (2018) Fabry-Perot imaging in Python. GitHub, https:// Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: ICML’13: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28. JMLR Workshop and Conference Proceedings. Saari H, Aallos V-V, Akujärvi A, Antila T, Holmlund C, Kantojärvi U, Mäkynen J, Ollila J (2009) Novel miniaturized hyperspectral sensor for UAV and space applications. In: Meynart R (ed) Proceedings of SPIE International Society for Optics and Photonics SPIE. Sensors, systems, and next-generation Satellites XIII, vol 7474 Syu NS, Chen YS, Chuang YY (2018) Learning deep convolutional networks for demosaicing. arXiv:1802.03769

Method for Radiance Approximation of Hyperspectral …


van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Yu T (2014) The scikit-image contributors. scikit-image: image processing in Python. Peer J 2:e453 Wang Z, Bovik AC (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE Signal Process Mag 26(1):98–117 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612 Zhu J, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, pp 2242–2251

Directional Wavelet Packets for Image Processing Amir Averbuch and Valery Zheludev

Abstract The paper presents a versatile library of quasi-analytic complex-valued wavelet packets originated from polynomial splines of arbitrary orders. The real parts of quasi-analytic wavelet packets are regular spline-based orthonormal wavelet packets derived from discretized periodic polynomial splines. Imaginary parts are the so-called complementary orthonormal wavelet packets derived from the Hilbert transforms of regular wavelet packets. The discrete Fourier transforms of quasianalytic wavelet packets are located in either the positive or negative half-band of the frequency domain. Consequently, the discrete Fourier transforms of 2D wavelet packets, which are derived by the tensor products of 1D wavelet packets, occupy one of the quadrants of the 2D frequency domain. Such a structure results in the directionality of their real parts. The shapes of real quasi-analytic wavelet packets are close to windowed cosine waves that oscillate in several different directions at different frequencies. The paper provides a few examples of the successful application of designed quasi-analytic wavelet packets to image denoising and inpainting. Keywords Spline · Wavelet packet · Quasi-analytic wavelet packet · Directionality · Oscillatory waveforms · Image processing

1 Introduction It is apparent that for an image processing algorithm to be efficient, it has to take into considerations features that characterize the images, such as edges oriented in various directions, texture patterns, that can be approximated by patches oscillating in various directions with various frequencies, and smooth regions. The ability to extract such features even from degraded images is a key element for image denoising, inpainting, deblurring, classification and target detection. This stems from the fact that practically all the processed images have a sparse representation in a proper transform A. Averbuch · V. Zheludev (B) School of Computer Science, Tel Aviv University, 69978 Tel Aviv, Israel e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 P. Neittaanmäki and M.-L. Rantalainen (eds.), Impact of Scientific Computing on Science and Society, Computational Methods in Applied Sciences 58,



A. Averbuch and V. Zheludev

domain. The sparse representation of an image means that it can be approximated by a linear combination of a relatively small number of 2D “basic” elements selected from a versatile collection called dictionary, while retaining the above mentioned components of the image. Such a dictionary should comprise a variety of waveforms which are able to capture edges oriented in any direction, texture patches oscillating with any frequency and to represent smooth regions by a very few elements. The latter requirement can be fulfilled if the dictionary elements possess vanishing moments at least locally. In order to meet the former two, the dictionary elements have to be oriented in multiple directions and to have oscillating structures with multiple frequencies. Image processing applications are characterized by extensive research, especially in recent decades. Naturally, a number of dictionaries are reported in the literature and applied to image processing. We mention contourlets (Do and Vetterli 2003), curvelets (Candès and Donoho 2004; Candès et al. 2006), pseudo-polar Fourier transforms (Averbuch et al. 2008a, b) and shearlets (Kutyniok and Labate 2012; Kutyniok et al. 2012). These dictionaries are used in various image processing applications. However, while these dictionaries successfully capture edges in images, they did not demonstrate a satisfactory texture restoration due to lack of oscillating waveforms in the dictionaries. A number of publications (Jalobeanu et al. 2000; Bayram and Selesnick 2008; Han and Zhao 2014; Han et al. 2019, 2016; Ji et al. 2017, 2018), to name a few, design directional dictionaries by tensor multiplication of complex wavelets (Kingsbury 1999; Selesnick et al. 2005), wavelet frames and wavelet packets (WPs). The tight tensor-product complex wavelet frames (TP_CTFn )1 with a different number of directions, are designed in Han and Zhao (2014), Han et al. (2016, 2019) ↓ and some of them, in particular cptTP_CTF6 , TP_CTF6 and TP_CTF6 , demonstrate impressive performance for image denoising and inpainting. The waveforms in these frames are oriented in 14 directions and, due to the 2-layer structure of their spectra, they possess certain, although limited, oscillatory properties. The Digital Affine Shear Filter Transform with 2-Layer Structure (DAS-2) algorithm (Che and Zhuang 2018), the two-layer structure inherent in the TP_CTF6 frames, is incorporated into shearlet-based directional filter banks introduced in Zhuang (2016). This improves the performance of DAS-2 in comparison to TP_CTF6 for texture-rich images such as “Barbara”, which is not the case for smoother images like “Lena”. We succeeded in the design of a family of dictionaries that maximally meet the requirements for image processing applications. As a base for such a design, we have a library of orthonormal WPs originating from the discretized polynomial splines of multiple orders (see Averbuch et al. 2019). The waveforms in the library are symmetric, well localized in time domain, their shapes vary from low-frequency smooth curves to high-frequency oscillating transients. They can have any number of local vanishing moments. Their spectra provide a variety of refined splits of the frequency domain and the shapes of the magnitude spectra tend to a rectangular as 1

The index n refers to the number of filters in the underlying one-dimensional complex tight framelet filter bank.

Directional Wavelet Packets for Image Processing


the spline’s order increases. Their tensor products possess similar properties that are extended to 2D setting while the directionality, which is of crucial importance for image processing, does not exist. However, the directionality is achieved by the extension of these orthonormal WPs to the complex-valued quasi-analytic WPs (qWPs) (see Sect. 2). The paper is organized as follows. Section 2 outlines the design of directional qWPs originated from polynomial splines and the corresponding transforms. Section 3 presents a couple of examples of image denoising and inpainting. Section 4 comprises a brief discussion. The following abbreviations are used: 1D 2D cWP DFT dWP FFT HT qWP PSNR SSIM

One-dimensional Two-dimensional Complementary wavelet packet Discrete Fourier transform Orthonormal wavelet packets derived from discretized polynomial splines Fast Fourier transform Hilbert transform Quasi-analytic wavelet packet Peak Signal to Noise Ratio Structural Similarity Index (Wang et al. 2004).

2 Quasi-analytic Directional Wavelet Packets In this section, we depict the properties of qWPs and give some illustrations. An outline of the qWPs design and the implementation of the corresponding transforms is provided in Averbuch et al. (2021), which describes the successful application of qWPs to image inpainting. A detailed description of the design and implementation is given in Averbuch et al. (2020).

2.1 Properties of qWPs The qWPs are derived from the periodic WPs originating from orthonormal discretized polynomial splines of different orders (dWPs), which are described in Averbuch et al. (2019, Chap. 4) (a brief outline is given in Averbuch et al. 2020). The p dWPs are denoted by ψ[m],l , where p is the generating spline’s order, m is the decomposition level and l = 0, . . .2m − 1, is the index of a m-level wavelet packet. The 2m -sample shifts p

{ψ[m],l (· − 2m k)}, l = 0, . . ., 2m − 1, k = 0, . . ., N /2m − 1,


A. Averbuch and V. Zheludev

of the m-level dWPs form an orthonormal basis of the space [N ] of N -periodic discrete-time signals. Surely, other orthonormal bases are possible, for example, the wavelet and Best bases (Coifman and Wickerhauser 1992). The spaces of 1D and 2D N -periodic signals are denoted by [N ] and [N , N ], def respectively. We denote N = 2 j and ω = e2π i/N . The sequence δ[k] ∈ [N ] is the N -periodic Kronecker delta. Discrete Fourier Transform (DFT) of a signal x ∈ [N ] is the N -periodic complex sequence x[n] ˆ =

N −1 

ω−kn x[k], n ∈ Z.

k=0 p


Complementary wavelet packets (cWPs) are φ[m],l and φ[m], j,l and quasi-analytic p p wavelet packets (qWPs) ±[m],l and +±[m],l, j in 1D and 2D cases, respectively. 2.1.1

One-Dimensional qWPs p

The waveforms ψ[m],l [k] are symmetric, well localized in the spatial domain and have oscillatory structure, their DFT spectra form a refined split of the frequency domain. The shapes of the magnitude spectra tend to rectangular as the spline’s order p grows. A common way to extend 1D WP transforms to multiple dimensions is by the tensor-product extension. The 2D dWPs from the level m are: def




ψ[m], j,l [k, n] = ψ[m], j [k] ψ[m],l [n]. Their 2m -sample shifts along vertical and horizontal directions form orthonormal bases of the space [N , N ] of 2D signals N -periodic in both directions. The drawback is the lack of directionality. The directionality is achieved by switching to complex wavelet packets. For this, we start with the application of the Hilbert transform (HT) to the dWPs p ψ[m],l , to get the signals p


τ[m],l = H (ψ[m],l ), m = 1, . . ., M, l = 0, . . ., 2m − 1. A slight correction of these signals def p p p p φ[m],l [k] = ψˆ [m],l [0]/N + ψˆ [m],l [N /2]/N + τ[m],l [k]


provides us with a set of signals from the space [N ], whose properties are similar p to the properties of the dWPs ψ[m],l . In particular, their shifts form orthonormal bases in [N ], their magnitude spectra coincide with the magnitude spectra of the p p p dWPs ψ[m],l . However, unlike the symmetric dWPs ψ[m],l , the signals φ[m],l are

Directional Wavelet Packets for Image Processing


9 ; signals φ 9 , l = 0, . . ., 7; their magnitude DFT spectra (right Fig. 1 Top to bottom: signals ψ[3],l [3],l 9 9 half-band); magnitude DFT spectra of complex qWPs +[3],l ; same for −[3],l , l = 0, . . ., 7

antisymmetric for all l except for l0 = 0 and lm = 2m − 1. We refer to the signals p φ[m],l as the complementary orthonormal WPs (cWPs). The sets of complex-valued WPs, which we refer to as the quasi-analytic wavelet packets (qWPs), are defined as p



±[m],l = ψ[m],l ± iφ[m],l , m = 1, . . ., M, l = 0, . . ., 2m − 1, p


where φ[m],l are the cWPs defined in (1). The qWPs ±[m],l differ from the analytic p p WPs by adding two values ±i ψˆ [m],l [0] and ±i ψˆ [m],l [N /2] into their DFT spectra, p respectively. The DFT spectra of the qWPs +[m],l are located within positive halfp band of the frequency domain and vice versa for the qWPs −[m],l . 9 9 and φ[3],l , l = 0, . . ., 7, from the third decomFigure 1 displays the signals ψ[3],l position level and their magnitude spectra (right half-band), that coincide with each 9 9 9 [0] and ψˆ [3],l [N /2] to the spectra of φ[3],l , l = 0, 7, results in other. Adding ψˆ [3],l an antisymmetry distortion. These WPs provide a collection of diverse symmetric and antisymmetric well localized waveforms, which range from smooth wavelets for l = 0, 1 to fast oscillating transients for l = 5, 6, 7. Thus, this collection is well suited to catch smooth as well as oscillating local patterns in signals. In the 2D case, these valuable properties of the spline-based wavelet packets are completed by the directionality of the tensor-product waveforms.



A. Averbuch and V. Zheludev

Two-Dimensional qWPs p


Similarly to the 2D dWPs ψ[m], j,l [k, n], the 2D cWPs φ[m], j,l [k, n] are defined as the tensor products of 1D WPs such that p



φ[m], j,l [k, n] = φ[m], j [k] φ[m],l [n]. p

The 2m -sample shifts of the cWPs {φ[m], j,l }, j, l = 0, . . ., 2m − 1, in both directions form an orthonormal basis for the space [N , N ] of arrays that are N -periodic in both directions. p p The 2D dWPs {ψ[m], j,l } as well as the cWPs {φ[m], j,l } lack the directionality property which is needed in many applications that process 2D data. However, realvalued 2D wavelet packets oriented in multiple directions can be derived from tensor p products of complex quasi-analytic qWPs ±[m],ρ . The complex 2D qWPs are defined as follows: def p p p ++[m], j,l [k, n] = +[m], j [k] +[m],l [n], def




+−[m], j,l [k, n] = +[m], j [k] −[m],l [n], where m = 1, . . ., M, j, l = 0, . . ., 2m − 1, and k, n = 0, . . ., N − 1. The real parts of these 2D qWPs are p










θ+[m], j,l [k, n] = Re(++[m], j,l [k, n]) = ψ[m], j,l [k, n] − φ[m], j,l [k, n], θ−[m], j,l [k, n] = Re(+−[m], j,l [k, n]) = ψ[m], j,l [k, n] + φ[m], j,l [k, n],


Figure 2a illustrates the design of qWPs. p The DFT spectra of the 2D qWPs ++[m], j,l , j, l = 0, . . ., 2m − 1, are tensor products of the one-sided spectra of the qWPs p ˆ ++[m], ˆp ˆp  j,l [ p, q] = +[m], j [ p] +[m],l [q]

and, as such, they fill the quadrant q0 of the frequency domain, while the spectra of p +−[m], j,l , j, l = 0, . . ., 2m − 1, fill the quadrant q1 (see Fig. 2). Figure 3 displays 9 9 the magnitude spectra of the ninth-order 2D qWPs ++[2], j,l and +−[2], j,l from the second decomposition level. 9 Figure 3 shows that the DFT spectra of the qWPs +±[m], j,l effectively occupy relatively small squares in the frequency domain. For deeper decomposition levels, sizes of the corresponding squares decrease as geometric progression. Such configp uration of the spectra leads to the directionality of the real-valued 2D WPs θ±[m], j,l p defined in (2). The directionality of the WPs θ±[m], j,l is discussed in Averbuch et al. p (2020). It is established that if the spectrum of WP +±[m], j,l occupies a square

Directional Wavelet Packets for Image Processing


Fig. 2 a Diagram of the qWP design; b Quadrants of frequency domain

Fig. 3 Magnitude spectra of 2D qWPs from the second decomposition level


whose center lies in the point [κ0 , ν0 ], then the respective real-valued WP θ±[m], j,l is represented by 2π(κ0 k + ν0 n) p θ [k, n], θ±[m], j,l [k, n] ≈ cos N where θ [k, n] is a spatially localized low-frequency waveform which does not have a directionality. But the 2D signal cos 2π(κ0Nk+ν0 n) oscillates in the direction D, which is orthogonal p to the vector V = κ0 i + ν0 j. Therefore, WP θ±[m], j,l can be regarded as the directional cosine wave modulated by a localized low-frequency signal θ . The cosine frequen-


A. Averbuch and V. Zheludev




Fig. 4 a Magnitude spectra of 2D qWP ++[3],2,5 [k, n]; b WP θ++[3],2,5 = Re(++[3],2,5 )

9 Fig. 5 a WPs θ+[2], j,l from the second decomposition level; b Their magnitude spectra

cies in the vertical and horizontal directions are determined by the indices j and l, p respectively, of the WP θ±[m], j,l . The bigger the index is, the higher is the frequency in the respective direction. The situation is illustrated in Fig. 4. The imaginary parts p of the qWPs +±[m], j,l have a similar structure. Figures 5 and 6 display the WPs 9 9 θ+[2], j,l , and θ−[2], j,l , j, l = 0, 1, 2, 3, from the second decomposition level and their 9 9 magnitude spectra, respectively. WPs θ+[3], j,l and θ−[3], j,l , j, l = 0, 1, . . ., 7, from the third decomposition level are shown in Fig. 7. p

Remark 1 Note that all the WPs θ+[m], j,l whose spectra are located along the vector V, have approximately the same orientation. It is shown in Figs. 5, 6 and 7. For p example, all the “diagonal” qWPs {θ±[m], j, j }, j = 0, . . ., 2m − 1, are oscillating with different frequencies in the directions of either 135◦ (for θ+ ) or 45◦ (for θ− ). Consequently, the number of orientations of the mth level WPs is less than 2 × 4m which is the number of WPs. These orientational numbers are given in Table 1.

Directional Wavelet Packets for Image Processing


9 Fig. 6 a WPs θ−[2], j,l from the second decomposition level; b Their magnitude spectra

Fig. 7 WPs from the third decomposition level Table 1 Numbers of different orientations of p qWPs {θ±[m], j,l }, j, l = 0, . . ., 2m − 1, for different decomposition levels

Level m

Number of directions

1 2 3 4 5 6

6 22 86 314 1218 4606

2.2 Implementation Scheme for 2D qWP Transforms p

The spectra of 2D qWPs {++[m], j,l }, j, l = 0, . . ., 2m − 1, fill the quadrant q0 of p the frequency domain (see Fig. 2), while the spectra of 2D qWPs {+−[m], j,l } fill p the quadrant q1 . Consequently, the spectra of the real-valued 2D WPs {θ+[m], j,l }, p m j, l = 0, . . ., 2 − 1, and {θ−[m], j,l } fill the pairs of quadrant q+ = q0 ∪ q2 and q− =


A. Averbuch and V. Zheludev p

q1 ∪ q3 , respectively. By this reason, none linear combination of the WPs {θ+[m], j,l } and their shifts can serve as a basis in the signal space [N , N ]. The same is true p p for WPs {θ−[m], j,l }. However, combinations of the WPs {θ±[m], j,l } provide frames of the space [N , N ]. The transforms are implemented in the frequency domain using modulation matrices of the filter banks, which are built from the corresponding wavelet packets. It is important to mention that the structure of the filter banks Q+ and Q− for the first p decomposition level is different for the transforms with the “positive” +[m],l and p “negative” −[m],l qWPs, respectively. However, the transforms from the first to the second and further decomposition levels are executed using the same filter banks Hm for the “positive” and “negative” qWPs. This fact makes it possible a parallel implementation of the transforms. The one-level 2D qWP transforms of a signal X = {X [k, n]} ∈ [N , N ] are implemented by a tensor-product scheme. To be specific, for the transform with p ++[1] ,F the 1D transform of columns from the signal X is executed using the filter bank Q+ , which is followed by the 1D transform of rows of the produced coefficient arrays using the same filter bank Q+ . These operations produce the transform coefficient array 1  j,l Z+[1] = Z+[1] j,l=0 p

comprising of four blocks. The transform with +−[1] is implemented by a subsequent application of the filter banks Q+ and Q− to columns from the signal X and rows of the produced coefficient arrays, respectively. This results in the coefficient array Z−[1] =



Z−[1] .


The further transforms starting from the arrays Z+[1] and Z−[1] produce two sets of the coefficients ⎫ ⎧ ⎫ ⎧ m m 2 −1 2 −1 ⎬ ⎨ ⎬ ⎨ j,l j,l Z+[m] , Z−[m] , m = 2, . . ., M, Z−[m] = Z+[m] = ⎭ ⎩ ⎭ ⎩ j,l=0


respectively. The transforms are implemented by the application of the same filter banks Hm , m = 2, . . ., M, to rows and columns of the “positive” and “negative” coefficient arrays. The coefficients from a level m comprise of 4m “positive” blocks of m j,l coefficients {Z+[m] }, l, j = 0, . . ., ?2 −1 , and the same number of “negative” blocks j,l {Z−[m] }. The coefficients from a block are inner products of the signal X = {X [k, n]} ∈ [N , N ] with the shifts of the corresponding wavelet packet:

Directional Wavelet Packets for Image Processing


Fig. 8 a Image Re(X+ ); b Its magnitude DFT spectrum; c Image Re(X− ); d Its magnitude DFT spectrum


Z ±[m] [k, n] = j,l

N −1  λ,μ=0


X [λ, μ] +±[m], j,l [λ − 2m k, μ − 2m n], j,l

Y±[m] [k, n] = Re(Z ±[m] [k, n]) =

N −1  λ,μ=0


X [λ, μ] θ±[m], j,l [λ − 2m k, μ − 2m n].

The inverse transforms are implemented accordingly. Prior to the reconstruction, some, possibly different, structures (e.g., 2D wavelet, Best Basis or single-level j,l j,l wavelet packets) in the sets {Z+[m] } and {Z−[m] }, m = 1, . . .M, are defined, and some manipulations of the coefficients (e.g., thresholding, shrinkage, l1 minimization) are executed. The reconstruction produces two complex arrays X+ and X− . The signal X ˜ = Re(X+ + X− )/8. Figure 8 illustrates the “Fingerprint” image is restored by X restoration by the 2D signals Re(X± ). The signal Re(X− ) captures oscillations oriented to north-east, while Re(X+ ) captures oscillations oriented to north-west. ˜ = Re(X+ + X− )/8 perfectly restores the image achieving PSNR = The signal X 312.3538 dB.

2.3 Summary of 2D qWPs The qWP waveforms have the following characteristics: 1. 2. 3. 4. 5. 6. 7.

The qWP waveforms are oriented in multiple directions; The qWP waveforms have oscillating structure with multiple frequencies; The qWP waveforms have local vanishing moments; The qWP waveforms are well localized in the spatial domain; The DFT spectra of the qWPs produce a refined frequency separation; The corresponding transforms are implemented in a fast way using the FFT; The transform coefficients have a clear (explainable) physical meaning.


A. Averbuch and V. Zheludev

Due to a variety of orientations, the qWPs capture edges even in severely degraded images and their oscillatory shapes with a variety of frequencies enable to recover fine structures. Multiple experiments on image denoising (Averbuch et al. 2020) and inpainting (Averbuch et al. 2021) demonstrate that qWP-based methods are quite competitive with the best state-of-the-art algorithms. However, qWPs have a strong potential to handle a new important class of problems. Namely, the above properties of qWPs provide a perfect tool for feature extraction from images and, in that capacity, can serve as a significant component for Deep Neural Networks (DNNs). Due to the versatility of testing waveform and, most importantly, the explanatory physical meaning of the transform coefficients resulting from image convolution with a variety of qWPs waveforms, it is possible to replace at least some of the convolution layers in convolutional DNNs by convolving the image with qWPs. This will lead to a significant reduction of the training dataset size. Our preliminary experiments demonstrate the feasibility of qWPs for such a task.

3 Numerical Examples In this section, we present several examples.

3.1 Image Denoising The BM3D algorithm (Dabov et al. 2007) is one of the best image denoising method that exploits the self-similarity of patches and sparsity of the image in a transform domain. This method is efficient in restoration of moderately noised images. However, the BM3D tends to over-smooth and to smear the image fine structure and edges when noise is strong. Also, the BM3D is unsuccessful when the image contains many edges oriented in multiple directions. On the other hand, algorithms that use directional oscillating waveforms provide the opportunity to capture lines, edges and texture details. Therefore, it is natural to combine the qWP-based and BM3D algorithms in order to retain strong features of both algorithms and to get rid of their drawbacks. The description of two hybrid qWP BM3D algorithms and results of multiple experiments on image denoising are presented in a paper submitted to a journal (see the preprint Averbuch et al. 2020). The original four images of the experiments are “Fingerprint”, “Fabric”, “Bridge” and “Man”2 (see Fig. 9). The images were degraded by zero-mean Gaussian noise with STD σ = 5, 10, 25, 40, 50, 80, 100 dB. The images restored by two hybrid algorithms, denoted H1 and H2, were compared with BM3D-restored images based on


These images were not reported in Averbuch et al. (2020).

Directional Wavelet Packets for Image Processing


Fig. 9 Original images

Fig. 10 Diagrams of PSNR and SSIM values for restoration of “Fingerprint”, “Fabric”, “Bridge” and “Man” images degraded by noise

PSNR and SSIM values3 as well as visual perception. The diagrams in Fig. 10 illustrate the results of the experiments. It is seen that although the PSNR values of all three algorithms are very close to each other, the SSIM achieved by the hybrid algorithms is significantly higher than that achieved by the BM3D. This is especially true for texture-rich images like “Fingerprint” and “Bridge”. Figure 11 displays restoration of the “Bridge” image from the input degraded by Gaussian noise with STD σ = 50 dB. In this experiment, the image restored by H1 has PSNR = 23.76 dB and SSIM = 0.4963 versus PSNR = 23.61 dB and SSIM = 0.4625 achieved by BM3D. Consequently, H1 managed to restore some fine structures which were blurred by BM3D. Figure 12 displays restoration of the “Fingerprint” image from the input degraded by strong Gaussian noise with STD σ = 100 dB. In this experiment, the image restored by H1 has PSNR = 21.68 dB and 3

We used the Matlab 2020b function ssim.m to calculate SSIM values.


A. Averbuch and V. Zheludev

Fig. 11 Restoration of “Bridge” image from input degraded by Gaussian noise with STD σ = 50 dB

SSIM = 0.7019 versus PSNR = 21.65 dB and SSIM = 0.6643 achieved by BM3D. Because of a strong noise, BM3D blurred some fragments of the image while H1 managed to restore them.

3.2 Image Inpainting The designed qWPs demonstrated high efficiency in dealing with the image inpainting problem, that means restoration of images degraded by loss of significant share of pixels and possible addition of noise. State-of-the-art results in image inpainting were achieved by the iterative algorithm using decreasing thresholding values presented in Shen et al. (2017, 2014). In Che and Zhuang (2018), the performance of that algorithm with different filter banks was investigated. To be specific, the set ↓ DAS-2, DAS-1 (Zhuang 2016), TP-CTF 6 and TP-CTF 6 of filter banks, which we call SET-4 filter banks, were utilized. Note that for different types of images, different filter banks from the SET-4 were advantageous. We designed a qWP-based iterative algorithm, which combines the split Bregman iteration scheme (Goldstein and Osher 2009) with the adaptive decreasing thresh-

Directional Wavelet Packets for Image Processing


Fig. 12 Restoration of “Fingerprint” image from input degraded by Gaussian noise with STD σ = 100 dB

olding. In multiple experiments on restoration of images corrupted by missing a large amount of pixels and addition of Gaussian noise with various intensities, we compared the performance of our qWP-based algorithm with the performance of the SET-4 algorithms. The description of the algorithm and results of multiple experiments on image inpainting are presented in the paper (Averbuch et al. 2021). The results are compared according to PSNR and SSIM values and by visual perception. Similarly to denoising experiments, the qWP algorithm prevailed in restoration of edges and fine structure even in severely degraded images. This fact is reflected in highest values of SSIM. A typical example is displayed in Fig. 13. In this example, the qWP restoration of the “Bridge” image4 degraded by missing 50% of pixels and additive Gaussian noise with σ = 50 dB is compared with the restoration by DAS-2, which was the best out of the SET-4 algorithms.


The “Bridge” image did not participate in the experiments presented in Averbuch et al. (2021).


A. Averbuch and V. Zheludev

Fig. 13 Restoration of “Bridge” image

4 Discussion The paper describes the design of one- and two-dimensional quasi-analytic WPs (qWPs) originating from polynomial splines of arbitrary order and corresponding transforms. The qWP transforms operate in spaces of periodic signals. The periodic setting provides a lot of substantial opportunities for the design and implementation of WP transforms. The exceptional properties of the designed qWPs, such as waveforms’ orientation in multiple directions combined with oscillations with multiple frequencies (to name a few) proved to be highly beneficial for the image restoration. Our experiments on image denoising and inpainting using qWP-based algorithms produced state-of-the-art results. In summary, with this versatile and flexible tool, we are able to solve multiple data processing problems, such as image deblurring, superresolution, segmentation and classification, and target detection. In the latter, directionality is extremely important. Directional 3D wavelet packets under design can be useful in seismic and hyperspectral processing.

Directional Wavelet Packets for Image Processing


Acknowledgements This research was supported by the Israel Science Foundation (ISF, 1556/17, 1873/21), Len Blavatnik and the Blavatnik Family Foundation, Israel Ministry of Science Technology and Space (3-16414, 3-14481, 3-17927) and Academy of Finland (grant 311514).

References Averbuch A, Coifman RR, Donoho DL, Israeli M, Shkolnisky Y (2008) A framework for discrete integral transformations I: the pseudopolar Fourier transform. SIAM J Sci Comput 30(2):764–784 Averbuch A, Coifman RR, Donoho DL, Israeli M, Shkolnisky Y, Sedelnikov I (2008) A framework for discrete integral transformations II: the 2D discrete Radon transform. SIAM J Sci Comput 30(2):785–803 Averbuch A, Neittaanmäki P, Zheludev V (2019) Splines and spline wavelet methods with applications to signal and image processing, Selected topics, vol III. Springer, Cham Averbuch A, Neittaanmäki P, Zheludev V (2020) Directional wavelet packets originating from polynomial splines. arXiv:2008.05364 Averbuch A, Neittaanmäki P, Zheludev V, Salhov M, Hauser J (2020) Coupling BM3D with directional wavelet packets for image denoising. arXiv: 2008.11595 Averbuch A, Neittaanmäki P, Zheludev V, Salhov M, Hauser J (2021) Image inpainting using directional wavelet packets originating from polynomial splines. Signal Process Image Commun 97:116334 Bayram I, Selesnick IW (2008) On the dual-tree complex wavelet packet and M-band transforms. IEEE Trans Signal Process 56(6):2298–2310 Candès E, Demanet L, Donoho D, Ying LX (2006) Fast discrete curvelet transforms. Multiscale Model Simul 5(3):861–899 Candès EJ, Donoho DL (2004) New tight frames of curvelets and optimal representations of objects with piecewise C 2 singularities. Commun Pure Appl Math 57(2):219–266 Che Z, Zhuang X (2018) Digital affine shear filter banks with 2-layer structure and their applications in image processing. IEEE Trans Image Process 27(8):3931–3941 Coifman RR, Wickerhauser MV (1992) Entropy-based algorithms for best basis selection. IEEE Trans Inform Theory 38(2):713–718 Dabov K, Foi A, Katkovnik V, Egiazarian K (2007) Image denoising by sparse 3-D transformdomain collaborative filtering. IEEE Trans Image Process 16(8):2080–2095 Do MN, Vetterli M (2003) Contourlets. In: Welland GV (ed) Beyond Wavelets. Academic, San Diego, CA, pp 83–105 Goldstein T, Osher S (2009) The split Bregman method for L1-regularized problems. SIAM J Imaging Sci 2(2):323–343 Han B, Mo Q, Zhao Z, Zhuang X (2019) Directional compactly supported tensor product complex tight framelets with applications to image denoising and inpainting. SIAM J Imaging Sci 12(4):1739–1771 Han B, Zhao Z (2014) Tensor product complex tight framelets with increasing directionality. SIAM J Imaging Sci 7(2):997–1034 Han B, Zhao Z, Zhuang X (2016) Directional tensor product complex tight framelets with low redundancy. Appl Comput Harmon Anal 41(2):603–637 Jalobeanu A, Blanc-Féraud L, Zerubia J (2000) Satellite image deconvolution using complex wavelet packets. In: Proceedings 2000 international conference on image processing. IEEE, pp 809–812 Ji H, Shen Z, Zhao Y (2017) Directional frames for image recovery: multi-scale discrete Gabor frames. J Fourier Anal Appl 23(4):729–757 Ji H, Shen Z, Zhao Y (2018) Digital Gabor filters with MRA structure. Multiscale Model Simul 16(1):452–476


A. Averbuch and V. Zheludev

Kingsbury N (1999) Image processing with complex wavelets. Phil Trans R Soc A 357(1760):2543– 2560 Kutyniok G, Labate D (Eds) (2012) Shearlets: multiscale analysis for multivariate data. Birkhäuser, Boston Kutyniok G, Lim W-Q, Zhuang X (2012) Digital shearlet transforms. In: Shearlets: multiscale analysis for multivariate data. Birkhäuser, Boston, , pp 239–282 Selesnick IW, Baraniuk RG, Kingsbury NC (2005) The dual-tree complex wavelet transform. IEEE Signal Process Mag 22(6):123–151 Shen Y, Han B, Braverman E (2014) Image inpainting using directional tesnor product complex tight framelets. arXiv: 1407.3234 Shen Y, Han B, Braverman E (2017) Image inpainting from partial noisy data by directional complex tight framelets. ANZIAM J 58(3–4):247–255 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612 Zhuang X (2016) Digital affine shear transforms: fast realization and applications in image/video processing. SIAM J Imaging Sci 9(3):1437–1466

Trends in Scientific Computing

Fifty Years of High-Performance Computing in Finland Kimmo Koski, Pekka Manninen, and Tommi Nyrönen

Abstract In the past, scientific computing was primarily established around a supercomputer, with a number of applications adapted to it. HPC resources were used by only a few disciplines and to a limited extent. In recent decades, this has changed dramatically. The computing power has increased significantly and its use is much wider and heterogeneous. We are talking about an efficient HPC ecosystem in which computing, data, networks, applications and, above all, competent people play a major role. In Finland, services for research have been provided by CSC–IT Center for Science for the last 50 years from the era of centralized supercomputing in the late 1980s to the HPC ecosystem consisting of various resources and support structures in 2020. The provision of services has rapidly developed from primarily national to increasingly global European cooperation. This paper consists of a description of the Finnish system in the past and in the future. The EuroHPC initiative, which will result in one of the world’s most efficient computing environments being built at CSC, is described in more detail, as is ELIXIR, one of the largest data-intensive research areas and infrastructures. In addition, the future direction will be discussed. Keywords Data · Infrastructure · Research · HPC · EuroHPC · LUMI supercomputer · ELIXIR

1 Introduction The era of scientific computing in Finland started in 1971, when a group of people began working on punch cards with the first “high-end compute” Univac 1108 installed in Finland. At that time, a very limited number of people with highperformance computing (HPC) competence in scientific computing worked at the state computing center (Valtion tietokonekeskus VTKK), where operations continued to develop. A few years later, the first real supercomputer, the Cray X-MP, was K. Koski (B) · P. Manninen · T. Nyrönen CSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



K. Koski et al.

completed and installed in 1989. By then, the VTKK unit had changed its format and working methods and transferred to the CSC–IT Center for Science, the Finnish national center for scientific computing, networks and data, which supports versatile software environments in various disciplines. Since the era of X-MP, scientific computing services have continued to develop, and CSC is one of the major HPC centers in Europe in 2021. The computing performance of supercomputers has evolved from less than one teraflop/s (1012 floatingpoint operations per second) in the Cray X-MP to 0.5 exaflop/s (1018 floating-point operations per second) in the LUMI HPE Cray EX system, which is due to be completed in early 2022. This represents a half-million-fold increase in nominal performance over the past 30 years. Although we usually highlight the computing performance, that is not the whole story. The strengths of the Finnish model in serving computational users include many other aspects, such as balanced HPC environment with sufficient communication networks and data systems, efficient applications and a strong support structure. Due to long-term investment in the competence development of high-end computing, Finland is one of the most successful countries in HPC and was selected to host Europe’s most efficient system, LUMI, which will be in operation in late 2021 or early 2022. This paper reviews some of the history over the past 50 years in Sect. 2 and provides insights into possible future developments in Sect. 3. LUMI is a EuroHPC project and is described in more detail. One of the top science application areas of EuroHPC, built by CSC for more than 10 years, is bioinformatics. We discuss the use of systems for computational genomics, which is one of the priorities of the European life sciences data infrastructure ELIXIR.

2 History In 2021, CSC turned 50 years old. It took 18 years since the company was founded before the first real supercomputer—the Cray X-MP—arrived in Finland and Otaniemi.

2.1 From Centralized Systems to Computing Ecosystem In 1989, when the Cray X-MP was installed in Finland, the variety of applications for such a supercomputer was not so wide. Weather forecasts, fluid dynamics and astrophysics were typical areas where vector computing was efficient. Much of the computational load was done on PCs and later on RISC workstations or their clusters, but supercomputing was reserved for rare, although important, computational challenges.

Fifty Years of High-Performance Computing in Finland


During the following decades, HPC services evolved and used various resources, such as the aforementioned clusters, massive parallel processors, or scalar processors with different memory sizes and architectures. Ideally, each load balancing required that the software application always targeted to run on the system where the program codes and processor architectures matched best. This principle was called metacomputing in the mid-1990s. Vector systems begun to disappear and instead massive parallel systems, PC clusters, and large shared memory systems took up space. Later, new concepts such as grid computing and cloud entered the arena, sometimes only by introducing new terminology and approach using existing technology. Computing resources include traditional CPUs, GPUs, and eventually—in a few years or more—possibly quantum computers. The principles of metacomputing still apply: a wide variety of resources are needed to support different scientific software algorithms. Today, instead of a “supercomputer,” we like to refer to “HPC ecosystem.” This illustrates the ongoing changes in hardware construction paradigms. Technical performance is the starting point, but there is an increasing emphasis on how setup can be utilized efficiently to solve major scientific challenges. Therefore, software tuning, high-performance access to large data, and application development are important competencies to maximize the impact of public investments.

2.2 Development of CSC At the same time as the decision on the Cray X-MP was made, it was foreseen that the efficient use of the system would require support services in many areas, primarily to keep the systems up but also to provide scientific support to enable efficient use of the system. CSC was commissioned to set up and operate the system and to provide technical support and research-specific services. The latter mainly included application software support, such as code tuning and methodology consultations. In 1989, CSC employed about 30 people, focusing mainly on supercomputing and research networks. During the 1990s, CSC’s business expanded. The company installed various computing resources, so the availability of data from these resources became important. CSC’s staff continued to grow. CSC employed 80 people in the late 1990s and now, after about 20 years, 500 people. Services and customer base have expanded into new areas, such as education, data preservation, and the digitalization of higher education and public administration. As a strong company with sufficient volume and expertise, CSC has been able to excel in international and European collaboration. The impact of EC funding on CSC’s service and capability development is significant. In 2019, a major breakthrough was achieved when it was decided to locate one of Europe’s most efficient systems in CSC’s data center in Kajaani. This EuroHPC LUMI supercomputer is a joint project of 10 countries and the EU, which shows strong trust in CSC through its investments in Finland.


K. Koski et al.

2.3 Towards the European HPC Ecosystem The first EU projects were launched in the mid-to-late 1990s. The focus of the national computing centers was to learn how to work together. The DEISA project for HPC and the EGEE project for grid computing were the first collaborations to pave the way for closer European collaboration. The HPCEUR initiative was launched in 2005 with the aim of co-financing the construction of large European HPC centers. This initiative failed and did not result in a viable funding model. As Europe had lagged behind the rest of the world, something had to be done. The HPC European Task Force (HET) was set up to find ways to work together sustainably. HET finally succeeded in providing a European HPC strategy and recommendations on how to proceed. As a result, 16 countries signed a Memorandum of Understanding, which led to the establishment of the association PRACE (Partnership for Advanced Computing in Europe). A few years ago, the competition for global HPC expertise once again received a lot of attention. Concerns grew that Europe would not keep pace with China, Japan and USA. The EuroHPC initiative was set up to reduce the gap. EuroHPC initiated a call for European pre-exascale supercomputers, which HPCEUR had already tried but failed. For the first time, the EU also funded infrastructure (50%) and not just the workload, as in several PRACE projects from 2008 to the present. Three EuroHPC pre-exascale centers were established in Finland (LUMI), Italy (Leonardo) and Spain (Mare Nostrum 5) and five smaller petascale centers elsewhere in Europe. The way to a smooth and successful European HPC collaboration has not been easy. However, steps in the right direction have been taken, and the importance of computational resources is now better understood than before. Today, the focus should be—and fortunately increasingly is—on a full HPC ecosystem that contains not only various computing resources but also access to data, high-performance network connections, high-tech software applications and, above all, competent people. Scientific problems require a variety of computational resources from highly connected parallel applications to data-intensive and distributed solutions. Perhaps for these reasons, we are seeing a number of financial streams in Europe, such as EuroHPC for HPC, EOSC for cloud computing, centers of excellence, artificial intelligence, and thematic calls ranging in many areas from transport to health. The key challenge is how all these initiatives work together to construct a digital Europe– taking into account that their governance and funding models are in many cases fragmented. The outcome of this collaboration will be decisive in evaluating the success of the European HPC Ecosystem, a topic in which billions have been invested over the past decade or two.

Fifty Years of High-Performance Computing in Finland


3 Future Developments In 2021, CSC was one of the main stakeholders in European HPC due to its active role and its extensive EU framework program project portfolio. Finland is the focus of HPC thanks to its LUMI infrastructure, which is expected to be the most efficient system in Europe when it becomes operational in late 2021/early 2022. The elements of success are in place, but the work has only begun with the writing of this chapter.

3.1 Prerequisites for Success A system capable of performing over 500 petaflops/s can make a big difference in science, but performance alone is not enough. Making full use of capacity requires a lot of work in application development and porting software to new architectures. And just performing a simulation is not enough–the results must also be understood. Much attention needs to be paid to training and education, as the efficient use of the system needs requires competent people. Much work has been done to enable computing capacity and build a support structure for the life cycle of current EuroHPC systems until 2026 or even 2027. The focus of activity is now shifting to boosting application development and scaling and developing competencies, creating new capabilities. Developments are needed, for example, to make high-quality data about LUMI available, together with European data resources, including life and health sciences. Cooperation between EuroHPC countries would also be useful for emerging global collaboration opportunities. The focus should be in the overall HPC ecosystem, rather than just peak performance reaching the TOP500 list. Ecosystem thinking also enables better collaboration between countries. Not everyone has to be physically in the same place, but they can take part in code development or training activities from their own country, to name a few examples. This can enhance collaboration, bring more synergy and enable higher quality services when done together. The recipe for major success is simple: a balanced ecosystem and collaboration between stakeholders will lead to the scientific discoveries and breakthroughs made possible by the system.

3.2 LUMI in Kajaani The LUMI1 consortium was formed for the EuroHPC application process. The participating countries are Finland, Belgium, the Czech Republic, Denmark, Estonia, Iceland, Norway, Poland, Sweden and Switzerland. The consortium provides a highquality, cost-efficient and environmentally sustainable HPC ecosystem based on gen1

Large Unified Modern Infrastructure,


K. Koski et al.

uine European collaboration. Member countries also offer a high level of competence from computing and training to data management. The supercomputing and data infrastructure of CSC’s datacenter in Kajaani will help make Europe one of the world leaders in supercomputing, enabling European researchers to access world-class computing resources. State-of-the-art computing resources also provide a basis for research in areas that have previously been out of reach. This will increase the potential for scientific breakthroughs with an immense societal impact, such as understanding climate change and human health. New data management and HPC infrastructure will foster innovation and new data-based business opportunities, including the development of artificial intelligence. LUMI is an HPE Cray EX supercomputer that consists of several partitions designed to support various user needs. The largest partition of the system is the LUMI-G, which consists of GPU-accelerated nodes that use the next-generation AMD Instinct “MI200” GPUs. In addition to this, there is a smaller CPU partition LUMI-C with AMD Epyc “Milan” CPUs. For data analytics, it includes an auxiliary section with large memory nodes and some GPUs for data visualization. The system is installed in two steps. In the first phase, in the summer of 2021, everything except the GPU partition was installed. In the second phase, in early 2022, upgrades will be made and full capacity will be available with the deployment of LUMI-G. One of CSC’s major data resources for HPC and LUMI ecosystems comes from the life sciences. Life science research is both data and computationally intensive and requires increasingly distributed cloud, computing, storage and access services to thrive. Supporting HPC in the life sciences has been a strategic goal for CSC for more than 20 years.

3.3 ELIXIR and National Nodes ELIXIR2 is a European strategic research data infrastructure of global significance. Since its official launch in December 2013, ELIXIR has brought together the European national centers and the core bioinformatics resources into a single coordinated infrastructure. ELIXIR invests in and aims to sustain long-term national bioinformatics infrastructures and ensure their level of interoperability, helping to make scientific discoveries. ELIXIR connects national bioinformatics expertise into an international bioinformatics community, and Europe’s 23 ELIXIR nodes manage data resources and associated research services. ELIXIR Europe is represented by the European Molecular Biology Laboratory (EMBL), which has its own budget and governance. The ELIXIR Consortium Agreement defines ELIXIR. In Finland, ELIXIR was established as a state act in 2015, when CSC–IT Center for Science was authorized to operate the local ELIXIR node.


Fifty Years of High-Performance Computing in Finland


ELIXIR Europe facilitates collaboration between its member institutes and researchers through two intersecting groups of organizations, Platforms3 and Communities.4 In order to stay up-to-date, it is important that research infrastructures systematically reach user communities driven by researchers and bioinformaticians in domain or technology specific research around a specific theme, e.g., human data, metabolomics, microbial biotechnology. ELIXIR has a total of eleven communities. Overall, ELIXIR communities and platforms foster long-term collaboration that connects national resources to international infrastructure. CSC is the ELIXIR node in Finland. The node specializes in delivering computing and storage services with integrated computational access to very large biological data resources, as well as computational science training events tailored to the needs of life scientists and biological research. Life sciences is one of the CSC’s main user communities. In 2020, CSC provided computing data management services to 434 health and biology research projects. These research projects form one of Finland’s fastest growing disciplines. They already consume a third of the national computational research supported by the CSC infrastructure. The sensitive data cloud services cPouta and ePouta, developed to meet needs of new uses for computational services, have been operational since 2015. CSC aims to ensure that life science projects, irrespective of their status with respect to ELIXIR, have access to the advanced and scalable long-term HPC systems needed to make scientific discoveries. Since the beginning of 2000, CSC has consistently driven technology and service development to enable the delivery of cloud computing and storage services in conjunction with bioinformatics data management and to focus more on appropriate data management. In 2021, CSC was the leader in the ELIXIR Europe Compute platform. Over the past ten years, the Finland node has influenced security practices on a European scale with its services, such as the delivery of HPC to bioinformatics, user authentication and authorization infrastructure. The federated European identity and access management services for several interdependent parties and thousands of researchers are co-operated by CSC in Finland and the Mazaryk University in the Czech Republic. Federated identity and access management is one of the most promising technologies for future development to manage the sharing of data on the computing capabilities of EuroHPC (e.g., LUMI). The strategy is to make these environments compatible with containerized applications for sensitive data (e.g., personalized medicine) in the context of large initiatives such as the European Health Data Space. The combined work of ELIXIR and European e-infrastructures such as EOSC and EuroHPC takes place on a daily basis at CSC. This work brings together two distinct highly skilled communities of technical experts. Data management experts, together with HPC experts, can ensure that technical solutions are suited to future scientific needs, such as the processing of sensitive data in large-scale e-infrastructures. Technological development in Finland supports Europe’s broader ambitions of increasing digital sovereignty. Each scientific community establishes and manages its own stan3 4


K. Koski et al.

dards for describing and accessing datasets, reporting data, matching and comparing content, and eventually building linkages between datasets and their analysis. These specifications form the basis for digital platforms where computing is combined with access to data resources that serve the entire research process.

3.4 LUMI’s Eco-efficiency LUMI is world class in terms of environmental sustainability and cost efficiency. It will help the European ICT sector to become greener and reduce costs, which is necessary for the EU to meet its ambitious climate targets. Reducing CO2 emissions is a critical goal worldwide. In addition to its other Green Deal objectives, the EU is pursuing make European data centers climate-neutral by 2030. The supercomputer’s waste heat will account for about 20 percent of the district heating in the city of Kajaani, which will substantially reduce the city’s carbon footprint. The electricity company Loiste Lämpö has built a district heating interface for the data center, and the company will also take care of utilizing the waste heat generated by the computer. The need for cooling is also reduced by the fact that the outdoor temperature and, as a result, the building’s thermal stress is much lower in Kajaani than in Southern Europe, for instance. The brownfield building solution also decreases LUMI’s carbon footprint. LUMI’s energy consumption will be covered 100% by Vattenfall’s hydroelectricity. Additionally, the power grid in the data center is very reliable, and the electricity costs are low. In addition to technical and structural transformations, attention has also been paid to the visual appearance of the data center. LUMI, which means ‘snow’ in Finnish, will get a unique shield in a white space resembling a snow bank. The structure is made of perforated aluminum composite panels, which will also be illuminated.

4 Conclusions The future of HPC will be based on a balanced ecosystem and intense international collaboration. In the first phase, EuroHPC will contain three pre-exascale centers and five petascale centers, which are expected to work together. Multiple projects will be launched on these infrastructures with the target of expanding and improving their use to achieve scientific or industrial breakthroughs wherever HPC can impact. New modules, such as quantum computers, will complement the ecosystem. The focus cannot be just on computing power but on all aspects of the HPC ecosystem, including data structures, networks, software, and most of all, competent people. Computers do not solve problems—people do. LUMI will start operations in late 2021 and early 2022. CSC and the LUMI consortium must prove that the system is hosted in a flawless manner, facilitate its uptake and make the best possible use of it in the various scientific communities.

Fifty Years of High-Performance Computing in Finland


LUMI will see its best days in 2022–2024 and will likely need a minor upgrade or extension somewhere in the middle of its operations. At the same time, CSC and LUMI will start looking at the horizon and towards LUMI’s successor. CSC has now become a Tier-0 computing center and will certainly want to maintain that position in the future after the current LUMI system. The growth of health data and its applications is a globally recognized phenomenon. For example, the EU’s 1+Million Genomes initiative aims to make at least one million sequenced genomes available in the EU by 2022, an estimated 100 petabytes of access-controlled data for analysis on high-performance computers. The collection and management of genomic data requires the progress of scientific computing infrastructure as well as discussions between different governance sectors and with experts from national organizations, in Finland CSC, the Finnish Institute for Health and Welfare (THL) and FIMM Technology Centre at the University of Helsinki. These actions are a key factor in promoting the growth of the health sector in Europe. Thanks to increased computing power and the balanced ecosystem that supports it, previously unsolved problems can now be solved. The health sector faces some major scientific challenges, including the cure of cancer and other diseases and the identification of infectious pathogens such as COVID-19. In other areas, from environmental sciences to hard physics and more, there are equally important challenges ahead. CSC has been working for 50 years with the goal of providing high-quality services and infrastructure for Finnish and European scientists. Because of LUMI, the role will be even more global in the future.

Quantum Scientific Computing Matthias Möller

Abstract Quantum computing began in 1980 with Paul A. Benioff’s theoretical feasibility study. Since then, it has brought a wealth of publications on quantum information theory, quantum technologies, algorithms, and potential applications. In this paper, we give a brief overview of developments in this field, focusing on their potential impact on scientific computing in the post-quantum era. Keywords Quantum scientific computing · Quantum algorithms · Scientific computing · Quantum computing · Future computing

1 Introduction Computational scientists are traditionally among the first to show interest in emerging compute technologies long before they become a mainstream technology or disappear into oblivion because they could not fulfil the high expectations or were simply superseded by another technology. About the time that Cray Research brought their first supercomputers—the Cray-1 (1976) and the Cray X-MP (1982)—into the market, NASA was running the ‘Finite Element Machine’ project (Storaasli et al. 1982) to build and evaluate the performance of a parallel computer for structural analysis. This is not the only reporting of the successful early adoption of an emerging compute paradigm by computational scientists. Mackerle et al. (1996) gives a comprehensive overview of the early days of computational finite elements with hundreds of papers on this topic published between 1980 and 1990. Likewise, the unparalleled success of another emerging compute technology—General-Purpose Graphics Processing Units (GPGPU)—started with early adopters from the scientific computing community, who explored the potential M. Möller (B) Delft University of Technology, Delft Institute of Applied Mathematics, Mekelweg 4, 2628 Delft, The Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



M. Möller

of graphics cards to speed up finite element analysis (Cecka et al. 2011; Komatitsch et al. 2009) and the solution of linear systems of equations (Göddeke et al. 2005). Since then many-core processors and high-performance Graphics Processing Units (GPU) computing have become mainstream technologies that are nowadays empowering workstations and supercomputers alike. The next compute technology at the horizon that holds the promise to radically change the way we will be solving computational problems in the future is Quantum Computing (QC). To quote from EU’s Quantum Flagship program, we are at the “second quantum revolution” which “should lead to devices with far superior performance and capabilities for [...] simulation and computing” (European Commission will launch 2021). In recent years, QC has changed from curiosity-driven research of individual scientists to a high-priority topic for politicians worldwide with massive investments to secure a leading position in the list of nations to have quantum technologies in their arsenal. The global quantum race between Europe, China, and the U.S. is not only about post-quantum cryptography technologies, admittedly a question of national security interest, but also about the expected advantage of quantum computers over classical ones to “open new opportunities to address grand challenges in such fields as energy, health, security and the environment” (European Commission will launch 2021). Interestingly enough, finite elements are again among the first methodologies with wide application in the scientific computing community and in daily engineering practice whose potential in a post-quantum era has been investigated in the literature (Clader et al. 2013; Montanaro and Pallister 2016). Both papers dampen the expectation to achieve exponential speedup over classical numerical methods as promised by the underlying quantum algorithm’s theoretical performance (Harrow et al. 2009) but leave at least some hope to obtain polynomial speedup for very high-dimensional problems (cf. Sect. 3.1). In this paper, we give a brief overview of developments in the field of scientific quantum computing. Our focus is on quantum algorithms and early applications. Section 2 provides an introduction to the principles of quantum computing and the challenges that can be observed in practice. We discuss different types of quantum and hybrid classical-quantum algorithms in Sect. 3. In this section, we also discuss quantum algorithms that are specifically tailored for solving partial differential equations. We conclude with a summary of opportunities for scientific quantum computing in Sect. 4.

2 Quantum Computing in a Nutshell While classical computers are centered around bits and bytes, quantum computers make use of so-called quantum bits (qubits) to store information. From an abstract mathematical point of view, a single qubit can hold any superposition of the two basis states |0 and |1, i.e. |ψ = α0 |0 + α1 |1 , α0 , α1 ∈ C, |α0 |2 + |α1 |2 = 1,

Quantum Scientific Computing


where |α0 |2 and |α1 |2 are the probabilities of collapsing to the classical state 0 and 1 upon measurement, respectively. To illustrate this core concept of quantum computing, consider a coin flip experiment. A fair coin will fall on its head and tail to equal parts so that it can be considered to be in equal superposition of the two possible states, i.e. 1 1 |ψ = √ |0 + √ |1 . 2 2 Note that this does not mean that the coin is in an intermediate third state (‘standing on the edge’) but that, on average, in every second experiment the coin falls on head. Like in this example, we can only observe the outcome of a quantum experiment after forcing it to collapse into one of the two classical states, which is termed measurement. It is therefore imperative to perform multiple runs of the same quantum experiment and determine the correct solution(s) to the problem under consideration from the frequency of its occurrence in the measured output. What sounds like a severe limitation at first glance, e.g., for deterministic problems like adding two numbers, turns out to be an advantage for problems that involve uncertainties, e.g., in the input or boundary conditions. The power of quantum computing comes from combining multiple qubits to a quantum register, which is able to hold a superposition of the 2n basis states, i.e. | = |ψ1  ⊗ · · · ⊗ |ψn  =

n 2 −1

αi |i , αi ∈ C,


n 2 −1

|αi |2 = 1.


Here, ⊗ is the Kronecker product and |i denotes the state that corresponds to the bit string representation of i, e.g., |3 = |0011 assuming least-significant bit encoding and n = 4. Consequently, |αi |2 is the probability of obtaining i upon measurement. Typically starting from a freshly reset quantum register |in  = |0, a sequence of unitary matrices U j is applied to manipulate the quantum register’s state, i.e. |out  = Um · · · U1 |in  .


The U j ’s are themselves composed from small unitary matrices, the quantum gates, in Kronecker-product fashion, e.g., U1 = I ⊗ H ⊗ I ⊗ X applies a not gate (‘X ’) to the last qubit and a Hadamard gate (‘H ’) to the first but one, leading to the state 1 1 |out  = √ |0001 + √ |0101 . 2 2 The interested reader can find an extensive description of quantum gates in Nielsen and Chuang (2020) including multi-qubit gates that allow applying a particular quantum operation on one qubit depending on the state of another one. A problem that we are regularly observing when executing quantum circuits on real quantum computers is as follows: In the above example one would expect the


M. Möller

Fig. 1 Minimizing the weight of a truss weight in a 2-truss problem with discrete element sizes: probabilities of finding the respective options (encoded in bit strings) by combinatorial optimization De Zoete (2021) using a quantum computer simulator (blue bars), probabilities when using a physical quantum device (red bars), both on the top axis; the value of the cost function corresponding to the respective option (stem plot) on the bottom axis

two states |0001 and |0101 to appear with equal probability 1/2 and all other 24 − 2 = 14 states such as |1111 to appear not at all. However, the noise inherent to physical quantum devices today foils this theoretical result so that in practice a non-negligible amount of runs yields one of the invalid states. To support this claim with concrete numbers we have executed the above quantum circuit on IBM’s 5-qubit quantum computer Belem.1 The two admissible states have been found in 46.7% and 41.1% of all cases, respectively, leaving 12.2% of all runs with an invalid outcome. What sounds like a tolerable false-positive rate can completely ruin the applicability of a quantum algorithm in practice. De Zoete (2021) presents a quantumaccelerated algorithm for minimizing the overall weight of a truss structure subject to the constraint that the member sizes must be chosen from a discrete set. This combinatorial optimization problem occurs in many practical applications ranging from large truss structures encountered in bridges and roof constructions to trusslike 3D printed lattice structures. It was found to be NP-hard Yates et al. (1982), which makes an efficient quantum algorithm particularly attractive. Figure 1 shows the numerical results obtained for a 2-truss proof-of-concept implementation. Execution of the algorithm on a noiseless quantum computer simulator finds the global optimum (‘[100100]’ with minimal cost) in 50–60% of all runs (indicated by blue 1

IBM offers access to some of their quantum computers free-of-charge via their public cloud service IBM Quantum available at

Quantum Scientific Computing


Fig. 2 IBM 27-qubit quantum processor ‘Montreal’: Lattice structure, readout errors of individual qubits and error rates of single- (‘H’) and two-qubit (‘CNOT’) gate operations

bars). Moreover, invalid options are not found at all. This is no longer the case when executed on a physical quantum device (Rigetti’s Aspen-9 system2 ) which results in invalid outcomes in more than 95% of all runs (indicated by red bars) rendering the algorithm unusable in practice. Noise is not the only shortcoming of today’s quantum devices which are termed Noisy Intermediate-Scale Quantum (NISQ) computers for good reason. Another shortcoming is the small number of qubits and the sparse connectivity between them. Figure 2 shows the lattice topology of IBM’s 27-qubit quantum processor ‘Montreal’ as well as the various types of errors that occur during the application of single- and two-qubit gate operations (indicated by ‘H’ and ‘CNOT’, respectively) and reading out the result. Simply said, the deeper the circuit, that is, the longer the sequence of unitary matrices U j in (1) the more errors accumulate and the less accurate the results are. This effect is amplified by the fact that two-qubit operations can only be applied to pairs of adjacent qubits. To apply, say, a CNOT operation to the pair (q0 , q3 ) in Fig. 2, their information needs to be first swapped to (q1 , q2 ), which deepens the circuit even more and gives rise to extra errors in practice. Moreover, physical quantum devices only support a small number of native gates from which all software-visible gates need to be built. For example, Rigetti’s Aspen-9 system supports the two single-qubit rotation gates Rx (kπ/2), k ∈ N, and Rz (θ ), θ ∈ [0, 2π ), and the two-qubit controlled-Pauli-Z gate. This can be compared to decomposing complex instructions in classical computing like fused-multiply-add into the natively supported instructions of a reduced instruction set computer. Finally, it should not be forgotten that current and near-future quantum computers are significantly slower than their classical counterparts both in I/O and computation. According to Troyer Troyer (2021) tomorrow’s quantum computers are expected to read data at 1 Gbit/s, whereas classical chips are already 10,000 times faster today. 2 Rigetti offers paid access to their quantum computers via their Quantum Cloud Services (https:// and via Amazon Braket (


M. Möller

The discrepancy becomes even worse when looking at the speed at which operations are carried out. Optimistic scenarios expect quantum computers to be about ten orders of magnitude slower than their classical counterpart which leads Troyer to the conclusion that only quantum algorithms with super-quadratic speedup and high compute-to-data ratio are suitable to demonstrate a practical quantum advantage over classical computations.

3 Quantum Algorithms for Scientific Computing About a decade after the seminal paper by Benioff Benioff (1980) that can be considered as a first feasibility study of quantum computing, two of the most prominent quantum algorithms were published, Grover’s quantum search algorithm (Grover 1996) and Shor’s integer factorization algorithm (Shor 1994). The former promises only quadratic theoretical speedup over classical search algorithms and, based on Troyer’s analysis (Troyer 2021), might therefore not be of practical significant. A similar opinion is shared by Viamontes et al. (2005) who focus on identifying practical use cases for this textbook algorithm. In contrast, Shor’s algorithm is of practical relevance, as it promises to factorize integer numbers into prime factors in polynomial time which amounts to almost exponential speedup over classical algorithms thereby possibly making an end to public-key cryptography schemes, such as the widely used RSA scheme. However, breaking RSA-2048 requires millions of qubits (Gidney and Ekerå 2021), which makes this algorithm impractical for near-term quantum devices with maybe a few hundred qubits. The reason for this excessive amount of qubits is as follows: To mitigate the imperfection of qubits, a possible strategy towards fault-tolerant qubits is to combine multiple physical qubits into a logical qubit which is able to detect and correct errors (Egan et al. 2021). In this section, we shed some light on families of quantum algorithms that have potential applications in the field of scientific computing with a special focus on numerical linear algebra and the numerical solution of partial differential equations.

3.1 Quantum Linear Solver Algorithms A quantum algorithm that falls into the same category of non-near-termimplementable algorithms is the quantum linear solver by Harrow et al. (2009). For a Hermitian matrix A ∈ C N ×N and a right-hand side vector b ∈ C N , the HarrowHassidim-Lloyd (HHL) algorithm returns the scalar quantity x|xM = x† Mx (an inner product induced by the linear operator M) with x being the solution to the linear system of equations Ax = b. The HHL algorithms promises up to exponential speedup over the best classical algorithms, provided that matrix A is sparse and well conditioned.

Quantum Scientific Computing


In subsequent publications, the theoretical complexity of the HHL algorithm and its dependency on the required precision has been improved (Ambainis 2010; Childs et al. 2017) and its restriction to sparse matrices has been relaxed (Wossnig et al. 2018) thereby paving the way for numerical approaches such as the boundary element method that give rise to dense system matrices. Next to algorithmic improvement, several HHL-based applications have been proposed in the literature, e.g., solution procedures for Ordinary Differential Equations (ODEs) (Berry 2014; Berry et al. 2017; Childs et al. 2020) and Partial Differential Equations (PDEs) (Childs et al. 2020; Clader et al. 2013; Linden et al. 2020; Leyton and Osborne 2008; Liu et al. 2021; Lloyd et al. 2020; Montanaro and Pallister 2016). Worked-out circuit designs for solving Poisson’s equation can be found in Papageorgiou et al. (2013); Wang et al. (2020). However, despite a few experimental realizations of HHL-type algorithms for very small and hand-crafted linear systems (Barz et al. 2014; Cai et al. 2013; Lee et al. 2019; Pan et al. 2014; Perelshtein et al. 2021), the algorithm is expected to require millions of qubits before it might demonstrate possible quantum advantage in practice. In our own experiments (Sigurdsson 2021) and in results published in the literature (Morrell Jr and Wong 2010) the algorithm’s susceptibility to noise deteriorates its fidelity already for 2 × 2 systems to impractically small values. Montanaro and Pallister (2016) sketch an even more pessimistic future of HHL in the context of PDE analysis with the Finite Element Method (FEM) for elliptic PDE problems. According to their detailed start-to-end analysis, a potential advantage of HHL can only be achieved for very high-dimensional Boundary-Value Problems (BVP) and those whose solutions have large high-order derivatives even if all improvements (Ambainis 2010; Childs et al. 2017) of the original HHL algorithm (Harrow et al. 2009) and the SPAI-type preconditioning approach by Clader et al. (2013) are employed. “The maximal advantage of the quantum algorithm for the typical physically relevant PDE defined over 3 + 1 dimensions (three spatial and one temporal, such that d = 4) is approximately quadratic” (Montanaro and Pallister 2016) and therefore, following Troyer’s taxonomy, of little practical relevance. Furthermore, an often overlooked crux of the HHL algorithm is the fact that it does not return the full solution vector but condenses it into a single scalar quantity. In Möller and Vuik (2019), we sketch a conceptual framework for using this algorithm as computational building block within a PDE-constrained optimization problem of the form min Cα (xα ) α

s.t. Aα xα = bα ,

(2) (3)

where α is an admissible set of design parameters (e.g., to specify the shape of the domain α ), Aα and bα are the system matrix and the right-hand side vector, respectively, resulting from the discretization of a BVP by, say, finite elements, and Cα (·) is a cost function whose discrete counterpart can be expressed as ·|·M . As an example consider the task of optimizing the shape of the domain α such that the L 2 error between the finite-element solution yα to a BVP and a prescribed


M. Möller

target profile y∗α is minimized. This can be achieved by setting xα := yα − y∗α and bα := bα − Aα y∗α and letting Mα be the consistent mass matrix  xα |xα Mα =

x†α Mα xα

    yα − y∗α yα − y∗α dx



Another example where the domain remains unchanged is the finding of boundary conditions and/or a suitable right-hand side vector b that minimizes the value of the finite element solution in some user-defined norm induced by the matrix M. Further applications of HHL-type algorithms are expected in Quantum Machine Learning (QML) (Zhao et al. 2019) and quantum-accelerated principal component analysis (Lloyd et al. 2014). While some of HHL’s caveats can be overcome easily, e.g., the restriction to Hermitian matrices can be overcome by considering the extended system      0 b 0 A = , x 0 A† 0 others like the need of a so-called Quantum Random Access Memory (QRAM) (Giovannetti et al. 2008) require substantial advancements in quantum technology. We would like to refer the interested reader to the famous commentary ‘Read the fine print’ by Aaronson (2015).

3.2 Hybrid Quantum-Classical Algorithms A recent trend in quantum algorithm development are Hybrid Quantum-Classical Algorithms (HQCA) which are more suitable for NISQ devices as they split the work between a classical and a quantum computer. This makes HQCAs also more attractive for the scientific computing community whose members are typically experienced in classical computing and can transit to quantum computing step by step. The core idea of so-called Variational Quantum Algorithms (VQAs) is to express the problem at hand as a (global) optimization problem, whereby the optimization loop is running on a classical computer which repeatedly executes a parametrized quantum circuit U(θ) |in  ≡ |(θ ) to evaluate the cost function, i.e. min C(θ ), C(θ) = (θ )|O|(θ ) θ


with some observable O, a Hermitian operator. The sought optimum of the problem is either the value of the cost function at the state of minimal energy or can be calculated from the set of optimal parameters θ opt by classical computations. In fact, the quantum-assisted PDE-constrained optimization problem (2)–(3) can be interpreted as a VQA with a full-fledged quantum algorithm executed in each evaluation of the cost function. However, the main motivation for resorting to hybrid algorithms is to

Quantum Scientific Computing


lower the demands on the quantum computer side by crafting shallow circuits with only few qubits that can actually be executed on NISQ devices. One of the first algorithms of this type is the Variational Quantum Eigensolver (VQE) (Peruzzo et al. 2014), which computes the minimal eigenvalue of a Hermitian matrix H by minimizing the cost function C(θ ) = (θ )|H|(θ ). Another prominent VQA is the Quantum Approximate Optimization Algorithm (QAOA) introduced by Farhi et al. (2014) and its recent generalization by Hadfield et al. (2019), the Quantum Alternating Operator Ansatz (QAOA). The main difference lies in the design of the parametrized circuit U(θ), which is more ‘universal’ in the latter case. Despite their youth, QAOA-type algorithms have been extensively studied in the literature for hard combinational optimization problems such as the maximum cut problem, the constraint satisfiability problem, and the travelling salesman problem. Furthermore, a few publications even address preliminary applications such as battery revenue optimization (de la Grand’rive and Hullo 2019) and truss-weight minimization subjecting to discrete element size choices (De Zoete 2021). Another VQA of possible interest for the scientific computing community is the Variational Quantum Linear Solver (VQLS) that was independently developed by Bravo-Prieto et al. (2019) and Xu et al. (2019). While the aforementioned publications demonstrate the applicability of VQLS to hand-crafted matrices that are particularly suited for the algorithm, Liu et al. (2021) present an efficient realization of VQLS for Poisson’s equation discretized by the finite difference method. The main challenge namely consists in finding an efficient representation of the system matrix as linear combination of unitary matrices that can be implemented on the quantum device. This is, however, not the only challenge of VQLS and VQAs in general. Firstly, they often lack a rigorous complexity analysis so that it is not clear for which type of problems they deliver quantum advantage if at all. Guerreschi and Matsuura (2019), for instance, expect several hundreds of qubits to be required before QAOA-type algorithms will show speedup for representative combinational problems. Huang et al. (2019) propose special-purpose near-term variational quantum-classical algorithms for certain types of linear systems of equations that are expected to become beneficial once quantum devices with 50–70 high-quality qubits become available. Secondly, the performance of VQAs depends on many tuning parameters (e.g., the choice of the optimizer, the cost function, the ansatz of the parametrized quantum circuit, the evaluation of gradients in a gradient-based optimization approach, etc.). In their recent article, Cerezo et al. (2021) address most of the above issues and suggest several alternative strategies for each of them. One of the more severe problems that has received much attention in the literature is the occurrence of barren plateaus. In their seminal paper, McClean et al. (2018) “show that for a wide class of reasonable parameterized quantum circuits, the probability that the gradient along any reasonable direction is non-zero to some fixed precision is exponentially small as a function of the number of qubits”. Consequently, minimizing (4) becomes exponential expensive thereby scotching any potential quantum speedup. This is not only a problem for gradient-based optimizers but likewise affects gradient-free approaches (Arrasmith et al. 2021).


M. Möller

3.3 Quantum Algorithms for Partial Differential Equations The list of quantum algorithms given in Sects. 3.1 and 3.2 is by no means complete. Pesah (2021) suggests a coherent framework to classify quantum algorithms for solving PDEs based on their mapping to one of the two quantum primitives: Hamiltonian simulation and quantum linear solvers. According to his work, a promising alternative to ‘solving’ the discretized problem with the aid of HHL or VQLS is to ‘solve’ it by means of Hamiltonian simulation. This approach is not applicable to all types of PDEs but only to those that admit a transformation to Schrödinger’s equation, cf. (Dodin and Startsev 2021; Smith and Parker 2019). Like Montanaro and Pallister in Montanaro and Pallister (2016) for the HHL algorithm, Pesah dampens the hope for an easily achievable exponential speedup for this alternative approach if the spatial dimensions is kept fixed. We would like to complete this paper by referencing a few unconventional approaches towards solving PDE problems with the aid of quantum computing. Todorova and Steijl (2020) suggest a quantum algorithm for fluid-flow simulation based on the collisionless Boltzmann equation. A hand-crafted quantum NavierStokes algorithm was proposed recently by Gaitan (2020), which combines the classical method-of-lines approach with Kacewicz’s quantum ODE solver (Kacewicz 2006). Lubasch et al. (2020) developed a variational approach to solving nonlinear PDEs with the help of tensor networks and apply it to the viscous Burgers equation. Another variational quantum PDE solver for specific applications from the field of applied physics is presented in García-Molina et al. (2021). Knudsen and Mendl (2021) present a Quantum Neural Network (QNN) approach for continuous-variable quantum computers, which is a topic on its own that goes beyond the scope of this paper and shall not be discussed any further. A very recent contribution in this direction is the seminal paper by Kyriienko et al. (2021) that combines the ideas of physics-informed neural networks (Raissi et al. 2019) with quantum neural networks. In essence, a QNN is trained to predict the solution of a (non-)linear differential equation, whereby derivatives are computed by means of the parameter shift rule (Schuld et al. 2019), which amounts to calculating exact gradients of the parametric quantum circuit from its evaluation at specifically shifted parameter settings (Kyriienko and Elfving 2021). As the last years have brought a vast amount of publications in the field of QML, more approaches of this type can be expected in the future. The interested reader is referred to the review paper by Mishra et al. (2021). Last but not least, the works by Zanger et al. (2021), Raisuddin and De (2022), and Srivastava and Sundararaghavan (2019) should be mentioned as one of the few publications on using Quantum Annealing (QA)—a different paradigm of quantum computing—in the context of solving differential equations. Quantum annealers are often brushed aside as not truly quantum computers. Admittedly, quantum annealers are not as universal as gate-based quantum computers, which, at least theoretically, are much more flexible in the algorithm development. Quantum annealers are restricted to solving Quadratic Unconstrained Binary Optimization (QUBO) prob-

Quantum Scientific Computing


lems. The art of algorithm design lies in casting the problem at hand into this QUBO form or an equivalent Ising model formulation. In contrast to the gate-based NISQ computers, quantum annealers have reached a technology readiness level that allows solving real-world scientific problems of practical interest already today, which should be reason enough to pay more attention to this technology.

4 Opportunities for Scientific Quantum Computing While the first generation of quantum algorithms was mostly developed by theorists with a focus on asymptotic theoretical performance, the second generation has brought a variety of potentially applicable hybrid algorithms, whose correct functioning can be explored on quantum computer simulators and real-world NISQ devices. This ongoing shift towards practical quantum computing is attracting researchers and practitioners from the scientific computing community who can help solve the major challenges of VQAs with their expertise in classical optimization techniques and explore novel application fields for quantum computing. Furthermore, the two disciplines Scientific Machine Learning (SciML) and Quantum Machine Learning (QML) can largely support each other’s advancement. On the one hand, there is some evidence that parametric quantum circuits might have higher expressive power than classical neural networks (Du et al. 2020), which might enable new SciML applications. On the other hand, classical machine-learning techniques can also help to solve problems encountered in quantum computing such as the generation of resource-efficient executable quantum circuits (Adarsh and Möller 2021; Khatri et al. 2019). Last but not least, alternative quantum-computing technologies like quantum annealing should be taken more seriously as they might help practitioners to transition into the field of quantum scientific computing in the near term.


Boundary-Value Problems Finite Element Method General-Purpose Graphics Processing Units Graphics Processing Units Algorithm designed by Aram Harrow, Avinatan Hassidim, and Seth Lloyd Hybrid Quantum-Classical Algorithms Noisy Intermediate-Scale Quantum Ordinary Differential Equation Partial Differential Equation Quantum Annealing Quantum Approximate Optimization Algorithm, its recent generalization Quantum Alternating Operator Ansatz



M. Möller

Quantum Computing Quantum Machine Learning Quantum Neural Network Quantum Random Access Memory Quadratic Unconstrained Binary Optimization Scientific Machine Learning Variational Quantum Algorithm Variational Quantum Eigensolver Variational Quantum Linear Solver

Acknowledgements This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC0500OR22725. The author would moreover like to thank J. de Zoete for providing code and results for the truss-weight minimization problem.

References Aaronson S (2015) Read the fine print. Nat Phys 11:291–293. Adarsh S, Möller M (2021) Resource optimal executable quantum circuit generation using approximate computing. In: 2021 IEEE international conference on quantum computing and engineering (QCE). IEEE, Los Alamitos, CA, pp 225–231. Ambainis A (2010) Variable time amplitude amplification and a faster quantum algorithm for solving systems of linear equations. arXiv: 1010.4458 Arrasmith A, Cerezo M, Czarnik P, Cincio L, Coles PJ (2021) Effect of barren plateaus on gradientfree optimization. Quantum 5:558. Barz S, Kassal I, Ringbauer M, Lipp YO, Daki´c B, Aspuru-Guzik A, Walther P (2014) A two-qubit photonic quantum processor and its application to solving systems of linear equations. Sci Rep 4:6115. Benioff P (1980) The computer as a physical system: a microscopic quantum mechanical Hamiltonian model of computers as represented by Turing machines. J Stat Phys 22:563–591. https:// Berry DW (2014) High-order quantum algorithm for solving linear differential equations. J Phys A: Math Theor 47(10):105301. Berry DW, Childs AM, Ostrander A, Wang G (2017) Quantum algorithm for linear differential equations with exponentially improved dependence on precision. Commun Math Phys 356:1057– 1081. Bravo-Prieto C, LaRose R, Cerezo M, Subasi Y, Cincio L, Coles PJ (2019) Variational quantum linear solver. arXiv:1909.05820 Cai X-D, Weedbrook C, Su Z-E, Chen M-C, Gu M, Zhu M-J, Li L, Liu N-L, Lu C-Y, Pan J-W (2013) Experimental quantum computing to solve systems of linear equations. Phys Rev Lett 110:230501. Cao Y, Papageorgiou A, Petras I, Traub J, Kais S (2013) Quantum algorithm and circuit design solving the Poisson equation. New J Phys 15:013021. 013021 Cecka C, Lew AJ, Darve E (2011) Assembly of finite element methods on graphics processors. Int J Numer Methods Eng 85(5):640–669. Cerezo M, Arrasmith A, Babbush R, Benjamin SC, Endo S, Fujii K, McClean JR, Mitarai K, Yuan X, Cincio L, Coles PJ (2021) Variational quantum algorithms. arXiv:2012.09265

Quantum Scientific Computing


Childs AM, Kothari R, Somma RD (2017) Quantum algorithm for systems of linear equations with exponentially improved dependence on precision. SIAM J Comput 46(6). 10.1137/16M1087072 Childs AM, Liu J-P (2020) Quantum spectral methods for differential equations. Commun Math Phys 375:1427–1457. Childs AM, Liu J-P, Ostrander A (2020) High-precision quantum algorithms for partial differential equations. arXiv:2002.07868 Clader BD, Jacobs BC, Sprouse CR (2013) Preconditioned quantum linear system algorithm. Phys Rev Lett 110:250504. de la Grand’rive PD, Hullo J-F (2019) Knapsack problem variants of QAOA for battery revenue optimisation. arXiv:1908.02210 De Zoete J (2021) A practical quantum algorithm for solving structural optimization problems. Master’s thesis, Delft University of Technology. Dodin IY, Startsev EA (2021) On applications of quantum computing to plasma simulations. Phys Plasmas 28:092101. Du Y, Hsieh M-H, Liu T, Tao D (2020) Expressive power of parametrized quantum circuits. Phys Rev Res 2(3):033125. Egan L, Debroy DM, Noel C, Risinger A, Zhu D, Biswas D, Newman M, Li M, Brown KR, Cetina M, Monroe C (2021) Fault-tolerant control of an error-corrected qubit. Nature 598:281–286. Engel A, Smith G, Parker SE (2019) Quantum algorithm for the Vlasov equation. Phys Rev A 100:062315. European Commission will launch e1 billion quantum technologies flagship. https://www.h2020. md/en/. Cited 15 Oct 2021 Farhi E, Goldstone J, Gutmann S (2014) A quantum approximate optimization algorithm. arXiv:1411.4028 Gaitan F (2020) Finding flows of a Navier-Stokes fluid through quantum computing. NPJ Quantum Inf 6:61. García-Molina P, Rodríguez-Mediavilla J, García-Ripoll JJ (2021) Quantum Fourier analysis for multivariate functions and applications to a class of Schrödinger-type partial differential equations. arXiv:2104.02668 Gidney C, Ekerå M (2021) How to factor 2048 bit RSA integers in 8 h using 20 million noisy qubits. Quantum 5:433. Giovannetti V, Lloyd S, Maccone L (2008) Quantum random access memory. Phys Rev Lett 100(16):160501. Göddeke D, Strzodka R, Turek S (2005) Accelerating double precision FEM simulations with GPUs. In: Proceedings of ASIM 2005—18th symposium on simulation technique Grover LK (1996) A fast quantum mechanical algorithm for database search. In: STOC ’96: Proceedings of the twenty-eighth annual ACM symposium on theory of computing. ACM, New York, pp 212–219. Guerreschi GG, Matsuura AY (2019) QAOA for Max-Cut requires hundreds of qubits for quantum speed-up. Sci Rep 9:6903. Hadfield S, Wang Z, O’Gorman B, Rieffel EG, Venturelli D, Biswas R (2019) From the quantum approximate optimization algorithm to a quantum alternating operator ansatz. Algorithms 12(2):34. Harrow AW, Hassidim A, Lloyd S (2009) Quantum algorithm for linear systems of equations. Phys Rev Lett 103(15):150502. Huang H-Y, Bharti K, Rebentrost P (2019) Near-term quantum algorithms for linear systems of equations. arXiv:1909.07344 Kacewicz B (2006) Almost optimal solution of initial-value problems by randomized and quantum algorithms. J Complex 22(5):676–690.


M. Möller

Khatri S, LaRose R, Poremba A, Cincio L, Sornborger AT, Coles PJ (2019) Quantum-assisted quantum compiling. Quantum 3:140. Knudsen M, Mendl CB (2020) Solving differential equations via continuous-variable quantum computers. arXiv:2012.12220 Komatitsch D, Michéa D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J Parallel Distr Comput 69(5):451– 460. Kyriienko O, Elfving VE (2021) Generalized quantum circuit differentiation rules. Phys Rev A 104(5):052417 Kyriienko O, Paine AE, Elfving VE (2021) Solving nonlinear differential equations with differentiable quantum circuits. Phys Rev A 103(5):052416. 052416 Lee Y, Joo J, Lee S (2019) Hybrid quantum linear equation algorithm and its experimental test on IBM quantum experience. Sci Rep 9:4778. Leyton SK, Osborne TJ (2008) A quantum algorithm to solve nonlinear differential equations. arXiv:0812.4423 Linden N, Montanaro A, Shao C (2020) Quantum versus classical algorithms for solving the heat equation. arXiv:2004.06516 Liu H-L, Wu Y-S, Wan L-C, Pan S-J, Qin S-J, Gao F, Wen Q-Y (2021) Variational quantum algorithm for the Poisson equation. Phys Rev A 104(2):022418. 022418 Liu J-P, Kolden HØ, Krovi HK, Loureiro NF, Trivisa K, Childs AM (2021) Efficient quantum algorithm for dissipative nonlinear differential equations. PNAS 118(35):e2026805118. https:// Lloyd S, De Palma G, Gokler C, Kiani B, Liu Z-W, Marvian M, Tennie F, Palmer T (2020) Quantum algorithm for nonlinear differential equations. arXiv:2011.06571 Lloyd S, Mohseni M, Rebentrost P (2014) Quantum principal component analysis. Nat Phys 10:631– 633. Lubasch M, Joo J, Moinier P, Kiffner M, Jaksch D (2020) Variational quantum algorithms for nonlinear problems. Phys Rev A 101(1):010301. Mackerle J (1996) Implementing finite element methods on supercomputers, workstations and PCs: a bibliography (1985–1995). Eng Comput 13(1):33–85. 02644409610110985 McClean JR, Boixo S, Smelyanskiy VN, Babbush R, Neven H (2018) Barren plateaus in quantum neural network training landscapes. Nat Commun 9:4812. Mishra N, Kapil M, Rakesh H, Anand A, Mishra N, Warke A, Sarkar S, Dutta S, Gupta S, Prasad Dash A, Gharat R, Chatterjee Y, Roy S, Raj S, Kumar Jain V, Bagaria S, Chaudhary S, Singh V, Maji R, Dalei P, Behera BK, Mukhopadhyay S, Panigrahi PK (2021) Quantum machine learning: a review and current status. In: Data management, analytics and innovation. Springer, Berlin, pp 101–145. Möller M, Vuik C (2019) A conceptual framework for quantum accelerated automated design optimization. Microprocess Microsyst 66:67–71. Montanaro A, Pallister S (2016) Quantum algorithms and the finite element method. Phys Rev A 93(3):032324. Morrell Jr HJ, Wong HY (2021) Step-by-step HHL algorithm walkthrough to enhance the understanding of critical quantum computing concepts. arXiv:2108.09004 Nielsen MA, Chuang IL (2010) Quantum computation and quantum information, 2nd edn. Cambridge University Press, Cambridge Pan J, Cao Y, Yao X, Li Z, Ju C, Chen H, Peng X, Kais S, Du J (2014) Experimental realization of quantum algorithm for solving linear systems of equations. Phys Rev A 89(2):022313. https://

Quantum Scientific Computing


Perelshtein MR, Pakhomchik AI, Melnikov AA, Novikov AA, Glatz A, Paraoanu GS, Vinokur VM, Lesovik GB (2021) Large-scale quantum hybrid solution for linear systems of equations. arXiv:2003.12770 Peruzzo A, McClean J, Shadbolt P, Yung M-H, Zhou X-Q, Love PJ, Aspuru-Guzik A, O’Brien JL (2014) A variational eigenvalue solver on a photonic quantum processor. Nat Commun 5:4213. Pesah A (2021) Quantum algorithms for solving partial differential equations. MRES case study report. Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J Comput Phys 378:686–707. Raisuddin OM, De S (2022) FEqa: finite element computations on quantum annealers. Comput Methods Appl Mech Eng 395:115014 Schuld M, Bergholm V, Gogolin C, Izaac J, Killoran N (2019) Evaluating analytic gradients on quantum hardware. Phys Rev A 99(3):032331. Shor PW (1994) Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th annual symposium on foundations of computer science. IEEE, pp 124–134. https:// Sigurdsson S (2021) Implementations of quantum algorithms for solving linear systems. Master’s thesis, Delft University of Technology. Srivastava S, Sundararaghavan V (2019) Box algorithm for the solution of differential equations on a quantum annealer. Phys Rev A 99(5):052355 OO Storaasli, SW Peebles, TW Crockett, JD Knott, and L Adams. The finite element machine: An experiment in parallel processing. NASA Technical Memorandum 84514, 1982. Available via NTRS at Todorova BN, Steijl R (2020) Quantum algorithm for the collisionless Boltzmann equation. J Comput Phys 409:109347. Troyer M (2021) Disentangling hype from reality: achieving practical quantum advantage. Video of the presentation at Q2B practical quantum computing, 8–10 Dec 2020. https://connect-world. com/disentangling-hype-from-reality-achieving-practical-quantum-advantage/ Viamontes GF, Markov IL, Hayes JP (2005) Is quantum search practical? Comput Sci Eng 7(3):62– 70. Wang S, Wang Z, Li W, Fan L, Wei Z, Gu Y (2020) Quantum fast Poisson solver: the algorithm and complete and modular circuit design. Quantum Inf Process 19:170. s11128-020-02669-7 Wossnig L, Zhao Z, Prakash A (2018) Quantum linear system algorithm for dense matrices. Phys Rev Lett 120(5):050502. Xu X, Sun J, Endo S, Li Y, Benjamin SC, Yuan X (2019) Variational algorithms for linear algebra. arXiv:1909.03898 Yates DF, Templeman AB, Boffey TB (1982) The complexity of procedures for determining minimum weight trusses with discrete member sizes. Int J Solids Struct 18(6):487–495. https://doi. org/10.1016/0020-7683(82)90065-8 Zanger B, Mendl CB, Schulz M, Schreiber M (2021) Quantum algorithms for solving ordinary differential equations via classical integration methods. Quantum 5:502. q-2021-07-13-502 Zhao Z, Pozas-Kerstjens A, Rebentrost P, Wittek P (2019) Bayesian deep learning on a quantum computer. Quantum Mach Intell 1:41–51.

Quantum Computing at IQM Hermanni Heimonen, Adrian Auer, Ville Bergholm, Inés de Vega, and Mikko Möttönen

Abstract Classical information processing has dramatically changed the lives of almost everyone over the last century. Now we are on the verge of a new information processing revolution, quantum computing. In this chapter, we present an overview of the main differences between classical and quantum computing, followed by a discussion of the foundational concepts of quantum computing and an introduction to quantum algorithms. We then briefly report on the latest developments in the field, focusing on the software-hardware co-design approach currently pursued at IQM. Finally, we provide an insight into the future developments and opportunities of quantum computing. Keywords Quantum computing · Information processing revolution · Quantum algorithm

1 Introduction to Quantum Technology The 20th century was miraculous in introducing humankind to the new and exciting microscopic world of quantum physics. The 21st century will perhaps be equally amazing in demonstrating the use of quantum systems, controlled at the single excitation level in various technological applications (Acín et al. 2018). The emerging quantum industry highlights the potential impact of new scientific discoveries in this field of research and is showing a technological pathway to solving some of the hardest computational problems we are facing. The first leaps in the understanding of quantum physics are largely thanks to quantum optics (Haroche and Raimond 2006), where the non-linearity of natural H. Heimonen (B) · V. Bergholm · M. Möttönen IQM, Keilaranta 19, 02150 Espoo, Finland e-mail: [email protected] A. Auer · I. de Vega IQM, Nymphenburgerstr. 86, 80636 Munich, Germany © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



H. Heimonen et al.

atoms together with lasers driving the atoms has been used to observe a plethora of remarkable phenomena such as photon bunching and anti-bunching (Haroche and Raimond 2006), electromagnetically induced transparency (Fleischhauer et al. 2005), and multiphoton entanglement (Pan et al. 2012). Quantum key distribution (Bennett and Brassard 1984), which promises eavesdropper-free communications over long distances, is a triumph of quantum optics and the first widely spread commercial quantum-technological application. It is currently the backbone of quantum communications that is being pursued to evolve into a quantum internet (Kimble 2008). In addition to quantum communications, the applications of quantum technology include quantum sensors (Degen et al. 2017), simulating other quantum systems (Cirac and Zoller 2012; Buluta and Nori 2009; Lloyd 1996; Feynman 1982), and quantum computing (Ladd et al. 2010). Quantum sensors are studied to provide, for example, more accurate atomic clocks, magnetic field sensors, and quantum radar. In quantum simulations, controlled quantum systems such as ultracold atoms (Gross and Bloch 2017) or trapped ions (Blatt and Roos 2012) are used to simulate the physics of other quantum systems that cannot be accessed as conveniently. Connecting these ideas to computation, digital quantum simulation (Somaroo et al. 1999), where a sequence of discrete quantum logic operations on individual parts of the whole system results in approximating the evolution of the simulated quantum system, has emerged, thus bringing quantum simulators and computers closer to each other.

2 Overview of Quantum Computing A basic requirement for computational devices is the ability to store information in a physical system, such as in vacuum tubes or transistors, and manipulate the information using logic gates. Decades of technological development have brought us to the point where we can control and manipulate quantum systems to a high degree of accuracy, enabling us to encode information in individual quantum states and to expand the computational tools from classical Boolean logic to quantum logic (Nielsen and Chuang 2002). Quantum logical operations allow the system to reach states that exhibit so-called superposition, and entanglement between quantum bits, also known as qubits. These properties can be considered resources that quantum logic introduces over classical logic. They enable the solution of certain classes of problems using quantum algorithms that have a lower complexity than their best classical counterparts.

2.1 History of Quantum Computing Quantum computing (Nielsen and Chuang 2000) has its roots in the early ideas of quantum simulation (Feynman 1982). It witnessed a theoretical breakthrough about two and half decades ago when the first quantum algorithms were found for

Quantum Computing at IQM


integer factorization (Shor 1994, 1997) and database search (Grover 1996), both providing a time complexity superior to the best-known classical algorithms. Since then, many quantum algorithms have been developed, such as those for quantum chemistry (McArdle et al. 2020), linear systems of equations (Harrow et al. 2009), and algebraic problems (Childs and van Dam 2010). The early theoretical discoveries called for a grand experimental mission to build a quantum computer capable of implementing quantum logical operations to a high degree of accuracy. This journey started from the demonstration of single qubits and study of many fundamental physics questions, before scaling to multi-qubit quantum processors. There has been great success in building these controllable quantum systems based on many different physical platforms such as nuclear magnetic resonance (Vandersypen et al. 2001), optical photons (Kok et al. 2007; Knill et al. 2001), color centers in diamond (Waldherr et al. 2014; Gurudev Dutt et al. 2007), semiconducting quantum dots (Veldhorst et al. 2015; Nowack et al. 2011), donor atoms in silicon (Pla et al. 2012; Morello et al. 2010), trapped ions (Blatt and Roos 2012; Leibfried et al. 2003), and superconducting circuits (Devoret and Schoelkopf 2013; Barends et al. 2014). Quantum computing is developed in constant interaction with classical computing, both in theory and in practice. In particular, quantum computers will only become of practical value once they can surpass classical computers in their computational power. This is in stark contrast with the initial development of computing, where early devices, such as Alan Turing’s Bombe, were immediately useful by reducing the workload of human computers. Recently, a quantum computer with 53 superconducting qubits was used to run quasi-random quantum logic circuits (Arute et al. 2019), a task that scales superpolynomially for classical computers as a function of the number of qubits. In this task that was designed to scale unfavourably for classical computers, the quantum computer carried out the computation much faster than the fastest classical supercomputer, which we refer to as quantum supremacy. Occasionally, such a demonstration has been also referred to as quantum advantage, but we reserve this term for the case where the quantum computer outperforms all classical computers in a computational task of practical value. Nevertheless, the quantum supremacy experiment was important in demonstrating that, at this level, there are no fundamental roadblocks for quantum computers to operate according to the theory, but that further advances are required for quantum advantage. In fact, it is yet to be demonstrated that this type of Noisy Intermediate-Scale Quantum (NISQ) computers, where the energy levels of physical systems are directly used as qubits, thus making them prone to errors, can deliver quantum advantage in practical problems. Further development in both algorithms and hardware is needed. On the other hand, the interaction between the fields of quantum and classical computing may also improve classical algorithms. Namely, a number of quantum algorithms that at the time of their discovery have outperformed their best classical counterparts have been dequantized (Tang 2019), i.e. researchers have found a way to replace the essential quantum operations in the algorithms by


H. Heimonen et al.

polynomial-sized classical computations, effectively erasing the potential quantum advantage from the algorithm, but spurring the development of classical complexity theory and algorithms. Recently these quantum-inspired classical algorithms, outperforming previous algorithms, have been developed in particular for solving linear algebra problems (Arrazola et al. 2019). Quantum-inspired classical hardware has also been developed, such as the coherent Ising machine (Inagaki et al. 2000) and the digital annealer (Aramon et al. 2019) that are application-specific classical computers built for solving combinatorial optimization problems, reported to perform well in practice. Quantum error correction (La Guardia 2020; Terhal 2015) is a way to build highaccuracy logical qubits out of error-prone physical qubits, provided that all the elementary operations, readout, gates, and reset, can be achieved at an error level below a certain threshold value. The threshold can be as high as a percent in advanced error correction codes (Fowler et al. 2009), with the tradeoff of each logical qubit using many physical qubits. Great progress in experimental studies of error correction has been witnessed, for example, in superconducting circuits (Kelly et al. 2015; Nakamura et al. 1999; Andersen et al. 2020; Shankar et al. 2013; Ofek et al. 2016; Campagne-Ibarcq et al. 2020; Krinner et al. 2020), ion traps (Schindler et al. 2011; Rosenblum et al. 2018; Chiaverini et al. 2004), and color centers (Waldherr et al. 2014; Taminiau et al. 2014), and an increased lifetime for an error-corrected logical qubit has been observed utilizing bosonic codes (Ofek et al. 2016; Campagne-Ibarcq et al. 2020). However, further improvements in the fidelity of the elementary qubit operations are urgently needed to bring down the qubit overheads to manageable amounts in order to go deep into the regime where error correction becomes efficient and the device may be scaled up into a useful fault-tolerant quantum computer. While this is the long term vision for quantum computing, researcher are currently also pursuing to make use of non-error-corrected devices.

2.2 Differences Between Quantum and Classical Computing Computational complexity theory aims to categorize computational problems, based on the resources required to solve them, by grouping them into complexity classes. For classical computing, the most important of these classes is P. It contains all the decision problems for which there is an algorithm that can solve them, using a classical computer, in an amount of time that scales polynomially with the size of the problem instance, i.e., the problem can be solved in polynomial time. In complexity theory, a polynomial scaling is considered easy, tractable, or efficient,1 whereas anything superpolynomial is considered hard or even intractable. For every computational problem there are many different algorithms, and the classification of the problem is based on the most efficient one. In many cases, 1

Ultimately this is just a convention. A problem with an execution time following a polynomial of a sufficiently high degree will still be intractable in practice.

Quantum Computing at IQM


Fig. 1 Various computational complexity classes and computational problems that belong to them

there is no proof that a given algorithm is the most efficient one possible, and thus our knowledge about the classification of a particular problem is often limited and becomes more precise as new algorithms and proofs are discovered. For example, the discovery of the Agrawal-Kayal-Saxena primality test in 2002 (Agrawal et al. 2004) proved that the integer primality problem is in P, whereas previously it was only known to belong in the larger class ZPP that contains P. In other words, we know that for all the problems that we currently know to reside in the complexity class P, there exists an efficient classical algorithm. Quantum computers, however, are not limited to classical algorithms. New complexity classes needed to be introduced to represent the utilization of the rules of quantum mechanics to solve computational problems. The most important one of these new classes is BQP, the class of decision problems solvable with a bounded error (B) using a quantum computer (Q) in polynomial time (P). There are several important problems known to belong in BQP, but conjectured to be outside P. Figure 1 presents a summary of the most central complexity classes and various problems currently known to reside in them. In the figure, BQP represents problems that can be solved efficiently using a quantum computer. Its precise relationship to some other important classes, especially P and NP, is not yet known. Importantly, quantum and classical computing are not entirely different. In fact, quantum computers can efficiently simulate classical computers, which means P ⊂ BQP, and classical computers can simulate quantum computers, albeit with a lower efficiency, yielding BQP ⊂ PSPACE, where PSPACE denotes decision problems which can be solved using polynomial amount of space, or memory, but computational time is not considered. However, the precise relationship of BQP with these complexity classes is not yet known. It is currently conjectured2 as follows: 2

It is possible that P = BQP, although this is currently considered very unlikely.


H. Heimonen et al.

Conjecture 1 BQP is strictly larger than both P and BPP, implying that there are important problems that quantum computers can solve efficiently but classical ones cannot. Conjecture 2 BQP contains no NP-complete problems, i.e., there will remain important problems that not even quantum computers can solve efficiently. Here, NP is the class of decision problems whose solutions can be verified in polynomial time using a classical computer. For these problems, arriving at a solution may be very hard, but verifying the correctness of a proposed solution is relatively easy. Hence, the most important problems for quantum computing are found in the set BQP ∩ NP \ P. These are classically hard problems that can be efficiently solved using a quantum computer, and the obtained solutions can be efficiently verified on a classical computer.

3 Basics of Quantum Computing Theory In this section, we provide potential examples of the problems discussed in Sect. 2.

3.1 Qubit To understand the basics of quantum computing, the first thing is to get introduced to the quantum bit, referred to as a qubit. Theoretically, the qubit is a vector generalisation of a binary variable, in a way that guarantees that its state always obeys the axioms of quantum mechanics (Nielsen and Chuang 2002). A binary variable can take the values 0 or 1 either deterministically or probabilistically. In contrast, a quantum bit has the two computational basis states |0 and |1, using the ket vector notation. These correspond to the states 0 and 1 of a classical bit, respectively. In quantum computing, these are often represented as two-dimensional vectors as |0 = ˆ

    1 0 , |1 = ˆ , 0 1


where = ˆ highlights the difference between the ket vectors corresponding to the quantum mechanical state of the qubit (left) and their representations as usual vectors (right). The states that a qubit is allowed to take are complex linear combinations of these two basis states, such that a general state |ψ is given by |ψ = α |0 + β |1 , α, β ∈ C such that |α|2 + |β|2 = 1.


The ket vectors {|0 , |1} span a two-dimensional complex Hilbert space, i.e., a complete vector space with an inner product defined.

Quantum Computing at IQM


A state that has finite components along both the |0 and |1 vectors, or more generally along any two or more basis vectors in a given Hilbert space, is referred to as a superposition state. A qubit in a superposition state is sometimes informally referred to as being in both the zero and one state simultaneously since such a coherent superposition is fundamentally different from merely not knowing in which of these states the qubit is. Although this kind of superposition state can be an accurate description of the quantum state of the qubit which is fully isolated from the outside world, the superposition cannot be observed directly, only inferred indirectly. If a macroscopic apparatus, or a human, tried to observe the state of the qubit, it would be observed in the state |0 or |1 with probabilities |α|2 and |β|2 , respectively. The state of the qubit after this kind of an ideal measurement is either |0 or |1, which is sometimes referred to as a collapse of the state from a superposition to a basis state. The important difference between probabilistic classical bits and quantum bits is that there is a larger set of valid quantum states than there are classical states. The state of a probabilistic classical bit is fully determined by the probabilities of the basis states, which are positive real numbers that sum to unity. A quantum state instead has probability amplitudes α and β that are complex numbers. The direct implication is that, e.g., both of the states √12 |0 + √12 |1 and √12 |0 − √12 |1 have equal probabilities of observing 0 and 1, but they are very different states. This has implications for quantum algorithms that can make use of this richer set of states in the intermediate steps of the algorithm before a measurement of the qubits is made. The difference between the superposition state and a classical probabilistic state is intimately related to the phenomenon of decoherence. Namely, decoherence is the accumulation of errors in the computation due to the influence of the environment of the qubit. These errors may change the probability of the basis states, i.e., the magnitude of α and β, or the relative phase between α and β. The former process is generally referred to as qubit decay and the latter process as dephasing. Both of these processes tend to destroy the quantum-mechanical superposition and render the state of the qubit to a classical probability distribution. Thus to avoid excessive errors, a quantum algorithm typically needs to be run on a time scale not greatly exceeding the timescales T1 and T2 related to these decoherence processes.

3.2 Quantum Information Processing All computation by classical computers can be expressed as a specific sequence of logical gates acting on a register of bits, based on Boolean logic (Boole 1847). For example, the logic gates AND and OR act simultaneous on two bits, while the NOT gate only acts on a single bit at a time. These operations are usually defined through truth tables that relate gate inputs to gate outputs. Instead of truth tables, quantum logic gates are defined through unitary matrices. For example, the NOT gate is defined in its matrix form as


H. Heimonen et al.


  01 . 10


Clearly this gate converts |0 to |1 and vice versa. Another, more interesting, quantum gate is the Hadamard gate   1 1 1 . (4) H=√ 2 1 −1 Acting on the state |0 or |1, this gate creates equal superpositions of the basis states. Applying the gate to a qubit in the state |0 yields      1 1 1 1 1 1 1 1 ˆ H |0 = ˆ√ =√ = ˆ √ |0 + √ |1 . 1 −1 0 1 2 2 2 2 In the state |1 it yields      1 1 1 1 1 1 0 1 =√ = ˆ √ |0 − √ |1 . Hˆ |1 = ˆ√ 1 2 1 −1 2 −1 2 2 All possible operations on qubits are described by all the matrices that correspond to mappings of valid quantum states to other valid quantum states. Like the quantum states themselves, these matrices can have complex valued entries, but in order to conserve probability, |α|2 + |β|2 = 1, they must be unitary, i.e., their Hermitian conjugate equals their inverse. Since the Hermitian conjugate is merely a transpose and a complex conjugation, finding an inverse of any quantum gate, whether single or many-qubit gate, is as easy as expressing the gate itself. This implies that unlike classical logic gates, all quantum gates must be reversible. To extend the vector and matrix notation to systems of more than one qubit, the tensor product, or the Kronecker product ⊗, is a natural choice for combining the vector spaces of the individual qubits to a larger vector space. Thus, the basis states of an n-qubit Hilbert space are given by |m := |m 1 m 2 · · · m n  := |m 1  ⊗ |m 2  ⊗ · · · ⊗ |m n  ,


where m ∈ {0, 1, . . . , 2n − 1} and m k ∈ {0, 1} is the kth bit in the binary representation of m. Since the n-qubit Hilbert space is spanned by 2n basis vectors, it follows that in general the quantum state of an n-qubit register needs to be described by 2n complex numbers {cm }: | =

n 2 −1


cm |m , where

n 2 −1

|cm |2 = 1.


Single-qubit gates for a certain qubit can likewise be obtained in this larger Hilbert space using the tensor product of the operation for that qubit with single-qubit identity

Quantum Computing at IQM


operators Iˆ acting on all other qubits. For the Hadamard gate on the first qubit, for example, we have Hˆ ⊗ Iˆ ⊗ · · · ⊗ Iˆ. In general, n-qubit gates can be defined using unitary matrices of dimension 2n . Similar to classical logic gates, it turns out that oneand two-qubit gates are enough to implement arbitrary quantum computations. In fact, almost any two-qubit gate together with arbitrary single-qubit gates are enough. The best-known two-qubit gate is the controlled-NOT (CNOT) gate represented by a matrix ⎛ ⎞ 1000 ⎜0 1 0 0 ⎟ ⎟ CNOT = ⎜ (6) ⎝0 0 0 1 ⎠ . 0010 The CNOT gate applies a NOT operation on the target qubit (the first qubit in (5)) provided that the control qubit (the second qubit in (5)) is in the state |1. As observed above, the dimension of the Hilbert space covering all the possible states of an n-qubit system is according to the rules of the tensor product 2n , i.e. exponential in the number of qubits. This is the origin of the difficulty of simulating quantum systems with a classical computer, but also an the opportunity that quantum computing offers.

3.3 Quantum Algorithms The development of quantum algorithms, although a vital field of research, is still in its early days. Relative to the development of algorithms for classical computers, only relatively few quantum algorithms and algorithm classes are known (Montanaro 2016). The first quantum algorithms, which are also the best-known, were developed for idealised quantum computers that do not suffer from errors due to decoherence or inaccuracies in gate implementations. These algorithms were developed by theoretical physicists and computer scientists before access to actual quantum computers to run and test the algorithms. Shor’s algorithm for prime number factorisation (Shor 1994, 1997) and Grover’s database search algorithm (Grover 1996) are famous examples of these kinds of early quantum algorithms. They can be analytically shown to work and their complexity analysis is straightforward. Shor’s algorithm for prime number factorisation famously provides a superpolynomial advantage over the best-known classical algorithms by utilizing the quantum Fourier transform. Grover’s database search is able√to find with high probability an element from an unstructured database in time O( N ). However, note that since there is no physical long-term quantum data storage available at the moment, one must input data to a quantum computer to use Grover’s algorithm, which consumes in the general case O(N ) time, currently preventing any speedup of the algorithm as a full data base search algorithm. Consequently, Grover’s algorithm is limited to acting as a sub-routine in larger computations or to provide a quadratic speedup to brute-force searches of solutions to NP-complete problems. Both of these


H. Heimonen et al.

algorithms also suffer from being sensitive to decoherence and computational errors, which at the moment are major concerns for practical quantum computers. We may need to implement full-scale quantum error correction before these algorithms can be run on problems of practical value, not just problems demonstrating a proof of the principle (Martin-López et al. 2012; Vandersypen et al. 2001). However, these algorithms provided the first strong motivation to build quantum computers in practice, and to date, they continue to motivate researchers and investment in the field. Recently, heuristic algorithm types that may be more noise resilient have been developed for use on near-term devices (Bharti et al. 2021). These algorithms are mostly based on trying to distribute the computational load between a classical and quantum computer in such a way that all the computationally easy tasks are handled by a classical computer, while only the most critical hard steps are handled by the noisy and limited-size quantum computer. These kinds of algorithms are called hybrid quantum-classical algorithms and they represent the most recent development in the field. Thus far, no theoretical proof exists for complexity theoretic speed-ups with these algorithms beyond some specific cases, but these algorithms perform considerably better in practice on current hardware than the early algorithms that on the other hand, can easily be analysed. For example, the ground-state energy of a chain of 12 hydrogen atoms has been computed using the Variational Quantum Eigensolver (VQE) (Arute et al. 2020) and the Quantum Approximate Optimization Algorithm (QAOA) has been thoroughly benchmarked on the maximum cut problem (Crooks 2018; Bharti et al. 2021). Although these instances are still easy tasks for a classical computer, they represents the state of the art for quantum computation. Quantum-classical hybrid algorithms have to this date been developed for a wide variety of problems, including but not limited to quantum chemistry and pharmaceutical industry problems, combinatorial optimization problems, machine-learning problems (in particular kernel methods), and number theoretic and linear-algebra problems (Bharti et al. 2021). It remains to be seen if proofs of speed-ups can be presented for hybrid algorithms, or if similarly to neural networks, they will simply perform well in practice without such theoretical guarantees, or if completely new algorithmic techniques need to be developed to reach quantum advantage with NISQ computers. The field is rapidly progressing at the moment and hardware improvements and new ideas constantly spur new algorithm developments.

4 Quantum Computing at IQM The quantum-computing company IQM was spun out in 2018–2019 from the research carried out at Aalto University and VTT Technical Research Centre of Finland, especially in the Quantum Computing and Devices research group. At the time of writing, IQM has roughly 120 employees and offices in Finland, Germany, and Spain. Since its beginning, IQM has focused on superconducting quantum-electrical circuits as the hardware platform to implement quantum computing. In addition to

Quantum Computing at IQM


such chips fabricated at IQM’s foundry in Finland, the company integrates a plethora of components and technologies into full on-site quantum computers that it offers across the world. As one of its key differentiators, IQM aims to launch new so-called co-design quantum computers, where the hardware and software are tailored for a specific commercially relevant problem, with the aim to achieve early quantum advantage in such problems.

4.1 IQM Technology The quantum-computing hardware developed at IQM is based on superconducting microwave circuits on a chip, a leading technology for building large-scale Quantum Processing Units (QPUs). Here, IQM employs certain types of a superconducting circuits as the physical qubits such as the transmon (Koch et al. 2007) which has favorable properties in terms of controllability and noise resilience. In such a transmon, the occupation of the two lowest-energy quantum states can be controlled very precisely and they can therefore serve as the computational states |0 and |1, allowing the circuit to be used as a qubit. In addition to providing the complete hardware stack, including cryogenics and electronics, IQM is also developing a software stack that allows users to run complete quantum algorithms on IQM QPUs. A hardware overview of the other parts of the quantum computer except the chip is shown in Fig. 2. Next to the dilution refrigerator are two cases that enclose control electronics used to perform the logical operations in the QPU and a pumping system required to run the refrigerator. Signals from the control electronics are sent to the QPU via microwave coaxial cables inside the refrigerator. Microwave isolators

Fig. 2 Superconducting quantum computer consisting of a dilution refrigerator (marked with green circle on the left and shown in detail on the right) capable of cooling the QPU down to roughly 10 millikelvin


H. Heimonen et al.

protect the QPU from excess noise from the amplifiers which are needed to read out the state of the qubits from originally weak microwave signals. The QPU is enclosed in a magnetic shield to prevent magnetic-field noise from inducing excess decoherence. In Sect. 2, it was explained that superpositions of computational states are necessary to utilize quantum computing. In the laboratory, however, such quantum superpositions are fragile because the interaction of qubits with their environment is unavoidable and not fully controllable. This imperfection leads to decoherence as discussed above. For the transmon qubits at IQM, the coherence time T2 , i.e. the duration for which superpositions can be maintained and thus quantum information can be stored, reaches tens of microseconds (IQM Quantum Engineers 2021). The quantum algorithm must be completed within the coherence time or a few to obtain a viable result. Therefore, one goal of the QPU development is to increase the number of operations achievable within the coherence time by improving and speeding up the three basic operations of digital quantum computing: qubit reset, quantum gate execution, and qubit readout. Resetting a qubit, typically to the lower-energy state of the two basis states, |g, is a common operation in the beginning or in the course of a quantum algorithm. Since the natural relaxation time T1 from the higher-energy state |e to the state |g is at least half the coherence time, T2 ≤ 2T1 , and on the order of a few tens or hundreds of microseconds, using the natural decay after a measurement leads to an impractically long duty cycle for the qubits. To circumvent this problem IQM researchers have developed, together with co-authors, a proprietary solution for a fast reset, the quantum-circuit refrigerator (Tan et al. 2017), which enables qubit reset down to a few nanoseconds. Fast quantum gates are another important resource in the execution of quantum algorithms since the gate duration determines the maximum possible depth of a quantum circuit, and therefore briefer gates enable more complex algorithms to be run, thus solving harder computational problems. For implementing fast highfidelity two-qubit gates, IQM uses an advanced coupling scheme (Yan et al. 2018) compatible with the implementation of so-called iSWAP and CPHASE gates within tens of nanoseconds. In addition, IQM has also developed advanced techniques for the third fundamental operation, qubit measurement, which enable extremely low error in the readout of the qubit state (Ikonen et al. 2019). All these operations are continuously improved to reduce the time required for a typical cycle of reset, application of quantum gates, and qubit readout, to ultimately successfully implement complex quantum algorithms on a real QPU. The reset and readout speed will be especially important in the long term in large-scale error-corrected quantum computers, in which one almost constantly needs to monitor the potential appearance of errors. In addition to the development of QPUs for standard digital quantum computing described above, IQM is also working on specialized QPUs in a co-design approach with the goal of achieving quantum advantage before fully error-corrected quantum computers will be available. This is the topic of the next section.

Quantum Computing at IQM


4.2 Co-design Quantum Computers and Their Components The typical way to process quantum information using a fixed set of one- and twoqubit gates, such as those introduced in Sect. 3.2, is referred to as Digital Quantum Computing (DQC). As opposed to other quantum-computing paradigms such as those discussed further in this section, DQC is one of the most successful in terms of algorithmic and hardware support. Independent of the size of the quantum processor for DQC, only the dynamics of single qubits and pairs of coupled qubits have to be characterized, calibrated and optimized, providing for favourable scaling properties. In addition, digital quantum computing is based on sequences of quantum gates, or quantum circuits, which have been proven to be equivalent to quantum Turing machines (Yao 1993; Molina and Watrous 2019). Hence, DQC is a universal computational model, in the sense that it can simulate any finitely realizable physical system (Deutsch 1985). However, most digital quantum algorithms require a relatively large number of logical qubits, which as mentioned above, require an even larger number of physical qubits if quantum error correction is needed (Lidar et al. 2013). This limits the range of applicability of standard digital quantum computers. An alternative paradigm, Analog Quantum Computing (AQC) and the related hardware have been developed and tailored to efficiently solve a few specific problems, being in principle more robust against noise. In contrast to DQC, analog machines are based on operating on many qubits, sometimes even the whole quantum processor, at a time, using quantum operations with varying durations. Adiabatic quantum algorithms are one of the main types of analog algorithms. These are based on encoding the solution of a problem in the lowest energy state of a target Hamiltonian, towards which the adiabatic quantum computer is steered to evolve. A second type, quantum annealing algorithms, are designed to solve similar problems while allowing for the exploration of excited states of the quantum computer during the search. In addition, the QAOA and its variants are a digitized classical-quantum hybrid version of quantum annealing. These can be implemented on a digital quantum computer supported by classical information processing. As the third paradigm of quantum computing, we present Digital-Analog Quantum Computing (DAQC) which aims to merge the best of the worlds of DQC and AQC. DAQC combines the continuous-time evolution of the full quantum processor, with the use of high-fidelity single-qubit gates. And it is precisely this latter ingredient that provides its universality, i.e., allows DAQC to implement any possible algorithm. Although DAQC is an interesting alternative to DQC, its advantage with respect to DQC in terms of noise resilience and overall efficiency is still under analysis, and the result may depend very heavily on the specific algorithm. However, our aim is not to address all problems with DAQC, but to find a suitable problem that may lead to expedited quantum advantage with the help of DACQ. The ability of DAQC devices to implement digital, analog, and digital-analog algorithms in the same quantum processor implies a dramatic extension of its tools and methods beyond the standard ones that can be found either in a universal purely digital or analog machine. In DAQC, we need a control software and firmware that


H. Heimonen et al.

is adapted not only to produce gates, but also to couple a large number of qubits for a varying controllable period of time. Furthermore, precise benchmarking, characterization, and optimal control of such a large number of qubits may be challenging since the standard methods like randomized benchmarking (Bharti et al. 2021) were originally designed for DQC and call for further research in the case of a large number of simultaneously coupled qubits. Nevertheless, a quantum computer able to operate in all three operational modes, DQC, AQC, and DAQC, enables flexibly in adapting to each particular problem or algorithm, thus optimizing the performance of the quantum computer. This includes also combinations such as digital algorithms that include DAQC components for producing n-qubit gates, or continuous time evolution for optimizing gates as in Ref. (Lacroix et al. 2020). At the first level of co-design quantum computers, the quantum processor is still based on a regular array of qubits, standard for a generic algorithm, and the only procedure that is changed with respect to DQC is how the qubits are operated. This is typically the case in the above-discussed multi-use quantum computer. The second level of co-design adds an optimization of the topology of the qubits within the quantum processor for improved performance in a specific algorithm, and therefore such a processor may solve more efficiently a certain class of problems in comparison to first-level co-design devices. By topology we mean here the graph of qubits and their connections. For example, a square lattice of qubits where each qubit is connected to four of its nearest neighbours. Note that within a quantum processor only the qubits that are physically connected can directly interact, which is determined by the chip topology. Since many quantum algorithms naturally call for two-qubit gates on every pair of qubits, gates on non-physically-connected qubits will require a number of additional two-qubit gates such as SWAP gates that effectively connect them, thus increasing the circuit depth. Increased depth limits the size of the algorithms that can be successfully run on the quantum computer since the additional operations must fit within the coherence time of the device. Quantum-computing-hardware companies have been experimenting with different topologies for qubits in their quantum processors. For instance, rather than the standard square lattice, some quantum processors are based on a so-called heavy-hex topology (IBM Quantum 2021). It is a hexagonal qubit lattice where, interestingly, the connectivity is drastically reduced. Although the reduced connectivity requires more operations to connect qubits, resulting in a deeper circuit and therefore more exposure to errors, fewer connections between the qubits lead to less cross-talk error that arises where the control signal of a gate on a qubit, induces some residual gate operations on other nearby qubits. In this context, while most quantum-computing companies follow a softwareagnostic analysis of the hardware, the idea of co-design is to consider the algorithm type as an element to define the most optimal QPU topology. For instance, various algorithms natively introduce all-to-all connectivity or long-range connectivity between qubits, which suggests that a topology with high connectivity may be more optimal than the heavy-hex one. The literature that explores how the connectivity

Quantum Computing at IQM


and topology affect the performance of different algorithms is still limited (Lu et al. 2021; Hu et al. 2021; Li et al. 2021), but significant new optimization pathways can likely be discovered by co-designing the topology and the quantum algorithm. As an example, a topology based on a central element directly coupled to all other qubits, as well as variations therein, may be more suitable to run algorithms where all-to-all qubit connectivity is beneficial. In these cases, the presence of a central element reduces the number of operations needed in comparison to a standard lattice. Simulation of quantum mechanical systems, e.g., to characterize the properties of complex molecules or materials, is one of the very appealing applications of quantum computing. The amount of memory required for such a simulation on a classical computer scales exponentially with the size of the simulated system, rendering the computation intractable for many relevant problems. Many complex systems contain a combination of different types of degrees of freedoms, each associated with a specific algebra of observables. For instance, electrons are fermions which arise in quantum chemistry whereas phonons and photons are bosons which are intimately related to vibrations in materials and to the electromagnetic field. Although fermions may have a one-to-one correspondence with qubits, mapping bosons to qubits is more challenging due to the large number of quantum levels required. For most problems it is sufficient to consider the first M levels of each bosonic mode, a gateefficient encoding of which can be implemented with M + 1 qubits (Somma et al. 2003). This imbalance renders it appealing to consider hybrid quantum computing chips that codify fermionic or spin degrees of freedom in qubits, and bosonic ones in linear microwave elements which naturally host bosonic modes, such as resonators or waveguides. Thus, even though traditionally only qubits have been considered as computational elements, quantum processors in general consist of different types of components that may participate in the computation: resonators, buses, quantum circuit refrigerators, waveguides, memory, and interfaces (Wendin 2017; Kurizki et al. 2015). Here, the idea is to use some of these components also as computational elements in the quantum simulation. To this end, hybrid chips have been used for instance to simulate quantum optical models (Forn-Díaz et al. 2010) and the spin-boson model, a paradigmatic model to describe the dynamics of a simple quantum system coupled to an environment (Messinger et al. 2019; Indrajeet et al. 2020; Leppäkangas et al. 2018). However, rather than focusing on a chip that simulates a single model with an analog evolution, a more ambitious goal is to co-design a hybrid chip that is general enough to simulate a large number of different physical models or problems. Here, digital-analog algorithms may be an optimal choice. On the one hand, they are not based on logic gates operating on qubits, and therefore they can operate with other computational elements. On the other hand, they provide more versatility than a purely analog evolution, so that they can be used to simulate different dynamics. The full co-design of the hardware components and their topology for given types of algorithms constitutes the third level of co-design quantum computers. A comparison of standard and co-design approaches to quantum computing is shown in Fig. 3. The standard framework for quantum computing circuit elements includes qubits on a square lattice equipped with single- and two-qubit gates,


H. Heimonen et al.

Fig. 3 Comparison of standard and co-design approaches to quantum computing

compatible with running error-correction algorithms. IQM’s co-design quantum computers combine a larger set of possible building blocks, aiming for advantages in building application-specific quantum processors.

5 Future Outlook at IQM Quantum computing is an emerging field of industry with many challenges ahead, but it holds great promise to address some of the hardest computational problems encountered in practice. Quantum computing differs fundamentally from all classical computation models and provides access to computational complexity classes commonly believed to be larger than their classical counterparts. To fulfill these expectations of computational speed-ups for practical problems, major hardware development is still needed. At IQM, we take inspiration from classical applicationspecific integrated chips to develop application-specific quantum computers where the algorithm and hardware specifications are co-designed to minimize the overheads and maximize the computational power of the device given the current technological limitations. Such an approach to quantum computing seems fruitful to achieve the first quantum advantage in solving practical problems, perhaps some years faster than using hardware that is designed to run generic quantum algorithms. Thus it seems that quantum advantage will be achieved with various complementary approaches each tailored for its own purposes. It is challenging to predict when exactly the first quantum advantage will be achieved, but the fast current development of the quantum industry appears very encouraging. The above-discussed developments for co-design quantum computers will also be useful throughout as quantum computing technology matures and error correction becomes feasible. In this regard, an interesting idea is to co-design the hardware to not only improve algorithmic performance but also error correction. Eventually, with the era of error-corrected general-purpose quantum computers, the full power of

Quantum Computing at IQM


quantum computing will be unleashed. Here, quantum computers will not only solve many existing problems with speed and accuracy superior to classical hardware but will also likely create their own use cases in a similar fashion as the development of the transistor having eventually given rise to cloud services. Thus, we are currently living the most exciting time in the history of computing. Abbreviations AQC DAQC DQC NISQ QAOA QPU VQE

Analog Quantum Computing Digital-Analog Quantum Computing Digital Quantum Computing Noisy Intermediate-Scale Quantum Quantum Approximate Optimization Algorithm Quantum Processing Unit Variational Quantum Eigensolver

Acknowledgements We congratulate Pekka Neittaanmäki on his 70th birthday and thank him for inviting us to write this chapter. We acknowledge Henrikki Mäkynen for assembling Figs. 2 and 3 based on the work of several IQM employees. We thank all IQM staff, management, investors, and supporters for the wonderful working environment and joining us on this exciting journey to build world-leading quantum computers for the well-being of the humankind, now and for the future. IQM has received funding from the Finnish Ministry of Employment and Economy through Business Finland, the European Commission through the EIC Accelerator Programme, and the German Ministry of Education and Research.

References Acín A, Bloch I, Buhrman H, Calarco T, Eichler C, Eisert J, Esteve D, Gisin N, Glaser SJ, Jelezko F, Kuhr S, Lewenstein M, Riedel MF, Schmidt PO, Thew R, Wallraff A, Walmsley I, Wilhelm FK (2018) The quantum technologies roadmap: a European community view. New J Phys 20(8):080201 Agrawal M, Kayal N, Saxena N (2004) PRIMES is in P. Ann Math 160(2):781–793 Andersen CK, Remm A, Lazar S, Krinner S, Lacroix N, Norris GJ, Gabureac M, Eichler C, Wallraff A (2020) Repeated quantum error detection in a surface code. Nat Phys 16:875–880 Aramon M, Rosenberg G, Valiante E, Miyazawa T, Tamura H, Katzgraber HG (2019) Physicsinspired optimization for quadratic unconstrained problems using a digital annealer. Front Phys 7:48 Arrazola JM, Delgado A, Bardhan BR, Lloyd S (2019) Quantum-inspired algorithms in practice. arXiv:1905.10415 Arute F, Arya K, Babbush R, Bacon D et al (2019) Quantum supremacy using a programmable superconducting processor. Nature 574:505–510 Arute F, Arya K, Babbush R, Bacon D et al (2020) Hartree-Fock on a superconducting qubit quantum computer. Science 369:1084–1089 Barends R, Kelly J, Megrant A, Veitia A, Sank D, Jeffrey E, White TC, Mutus J, Fowler AG, Campbell B, Chen Y, Chen Z, Chiaro B, Dunsworth A, Neill C, O’Malley P, Roushan P, Vainsencher A, Wenner J, Korotkov AN, Cleland AN, Martinis JM (2014) Superconducting quantum circuits at the surface code threshold for fault tolerance. Nature 508:500–503


H. Heimonen et al.

Bennett CH, Brassard G (1984) Quantum cryptography: public key distribution and coin tossing. In: Proceedings of the IEEE international conference on computers, systems, and signal processing, pp 175–179 Bharti K, Cervera-Lierta A, Kyaw TH, Haug T, Alperin-Lea S, Anand A, Degroote M, Heimonen H, Kottmann JS, Menke T, Mok W-K, Sim S, Kwek L-C, Aspuru-Guzik A (2021) Noisy intermediatescale quantum (NISQ) algorithms. arXiv:2101.08448 Blatt R, Roos CF (2012) Quantum simulations with trapped ions. Nat Phys 8(4):277–284 Boole G (1847) The mathematical analysis of logic. Philosophical Library Buluta I, Nori F (2009) Quantum simulators. Science 326:108–111 Campagne-Ibarcq P, Eickbusch A, Touzard S, Zalys-Geller E, Frattini NE, Sivak VV, Reinhold P, Puri S, Shankar S, Schoelkopf RJ, Frunzio L, Mirrahimi M, Devoret MH (2020) Quantum error correction of a qubit encoded in grid states of an oscillator. Nature 584:368–372 Chiaverini J, Leibfried D, Schaetz T, Barrett MD, Blakestad RB, Britton J, Itano WM, Jost JD, Knill E, Langer C, Ozeri R, Wineland DJ (2004) Realization of quantum error correction. Nature 432:602–605 Childs AM, van Dam W (2010) Quantum algorithms for algebraic problems. Rev Mod Phys 82(1):1– 52 Cirac JI, Zoller P (2012) Goals and opportunities in quantum simulation. Nat Phys 8(4):264–266 Crooks GE (2018) Performance of the quantum approximate optimization algorithm on the maximum cut problem. arXiv:1811.08419 Degen CL, Reinhard F, Cappellaro P (2017) Quantum sensing. Rev Mod Phys 89(3):035002 Deutsch D (1985) Quantum theory, the Church-Turing principle and the universal quantum computer. Proc R Soc Lond A 400:97–117 Devoret MH, Schoelkopf RJ (2013) Superconducting circuits for quantum information: an outlook. Science 339:1169–1174 Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21(6–7):467–488 Fleischhauer M, Imamoglu A, Marangos JP (2005) Electromagnetically induced transparency: optics in coherent media. Rev Mod Phys 77(2):633–673 Forn-Díaz P, Lisenfeld J, Marcos D, García-Ripoll JJ, Solano E, Harmans CJPM, Mooij JE (2010) Observation of the Bloch-Siegert shift in a qubit-oscillator system in the ultrastrong coupling regime. Phys Rev Lett 105(23):237001 Fowler AG, Stephens AM, Groszkowski P (2009) High-threshold universal quantum computation on the surface code. Phys Rev A 80(5):052312 Gross C, Bloch I (2017) Quantum simulations with ultracold atoms in optical lattices. Science 357:995–1001 Grover LK (1996) A fast quantum mechanical algorithm for database search. In: STOC ’96: Proceedings of the 28th annual ACM symposium on theory of computing. ACM, pp 212–219 Gurudev Dutt MV, Childress L, Jiang L, Togan E, Maze J, Jelezko F, Zibrov AS, Hemmer PR, Lukin MD (2007) Quantum register based on individual electronic and nuclear spin qubits in diamond. Science 316:1312–1316 Haroche S, Raimond J-M (2006) Exploring the quantum: atoms, cavities, and photons. Oxford University Press, Oxford Harrow AW, Hassidim A, Lloyd S (2009) Quantum algorithm for linear systems of equations. Phys Rev Lett 103(15):150502 Hu W, Yang Y, Xia W, Pi J, Huang E, Zhang X-D, Xu H (2021) Performance of superconducting quantum computing chips under different architecture designs. arXiv:2105.06062 Ikonen J, Goetz J, Ilves J, Keränen A, Gunyho AM, Partanen M, Tan KY, Hazra D, Grönberg L, Vesterinen V, Simbierowicz S, Hassel J, Möttönen M (2019) Qubit measurement by multichannel driving. Phys Rev Lett 122(8):080503 Inagaki T, Haribara Y, Igarashi K, Sonobe T, Tamate S, Honjo T, Marandi A, McMahon PL, Umeki T, Enbutsu K, Tadanaga O, Takenouchi H, Aihara K, Kawarabayashi K, Inoue K, Utsunomiya S, Takesue H (2016) A coherent Ising machine for 2000-node optimization problems. Science 354:603–606

Quantum Computing at IQM


Indrajeet S, Wang H, Hutchings MD, Taketani BG, Wilhelm FK, LaHaye MD, Plourde BLT (2020) Coupling a superconducting qubit to a left-handed metamaterial resonator. Phys Rev Appl 14(6):064033 IQM Quantum Engineers (2021) Closing the gaps in quantum computing: co-development and co-design. Digitale Welt 5(2):85–93 Kelly J, Barends R, Fowler AG, Megrant A, Jeffrey E, White TC, Sank D, Mutus JY, Campbell B, Chen Y, Chen Z, Chiaro B, Dunsworth A, Hoi I-C, Neill C, O’Malley PJJ, Quintana C, Roushan P, Vainsencher A, Wenner J, Cleland AN, Martinis JM (2015) State preservation by repetitive error detection in a superconducting quantum circuit. Nature 519:66–69 Kimble HJ (2008) The quantum internet. Nature 453:1023–1030 Knill E, Laflamme R, Milburn GJ (2001) A scheme for efficient quantum computation with linear optics. Nature 409:46–52 Koch J, Yu TM, Gambetta J, Houck AA, Schuster DI, Majer J, Blais A, Devoret MH, Girvin SM, Schoelkopf RJ (2007) Charge-insensitive qubit design derived from the Cooper pair box. Phys Rev A 76(4):042319 Kok P, Munro WJ, Nemoto K, Ralph TC, Dowling JP, Milburn GJ (2007) Linear optical quantum computing with photonic qubits. Rev Mod Phys 79(1):135–174 Krinner S, Lazar S, Remm A, Andersen CK, Lacroix N, Norris GJ, Hellings C, Gabureac M, Eichler C, Wallraff A (2020) Benchmarking coherent errors in controlled-phase gates due to spectator qubits. Phys Rev Appl 14(2):024042 Kurizki G, Bertet P, Kubo Y, Mølmer K, Petrosyan D, Rabl P, Schmiedmayer J (2015) Quantum technologies with hybrid systems. Proc Nat Acad Sci 112(13):3866–3873 La Guardia GG (2020) Quantum error correction: symmetric, asymmetric, synchronizable, and convolutional codes. Springer Lacroix N, Hellings C, Andersen CK, Di Paolo A, Remm A, Lazar S, Krinner S, Norris GJ, Gabureac M, Heinsoo J, Blais A, Eichler C, Wallraff A (2020) Improving the performance of deep quantum optimization algorithms with continuous gate sets. PRX Quantum 1(2):020304 Ladd TD, Jelezko F, Laflamme R, Nakamura Y, Monroe C, O’Brien JL (2010) Quantum computers. Nature 464:45–53 Leibfried D, Blatt R, Monroe C, Wineland D (2003) Quantum dynamics of single trapped ions. Rev Mod Phys 75(1):281–324 Leppäkangas J, Braumüller J, Hauck M, Reiner J-M, Schwenk I, Zanker S, Fritz L, Ustinov AV, Weides M, Marthaler M (2018) Quantum simulation of the spin-boson model with a microwave circuit. Phys Rev A 97(5):052321 Li G, Shi Y, Javadi-Abhari AJ (2021) Software-hardware co-optimization for computational chemistry on superconducting quantum processors. arXiv:2105.07127 Lidar DA, Brun TA (eds) Quantum error correction. Cambridge University Press, Cambridge Lloyd S (1996) Universal quantum simulators. Science 273:1073–1078 Lu B-H, Wu Y-C, Kong W-C, Zhou Q, Guo G-P (2021) Special-purpose quantum processor design. arXiv:2102.01228 Martin-López E, Laing A, Lawson T, Alvarez R, Zhou X-Q, O’Brien JL (2012) Experimental realization of Shor’s quantum factoring algorithm using qubit recycling. Nat Photon 6(11):773– 776 McArdle S, Endo S, Aspuru-Guzik A, Benjamin SC, Yuan X (2020) Quantum computational chemistry. Rev Mod Phys 92(1):015003 Messinger A, Taketani BG, Wilhelm FK (2019) Left-handed superlattice metamaterials for circuit QED. Phys Rev A 99(3):032325 Molina A, Watrous J (2019) Revisiting the simulation of quantum Turing machines by quantum circuits. Proc R Soc A 475:20180767 Montanaro A (2016) Quantum algorithms: an overview. NPJ Quantum Inf 2(1):15023 Morello A, Pla JJ, Zwanenburg FA, Chan KW, Tan KY, Huebl H, Möttönen M, Nugroho CD, Yang C, van Donkelaar JA, Alves ADC, Jamieson DN, Escott CC, Hollenberg LCL, Clark RG, Dzurak AS (2010) Single-shot readout of an electron spin in silicon. Nature 467:687–691


H. Heimonen et al.

Nakamura Y, Pashkin YuA, Tsai JS (1999) Coherent control of macroscopic quantum states in a single-Cooper-pair box. Nature 398:786–788 Nielsen MA, Chuang I (2000) Quantum computation and quantum information. Cambridge University Press, Cambridge Nielsen MA, Chuang I (2002) Quantum computation and quantum information. Am J Phys 70(5):558–560 Nowack KC, Shafiei M, Laforest M, Prawiroatmodjo GEDK, Schreiber LR, Reichl C, Wegscheider W, Vandersypen LMK (2011) Single-shot correlations and two-qubit gate of solid-state spins. Science 333:1269–1272 Ofek N, Petrenko A, Heeres R, Reinhold P, Leghtas Z, Vlastakis B, Liu Y, Frunzio L, Girvin SM, Jiang L, Mirrahimi M, Devoret MH, Schoelkopf RJ (2016) Extending the lifetime of a quantum bit with error correction in superconducting circuits. Nature 536:441–445 ˙ Pan J-W, Chen Z-B, Lu C-Y, Weinfurter H, Zeilinger A, Zukowski M (2012) Multiphoton entanglement and interferometry. Rev Mod Phys 84(2):777–838 Pla JJ, Tan KY, Dehollain JP, Lim WH, Morton JJL, Jamieson DN, Dzurak AS, Morello A (2012) A single-atom electron spin qubit in silicon. Nature 489:541–545 Rosenblum S, Reinhold P, Mirrahimi M, Jiang L, Frunzio L, Schoelkopf RJ (2018) Fault-tolerant detection of a quantum error. Science 361:266–270 Schindler P, Barreiro JT, Monz T, Nebendahl V, Nigg D, Chwalla M, Hennrich M, Blatt R (2011) Experimental repetitive quantum error correction. Science 332:1059–1061 Shankar S, Hatridge M, Leghtas Z, Sliwa KM, Narla A, Vool U, Girvin SM, Frunzio L, Mirrahimi M, Devoret MH (2013) Autonomously stabilized entanglement between two superconducting quantum bits. Nature 504:419–422 Shor PW (1994) Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th annual symposium on foundations of computer science. IEEE, pp 124–134 Shor PW (1997) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J Comput 26(5):1484–1509 Somaroo S, Tseng CH, Havel TF, Laflamme R, Cory DG (1999) Quantum simulations on a quantum computer. Phys Rev Lett 82(26):5381–5384 Somma RD, Ortiz G, Knill EH, Gubernatis J (2003) Quantum simulations of physics problems. In: Donkor E, Pirich AR, Brandt HE (eds) Quantum information and computation, vol 5105 Proceedings of the SPIE. SPIE Taminiau TH, Cramer J, van der Sar T, Dobrovitski VV, Hanson R (2014) Universal control and error correction in multi-qubit spin registers in diamond. Nat Nanotech 9(3):171–176 Tan KY, Partanen M, Lake RE, Govenius J, Masuda S, Möttönen M (2017) Quantum-circuit refrigerator. Nat Commun 8(1):15189 Tang E (2019) A quantum-inspired classical algorithm for recommendation systems. In: STOC 2019: Proceedings of the 51st annual ACM SIGACT symposium on theory of computing. ACM, pp 217–228 Terhal BM (2015) Quantum error correction for quantum memories. Rev Mod Phys 87(2):307–346 The IBM Quantum heavy hex lattice (2021) Technical note. heavy-hex-lattice#fnref-6 Vandersypen LMK, Steffen M, Breyta G, Yannoni CS, Sherwood MH, Chuang IL (2001) Experimental realization of Shor’s quantum factoring algorithm using nuclear magnetic resonance. Nature 414:883–887 Veldhorst M, Yang CH, Hwang JCC, Huang W, Dehollain JP, Muhonen JT, Simmons S, Laucht A, Hudson FE, Itoh KM, Morello A, Dzurak AS (2015) A two-qubit logic gate in silicon. Nature 526:410–414 Waldherr G, Wang Y, Zaiser S, Jamali M, Schulte-Herbrüggen T, Abe H, Ohshima T, Isoya J, Du JF, Neumann P, Wrachtrup J (2014) Quantum error correction in a solid-state hybrid spin register. Nature 506:204–207 Wendin G (2017) Quantum information processing with superconducting circuits: a review. Rep Prog Phys 80(10):106001

Quantum Computing at IQM


Yan F, Krantz P, Sung Y, Kjaergaard M, Campbell DL, Orlando TP, Gustavsson S, Oliver WD (2018) Tunable coupling scheme for implementing high-fidelity two-qubit gates. Phys Rev Appl 10(5):054062 Yao A (1993) Quantum circuit complexity. In: Proceedings of 1993 IEEE 34th annual foundations of computer science. IEEE, pp 352–361

Contribution of Scientific Computing in European Research and Innovation for Greening Aviation Dietrich Knoerzer

Abstract The aerospace industry has been and still is one of the worldwide industries that develops and applies scientific computing in design and production. In Europe, aerospace companies and research institutes have jointly developed computational methods since the 1980s. The increasing performance of computer technology has contributed to the development of complex numerical tools. The application of new numerical methods has not only enhanced the design and performance of aircraft, but since the early 1990s, it has also promoted research activities that improve the environmental impact of aviation. Scientific computational methods play a significant role in developing greening technologies to reduce drag, aeroengine emissions, and aircraft noise. This paper describes the development and the application of scientific computation methods and their contribution to the greening technologies of aeronautics in research projects under the Research Framework Programmes of the European Union since 1990. It focuses on projects addressing the external flow of the aircraft, while the internal flow of the turbomachinery and combustion have also been significantly improved due to enhanced computational methods. In the end, some perspectives on future developments are given. Keywords Numerical simulation, Drag reduction, Flow control, Aircraft noise reduction, European research projects, Computational methods

1 Introduction In the 50 years to 2019, global air transport grew by an average of more than 5% per year, despite all kind of global or regional crisis. Reducing fuel consumption has long been a driving force in improving aircraft with higher payloads and longer range. In the late 1980s, the first environmental restrictions were imposed on civil aviation, beginning with noise and unburned particles (e.g., soot). D. Knoerzer (B) Independent Aeronautics Consultant, Brussels, Belgium e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



D. Knoerzer

In the 1990s, the increase in Nitrogen Oxides (NOx) in the exhaust gases was a cause for concern, as NOx contributes to the destruction of the protective ozone layer in the upper atmosphere. World-wide concerns about global warming due to increasing CO2 emissions challenged the aircraft industry to overcome this negative impact by reducing drag through improved aerodynamics and advanced lightweight airframe structures such as Carbon Fibre Reinforced Plastic (CFRP), as well as improved combustion technologies. Because national resources were often fragmented and insufficient, the civil aviation industry produced aircraft and aircraft engines in international collaboration. Thus, joint technological development to tackle the challenges became an obvious logical step. A new initiative of the European Union’s (EU) research Framework Programme (FP) offered a solution for international research cooperation. At the beginning of 1989, a pilot programme for specific aeronautical research was launched under the 2nd FP, following careful preparation by the European aeronautics industry, national research establishments and the European Commission. In addition to technologies that enhance safety and efficient production methods, environmental topics such as reducing aircraft drag and aviation noise, and improving engine efficiency, found their place in FP2’s aeronautics work programme. The development of computational methods for grid generation, code validation and design optimisation (Haase et al. 1993; Weatherill et al. 1993; Periaux et al. 1997) to support the design process was also a priority of the first European aeronautics research programme, which included 28 projects between 1989 and 1992. Figure 1 represents the status of Computational Fluid Dynamics (CFD) tools using the example of the NLR 7301s flapped airfoil. In FP3 (1992–1994), the extension of the pilot phase of aeronautics took place as part of the Industrial and Materials Technology Programme. This allowed research to continue in key areas of aeronautical technology. Since FP4 (1994–1998), the budget for aeronautics has been higher than in previous framework programmes. Greening technologies in aviation have become an increasing priority for research projects. Since FP7 (2007–2013), Clean Sky, a public-private partnership between the European Commission and the European aeronautics industry, has performed large-scale demonstrations and research up to a high Technology Readiness Level

Fig. 1 NLR 7301s flapped airfoil with high angle of attack (AoA): M = 0.185, Re = 2.5 × 106 , α = 13.1◦ (EUROVAL project Haase et al. (1993))

Contribution of Scientific Computing in European Research …


(TRL). In addition, the technology has been developed and validated for the a single European airspace in the Single European Sky Air Traffic Management Research Joint Undertaking (SESAR JU). Within the 9th Framework Programme ‘Horizon Europe’ (2021–2027), the European Partnership for Clean Aviation became a followon to the Clean Sky in 2021. This paper discusses in more detail some examples of the use of computational methods to develop greener aeronautical technology. Previously, numerical simulation was normally closely related to experimental investigations. In the 1990s, computational methods were intended to describe the physical behaviour of a flow, whereas later they were able to predict quite accurately the behaviour of a flow around complex geometries. Two greener areas will be addressed with a focus on scientific computing: 1. Laminar flow and flow control to reduce aircraft drag, including high-lift, 2. Low noise technologies and aeroacoustic research to reduce aircraft noise. Section 4 deals with the development of numerical methods, in particular the contribution of European collaboration projects to this development within the EU Research Framework Programmes FP2 to FP7 and Horizon 2020 (2014–2020).

2 Technologies to Reduce Aircraft Drag In 2011, Europe’s vision for aviation ‘Flightpath 2050’ (Flightpath 2050) set a target of reducing CO2 emissions by 75% by 2050 compared to aircraft emissions in 2000. This can be achieved by reducing aircraft drag and weight, improving aero-engine efficiency, or with a new propulsion system and enhancing flight operations. Reducing aircraft drag can provide the biggest share and represents a win-win situation, as it reduces fuel consumption and thus CO2 emissions. For operators, reduced fuel demand means longer range or more payload on each flight.

2.1 Drag Reduction by Laminar Flow and Flow Control Laminar flow around the wing sections and empennage represents the largest single technique for reducing frictional drag. Natural laminar flow (NLF) is mainly influenced by wing shape and surface quality. Hybrid Laminar Flow Control (HLFC), in turn, regulates flow by suction to support natural laminar flow. Because the use of HLFC is very complex, natural laminar flow is the preferred option when physically possible. A propeller-driven aircraft, such as the World War II fighter P-51 Mustang or the four-seater Mooney, flew and still flies successfully on wings with large natural laminar areas. Transferring this technology to the transonic wing of modern jet airliners


D. Knoerzer

Fig. 2 Laminar flow limits influenced by Re-number and leading edge sweep angle: Re-range of flight and S1MA wind tunnel tests (Source Schrauf (2005))

is not easy because three types of boundary layer instability cause the transition from low drag laminar flow to turbulent flow: 1. Tollmien-Schlichting Instability (TSI), 2. Cross-Flow Instability (CFI), 3. Attachment Line Transition (ALT). The laminar flow limits are illustrated in Fig. 2. As the figure shows, at high Reynolds numbers (Re-number) Tolmien-Schlichting instability appears across the wing. Cross-flow instabilities can go along the leading edge of the swept wing of the jet aircraft. At the larger sweep angles used in most modern airliners, the transition of the attachment line to turbulent flow is a major barrier to the use of laminar flow (Schrauf 2005). Predicting the transition from the laminar boundary layer to the turbulent flow is one of the biggest challenges. A useful instrument here is the eN method, which Jan van Ingen developed at TU Delft as a semi-empirical method in 1956, long before the existence of the computer (van Ingen 2008). Flow calculations applying the Euler equations used in the 1980s were unable to predict the transition. Not even the NavierStokes equations that model physics using mathematical simulations were able to provide the transition point. At national and European level, a number of flight and large-scale wind tunnel experiments have been carried out from 1989 to 2018. The DLR test aircraft VFW 614 ATTAS performed flights with natural laminar flow on the glove of the right wing (see Fig. 3). In 1992 and 1993, DLR, together with Rolls-Royce, conducted flight tests on the left engine nacelle of the VFW 614 ATTAS, first with natural laminar flow and later with hybrid laminar flow control using suction through the

Contribution of Scientific Computing in European Research …


Fig. 3 DLR test aircraft VFW 614 ATTAS made flight tests in 1989/1990 with a glove for natural laminar flow on the right wing and also with a hybrid laminar flow nacelle (Source DLR)

Fig. 4 Flight experiments on the nacelle of the left engine of the VFW 614 ATTAS (Source Mullender and Riedel (1996))

perforated nacelle surface. Numerical calculations were then performed using Euler codes, code CEVCATS (DLR) for multi-block grids, and code FLITE (Rolls-Royce) for 3D unstructured grids (see Fig. 4) (Mullender and Riedel 1996). Within FP2, the European project ELFIN launched a series of large-scale studies on NLF and HLFC technologies relevant to industry at the end of 1989. Two major experiments were conducted at high Mach and Re-numbers. NLF flight tests were performed with a large glove on the right wing of the Fokker 100 prototype (Fig. 5).


D. Knoerzer

Fig. 5 Calculated pressure distribution without and with the NLF glove on the starboard wing of the Fokker 100 (ELFIN I project, Source Airbus)

A high Re-number wind tunnel test investigated HLFC with a suction element on an ATTAS wing in a large 1:2 transonic wind tunnel ONERA S1MA. The design of the glove and suction section was done with Reynolds-averaged Navier-Stokes (RANS) solvers and the eN-method was used to predict the transition. At the same time, this project and its follow-on projects conducted more systematic research on the use of RANS and even some investigations on direct numerical simulations (DNS). However, the computational capabilities were still limited at that time. In the follow-on projects ELFIN II, LARA, HYLDA and HYLTEC (within FP3 and FP4), additional laminar flow tests were performed with the support of numerical tools, thus developing the technology. As a first step towards the in-flight demonstration of high Reynolds hybrid laminar flow in Europe, the FP4 project HYLDA (1996–2000) investigated the feasibility of a wing glove and a hybrid laminar flow. Airbus conducted flight tests with the A320 HLFC fin outside the project, but with financial support from the project, operational aspects such as de-icing of suction surfaces and the influence of surface roughness were investigated. Various tasks were supported by suction surface design and data analysis activities. The work done in HYLDA became an integral part of a strategic long-term industrial approach to developing hybrid laminar flow technology for large transport aircraft (Re > 30 million, Ma > 0.8). In addition detailed cross-functional strategy development and multidisciplinary design work were performed. The main objectives of the first European laminar flow projects ELFIN I and II, LARA and HYLDA were to obtain a better understanding of the aerodynamic phenomena affecting the stability of the boundary layer and to develop and verify mathematical tools for transition prediction. The follow-on project HYLTEC (1998– 2001) performed the first investigations on issues related to HLFC manufacturing and systems (Hybrid laminar flow technology 2002). But despite all the progress in laminar flow technology, it became clear that the relevant technologies were not mature enough for industrial application in the short to medium term. The goal of HYLTEC’s strategy was to improve the handling of practical problems with HLFCequipped aircraft. Several problem areas had been identified where knowledge of the industrial HLFC application was not yet sufficient.

Contribution of Scientific Computing in European Research …


The HYLTEC project consisted of three main work packages, which were supported by computational methods in the simulation: 1. Manufacturing, systems and operational issues that caused problems or risks in integrating laminar flow control into production aircraft, 2. Retrofitting requirements for the application of HLFC to aircraft, 3. Measurements for problems related to icing, contamination and HLFC retrofit restrictions. Work Package 1 required the fabrication of leading edges with porous suction surfaces with very small-dimensional tolerances and that these tolerances are maintained throughout the operational life of the aircraft. In Work Package 2, new production aircraft and their derivatives were developed without up-front application of laminar flow. The purpose of the work package was to establish a general baseline approach for the design of an HLFC retrofit. The FP5 project ALTTA (2000–2003) was the last European project to advance hybrid laminar flow technology to reduce the aircraft drag by delaying the transition from laminar to turbulent flow. ALTTA also contributed to the design of HLFC applications, improving transition prediction techniques by replacing database methods with linear and non-linear stability codes. Due to relatively low crude oil prices in the late 1990s, aircraft manufacturers’ interest in the improvements declined. In particular, the expected additional complexity associated with them (e.g., airframe surface quality requirements and the possible suction system to be installed) did not seem attractive to airlines, given the savings in fuel costs anyway. After 2000, with rising oil prices and the growing discussion about global warming and the need to diminish greenhouse gases, such as CO2 , Europe began to develop new technology to reduce drag through laminarization. Airlines were not interested in HFLC systems because HLFC could lead to greater complexity and increase operational maintenance. Therefore, the research focused on the NLF, which aimed to push physical limits to the operating area of transonic airliners, although its drag reduction potential was lower than that of the HLFC. The TELFONA project (2005–2009) designed the NLF transonic Pathfinder wing and tested it in cruising conditions in the cryogenic European Transonic Windtunnel (ETW). Numerical design was performed with a RANS solver using the one-equation turbulence model of Spalart and Almares (Fig. 6). The numerical tools developed and the design knowledge collected in the TELFONA project were applied in an smart fixed-wing aircraft as part of the Joint Technology Initiative (JTI) Clean Sky to design the ‘BLADE’ laminar flow wing demonstrator. The wide-body Airbus A340-001 became a flight demonstrator by modifying the outer wings (Fig. 7). The NLF wing with a shielding Krueger high-lift device at the leading edge achieved a reduced sweep angle to avoid cross-flow instability. In addition to the aerodynamic design of the high-lift system, two different structures made of CFRP were produced to provide the smooth surface quality required to keep the boundary layer laminar for as long as possible. Airbus successfully performed


D. Knoerzer

Fig. 6 Pressure distribution of NLF Pathfinder wing calculated with RANS code TAU in two span positions for two lift coefficients (TELFONA project Streit et al. (2011))

Fig. 7 Flight test on an Airbus A340-001 with NLF outer wings at reduced sweep angle in 2017 (Airbus BLADE project)

flight tests on an aircraft with a modified NLF outer wings in 2017–2018. The drag reduction potential of the NLF wing is estimated to be about 8% of the total aircraft drag, which represents a remarkable contribution to the needed CO2 reduction. It is now up to the aircraft manufacturers to implement NLF technology in the design of the next civil passenger aircraft.

Contribution of Scientific Computing in European Research …


2.2 High Lift Design for Transonic Wings Advanced civil transport aircraft need high-lift systems that can provide the required performance on take-off and landing. A special challenge is the integration of a highlift system into a laminar wing structure with minimal drag. It took some time before numerical RANS codes could calculate complex 3D high-lift configurations (Wild and Schmidt 2004). The EUROLIFT project (2000–2003) developed the know-how for the efficient design of a high-lift system and the necessary numerical and experimental tools. The EUROLIFT research used the cryogenic test facility ETW at a very high Re-number (Re = 20 × 106 ) to bridge the gap between sub-scale testing and flight conditions to understand and incorporate scaling effects. It analysed experimental and numerical results based on structured and unstructured RANS calculations and combined them to improve the understanding of high lift flow physics. The follow-on project EUROLIFT II validated numerical and theoretical methods for the exact prediction of complete aircraft aerodynamics in a high-lift configuration with flight Re-numbers. Numerical and experimental analysis of the physical interaction of aerodynamic effects dominated by different vortices was performed. The generic commercial aircraft configuration DLR F11 was used in various high lift settings. High-end design tools with DLR’s RANS codes FLOWER (structured) and TAU (unstructured) were used in experiments in the cryogenic wind tunnel ETW. The results show excellent agreement between numerical solutions and experimental data. In FP7, the EU project DeSiReH (2009–2013) focused on numerical design tools and experimental measurement techniques with tests under cryogenic conditions in ETW (Fig. 8). The objective was to improve the industrial design process of laminar wings in terms of product quality, efficiency and reduced development costs. The work concentrated on the design of high lift devices. DeSiReH performed a complete high lift design for the NLF wing (Iannelli et al. 2013).

Fig. 8 Grid of DLR F11 wing in high-lift configuration inside the ETW wind tunnel for detailed CFD analysis based on the RANS/URANS equations (left); surface grid and cuts through a prismatic grid (right) (DeSiReH project Strüber and Wild (2014))


D. Knoerzer

The UHURA project (2018–2022) addressed unsteady high-lift aerodynamics and worked to validate unsteady numerical simulations using RANS methods. In particular, it considered the validation of numerical simulation methods to predict unsteady aerodynamics and dynamic loads during the deployment and retraction phase of a high lift system. Here, it specially sought to measure the unknown aerodynamic characteristics of the slotted Krueger leading edge device during deployment and retraction.

2.3 Research on Shock Wave Boundary Layer Interaction In the design of supercritical transonic wings, the interaction between the shock wave and the boundary layer is a critical problem that can lead to increased drag due to flow separation. The European project EUROSHOCK and its successor EUROSHOCK II tried to reduce drag by controlling the shock boundary layer interaction either passively by circulating through the surface cavity or actively (as a hybrid) by a combination of recirculation and suction. In these projects, numerical methods and experimental tests complemented each other. At the beginning, the experiments were supported by numerical calculations, but over the years, flow behaviour could be predicted by numerical methods, which were validated by relevant experiments. Investigations on the shock-wave boundary layer interaction in 1993–1999 were published in two books (Stanewsky et al. 1995, 2002). In the case of the transonic laminar wing in particular, the control of the shock position and its interaction with the laminar boundary layer is a critical design issue. Work on the interaction between the shock wave and the boundary layer continued within FP6 and FP7 in the UFAST and TFAST projects. UFAST investigated the unsteady effects of the separation caused by shock waves in numerous different experiments and related numerical calculations using Unsteady Reynoldsaveraged Navier-Stokes (URANS) approaches and Delayed Detached Eddy Simulation (DDES) technique. The results of the project were published by Springer (Doerffer et al. 2011) and the experimental database by the coordinating institute IMP-PAN in Gdansk, Poland, which also coordinated the follow-on project TFAST. In the TFAST project, a large European research community investigated the effects of the transition location on the shock wave boundary layer interaction through experiments and numerical simulation. The main objective was to minimise the negative impact of the shock on the laminar boundary layer at the end of the transonic areas. Several experiments were designed with proactive CFD work to combine the experiment and the CFD as closely as possible and to facilitate interaction. In 2021, the results were published in the Springer book of the NNFM series (Doerffer et al. 2021). An example of a DDES calculation is shown in Fig. 9. The TEAMAero project, launched in 2020 within Horizon 2020, is developing effective flow control methods to mitigate shock effects in aeronautical applications. The TEAMAero project focuses on improving a fundamental understanding of the physics of shock wave boundary layer interaction (SBLI), including

Contribution of Scientific Computing in European Research …


Fig. 9 3D visualisation of the DDES calculation of the V2C wing (Dassault Aviation) using an isosurface of the swirling strength criterion at Ma = 0.7, AoA = 7◦ , Re = 3 × 106 (TFAST project Doerffer et al. (2021))

three-dimensionality and unsteadiness, and the development of flow control schemes using wall transpiration (suction/blow), vortex generators, and surface treatments that delay separation. In addition, novel numerical methods for updated prediction of SBLI effects are being developed.

3 Aeroacoustic Research to Reduce Aircraft Noise At the latest when civil jet aircraft were introduced, aircraft noise became a problem in society, especially around large airports. With the increase in the bypass ratio, jet noise could be significantly reduced. However, the steadily growing number of civil aviation flights has led to aircraft noise remaining a continuous problem in aviation.

3.1 Low-Noise Technologies for Propulsion Systems In the 1980 and 1990s, fast propeller-driven commuter aircraft were developed whose propeller noise became an environmental concern. Within FP4, several commuter aircraft manufacturers launched a research project SNAAP (1993–1996), which investigated the noise behaviour of a fast propeller (up to Mach 0.7). In addition


D. Knoerzer

Fig. 10 2.5D simulation of the DLR-Berlin low speed fan with elsA code using LES (PROBAND project Enghardt (2009))

to large-scale wind tunnel tests, aerodynamic and acoustic codes were developed on the theoretical side. Based on Euler techniques, essential input data were provided for the acoustic code. The acoustic codes were intended to comprise several modules for different acoustic sources. Techniques for identifying acoustic sources from experimental data were developed and codes were validated and refined with test data. Within FP5, the PROBAND project (2005–2008) addressed aeroengine fan noise, which often causes high noise levels, especially during take-off and climb. In addition to scale experiments, advanced CPD methods such as Large Eddy Simulation (LES) and Detached Eddy Simulation (DES) were used to predict noise attenuation potential. To predict fan broadband noise, a 2.5D simulation of the DLR-Berlin lowspeed fan was performed with the ONERA elsA code using LES (Fig. 10). Figure 10 shows the interaction between the blades and the wings (Enghardt 2009). Predictions were validated against DLR test results and showed reasonable agreement. It has been known for many years that the fuel consumption of counter-rotating open-rotor aero engines can be as much as 30% lower than that of classic turbojet engines. The implementation of these advanced engines was previously hindered by their high noise and vibration levels. Using advanced computational methods CFD and Computational Aero-Acoustics (CAA), the improved design of counter-rotating blades could reduce critical noise and vibration to a level that would allow future civil operations. Within JTI Clean Sky 2, the SCONE project performed broadband noise simulations of a counter-rotating open rotor using reduced order modelling (Kors 2004) (Fig. 11). In 2017, SAFRAN performed related full-scale demonstration tests in its engine test centre in Istres, France.

Contribution of Scientific Computing in European Research …


Fig. 11 High quality chorochronic LES for open rotor: adimensional density × axial velocity (Clean Sky 2, Source CERFACS Good vibrations (2021))

3.2 Airframe Noise Reduction Research When engine noise could be reduced, airframe noise often became the dominant source of noise, especially when approaching and landing. A number of European research projects have addressed this issue through large-scale wind tunnel experiences and flight tests. In addition, the CAA prediction methods were developed that became increasingly reliable and precise. The RAIN, SILENCE(R) and OPENAIR projects focused on airframe noise from landing gear, including doors, and high-lift systems as well as engines and their installation. Within FP4, the RAIN project (1998–2001) investigated, amongst other things, the full-scale triple main landing gear, as it was later used in the Airbus A380. The tests were performed in a large acoustic test section of the German-Dutch Wind Tunnels (DNW). Various sources of noise could be identified and measures taken to reduce the noise they cause. Within FP5, the large SILENCE(R) project (2001–2007) investigated all relevant engine and airframe noise reduction technologies to achieve a total noise reduction of 5 dB, which is half of ACARE’s 2020 goal (Kors 2004). Among the numerous tests, it performed a series of low-altitude flight tests on an Airbus four-engine test aircraft A340-001 to measure noise sources (Fig. 12). Large sensitive microphone arrays were installed on the surface of the airfield to measure noise from the airframe and engines during the overflight. The follow-on project OPENAIR (FP7) sought to optimise low-noise aircraft at a higher TRL than SILENCE(R). It could further reduce noise by 2.5 dB. The technology demonstration was planned in the Clean Sky Joint Technology Initiative. Although acoustic experiments and testing played a dominant role in these projects, numerical methods provided a key contribution to the simulation and prediction of flow behaviour and related noise sources. To this end, as the power of computers increased, researchers used increasingly complex computer models of how air flows and how aerodynamic forces interact with surfaces, causing noise and vibration. In CFD, the RANS and URANS codes were used and the Ffowcs Williams and Hawkings (FW-H) equations were applied to the acoustic predictions.


D. Knoerzer

Fig. 12 In-flight measurements of airframe noise on an Airbus A340-001 with ground-based microphone arrays in Tarbes, France (SILENCE(R) project Kors (2004))

On the computational prediction side, within FP7, the VALIANT project (2009– 2013) addressed the validation and improvement of numerical airframe noise prediction tools. Numerous partners focused on broadband noise generated by the turbulent flow around the airframe. VALIANT chose four key generic test cases that represent the major broadband airframe noise mechanisms associated with multiple body interactions (VALidation and Improvement of Airframe Noise prediction Tools (VALIANT) 2013). One of the test cases was wing flap simulations (Fig. 13), for which measurements were performed in an anechoic wind tunnel at the Ecole Central de Lyon. Based on the unsteady flow field data obtained, DLR, NTS, LMS and TU Berlin carried out several far-field noise predictions. Both near and far field results were compared with experimental data. The teams involved in the near-field simulations used different numerical methods. The predominant sound sources were either presented directly using incompressible LES (Von Karman-Institute for Fluid Dynamics, VKI) that resolves turbulence, or they were reconstructed based on time-averaged compressible RANS solutions of turbulent mean flows using a synthetic RPM model (DLR). In a two-strut test case, the partner CIMNE performed CFD simulations using a variational multi-scale technique. The inhomogeneous Helmholtz equation was solved to compute the far-field acoustic pressure. The Russian partner, the Institute for Mathematical Modelling of the Russian Academy of Sciences, used the hybrid Detached Eddy Simulation (DES) method for numerical representation of compressible viscous flow, while aeroacoustic analysis was based on the FW-H formulation.

Contribution of Scientific Computing in European Research …


Fig. 13 Visualization of the IDDES solution for the wing-flap in the case of turbulence as a source of broadband noise: 3D view of swirl iso-surface for accurate noise predictions (VALIANT project VALidation and Improvement of Airframe Noise prediction Tools (VALIANT) (2013))

NUMECA International performed LES simulations and completed an aeroacoustic analysis based on the FW-H formulation. TU Berlin performed CFD simulations using the Delayed DES technique (DDES) and completed sn aeroacoustic analysis based on the FW-H formulation. The numerical near and far field results obtained by different partners were compared with experimental data. Koloszar (2012). Numerous European aeroacoustic projects were well connected through the European network X-NOISE (1998–2015). The X-NOISE network served as an efficient platform for European noise projects for information exchange or thematic workshops. The goal of the X-Noise cluster was to create synergies between the activities of the involved aeroacoustic projects (Collin 2016). Within Horizon 2020, the DJINN project (2020–2023) aims to decrease jet installation noise by developing a new generation of hybrid CFD methods to assess promising noise reduction technologies. Advanced tools for coupled aerodynamics and aeroacoustics are provided for this purpose. The goal is to reduce the peak noise level of the interaction between the jet and the airframe by up to 5 dB at low frequencies. Improved CFD methods for multi-physical modelling using high performance computing are expected to reduce design times and costs by about 25% (Decrease jet-installation noise 2020). The DJINN project formed a scientific cluster with the INVENTOR and ENODISE projects, which complement each other in their activities and have common partners (Fig. 14). INVENTOR (2020–2024) addresses the design of installed airframe components to reduce aircraft noise. ENODISE (2020–2024) is working on disruptive airframe and propulsion integration concepts. Together with the Horizon 2020 ARTEM project, seven research establishments that are members of EREA have worked with leading European universities and major entities in the European aerospace industry to develop promising novel concepts and methods for new low noise and disruptive 2035 and 2050 aircraft configurations. Noise reduction technologies included modelling the future aircraft configurations as a blended wing body (BWB) and other innovative concepts with integrated engines or distributed electric propulsion. The impact of those new configurations with low noise technology has been assessed using a variety of methods, including industrial tools. ARTEM implemented a holistic approach to reducing noise from future aircraft (Aircraft noise Reduction Technologies and related Environmental iMpact 2020).


D. Knoerzer

Fig. 14 Scientific cluster of DJINN, INVENTOR and ENODISE projects on aero-acoustic airframe-engine installation problems (Source DJINN)

4 Computational and Numerical Methods in Aeronautics When the first operational computers appeared about sixty years ago, the aerospace industry was one of the first users in the calculation of lightweight structures by finite element methods (FEM) and in the calculation of fluid mechanics calculations to optimise lift and drag. The first panel methods were used for aerodynamic shape design, which could originally be calculated analytically. As computer power increased, codes were developed for frictionless Euler equations. Friction terms were introduced to capture drag. However, an improved flow behaviour accuracy could be achieved with Reynolds averaged Navier-Stokes (RANS) solvers. This was already used in scientific computing, but for complex industrial applications it could only be used when sufficient computer power was available and pre- and postprocessing could be handled with reasonable effort.

4.1 Development of Reynolds Averaged Navier-Stokes (RANS) Flow Solvers In the nineties, the RANS codes were developed most by research centres but also by industry and universities. In Germany, the national CFD project MEGAFLOW was initiated, which combined many of the CFD development activities from DLR,

Contribution of Scientific Computing in European Research …


universities and aircraft industry. Its goal was the development and validation of a dependable and efficient numerical tool for the aerodynamic simulation of complete aircraft. The MEGAFLOW software system includes the block-structured NavierStokes code FLOWer and the unstructured Navier-Stokes code TAU (Kroll and Heinrich 2004). While structured grids can better capture the wing surface with a high aspect ratio, unstructured grids using triangle of tetraether grids can better scope with complex geometries. Both codes were and still are successfully used since a long time and still produce simulation results in state-of-the-art quality. Thanks to the drastically enhanced computer power and speed, unstructured codes get a certain advantage because of the higher flexibility for complex geometries. In France, ONERA developed its structured grid RANS code, programmed in C++, and Dassault Aviation successfully applied its internal Navier-Stokes code to the design of its business aircraft. In the first specific aeronautics programme, BRITE-EURAM Area 5, three collaborative scientific computing research projects were launched in 1990, all coordinated by industry: 1. EUROVAL (coordinator Dornier) validated the turbulence models used in the RANS codes (Haase et al. 1993); 2. EUROMESH (coordinator British Aerospace) addressed smart ways to generate meshes (see Fig. 15) (Weatherill et al. 1993); 3. EUROPT (coordinator Dassault Aviation) investigated optimal design methods in aeronautics (Periaux et al. 1997). As the aspects of scientific computing mentioned here are interdependent and therefore require coordination, the follow-on activities of these three projects was combined within FP3 into the European Computational Aerodynamics Research Project (ECARP). The results were published in three books (Haase et al. 1997; Periaux et al. 1998; Hills et al. 1999). While mesh generation was increasingly performed with smart automatic pre- and post-processing software packages, design

Fig. 15 Structured multi-block grid of the generic ONERA M6 wing after mesh adaptation (M = 0.84, α = 3◦ , 103 × 41 × 22 mesh nodes) (EUROMESH project Weatherill et al. (1993))


D. Knoerzer

Fig. 16 Inviscid flow calculation for generic aeroelastic aircraft: pressure distribution and isolines for the final mesh (UNSI project Haase et al. (2003))

optimisation could gain from the increasing power and speed of the computer and consider more sophisticated methods that were more robust in finding the optimal solution. In addition to the “classical” computational themes of fluid mechanics, other European research teams investigated multi-dimensional upwind schemes for Euler and Navier-Stokes codes to achieve method robustness and grid independence. Two projects also worked on solution-adaptive multigrid techniques to solve discrete systems of these equations. These projects, coordinated by the Von Karman-Institute for Fluid Dynamics (VKI), ran from 1989 to 1995. The results of the team from numerous test cases, such as the circular cylinder, the standard airfoil NACA 0012, the supercritical airfoil RAE-2822, the generic ONERA M6 wing, and others, were published in Deconinck and Koren (1996). Problems of fluid-structure coupling were addressed in the UNSI project (1998– 2000). The goal was to improve CFD methods for time-accurate unsteady flow. The project performed a comprehensive calibration of CFD methods by carrying out parallel measurements and computations to realize cross-fertilization and improve turbulence modelling, particularly for unsteady flows (see Fig. 16). UNSI demonstrated the ability to predict unsteady aeroelastic phenomena, including flow with separation and/or flutter (Haase et al. 2003). In the FP5 project FLOMANIA (2002–2004), 17 European industry and research organisations collaborated on improving existing CFD methods. The focus was on the investigation and integration of differential Reynolds stress models into aeronautical applications and modelling of detached-eddy simulations (DES) (Haase et al. 2006). The FP6 project ATAAC (2009–2012) focused on approaches that are more affordable than LES: • Differential Reynolds Stress Models (DRSM),

Contribution of Scientific Computing in European Research …


• advanced Unsteady Reynolds-averaged Navier-Stokes (URANS) models, including Scale-Adaptive Simulation (SAS), • Wall-Modelled Large Eddy Simulation (WMLES), and • different hybrid RANS-LES coupling schemes, including the latest versions of DES and Embedded LES. The project focused on flows for which current models fail to provide sufficient accuracy, e.g. installed flows, high-lift applications, or swirling flows (Schwamborn and Strelets 2012). Uncertainty management and quantification, as well as robust design methods (RDM), are needed to bridge the gap on the way to industrial readiness, as addressing uncertainties allows for rigorous management of performance engagements and associated risks. The UMRIDA project (2013–2016) responded to this challenge by developing and applying new methods capable of handling large amounts of simultaneous uncertainties, including generalized geometric uncertainties in design and analysis within a turn-around time acceptable for industrial readiness (Hirsch et al. 2019).

4.2 Numerical Methods for Industrial Needs The aeronautics industry used numerical methods in the design and production process from the very beginning of computer-based procedures. However, for a long time this was limited to aircraft components such as the wings and simple geometries because the computer did not have sufficient performance and no capable numerical codes were available. In 1995, the aerospace research centres NLR, DLR and FFA, with the support of industrial partners, launched the FASTFLO project to develop a hybrid grid-based CFD technology for inviscid and viscous flow computation that can be applied to geometrically complex aircraft configurations (Berglind and van der Burg 2001). While the first project was based on Euler codes, Navier-Stokes codes were already used in the follow-on project FASTFLO II, as software development and computer performance had improved significantly. The main objective of the FP6 project DESider (2004–2007) was to overcome known weaknesses in turbulence-solving approaches (detached-eddy simulation or DES, other RANS-LES hybrids and scale adaptive simulation or SAS) to support the European aeronautics industry with simulation methods that provide better predictive accuracy for complex turbulent flows. The project had the following main achievements (Haase et al. 2009): • Investigation and development of advanced modelling approaches for unsteady flow simulations as a compromise between URANS and LES, which are now able to produce results comparable to LES for real aeronautical applications, • Demonstration of the ability of hybrid RANS-LES approaches to solve industrially relevant applications focusing on aerodynamic flows characterised by separation, wakes, vortex interaction and buffeting.


D. Knoerzer

Fig. 17 Instantaneous flow field pattern of the generic aircraft FA5; iso-vorticity surfaces coloured according to the pressure coefficient c p (DESider project Haase et al. (2009))

• Investigation on the application of hybrid RANS-LES methods to multi-disciplinary topics such as aero-acoustics (noise reduction) and aeroelasticity (reduced weight, unsteady loads, fatigue issues, improved safety), facilitating cost-effective design. Figure 17 shows the complex flow structures of a challenging test case. The FP6 project ADIGMA (2006–2009) was launched to take a major step in the development of next-generation CFD tools, significantly improving the accuracy and efficiency of turbulent aerodynamic flows (Kroll et al. 2010). The main objective of ADIGMA was to develop innovative high-order methods for compressible flow equations combined with reliable adaptive solution strategies, enabling mesh-independent numerical simulations for aerodynamic applications. The project concentrated on technologies that show the highest potential for efficient, high-order discretization in unstructured meshes, specifically Discontinuous Galerkin (DG) methods and Continuous Residual Distribution (CRD) schemes. Significant progress had been made in ADIGMA, but finally many of remarkable achievements in high-order simulation methods were still far from industrial maturity. In 2010, the follow-on project IDIHOM (2010–2014) was initiated within FP7. It was motivated by the increasing demand from the European aerospace industry to improve CFD-aided design procedure and analysis using accurate and fast numerical methods. Following a top-down approach, IDIHOM addressed a set of challenging application cases proposed by the industry. The test set included turbulent steady and unsteady aerodynamic flows covering external and internal aerodynamics as well as aeroelastic and aeroacoustic applications. The complete process chain of high-order flow simulation capability was addressed, focusing on the efficiency and mesh generation capabilities of the flow solver, but also on visualising CFD and coupling it to other disciplines for multidisciplinary analysis. The challenging application cases defined by the industry formed the basis for demonstrating and assessing the current status of high-order methods as a workhorse in industrial applications. One of the chosen test cases was the aero-elastic experiments of the international cooperative program HART-II (Higher harmonic control Aeroacoustics Rotor Test) (Fig. 18).

Contribution of Scientific Computing in European Research …


Fig. 18 Aero-elastic results for the HART-II rotor (generic Bo 105): Visualization of the tip vortex through the iso-surfaces of the Q-criterion, coloured according to the magnitude of the vortices (left); sectional slice at 87% span (right) (IDIHOM project, Source Kroll et al. (2015))

IDIHOM’s goal was to help bridge the gap between the expectations for the capabilities of high-order methods and their use in industrial real-world applications – by pursuing better, more accurate, and more time-saving design processes (Kroll et al. 2015). The Horizon 2020 TILDA project (2015–2018) addressed methods and approaches that combine advanced and efficient high-order numerical schemes (HOMs) with innovative approaches to LES and DNS The project sought to solve all the relevant flow features in tens of thousands of processors in order to get close to a full LES/DNS solution at one billion Degrees of Freedom (DoF) but without exceeding a turn-around time of a few days. The aim of the TILDA project was for both to improve physical knowledge and more accurate predictions of nonlinear, unsteady flows near the borders of the flight envelope, which directly contributed to better reliability. The main highly innovative targets for industrial needs were: • Development of methods for HOM acceleration in unsteady turbulence simulation in unstructured grids, • Development of methods to accelerate LES and future DNS methodology through multilevel, adaptive, fractal, and similar approaches in unstructured grids, • Using existing large-scale HPC networks to bring LES/DNS industrial applications close(r) to daily practice. Compact high-order methods offer a very high ratio between computational work and DoF combined with a low data dependency stencil, making these methods extremely well adapted for parallel shared-memory processors and allowing efficient redistribution to an increased number of processors. In addition, the TILDA project was able to demonstrate HOM’s multi-disciplinary capability for LES in the field of aeroacoustics (Hirsch et al. 2021).


D. Knoerzer

The capabilities of emerging leading-edge HPC architectures are often not fully utilised in industrial numerical simulation tools. Current state-of-the-art solvers used by industry do not take sufficient advantage of the immense capabilities of new computer hardware architectures, such as streaming processors or many-core platforms. Combined research focusing on algorithms and HPC is the only way to enable the development and promotion of simulation tools for the needs of the European aeronautical industry. Therefore, the Horizon 2020 project NextSim (2021–2024) focuses on the development of the numerical flow solver CODA (including finite volume capabilities and high-order discontinuous Galerkin schemes), which is planned to be a new reference solver for aerodynamic applications in the Airbus Group. Improving the capabilities of models for complex fluid flows offers the potential to reduce the energy consumption of aircraft, cars and ships, which in turn reduces emissions and noise from combustion-based engines. This can have a major impact on economic and environmental factors in a highly competitive global industry. Hence, the ability to understand, model, and predict turbulence and transition phenomena is a key requirement in the design of efficient and environmentally acceptable fluidbased energy transfer systems. Against this background, the Horizon 2020 project HiFi-TURB (2019–2022) sets out a highly ambitious and innovative program of activities designed to address some of the deficiencies of advanced statistical turbulence models. The project focuses on the following pillars of excellence: • Exploitation of high-fidelity LES/DNS data for a range of reference flows that contain important flow characteristics. • Application of novel artificial intelligence and machine learning algorithms to identify significant correlations between representative turbulent quantities. • Directing research toward improved models by four world-renowned industrial and academic turbulence experts. The HiFi-TURB project involves major aeronautics companies and software providers, as well as well-known aeronautics research centres and academics as project partners. Since 2018, DLR, ONERA and Airbus have jointly developed the new unstructured CFD solver CODA. The new code is written in C++17 and focuses heavily on HPC and scalability. DLR integrated its works into the TAU successor code Flucs as part of CODA’s co-development (Wagner 2021).

5 European Scientific Networks in Flow Physics and Design Methods The partners involved in research projects, especially at European level, need to share a certain amount of their know-how in order to move forward together on the research topics of the project. Due to novel research areas and competition in industry, sharing information was sometimes difficult. On the other hand, at European

Contribution of Scientific Computing in European Research …


level, several projects often work in parallel in complementary technology areas. These were the reasons for the creation of European scientific networks, which stimulated the exchange of scientific knowledge in workshops or panels of experts and coordinated research activities through linked projects or in-house activities. Network partners could also develop joint technology strategies for specific thematic areas. Some examples of networks in aerodynamics and design will illustrate their main functions. The aim of the European network DragNet (1998–2002) was to facilitate the industrial return of European research on aerodynamic drag reduction technologies by coordinating European drag reduction projects, e.g., in the field of laminar flow technology. As its follow-on networks KATnet and KATnet II, it provided an information and data exchange platform through conferences, workshops and the Internet (Thiede 2001). Its expert groups identified future research and technology priorities that could meet the needs of the European aircraft industry. DragNet was a partner in TRA3 project (1998–2002), which aimed to cluster aerodynamics research projects within FP4 and FP5. KATnet and KATnet II had a wider scope than drag reduction technologies. They contributed to ACARE’s first Strategic Research Agenda (SRA) (Strategic Research Agenda 2002) and its updates. Within FP5, the thematic network FLOWnet provided a code validation tool for flow modelling and computational/experimental methods (subsonic, transonic, supersonic and hypersonic regimes) through a common database to its large scientific and industrial community. After a successful start, it faced operational difficulties and ceased operations soon after the end of the supported network. Within FP5, the successful thematic network QNET-CFD also addressed the quality and reliability of industrial applications of CFD methods. It developed services and assessment procedures in cooperation with the scientific association ERCOFTAC, which operated the Knowledge Base Wiki1 and has used it until today. Another network that has continued its key activities to this day was the FP4 network INGENET. It dealt with industrial design and control applications using genetic algorithms and evolutionary strategies. Improved numerical methods and the development of high-performance computing made the use of robust but slowly converging evolutionary algorithms attractive for industrial applications. Through its networking activities, INGENET supported the bi-annual thematic conference EUROGEN, which continues to attract the related community. A selection of the conference papers have been published regularly in thematic books (see, e.g., Quagliarella et al. (1998); Miettinen et al. (1999)). Figure 19 shows the Pareto-optimised wing planform shapes calculated using Navier-Stokes equations. This was presented by S. Obayashi (Tohoku University, Japan) at EUROGEN97 in Trieste, Italy (Quagliarella et al. 1998).



D. Knoerzer

Fig. 19 Pareto-optimised wing planform shapes calculated using Navier-Stokes equations (presented at EUROGEN97 by Quagliarella et al. (1998))

6 Conclusions The intensive cooperation in joint research projects described in this paper led to Europe’s strong position in the computational sciences world-wide. Efforts are needed to maintain this position. Today, computational methods and numerical tools play a key role in aerospace design processes to ensure timeliness, cost control, and design accuracy. Modern optimised aircraft require multidisciplinary approaches in the design process. Additional challenges will need to be tackled in the future, such as the use of high-performance computers, the application of Artificial Intelligence (AI) and the handling of big data. In the greening of the aviation sector over the last decade, computational methods have been involved, especially in reducing drag and noise, the use of enhanced lightweight structures, optimised air traffic management, etc. But the aviation of the future will face more challenges. The ambitious goal of a world-wide CO2 neutral air transport system by 2050 will require an enormous effort in two directions. The performance of future aircraft has to be significantly better. Experts estimate performance could improve by more than a third compared to today’s state-of-the-art airliners such as Airbus A320neo or A350 wide-body. In their paper (Grewe et al. 2021), Volker Grewe and his colleagues at DLR show the savings potential of the next generation of classic aircraft and radically changed concepts (Table 1). The table shows the estimated contributions from the different disciplines to be achieved. High-performing computational design methods and numerical optimisers are needed to achieve individual reduction targets and thereby the estimated overall improvement. The other two thirds of the CO2 reduction must come from sustainable aviation fuel (SAF) and CO2 neutral propulsion systems using electricity or hydrogen,

Contribution of Scientific Computing in European Research …


Table 1 Estimate of the potential for reducing the fuel consumption of the next generation of aircraft compared to the state-of-the-art aircraft A320neo (single-aisle) and A350 (twin-aisle) (Source Grewe et al. (2021)) Technology

Next generation 2035

Future generation 2050




Twin-aisle (long-range) Conventional Tube and Wing Config.

Novel aircraft

Turbofan Flying V and MF-BWB

Turboelec. propulsion NASA N3-X

Airframe +2 engine integration (%)






Novel configurations (%)



Drag reduction (%)






Lightweight structures (%)







Combustionbased engines (%)

















Operations (%) −2 Total estimated improvement (%)


for example. These two might be the solution for smaller short-range aircraft. Big long-range aircraft will have to rely on SAF-powered combustion engines, which need to be produced in sufficient quantities, at acceptable prices and without harm to society. Europe’s latest initiative for the development of science and technology is the European Union’s 9th Framework Programme for Research and Innovation, Horizon Europe,2 with a budget of 95,5 billion euros for the years 2021–2027. Thirty-five percent of the budget will contribute to climate objectives. The first Horizon Europe strategic plan defines the strategic orientations for research and innovation investments over 2021–2024. It plans that aviation research and innovation actions will develop integrated aircraft technologies that enable deep decarbonisation and reduce 2


D. Knoerzer

Fig. 20 Specific programme structure of Horizon Europe (2021–2027) with three pillars (Source European Commission)

all emissions (including air pollutants) and aviation impacts (including noise). These actions strengthen the cooperation of the European aero-industry and its industrial leadership. The specific programme for Horizon Europe is structured in three main parts, called pillars (see Fig. 20). Pillar II has the by far largest budget share with 53.5 billion euros. Within Pillar II, Cluster 4 (Digital, Industry and Space) and Cluster 5 (Climate, Energy and Mobility) are the most relevant for the research on computational and numerical methods and for specific aviation research and technological development on the European level. In Horizon Europe, aviation research and innovation will follow a policy-driven approach aiming for climate neutrality by 2050 and a digital transformation. There are three main streams of activities: 1. Collaborative aviation research and innovation under the Cluster 5 work programme focuses on transformative low-TRL (1–4) technologies. 2. Clean Aviation partnership focuses on clearly identified paths, as described in the Strategic Research and Innovation Agenda (SRIA), and focused demonstrators (TRL 4–6). 3. SESAR3 partnership addresses solutions supporting the evolving demand for using the European sky, the increased expectations on the quality of air traffic management, and service provisions for unmanned aircraft system traffic management. Cluster 4 of Pillar II will support activities addressing key digital technologies including quantum technologies, emerging enabling technologies, artificial intelligence and robotics, advanced computing, and Big Data amongst others.

Contribution of Scientific Computing in European Research …


Overall, Horizon Europe offers many opportunities for relevant research and technological development in aviation, computer science, and also renewable energy for tackling future challenges. Nomenclature ACARE AI ALT AoA ARA BLADE CAA CERFACS CFD CFI CFRP CIMNE Clean Aviation Clean Sky CODA CRD DDES DES DG DLR DNW DoF EADS elsA ERCOFTAC EREA ETW EU FP (FP2, ...) FW-H HART-II HLFC

Advisory Council for Aviation Research and Innovation in Europe Artificial Intelligence Attachment Line Transition Angle of Attack Aircraft Research Association, Bedford, UK Breakthrough Laminar Aircraft Demonstrator in Europe (Airbus project within the European Clean Sky framework) Computational Aero-Acoustics Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique, Toulouse, France Computational Fluid Dynamic Cross-Flow Instability Carbon Fibre Reinforced Plastic International Centre for Numerical Methods in Engineering, Barcelona, Spain Public-private partnership to develop cleaner aircraft (follow-on of Clean Sky) Public-private partnership between the European Commission and the European aeronautics industry CFD for ONERA, DLR and Airbus (solver) Continuous Residual Distribution (scheme) Delayed Detached Eddy Simulation Detached Eddy Simulation Discontinuous Galerkin (method) German Aerospace Center, Cologne, Germany German-Dutch Wind Tunnels, Emmeloord, The Netherlands Degree of Freedom European Aeronautics Defence and Space Company, Leiden, The Netherlands CFD software of ONERA European Research Community on Flow, Turbulence and Combustion Association of European Research Establishments in Aeronautics European Transonic Windtunnel, Cologne, Germany European Union EU Research Framework Programme (2nd FP, ...) Ffowcs Williams and Hawkings equations 2nd Higher-Harmonic Control Aeroacoustic Rotor Test Hybrid Laminar Flow Control



D. Knoerzer

High Order Methods 8th EU Framework Programme for Research & Innovation (2014–2020) 9th EU Framework Programme for Research & Innovation (2021–2027) High Performance Computer Institute of Fluid Flow Machinery, Polish Academy of Sciences, Gdansk, Poland National Institute for Research in Digital Science and Technology, Le Chesnay, France Large Eddy Simulation Multi-Disciplinary Optimisation Micro-ElectroMechanical Systems Natural Laminar Flow Notes on Numerical Fluid Mechanics and Multidisciplinary Design (book series of Vieweg/Springer) Nitrogen Oxides The French Aerospace Lab, Palaiseau, France Reynolds Averaged Navier-Stokes equations Robust Design Methods Random Particle-Mesh ONERA’s large transonic atmospheric wind tunnel, Modane-Avrieux, France Sustainable Aviation Fuel Sustainable and Green Engines (Clean Sky programme) Single European Sky Air Traffic Management Research (Joint Undertaking of European Commission, Eurocontrol and industry) Strategic Research Agenda Strategic Research and Innovation Agenda CFD software of DLR Technology Readiness Level Tollmien-Schlichting Instability Unsteady Reynolds-Averaged Navier-Stokes Von Karman-Institute for Fluid Dynamic, Sint-Genesius-Rode, Belgium Wall-Modelled Large Eddy Simulation Zonal Detached Eddy Simulation

European Aeronautics Research Projects and Networks Here is a list of European aeronautics research projects and networks in drag reduction technologies, CFD and aeroacoustics mentioned in this paper: ADIGMA

Adaptive Higher-Order Variational Methods for Aerodynamic Applications in Industry (2006–2009)

Contribution of Scientific Computing in European Research …



Advanced Aerodynamic Flow Control Using MEMS (parts I and II) (2002–2005) Aircraft Noise Reduction Technologies and Related Environmental Impact (2017–2022) Advanced Turbulence Simulation for Aerodynamic Application Challenges (2009–2012) Detached Eddy Simulation for Industrial Aerodynamics (2004–2007) Design, Simulation and Flight Reynolds Number Testing for Advanced High-Lift Solutions (2009–2013) Decrease Jet-Installation Noise (2020–2023) European Drag Reduction Network (1998–2002) European Computational Aerodynamics Research Project (1993–1995) European Laminar Flow Investigation (parts I and II) (1989–1996) Enabling Optimized Disruptive Airframe-Propulsion Integration Concepts (2020–2024) Efficient Turbulence Models for Aeronautics (1992–1995) European High Lift Programme (parts I and II) (2000–2007) Multi-Block Mesh Generation for Computational Fluid Dynamics (1990–1992) Optimum Design in Aerodynamics (1990–1993) European Shock Control Investigation (1993–1995) Drag Reduction by Shock and Boundary Layer Control (1996–1999) Reduction of Wave and Lift-Dependent Drag for Supersonic Transport Aircraft (1996–1999) European Program on Transition Prediction (1996–1999) Computational Fluid Dynamics Code Validation (1990–1993) Fully Automated CFD System for Three-dimensional Flow Simulations (parts I and II) (1996–2000) Flow Physics Modelling—An Integrated Approach (2002–2004) Flow Library on the Web Network (1998–2002) Grey Area Mitigation for Hybrid RANS-LES Methods (2013–2015) High-Fidelity LES/DNS Data for Innovative Turbulence Models (2019–2022) Hybrid Laminar Flow Demonstration on Aircraft (1996–2000) Hybrid Laminar Flow Technology (1998–2002) Industrialisation of Higher-Order Methods (2010–2014) Networked Industrial Design and Control Applications Using Genetics Algorithms and Evolution Strategies (1997–2002)




D. Knoerzer

Innovative Design of Installed Airframe Components for Aircraft Noise Reduction (2020–2024) Key Aerodynamic Technologies Network (parts I and II) (2002–2009) Laminar Flow Research Action (1993–1994) Low Emission Combustor Technology (phases I–III) (1990–2002) Next Generation of Industrial Aerodynamic Simulation Code (2021–2024) Optimisation for Low Environmental Noise Impact Aircraft (2009–2014) Improvement of Fan Broadband Noise Prediction: Experimental Investigation and Computational Modelling (2005–2008) Thematic Network for Quality and Trust in the Industrial Application of CFD (2000–2004) Reduction of Airframe and Installation Noise (1998–2002) Simulations of CROR and Fan Broadband Noise with Reduced Order Modelling (Clean Sky 2) Significantly Lower Community Exposure to Aircraft (2001– 2007) Study of Noise and Aerodynamics of Advanced Propellers (1993–1996) Towards Effective Flow Control and Mitigation of Shock Effects in Aeronautical Applications (2020–2024) Testing for Laminar Flow on New Aircraft (2005–2009) Transition Location Effect on Shock Wave Boundary Layer Interaction (2012–2016) Towards Industrial LES/DNS in Aeronautics (2015–2018) Targeted Research Action in Aerospace Aerodynamics (1998–2002) Unsteady Effects in Shock Wave Induced Separation (2005–2009) Unsteady High-Lift Aerodynamics—Unsteady RANS Validation (2018–2022) Uncertainty Management for Robust Industrial Design in Aeronautics (2013–2016) Unsteady Viscous Flow in the Context of Fluid-Structure Interaction (1998–2000) Solution Adaptive Navier-Stokes Solvers Using Multidimentional Upwind Schemes and Multigrid Acceleration (1990–1995)

Contribution of Scientific Computing in European Research …



Validation and Improvement of Airframe Noise Prediction Tools (2009–2013) Aviation Noise Research Network and Coordination (1998–2015)

More details about these research projects and networks can be obtained from the CORDIS Database of the European Commission.3

References Aircraft Noise Reduction Technologies and Related Environmental iMpact (ARTEM) (2020) Community Research and Development Information Service (CORDIS). project/id/769350 Berglind T, van der Burg JW (2001) A joint European initiative to develop hybrid grid based CFD technology for inviscid and viscous flow computations applicable to geometrically complex aircraft configurations. Aerospace engineering reports NLR-TP-2000-506, National Aerospace Laboratory NLR Collin D, Bauer M, Bergmans D, Brok P, Brouwer H, Dimitriu D, Gély D, Humphreys N, Kors E, Lemaire S, Lempereur P, Mueller U, Van Oosten N (2016) Overview of aviation noise: research effort supported by the European Union. In: ICAO Environmental Report, pp 38–41 Deconinck H, Koren B (eds) (1996) Euler and Navier-Stokes solvers using multi-dimensional upwind schemes and multigrid acceleration: results of the BRITE/EURAM projects AEROCT89-0003 and AER2-CT92-00040, 1989–1995. Notes on Numerical Fluid Mechanics (NNFM), vol 57. Vieweg Decrease Jet-Installation Noise (2020). Community Research and Development Information Service (CORDIS). Doerffer P, Flaszynski P, Dussauge JP, Babinsky H, Grothe P, Petersen A, Billard F (eds) (2021) Transition location effect on shock wave boundary layer interaction: experimental and numerical findings from the TFAST project. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 144. Springer, Berlin Doerffer P, Hirsch C, Dussauge J-P, Babinsky H, Barakos GN (eds) (2011) Unsteady effects of shock wave induced separation. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 114. Springer, Berlin Enghardt L (2009) Improvement of fan broadband noise prediction: experimental investigation and computational modelling. Publishable final project results deliverable D6.1a, PROBAND Consortium. Flightpath 2050: Europe’s Vision for Aviation (2011) Report of the high level group on aviation research, European Union. document/Flightpath2050_Final.pdf Good Vibrations: Quieter Skies Ahead with Clean Sky (2021) Clean Aviation Joint Undertaking. Grewe V, Gangoli Rao A, Grönstedt T, Xisto C, Linke F, Melkert J, Middel J, Ohlenforst B, Blakey S, Christie S, Matthes S, Dahlmann K (2021) Evaluating the climate impact of aviation emission scenarios towards the Paris agreement including COVID-19 effects. Nat Commun 12:3841 Haase W, Aupoix B, Bunge U, Schwamborn D (eds) (2006) FLOMANIA—a European initiative on flow physics modelling: results of the European-Union funded project, 2002–2004. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 94. Springer, Berlin 3


D. Knoerzer

Haase W, Brandsma F, Elsholz E, Leschziner M, Schwamborn D (eds) (1993) EUROVAL—an European initiative on validation of CFD codes: results of the EC/BRITE-EURAM project EUROVAL, 1990–1992. Notes on Numerical Fluid Mechanics (NNFM), vol 42. Vieweg Haase W, Braza M, Revell A (eds) (2009) DESider—a European effort on hybrid RANS-LES modelling: results of the European-Union funded project, 2004–2007. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 103. Springer, Berlin Haase W, Chaput E, Elsholz E, Leschziner M, Müller U (eds) (1997) ECARP—European Computational Aerodynamics Research Project: validation of CFD codes and assessment of turbulence models. Notes on Numerical Fluid Mechanics (NNFM), vol 58. Vieweg Haase W, Selmin V, Winzell B (eds) (2003) Progress in computational flow-structure interaction: Results of the Project UNSI supported by the European Union 1998–2000. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 81. Springer, Berlin Hills D, Morris M, Marchant M, Guillen P (eds) (1999) Computational mesh adaptation: ECARP— European Computational Aerodynamics Research Project. Notes on Numerical Fluid Mechanics (NNFM), vol 69. Vieweg Hirsch C, Hillewaert K, Hartmann R, Couaillier V, Boussuge J-F, Chalot F, Bosniakov S, Haase W (eds) (2021) TILDA—Towards Industrial LES/DNS in Aeronautics—paving the way for future accurate CFD: results of the H2020 research project TILDA funded by the European union, 2015–2018. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 148. Springer, Berlin Hirsch C, Wunsch D, Szumbarski J, Laniewski-Wollk L, Pons-Prats J (eds) (2019) Uncertainty Management for Robust Industrial Design in Aeronautics: findings and best practice collected during UMRIDA, a collaborative research project (2013–2016) funded by the European Union. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 140. Springer, Berlin Hybrid Laminar Flow Technology. CORDIS (2002). BRPR970606 Iannelli P, Wild J, Minervino M, Strüber H, Moens F, Vervliet A (2013) Design of a high-lift system for a laminar wing. In: 5th European Conference for Aeronautics and Space Sciences (EUCASS). EUCASS Association Koloszar L, Rambaud P, Planquart P, Schram C (2012) Numerical noise prediction of a generic flap configuration. In: 5th Symposium on Integrating CFD and Experiments in Aerodynamics Kors E (2004) Significant lower community exposure to aircraft noise: halfway towards success. Key note presentation of 10th AIAA/CEAS Aeroacoustics Conference. Manchester Kroll N, Bieler H, Deconinck H, Couaillier V, van der Ven H, Sørensen K (eds) (2010) ADIGMA—a European initiative on the development of adaptive higher-order variational methods for aerospace applications: results of a collaborative research project funded by the European Union, 2006– 2009. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 113. Springer, Berlin Kroll N, Heinrich R (2004) MEGAFLOW—a numerical simulation tool for aircraft design. In: JAXA ANSS2004 (Aerospace Numerical Simulation Symposium). Tokyo Kroll N, Hirsch C, Bassi F, Johnston C, Hillewaert K (eds) (2015) IDIHOM—Industrialization of High-Order Methods—a top-down approach: results of a collaborative research project funded by the European union, 2010–2014. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 128. Springer, Berlin Miettinen K, Neittaanmäki P, Mäkelä MM, Périaux J (eds) (1999) Evolutionary algorithms in engineering and computer science: recent advances in genetic algorithms, evolution strategies, evolutionary programming, genetic programming and industrial applications. Wiley Mullender AJ, Riedel H (1996) A laminar flow nacelle flight test programme. In: 2nd European Forum on Laminar Flow Technology, Paris. AAAF Periaux J, Bugeda G, Chaviaropoulos PK, Giannakoglou K, Lanteri S, Mantel B (eds) (1998) Optimum aerodynamic design & parallel Navier-Stokes computations: ECARP—European compu-

Contribution of Scientific Computing in European Research …


tational aerodynamics research project. Notes on Numerical Fluid Mechanics (NNFM), vol 61. Vieweg Periaux J, Bugeda G, Chaviaropoulos PK, Labrujere T, Stoufflet B (eds) (1997) EUROPT—a European initiative on optimum design methods in aerodynamics: proceedings of the Brite/Euram project workshop “Optimum design in areodynamics”. Notes on Numerical Fluid Mechanics (NNFM), vol 55. Vieweg Quagliarella D, Périaux J, Poloni C, Winter G (eds) (1998) Genetic algorithms and evolution strategies in engineering and computer science: recent advances and industrial applications. Wiley Schrauf G (2005) Status and perspectives of laminar flow. Aeronaut J 109(1102):639–644 Schwamborn D, Strelets M (2012) ATAAC—an EU-project dedicated to hybrid RANS/LES methods. In: Progress in hybrid RANS-LES modelling: papers contributed to the 4th symposium on hybrid RANS-LES methods. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 117. Springer, Berlin Stanewsky E, Délery J, Fulker J, de Matteis P (eds) (2002) Drag reduction by shock and boundary layer control: results of the project EUROSHOCK II supported by the european union 1996–1999. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol 80. Springer, Berlin Stanewsky E, Délery J, Fulker J, Geissler W (eds) (1995) EUROSHOCK—drag reduction by passive shock control: results of the project EUROSHOCK, AER2-CT92-0049 supported by the European Union, 1993–1995. Notes on Numerical Fluid Mechanics (NNFM), vol 56. Vieweg Strategic Research Agenda (2002) vol 1 and 2. Advisory council for aeronautics research in Europe Streit T, Horstmann K, Schrauf G, Hein S, Fey U, Egami Y, Perraud J, El Din IS, Cella U, Quest J (2011) Complementary numerical and experimental data analysis of the ETW Telfona Pathfinder wing transition tests. In: 49th AIAA aerospace sciences meeting including the new horizons forum and aerospace exposition. AIAA Strüber H, Wild J (2014) Aerodynamic design of a high-lift system compatible with a natural laminar flow wing within the DeSiReH project. In: 29th Congress of the International Council of the Aeronautical Sciences (ICAS 2014). ICAS, p 695 Thiede P (ed) (2001) Aerodynamic drag reduction technologies: proceedings of the CEAS/DragNet European Drag Reduction Conference. Notes on Numerical Fluid Mechanics (NNFM), vol 76. Springer, Berlin Validation and Improvement of Airframe Noise prediction Tools (VALIANT) (2013) Community Research and Development Information Service (CORDIS). 233680 van Ingen JL (2008) The eN method for transition prediction: historical review of work at TU Delft. In: Proceedings of the 38th AIAA Fluid Dynamics Conference and Exhibit, AIAA paper 2008–3830. AIAA, pp 1–49 Wagner M (2021) The CFD solver CODA and the sparse linear systems solver spliss: evaluation of performance and scalability. In: Speech in HLRN CFD workshop day, 29 July 2021 Weatherill NP, Marchant MJ, King DA (eds) (1993) Multiblock grid generation: results of the EC/BRITE-EURAM project EUROMESH, 1990–1992. Notes on Numerical Fluid Mechanics (NNFM), vol 44. Vieweg Wild J, Schmidt OT (2004) Prediction of attachment line transition for a high-lift wing based on twodimensional flow calculations with RANS-solver. In: New results in Numerical and Experimental Fluid Mechanics V: contributions to the 14th STAB/DGLR symposium. Notes on Numerical Fluid Mechanics and Multidisciplinary Design (NNFM), vol. 92, pp 200–207. Springer, Berlin

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization with Evolutionary Algorithms and Game Strategies in Aeronautics and Civil Engineering Jacques Periaux and Tero Tuovinen

Abstract This article reviews the major improvements in efficiency and quality of evolutionary multi-objective and multi-disciplinary design optimization techniques achieved during 1994–2021. At first, we introduce briefly Evolutionary Algorithms (EAs) of increasing complexity as accelerated optimizers. After that, we introduce the hybridization of EAs with game strategies to gain higher efficiency. We review a series of papers where this technique is considered an accelerator of multi-objective optimizers and benchmarked on simple mathematical functions and simple aeronautical model optimization problems using friendly design frameworks. Results from numerical examples from real-life design applications related to aeronautics and civil engineering, with the chronologically improved EAs models and hybridized game EAs, are listed and briefly summarized and discussed. This article aims to provide young scientists and engineers a review of the development of numerical optimization methods and results in the field of EA-based design optimization, which can be further improved by, e.g., tools of artificial intelligence and machine learning. Keywords Single/multi-disciplinary design optimization · Evolutionary algorithms · Game strategies · Hybridized games · Aeronautics · Civil engineering

J. Periaux CIMNE—Centre Internacional de Mètodes Numèrics a l’Enginyeria, Campus Nord UPC, Gran Capità, S/N 08034 Barcelona, Spain T. Tuovinen (B) Jyväskylä University of Applied Sciences, 40100 Jyväskylä, Finland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053,



J. Periaux and T. Tuovinen

1 Introduction and Motivation This article summarizes the progress made over the last three decades in evolutionary design optimization methods for aerospace and aerospace and civil engineering applications. The different segments cover examples of complex evolutionary algorithms that combine evolutionary optimizers, parallel computing, hierarchical topologies of optimizers and solvers, asynchronous estimation strategies, and game strategies that can significantly reduce computational costs in multidisciplinary design optimization procedures. The examples have been selected chronologically from 1994 to 2022, divided into three decades and show the reader the real problems of aviation and construction technology. Traditional optimization tools in aeronautical design have been implemented with deterministic optimizers that are highly effective in finding local but not global optimal solutions. However, if the design problem is multimodal, discontinuous, nondifferentiable, and involves multiple objectives, robust alternative tools are needed. One robust optimization technique is the use of evolutionary algorithms (EAs) based on Darwinian principles of evolution, in which populations of individuals evolve in search mode and adapt to the environment using mutation, crossover, and selection operators. EAs are exploring a wide range of possible solutions to complex problems compared to traditional programs. They have the ability to find global optimal solutions among local optimists in a hilly landscape (Mitchell 1998) by combining the genetic material of the sub-solutions. The aim and novelty of our review article is to show the progress made on complex gaming strategies in many areas of aviation (civil aircraft and UAV systems) and civil engineering. The applications are chosen to illustrate the flexibility of implementation and the efficiency and quality of evolutionary methods applied to different multiobjective and multi-disciplinary design optimization problems related to aviation systems. The computational results and conclusions provide a data repository that will be useful in solving the green and neutral carbon challenges of the coming decades, driven by ever-increasing air traffic and environmental structures.

2 Methods of Design Optimization In this section, we will review briefly the main methodologies of the design field with some relevant references. In the sequel examples with numerical results obtained by using advanced Evolutionary Algorithms (EAs) are presented in Sect. 3. Later in this article, we will present applications to aeronautics and civil design problems.

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


2.1 Evolutionary Methods Genetic Algorithms and Evolutionary Algorithms One well-known class of methods for solving problems of Multi-disciplinary Design Optimization (MDO) and Multi-Objective Optimization (MO) is Evolutionary Algorithms (EAs). They [EAs] belong to the category of stochastic (randomized) optimization methods. Genetic Algorithms (GAs) are decision-making algorithms and their main principle is to mimic the evolution principle “survival of the fittest” introduced by Charles Darwin’s famous book “Origins of Species” in 1859 (see Darwin (1859)). The main advantages of GAs are • GAs have the capability to capture a global optimum solution among many local optima; • GAs can be easily solved using parallel computers; • GAs can be adapted to numerical solvers without major modification; • GAs require no derivatives of the objective functions and finally they can be used to solve simultaneously multi-objective problems considering the objective vector instead of the aggregation of objectives. This “vector” feature will be used in the later examples to capture game theoretical Pareto, Nash or Stackelberg solutions by game theory in multi-criteria optimization problems. One of the main benefits of GAs are robustness, (not limited by restrictive assumptions about the search space continuity, derivability of objectives, uni-modality) and they also accommodate well to discontinuous environments. Details of general presentation, description, and mechanics of EAs using binary coding and selection, crossover and mutation operators can be found in Periaux et al. (2015). Another evolutionary method is Evolution Strategies (ESs) introduced by Rechenberg (1973), and using only two real coded individuals, the parent and its descendant. Further developments in evolution strategies were introduced with multi-membered populations for parents and their offspring. There is no longer separation between GAs and ESs, which both are now defined as Evolutionary Algorithms (EAs). Various approaches have been developed to satisfy design constraints (see, e.g., Kumar (1992)). The use of penalty functions is the most common approach and is based on adding penalties (static penalties, dynamic penalties, annealing penalties, adapting penalties, death penalties, ...) to the objective function (see, e.g., Dasgupta and Michalewicz (1997)). Evolutionary Algorithms have been used for many years in multi-objective optimization and successfully applied to conceptual problems of the various designs but they still face detailed design challenges in the industrial environment due to computational costs, especially in optimization procedures using nonlinear analyzers.


J. Periaux and T. Tuovinen

2.2 Multi-objective Optimization and Game Theory Problems in engineering design often require simultaneous optimization of inseparable objective and usually it is associated with a number of constraints. The problem of multi criteria optimization is conventionally formulated as: Minimize f i (x), i = 1, . . . , N , subject to constraints g j (x) = 0,

j = 1, . . . , M,

h k (x) < 0, k = 1, . . . , K ,

where f i are the objective functions, N is the number of objectives, x is an N dimensional vector whose arguments are the decision variables and g j and h k are equality and inequality constraints respectively. The genetic algorithms proposed by Shaffer in 1985 was the first approach to solving multiple objectives using his Vector Evaluated Genetic Algorithms (VEGAs), see Schaffer (1985). There are many variations and developments of multi-objective approaches, e.g., Pareto, Nash and Stackelberg approaches (Deb 2001). Pareto-game-based cooperative perspective, which captures a whole set of nondominated solutions i.e. Pareto front (Pareto 1896), is discussed in Periaux et al. (2015). Other non-cooperative games based on Nash equilibrium (Nash 1950), hierarchical Stackelberg games (Loridan and Morgan 1989) and hybridized games (acceleration of the search for a global solution or the capture of a Pareto front with a dynamic Nash game through a Nash HAPEA topology) tested on mathematical functions to evaluate performances are also described in Periaux et al. (2015). The concept of the hybridized game uses dynamically a Nash game and Pareto optimality. The idea of implementing a dynamic Nash game during Pareto optimization is motivated by an accelerated search for global solutions and is very innovative.

2.3 Advanced Methods and Tools for EAs The main difficulty of EAs in industrial use is the computation time of each objective function evaluation of the candidate solutions. Therefore it is necessary to introduce advanced techniques to accelerate the optimization procedures. Here we describe speeding techniques including distributed and parallel EAs, hierarchical EAs, asynchronous EAs, advanced mutation operators and hybridized games, which aim to increase diversity and speed up the search for optimal solutions. Distributed and Parallel EAs The most common approach for parallelization is global Parallel EAs (PEAs). The main feature is to use a network of interconnected small populations (the so-

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


Fig. 1 Hierarchical multipopulation topology

called “Island Model”) introduced by Mühlenbein (1991). With this approach, subpopulations of nodes can explore simultaneously different regions of the search space, improving robustness and easily escaping from local minima if needed. These subpopulations evolve independently on each node for an “epoch”. After each epoch a period of migration and information can be exchanged between nodes, after which successive new periods of isolation occur. (Mühlenbein (1991), see Fig. 1) Hierarchical EAs Hierarchical EAs (HEAs) use a hierarchical topology for the lay-out of the subpopulations (see Sefrioui and Périaux (2000)). The main feature of the method is the interaction between multi-layers nodes. Each node can be handled by a different EAs where specific parameters can be tuned as follows: • The top layer concentrates on refining solutions (small mutation, precise but expensive models); • The intermediate layer is a compromise between exploitation and exploration (intermediate models); • The bottom layer is entirely devoted to exploration with approximate but fast models. There exists many examples that demonstrate numerically the good performances of HEAs compared to EAs. Some of them are presented below in sections related to aerodynamic applications. Asynchronous Evolutionary Algorithms The asynchronous variant of the classical parallel EAs implementation was proposed by Whitney et al. (2003b). In this approach, the remote flow analysis solvers do not


J. Periaux and T. Tuovinen

Fig. 2 Asynchronous evaluation

need to run at the same speed or even on the same local network. Solver nodes can be added or deleted dynamically during the optimization procedure. This parallel implementation requires modifying the canonical EAs which ordinarily evaluate entire populations simultaneously before proceeding with a new population. The method generates, based on an individual approach only one candidate solution at a time and only re-incorporates one individual at a time, rather than an entire population at every generation, as is usual with traditional EAs. Therefore, solutions can be generated and returned out of order, allowing the implementation of a synchronous fitness evaluation, giving the method its name. This asynchronous approach has no waiting time (or bottleneck) for the return of the individual compared to the classical notion of generations since the buffer is continuously updated by evaluated individuals. As soon as one solution has been evaluated, its genetic material is reincorporated into the buffer of the optimization procedure. See Fig. 2. Nash Game and Hierarchical Asynchronous Parallel EAs The competitive Nash game can be combined with the HEA or Hierarchical Asynchronous Parallel Evolutionary Algorithm (HAPEA) approach. This idea is implemented using one hierarchical EAs for each player. Two migrations are present when using the hierarchical EA-Nash scheme. The first one is a circulation of solutions up and down; the best solutions progress from the bottom layer to the top layer when they are refined and then a second Nash migration where information between players is exchanged after an epoch. The new variables for each player are updated on each node and on each hierarchical tree. Hybrid Game Coupled with Multi-objective Evolutionary Algorithms A hybrid game is an advanced optimization method that couples Nash game and Pareto optimality linked dynamically together and hence it can simultaneously capture Nash equilibrium and a set of Pareto non-dominated solutions (Lee et al. 2010).

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


The hybridized Nash/GAs strategy will speed up the search of global solutions when solving multi-objective problems. The elitist solution from each Nash player is seeded to a Pareto player at each generation. The coalition-based mechanism increases the diversity of Pareto player search during the optimization procedure. Each Nash player has its own design criterion and uses its own optimization strategy. If we are considering the problem of multi-objective design optimization, then the hybridized Nash game will cut the problem into several single-objective design problems. If the problem is a multi-disciplinary/multi physics design, then the hybridized Nash game will split the problem into several single-discipline design problems. Meta Model Assisted EAs Meta model assisted EAs are techniques to accelerate the convergence of an EAs. The use of high fidelity Computational Fluid Dynamics (CFD) and Finite Element Analysis (FEA) software for aeronautical and civil engineering system design are common practices in scientific and industrial environments. However, their use for optimization and MDO is limited by the time and cost of performing a number of function evaluations in the optimization procedure. An alternative way is to construct an approximate model. Compared to the full analysis used by high fidelity tools, the approximate model requires relatively little time to compute the objective function. To build an approximation for optimization, a series of steps have to be set up: (1) define the experimental design sample using Latin hypercube sampling; (2) define a model to represent the data (RSM, Kriging); (3) fit the model to the observed data (Karakasis and Giannakoglou 2006).

3 Progress of the Use of Optimization Methods in Aeronautical Engineering Design In this section, our aim is to collect significant articles and summarize their achievements in indicating the progress of the optimization methods in the field of aeronautical engineering design. This study is not fully comprehensive, but gives the reader a rapid and brief overview of advances and relevant literature for a deeper review of specific topics. The section is divided into three decades. In Table 1 we have presented a chronological summary of milestones in evolutionary design optimization in aeronautics.

3.1 Period 1994–2004 The review begins in the 1990s when the first shape design reconstruction problems of aerodynamics with a PDE flow analyzer were introduced. The challenge was that there are situations where the gradient and higher-order derivatives are prohibitive


J. Periaux and T. Tuovinen

Table 1 Progress of optimization methods and development of applications in the field of aviation Year





First reconstruction problem in aerodynamics with a PDE flow analyzer

A transonic flow shape optimization by a genetic algorithm

Chen et al. (1994)


Results obtained in inverse and evolutionary optimization of airfoils operating in transonic Euler flows

Robust genetic algorithms for optimization problems in aerodynamic problems

Périaux et al. (1996)


Distance-dependent mutation

Fast convergence thanks to diversity

Sefrioui et al. (1996)


Hierarchical GAs with multi-models A hierarchical genetic algorithm using multiple models for optimization

Sefrioui and Périaux (2000)


Multi-objective UAV design optimization: capture of Pareto front

Adaptive evolution design without problem-specific knowledge: UAV applications

Whitney et al. (2003a)


Aerostructural UAV design optimization

Multi-disciplinary optimization of unmanned aerial systems using evolutionary algorithms

Gonzalez et al. (2008)


UAS mission path planning with hybrid games

Optimal Mission Path Planning Gonzalez et al. (MPP) for an air sampling unmanned (2009) aerial system


Multidisciplinary wing design optimization of UCAV using hierarchical asynchronous GAs

Efficient hybrid game strategies coupled to evolutionary algorithms for robust multidisciplinary design optimization in aerospace engineering

Lee et al. (2011b)


Shock control bump design optimization using hybrid games

Double-shock control bump design optimization using hybridized evolutionary algorithms

Lee et al. (2011a)


Distributed evolutionary optimization Distributed evolutionary using Nash games and GPUs optimization using Nash games and GPUs: Applications to CFD design problems

Leskinen and Périaux (2013a)


Robust active shock control bump design optimization using parallel hybrid GAs

Robust active shock control bump design optimization using parallel hybrid MOGA

Lee et al. (2013)


Multi-objective robust design optimization procedure to the route planning of a commercial aircraft

Applying multi-objective robust design optimization procedure to the route planning of a commercial aircraft

Pons-Prats et al. (2015)


Wing design optimization with Stackelberg games

Two-objective aerodynamic shape Wang et al. optimization by a Stackelberg/adjoint (2020) method


Multi-objective natural laminar flow Drag reduction of NLF airfoils by optimization method in civil evolutionary shape design transport aircraft airfoil shape design optimization with Pareto games and trailing edge devices

Chen et al. (2021)

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


or even impossible. For highly multi-modal cost functions the derivative-based algorithms will converge to the closest local minimum and not be able to capture a global solution. For these kinds of situations, it was a natural choice to consider alternative derivative-free methods like the use of Genetic Algorithms (GAs) for transonic flow shape optimization. In the aeronautical application context, the GAs approach has been tested for the inverse shape design reconstruction of a NACA0012 airfoil. This first application of GAs to CFD aeronautical design application with Partial Differential Equations (PDEs) flow analysis solver opened the door in the sequel to other more complex models evaluated and analyzed on aeronautics (Chen et al. 1994). In the mid-1990s, the design strategy development reaches a level where evolutionary inverse and optimization methods were applied to airfoils using transonic Euler flows (Périaux et al. 1996). In the article, the use of GAs as a part of a complex design process devoted to the solution of problems originating from aerospace engineering was discussed. For the first time, a major role of GAs in such design processes and their interactions with finite element mesh generators and Euler flow analysis solvers was introduced. A classical deterministic approach based upon a conjugate gradient method (CG) is then presented providing GAs a reference for comparisons. A major advantage of GAs in a complex industrial environment is robustness. It is not limited to restrictive assumptions about the search space such as continuity, the existence of derivatives, or unimodality. That robustness allowed us to consider only approximate solutions at the beginning of the optimization procedure before refining them while the evolution. The design problems considered in this article were quite simple, but quite soon thereafter, CG and GAs were tested in more difficult constrained optimization problems, such as transonic multi-point design with strong viscous effects. Additional details on the topic can be found in the articles (Pironneau 1982; Michalewicz 1996; Goldberg 1989). In the year 1996, a new approach of distance-dependent mutation to maintain population diversity for a genetic algorithm was introduced (Sefrioui et al. 1996). This advance made faster convergence possible and the idea was to avoid being trapped in local solutions due to premature convergence. Premature convergence is a term to describe the fact that the diversity among the gene pool is almost lost. This new approach was evaluated by means of mathematical test functions in the article by Grefenstette (1986) with respect to terms of robustness and convergence speed. Finally, it was applied to more complex problems originating from aerospace engineering inverse problem (RAE2822 airfoil shape reconstruction), with the use of the Euler flow analyzer and studying its convergence evolution. The numerical results on several test functions showed that having a distance-dependent mutation led to fast convergence and maintain population diversity. At the same time, the dynamical exploration process deals with the case where a premature convergence might have occurred despite the diversity introduced by the distance-dependent mutation. More details can be found, e.g., in Grefenstette (1986); Mühlenbein et al. (1991); Sefrioui (1998). In the early 2000s, hierarchical GAs with multiple models were developed. In the article (Sefrioui and Périaux 2000), the authors presented both the theoretical basis and some numerical results on Hierarchical Genetic Algorithms (HGAs).


J. Periaux and T. Tuovinen

HGAs are explained and implemented along with the advantages conferred by their multi-layered hierarchical topology, which is a compromise to the classical exploration/exploitation dilemma. Another feature is the introduction of multiple models for optimization problems within the frame of an HGA. It is possible to use simple models that are very fast and more complex models which are slower and still achieve the same quality faster by mixing them together. HGAs offered a good alternative to the exploration/exploitation levels by introducing a hierarchical topology that allows different layers to perform different tasks. The different concepts presented in this article, are illustrated with experiments on a CFD problem, namely a convergentdivergent nozzle reconstruction for a transonic flow involving a shock. The overall numerical results confirmed that an HGA using multiple models can achieve the same quality of results compared to that of a classic GAs using only a complex model, but converged solutions were found to be three CPU-times faster. (See the articles Srinivas (1999); Schlierkamp-Voosen and Mühlenbein (1994)). In 2003 the researchers captured the Pareto front of Unmanned Aerial Vehicles (UAV) Multi-objective Design Optimization (MDO) problem using adaptive evolution. The design optimization problem in aerodynamics was presented in a multi-objective form while flow solvers to evaluate candidate shapes were generally “black box” software. In addition, the evaluation of fitness functions took lots of CPU time, but it could be programmed in multiple-fidelity forms. For this reason, a Hierarchical Asynchronous Parallel Evolutionary Algorithm (HAPEA) was devised by Sefrioui and Périaux (2000). Whitney et al. (2003a) designed an airfoil for a light and inexpensive projected UAV platform considering two separate multiobjective on- and off-design cruise requirements and cruise-landing requirements. To compute the fitness of airfoil shape candidates, the available panel method with a coupled boundary layer was used, which gave reliable results for attached and thin separated flows (Drela and Youngren 2001). Since a Pareto process is a cooperative game, all the Pareto front member shapes needed to be analyzed with respect to both objectives. The evolution process identified geometries that were realistically sub-optimal, but more convenient from an optimization point of view. In essence, HAPEA has exploited a weakness of the solver used. (Whitney et al. (2003a), see also Srinivas and Deb (1994)).

3.2 Period 2004–2014 Our chronology continues to 2008 when the focus was on optimizing the design of the aerostructure of Unmanned Aerial Systems (UAS) (Gonzalez et al. 2008). UASs have shown their importance in military and commercial assets for diverse applications, ranging from reconnaissance and surveillance to fly in regions where the presence of onboard human pilots is either too risky, unnecessary, or expensive. Examples of civilian applications are numerous, including coastal surveillance, power line inspection, traffic monitoring, bush-fire monitoring, precision farming, and remotesensing, among others. The multi-physics aspects of these unmanned vehicles can

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


Fig. 3 Advanced optimization techniques for UAS (courtesy of Gonzalez et al. (2009))

benefit from alternative approaches for design and optimization. An optimization tool named Hierarchical Asynchronous Parallel Multi-Objective Evolutionary Algorithm (HAPMOEA) has been introduced in the research by Gonzalez et al. (2005). In the study, hierarchical population topology and asynchronous evaluation of candidates have also been used. In the asynchronous procedure, one candidate solution will be generated at a time and will be sent at its own speed. When candidates have been evaluated, they are returned to the optimizer and either accepted by insertion into the main population or to be rejected. The approach here ignores any concept of a generation-based solution (Wakunda and Zell 2000). The application using the above method considers the medium altitude long endurance of a MALE UAV wing design and optimization. The two objectives are the maximization of the lift to drag (L/D) and the minimization of the wing weight. Numerical results indicated computational gain and better results by using a hierarchical topology of fidelity models. In 2009, the focus was on optimal route approaches in Mission Path Planning (MPP) problems. In the article (Gonzalez et al. 2009), the authors present advanced optimization techniques for UAS (see Fig. 3) fitted with a spore trap to detect and monitor spores and plant pathogens. The UAS is fitted with an air sampling or spore trap to detect and monitor spores and plant pathogens in remote areas not accessible to current stationary monitor methods. The optimal paths are computed using MultiObjective Evolutionary Algorithms (MOEAs). Traditionally optimal path plans are found using deterministic optimizers, but this approach may be easily trapped in local minima (Tang et al. 2005). Other techniques such as evolutionary algorithms are robust to find global solutions but suffer highly on computational expense. Therefore, one of the main objectives in optimal path planning is to develop effective and efficient optimization techniques in terms of computational cost and solution quality (Lee et al. 2008). The hybrid game method applied to a Non-dominated Sorting Genetic Algorithm II (NSGA-II) couples the concept of the Nash game and the Pareto optimality of NSGA-II, and hence can simultaneously produce Nash equilibrium and Pareto nondominated solutions. The hybrid game consists of several Nash players corresponding to the objectives of the problem. Each Nash player has its own optimization criteria


J. Periaux and T. Tuovinen

and uses its own strategy. In the context of path planning the overall objective is to minimize the path distance between two points: (1) start and target and target to start the optimization can be split in two with one player optimizing the path, (2) start to target and the other player optimizing target to start. The reason for implementing a Nash game coupled with Pareto optimality is to accelerate the search for one of the global solutions (see Yi (2008)). The elite design from each Nash player will be seeded to a Pareto player at every generation. In the early 2010s, research in Multi-disciplinary Design Optimization (MDO) faces the need to develop robust and efficient optimization methods and produce higher-quality designs without high computational costs. One alternative methodology was the use of game-strategy-based approaches. Nash and Pareto strategies are game theory tools, which can be used to save CPU usage and produce highquality solutions due to their efficiency in design optimization. In the chosen application, multi-objective and robust multidisciplinary design optimization of Unmanned Combat Air Vehicle (UCAV) using a Hierarchical Asynchronous Parallel Evolutionary Algorithm (HAPEA) has been implemented (see Lee et al. (2011b)). The main elements of the methodology include Multi-Objective Evolutionary Algorithms (MOEA), hierarchical multi-fidelity/population topology, Nash game, hybrid game (hybrid Nash), robust/uncertainty design, and algorithms for HAPMOEA and hybrid game. Considering a real-world design problem, a Joint Unmanned Combat Air Vehicle (J-UCAV) was used with a mission profile, which consists of Reconnaissance, Intelligence, Surveillance, and Target Acquisition (RISTA). A test case considered the multidisciplinary design optimization of UCAV when there was uncertainty in the operating conditions and a robust design technique was required. This problem was selected to show the benefits of using the hybrid game (Nash-Pareto) method since the addition of uncertainty increases the computational cost considerably. The objectives were to maximize Aerodynamic Quality (AQ) while minimizing Electromagnetic Quality (EQ) to maximize the survivability of the UCAV. Numerical results show that the Nash game decomposes the complex multi-objective problem into several single-objective problems that lead to the nondominated solutions on a Pareto front. Details can be found in the article (Lee et al. 2009). In the same year, 2011, hybrid games were used to solve the Shock Control Bump (SCB) design optimization problem (Lee et al. 2011a). The main goal of this study was to investigate the efficiency of the hybrid game for a single-objective design problem. One of the Active Flow Control (AFC), a double-shock control bump was applied on the suction and pressure sides of a natural laminar flow (NLF) airfoil (RAE 5243) to reduce the transonic total drag. A mathematical benchmark with two complex mathematical design problems was used. Convergence history obtained by NSGA-II and a hybrid game was shown in the article. Analytical research clearly shows the benefits of using a hybrid game in terms of computational costs and design quality. The use of SCB significantly reduced the transonic drag. In the long-term view, SCB technology has the potential to save critical aircraft emissions due to less fuel burnt. (See the articles Lee et al. (2010); Qin et al. (2004)).

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


Continuing our chronology from 2012, we focus on distributed evolutionary optimization that uses Nash games and GPUs. As known from earlier stages, flow simulation can be very time-consuming. When it is applied to the optimization where the model must be re-evaluated for each design candidate, the situation becomes even worse. Global search methods such as Evolutionary Algorithms (EAs) or Differential Evolutions (DE) can solve problems with difficult objective functions. In the article (Leskinen and Périaux 2013a), the authors show that it is possible to improve the convergence of algorithms by exploiting the separability of design objective functions using the game theory approach. The method was coupled with Graphics Processing Units (GPUs) and applied to inverse shape reconstruction problems in aerodynamic design. The target of multi-objective optimization study is usually associated with the capture of a Pareto front. In the game theory, a Pareto front is the result of a cooperative game. There are other strategies that can be applied to multi-objective optimization, such as the competitive (Nash) or the hierarchical leader-follower (Stackelberg) games. The Nash approach can also be implemented in a single objective inverse shape design as a Geometry Decomposition Method (GDM). The objective function and the design variables are decomposed between the players forming a virtual Nash multi-objective problem. Since the programmable and efficient GPUs became available in early 2000, massive parallelism provided by the GPUs has offered a fast and cost-effective computing platform for solving computationally intensive problems. Mature development platforms such as CUDA and OpenCL have enabled programmers familiar with C/C++ or Fortran to develop efficient codes. The test case considered in the article is an inverse shape reconstruction problem from the field of aeronautical design. The results indicate that the geometry decomposition method using the virtual Nash games is very efficient if there is not too much interference between the design variables (Leskinen and Périaux 2013b). In 2013, progress was made in the robust design optimization of an active shock control bump using parallel hybrid GAs. Lee et al. (2013) consider the robust design optimization of the Shock Control Bump (SCB) on the suction side of the Natural Laminar Flow (NLF) with the RAE 5243 airfoil. The objectives were to minimize both mean total drag and drag sensitivity at the variable of boundary layer transition positions and lift coefficients. The design problem is solved by using two different Multi-Objective Evolutionary Algorithms (MOEAs). The first is a GA implemented on a Robust Multi-objective Optimization Platform (RMOP) (denoted as RMOGA). The second method uses a hybridized GA with a Nash game (Lee et al. 2011c) and parallel computation (denoted as HPRMOGA). The study showed how to control the transonic flow on a current aerofoil using an AFC device and how to improve the design quality considering uncertain design parameters (Taguchi et al. 1999) and the optimization efficiency using a hybrid game coupled to parallel computation. In RMOP with OpenMP-based (van der Pas 2009) parallel computation, the user can define the number of CPU cores, and then the individuals of GAs are distributed to available CPUs to evaluate a design model with CFD analysis tool during the optimization. This research presented the methodology including robust design with MOEA as an aerodynamic analysis tool and the


J. Periaux and T. Tuovinen

concept of SCB in the context of robust design optimization with uncertainties using RMOGA and HPRMOGA for comparison. Numerical results represent the benefits of using robust design in terms of performance magnitude and sensitivity and also the benefits of coupling hybrid games with parallel computing systems considering both theoretical and physical aspects during the optimization.

3.3 Period 2014–2022 In the late 2010s, the development of Mission Path Planning (MPP) problems has been focused on the robustness of optimization methods. Pons-Prats et al. (2015) study aircraft emissions and their climate impact in order to require aircraft manufacturers to reduce water vapor, carbon dioxide (CO2) and nitrogen oxide (NOx) emissions. The difficulty of reducing emissions is due to the fact that commercial aircraft are usually designed for an optimal cruise altitude but may be requested to operate and deviate at different altitudes and speeds. This forms a multi-disciplinary problem with multiple trade-offs such as optimizing engine efficiency and minimizing fuel burnt and emissions while maintaining prescribed aircraft trajectories, altitude profiles, and air safety. Based on the rough data provided by an air carrier company, this article considers the coupling of an advanced evolutionary optimization technique (NSGA-II, Deb (2001)) with mathematical models and algorithms for aircraft emissions. The approach studies non-robust and robust optimization considering uncertainties on several parameters. Numerical results showed that the methods were able to capture a set of useful trade-off trajectories of Pareto solutions between aircraft range and fuel consumption as well as fuel consumption and flight time. We have now entered the early 2020s with the use of Stackelberg games. Wang et al. (2020) combine the Stackelberg game strategy (Stackelberg 1952) with an adjoint method and applies it to single- and two-objective aerodynamic shape design optimization. Game-theory-based methods employ a strategic process wherein players cooperate or compete in a symmetric or hierarchical way to improve their own gains in the game. Specifically, the improvement of the gains for each player can be conducted by a single-objective optimization method, which is referred to as the basic optimizer in the game. This method can usually find the optimum efficiently since the objectives and the design variables are distributed to several sets. The deterministic optimizer of the Stackelberg game strategy referred to in this article is the adjoint method, in which the gradient-based method called Sequential Least-SQuares Programming (SLSQP) is chosen as the optimization method (see Kraft and Schnepper (1989)), and the gradients of the objective function with respect to the design variables are calculated by solving continuous adjoint equations (Jameson 2003). Based on the Stackelberg game strategy and the continuous adjoint method, a Stackelberg/adjoint method is developed. Two types of players (a leader and a follower) are involved in the design and each of these players is responsible for the optimization of one objective function by adjusting a subset of design variables. Note that the success of the proposed method is highly dependent on the choice of

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


a few influential factors, namely the maximal number of iterations for each player, the design cycle, the splitting and mapping schemes of design variables, and the allocation strategies of objective functions to different players. A single-objective and two-objective aerodynamic design optimizations of the RAE2822 airfoil and the ONERA M6 wing are conducted to verify the usefulness of these inferences and to validate the efficiency and effectiveness of the proposed method in optimization problems with a complex CFD numerical model or a large number of design variables. The conclusions obtained from the numerical experiments are the following: the two-objective optimization can obtain better and more robust results than the single-objective optimization. For instance, in the example of the RAE2822 airfoil, the two-objective optimization obtains a much better drop in drag at all given Mach numbers. When solving single-objective optimization problems, the Stackelberg/adjoint method produces better results than the adjoint method. The adjoint method failed to meet the constraint in the example of the ONERA M6 wing (see details in Wang et al. (2020)). For additional details about this topic, the reader can consult the following articles: Bauso (2016); Deb et al. (2000); Reuther et al. (1996). Finally, we have arrived in the year 2021 and multi-objective natural laminar flow optimization of civil transport aircraft airfoils (Chen et al. 2021). At sufficiently high Reynolds numbers, many flows are rather turbulent instead of laminar. There are different transition mechanisms, which are basically dependent on external flow turbulence intensity level, stream-wise pressure gradient, solid wall geometry, and surface roughness. One of the most challenging demands in the aerodynamics behavior of laminar airfoils and wings is a reliable computation of boundary layer transition. This transition to turbulence has been the object of many studies utilizing several approaches. The main goal of the study was NLF airfoil shape design optimization in transonic flow. Maintaining a long portion of the favorable pressure gradient is an efficient way to obtain an NLF airfoil. However, the larger portion of favorable pressure gradient on the airfoil the stronger shock wave near its trailing edge due to the recovery of pressure, i.e. the skin friction decreases on laminar flow airfoil but the shock wave intensity increases simultaneously. In this article, Trailing Edge Device (TED) equipped the trailing edge of an airfoil to reduce wave-induced drag in the process of laminar flow optimization. Therefore, the NLF airfoil design optimization problem is converted into a two-objective optimization problem: (1) optimization of the TED shape to reduce wave (or pressure) drag and (2) optimization of the airfoil shape to delay the transition. In the research, a passive shock wave control device (TED) is used to reduce wave-induced drag during the procedure of laminar flow optimization, and the e N transition prediction method based on the Linear Stability Theory (LST) is used to predict the transition location. The 2D steady, compressible RANS equations are solved to obtain the aerodynamic coefficient of flow over an airfoil by using a finite volume Galerkin method on structured meshes. For the Euler part of the equations, a second-order Roe scheme is used to compute turbulent flows with a Spalart-Allmaras one-equation turbulence model. Near the wall, turbulence is computed by a two-layer approach. The e N


J. Periaux and T. Tuovinen

Fig. 4 Comparison of transition lines on the lower surface of a converged wing shape obtained by low-fidelity and high-fidelity optimization methods starting from an initial shape (courtesy of Chen et al. (2021))

method (LST), which is based on small disturbance theory in which a small sinusoidal disturbance is imposed on a given steady laminar flow, is used to see whether the disturbance intensifies or subsides over time. Then a Multi-Objective Genetic Algorithm (MOGA) coupled with a Pareto strategy is implemented to optimize the airfoil shape with a larger laminar flow region and a weaker shock wave drag simultaneously. A Pareto-GAs tournament optimizer software can easily capture the Pareto front of this two-objective optimization problem: a parallelized version of a Nondominated Sorting Genetic Algorithm II (NSGA-II) is used to solve the laminar flow airfoil shape design optimization problems. Results of numerical experiments demonstrate that both wave drag and friction drag performances of several Pareto members can be greatly improved with the optimal shape equipped with a TED device and compared to those of the baseline shape. From the numerical results obtained in the study, it can be concluded that the NLF shape design optimization method with the TED device discussed in this paper is feasible and effective. An NLF 3D configuration design optimization remains a challenge that cannot be solved successfully without large HPC facilities and the assistance of Artificial Intelligence (AI) tools such as Machine Learning (ML). (See Fig. 4).

4 Applications to Civil Engineering Design In this section, we will present some examples with references to the use of methods presented above in the field of civil engineering. The aim is to demonstrate the universality of the game-theory-based methods for solving optimization problems in different disciplines. A comprehensive review of the topic can be found, e.g., in the article by Greiner et al. (2017), which is highly recommended to our readers. Structural design optimization with Nash games is one example of applying methods on other fields outside of aeronautics, but it will not be the only option. Population-based global meta-heuristics had been used as an optimizer in real-world complex engineering design problems for two decades when it first appeared in aero-

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


Fig. 5 Test case with 55 members (courtesy of Greiner et al. (2017))

Fig. 6 Best out of 30 independent executions of fitness function with 0.4% mutation rate (courtesy of Greiner et al. (2017))

nautical engineering. The main characteristic of aeronautics design has been that the computational CPU costs of fitness calculation are frequently too high. In the following example, the minimum constrained weight optimization problem in structural engineering has been solved using a game-theory-based Nash evolutionary algorithm. A Nash GAs procedure performance can be used and analyzed in structural mechanics. For example, in a bar truss test problem, different variable split sets from a discrete real cross-section type 55-member structure are used, and then the results are compared to a standard panmictic evolutionary algorithm that solves the same test. (See Figs. 5 and 6). Numerical results have indicated that a significant increase in performance can be achieved using the proposed Nash approach, both in terms of convergence speed-up and in algorithm robustness when finding the optimum design solutions. Influence of partitioning type has also been evidenced, being critical with respect to a good


J. Periaux and T. Tuovinen

partitioning selection. Results of the reports claim better performance of Nash GAs left-right partitioning versus the beam-column partitioning and 3-players partitioning (Greiner et al. 2015). Further reading and different applications can be found, e.g., in Greiner et al. (2004, 2015, 2017, 2019); Galan et al. (1996); Galván et al. (2003).

5 Conclusions and Perspectives The appearance of electronic computers in the 1960s has been the most revolutionary development in the history of scientific computing and technology. The early pioneers in computer science were interested in biology and electronics and looked to evolutionary systems to achieve their visions (Mitchell 1998). Renowned scientists like Ingo Rechenberg and then Hans-Paul Schwefel introduced “Evolution Strategies” and John Holland and then David Goldberg “Genetic Algorithms” claiming that evolution could be used as an optimization tool for engineering problems. Holland’s innovative evolutionary method for adaptation used a population encoded with bit-string and introduced computationally the “natural selection of the fittest” principle (Darwin 1859) with the genetic operators of crossover, mutation, and selection. The notion of a global solution to design optimization problems applied to CFD in aeronautics appeared in 1995 for the first time with a flow analyzer modeled with Partial Differential Equations (PDEs). Then, an important CPU time needed to capture a global solution was very rapidly associated with the cost of candidate evaluations to access values of (multi)-optimized objectives. This drawback (“There is no free lunch to capture a global solution”) was particularly penalizing in industrial design environments. This article provides a series of real-life examples in aeronautics with milestones achieved from 1995 until 2021 accumulating improvements of EAs obtained with hierarchy, parallel asynchronism, and hybridized EAs with game strategies. This article is a repository with a variety of well-documented examples with data and relevant references we propose to the reader related to the topic. It will help young scientists complete their Ph.D. thesis in good time and young engineers in their work in the field of design optimization in aeronautics and civil engineering. EAs are described in this paper as a powerful tool for solving multi-objective, multidisciplinary problems. From the examples selected, EAs are promising methods for solving complex technological applications assisted with machine learning AI tools. It was shown in many different problems that mimicking natural evolution provided a source of inspiration for introducing new ideas and models to be implemented as evolutionary-based software. Among new ideas to increase the efficiency of methods in complex applications, it can be mentioned the combination of Cultural Algorithms (CAs) and Genetic Algorithms (GAs), both of which have broad potential in the field of optimization. An efficient Hybrid evolutionary optimization method coupling CA with GAs (HCGA) is proposed in Zhao et al. (2022). Moreover, there are interesting challenges of real-

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


life examples, such as aerodynamic configuration design for distributed propulsion vehicle in which a Distributed Propulsion System (DPS) is positioned with a number of small engines located in the rear of the aircraft instead of a few larger ones in the center section. This innovative configuration will improve the propulsion efficiency by distributing exhaust and filling the wake behind the air vehicle. Solutions implemented using Artificial Intelligence (AI) to the challenges of the new methods and applications mentioned above will benefit the rapid digitalization of industry and society and improve the well-being of citizens by greening the environmental impact of aviation. The authors hope that the data-method-based repository examples will provide the readers a pathway guided by the initial inspiration of natural evolution for solving their optimization problems. The famous sentence of John Holland Computer programs that “evolve” in ways that resemble natural selection can solve complex problems even their creators do not fully understand (Holland (1992)).

remains topical in 2022. If engineers used in 1992 GAs software and expert systems to design jet turbines with aero-engine databases (conceptual and preliminary design) and achieved better results than with more conventional tools, engineers and scientists of the new generation will design 3D (disruptive) configurations with non-linearly modeled complex flow or structural analyzers (detailed design) with more efficient new optimizers assisted with AI. The scientific and industrial communities will once again be astonished over the next decade by John Holland’s prediction, which may include more GA surprises. Acknowledgements We had the honor of being invited to contribute to a volume that celebrates the contributions of Pekka Neittaanmäki’s career in scientific computing. These range from the education of many Ph.D. students with challenging applications in the paper industry, health care, medicine, and urban transportation in the areas of computational mathematics and, more recently, information technology. Pekka’s career has had a major impact on applied sciences. In this article, we have considered progress of efficiency and the quality of design optimization methods with evolutionary algorithms and game strategies over the past three decades. Pekka Neittaanmäki played a significant role in disseminating the development of deterministic and statistical optimization methods through international cooperation, exchanges, and events. This contribution illustrates some milestones with referenced articles issued from industrial and research periods with co-authors since 1994. We would like to thank all the co-authors of the original articles on which this review is based: Gabriel Bugeda, Hong Quan Chen, Yongbin Chen, Lloyd Damp, Jose M. Emperador, Blas Galvan, Jean Gabriel Ganascia, Roland Glowinski, Luis Felipe Gonzalez, David Greiner, He Ji Wen, Anthony J. Kearsley, Dong Seop Chris Lee, Emmanuel Laporte, Jianda Sheng, Jyri Leskinen, Bertrand Mantel, Pekka Neittaanmäki, Eugenio Oñate, Jordi Pons Prats, Olivier Pironneau, Mourad Sefrioui, Karkenhalli Srinivas, Tang Zhili, Rodney Walker, Jin Wang, Eric Whitney, Gabriel Winter, Fangfang Xie, Francisco Zarate, Jifa Zhang, Yao Zheng, ... The methods and data results described in this review article have been obtained in cooperation with many colleagues in Spain, Finland, France, and other countries through numerous fruitful discussions on evolutionary design optimization. The support of the European Commission, the Chinese Ministry of Industry and Information Technology, Tekes (now Business Finland) and CIMNE/UPC in research projects is gratefully acknowledged. And last but not least, the authors are grateful to Marja-Leena Rantalainen for her patience and the final supervision of this article.


J. Periaux and T. Tuovinen

References Bauso D (2016) Game theory with engineering applications. SIAM Chen HQ, Glowinski R, He JW, Kearsley AJ, Périaux J, Pironneau O, Caughey DA, Hafez MM (1994) Remarks on optimal shape design problems. In: Frontiers of computational fluid dynamics. Wiley, pp 67–80 Chen Y, Tang Z, Periaux J (2021) Comparison and analysis of natural laminar flow airfoil shape optimization results at transonic regime with bumps and trailing edge devices solved by Pareto games and EAs. In: Greiner D, Asensio MI, Montenegro R (eds) Numerical simulation in physics and engineering: trends and applications. SEMA SIMAI springer series, vol 24, pp 155–168 Darwin C (1859) On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. John Murray Dasgupta D, Michalewicz Z (1997) Evolutionary algorithms—an overview. In: Evolutionary algorithms in engineering applications. Springer, Berlin, pp 3–28 Deb K (2001) Multiobjective optimization using evolutionary algorithms. Wiley Deb K, Agrawal S, Pratap A, Meyarivan T (2000) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: International conference on parallel problem solving from nature, PPSN 2000. Springer, Berlin, pp 849–858 Drela M, Youngren H (2001) XFOIL 6.94 user guide. MIT Aero & Astro Galan M, Winter G, Montero G, Greiner D, Periaux J, Mantel B (1996) A transonic flow shape optimisation using genetic algorithms. In: Computational methods in applied sciences ’96 (Paris, 1996). Wiley, pp 273–283 Galván B, Greiner D, Periaux J, Sefrioui M, Winter G (2003) Parallel evolutionary computation for solving complex CFD optimization problems: a review and some nozzle applications. In: Parallel computational fluid dynamics 2002: new frontiers and multi-disciplinary applications. Elsevier, pp 573–604 Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. AddisonWesley Gonzalez F, Whitney E, Srinivas K, Armfield S, Periaux J (2005) A robust evolutionary technique for coupled and multidisciplinary design optimisation problems in aeronautics. Comput Fluid Dyn J 14(2):142–153 Gonzalez L, Lee DS, Walker R, Periaux J (2009) Optimal Mission Path Planning (MPP) for an air sampling unmanned aerial system. In: Proceedings of the 2009 Australasian conference on robotics and automation (ACRA). Australian robotics & automation association, pp 1–9 Gonzalez LF, Periaux J, Damp L (2008) Multi-disciplinary design optimisation of unmanned aerial systems using evolutionary algorithms Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Trans Syst Man Cybern 16(1):122–128 Greiner D, Emperador JM, Winter G (2004) Single and multiobjective frame optimization by evolutionary algorithms and the auto-adaptive rebirth operator. Comput Methods Appl Mech Eng 193(33–35):3711–3743 Greiner D, Periaux J, Emperador JM, Galván B, Winter G (2015) A study of Nash-evolutionary algorithms for reconstruction inverse problems in structural engineering. In: Advances in evolutionary and deterministic methods for design, optimization and control in engineering and sciences. Springer, Berlin, pp 321–333 Greiner D, Periaux J, Emperador JM, Galván B, Winter G (2017) Game theory based evolutionary algorithms: a review with Nash applications in structural engineering optimization problems. Arch Comput Methods Eng 24(4):703–750 Greiner D, Périaux J, Emperador JM, Galván B, Winter G (2019) A diversity dynamic territory Nash strategy in evolutionary algorithms: enhancing performances in reconstruction problems in structural engineering. In: Advances in evolutionary and deterministic methods for design, optimization and control in engineering and sciences. Springer, Berlin, pp 283–301 Holland JH (1992) Genetic algorithms. Sci Am 267(1):66–73

Thirty Years of Progress in Single/Multi-disciplinary Design Optimization …


Jameson A (2003) Aerodynamic shape optimization using the adjoint method. Lectures at the Von Karman Institute, Brussels Karakasis MK, Giannakoglou KC (2006) On the use of metamodel-assisted, multi-objective evolutionary algorithms. Eng Optim 38(8):941–957 Kraft D, Schnepper K (1989) SLSQP—a nonlinear programming method with quadratic programming subproblems. DLR, Oberpfaffenhofen 545 Kumar V (1992) Algorithms for constraint-satisfaction problems: a survey. AI Mag 13(1):32–44 Lee DS, Gonzalez LF, Srinivas K, Periaux J (2008) Robust evolutionary algorithms for UAV/UCAV aerodynamic and RCS design optimisation. Comput Fluids 37(5):547–564 Lee DS, Gonzalez LF, Srinivas K, Periaux J (2009) Multifidelity nash-game strategies for reconstruction design in aerospace engineering problems. In: Proceedings of 13th Australian international aerospace conference (AIAC 13) Lee DS, Periaux J, Gonzalez LF (2010) UAS mission path planning system (MPPS) using hybridgame coupled to multi-objective optimizer. J Dynanmic Syst, Meas, Control 132(4):041005 Lee DS, Gonzalez LF, Periaux J, Bugeda G (2011) Double-shock control bump design optimization using hybridized evolutionary algorithms. Proc Inst Mech Eng, Part G: J Aerosp Eng 225(10):1175–1192 Lee DS, Gonzalez LF, Periaux J, Srinivas K (2011) Efficient hybrid-game strategies coupled to evolutionary algorithms for robust multidisciplinary design optimization in aerospace engineering. IEEE Trans Evol Comput 15(2):133–150 Lee DS, Gonzalez LF, Periaux J, Srinivas K, Onate E (2011) Hybrid-game strategies for multiobjective design optimization in engineering. Comput Fluids 47(1):189–204 Lee DS, Bugeda G, Periaux J, Oñate E (2013) Robust active shock control bump design optimisation using hybrid parallel MOGA. Comput Fluids 80:214–224 Leskinen J, Périaux J (2013a) Distributed evolutionary optimization using Nash games and GPUs— applications to CFD design problems. Comput Fluids 80:190–201 Leskinen J, Périaux J (2013b) A new distributed optimization approach for solving CFD design problems using Nash game coalition and evolutionary algorithms. In: Domain decomposition methods in science and engineering XX. Springer, Berlin, pp 607–614 Loridan P, Morgan J (1989) A theoretical approximation scheme for Stackelberg problems. J Optim Theory Appl 61(1):95–110 Michalewicz Z (1996) Genetic algorithms + data structures = evolution programs. Springer, Berlin Mitchell M (1998) An introduction to genetic algorithms. MIT Press Mühlenbein H (1991) Evolution in time and space—the parallel genetic algorithm. In: Foundations of genetic algorithms, vol 1. Elsevier, pp 316–337 Mühlenbein H, Schomisch M, Born J (1991) The parallel genetic algorithm as function optimizer. Parallel Comput 17(6–7):619–632 Nash JF (1950) Equilibrium points in n-person games. Proc Natl Acad Sci USA 36(1):48–49 Pareto V (1896) Cours d’économie politique professé à l’Université de Lausanne, vol 1. F. Rouge Périaux J, Sefrioui M, Stoufflet B, Mantel B, Laporte E (1996) Robust genetic algorithms for optimization problems in aerodynamic design. In: Genetic algorithms in engineering and computer science. Wiley, Chichester, pp 371–396 Periaux J, Gonzalez F, DSC Lee (2015) Evolutionary optimization and game strategies for advanced multi-disciplinary design: applications to aeronautics and UAV design. Intelligent systems, control and automation: science and engineering, vol. 75. Springer, Berlin Pironneau O (1982) Optimal shape design for elliptic systems. In: System modeling and optimization: proceedings of the 10th IFIP conference. Springer, Berlin, pp 42–66 Pons-Prats J, Bugeda G, Zarate F, Oñate E, Periaux J (2015) Applying multi-objective robust design optimization procedure to the route planning of a commercial aircraft. In: Computational methods and models for transport, ECCOMAS 2015. Springer, Berlin, pp 147–167 Qin N, Zhu Y, Shaw ST (2004) Numerical study of active shock control for transonic aerodynamics. Int J Numer Methods Heat Fluid Flow 14(4):444–466


J. Periaux and T. Tuovinen

Rechenberg I (1973) Evolutions strategie: optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Reuther J, Jameson A, Farmer J, Martinelli L, Saunders D (1996) Aerodynamic shape optimization of complex aircraft configurations via an adjoint formulation. In: AIAA Paper 96-0094, 34th aerospace sciences meeting and exhibit Schaffer JD (1985) Multiple objective optimization with vector evaluated genetic algorithms. In: Proceedings of the 1st international conference on genetic algorithms. L. Erlbaum Associates Schlierkamp-Voosen D, Mühlenbein H (1994) Strategy adaptation by competing subpopulations. In: Parallel problem solving from nature—PPSN III: international conference on evolutionary computation, proceedings. Springer, Berlin, pp 199–208 Sefrioui M (1998) Algorithmes evolutionnaires pour le calcul scientifique: application a l’electromagnetisme et a la mecanique des fluides numeriques. PhD thesis, Paris, vol 6 Sefrioui M, Périaux J (2000) A hierarchical genetic algorithm using multiple models for optimization. In: International conference on parallel problem solving from nature, PPSN 2000. Springer, Berlin, pp 879–888 Sefrioui M, Periaux J, Ganascia J-G (1996) Fast convergence thanks to diversity. In: Fogel LJ, Angeline PJ, Bäck T (eds) Evolutionary programming V: proceedings of the fifth annual conference on evolutionary programming. MIT Press, pp 313–321 Srinivas K (1999) Computation of cascade flows by a modified CUSP scheme. Comput Fluid Dyn J 8(2):285–295 Srinivas N, Deb K (1994) Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol Comput 2(3):221–248 von Stackelberg H (1952) The theory of the market economy. Oxford University Press Taguchi G, Chowdhury S, Taguchi S (1999) Robust engineering: learn how to boost quality while reducing costs & time to market. McGraw Hill Tang Z, Périaux J, Désidéri JA (2005) Multi criteria robust design using adjoint methods and game strategies for solving drag optimization problems with uncertainties. In: East west high speed flow fields conference, pp 487–493 van der Pas R (2009) An overview of OpenMP 3.0. In: Presentation at the 5th international workshop on OpenMP Wakunda J, Zell A (2000) Median-selection for parallel steady-state evolution strategies. In: International conference on parallel problem solving from nature, PPSN 2000. Springer, Berlin, pp 405–414 Wang J, Zheng Y, Chen J, Xie F, Zhang J, Périaux J, Tang Z (2020) Single/two-objective aerodynamic shape optimization by a Stackelberg/adjoint method. Eng Optim 52(5):753–776 Whitney E, Gonzalez L, Srinivas K, Periaux J (2003a) Adaptive evolution design without problem specific knowledge: UAV applications. In: International congress on evolutionary methods for design, optimization and control with applications to industrial and societal problems, EUROGEN Whitney E, Gonzalez L, Srinivas K, Périaux J (2003b) Multi-criteria aerodynamic shape design problems in CFD using a modern evolutionary algorithm on distributed computers. In: Computational fluid dynamics 2002: proceedings of the second international conference on computational fluid dynamics, ICCFD. Springer, Berlin, pp 597–602 Yi T (2008) Uncertainty based multiobjective and multidisciplinary design optimisation in aerospace engineering. PhD thesis, University of Sydney Zhao X, Tang Z, Cao F, Zhu C, Periaux J (2022) An efficient hybrid evolutionary optimization method coupling cultural algorithm with genetic algorithms and its application to aerodynamic shape design. Appl Sci 12(7):3482